Now that you can (hopefully) better appreciate the nature of the record-matching problem, let’s collect some real-world data from LinkedIn and start hacking out clusters. A few reasons you might be interested in mining your LinkedIn data are because you want to take an honest look at whether your networking skills have been helping you to meet the “right kinds of people,” or because you want to approach contacts who will most likely fit into a certain socioeconomic bracket with a particular kind of business enquiry or proposition. The remainder of this section systematically works through some exercises in clustering your contacts by job title.
An obvious starting point when working with a new data set is to count things, and this situation is no different. This section introduces a pattern for transforming common job titles and then performs a basic frequency analysis.
Assuming you have a reasonable number of exported contacts, the
minor nuances among job titles that you’ll encounter may actually be
surprising, but before we get into that, let’s put together some code
that establishes some patterns for normalizing record data and takes a
basic inventory sorted by frequency. Example 6-2 inspects job titles and prints out
frequency information for the titles themselves and for individual
tokens that occur in them. If you haven’t already, you’ll need to
table , a package that you can use to produce ...