CHAPTER 3

CLUSTERING

3.1 OVERVIEW

When needing to make sense of a large set of data, the data can be broken down into smaller groups of observations that share something in common. Knowing the contents of these smaller groups helps in understanding the entire data set. Clustering is a widely used and flexible approach to analyzing data, in which observations are automatically organized into groups. Those observations within a particular group are more similar to each other than to observations in other groups. This approach has been successfully utilized in a variety of scientific and commercial applications, including medical diagnosis, insurance underwriting, financial portfolio management, organizing search results, and marketing. For example, clustering has been used by retail organizations to analyze customer data based on historical purchases, along with information about the customer, such as their age or where they live. Customers are grouped using clustering approaches, and specific marketing campaigns are then formulated based on the identified market segments.

A data set of animals will be used to illustrate clustering. Table 3.1 describes a series of animal observations (taken from http://archive.ics.uci.edu/ml/datasets/ Zoo; Murphy and Aha, 1994). Each animal is characterized by a number of variables, including several binary variables, such as whether the animal has hair (hair) or produces milk (milk). The data set also includes a count of the number of animal legs ...

Get Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.