Chapter 22Measuring Cluster Goodness

22.1 Rationale for Measuring Cluster Goodness

Every modeling technique requires an evaluation phase. For example, we may work hard to develop a multiple regression model for predicting the amount of money to be spent on a new car. But, if the standard error of the estimate s for this regression model is $100,000, then the usefulness of the regression model is questionable. In the classification realm, we would expect that a model predicting who will respond to our direct-mail marketing operation will yield more profitable results than the baseline “send-a-coupon-to-everybody” or “send-out-no-coupons-at-all” models.

In a similar way, clustering models need to be evaluated as well. Some of the questions of interest might be the following:

  • Do my clusters actually correspond to reality, or are they simply artifacts of mathematical convenience?
  • I am not sure how many clusters there are in the data. What is the optimal number of clusters to identify?
  • How do I measure whether one set of clusters is preferable to another?

In this chapter, we introduce two methods for measuring cluster goodness, the silhouette method, and the pseudo-F statistic. These techniques will help to address these questions by evaluating and measuring the goodness of our cluster solutions. We also examine a method to validate our clusters using cross-validation with graphical and statistical analysis.

Any measure of cluster goodness, or cluster quality, should address the ...

Get Data Mining and Predictive Analytics, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.