Cover by Toby Segaran

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Clustering

Hierarchical clustering and K-means clustering are unsupervised learning techniques, meaning they don't require examples for training data because they don't attempt to make predictions. Chapter 3 looked at how to take a list of top bloggers and automatically cluster them so you could see which ones naturally fell into groups that write about similar subjects or use similar words.

Hierarchical Clustering

Clustering works on any set of items that have one or more numerical properties. The example in Chapter 3 used word counts for the different blogs, but any set of numbers can be used for clustering. To demonstrate how the hierarchical clustering algorithm works, consider a simple table of items (some letters of the alphabet) and some numerical properties (Table 12-7).

Table 12-7. Simple table for clustering

Item

P1

P2

A

1

8

B

3

8

C

2

6

D

1.5

1

E

4

2

Figure 12-12 shows the process of clustering these items. In the first pane, the items have been plotted in two dimensions, with P1 on the x-axis and P2 on the y-axis. Hierarchical clustering works by finding the two items that are closest together and merging them into a cluster. In the second pane, you can see that the closest items, A and B, have been grouped together. The "location" of this cluster is the average of the two items in it. In the next pane, it turns out that the closest items are C and the new A-B cluster. This process continues until the final pane in which everything is contained in one big cluster.

Figure 12-12. Process of hierarchical ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required