Hierarchical clustering gives a nice tree as a result, but it has a couple of disadvantages. The tree view doesn't really break the data into distinct groups without additional work, and the algorithm is extremely computationally intensive. Because the relationship between every pair of items must be calculated and then recalculated when items are merged, the algorithm will run slowly on very large datasets.

An alternative method of clustering is
*K-means* clustering. This type of algorithm is quite
different from hierarchical clustering because it is told in advance how
many distinct clusters to generate. The algorithm will determine the
size of the clusters based on the structure of the data.

K-means clustering begins with *k* randomly
placed *centroids* (points in space that represent
the center of the cluster), and assigns every item to the nearest one.
After the assignment, the centroids are moved to the average location of
all the nodes assigned to them, and the assignments are redone. This
process repeats until the assignments stop changing. Figure 3-5 shows this process in
action for five items and two clusters.

Figure 3-5. K-means clustering with two clusters

In the first frame, the two centroids (shown as dark circles) are placed randomly. Frame 2 shows that each of the items is assigned to the nearest centroid—in this case, A and B are assigned to the top centroid ...

Start Free Trial

No credit card required