The clustering algorithms in this chapter have been demonstrated using a stylized visualization of data in two dimensions, with the difference between the various items indicated by how far apart they are in the diagram. Since most real-life examples of items you would want to cluster have more than two numbers, you can't just take the data as-is and plot it in two dimensions. However, to understand the relationship between the various items, it would be very useful to see them charted on a page with closer distances indicating similarity.
This section will introduce a technique called multidimensional scaling, which will be used to find a two-dimensional representation of the dataset. The algorithm takes the difference between every pair of items and tries to make a chart in which the distances between the items match those differences. To do this, the algorithm first calculates the target distances between all the items. In the blog dataset, Pearson correlation was used to compare the items. An example of this is shown in Table 3-2.
Table 3-2. Sample distance matrix
A | B | C | D | |
---|---|---|---|---|
A | 0.0 | 0.2 | 0.8 | 0.7 |
B | 0.2 | 0.0 | 0.9 | 0.8 |
C | 0.8 | 0.9 | 0.0 | 0.1 |
D | 0.7 | 0.8 | 0.1 | 0.0 |
Next, all the items (blogs, in this case) are placed randomly on the two-dimensional chart, as shown in Figure 3-7.
Figure 3-7. Starting locations of the 2D projection
The current distances between all the items are calculated using the ...