O'Reilly logo

Programming Collective Intelligence by Toby Segaran

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Viewing Data in Two Dimensions

The clustering algorithms in this chapter have been demonstrated using a stylized visualization of data in two dimensions, with the difference between the various items indicated by how far apart they are in the diagram. Since most real-life examples of items you would want to cluster have more than two numbers, you can't just take the data as-is and plot it in two dimensions. However, to understand the relationship between the various items, it would be very useful to see them charted on a page with closer distances indicating similarity.

This section will introduce a technique called multidimensional scaling, which will be used to find a two-dimensional representation of the dataset. The algorithm takes the difference between every pair of items and tries to make a chart in which the distances between the items match those differences. To do this, the algorithm first calculates the target distances between all the items. In the blog dataset, Pearson correlation was used to compare the items. An example of this is shown in Table 3-2.

Table 3-2. Sample distance matrix

 

A

B

C

D

A

0.0

0.2

0.8

0.7

B

0.2

0.0

0.9

0.8

C

0.8

0.9

0.0

0.1

D

0.7

0.8

0.1

0.0

Next, all the items (blogs, in this case) are placed randomly on the two-dimensional chart, as shown in Figure 3-7.

Starting locations of the 2D projection

Figure 3-7. Starting locations of the 2D projection

The current distances between all the items are calculated using the ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required