You are previewing Programming Collective Intelligence.

k-Nearest Neighbors

Chapter 8 covered the topic of numerical prediction using an algorithm called k-nearest neighbors (kNN), and used it to show how you could build models for predicting prices given a set of examples. The recommendation algorithm in Chapter 2 for predicting how much someone would like a movie or a link was also a simple version of kNN.

kNN works by taking a new item for which you want a numerical prediction and comparing it to a set of items for which it already has values. It finds the ones most similar to the item in question and averages their values to get a predicted value. Table 12-6 shows a list of digital cameras, along with their megapixel rating, zoom power, and sale price.

Table 12-6. Digital cameras and prices

Camera

Megapixels

Zoom

Price

C1

7.1

3.8x

\$399

C2

5.0

2.4x

\$299

C3

6.0

4.0x

\$349

C4

6.0

12.0x

\$399

C5

10.0

3x

\$449

Suppose you want to guess the price for a new camera with a six megapixels and a 6x zoom lens. The first thing you'll need is a way to measure how similar two items are. Chapter 8 used Euclidean distance, and you've seen many other distance metrics throughout this book, such as Pearson correlation and Tanimoto score. Using Euclidean distance in this example reveals that the closest item in the table is C3. To visualize this, imagine plotting the items on a chart with megapixels as the x-axis and zoom as the y-axis. The items themselves are identified by their prices in Figure 12-10.

Figure 12-10. Camera prices in zoom-megapixel space

You could just take the price ...