You are previewing Programming Collective Intelligence.

k-Nearest Neighbors

The easiest approach to our wine pricing problem is the same one you would use if you were trying to price something manually—that is, to find a few of the most similar items and assume the prices will be roughly the same. By finding a set of items similar to the item that interests you, the algorithm can average their prices and make a guess at what the price should be for this item. This approach is called k-nearest neighbors (kNN).

Number of Neighbors

The k in kNN refers to the number of items that will be averaged to get the final result. If the data were perfect, you could use k=1, meaning you would just pick the nearest neighbor and use its price as the answer. But in real-world situations, there are always aberrations. In the example, you deliberately add "noise" to simulate this (the random addition or subtraction of 20 percent). Someone might get a great deal, or an uninformed customer might drastically overpay for the nearest neighbor. For this reason, it's better to take a few neighbors and average them to reduce any noise.

To visualize the problem of choosing too few neighbors, consider a problem where there's only one descriptive variable, say, age. Figure 8-1 shows a graph of price (on the y-axis) versus age (on the x-axis). Also on the graph is the line that you get if you only use a single nearest neighbor.

Notice ...