The easiest approach to our wine pricing problem is the same one
you would use if you were trying to price something manually—that is, to
find a few of the most similar items and assume the prices will be
roughly the same. By finding a set of items similar to the item that
interests you, the algorithm can average their prices and make a guess
at what the price should be for this item. This approach is called
*k-nearest neighbors* (kNN).

The k in kNN refers to the number of items that will be averaged to get the final result. If the data were perfect, you could use k=1, meaning you would just pick the nearest neighbor and use its price as the answer. But in real-world situations, there are always aberrations. In the example, you deliberately add "noise" to simulate this (the random addition or subtraction of 20 percent). Someone might get a great deal, or an uninformed customer might drastically overpay for the nearest neighbor. For this reason, it's better to take a few neighbors and average them to reduce any noise.

To visualize the problem of choosing too few neighbors, consider a problem where there's only one descriptive variable, say, age. Figure 8-1 shows a graph of price (on the y-axis) versus age (on the x-axis). Also on the graph is the line that you get if you only use a single nearest neighbor.

Figure 8-1. kNN using too few neighbors

Notice ...

Start Free Trial

No credit card required