The dataset you built at the start of this chapter was designed to be artificially simple—specifically, all the variables used to predict the price are roughly comparable and are all important to the final result.

Since all the variables fall within the same range, it's meaningful to calculate distances using all of them at once. Imagine, however, if you introduced a new variable that influenced the price, such as the size of the bottle in milliliters. Unlike the variables you've used so far, which were between 0 and 100, its range would be up to 1,500. Look at Figure 8-6 to see how this would affect the nearest neighbor or distance-weighting calculations.

Figure 8-6. Heterogeneous variables cause distance problems

Clearly, this new variable has a far greater impact on the calculated distances than the original ones do—it will overwhelm any distance calculation, which essentially means that the other variables are not taken into account.

A different problem is the introduction of entirely irrelevant variables. If the dataset also included the number of the aisle in which you found the wine, this variable would be included in the distance calculations. Two items identical in every respect but with very different aisles would be considered very far apart, which would badly hinder the ability of the algorithms to make accurate predictions.

In order ...

Start Free Trial

No credit card required