So far we've been assuming that if you take an average or weighted average of the data, you'll get a pretty good estimate of the final price. In many cases this will be accurate, but in some situations there may be an unmeasured variable that can have a big effect on the outcome. Imagine that in the wine example there were buyers from two separate groups: people who bought from the liquor store, and people who bought from a discount store and received a 40 percent discount. Unfortunately, this information isn't tracked in the dataset.
The createhiddendataset
function creates a dataset that simulates these properties. It drops
some of the complicating variables and just focuses on the original
ones. Add this function to numpredict.py:
def wineset3( ): rows=wineset1( ) for row in rows: if random( )<0.5: # Wine was bought at a discount store row['result']*=0.6 return rows
Consider what will happen if you ask for an estimate of the price of a different item using the kNN or weighted kNN algorithms. Since the dataset doesn't actually contain any information about whether the buyer bought from the liquor store or a discount store, the algorithm won't be able to take this into account, so it will bring in the nearest neighbors regardless of where the purchase was made. The result is that it will give the average of items from both groups, perhaps representing a 25 percent discount. You can verify this by trying it in your Python session:
>>>reload(numpredict)
<module 'numpredict' ...