The matchmaker dataset contains numerical data and categorical data. Some classifiers, like the decision tree, can handle both types without any preprocessing, but the classifiers in the remainder of this chapter work only with numerical data. To handle this, you'll need a way to turn data into numbers so that it will be useful to the classifier.
The simplest thing to convert to a number is a yes/no question
because you can turn a "yes" into 1 and a "no" into −1. This also
leaves the option of converting missing or ambiguous data (such as "I
don't know") to 0. Add the yesno
function to advancedclassify.py
to do this conversion for you:
def yesno(v): if v=='yes': return 1 elif v=='no': return −1 else: return 0
There are a couple of different ways you can record people's interests in the dataset. The simplest is to treat every possible interest as a separate numerical variable, and assign a 0 if the person has that interest and a 1 if he doesn't. If you are dealing with individual people, that is the best approach. In this case, however, you have pairs of people, so a more intuitive approach is to use the number of common interests as a variable.
Add a new function called matchcount
to advancedclassify.py, which returns the
number of matching items in a list as a float:
def matchcount(interest1,interest2): l1=interest1.split(':') l2=interest2.split(':') x=0 for v in l1: if v in l2: x+=1 return x
The number of common interests is an ...