With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

No credit card required

Categorical Features

The matchmaker dataset contains numerical data and categorical data. Some classifiers, like the decision tree, can handle both types without any preprocessing, but the classifiers in the remainder of this chapter work only with numerical data. To handle this, you'll need a way to turn data into numbers so that it will be useful to the classifier.

Yes/No Questions

The simplest thing to convert to a number is a yes/no question because you can turn a "yes" into 1 and a "no" into −1. This also leaves the option of converting missing or ambiguous data (such as "I don't know") to 0. Add the `yesno` function to advancedclassify.py to do this conversion for you:

```def yesno(v):
if v=='yes': return 1
elif v=='no': return −1
else: return 0```

Lists of Interests

There are a couple of different ways you can record people's interests in the dataset. The simplest is to treat every possible interest as a separate numerical variable, and assign a 0 if the person has that interest and a 1 if he doesn't. If you are dealing with individual people, that is the best approach. In this case, however, you have pairs of people, so a more intuitive approach is to use the number of common interests as a variable.

Add a new function called `matchcount` to advancedclassify.py, which returns the number of matching items in a list as a float:

```def matchcount(interest1,interest2):
l1=interest1.split(':')
l2=interest2.split(':')
x=0
for v in l1:
if v in l2: x+=1
return x```

The number of common interests is an ...

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

No credit card required