Cover by Toby Segaran

Safari, the world’s most comprehensive technology and business learning platform.

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required

O'Reilly logo

Categorical Features

The matchmaker dataset contains numerical data and categorical data. Some classifiers, like the decision tree, can handle both types without any preprocessing, but the classifiers in the remainder of this chapter work only with numerical data. To handle this, you'll need a way to turn data into numbers so that it will be useful to the classifier.

Yes/No Questions

The simplest thing to convert to a number is a yes/no question because you can turn a "yes" into 1 and a "no" into −1. This also leaves the option of converting missing or ambiguous data (such as "I don't know") to 0. Add the yesno function to advancedclassify.py to do this conversion for you:

def yesno(v):
  if v=='yes': return 1
  elif v=='no': return −1
  else: return 0

Lists of Interests

There are a couple of different ways you can record people's interests in the dataset. The simplest is to treat every possible interest as a separate numerical variable, and assign a 0 if the person has that interest and a 1 if he doesn't. If you are dealing with individual people, that is the best approach. In this case, however, you have pairs of people, so a more intuitive approach is to use the number of common interests as a variable.

Add a new function called matchcount to advancedclassify.py, which returns the number of matching items in a list as a float:

def matchcount(interest1,interest2):
  l1=interest1.split(':')
  l2=interest2.split(':')
  x=0
  for v in l1:
    if v in l2: x+=1
  return x

The number of common interests is an ...

Find the exact information you need to solve a problem on the fly, or go deeper to master the technologies and skills you need to succeed

Start Free Trial

No credit card required