Encoding categorical variables

The final step on the road to prepare the data for the exploratory phase is to bin categorical variables. Some software packages do this behind the scenes, but it is good to understand when and how to do it.

Any statistical model can accept only numerical data. Categorical data (sometimes can be expressed as digits depending on the context) cannot be used in a model straightaway. To use them, we encode them, that is, give them a unique numerical code. This is to explain when. As for how—you can use the following recipe.

Getting ready

To execute this recipe, you will need the pandas module.

No other prerequisites are required.

How to do it…

Once again, pandas already has a method that does all of this for us (the data_dummy_code.py ...

Get Practical Data Analysis Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.