The trick to encode categorical data is to expand categorical data into multiple columns, each having a 1 or 0 representing whether it's true or false. This of course comes with some caveats and subtle issues that must be navigated with care. For the rest of this subsection, I shall use a real categorical variable to explain further.
Consider the LandSlope variable. There are three possible values for LandSlope:
- Gtl
- Mod
- Sev
This is one possible encoding scheme (this is commonly known as one-hot encoding):
Slope |
Slope_Gtl |
Slope_Mod |
Slope_Sev |
Gtl |
1 |
0 |
0 |
Mod |
0 |
1 |
0 |
Sev |
0 |
0 |
1 |
This would be a terrible encoding scheme. To understand why, we must first understand linear regression ...