Derived Variables: Making the Data Mean More
The preceding chapter is about getting data to the point where modeling can begin. This chapter is devoted to making models better by improving the quality of the data going into them. For the most part, this is not a matter of obtaining additional data sources; it is about defining new variables that express the information inherent in the data in ways that make the information more useful or more readily available to data mining techniques.
Creating derived variables is one of the most creative parts of the data mining process. If there is an art and science of data mining, creating derived variables is part of the art. Derived variables allow data mining models to incorporate human insights into the modeling process, and allow data mining models to take advantage of important characteristics already known about customers, products, and markets. In fact, the ability to come up with the right set of variables for modeling is one of the most important skills a data miner must have.
Derived variables definitely improve model performance as determined by technical measures such as average squared error, misclassification rate, and lift. Perhaps more importantly, well-chosen derived variables also enhance the ability of models to be understood and interpreted.
However, the variables that work best in one setting may not work in another, seemingly similar setting. Different companies have different drivers in their markets. Some ...