Chapter 6

Data Mining Using Classic Statistical Techniques

The notion that data mining and statistics are separate disciplines now seems outdated and even a bit quaint. In fact, all data mining techniques are based on the science of probability and the discipline of statistics. The techniques described in this chapter are just closer to these roots than the techniques described in other chapters.

The chapter begins by describing how even simple, descriptive statistics can be viewed as models. If you can describe what you are looking for, then finding it is easier. This leads to the idea of similarity models — the more something looks like what you are looking for, the higher its score.

Next come table lookup models, which are very popular in the direct marketing industry, and have wide applicability in other fields as well. Naïve Bayesian models are a very useful generalization of table lookup models that allow many more inputs than can usually be accommodated as dimensions of a lookup table.

Much of the chapter is devoted to linear and logistic regression — certainly the most widely used predictive modeling techniques. Regression models are introduced first as a way of formalizing the relationship between two variables that can be seen in a scatter plot. Next comes a discussion of multiple regression, which allows for models with more than a single input, followed by a discussion of logistic regression, which extends the technique to targets with a restricted range such as probability ...