Chapter 11. The Best‐Fit Line: Linear Regression Models

The previous chapter introduced data mining ideas using various types of models well suited to databases, such as look‐alike models, lookup tables, and naïve Bayesian models. This chapter extends these ideas to the realm of more traditional statistical techniques: linear regression and best‐fit lines.

Unlike the techniques in the previous chapter, linear regression requires that the input and target variables all be numeric; the results are coefficients in a mathematical formula. A formal treatment of linear regression involves lots of mathematics and proofs. However, this chapter steers away from an overly theoretical approach.

In addition to providing a basis for statistical modeling, linear regression has many applications. To understand relationships between different numeric quantities, regressions —especially best‐fit lines —are the place to start. The examples in this chapter include estimating potential product penetration in zip codes, studying price elasticity (investigating the relationship between product prices and sales volumes), and quantifying the effect of monthly fee on yearly stop rates.

The simplest linear regression models are best‐fit lines that have one input and one target. Because the data can be plotted using a scatter plot, such models are readily understood visually. In fact, Excel builds linear regression models into charts using the best‐fit trend line, one of six built‐in types ...

Get Data Analysis Using SQL and Excel now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.