Chapter 17Model Building: Stepwise Regression and Best Subsets Regression

17.1 Introduction and Overview

The situation sometimes arises where several potential predictor (i.e., X) variables are available for a multiple linear regression (MLR) model, but it is unwieldy to include all of them in the model, and at the same time unclear which are the most important or appropriate to select. There is also the issue that some of the X variables may be intercorrelated (i.e., redundant), which creates problems when used for conventional regression analysis. If there are k candidate X variables, the number of possible models resulting from the k predictor variables is 2k. For instance, if k is 2 (i.e., there are two available X variables, say, X1 and X2), then there are 22 = 4 possible models to choose from, namely, (1) a model that includes both X1 and X2; one that includes only X1; (2) one that includes only X2; and (3) one that includes neither X1 nor X2, that is, the “null” or “intercept-only” model where the response or Y variable is simply estimated as the mean of the Y values in the data sample and that mean is the intercept coefficient in the model. If k is 4, the number of possible models is 16. If k is 10, the number of possible models from which to select one is a staggering 1024 models! Could we simply use all of the available X variables and not worry about picking and choosing variables? Or could we arbitrarily select a convenient number of them for use in the regression? ...

Get Statistical Applications for Environmental Analysis and Risk Assessment now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.