Modeling functions like `lm`

will include every variable specified in the
formula, calculating a coefficient for each one. Unfortunately, this means
that `lm`

may calculate coefficients for
variables that aren’t needed. You can manually tune a model using
diagnostics like `summary`

and `lm.influence`

. However, you can also use some
other statistical techniques to reduce the effect of insignificant
variables or remove them from a model altogether.

A simple technique for selecting the most important variables is stepwise variable selection. The stepwise algorithm works by repeatedly adding or removing variables from the model, trying to “improve” the model at each step. When the algorithm can no longer improve the model by adding or subtracting variables, it stops and returns the new (and usually smaller) model.

Note that “improvement” does not just mean reducing the residual
sum of squares (RSS) for the fitted model. Adding an additional variable
to a model will not increase the RSS (see a statistics book for an
explanation of why), but it does increase model complexity. Typically,
AIC (Akaike’s information criterion) is used to measure the value of
each additional variable. The AIC is defined as AIC = − 2 ∗
log(*L*) + *k* ∗ edf, where
*L* is the likelihood and edf is the equivalent
degrees of freedom.

In R, you perform stepwise selection through the `step`

function:

step(object, scope, scale = 0, direction = c("both", "backward", "forward"), ...

Start Free Trial

No credit card required