Models and Formulas

To statisticians, a model is a concise way to describe a set of data, usually with a mathematical formula. Sometimes, the goal is to build a predictive model with training data to predict values based on other data. Other times, the goal is to build a descriptive model that helps you understand the data better.

R has a special notation for describing relationships between variables. Suppose that you are assuming a linear model for a variable y, predicted from the variables x1, x2, ..., xn. (Statisticians usually refer to y as the dependent variable, and x1, x2, ..., xn as the independent variables.) In equation form, this implies a relationship like:

Models and Formulas

In R, you would write the relationship as y ~ x1 + x2 + ... + xn, which is a formula object.

As an example, let’s use the cars data set (which is included in the base package). This data set was created during the 1920s and shows the speed and stopping distance for a set of different cars. We’ll look at the relationship between speed and stopping distance. We’ll assume that the stopping distance is a linear function of speed. So let’s try to use a linear regression to estimate the relationship. The formula is dist~speed. We’ll use the lm function to estimate the parameters of a linear model. The lm function returns an object of class lm, which we will assign to a variable called cars.lm:

> cars.lm <- lm(formula=dist~speed,data=cars) ...

Get R in a Nutshell, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.