A linear regression assumes that there is a linear relationship between the response variable and the predictors. Specifically, a linear regression assumes that a response variable y is a linear function of a set of predictor variables x1, x2, ..., xn.
As an example, we’re going to look at how different metrics predict the runs scored by a baseball team. Let’s start by loading the data for every team between 2000 and 2008. We’ll use the SQLite database that we used in Chapter 13 and extract the fields we want using an SQL query:
> library(RSQLite) > drv <- dbDriver("SQLite") > con <- dbConnect(drv, + dbname=system.file("extdata","bb.db", package="nutshell")) > team.batting.00to08 <- dbGetQuery(con, + paste( + 'SELECT teamID, yearID, R as runs, ', + ' H-"2B"-"3B"-HR as singles, ', + ' "2B" as doubles, "3B" as triples, HR as homeruns, ', + ' BB as walks, SB as stolenbases, CS as caughtstealing, ', + ' HBP as hitbypitch, SF as sacrificeflies, ', + ' AB as atbats ', + ' FROM Teams ', + ' WHERE yearID between 2000 and 2008' + ) + )
Or, if you’d like, you can just load the file from the
> library(nutshell) > data(team.batting.00to08)
Because this is a book about R and not a book about baseball, I renamed the common abbreviations to more intuitive names for plays. Let’s look at scatter plots of runs versus each other variable so that we can see which variables are likely to be most important.
We’ll create a data frame for plotting, ...