A linear regression assumes that there is a linear
relationship between the response variable and the predictors.
Specifically, a linear regression assumes that a response variable
*y* is a linear function of a set of predictor
variables *x*_{1},
*x*_{2}, ...,
*x _{n}*.

As an example, we’re going to look at how different metrics predict
the runs scored by a baseball team.^{[54]} Let’s start by loading the data for every team between 2000
and 2008. We’ll use the SQLite database that we used in Chapter 13 and extract the fields we want using an SQL
query:

>library(RSQLite)>drv <- dbDriver("SQLite")>con <- dbConnect(drv,+dbname=system.file("extdata","bb.db", package="nutshell"))>team.batting.00to08 <- dbGetQuery(con,+paste(+'SELECT teamID, yearID, R as runs, ',+' H-"2B"-"3B"-HR as singles, ',+' "2B" as doubles, "3B" as triples, HR as homeruns, ',+' BB as walks, SB as stolenbases, CS as caughtstealing, ',+' HBP as hitbypitch, SF as sacrificeflies, ',+' AB as atbats ',+' FROM Teams ',+' WHERE yearID between 2000 and 2008'+)+)

Or, if you’d like, you can just load the file from the `nutshell`

package:

>library(nutshell)>data(team.batting.00to08)

Because this is a book about R and not a book about baseball, I renamed the common abbreviations to more intuitive names for plays. Let’s look at scatter plots of runs versus each other variable so that we can see which variables are likely to be most important.

We’ll create a data frame for plotting, ...

Start Free Trial

No credit card required