Chapter 4. Regression in a Nutshell

In Chapter 1, in which we briefly explored the realms of machine learning, we began with linear regression because it is probably something that you have come across at some point in your mathematical training. The process is fairly intuitive and easier to explain as a first concept than some other machine learning models. Additionally, many realms of data analysis rely on regression modeling ranging from a business trying to forecast its profits, to the frontiers of science trying to figure out new discoveries governing the laws of the universe. We can find regression in any scenario in which a prediction against time is needed. In this chapter, we examine how to use regression modeling in R to a deep extent, but we also explore some caveats and pitfalls to be aware of in the process.

The main motivation behind regression is to build an equation by which we can learn more about our data. There is no hard-and-fast rule about which type of regression model to fit to your data, however. Choosing between a logistic regression, linear regression, or multivariate regression model depends on the problem and the data that you have. You could fit a straight line to a given series of data points, but is that always the best case? Ideally, we are after a balance of simplicity and explanatory power. A straight line fit to a complex series of data might be simple, but might not describe the whole picture. On the other hand, having a very simple set of data that is basically a straight line and fitting a model with all sorts of wacky curves to it might give you a very high degree of accuracy, but leave very little room for new data points to be fit to it.

You might recall in your high school mathematics education about having a couple points of data and fitting a line through it. This fit to data is the easiest form of machine learning and is used often without realizing it is a type of machine learning. Although fitting a line to two data points is relatively easy to learn, fitting a line with three or more data points becomes a task better suited for computers to handle from a computation perspective. Simply adding one more data point (or, as we’ll see, several dozen more) makes the problem much more difficult to solve. But through mathematical techniques that power mainstream machine learning models, we can compute those kinds of problems very easily. R makes a lot of these steps quite simple to compute, and this chapter provides a foundation for assessing questions about where we draw the line between model complexity and accuracy.

Linear Regression

In Chapter 1, we briefly encountered linear regression with an example of the mtcars dataset. In that example, we determined a linear relationship of fuel efficiency as a function of vehicle weight and saw the trend go downward. We extracted coefficients for a linear mathematical equation and dusted our hands. Yet, there is a lot more beyond simply slapping an equation onto a bunch of data and calling it a day. Let’s revisit our mtcars example (Figure 4-1):

model <- lm(mtcars$mpg ~ mtcars$disp)

plot(y = mtcars$mpg, x = mtcars$disp, xlab = "Engine Size (cubic inches)",
    ylab = "Fuel Efficiency (Miles per Gallon)", main = "Fuel Efficiency From the `mtcars` Dataset")

abline(a = coef(model[1]), b = coef(model)[2], lty = 2)
imlr 0401
Figure 4-1. A simple linear regression fit to data

Let’s revisit our mtcars example (Figure 4-1), where we model the fuel efficiency (mpg) as a function of engine size (disp) and then look at the outputs of the model with the summary function:

summary(model)

##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$disp)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.8922 -2.2022 -0.9631  1.6272  7.2305
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## mtcars$disp -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

There is a wealth of information dumped out from the summary() function call on this linear model object. Generally, the one number people will typically look at to get a baseline accuracy assessment is the multiple R-squared value. The closer that value is to 1, the more accurate the linear regression model is. There are a lot of other terms in this output, though, so let’s walk through each element to gain a solid understanding:

Call

This displays the formulaic function call we used. In this case, we used one response variable, mpg, as a function of one dependent variable, disp, both of which were being called from the mtcars data frame.

Residuals

Residuals are a measure of vertical distance from each data point to the fitted line in our model. In this case, we have summary statistics for all of the vertical distances for all of our points relative to the fitted line. The smaller this value is, the better the fit is.

Coefficients

These are the estimates for the coefficients of our linear equation. Our equation in this case would be y = 0.04x + 29.59.

  • Std. Error: With those coefficients come error estimates as given by the Std. Error part of the coefficients table. In reality, our equation would be something like y = ( - 0 . 04 ± 0 . 005 ) x + ( 29 . 59 ± 1 . 23 ) .

  • t-value: This is the measurement of the difference relative to the variation in our data. This value is linked with p-values, but p-values are used far more frequently.

  • p-value: p-values are statistical assessments of significance. The workings of p-values are more complicated than that, but for our purposes a p-value being of value less than 0.05 means that we can take the number as being statistically significant. If the number in question has a p-value greater than 0.05, we should err on the side of it not being statistically significant. The star ratings next to them are explained by the significance codes that follow.

Residual standard error

This error estimate pertains to the standard deviation of our data.

Multiple R-squared

This is the R-squared value for when we have multiple predictors. This isn’t totally relevant for our linear example, but when we add more predictors to the model, invariably our multiple R-squared will go up. This is because some feature we add to the model will explain some part of the variance, whether its true or not.

Adjusted R-squared

To counteract the biases introduced from having a constantly increasing R-squared value with more predictors, the adjusted R-squared tends to be a better representation of a model’s accuracy when there’s multiple features.

F-statistic

Finally, the F-statistic is the ratio of the variance explained by parameters in the model and the unexplained variance.

This simple linear example has some decent explanatory power. We have determined a relationship between fuel efficiency and engine size. Oftentimes, this is where simple linear regression examples exhaust their usefulness. The things we are most after in this specific case are the slope and intercept. If this example were applied to sales over time, for example, our output from this modeling exercise would be a starting value for the intercept, and a growth rate for the coefficient.

Multivariate Regression

Suppose that you want to build a more robust model of fuel efficiency with more variables built into it. Fuel efficiency of a vehicle can be a complex phenomenon with many contributing factors other than engine size, so finding all of the features that are responsible for driving the behavior of the model in the most accurate fashion is where you want to utilize regression as you have been, but in a multivariate context.

Recall that our simple linear regression example was based around:

  • y = b + m1x1

where the coefficients are the intercept, b, and the slope, m, tied to the one variable we had in the model. If you want to bring in more factors that contribute to the model, change the mathematical form to:

  • y = b + m1x1 + m2x2 + m3x3 + (...)

where x1, x2, x3, and so forth, are different features in the model, such as a vehicle’s weight, engine size, number of cylinders, and so on. Because the new objective is to find coefficients for a model of the form y = f(x1, x2, x3, (…)), you need to revisit the call to the lm() function in R:

lm.wt <- lm(mpg ~ disp + wt, data = mtcars)
summary(lm.wt)

##
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.4087 -2.3243 -0.7683  1.7721  6.3484
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.96055    2.16454  16.151 4.91e-16 ***
## disp        -0.01773    0.00919  -1.929  0.06362 .
## wt          -3.35082    1.16413  -2.878  0.00743 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7658
## F-statistic: 51.69 on 2 and 29 DF,  p-value: 2.744e-10

This code extends the linear modeling from earlier to include the vehicle’s weight in the model fitting procedure. In this case, what you see is that the adjusted R-squared has gone up slightly from 0.709 when you fit a model of just the engine size, to 0.7658 after including the engine weight in the fit. However, notice that the statistical relevance of the previous feature has gone down considerably. Before, the p-value of the wt feature was far below the 0.05 threshold for a p-value to be significant; now it’s 0.06. This might be due to the vehicle fuel efficiency being more sensitive to changes in vehicle weight than engine size.

If you want to extend this analysis further, you can bring in another feature from the dataset and see how the R-squared of the model and p-values of the coefficients change accordingly:

lm.cyl <- lm(mpg ~ disp + wt + cyl, data = mtcars)
summary(lm.cyl)

##
## Call:
## lm(formula = mpg ~ disp + wt + cyl, data = mtcars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.4035 -1.4028 -0.4955  1.3387  6.0722
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.107678   2.842426  14.462 1.62e-14 ***
## disp         0.007473   0.011845   0.631  0.53322
## wt          -3.635677   1.040138  -3.495  0.00160 **
## cyl         -1.784944   0.607110  -2.940  0.00651 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.595 on 28 degrees of freedom
## Multiple R-squared:  0.8326, Adjusted R-squared:  0.8147
## F-statistic: 46.42 on 3 and 28 DF,  p-value: 5.399e-11

This code takes the same approach as before, but adds the engine’s cylinder count to the model. Notice that the R-squared value has increased yet again from 0.709 to 0.8147. However, the statistical relevancy of the displacement in the data is basically defunct, with a p-value 10 times the threshold at 0.53322 instead of closer to 0.05. This tells us that the fuel efficiency is tied more to the combined feature set of vehicle weight and number of cylinders than it is to the engine size. You can rerun this analysis with just the statistically relevant features:

lm.cyl.wt <- lm(mpg ~ wt + cyl, data = mtcars)
summary(lm.cyl.wt)

##
## Call:
## lm(formula = mpg ~ wt + cyl, data = mtcars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -4.2893 -1.5512 -0.4684  1.5743  6.1004
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
## wt           -3.1910     0.7569  -4.216 0.000222 ***
## cyl          -1.5078     0.4147  -3.636 0.001064 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.568 on 29 degrees of freedom
## Multiple R-squared:  0.8302, Adjusted R-squared:  0.8185
## F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12

By removing the statistically irrelevant feature from the model, you have more or less preserved the R-squared accuracy at 0.8185 versus 0.8147, while maintaining only relevant features to the data.

You should take care when adding features to the data, however. In R, you can easily model a response to all the features in the data by calling the lm() function with the following form:

lm.all <- lm(mpg ~ ., data = mtcars)
summary(lm.all)

##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.4506 -1.6044 -0.1196  1.2193  4.6271
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337   18.71788   0.657   0.5181
## cyl         -0.11144    1.04502  -0.107   0.9161
## disp         0.01334    0.01786   0.747   0.4635
## hp          -0.02148    0.02177  -0.987   0.3350
## drat         0.78711    1.63537   0.481   0.6353
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739
## vs           0.31776    2.10451   0.151   0.8814
## am           2.52023    2.05665   1.225   0.2340
## gear         0.65541    1.49326   0.439   0.6652
## carb        -0.19942    0.82875  -0.241   0.8122
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

This syntax creates a linear model with the dependent variable mpg being modeled against everything in the dataset, as denoted by the . mark in the function call. The problem with this approach, however, is that you see very little statistical value in the coefficients of the model. Likewise, the standard error for each of the coefficients is very high, and thus pinning down an exact value for the coefficients is very difficult. Instead of this top-down approach to seeing which features are the most important in the dataset, it is better to approach it from the bottom up as we have done thus far. Although the theme of feature selection itself is a very broad topic—one which we explore in depth with other machine learning algorithms—we can mitigate some of these problems in a couple of ways:

Careful selection of features

Pick features to add to the model one at a time and cut the ones that are statistically insignificant. We’ve accomplished this in the preceding code chunks by adding one parameter at a time and checking to see whether the p-value of the model output for that parameter is statistically significant.

Regularization

Keep all of the features but mathematically reduce the coefficients of the less important ones to minimize their impact on the model.

Regularization

Regularization can be a tough concept mathematically, but in principle it’s fairly straightforward. The idea is that you want to include as many of the features in your data as you can squeeze into the final model. The more features, the better you can explain all the intricacies of the dataset. The catch here is that the degree to which each feature explains part of the model, after regularization is applied, can be quite different.

Through the use of regularization, you can make your model more succinct and reduce the noise in the dataset that might be coming from features that have little impact on what you are trying to model against.

Let’s see what the linear model for the mtcars dataset would look like if we included all the features. We would have an equation like this:

  • mpg = 12.3 − 0.11cyl + 0.01disp − 0.02hp + 0.79drat − 3.72wt + 0.82qsec + 0.31vs + 2.42am + 0.66gear − 0.20carb

According to this linear equation, fuel efficiency is most sensitive to the weight of the vehicle (-3.72wt), given that this one has the largest coefficient. However, most of these are all within an order of magnitude or so to one another. Regularization would keep all of the features, but the less important ones would have their coefficients scaled down much further.

To utilize this regularization technique, you call a particular type of regression modeling, known as a lasso regression, as shown here:

library(lasso2)
lm.lasso <- l1ce(mpg ~ ., data = mtcars)
summary(lm.lasso)$coefficients

##                   Value  Std. Error     Z score   Pr(>|Z|)
## (Intercept) 36.01809203 18.92587647  1.90311355 0.05702573
## cyl         -0.86225790  1.12177221 -0.76865686 0.44209704
## disp         0.00000000  0.01912781  0.00000000 1.00000000
## hp          -0.01399880  0.02384398 -0.58709992 0.55713660
## drat         0.05501092  1.78394922  0.03083659 0.97539986
## wt          -2.68868427  2.05683876 -1.30719254 0.19114733
## qsec         0.00000000  0.75361628  0.00000000 1.00000000
## vs           0.00000000  2.31605743  0.00000000 1.00000000
## am           0.44530641  2.14959278  0.20715850 0.83588608
## gear         0.00000000  1.62955841  0.00000000 1.00000000
## carb        -0.09506985  0.91237207 -0.10420075 0.91701004

This code calls the l1ce() function from the lasso2 package on the mtcars dataset. This uses the same function call that we want the fuel efficiency variable mpg modeled as a function of all the other variables in the dataset. Built in to lasso regression is the regularization technique, which is only applied during the heavy mathematical lifting part of the algorithm. The regularization part of the regression scales the coefficients according to how much actual impact they have on the model in a more statistical fashion. In some cases, this can result in some features being scaled down to such a low value that they are approximated as zero. As a result of this regression modeling, you now have a different equation:

  • mpg = 36.02 − 0.86cyl + 0disp − 0.014hp + 0.06drat − 2.69wt + 0qsec + 0vs + 0.45am + 0gear − 0.095carb

Or, more simply:

  • mpg = 36.02 − 0.86cyl − 0.014hp + 0.06drat − 2.69wt + 0.45am + − 0.095carb

The most important feature before the change to a lasso regression was the vehicle’s weight, wt, which has remained unchanged as far as its relative importance. Even though the coefficient has changed somewhat, the fact that it is the highest magnitude coefficient remains the same. What you see in terms of less useful features being scaled down—in this case to zero—are features that you would probably think have little impact on fuel efficiency to begin with. Quarter-mile drag race time (qsec), engine configuration in terms of a V-shape or a straight-line shape (vs), and number of forward gears (gear) have all been rescaled down to zero.

However, the variable of displacement showed a clear relationship to fuel efficiency that we saw earlier. It being scaled down to zero does not mean there is no relationship whatsoever between just that one variable and our response, but when taken together with all the other variables in the dataset, its importance is negligible. Remember, in this case we are interested in a model of all features, not necessarily the importance of just one feature.

Notice from the new lasso regression model that some of the coefficients have been more or less mathematically eliminated from the model. To further refine the model and reduce the number of features in it, you can rerun the regression without those features and see what changes:

lm.lasso2 <- l1ce(mpg ~ cyl + hp + wt + am + carb, data = mtcars)
summary(lm.lasso2)$coefficients

##                     Value Std. Error     Z score     Pr(>|Z|)
## (Intercept) 31.2819166926 4.51160542  6.93365527 4.100942e-12
## cyl         -0.7864202230 0.86107128 -0.91330444 3.610824e-01
## hp          -0.0009037003 0.02343634 -0.03855979 9.692414e-01
## wt          -1.9248597501 1.38749433 -1.38729198 1.653527e-01
## am           0.0000000000 2.22143917  0.00000000 1.000000e+00
## carb         0.0000000000 0.67947216  0.00000000 1.000000e+00

With the reduced dataset being then passed into another lasso regression, you can see that the transmission type of the car, am, and the number of carburetors, carb, have both dropped to zero. By removing these features and rerunning, you can see if any more drop out:

lm.lasso3 <- l1ce(mpg ~ cyl + hp + wt, data = mtcars)
summary(lm.lasso3)$coefficients

##                  Value Std. Error    Z score  Pr(>|Z|)
## (Intercept) 30.2106931 1.97117597 15.3262284 0.0000000
## cyl         -0.7220771 0.82941877 -0.8705821 0.3839824
## hp           0.0000000 0.01748364  0.0000000 1.0000000
## wt          -1.7568469 1.07478525 -1.6346028 0.1021324

In this case, the horsepower of the car, hp, has now dropped out. You can continue to run as long as you have multiple features to test against:

lm.lasso4 <- l1ce(mpg ~ cyl + wt, data = mtcars)
summary(lm.lasso4)$coefficients

##                  Value Std. Error   Z score  Pr(>|Z|)
## (Intercept) 29.8694933  1.4029760 21.290096 0.0000000
## cyl         -0.6937847  0.5873288 -1.181254 0.2375017
## wt          -1.7052064  1.0720172 -1.590652 0.1116879

The final result is a model that has only two features instead of the 11 you started with:

  • mpg = 29.87    0.69cyl   1.70wt

Polynomial Regression

Polynomial regression is simply fitting a higher degree function to the data. Previously, we’ve seen fits to our data along the following form:

  • y = b + m1x1 + m2x2 + m3x3(...)

Polynomial regression differs from the simple linear cases by having multiple degrees for each feature in the dataset. The form of which could be represented as follows:

  • y = b + mx2

The following example will help with our reasoning (Figure 4-2):

pop <- data.frame(uspop)
pop$uspop <- as.numeric(pop$uspop)
pop$year <- seq(from = 1790, to = 1970, by = 10)

plot(y = pop$uspop, x = pop$year, main = "United States Population From 1790 to 1970",
    xlab = "Year", ylab = "Population")
imlr 0402
Figure 4-2. The plotted population of the United States in decades from 1790 to 1970

Here, we have a built-in dataset in R that we’ve tweaked slightly for demonstration purposes. Normally the uspop is a time–series object that has its own plotting criteria, but here we’ve tuned it to plot just the data points. This data is the population of the United States in 10-year periods from 1790 to 1970. Let’s begin by fitting a linear model to the data:

lm1 <- lm(pop$uspop ~ pop$year)
summary(lm1)

##
## Call:
## lm(formula = pop$uspop ~ pop$year)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -19.569 -14.776  -2.933   9.501  36.345
##
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.958e+03  1.428e+02  -13.71 1.27e-10 ***
## pop$year     1.079e+00  7.592e-02   14.21 7.29e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.12 on 17 degrees of freedom
## Multiple R-squared:  0.9223, Adjusted R-squared:  0.9178
## F-statistic: 201.9 on 1 and 17 DF,  p-value: 7.286e-11

This simple linear fit of the data seems to work pretty well. The p-values of the estimates are very low, indicating a good statistical significance. Likewise, the R-squared values are both very good. However, the residuals show a pretty wide degree of variability, ranging as much as a difference of 36, as demonstrated in Figure 4-3:

plot(y = pop$uspop, x = pop$year, main = "United States Population From 1790 to 1970",
    xlab = "Year", ylab = "Population")

abline(a = coef(lm1[1]), b = coef(lm1)[2], lty = 2, col = "red")
imlr 0403
Figure 4-3. Population data with a linear model fit

The dotted line fit from the linear model seems to do OK. It fits some of the data better than others, but it’s pretty clear from the data that it’s not exactly a linear relationship. Moreover, we know from intuition that population over time tends to be more of an exponential shape than one that’s a straight line. What you want to do next is to see how a model of a higher degree stacks up against the linear case, which is the lowest-order degree polynomial that you can fit:

lm2 <- lm(pop$uspop ~ poly(pop$year, 2))
summary(lm2)

##
## Call:
## lm(formula = pop$uspop ~ poly(pop$year, 2))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -6.5997 -0.7105  0.2669  1.4065  3.9879
##
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)         69.7695     0.6377  109.40  < 2e-16 ***
## poly(pop$year, 2)1 257.5420     2.7798   92.65  < 2e-16 ***
## poly(pop$year, 2)2  73.8974     2.7798   26.58 1.14e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.78 on 16 degrees of freedom
## Multiple R-squared:  0.9983, Adjusted R-squared:  0.9981
## F-statistic:  4645 on 2 and 16 DF,  p-value: < 2.2e-16

This code calls the lm() function again, but this time with an extra parameter around the dependent variable, the poly() function. This function takes the date data and computes an orthogonal vector, which is then scaled appropriately. By default, the poly() function doesn’t change the values of the date data, but you can use it to see if it yields any better results than the lower-order fit that you did previously. Recall that the linear fit is technically a polynomial, but of degree 1. In an equation, here’s the resultant model fit:

  • y = b + m1x12 + m2x1

Let’s slowly walk through the summary() output first. Looking at the residual output gives us a bit of relief: no residuals in the range of 30! Smaller residuals are always better in terms of model fit. The coefficients table now has three entries: one for the intercept, one of the first-degree term, and now one for the second-degree term. When you called poly(pop$year, 2), you instructed R that you want a polynomial of the date data with the highest degree being 2. Going back to the coefficients table, you can see that all of the p-values are statistically significant, which is also a good indication that this is a solid model fit to your data (see Figure 4-4):

plot(y = pop$uspop, x = pop$year, main = "United States Population From 1790 to 1970",
    xlab = "Year", ylab = "Population")

pop$lm2.predict = predict(lm2, newdata = pop)

lines(sort(pop$year), fitted(lm2)[order(pop$year)], col = "blue",
    lty = 2)

From Figure 4-4, it looks pretty obvious that the higher degree polynomial (in this case a quadratic equation) fits the data better. Clearly using higher degree polynomials works better than lower degree ones, right? What happens if you fit a third-degree polynomial? Or something higher still? I’ll bet that if you use a sixth-degree polynomial you would have a very accurate model indeed! What immediately leaps out is that the simple linear fit that you had earlier fit the data as best it could, but the higher second-degree polynomial (i.e., a simple quadratic) fit better. A better way to distinguish the difference between higher-order polynomial fits is by looking at plots of each model’s residuals, which you can see in Figure 4-5.

imlr 0404
Figure 4-4. Population over time modeled with a quadratic fit seems to fit the data much better than a linear one; if you want the most accurate model possible, however, you might want to increase the polynomial degree to which you fit the data
imlr 0405
Figure 4-5. Population over time with multiple models fit

Figure 4-5 shows the linear fit compared to increasing degree polynomials. The polynomials are difficult to separate in terms of how well they fit, but all seem to fit better than the linear case. To compare models that are close approximations visually at this level, you would need to dive into looking at plots of their residuals instead, as demonstrated in Figure 4-6:

par(mfrow = c(2, 3))
plot(resid(lm1), main = "Degree 1", xlab = "Sequential Year",
    ylab = "Fit Residual")
plot(resid(lm2), main = "Degree 2", xlab = "Sequential Year",
    ylab = "Fit Residual")
plot(resid(lm3), main = "Degree 3", xlab = "Sequential Year",
    ylab = "Fit Residual")
plot(resid(lm4), main = "Degree 4", xlab = "Sequential Year",
    ylab = "Fit Residual")
plot(resid(lm5), main = "Degree 5", xlab = "Sequential Year",
    ylab = "Fit Residual")
plot(resid(lm6), main = "Degree 6", xlab = "Sequential Year",
    ylab = "Fit Residual")
imlr 0406
Figure 4-6. A residuals plot of each of the models

Recall that a residual is the vertical distance between a data point and the fitted model line. A model that fits the data points exactly should have a residuals plot as close to a flat line as possible. In the case of your linear fit, the scale of the residuals plot is much larger than the rest, and you can see that the linear fit has some pretty bad residual distance at its start, halfway point, and end. This is not an ideal model. On the other hand, the higher-degree polynomials seem to do pretty well. The scale of their residual plots are much nicer, but the one that really stands out is the sixth-degree polynomial fit at the end. The residuals plot is pretty much zero to start, and then it becomes a little more error-prone.

This is all well and good, but it might be easier to rank the model fit by looking at their residuals numerically:

c(sum(abs(resid(lm1))), sum(abs(resid(lm2))), sum(abs(resid(lm3))),
    sum(abs(resid(lm4))), sum(abs(resid(lm5))), sum(abs(resid(lm6))))

## [1] 272.51432  33.77224  34.54039  36.95125  25.45242  19.59938

This code sums the residual plots by absolute value of the residual. If you just take the raw sum of the residuals, you get an inaccurate picture because some residuals might be negative. So the total residual for the linear fit is quantitatively bad compared to the rest of the models, with the sixth-degree polynomial being the clear winner in terms of the best fit to the data points.

But is the best fit to the data points actually the best model? We must take into account the ideas of overfitting and underfitting the data. The linear model fit to the data in the previous case would be a good example of an underfit scenario. Clearly there’s some structure in the data that isn’t being explained by a simple linear fit. On the other hand, a model can be overfit if it is too specific to the data presented and offers little explanatory power for any new data that might come into the system. This is the risk you run by increasing the degree of polynomial models.

Goodness of Fit with Data—The Perils of Overfitting

We have just run an example of trying to get a model that has the best fit to our data. This is a good goal to have, but you need to be careful not to go too far in fitting perfectly to the data. We have seen so far that the linear fit to a population curve probably isn’t the best model for the job. A quadratic or a cubic polynomial fit seems to do much better by comparison. Yet, is it worth it to keep increasing the degree of the model fit? Is the minimization of the residual the only goal in terms of selecting the best model for the job?

Root-Mean-Square Error

In statistics, the root-mean-square error (RMSE) is a quantifiable way to see how our model’s predicted values stack up against actual values. Mathematically, the RMSE is given as follows:

RMSE = ( (predictedvalue-actualvalue) 2 )

To assess polynomial fits, you can perform an RMSE analysis on each one. You can then compare the resultant errors and select the one that has the lowest result. To do so, you need new data that isn’t in your model. For that, let’s use the US population census data from 1980 to 2010:

uspop.2020 <- data.frame(year = c(1980, 1990, 2000, 2010), uspop = c(226.5,
    249.6, 282.2, 309.3))
uspop.2020.predict <- uspop.2020

pop2 <- data.frame(uspop)
pop2$uspop <- as.numeric(pop$uspop)
pop2$year <- seq(from = 1790, to = 1970, by = 10)

This code also reinitializes the old population data for prediction purposes as a general cleanup measure. From there, you can do your usual prediction routine, and then calculate the RMSE for each polynomial:

uspop.2020.predict$lm1 <- predict(lm(uspop ~ poly(year, 1), data = pop2),
    uspop.2020)

uspop.2020.predict$lm2 <- predict(lm(uspop ~ poly(year, 2), data = pop2),
    uspop.2020)

uspop.2020.predict$lm3 <- predict(lm(uspop ~ poly(year, 3), data = pop2),
    uspop.2020)

uspop.2020.predict$lm4 <- predict(lm(uspop ~ poly(year, 4), data = pop2),
    uspop.2020)

uspop.2020.predict$lm5 <- predict(lm(uspop ~ poly(year, 5), data = pop2),
    uspop.2020)

uspop.2020.predict$lm6 <- predict(lm(uspop ~ poly(year, 6), data = pop2),
    uspop.2020)

And, finally, calculate the RMSE:

c(sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm1)^2)),
    sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm2)^2)),
    sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm3)^2)),
    sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm4)^2)),
    sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm5)^2)),
    sqrt(mean((uspop.2020.predict$uspop - uspop.2020.predict$lm6)^2)))

## [1]  75.622445   8.192311   5.070814   9.153189  73.632318 124.429798

From these results, you can see that the simple linear fit had an RMSE of 75, the second-degree polynomial had 8, and the third-degree polynomial had 5. The errors blow up after the third-degree polynomial, which is another indication that the models were too overfit to the data. In this case, you would select the model that has the lowest RMSE to the new predicted data by picking the polynomial of degree three.

Model Simplicity and Goodness of Fit

If you recall the model coefficients, each one has an attached p-value of statistical significance tied to it from the lm() model fitting procedure. If a coefficient’s p-value is less than 0.05, it’s safe to assume that it is statistically important for your model.

To help you to decide which model to use, identify where the trade-off is between model accuracy and model complexity. The more complex your model is—that is, the degree of polynomial used—the tighter it’s going to fit to your data, but you run the risk of some of the coefficients being less statistically valid as the model becomes more complex. To avoid this, first look at both the R-squared and the number of statistically valid coefficients for each of your models:

table((summary(lm1)$coefficients[, 4]) < 0.05)

##
## TRUE
##    2

summary(lm1)$r.squared

## [1] 0.9223434

This example takes the coefficients from the simple linear fit, lm1, and then extracts the p-values tied to the coefficients. It then tabularizes how many of those are statistically valid (if they are above 0.05). The result from the simple linear case is that there are two coefficients: the slope and the intercept, and that they are both statistically valid. The R-squared value also confirms that the fit is pretty good, but let’s use that as a baseline for the sake of comparison.

Instead of computing this for each model and looking back and forth at the results, you can dump all this information into a handy data frame for easier readability. Let’s define a model.order as the highest degree of the polynomial fit (this is simply the number you pass into the poly() function during the linear model lm() function call). You then define coef.true as the number of coefficients that are statistically valid in the model. In this case, you are looking only at the coefficients related to the dependent variables and not the model’s intercept, which is statistically valid in all cases, hence why you subtract the coef.true value by 1. Next, you define a coef.false term as the opposite case: how many of the model’s coefficients on the dependent variables are not statistically meaningful. Finally, you define a model.rsq value, which is the extracted R-squared model accuracy. You then put it all together in a data frame and define a final metric: goodness. This measure compares the ratio of statistically meaningful coefficients to the model’s order:

model.order <- c(1,2,3,4,5,6)

coef.true <- c(
  table((summary(lm1)$coefficients[,4])<0.05) - 1
  ,table((summary(lm2)$coefficients[,4])<0.05) - 1
  ,table((summary(lm3)$coefficients[,4])<0.05)[2] - 1
  ,table((summary(lm4)$coefficients[,4])<0.05)[2] - 1
  ,table((summary(lm5)$coefficients[,4])<0.05)[2] - 1
  ,table((summary(lm6)$coefficients[,4])<0.05)[2] - 1

)

coef.false <- c(
  0
  ,0
  ,table((summary(lm3)$coefficients[,4])<0.05)[1]
  ,table((summary(lm4)$coefficients[,4])<0.05)[1]
  ,table((summary(lm5)$coefficients[,4])<0.05)[1]
  ,table((summary(lm6)$coefficients[,4])<0.05)[1]

)

model.rsq <- c(
  summary(lm1)$r.squared
  ,summary(lm2)$r.squared
  ,summary(lm3)$r.squared
  ,summary(lm4)$r.squared
  ,summary(lm5)$r.squared
  ,summary(lm6)$r.squared

)

model.comparison <- data.frame(model.order, model.rsq, coef.true, coef.false)
model.comparison$goodness <- (model.comparison$coef.true / model.comparison$model.order)

model.comparison

##   model.order model.rsq coef.true coef.false  goodness
## 1           1 0.9223434         1          0 1.0000000
## 2           2 0.9982808         2          0 1.0000000
## 3           3 0.9983235         2          1 0.6666667
## 4           4 0.9984910         2          2 0.5000000
## 5           5 0.9992208         3          2 0.6000000
## 6           6 0.9993027         3          3 0.5000000

The result demonstrates that, although the model’s R-squared accuracy might be increasing as the fit becomes more complex, the goodness of that fit goes down over time because the number of statistically meaningful coefficients compared to the total number of coefficients tends to go down. One way that you can statistically quantify this is to rank the associated elements you’re interested in optimizing with the following:

model.comparison$rank <- sqrt(model.comparison$goodness^2 +
    model.comparison$model.rsq^2)
model.comparison

##   model.order model.rsq coef.true coef.false  goodness     rank
## 1           1 0.9223434         1          0 1.0000000 1.360411
## 2           2 0.9982808         2          0 1.0000000 1.412998
## 3           3 0.9983235         2          1 0.6666667 1.200456
## 4           4 0.9984910         2          2 0.5000000 1.116685
## 5           5 0.9992208         3          2 0.6000000 1.165522
## 6           6 0.9993027         3          3 0.5000000 1.117410

Now, you can see where the trade-off is best between model accuracy and goodness of fit. The model order with the highest rank in this case is a quadratic fit that has all of its coefficients that are statistically valid. Although the model fit for a third-degree polynomial is marginally better (almost unmeasurably so), the goodness of fit isn’t great because we have a coefficient that is not statistically meaningful.

What we can say about this procedure is that we have an optimal model to choose that has the highest rank value. A model that has a lower R-squared and lower rank is underfit to the data. A model that has a higher R-squared and a lower rank is an overfit model to our data.

Logistic Regression

Thus far we’ve considered regression models in terms of taking some kind of numeric data to which we want to fit some kind of curve so that we can use it for predictive purposes. Linear regression takes some sort of numeric data and renders an equation like y = mx + b out. Linear regression can also have multiple inputs and we could have an equation like y = b + m1x1 + m2x2 + (…). Further, these types of numerical regression models can be turned into nonlinear cases such as y = b + m1x1 + m2x12 + m3x13 + (…). All of these have their own use cases and are totally dependent on the data we’re working with and how we strategize about the kind of accuracy for which we want to optimize.

All of these so far have ingested some numeric input and given us a numeric output. What if, instead, we wanted a “yes” or “no” outcome from our data? What if we were trying to do something like determine whether our input data was of a positive or negative result? In this case, we would be taking in continuous numeric data and getting some kind of discrete output. This is the basis for the classification end of our regression modeling. Logistic regression is a particular type of classification and relatively simple enough to be used as an introductory example. Logistic regression, in contrast to linear regression, finds the point at which the data goes from one kind of classification to another instead of trying to fit all the individual data points themselves.

The Motivation for Classification

Suppose that you are trying to diagnose patients to determine whether they have a malignant tumor. Let’s look at the code and the resulting plot in Figure 4-7:

data <- data.frame(tumor.size <- c(1, 2, 3, 4, 5, 6, 7, 8, 9,
    20), malignant <- c(0, 0, 0, 0, 1, 1, 1, 1, 1, 1))

tumor.lm <- lm(malignant ~ tumor.size, data = data)

plot(y = data$malignant, x = data$tumor.size, main = "Tumor Malignancy by Size",
    ylab = "Type (0 = benign, 1 = cancerous)", xlab = "Tumor Size")

abline(a = coef(tumor.lm[1]), b = coef(tumor.lm[2]))
coef(tumor.lm)

## (Intercept)  tumor.size
##  0.20380952  0.06095238

summary(tumor.lm)$r.squared

## [1] 0.4063492
imlr 0407
Figure 4-7. Fitting a linear regression line to binary data does not provide an accurate model

This code creates a dataset of tumor sizes from 1 to 20 and classifies whether they are malignant, with a benign or noncancerous tumor being classified as 0, and a malignant or cancerous tumor being labeled as 1. A naive instinct might be to slap a regression model on this data and see what the equation output is. With this approach, you would have an equation such as the following:

  • tumor malignancy = 0.061 × tumor size + 0.204

The poor fit of the R-squared at 0.406 suggests that we could obtain a more accurate model. Additionally, you should question the logical assessment of what it means to have a tumor that is 0.2 malignant when they are logged in the data as being either malignant or not with no room in between. This would also not make much sense with the mtcars dataset if we had something modeled against transmission type. What would a 0.2 transmission be if 0 was manual and 1 was an automatic?

We need to rethink this approach. Instead of fitting a normal mathematical function, we need to fit something called a decision boundary to the data.

The Decision Boundary

The decision boundary is simply a line in the sand of our data that says “anything on this side is classified as X and anything on the other side is classified as Y.” Figure 4-8 revisits the plot of tumor sizes and whether they’re malignant. You can clearly see that any tumor that’s greater in size than 5 always seems to be malignant:

plot(y = data$malignant, x = data$tumor.size, main = "Tumor Malignancy by Size",
    ylab = "Type (0 = benign, 1 = cancerous)", xlab = "Tumor Size")
abline(v = 4.5)
imlr 0408
Figure 4-8. Plotting a decision boundary instead of a regression line classifies data less than about 4.5 as 0; data above that threshold is classified as 1

Logistic regression establishes the boundary against which you can classify data. The boundary in Figure 4-8 shows that any tumor size greater than 4.5 is malignant, whereas anything less than that is benign.

The Sigmoid Function

The way logistic regression (as well as many other types of classification algorithms) work is based on the mathematical underpinnings of the sigmoid function. The sigmoid function takes the following mathematical form:

h ( x ) = 1 1+e -x

Figure 4-9 shows what the plot looks like:

e <- exp(1)
curve(1/(1 + e^-x), -10, 10, main = "The Sigmoid Function", xlab = "Input",
    ylab = "Probability")
imlr 0409
Figure 4-9. The sigmoid function is the basis for logistic regression

This function is used in logistic regression to classify data. On its own, the function takes in some numeric value that we are interested in and maps it to a probability between 0 and 1. We might be tempted to just plug in some of the values from our earlier example into the sigmoid function and see what the output is. If we did, like setting x = 1, for example, we would get h(1) = 0.73, or about a 73% chance a tumor is malignant if our input is 1. Yet our classification system is 0 for benign and 1 for malignant. The length = 1 input yields a result of 0.73, which is incorrect.

Instead, we need to pass a set of weighted parameters to the logistic regressor. Because we have only one dependent variable at the moment (keeping in mind that the y-axis for our classification output is not an input variable), we should expect to pass a function to our logistic regressor that has the form similar to the following:

g ( length ) = θ 0 + θ 1 length

A priori, we don’t know what the weights are just yet. What we do want is for them to be chosen such that our g(x) function, when passed to our sigmoid function, will give us a classification that looks reasonable to what we see in our data:

lengths <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
t1 = -4.5
t2 = 1
g = t1 + t2 * lengths
s = 1/(1 + e^-g)
data.frame(lengths, g, s)

##    lengths    g          s
## 1        1 -3.5 0.02931223
## 2        2 -2.5 0.07585818
## 3        3 -1.5 0.18242552
## 4        4 -0.5 0.37754067
## 5        5  0.5 0.62245933
## 6        6  1.5 0.81757448
## 7        7  2.5 0.92414182
## 8        8  3.5 0.97068777
## 9        9  4.5 0.98901306
## 10      10  5.5 0.99592986

This code chunk takes the input tumor lengths, which range from 1 to 10, and picks two weights of θ 0 = 4.5 and θ 0 = 1. In practice, you would either need to experiment with picking values for the weights and seeing how the outputs react, or crunch them through an algorithm that gives you the answer. The preceding code provides the answer as an end result. They are then used as the weights for the function g(x) that is then passed to the sigmoid. The table in the code presents the resultant classification of the data as s. A tumor of length 1, when passed through the input function g(x), gives a result of –3.5. This value, when passed to the sigmoid function, yields a result that’s pretty close to zero. This means that a tumor of length 1 has a very low probability of being malignant, as demonstrated in Figure 4-10:

plot(y = s, x = lengths, pch = 1, main = "Sigmoid Function Inputs and Rounding Estimates",
    xlab = "Tumor Lengths", ylab = "Probability of Class 1 Typification")

points(y = round(s), x = lengths, pch = 3)
imlr 0410
Figure 4-10. For a given input length, you can see the estimate from the sigmoid function in circles, and its rounded value in crosses

Figure 4-10 presents probabilities for tumor lengths being classified as malignant if the probability is 1.0 and benign if the probability is 0.0. The result is pretty close, but there’s some error with it. You would get a much better picture if you simply round the values to the nearest whole number. The final result is a classification that looks exactly like the starting data.

We originally started with the input data being tumor length. The output of tumor type between benign, y = 0, and malignant, y = 1, was already given to us. The objective was to design a model that calculates the probability that a tumor is benign or malignant based on its length. We did this by starting with the equation g ( x ) = θ 0 + θ 1 x and then finding the weights θ 0 and θ 1 , which helped to get values out that, when passed through a sigmoid function, provided values that look about right for what we needed. What we get at the end of the day is a decision boundary at length = 4.5; any values above that are classified as 1, and any values below it are classified as 0.

The mechanisms by which classification algorithms like logistic regression work to determine those modeling weights are somewhat similar in scope to how simple linear regression weights are calculated. However, given that the goal of this text is to be introductory in nature, I’ll refer you to the statistical appendix for linear regression. Logistic regression and many other machine learning algorithms work in a similar fashion, but diving too deep into the realm of algorithm optimization can get overly mathy and we would lose focus on the understanding and application of the machine learning ecosystem as a whole.

Binary Classification

Everything we’ve done so far in terms of classification has been on binary data: the tumor is either malignant or benign. Figure 4-11 looks at another example in which we determine the classes based on the data’s distribution:

plot(iris$Sepal.Length ~ iris$Sepal.Width, main = "Iris Flower Sepal Length vs Sepal Width",
    xlab = "Sepal Width", ylab = "Sepal Length")
imlr 0411
Figure 4-11. You can use logistic regression for data that is more spread out instead of being discrete in output

In Figure 4-11, there are a bunch of data points and what appears to be two different classes of plants. There looks to be a grouping of data points at the bottom of the plot that seem to be more separated than the others. You can fit a logistic regression model to this data and find the equation for the line that makes your decision boundary. Any points below that threshold will be classified as one type, and all the points above the line will be classified as another type.

This exercise uses a generalized linear model, given by the function glm(). Its usage is more flexible than that of the standard linear model function lm() in that you can use it for classification purposes:

iris.binary <- iris
iris.binary$binary <- as.numeric(iris[, 5] == "setosa")


iris.logistic <- glm(binary ~ Sepal.Width + Sepal.Length, data = iris.binary,
    family = "binomial")
iris.logistic

##
## Call:  glm(formula = binary ~ Sepal.Width + Sepal.Length,
##     family = "binomial", data = iris.binary)
##
## Coefficients:
##  (Intercept)   Sepal.Width  Sepal.Length
##        437.2         137.9        -163.4
##
## Degrees of Freedom: 149 Total (i.e. Null);  147 Residual
## Null Deviance:       191
## Residual Deviance: 2.706e-08     AIC: 6

The output from this method provides some coefficients and intercepts that don’t look totally right. You need one extra step to calculate the slope and intercepts of the decision boundary this way. Recall for a moment how you used the sigmoid function g(z) = 1/(1 + e z).

z is a function with the following form:

z = θ 0 + θ 1 x 1 + θ 2 x 2

Because you want a value between 0 and 1 for binary classification, the classification is 1 when you have your sigmoid function g(z) ≥ 0.5. That’s only true when the function z that you pass it is itself greater than 0:

0 θ 0 + θ 1 x 1 + θ 2 x 2

You can rewrite this equation and solve for our x2 value accordingly:

x 2 -θ 0 θ 2 + -θ 1 θ 2 x 1

This equation is the same form as a y = b + mx line, where we can solve computationally for the slope and intercept to build out the function that determines the decision boundary:

slope = - x 1 / x 2 = - 137 . 9 / 163 . 4
intercept = - b / x 2 = - 437 . 2 / 163 . 4

You can calculate this directly from the logistic model object:

slope.iris <- coef(iris.logistic)[2]/(-coef(iris.logistic)[3])
int.iris <- coef(iris.logistic)[1]/(-coef(iris.logistic)[3])

slope.iris

## Sepal.Width
##   0.8440957

int.iris

## (Intercept)
##    2.675511

You then can plot this over your data and see how the classes shake out, as illustrated in Figure 4-12:

iris.binary$binary[iris.binary$binary == 0] <- 2

plot(Sepal.Length ~ Sepal.Width, data = iris.binary, pch = (binary),
    main = "Iris Flower Sepal Length vs Sepal Width", xlab = "Sepal Width",
    ylab = "Sepal Length")

abline(a = int.iris, b = slope.iris)
imlr 0412
Figure 4-12. Splitting a distribution of data into two classes

You now have an equation that helps determine how to separate the species of iris flowers we have. Any flowers in the dataset that have a value below the line of y = 2.68 + 0.844 × (Sepal.Width) will be classified accordingly.

Multiclass Classification

If you want to find the splits in your data that define multiple classes and not just a binary classification, you need to use multiclass classification. This approach is slightly different in that you are basically applying the same binary classification scheme that you have been doing thus far, but you are comparing the class you’re interested in versus everything else.

Figure 4-13 illustrates what a multiclass classification exercise might look like:

multi <- data.frame(x1 = c(0.03, 0.24, 0.21, 0, 0, 0.23, 0.6,
    0.64, 0.86, 0.77), x2 = c(0.07, 0.06, 0.19, 1.15, 0.95, 1,
    0.81, 0.64, 0.44, 0.74), lab = c(1, 1, 1, 2, 2, 2, 3, 3,
    3, 3))

plot(x2 ~ x1, pch = lab, cex = 2, data = multi,
    main = "Multi-Class Classification",
    xlab = "x", ylab = "y")
imlr 0413
Figure 4-13. Multiclass data can be separated only by using a one-versus-many approach

There are three distinct classes of data, and you want to find some kind of lines that split them into their own categories, much like you did for the binary case. What this essentially boils down to is our simple binary test, but you change which group you’re comparing against. This is called a “one-versus-all” or “one-versus-many” test, in which you test three cases—triangles-versus-rest, circles-versus-rest, and crosses-versus-rest, as depicted in Figure 4-14:

par(mfrow = c(1, 3))
multi$lab2 <- c(1, 1, 1, 4, 4, 4, 4, 4, 4, 4)
plot(x2 ~ x1, pch = lab2, cex = 2, data = multi,
    main = "Multi-Class Classification",
    xlab = "x", ylab = "y")

multi$lab3 <- c(4, 4, 4, 2, 2, 2, 4, 4, 4, 4)
plot(x2 ~ x1, pch = lab3, cex = 2, data = multi,
    main = "Multi-Class Classification",
    xlab = "x", ylab = "y")

multi$lab4 <- c(4, 4, 4, 4, 4, 4, 3, 3, 3, 3)
plot(x2 ~ x1, pch = lab4, cex = 2, data = multi,
    main = "Multi-Class Classification",
    xlab = "x", ylab = "y")
imlr 0414
Figure 4-14. One-versus-many classification utilizes the same approach as standard classification but reclassifies other groups into one single group for comparison

In a one-versus-many classification approach, you use one decision boundary to classify data for one type or class versus all the other types or classes. You then do the same for the rest of the types or classes in the data until you have a number of decision boundaries that you can use to typify your data accordingly. So, for the example in Figure 4-14, you’re computing (on the left plot) the circles versus the rest of the data, and then you compute the triangles versus the rest of the data (middle plot), and, finally, the crosses versus the rest of the data (right plot). By splitting a three-class problem into three, two-class problems, you can more easily find a single decision boundary for each plot and then combine those decision boundaries for a final model.

Here, you call upon the nnet library’s function multinom(). You use this to pass a multinomial case that’s basically the same as you’ve done for the simple binary case, but with three values instead of two. This methodology can be applied for more than three categories:

library(nnet)
multi.model <- multinom(lab ~ x2 + x1, data = multi, trace = F)

Notice that you have two lines to separate the three categories:

multi.model

## Call:
## multinom(formula = lab ~ x2 + x1, data = multi, trace = F)
##
## Coefficients:
##   (Intercept)       x2        x1
## 2   -12.47452 28.50805 -17.97523
## 3   -19.82927 12.95949  33.39610
##
## Residual Deviance: 0.0004050319
## AIC: 12.00041

Again, however, you need to do the special calculation for the slopes and intercepts of the decision boundaries based on the output of this model. You can apply the same math as earlier, but you need to apply it to each of the equations from the model. Then, you can plot the decision boundary lines, as illustrated in Figure 4-15:

multi.int.1 <- -coef(multi.model)[1]/coef(multi.model)[3]
multi.slope.1 <- -coef(multi.model)[5]/coef(multi.model)[3]

multi.int.2 <- -coef(multi.model)[2]/coef(multi.model)[4]
multi.slope.2 <- -coef(multi.model)[6]/coef(multi.model)[4]


plot(x2 ~ x1, pch = lab, cex = 2, data = multi,
    main = "Multi-Class Classification",
    xlab = "x", ylab = "y")
abline(multi.int.1, multi.slope.1)
abline(multi.int.2, multi.slope.2)
imlr 0415
Figure 4-15. Two lines separating three classes of data

Logistic Regression with Caret

The caret package makes doing logistic regression problems very easy for more complex examples than what we have been doing thus far. Using caret is fairly straightforward, though for some particular machine learning methods, some optimizations and tunings might be warranted to achieve the best results possible. Following is an example of how you can perform an operation with caret:

library(caret)

data("GermanCredit")

Train <- createDataPartition(GermanCredit$Class, p = 0.6, list = FALSE)
training <- GermanCredit[Train, ]
testing <- GermanCredit[-Train, ]

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate +
    Housing.Own + CreditHistory.Critical, data = training, method = "glm",
    family = "binomial")


predictions <- predict(mod_fit, testing[, -10])
table(predictions, testing[, 10])


##
## predictions Bad Good
##        Bad    9    8
##        Good 111  272

This simple example uses data from the GermanCredit dataset and shows how you can build a confusion matrix from a caret training object. In this case, the fit doesn’t seem super great, because about 50% of the data seems to be classified incorrectly. Although caret offers some great ways to tune whatever particular machine learning method you are interested in, it’s also quite flexible at changing machine learning methods. By simply editing the method option, you can specify one of the other 12 logistic regression algorithms to pass to the model, as shown here:

mod_fit <- train(Class ~ Age + ForeignWorker + Property.RealEstate +
    Housing.Own + CreditHistory.Critical, data = training,
    method = "LogitBoost",
    family = "binomial")

predictions <- predict(mod_fit, testing[, -10])
table(predictions, testing[, 10])

##
## predictions Bad Good
##        Bad    7   15
##        Good 113  265

Summary

In this chapter, we looked at a couple different ways to build basic models between simple linear regression and logistic regression.

Linear Regression

Regression comes in two forms: standard linear regression, which you might have encountered early on in your mathematics classes, and logistic regression, which is very different. R can create a linear model with ease by using the lm() function. In tandem with R’s formula operator, ~, you can build a simple y = mx + b regression equation by doing something like lm(y~x). From this linear model object, you can extract a wealth of information regarding coefficients, statistical validity, and accuracy. You can do this by using the summary() function, which can tell you how statistically valid each coefficient in your model is. For those that aren’t statistically useful, you can safely remove them from your model.

Regression models can be more advanced by having more features. You can model behavior like fuel efficiency as a function of a vehicle’s weight, but you can also incorporate more things into your model, such as a vehicle’s transmission type, how many cylinders its engine might have, and so forth. Multivariate regression modeling in R follows the same practice as single-feature regression. The only difference is that there are more features listed in the summary() view of your model.

However, simply adding more features to a model doesn’t make it more accurate by default. You might need to employ techniques like regularization, which takes a dataset that has lots of features and reduces the impact of those that aren’t statistically as important as others. This can help you to simplify your model drastically and boost accuracy assessments, as well.

Sometimes, there might be nonlinear relationships in the data that require polynomial fits. A regular linear model is of the form y = b + m1x1 + m2x2 + (…), whereas a polynomial model might have the form y = m1x12 + m2x1 + m3x22 + m4x2 + (…). You can fit polynomial behavior to your models by passing a poly() function to the lm() function; for example, lm(y~poly(x,2)). This creates a quadratic relationship in the data. It’s important to not go too crazy with polynomial degrees, however, because you run the risk of fitting your data so tightly that any new data that comes in might have high error estimates that aren’t true to form.

Logistic Regression

In machine learning, there are standard regression techniques that estimate continuous values like numbers, and classification techniques that estimate discrete values like data types. In a lot of cases, the data can have discrete values like a flower’s species. If you try the standard linear regression techniques on these datasets, you’ll end up with very disingenuous relationships in your data that are better suited for classification schemes.

Logistic regression is a classification method that finds a boundary that separates data into discrete classes. It does this by passing the data through a sigmoid function that maps the actual value of the data to a binary 1 or 0 case. That result is then passed through another equation that yields weights to assign probabilities to the data. You can use this to determine how likely a given data point is of a certain class.

You can also use logistic regression to draw decision boundaries in your data. A decision boundary is a line that doesn’t necessarily fit the data in a standard (x, y) plot, but fits gaps in the data to separate them into specific classes. In the case of data for which you have two classes, you would have one line that splits your data into class 1 and class 2.

In the case of multiple classes, you treat each class as a one-versus-many approach. If you have three classes, you focus on one and group the other two together and find the decision boundary that separates them, and then move on to the next class. At the end, you should have series of decision boundaries that separate the data into zones. Any data in a particular zone is classified the same as the data points in that zone.

Get Introduction to Machine Learning with R now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.