Regression Trees

In this case the response variable is a continuous measurement, but the explanatory variables can be any mix of continuous and categorical variables. You can think of regression trees as analogous to multiple regression models. The difference is that a regression tree works by forward selection of variables, whereas we have been used to carrying out regression analysis by deletion (backward selection).

For our air pollution example, the regression tree is fitted by stating that the continuous response variable Pollution is to be estimated as a function of all of the explanatory variables in the dataframe called Pollute by use of the ‘tilde dot’ notation like this:

model<-tree(Pollution ~ . , Pollute)

For a regression tree, the print method produces the following kind of output:

print(model)

node), split, n, deviance, yval
  * denotes terminal node

1) root 41 22040 30.05
  2) Industry < 748 36 11260 24.92
   4)Population < 190 7 4096 43.43 *
   5)Population > 190 29 4187 20.45
    10)Wet.days < 108 11 96 12.00 *
    11)Wet.days > 108 18 2826 25.61
     22) Temp< 59.35 13 1895 29.69
     44)Wind< 9.65 8 1213 33.88 *
     45)Wind > 9.65 5 318 23.00 *
    23) Temp > 59.35 5 152 15.00 *
3) Industry > 748 5 3002 67.00 *

The terminal nodes (the leaves) are denoted by * (there are six of them). The node number is on the left, labelled by the variable on which the split at that node was made. Next comes the ‘split criterion’ which shows the threshold value of the variable that was used to create the split. ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.