Classification trees with categorical explanatory variables

Tree models are a superb tool for helping to write efficient and effective taxonomic keys.

Suppose that all of our explanatory variables are categorical, and that we want to use tree models to write a dichotomous key. There is only one entry for each species, so we want the twigs of the tree to be the individual rows of the dataframe (i.e. we want to fit a tree perfectly to the data). To do this we need to specify two extra arguments: minsize = 2 and mindev = 0. In practice, it is better to specify a very small value for the minimum deviance (say, 10−6) rather than zero (see below).

The following example relates to the nine lowland British species in the genus Epilobium (Onagraceae). We have eight categorical explanatory variables and we want to find the optimal dichotomous key. The dataframe looks like this:

epilobium<-read.table("c:\\temp\\epilobium.txt",header=T)
attach(epilobium)
epilobium

species   stigma  stem.hairs  glandular.hairs  seeds  pappilose
1     hirsutum   lobed    spreading           absent   none    uniform
2  parviflorum   lobed    spreading           absent   none    uniform
3     montanum   lobed    spreading          present   none    uniform
4  lanceolatum   lobed    spreading          present   none    uniform
5   tetragonum clavate    appressed          present   none    uniform
6 obscurum clavate appressed present none uniform 7 roseum clavate spreading present none uniform 8 palustre clavate spreading present appendage uniform 9 ciliatum clavate spreading present appendage ridged stolons petals base 1 ...

Get The R Book now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.