Decision trees were introduced in Chapter 7 to show you how to build a model of user behavior from server logs. Decision trees are notable for being extremely easy to understand and interpret. An example of a decision tree is shown in Figure 12-1.
Figure 12-1. Example decision tree
It should be clear from the figure what a decision tree does when faced with the task of classifying a new item. Beginning at the node at the top of the tree, it checks the item against the node's criteria—if the item matches the criteria, it follows the Yes branch; otherwise, it follows the No branch. This process is repeated until an endpoint is reached, which is the predicted category.
Classifying in a decision tree is quite simple; training it is trickier. The algorithm described in Chapter 7 built the tree from the top, choosing an attribute at each step that would divide the data in the best possible manner. To illustrate this, consider the fruit dataset shown in Table 12-3. This will be referred to as the original set.
Table 12-3. Fruit data
There are two possible variables on which this data can be divided, either Diameter or Color, to create the top node of the tree. The first step is to try each of them in order to decide which of these variables divides the data best. Dividing the set on Color gives ...