Characteristics of Classification Trees - Flexibility of Classification Trees
A distinctive characteristic of classification trees is their flexibility. The ability of classification trees to examine the effects of the predictor variables one at a time, rather than just all at once, has already been described, but there are a number of other ways in which classification trees are more flexible than traditional analyses. The ability of classification trees to perform univariate splits, examining the effects of predictors one at a time, has implications for the variety of types of predictors that can be analyzed. In the Breiman et al. (1984) heart attack example, blood pressure and age were continuous predictors, but presence of sinus tachycardia was a categorical (two-level) predictor. Even if sinus tachycardia was measured as a three-level categorical predictor (perhaps coded as 0 = not present; 1 = present; 3 = unknown or unsure), without any underlying continuous dimension represented by the values assigned to its levels, univariate splits on the predictor variables could still be easily performed. Additional decisions would be added to the decision tree to exploit any additional information on risk provided by the additional category. To summarize, classification trees can be computed for categorical predictors, continuous predictors, or any mix of the two types of predictors when univariate splits are used.
Traditional linear discriminant analysis requires that the predictor variables be measured on at least an interval scale. For classification trees based on univariate splits for ordinal scale predictor variables, it is interesting that any monotonic transformation of the predictor variables (i.e., any transformation that preserves the order of values on the variable) will produce splits yielding the same predicted classes for the cases or objects (if the C&RT-style univariate split selection method is used, see Breiman et al., 1984). Therefore, classification trees based on univariate splits can be computed without concern for whether a unit change on a continuous predictor represents a unit change on the dimension underlying the values on the predictor variable; it need only be assumed that predictors are measured on at least an ordinal scale. In short, assumptions regarding the level of measurement of predictor variables are less stringent.
Classification trees are not limited to univariate splits on the predictor variables. When continuous predictors are indeed measured on at least an interval scale, linear combination splits, similar to the splits for linear discriminant analysis, can be computed for classification trees. However, the linear combination splits computed in the Classification Trees module do differ in important ways from the linear combination splits computed in the Discriminant Analysis module. In linear discriminant analysis the number of linear discriminant functions that can be extracted is the lesser of the number of predictor variables or the number of classes on the dependent variable minus one. The recursive approach implemented in the Classification Trees module does not face this limitation. For example, dozens of recursive, linear combination splits potentially could be performed when there are dozens of predictor variables but only two classes on the dependent variable. This compares with the single linear combination split that could be performed using traditional, non-recursive linear discriminant analysis, which could leave a substantial amount of the information in the predictor variables unused.
Now consider the situation in which there are many categories but few predictors. Suppose you are trying to sort coins into classes (perhaps pennies, nickels, dimes, and quarters) based only on thickness and diameter measurements. Using traditional linear discriminant analysis, at most two linear discriminant functions could be extracted, and the coins could be successfully sorted only if there were no more than two dimensions represented by linear combinations of thickness and diameter on which the coins differ. Again, the approach implemented in the Classification Trees module does not face a limitation on the number of linear combination splits that can be formed.
The approach implemented in the Classification Trees module for linear combination splits can also be used as the analysis method for constructing classification trees using univariate splits. Actually, a univariate split is just a special case of a linear combination split. Imagine a linear combination split in which the coefficients for creating the weighted composite were zero for all predictor variables except one. Since scores on the weighted composite would depend only on the scores on the one predictor variable with the nonzero coefficient, the resulting split would be a univariate split.
The approach implemented in the Classification Trees module for the Discriminant-based univariate split selection method for categorical and ordered predictors and for the Discriminant-based linear combination split selection method for ordered predictors is an adaptation of the algorithms used in QUEST (Quick, Unbiased, Efficient Statistical Trees). QUEST is a classification tree program developed by Loh and Shih (1997) that employs a modification of recursive quadratic discriminant analysis and includes a number of innovative features for improving the reliability and efficiency of the classification trees that it computes.
The algorithms used in QUEST are fairly technical (for references to descriptions of these algorithms, see Notes on Computational Algorithms), but the Classification Trees module also offers a split selection method option based on a conceptually simpler approach. The C & RT-style univariate split selection method is an adaptation of the algorithms used in C & RT, as described by Breiman et al. (1984). C & RT (Classification and Regression Trees) is a classification tree program that uses an exhaustive grid search of all possible univariate splits to find the splits for a classification tree.
The QUEST and C & RT analysis options complement each other nicely. C & RT searches can be lengthy when there are a large number of predictor variables with many levels, and it is biased toward choosing predictor variables with more levels for splits, but because it employs an exhaustive search, it is guaranteed to find the splits producing the best classification (in the learning sample, but not necessarily in cross-validation samples).
QUEST is fast and unbiased. The speed advantage of QUEST over C & RT is particularly dramatic when the predictor variables have dozens of levels (Loh & Shih, 1997, report an analysis completed by QUEST in 1 CPU second that took C & RT 30.5 CPU hours to complete). QUEST's lack of bias in variable selection for splits is also a distinct advantage when some predictor variables have few levels and other predictor variables have many levels (predictors with many levels are more likely to produce fluke theories, which fit the data well but have low predictive accuracy, see Doyle, 1973, and Quinlan & Cameron-Jones, 1995). Finally, QUEST does not sacrifice predictive accuracy for speed (Lim, Loh, & Shih, 1997). Together, the QUEST and C & RT options enable you to fully exploit the flexibility of classification trees.