Classification Trees Introductory Overview - Comparisons with Other Classification Tree Programs

A variety of classification tree programs have been developed to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. The Classification Trees module is a full-featured implementation of the QUEST (Loh & Shih, 1997) and C & RT (Breiman et. al., 1984) programs for computing binary classification trees based on univariate splits for categorical predictor variables, ordered predictor variables (measured on at least an ordinal scale), or a mix of both types of predictors. It also has options for computing classification trees based on linear combination splits for interval scale predictor variables.

Some classification trees programs, such as FACT (Loh & Vanichestakul, 1988) and THAID (Morgan & Messenger, 1973, as well as the related programs AID, for Automatic Interaction Detection, Morgan & Sonquist, 1963, and CHAID, for Chi-square Automatic Interaction Detection, Kass, 1980) perform multi-level splits rather than binary splits when computing classification trees. A multi-level split divides a parent node into more than two child nodes, but a binary split always produces just two child nodes (regardless of the number of levels of the splitting variable or the number of classes on the dependent variable). It should be noted, however, that there is no inherent advantage of multi-level splits, because any multi-level split can be represented as a series of binary splits, and there may be disadvantages. In some programs that perform multi-level splits, predictor variables can be used for splitting only once, so the resulting classification trees may be unrealistically short and uninteresting (Loh & Shih, 1997). A more serious problem is bias in variable selection for splits. This bias is possible in any program such as THAID (Morgan & Sonquist, 1973) that employs an exhaustive search for finding splits (for a discussion, see Loh & Shih, 1997). Bias in variable selection is the bias toward selecting variables with more levels for splits, a bias which can skew the interpretation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et. al., 1984). See also, Predictor Importance in STATISTICA GC&RT, Interactive Trees, and Boosted Trees.

Bias in variable selection can be avoided by using the discriminant-based split options in the Classification Trees module. These options make use of the algorithms in QUEST (Loh & Shih, 1997) to prevent bias in variable selection. The C & RT-style exhaustive search for univariate splits option is included in the Classification Trees module for use if one's goal is to find splits producing the best possible classification in the learning sample (but not necessarily in independent cross-validation samples). For reliable splits, as well as computational speed, the discriminant-based split options are recommended.

For information on techniques and issues in computing classification trees, see Computational Methods.