Computational Methods - Selecting Splits

The second basic step in classification tree analysis is to select the splits on the predictor variables which are used to predict membership in the classes of the dependent variables for the cases or objects in the analysis. Not surprisingly, given the hierarchical nature of classification trees, these splits are selected one at time, starting with the split at the root node, and continuing with splits of resulting child nodes until splitting stops, and the child nodes which have not been split become terminal nodes. Three split selection methods are available as options in the Classification Trees module. The first of the methods is the Discriminant-based univariate split option for categorical or ordered predictor variables.

Discriminant-based univariate splits

The first step in split selection when the Discriminant-based univariate splits option is chosen is to determine the best terminal node to split in the current tree, and which predictor variable to use to perform the split. For each terminal node, p-values are computed for tests of the significance of the relationship of class membership with the levels of each predictor variable. For categorical predictors, the p-values are computed for Chi-square tests of independence of the classes and the levels of the categorical predictor that are present at the node. For ordered predictors, the p-values are computed for ANOVAs of the relationship of the classes to the values of the ordered predictor that are present at the node. If the smallest computed p-value is smaller than the default Bonferroni-adjusted p-value for multiple comparisons of .05, or a different threshold value if specified by the user, the predictor variable producing that smallest p-value is chosen to split the corresponding node. If no p-value smaller than the threshold p-value is found, p-values are computed for statistical tests that are robust to distributional violations, such as Levene's F. Details concerning node and predictor variable selection when no p-value is smaller than the specified threshold are described in Loh and Shih (1997).

The next step is to determine the split. For ordered predictors, the 2-means clustering algorithm of Hartigan and Wong (1979, see also Cluster Analysis) is applied to create two "superclasses" for the node. The two roots are found for a quadratic equation describing the difference in the means of the "superclasses" on the ordered predictor, and the values for a split corresponding to each root are computed. The split closest to a "superclass" mean is selected. For categorical predictors, dummy-coded variables representing the levels of the categorical predictor are constructed, and then singular value decomposition methods are applied to transform the dummy-coded variables into a set of non-redundant ordered predictors. The procedures for ordered predictors are then applied and the obtained split is "mapped back" onto the original levels of the categorical variable and represented as a contrast between two sets of levels of the categorical variable. Again, further details about these procedures are described in Loh and Shih (1997). Although complicated, these procedures reduce a bias in split selection that occurs when using the C&RT-style exhaustive search method for selecting splits. This is the bias toward selecting variables with more levels for splits, a bias which can skew the interpretation of the relative importance of the predictors in explaining responses on the dependent variable (Breiman et. al., 1984). See also, Predictor Importance in STATISTICA GC&RT, Interactive Trees, and Boosted Trees.

Discriminant-based linear combination splits

The second split selection method available as an option in the module is the Discriminant-based linear combination split option for ordered predictor variables (however, the predictors are assumed to be measured on at least interval scales). Surprisingly, this method works by treating the continuous predictors from which linear combinations are formed in a manner which is similar to the way categorical predictors are treated in the previous method. Singular value decomposition methods are used to transform the continuous predictors into a new set of non-redundant predictors. The procedures for creating "superclasses" and finding the split closest to a "superclass" mean are then applied, and the results are "mapped back" onto the original continuous predictors and represented as a univariate split on a linear combination of predictor variables.

C&RT-style exhaustive search for univariate splits

The third split selection option is the C&RT-style exhaustive search for univariate splits option for categorical or ordered predictor variables. With this option, all possible splits for each predictor variable at each node are examined to find the split producing the largest improvement in goodness of fit (or equivalently, the largest reduction in lack of fit). What determines the domain of possible splits at a node? For categorical predictor variables with k levels present at a node, there are 2(k -1) - 1 possible contrasts between two sets of levels of the predictor. For ordered predictors with k distinct levels present at a node, there are k-1 midpoints between distinct levels. Thus it can be seen that the number of possible splits that must be examined can become very large when there are large numbers of predictors with many levels which must be examined at many nodes.

How is improvement in Goodness of fit determined? In the Classification Trees module, three choices of Goodness of fit measures are available. The Gini measure of node impurity is a measure that reaches a value of zero when only one class is present at a node (with Prior probabilities Estimated from class sizes and Equal Misclassification costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal). The Gini measure was the measure of Goodness of fit preferred by the developers of C&RT (Breiman et. al., 1984; see also Prior Probabilities, the Gini Measure of Node Impurity, and Misclassification Cost). The two other options available are the Chi-square measure, which is similar to the standard Chi-square value computed for the expected and observed classifications (with priors adjusted for misclassification cost), and the G-square measure, which is similar to the maximum-likelihood Chi-square (as, for example, computed in the Log-Linear module). The C&RT-style exhaustive search for univariate splits option works by searching for the split that maximizes the reduction in the value of the selected goodness of fit measure. When the fit is perfect, classification is perfect.