Predictor Importance in STATISTICA GC&RT, Interactive Trees, and Boosted Trees

Various STATISTICA analyses contain tree building facilities based on the C&RT algorithms:

Classification Trees Analysis

General Classification and Regression Trees (GC&RT)

Interactive Trees (C&RT, CHAID)

Boosted Tree Classifiers and Regression

In general, in STATISTICA, variable (predictor) importance in these programs is computed by summing - over all nodes in the tree(s) - the drop (delta) in node impurity (delta(I) for classification) or the resubstitution estimate (for regression trees delta(R)), and expressing these sums relative to the largest sum found over all predictors (the most important variable). Note that this is different from the notion of predictor or variable importance adopted later by Breiman et al. Specifically, those authors discuss variable importance in the context of the split variable only, and surrogate splits (see also, for example, the description of the Number of surrogates option on the Interactive Trees Specifications dialog - Advanced tab for C&RT). The key difference is that Breiman et al. suggest to sum only the delta values for the actual split variables in each respective split and their surrogates, but not for the variables that provide "competing" splits. So while STATISTICA (and other programs, such as Quest by Loh and Shi, 1997) will compute the sum (of delta) for all predictors over all nodes (and trees, in Boosted Trees), other programs may only compute the sum for the actual split variables that were chosen in the different nodes (or their surrogates).

There are advantages and disadvantages to both computational approaches. Suppose there is one predictor among several predictors that for all splits "came in" as number 2, i.e., provided the second best alternative (competing) split for the actual best predictors that were chosen at each node. Using Breiman et al's method for computing importance, that variable could be entirely ignored and remain unidentified as an important variable. The meaning of "importance" in this case pertains to a specific final solution (tree), and if a predictor is never actually chosen for a split (or a surrogate), it is, indeed, not an important predictor in the specific final (chosen) tree.

In STATISTICA, such a variable (let's call it v1) could end up being identified as the most important variable. This is because it may show large delta values over many nodes, without ever being "used up" or "utilized" by an actual split (i.e., there isn't a single split based on v1); another variable (e.g., v2) may have been used and its predictive power fully exploited in the first few splits of the tree, and hence, the delta values over most of the remaining nodes could be quite small. This type of "configuration" would lead to a potential "inflation" of the importance of the predictor that was never chosen (i.e., of predictor v1 relative to the predictor v2 which was actual chosen one or more times for a split).

To summarize, in STATISTICA it is entirely possible to see importance values for predictors that were never chosen (this is, by the way, also possible using Breiman et. al's approach, when surrogates are present). The advantages of the approach used in STATISTICA (as well as some other programs) is that it helps identify variables that may contain important predictive power with respect to the outcome of interest; while the information contained in the importance statistic defined by Breiman et al. is largely redundant with that contained in the actual tree, where splits closer to the root of the tree are typically more important (yield greater improvement in the fit of the model) than those that are closer to the bottom of the tree.

For details regarding this issue, see also the discussion in Breiman et al. (1984, p. 146-148).