Comparison of Interactive Trees and GC&RT and GCHAID

While the methods available in the Interactive Trees module are very similar and in some ways more flexible than those provided in the General Classification and Regression Trees (GC&RT) and General CHAID (GCHAID) modules, there are some important differences in these procedures, as well as specific advantages and disadvantages.

V-Fold Cross-Validation of Trees and Tree-Sequences

The technique of v-fold cross-validation is used in various analytic procedures of STATISTICA to avoid overfitting of the data. In short, when fitting a model to a data set, you can make the model so complex (i.e., include so many model parameters) that it will reproduce the data almost perfectly. The problem, of course, is that such models will likely do much worse when applied to new data, i.e., when used to make predictions or compute predicted classifications for cases or observations that were not used for fitting the model. In general, when fitting a complex model to relatively small data sets (relative to the complexity of the model) there is always the problem of fitting esoteric aspects of the specific sample in the analysis (learning sample, from which parameters are estimated), which will not generalize to the population at large and, hence, are not useful for prediction.

In several modules of STATISTICA, such as the General Classification and Regression Trees (GC&RT) or Classification Trees modules, you can use v-fold cross-validation methods to fit the entire sequence of trees (built by the respective algorithm for building the tree) v times to different subsamples of the data, and evaluate the predictive power (validity) of the respective models. This technique enables you to evaluate the steps of the automatic tree building process, and to select (automatically) a particular tree that appears most stable and valid, i.e., produces the most accurate predictions in observations that were excluded from the model (tree) building procedure itself. V-fold cross-validation of the tree sequence is an extremely powerful and useful tool for predictive data mining because it often yields simple models with optimum predictive validity.

In the Interactive Trees (C&RT, CHAID) module, v-fold cross-validation can be applied only to individual trees, not to the entire tree sequence. Since you (the user) can build trees interactively and grow and remove branches of the tree without necessarily building a sequence of trees (from simple to more complex), the program will not apply v-fold cross-validation to pick a best tree. Instead, you can only compute v-fold cross-validation error (cost) estimates for the automatically built final (chosen) tree, to get a basic idea of how useful the solution might be for predicting new observations.

Interactive trees may or may not generalize to new observations. The inability to apply v-fold cross-validation methods to an algorithmically (systematically) computed sequence of trees to select the solution with the greatest predictive validity is a major drawback of all programs that use interactive building of trees. In short, you cannot be sure whether the final solution for a tree model that you decided on after detailed interactive analyses will generalize to new samples, i.e., yield accurate predictions or predicted classifications for new cases.

Differences in Computational Procedures

Another issue that you should be aware of when using the various methods for building trees available in the STATISTICA system is that these techniques may not yield unique solutions. When using, for example, least squares Multiple Regression, usually there is a best and correct solution for the parameters in the linear regression equation. This is not necessarily the case for the tree building techniques.

For example, consider the CHAID algorithm (see also Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID). As a first step, the algorithm will create categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. In the General CHAID module, this step is repeated at each node of the tree to achieve the greatest predictive power at each stage of the automatic tree building process. In the Interactive Trees (C&RT, CHAID) module, by default, this procedure of "cutting" the range of values for continuous predictors into discrete ranges is only performed once, at the beginning of the analysis. This will speed up the analyses for very large data sets, provides greater responsiveness for interactive analyses, and avoids unnecessary computations since users will often choose ranges for splits manually anyway. However, with complex data sets that yield increasingly complex trees (split rules), these two procedures (initializing ranges for continuous predictors at each node vs. only once at the beginning of the analysis) can result in completely different trees. (Note that the Interactive Trees module contains various options to control when and how to create ranges from the values of continuous predictors.)

Another example in the context of CHAID where slight differences in the data or computational procedures can yield very different results would be when the program, during automatic operation (growing of the tree), encounters two predictor variables that would produce equal improvements to the overall fit, and the question is which one to choose. At some point, this decision is arbitrary, but how that decision is resolved (which variable is chosen) can greatly affect the subsequent choices and, hence, the overall tree (in fact it is not uncommon for some programs to produce different results for different re-orderings of predictor variables).

Another instance where CHAID or C&RT analyses performed with GCHAID or GC&RT vs. Interactive Trees may yield different results is in cases when the input data contains many missing data in the predictor variables. These differences are explained in Missing Data in GC&RT, GCHAID, and Interactive Trees.

A fourth example of how similar tree building methods can produce quite different results might be when you use surrogate splits: Surrogate split variables can be chosen (manually or automatically) for a split when the predictor values for some observations are missing. The program can then simply use a similar surrogate variable to perform the split. Depending on the number of surrogates allowed, you may again obtain very different results.

For various reasons, GCHAID and its corresponding version in Itrees may not yield identical solutions. Many of those are listed within the help documentation in the topic Interactive Trees (C&RT, CHAID) Overview. Typical reasons for disagreement include things like differences in handling of missing data, binning of continuous predictors. Another potential difference in the final trees created by the two modules may be due to the differences in the computation of the Bonferroni adjustment of a p-value associated with a split in a given tree. At this time, there is no universally accepted method for the Bonferroni adjustment (for example see Kass (1980) and Biggs et. al. (1991)). Depending on the user specified value for the probability for splitting, this difference in the Bonferroni adjustment may lead one module to create a split using a specific variable and the other to not split on the same variable and thus lead to considerable differences in the final structure of the two trees. Although the two trees may differ in structure, many times the practical difference will be small, for example: predictive accuracy may be very similar. Additionally, it is useful to note that as with most machine learning algorithms, tree algorithms are outside the traditional hypothesis testing scenario. Whereas differences in p-value computation and adjustment is crucial in traditional hypothesis testing, in tree algorithms such as CHAID it is much less so. Most users of tree algorithms are not concerned with precisely estimating a set of parameters of a statistical model in order to test one or more hypotheses which have implications for some theoretical idealizations of some natural process. Instead, users of tree algorithms are seeking a data-driven, practical solution that is useful for gaining new insights about relationships among variables and/or creating useful predictive models without statistically controlling certain types of error rates. Even with differences in the Bonferroni adjustment, the Itrees module provides enough information about each split used along with control of algorithmic parameters so that many of the differences in the trees produced by the two modules as a result of the differences in the Bonferroni adjustment, should be minimized and/or possibly eliminated altogether. For example, within Itrees a user can view the p-value for each split in a given tree and then set the probability for splitting and merging to acceptable values in order to create a different tree. Lastly, the user always has the option to manually force splits to create an appropriate tree.

Tree-building techniques are heuristic algorithms, not exact estimation methods. The general point to remember is that the methods for building trees are based on algorithms rather than precise analytic solutions to equations. These algorithms are usually very sophisticated and in practice can often yield useful and interpretable results. However, the specific solutions are not "exact" in the sense that other analysts or researchers will always obtain identical, or nearly identical, results given similar data and software (algorithms). In that respect, these techniques are not unlike other Machine Learning algorithms that often produce very useful results (for prediction or predictive classification) even though those results are not necessarily unique (e.g., other trees may exist that are equally useful), interpretable, or allow for inferences about some general population "parameters."