Interactive Trees (C&RT, CHAID) Overview

The Statistica Interactive Trees (C&RT, CHAID) module builds ("grows") classification and regression trees as well as CHAID trees based on automatic (algorithmic) methods, user-defined rules and criteria specified via a highly interactive graphical user interface (brushing tools), or combinations of both. The purpose of the module is to provide a highly interactive environment for building classification or regression trees (via classic C&RT methods or CHAID) to enable users to try various predictors and split criteria in combination with almost all functionality for automatic tree building provided in the General Classification and Regression Trees (GC&RT) and General CHAID Models (GCHAID) modules of Statistica.

The Interactive Trees (C&RT, CHAID) module can be used to build trees for predicting continuous dependent variables (regression) and categorical dependent variables (classification). The program supports the classic C&RT algorithm popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) as well as the CHAID algorithm (Chi-square Automatic Interaction Detector; see Kass, 1980).

Unique Advantages of the Interactive Trees (C&RT, CHAID) Module

While much of the functionality of the Interactive Trees (C&RT, CHAID) module can be found in other tree-building procedures of Statistica and Statistica Data Miner, there are a number of unique aspects to this program:

Methods for Building Trees for Regression and Classification

The Statistica system includes a very comprehensive selection of algorithms for building trees for regression and classification tasks.

Methods for automatic model building (machine learning)
The original purpose and traditional application for tree classification and regression techniques was to provide an alternative to the various linear and nonlinear methods for predictive data mining; this is further described in the topics on Classification Trees Analysis, which implements the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm developed by Loh and Shih (1997); General Classification and Regression Trees (GC & RT), which includes a complete implementation of C & RT methods (see also Breiman, Friedman, Olshen, & Stone, 1984); and General CHAID Models (GCHAID). These modules offer complete and in-depth implementations of powerful techniques, and are particularly well suited for inclusion in predictive data mining analyses. All of these methods will automatically find a best model (tree) using sophisticated algorithmic methods, and, once the general analysis problem has been defined (e.g., variables have been selected), they require little or no intervention on the part of the user to find good solutions that yield accurate predictions or predicted classifications. In fact, in many instances these techniques will generate models that are superior to any other linear, nonlinear, or Neural Networks-based solutions (see Hastie, Tibshirani, and Friedman, 2001, for an overview; see also, Nisbet, R., Elder, J., & Miner, G., 2009).
Building trees interactively
In contrast, another method for building trees that has proven popular in applied research and data exploration is based on experts' knowledge about the domain or area under investigation, and relies on interactive choices (for how to grow the tree) by such experts to arrive at "good" (valid) models for prediction or predictive classification. In other words, instead of building trees automatically, using sophisticated algorithms for choosing good predictors and splits (for growing the branches of the tree), a user may want to determine manually which variables to include in the tree, and how to split those variables to create the branches of the tree. This enables the user to experiment with different variables and scenarios, and ideally to derive a better understanding of the phenomenon under investigation by combining her or his expertise with the analytic capabilities and options for building the tree (see also the next paragraph).
Combining techniques
In practice, it may often be most useful to combine the automatic methods for building trees with "educated guesses" and domain-specific expertise. You may want to grow some portions of the tree using automatic methods and refine and modify the choices made by the program (for how to grow the branches of the tree) based on your expertise. Another common situation where this type of combined automatic and interactive tree building is called for is when some variables that are chosen automatically for some splits are not easily observable because they cannot be measured reliably or economically (i.e., obtaining such measurements would be too expensive). For example, suppose the automatic analysis at some point selects a variable Income as a good predictor for the next split; however, you may not be able to obtain reliable data on income from the new sample to which you want to apply the results of the current analysis (e.g., for predicting some behavior of interest, such as whether or not the person will purchase something from your catalog). In this case, you may want to select a "surrogate" variable, i.e., a variable that you can observe easily and that is likely related or similar to variable Income (with respect to its predictive power; for example, a variable Number of years of education may be related to Income and have similar predictive power; while most people are reluctant to reveal their level of income, they are more likely to report their level of education, and hence, this latter variable is more easily measured).

The Statistica Interactive Trees (C&RT, CHAID) module provides a very flexible and easy to use environment to grow trees or portions (branches) of trees algorithmically (automatically) as well as manually. It adds an extremely powerful tool for interactive data analysis and model building that may supplement and augment the many other techniques available in Statistica Data Miner for automatically determining valid models for prediction and predictive classification.