Interactive Trees (C&RT, CHAID) Overview
The STATISTICA Interactive Trees (C&RT, CHAID) module builds ("grows") classification and regression trees as well as CHAID trees based on automatic (algorithmic) methods, user-defined rules and criteria specified via a highly interactive graphical user interface (brushing tools), or combinations of both. The purpose of the module is to provide a highly interactive environment for building classification or regression trees (via classic C&RT methods or CHAID) to enable users to try various predictors and split criteria in combination with almost all functionality for automatic tree building provided in the General Classification and Regression Trees (GC&RT) and General CHAID Models (GCHAID) modules of STATISTICA.
The Interactive Trees (C&RT, CHAID) module can be used to build trees for predicting continuous dependent variables (regression) and categorical dependent variables (classification). The program supports the classic C&RT algorithm popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996) as well as the CHAID algorithm (Chi-square Automatic Interaction Detector; see Kass, 1980).
Unique Advantages of the Interactive Trees (C&RT, CHAID) Module
While much of the functionality of the Interactive Trees (C&RT, CHAID) module can be found in other tree-building procedures of STATISTICA and STATISTICA Data Miner, there are a number of unique aspects to this program:
- The program is particularly optimized for very large data sets, and in many cases the raw data do not have to be stored locally for the analyses.
- Because the Interactive Trees module does not support ANCOVA-like design matrices, it is more flexible in the handling of missing data; for example, in CHAID analyses, the program will handle predictors one at a time to determine a best (next) split; in the General CHAID (GCHAID) Models module, observations with missing data for any categorical predictor are eliminated from the analysis. See also, Missing Data in GC&RT, GCHAID, and Interactive Trees for additional details.
- You can perform "what-if" analyses by interactively deleting individual branches, and growing other branches, and observing various results statistics for the different trees (tree models).
- You can automatically grow some parts of the tree but manually specify splits for other branches or nodes. For example, if certain predictor variables can in practice not easily or economically be measured (e.g., information on personal Income is usually difficult to obtain in questionnaire surveys), then you can find and specify alternative predictors and splits for nodes to avoid such variables (e.g., replace Income with Number of rooms in primary residence).
- You can define specific splits. This is useful when you want to build simple and parsimonious solutions that can easily be communicated and implemented (e.g., a split on Income < 20,345 is less "convenient" then a split at Income < 20,000).
- You can quickly copy trees into new projects to explore alternative splits and methods for growing branches.
- You can save entire trees (projects) for later use. When you reload the tree projects, the tree will be restored to the exact state as when it was saved.
Methods for Building Trees for Regression and Classification
The STATISTICA system includes a very comprehensive selection of algorithms for building trees for regression and classification tasks.
- Methods for automatic model building (machine learning)
- The original purpose and traditional application for tree classification and regression techniques was to provide an alternative to the various linear and nonlinear methods for predictive data mining; this is further described in the topics on Classification Trees Analysis, which implements the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm developed by Loh and Shih (1997); General Classification and Regression Trees (GC & RT), which includes a complete implementation of C & RT methods (see also Breiman, Friedman, Olshen, & Stone, 1984); and General CHAID Models (GCHAID). These modules offer complete and in-depth implementations of powerful techniques, and are particularly well suited for inclusion in predictive data mining analyses. All of these methods will automatically find a best model (tree) using sophisticated algorithmic methods, and, once the general analysis problem has been defined (e.g., variables have been selected), they require little or no intervention on the part of the user to find good solutions that yield accurate predictions or predicted classifications. In fact, in many instances these techniques will generate models that are superior to any other linear, nonlinear, or Neural Networks-based solutions (see Hastie, Tibshirani, and Friedman, 2001, for an overview; see also, Nisbet, R., Elder, J., & Miner, G., 2009).
- Building trees interactively
- In contrast, another method for building trees that has proven popular in applied research and data exploration is based on experts' knowledge about the domain or area under investigation, and relies on interactive choices (for how to grow the tree) by such experts to arrive at "good" (valid) models for prediction or predictive classification. In other words, instead of building trees automatically, using sophisticated algorithms for choosing good predictors and splits (for growing the branches of the tree), a user may want to determine manually which variables to include in the tree, and how to split those variables to create the branches of the tree. This enables the user to experiment with different variables and scenarios, and ideally to derive a better understanding of the phenomenon under investigation by combining her or his expertise with the analytic capabilities and options for building the tree (see also the next paragraph).
- Combining techniques
- In practice, it may often be most useful to combine the automatic methods for building trees with "educated guesses" and domain-specific expertise. You may want to grow some portions of the tree using automatic methods and refine and modify the choices made by the program (for how to grow the branches of the tree) based on your expertise. Another common situation where this type of combined automatic and interactive tree building is called for is when some variables that are chosen automatically for some splits are not easily observable because they cannot be measured reliably or economically (i.e., obtaining such measurements would be too expensive). For example, suppose the automatic analysis at some point selects a variable Income as a good predictor for the next split; however, you may not be able to obtain reliable data on income from the new sample to which you want to apply the results of the current analysis (e.g., for predicting some behavior of interest, such as whether or not the person will purchase something from your catalog). In this case, you may want to select a "surrogate" variable, i.e., a variable that you can observe easily and that is likely related or similar to variable Income (with respect to its predictive power; for example, a variable Number of years of education may be related to Income and have similar predictive power; while most people are reluctant to reveal their level of income, they are more likely to report their level of education, and hence, this latter variable is more easily measured).
The STATISTICA Interactive Trees (C&RT, CHAID) module provides a very flexible and easy to use environment to grow trees or portions (branches) of trees algorithmically (automatically) as well as manually. It adds an extremely powerful tool for interactive data analysis and model building that may supplement and augment the many other techniques available in STATISTICA Data Miner for automatically determining valid models for prediction and predictive classification.