Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID
The acronym CHAID stands for Chi-squared Automatic Interaction Detector. This name derives from the basic algorithm that is used to construct (non-binary) trees, which for classification problems (when the dependent variable is categorical in nature), relies on the Chi-square test to determine the best next split at each step; for regression-type problems (continuous dependent variable) the program computes F-tests. Specifically, the algorithm proceeds as follows:
- Preparing predictors
- First, Statistica creates categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. For categorical predictors, the categories (classes) are "naturally" defined.
- Merging categories
- Next, Statistica cycles through the predictors to determine for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable; for classification problems (where the dependent variable is categorical as well) the program computes a Chi-square test (Pearson Chi-square); for regression problems (where the dependent variable is continuous), the program computes F tests. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value, the program merges the respective predictor categories and repeats this step (i.e., finds the next pair of categories, which now may include previously merged categories). If the statistical significance for the respective pair of predictor categories is significant (less than the respective alpha-to-merge value), the program (optionally) computes a Bonferroni adjusted p-value for the set of categories for the respective predictor.
- Selecting the split variable
- Next, (optionally) chooses for the split the predictor variable with the smallest adjusted p-value, i.e., the predictor variable that will yield the most significant split; if the smallest (Bonferroni) adjusted p-value for any predictor is greater than some alpha-to-split value, no further splits are performed, and the respective node is a terminal node.
This process continues until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).
Note: Missing data. Missing data in predictor variables are handled differently in the General CHAID (GCHAID) Models and General Classification and Regression Trees (GC&RT) modules, as compared to the Interactive Trees module. Because the Interactive Trees module does not support ANCOVA-like design matrices, it is more flexible in the handling of missing data. Refer to Missing Data in GC&RT, GCHAID, and Interactive Trees for additional details.
CHAID and Exhaustive CHAID Algorithms.
Exhaustive CHAID, a modification to the basic CHAID algorithm, performs a more thorough merging and testing of predictor variables, and hence requires more computing time. Specifically, the merging of categories continues (without reference to any alpha-to-merge value) until only two categories remain for each predictor. The program then proceeds as described above in the Selecting the split variable step, and selects among the predictors the one that yields the most significant split. For large data sets, and with many continuous predictor variables, this modification of the simpler CHAID algorithm may require significant computing time.
Note: there are four different types of tree-building algorithms available in Statistica: CHAID (Kass, 1980; see General CHAID Introductory Overview), and C & RT (Breiman, Friedman, Olshen, and Stone, 1984; see General Classification and Regression Trees), QUEST (Loh and Shih, 1997; see Classification Trees Analysis), and Interactive C & RT and CHAID Trees; see also CHAID, C & RT, and QUEST for additional details. For additional discussions of differences between different computational algorithms, see also the Interactive Trees (C & RT, CHAID) Introductory Overview and Missing Data in GC & RT, GCHAID, and Interactive Trees.