General Classification and Regression Trees Introductory Overview

Overview

Basic Ideas

The Stastica General Classification and Regression Trees module (GC&RT) will build classification and regression trees for predicting continuous dependent variables (regression) and categorical predictor variables (classification). The program supports the classic C & RT algorithm popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone, 1984; see also Ripley, 1996), and includes various methods for pruning and cross-validation, as well as the powerful v-fold cross-validation methods. In addition, with this program you can specify ANCOVA-like experimental designs (see MANOVA and GLM) with continuous and categorical factor effects and interactions, i.e., to base the computations on design matrices for the predictor variables. A general introduction to tree-classifiers, specifically to the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context of the Classification Trees Analysis facilities, and much of the following discussion presents the same information, in only a slightly different context. Another, similar type of tree building algorithm is CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980); a full implementation of this algorithm is available in the General CHAID Models module of Stastica.

Classification and Regression Problems

Stastica contains numerous algorithms for predicting continuous variables or categorical variables from a set of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear Models) and GRM (General Regression Models), you can specify a linear combination (design) of continuous  predictors and categorical factor effects (e.g., with two-way and three-way interaction effects) to predict a continuous dependent variable. In GDA (General Discriminant Function Analysis), you can specify such designs for predicting categorical variables, i.e., to solve classification problems.  

Regression-type problems. Regression-type problems are generally those where one attempts to predict the values of a continuous variable from one or more continuous and/or categorical predictor variables. For example, you may want to predict the selling prices of single family homes (a continuous dependent variable) from various other continuous predictors (e.g., square footage) as well as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area code where the property is located, etc.; note that this latter variable would be categorical in nature, even though it would contain numeric values or codes). If you used simple multiple regression, or some general linear model (GLM) to predict the selling prices of single family homes, you would determine a linear equation for these variables that can be used to compute predicted selling prices. STATISTICA contains many different analytic procedures for fitting linear models (GLM, GRM, Regression), various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear Estimation), where you can type in an arbitrary equation containing parameters to be estimated by the program. The General CHAID Models (GCHAID) module of STATISTICA also analyzes regression-type problems, and produces results that are similar (in nature) to those computed by GC&RT. Note that various neural network architectures [available in Statstica Automated Neural Networks (SANN)] are also applicable to solve regression-type problems.

Classification-type problems. Classification-type problems are generally those where one attempts to predict values of a categorical dependent variable (class, group membership, etc.) from one or more continuous and/or categorical predictor variables. For example, you may be interested in predicting who will or will not graduate from college, or who will or will not renew a subscription. These would be examples of simple binary classification problems, where the categorical dependent variable can only assume two distinct and mutually exclusive values. In other cases one might be interested in predicting which one of multiple different alternative consumer products (e.g., makes of cars) a person decides to purchase, or which type of failure occurs with different types of engines. In those cases there are multiple categories or classes for the categorical dependent variable. Stastica contains a number of methods for analyzing classification-type problems and to compute predicted classifications, either from simple continuous predictors (e.g., binomial or multinomial logit regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA). The General CHAID Models module of STATISTICA also analyzes classification-type problems, and produces results that are similar (in nature) to those computed by GC&RT. Note that various neural network architectures [available in STATISTICA Automated Neural Networks (SANN)] are also applicable to solve classification-type problems.

Classification and Regression Trees (C&RT)

In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set of if-then logical (split) conditions that permit accurate prediction or classification of cases.

Classification Trees

For example, consider the widely referenced Iris data classification problem introduced by Fisher [1936; see also Discriminant Function Analysis and General Discriminant Analysis (GDA)]. The example data file Irisdat.sta reports the lengths and widths of sepals and petals of three types of irises (Setosa, Versicol, and Virginic). The purpose of the analysis is to learn how one can discriminate between the three types of flowers, based on the four measures of width and length of petals and sepals. Discriminant function analysis will estimate several linear combinations of predictor variables for computing classification scores (or probabilities) that allow the user to determine the predicted classification for each observation (see also the Discriminant Function Analysis Example for a discussion of this type of analysis). A classification tree will determine a set of logical if-then conditions (instead of linear equations) for predicting or classifying cases instead:

The interpretation of this tree is straightforward: If the petal width is less than or equal to 0.8, the respective flower would be classified as Setosa; if the petal width is greater than 0.8 and less than or equal to 1.75, then the respective flower would be classified as Versicol; else, it belongs to class Virginic.

Regression Trees

The general approach to derive predictions from few simple if-then conditions can be applied to regression problems as well. For example, refer to Example 1 of the Multiple Regression Analysis module (see also Example 2 for GC&RT). Example 1 is based on the data file Poverty.sta, which contains 1960 and 1970 Census figures for a random selection of 30 counties. The research question (for that example) was to determine the correlates of poverty, that is, the variables that best predict the percent of families below the poverty line in a county. A reanalysis of those data, using the regression tree analysis [and v-fold cross-validation (see Introductory Overview - Basic Ideas Part II) to determine the best tree; see also, Example 2], yields the following results:

Again, the interpretation of these results is rather straightforward: Counties where the percent of households with a phone is greater than 72% have generally a lower poverty rate. The greatest poverty rate is evident in those counties that show less than (or equal to) 72% of households with a phone, and where the population change (from the 1960 census to the 1970 census) is less than -8.3 (minus 8.3). These results are straightforward, easily presented, and intuitively clear as well: There are some affluent counties (where most households have a telephone), and those generally have little poverty. Then there are counties that are generally less affluent, and among those the ones that shrunk most showed the greatest poverty rate.

A quick review of the scatterplot of observed vs. predicted values shows how the discrimination between the latter two groups is particularly well "explained" by the tree model.

Note: there are four different types of tree-building algorithms available in STATISTICA: CHAID (Kass, 1980; see General CHAID Introductory Overview),  C & RT (Breiman, Friedman, Olshen, and Stone, 1984; see General Classification and Regression Trees), QUEST (Loh and Shih, 1997; see Classification Trees Analysis), and Interactive C & RT and CHAID Trees; see also CHAID, C & RT, and QUEST for additional details. For additional discussions of differences between different computational algorithms, see also the Interactive Trees (C & RT, CHAID) Introductory Overview and Missing Data in GC & RT, GCHAID, and Interactive Trees.