Computational Methods - Selecting the "Right-Sized" Tree

After a night at the horse track, a studious gambler computes a huge classification tree with numerous splits that perfectly account for the win, place, show, and no show results for every horse in every race. Expecting to become rich, the gambler takes a copy of the tree graph to the races the next night, sorts the horses racing using the classification tree, makes predictions and places bets, and leaves the race track later much less rich than had been expected. The poor gambler has foolishly assumed that a classification tree computed from a learning sample in which the outcomes are already known will perform equally well in predicting outcomes in a second, independent test sample. The gambler's classification tree performed poorly during cross-validation. The gambler's payoff might have been larger using a smaller classification tree that did not classify perfectly in the learning sample, but that was expected to predict equally well in the test sample.

Some generalizations can be offered about what constitutes the "right-sized" classification tree. It should be sufficiently complex to account for the known facts, but at the same time it should be as simple as possible. It should exploit information that increases predictive accuracy and ignore information that does not. It should, if possible, lead to greater understanding of the phenomena which it describes. Of course, these same characteristics apply to any scientific theory, so we must try to be more specific about what constitutes the "right-sized" classification tree. The options available in the Classification Trees module allow the use of either, or both, of two different strategies for selecting the "right-sized" tree from amongst all the possible trees. One strategy is to grow the tree to just the right size, where the right size is determined by the user from knowledge from previous research, diagnostic information from previous analyses, or even intuition. The other strategy is to use a set of well-documented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized" tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least they take subjective judgment out of the process of selecting the "right-sized" tree.

FACT-style direct stopping

We will begin by describing the first strategy, in which the user specifies the size to grow the classification tree. This strategy is followed by selecting FACT-style direct stopping as the Stopping rule for the analysis, and by specifying the Fraction of objects which allows the tree to grow to the desired size. The Classification Trees module provides several options for obtaining diagnostic information to determine the reasonableness of the choice of size for the tree. Specifically, three options are available for performing cross-validation of the selected classification tree.

Test sample cross-validation

The first, and most preferred type of cross-validation is test sample cross-validation. In this type of cross-validation, the classification tree is computed from the learning sample, and its predictive accuracy is tested by applying it to predict class membership in the test sample. If the costs for the test sample exceed the costs for the learning sample (remember, costs equal the proportion of misclassified cases when Prior probabilities are Estimated and Misclassification costs are Equal), this indicates poor cross-validation and that a different sized tree might cross-validate better. The test and learning samples can be formed by collecting two independent data sets, or if a large learning sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for use as the test sample.

In the Classification Trees module, test sample cross-validation is performed by specifying a sample identifier variable which contains codes identifying the sample (learning or test) to which each case or object belongs. The Test sample Misclassification matrix spreadsheet can be displayed to inspect the number of cases or objects in the test sample in each observed class (columns) that are misclassified as another class (rows). This spreadsheet also gives the Cross-Validation (CV) cost and the standard deviation of the CV cost based on the test sample misclassifications. The Test sample predicted classes spreadsheet displays for each object or case in the test sample the object or case number (or case name, if used), the observed class, the predicted class, and the terminal node in the classification tree to which the object was assigned.

V-fold cross-validation

The second type of cross-validation available in the Classification Trees module is V-fold cross-validation. This type of cross-validation is useful when no test sample is available and the learning sample is too small to have the test sample taken from it. The user-specified V value for V-fold cross-validation (the default value is 3) determines the number of random subsamples, as equal in size as possible, that are formed from the learning sample. The classification tree of the specified size is computed V times, each time leaving out one of the subsamples from the computations, and using that subsample as a test sample for cross-validation, so that each subsample is used V - 1 times in the learning sample and just once as the test sample. The CV costs computed for each of the V test samples are then averaged to give the V-fold estimate of the CV costs, which can be displayed, along with its standard error, on the Tree sequence spreadsheet.

Global cross-validation

The third type of cross-validation available in the Classification Trees module is Global cross-validation. In global cross-validation, the entire analysis is replicated a specified number of times (3 is the default) holding out a fraction of the learning sample equal to 1 over the specified number of times, and using each hold-out sample in turn as a test sample to cross-validate the selected classification tree. This type of cross-validation is probably no more useful than V-fold cross-validation when FACT-style direct stopping is used, but can be quite useful as a method validation procedure when automatic tree selection techniques are used (for discussion, see Breiman et. al., 1984). This brings us to the second of the two strategies that can used to select the "right-sized" tree, an automatic tree selection method based on a technique developed by Breiman et al. (1984) called minimal cost-complexity cross-validation pruning.

Minimal cost-complexity cross-validation pruning

In the Classification Trees module, minimal cost-complexity cross-validation pruning is performed if Prune on misclassification error has been selected as the Stopping rule, and minimal deviance-complexity cross-validation pruning is performed if Prune on deviance has been selected as the Stopping rule. The only difference in the two options is the measure of prediction error that is used. Prune on misclassification error uses the costs that we have discussed repeatedly (which equal the misclassification rate when Prior probabilities are Estimated and Misclassification costs are Equal). Prune on deviance uses a measure, based on maximum-likelihood principles, called the deviance (see Ripley, 1996). We will focus on cost-complexity cross-validation pruning (as originated by Breiman et. al., 1984), since deviance-complexity pruning merely involves a different measure of prediction error.

The costs needed to perform cost-complexity pruning are computed as the tree is being grown, starting with the split at the root node up to its maximum size, as determined by the specified Minimum n. The learning sample costs are computed as each split is added to the tree, so that a sequence of generally decreasing costs (reflecting better classification) are obtained corresponding to the number of splits in the tree. The learning sample costs are called resubstitution costs to distinguish them from CV costs, because V-fold cross-validation is also performed as each split is added to the tree. Use the estimated CV costs from V-fold cross-validation as the costs for the root node. Note that tree size can be taken to be the number of terminal nodes, because for binary trees the tree size starts at one (the root node) and increases by one with each added split. Now, define a parameter called the complexity parameter whose initial value is zero, and for every tree (including the first, containing only the root node), compute the value for a function defined as the costs for the tree plus the complexity parameter times the tree size. Increase the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree. Take the smaller-sized tree to be the new largest tree, continue increasing the complexity parameter continuously until the value of the function for the largest tree exceeds the value of the function for a smaller-sized tree, and continue the process until the root node is the largest tree. (Those who are familiar with numerical analysis will recognize the use of a penalty function in this algorithm. The function is a linear combination of costs, which generally decrease with tree size, and tree size, which increases linearly. As the complexity parameter is increased, larger trees are penalized for their complexity more and more, until a discrete threshold is reached at which a smaller-sized tree's higher costs are outweighed by the largest tree's higher complexity)

The sequence of largest trees obtained by this algorithm have a number of interesting properties. They are nested, because successively pruned trees contain all the nodes of the next smaller tree in the sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of the same size with lower costs. Proofs and/or explanations of these properties can be found in Breiman et al. (1984).

Tree selection after pruning

We now select the "right-sized" tree from the sequence of optimally pruned trees. A natural criterion is the CV costs. While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized" tree, oftentimes there will be several trees with CV costs close to the minimum. Breiman et al. (1984) make the reasonable suggestion that one should choose as the "right-sized" tree the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum CV costs. They proposed a "1 standard error rule" for making this selection, i.e., choose as the "right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1 times the Standard error of the CV costs for the minimum CV costs tree.

In the Classification Trees module, a multiple other than the 1 (the default) can be specified for the Standard Error rule. Thus, specifying a value of 0.0 would result in the minimal CV cost tree being selected as the "right-sized" tree. Values greater than 1.0 could lead to trees much smaller than the minimal CV cost tree being selected as the "right-sized" tree.

One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "overfitting" and "underfitting" of the data. The graph below shows a typical plot of the Resubstitution costs and CV costs for the sequence of successively pruned trees.

As shown in this graph, the Resubstitution costs (e.g., the misclassification rate in the learning sample) rather consistently decrease as tree size increases. The CV costs, on the other hand, approach the minimum quickly as tree size initially increases, but actually start to rise as tree size becomes very large. Note that the selected "right-sized" tree is close to the inflection point in the curve, that is, close to the point where the initial sharp drop in CV costs with increased tree size starts to level out. The "automatic" tree selection procedure is designed to select the simplest (smallest) tree with close to minimum CV costs, and thereby avoid the loss in predictive accuracy produced by "underfitting" or "overfitting" the data (note the similarity to the logic underlying the use of a scree plot to determine the number of extracted factors to retain in factor analysis).

As has been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized" tree selection is a truly "automatic" process. The algorithms make all the decisions leading to selection of the "right-sized" tree, except for, perhaps, specification of a value for the Standard error rule. One issue that arises with the use of such "automatic" procedures is how well the results replicate, where replication might involve the selection of trees of quite different sizes across replications, given the "automatic" selection process that is used. This is where global cross-validation can be very useful. As explained previously, in global cross-validation, the entire analysis is replicated a specified number of times (3 is the default) holding out a fraction of the cases to use as a test sample to cross-validate the selected classification tree. If the average of the costs for the test samples, called the global CV costs, exceeds the CV costs for the selected tree, or if the standard error of the global CV costs exceeds the standard error of the CV costs for the selected tree, this indicates that the "automatic" tree selection procedure is allowing too much variability in tree selection rather than consistently selecting a tree with minimum estimated costs.

Classification trees and traditional methods

As can be seen in the methods used in computing classification trees, in a number of respects classification trees are decidedly different from traditional statistical methods for predicting class membership on a categorical dependent variable. They employ a hierarchy of predictions, with many predictions sometimes being applied to particular cases, to sort the cases into predicted classes. Traditional methods use simultaneous techniques to make one and only one class membership prediction for each and every case. In other respects, such as having as its goal accurate prediction, classification tree analysis is indistinguishable from traditional methods. Time will tell if classification tree analysis has enough to commend itself to become as accepted as the traditional methods.

For information on the basic purpose of classification trees, see Basic Ideas. For information on the hierarchical nature and flexibility of classification trees, see Characteristics of Classification Trees.

See also, Exploratory Data Analysis and Data Mining Techniques.