Classification Trees Overview

Basic Ideas

Classification Trees

Classification trees are used to predict membership of cases or objects in the classes of a categorical dependent variable from their measurements on one or more predictor variables. Classification tree analysis is one of the main techniques used in so-called Data Mining. The Classification Trees module is a full-featured implementation of techniques for computing binary classification trees based on univariate splits for categorical predictor variables, ordered predictor variables (measured on at least an ordinal scale), or a mix of both types of predictors. It also has options for computing classification trees based on linear combination splits for interval scale predictor variables.

The goal of classification trees is to predict or explain responses on a categorical dependent variable, and as such, the techniques in this module have much in common with the techniques used in the more traditional methods of Discriminant Analysis, Cluster Analysis, Nonparametric Statistics, and Nonlinear Estimation. The flexibility of classification trees makes them a very attractive analysis option, but this is not to say that their use is recommended to the exclusion of more traditional methods. Indeed, when the typically more stringent theoretical and distributional assumptions of more traditional methods are met, the traditional methods may be preferable. But as an exploratory technique, or as a technique of last resort when traditional methods fail, classification trees are, in the opinion of many researchers, unsurpassed.

What are classification trees? Imagine that you want to devise a system for sorting a collection of coins into different classes (perhaps pennies, nickels, dimes, quarters). Suppose that there is a measurement on which the coins differ, say diameter, which can be used to devise a hierarchical system for sorting coins. You might roll the coins on edge down a narrow track in which a slot the diameter of a dime is cut. If the coin falls through the slot it is classified as a dime, otherwise it continues down the track to where a slot the diameter of a penny is cut. If the coin falls through the slot it is classified as a penny, otherwise it continues down the track to where a slot the diameter of a nickel is cut, and so on. You have just constructed a classification tree. The decision process used by your classification tree provides an efficient method for sorting a pile of coins, and more generally, can be applied to a wide variety of classification problems.

The study and use of classification trees are not widespread in the fields of probability and statistical pattern recognition (Ripley, 1996), but classification trees are widely used in applied fields as diverse as medicine (diagnosis), computer science (data structures), botany (classification), and psychology (decision theory). Classification trees readily lend themselves to being displayed graphically, helping to make them easier to interpret than they would be if only a strict numerical interpretation were possible.

Classification trees can be and sometimes are quite complex. However, graphical procedures can be developed to help simplify interpretation even for complex trees. If one's interest is mainly in the conditions that produce a particular class of response, perhaps a High response, a 3D Contour Plot can be produced to identify which terminal node of the classification tree classifies most of the cases with High responses.

In the example illustrated by this 3D Contour Plot, one could "follow the branches" leading to terminal node 8 to obtain an understanding of the conditions leading to High responses.

Amenability to graphical display and ease of interpretation are perhaps partly responsible for the popularity of classification trees in applied fields, but two features that characterize classification trees more generally are their hierarchical nature and their flexibility.

The hierarchical nature and flexibility of classification trees is described in Characteristics of Classification Trees. For information on techniques and issues in computing classification trees, see Computational Methods.

See also, Exploratory Data Analysis and Data Mining Techniques.