Characteristics of Classification Trees - Hierarchical Nature of Classification Trees
Breiman et al. (1984) give a number of examples of the use of classification trees. As one example, when heart attack patients are admitted to a hospital, dozens of tests are often performed to obtain physiological measures such as heart rate, blood pressure, and so on. A wide variety of other information is also obtained, such as the patient's age and medical history. Patients subsequently can be tracked to see if they survive the heart attack, say, at least 30 days.
It would be useful in developing treatments for heart attack patients, and in advancing medical theory on heart failure, if measurements taken soon after hospital admission could be used to identify high-risk patients (those who are not likely to survive at least 30 days). One classification tree that Breiman et al. (1984) developed to address this problem was a simple, three question decision tree. Verbally, the binary classification tree can be described by the statement, "If the patient's minimum systolic blood pressure over the initial 24-hour period is greater than 91, then if the patient's age is over 62.5 years, then if the patient displays sinus tachycardia, then and only then the patient is predicted not to survive for at least 30 days."
It is easy to conjure up the image of a decision "tree" from such a statement. A hierarchy of questions are asked and the final decision that is made depends on the answers to all the previous questions. Similarly, the relationship of a leaf to the tree on which it grows can be described by the hierarchy of splits of branches (starting from the trunk) leading to the last branch from which the leaf hangs. The hierarchical nature of classification trees is one of their most basic features (but the analogy with trees in nature should not be taken too far; most decision trees are drawn downward on paper, so the more exact analogy in nature would be a decision root system leading to the root tips, hardly a poetic image).
The hierarchical nature of classification trees is illustrated by a comparison to the decision-making procedure employed in discriminant analysis. A traditional linear discriminant analysis of the heart attack data would produce a set of coefficients defining the single linear combination of blood pressure, patient age, and sinus tachycardia measurements that best differentiates low risk from high risk patients. A score for each patient on the linear discriminant function would be computed as a composite of each patient's measurements on the three predictor variables, weighted by the respective discriminant function coefficients. The predicted classification of each patient as a low risk or a high risk patient would be made by simultaneously considering the patient's scores on the three predictor variables. That is, suppose P (minimum systolic blood pressure over the 24-hour period), A (Age in years), and T (presence of sinus Tachycardia: 0 = not present; 1 = present) are the predictor variables, p, a, and t, are the corresponding linear discriminant function coefficients, and c is the "cut point" on the discriminant function for separating the two classes of heart attack patients. The decision equation for each patient would be of the form, "if pP + aA + tT - c is less than or equal to zero, the patient is low risk, else the patient is in high risk."
In comparison, the decision tree developed by Breiman et al. (1984) would have the following hierarchical form, where p, a, and t would be -91, -62.5, and 0, respectively, "If p + P is less than or equal to zero, the patient is low risk, else if a + A is less than or equal to zero, the patient is low risk, else if t + T is less than or equal to zero, the patient is low risk, else the patient is high risk." Superficially, the discriminant analysis and classification tree decision processes might appear similar, because both involve coefficients and decision equations. But the difference of the simultaneous decisions of Discriminant Analysis from the hierarchical decisions of classification trees cannot be emphasized enough.
The distinction between the two approaches can perhaps be made most clear by considering how each analysis would be performed in Regression. Because risk in the example of Breiman et al. (1984) is a dichotomous dependent variable, the Discriminant Analysis predictions could be reproduced by a simultaneous multiple regression of risk on the three predictor variables for all patients. The classification tree predictions could only be reproduced by three separate simple regression analyses, where risk is first regressed on P for all patients, then risk is regressed on A for patients not classified as low risk in the first regression, and finally, risk is regressed on T for patients not classified as low risk in the second regression. This clearly illustrates the simultaneous nature of Discriminant Analysis decisions as compared to the recursive, hierarchical nature of classification trees decisions, a characteristic of classification trees that has far-reaching implications.