Example 1: Classification via Boosted Trees
This example illustrates the use of classification trees for pattern recognition. The example data set used here is also discussed in the Classification Trees Analysis module [see Example 1: Discriminant-Based Splits for Categorical Predictors, as well as the General Classification and Regression Trees (GC&RT) module - Example 1: Pattern Recognition (Classification of Digits)].
The data for the analysis were generated in a manner similar to the way that a faulty calculator would display numerals on a digital display (for a description of how these data were generated, see Breiman et. al., 1984). The numerals from one through nine and zero that were entered on the keypad of a calculator formed the observed classes on the dependent variable Digit. There were 7 categorical predictors, Var1 through Var7. The levels on these categorical predictors (0 = absent; 1 = present) correspond to whether or not each of the 7 lines (3 horizontal and 4 vertical) on the digital display was illuminated when the numeral was entered on the calculator. The predictor variable to line correspondence is Var1 - top horizontal, Var2 - upper-left vertical, Var3 - upper-right vertical, Var4 - middle horizontal, Var5 - lower-left vertical, Var6 - lower-right vertical, and Var7 - bottom horizontal. The first 10 cases of the data set are shown below. The complete data set containing a total of 500 cases is available in the example data file Digit.sta. Open this data file via the File - Open Examples menu; it is in the Datasets folder.

Because the goal of this analysis is to build a prediction model for the different digits based on several (faulty) categorical predictor variables, accept the default Classification Analysis option and click OK to display the Boosted Trees Specifications dialog.
On the Quick tab, click the Variables button to display the variable selection dialog, select Digit as the categorical dependent variable and all others as the categorical predictor variables, and click the OK button to return to the Specifications dialog.

There are a number of additional options available on the Classification and Advanced tabs of this dialog, which can be used to "fine-tune" the analysis. You can use the options on the Classification tab to specify particular a priori class probabilities and unequal misclassification costs; using the options on the Advanced tab, you can determine the complexity of the individual trees you want to build in each boosting step, as well as the total number of boosting steps.
For the purpose of this analysis (and as a useful "first step" for most analyses), let's accept the defaults; thus, click OK. You will see the Computing dialog for a few moments as the consecutive boosting steps are computed, and then the Results dialog will be displayed.


This graph demonstrates the basic mechanism of how the stochastic gradient boosting algorithm implemented in Statistica can avoid overfitting (see also the Introductory Overview). As more and more additive terms (simple trees) are added to the model, the average squared error function for the training data (from which the respective trees were estimated) will decrease. However, the error estimate for the training data will at one point start to increase, clearly marking the point where evidence for overfitting is beginning to show.
By default, the program will designate 54 as the optimal number of trees (in this case; because of the random subsampling of training data in successive boosting steps, your results may be slightly different); this happens to be the point where the smallest error for the testing data occurred. You can use the Number of trees option on the Boosted Trees Results dialog - Quick tab to select a specific solution, i.e., number of trees in the final model.

As you can see, the final solution is remarkably accurate over all digits. You may want to review the various additional summary statistics (e.g., Risk estimates) to gage the quality of different solutions, i.e., for different numbers of additive terms (simple trees).


The predictor importance is computed as the relative (scaled) average value of the predictor statistic over all trees. So, for example, in this case it is the average value of the sums-of-squares prediction over all categories and over all trees and nodes, scaled so that the maximum value of that sum is equal to 1. Hence, these values reflect on the strength of the relationship between the predictors and the dependent variable of interest over the successive boosting steps. In this case, variables Var2, Var4, and Var5 stand out as the most important predictors.