Boosted Trees - Computational Details

The implementation of the algorithms in Statistica Boosted Trees for regression problems follows closely the descriptions in Friedman (1999a, 1999b).

Regression problems

Specifically, as described in Algorithm 2 of Friedman (1999b, p. 3), the following steps are performed:

For the M trees requested by the user, and for each category or class for classification analyses

    1. Draw a random sample of the observations with the sampling fraction as requested by the user
    1. Compute the predicted values for the observations in that sample
    1. Fit a regression tree of the requested complexity to the residuals; note that Statistica always uses a least-squares error function for fitting the tree
    1. Update the weighted (by the constant learning or shrinkage rate) additive expansion of terms (simple regression trees)

In addition, the program will compute at each boosting step the prediction residuals for an independently drawn sample of observations. The average squared error (residual) for that sample can also be plotted in a line graph (from the Results dialog box), along with the residuals computed for the training sample. From this error function for the testing data, the program will automatically determine a best number of additive expansions,  (the optimum number of boosted trees for achieving good predictive validity. Specifically, the program will select the solution that yields the smallest average squared error in the testing data).

Classification problems

The algorithm for classification problems follows closely Algorithm 6, as described in Friedman (1999a, p. 11), and is essentially the same as that described for regression problems, with the following exceptions.

Instead of a single set of successive regression trees, the program will build independent sets of boosted trees for each category of the categorical dependent variable. Further, instead of simple regression residuals (computed at each step), the program will apply the logistic transformation to the predicted values (see also Nonlinear Estimation) before computing residuals (scaled to a probability scale). To compute the final classifications, the logistic transformation is again applied to the regression prediction for each category over the successive trees; see also the description of the multi-class logistic regression and classification model in Friedman, 1999a, p. 10. Note that in the description in Friedman (1999a, such as Algorithm 6), the algorithm does not adjust the prediction probabilities for each class and each tree for unequal prior probabilities or misclassification costs for different classes. In the implementation of this algorithm in Statistica , the prediction probabilities for each class are adjusted for unequal prior probabilities and misclassification costs [as, for example, described in Breiman et al., 1984; see also General Classification and Regression Trees (GC&RT)].

Predictor importance

Use the options on the Results dialog - Quick tab to plot and display the relative importance of each predictor in the final solution (sequence of trees). The predictor importance is computed as follows:

During the building of each tree, for each split, predictor statistics (such as the sums of squares regression, since simple regression trees are built in all cases) are computed for each predictor variable; the best predictor variable (yielding the best split at the respective node) will then be chosen for the actual split. The program also computes the average of the predictor statistic for all variables over all splits, and over all trees in the boosting sequence. The final predictor importance values are computed by normalizing those averages so that the highest average is assigned the value of 1, and the importance of all other predictors is expressed in terms of the relative magnitudes of the average values of the predictor statistic, relative to the most important predictor.

Default partitioning of data into training and testing (hold-out) samples

If no Test sample variable and code is specified on the Advanced tab of the Boosted Trees Specifications dialog box, the program will automatically create a random sub-sample of 30% of the observations (cases) in the data and treat them as a test sample in order to evaluate the fit of the model over successive iterations in this separate (hold-out) sample. This leaves the remaining 70% of the observations for the analyses via stochastic gradient boosting (e.g., for the selection of samples for consecutive boosting steps). By default, the program will choose the specific solution (with the specific number of simple boosted trees) that yields the absolute smallest error (misclassification rate) over all boosting iterations. Use the Test sample options on the Boosted Trees Specifications dialog - Advanced tab if you want to select a specific sub-sample for the test (hold-out) sample.