Random Forests Overview

The STATISTICA Random Forest module is a complete implementation of the Random Forest algorithm developed by Breiman. In STATISTICA, this technique can be used for regression-type problems (to predict a continuous dependent variable) as well as classification problems (to predict a categorical dependent variable).

Technical Notes

Estimation
You have full control over all key aspects of the estimation procedure and model parameter, including the complexity of the trees fitted to the data, the maximum number of trees in the forest, control over how to stop the algorithm when satisfactory results have been achieved, etc. You can also specify an independent testing sample to evaluate the predictive validity of your model. If no specific testing sample is selected, STATISTICA will randomly select such a sample, and then determine the best solution (best number of simple trees) based on the performance of the respective models for predicting the cases in those testing samples.
Results
As with all modules of STATISTICA, STATISTICA Data Miner, STATISTICA Enterprise Server, and STATISTICA Enterprise Server Data Miner, a large number of graphs are provided in the results as aids for the evaluation of the final model.
Deployment for Data Mining
As is the case for all modules for predictive data mining, the final solution can be deployed by generating computer code in C/C++, STATISTICA Visual Basic (SVB), or PMML (for later deployment via the STATISTICA Rapid Deployment engine).

Technical Notes

The STATISTICA Random Forest module is an implementation of the so called Random Forest classifiers developed by Breiman. The algorithm is also applicable to regression problems. A Random Forest consists of a collection (ensemble) of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates (classifies) a set of independent (predictor) values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors.

A Random Forest consists of an arbitrary number (ensemble) of simple trees, which are used to vote for the most popular class (classification), or their responses are combined (averaged) to obtain an estimate of the dependent variable (regression). Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

The response of each tree depends on a set of predictor values chosen independently (with replacement) and with the same distribution for all trees in the forest, which is a subset of the predictor values of the original data set. In the STATISTICA Random Forest module, the optimal size of the subset of predictor variables is given by log2 M+1, where M is the number of inputs.

For classification problems, given a set of simple trees and a set of random predictor variables, the Random Forest method defines a margin function that measures the extent to which the average number of votes for the correct class exceeds the average vote for any other class present in the dependent variable. This measure provides us not only with a convenient way of making predictions, but also with a way of associating a confidence measure with those predictions.

For regression problems, Random Forests are formed by growing simple trees, each capable of producing a numerical response value (instead of a class label, as opposed to classification). Here, too, the predictor set is randomly selected from the same distribution and for all trees. Given the above, the mean-square error for a Random Forest is given by:

meanerror = (observed - treeresponse)2

The predictions of the Random Forest are taken to be the average of the predictions of the trees:

where the index k runs over the individual trees in the forest.

The implementation of the Random Forest algorithm in STATISTICA can flexibly incorporate missing data in the predictor variables. When missing data are encountered for a particular observation (case) during model building, the prediction made for that case is based on the last preceding (non-terminal) node in the respective tree. So, for example, if at a particular point in the sequence of trees a predictor variable is selected at the root (or other non-terminal) node for which some cases have no valid data, then the prediction for those cases is simply based on the overall mean at the root (or other non-terminal) node. Hence, there is no need to eliminate cases from the analysis if they have missing data for some of the predictors, nor is it necessary to compute surrogate split statistics (e.g., see also the documentation for the Number of surrogates option on the Interactive Trees Specifications dialog - Advanced tab for C&RT).

Importance

Generally, in data mining projects, the input variables are not equivalent, that is, often few variables have greater effect on the target variable than the others. Thus, it is useful to learn the relative influence (importance) of each predictor in predicting the response.  

For a given decision tree, STATISTICA computes predictor importance by summing over all nodes in the tree - the drop (delta) in node impurity and expressing these sums relative to the largest sum found over all predictors, i.e., the most important variable.

In general, the delta or decrease in node impurity is given by:

(s = split ) Di(s,t) = i(t) – pR*i(tR) – pL*i(tL)

where split of a parent node t sends a proportion pR of the case in t to the right daughter node tR and pL to the left daughter node tL. The impurity measure, i(t), of the node is typically given by the Gini measure (default) for a classification problem and sum of squares (SS) for a regression problem. For a random forest, the variable importance measures are summed across all trees in the forest and scaled in the same manner so that the most important variable has a value of 1.

Further details of various methods using trees can be found in General Classification and Regression Tree Models, Boosted Tree Classification and Regression, and Interactive Trees.