Example: Classification Random Forests
This example illustrates the use of Random Forests for classification tasks (i.e., tasks that require associating a data case of the predictor variables with a class type among a group of categorical levels present in the dependent variable). Further examples on the use of similar models, such as Boosted Trees, can be found in Example 1: Classification via Boosting Trees.

Specifying the Analysis
After opening the Boston2.sta data file, select Random Forest for Classification and Regression from the Data Mining menu to display the Random Forest Startup Panel.

In the Type of analysis list on the Quick tab, select Classification Analysis, and click the OK button to display the Random Forest Specifications dialog where you can configure the options for running the analysis.

Clear the Show appropriate variables only check box, and select variable PRICE as the Dependent variable, variable CAT1 as a Categorical pred variable, and variables ORD1-ORD12 as the Continuous pred variables.

Click the OK button to accept these selections, close the variable selection dialog, and return to the Random Forest Specifications dialog.

There are a number of additional options available on the Classification, Advanced, and Stopping Condition tabs of this dialog that can be reconfigured to "fine-tune" the analysis.
By default, Random Forests assigns equal misclassifications costs to all categories. To change this setting, select the Classification tab.

Then, select the User spec. option button in the Misclassifications costs group box, and click the adjacent
button to display a user defined input spreadsheet, which is used to adjust the cost values. (Note that, for this option to be available, response codes must be assigned via the Response codes option on the Quick tab).



Prior probabilities should reflect the degree of belief in class memberships of data cases (i.e., whether it belongs to category LOW, MEDIUM, or HIGH) before performing any analysis. One way to set such probabilities is to consider the percentage of each category in the data set. This is a reasonable approach if the data set is a good representative of the true population. On the other hand, if no such information is available, then you can assign equal prior probabilities to all categories. Note that assigning equal priors means "I don't know," which simply reflects your lack of knowledge of the percentage of house categories in the Boston area. Note that priors, as all probabilities, must sum to unity.
On the Advanced tab, you can access options to control the number and complexity (number of nodes) of the tree models you are about to create.

Instead of randomly partitioning the data set into training and test cases, you can define your holdout (testing) sample via the Test sample option, where you can identify a sample identifier code to divide the data into training and testing sets. Selecting this sampling method will override the random sampling option.
In particular, you can specify the number of predictor variables you want to include in your tree models. This option is an important one, and care should be taken in setting its value. Including a large number of predictors in the tree models can lead to prolonged computational time and, thus, to missing one of the advantages of the Random Forest model, which is the ability to perform predictions based on a subset of the predictor variables. Alternately, including too small a number of predictor variables may downgrade the model performance (since this can exclude variables that may account for most of the variability and trend in the data). In setting the number of predictor variables, it is recommended that you use the default value, which is based on the formula (see Breiman for further details).
However, for longer training runs there are better ways to specify when training should stop. You can do this on the Stopping Conditions tab.

The most useful option, perhaps, is the Percentage decrease in training error. This states that if the training error does not improve by at least the amount given over a number of epochs (the Cycles to calculate mean error), then training should stop.

Then, the Random Forest Results dialog will be displayed.

On the Quick tab, click the Summary button to review how consecutive training and testing classification rates progressed over the entire training cycles.

This graph demonstrates the basic mechanism of how the Random Forest algorithm implemented in Statistica can avoid overfitting (see also the Introductory Overview and Technical Notes). In general, as more and more simple trees are added to the model, the misclassification rate for training data (from which the respective trees were estimated) will generally decrease. The same trend should be observed for misclassification rates defined over the testing data. However, as more and more trees are added the misclassification rate for the testing, data will at one point start to increase (while the misclassification rate for the training set keeps decreasing), clearly marking the point where evidence for overfitting is beginning to show.
By default, the program will stop adding trees even if the designated number of trees you specified in the Number of trees option on the Advanced tab of the Random Forest Specifications dialog is not reached. To turn off the stopping condition, simply clear the Enable advanced stopping condition check box on the Stopping condition tab of the Random Forest Specifications dialog. In this case, the designated number of trees set in the Number of trees option will be added to the Random Forest.
To produce predictions for the test sample, for example, click on the Classification tab of the Results dialog

Select the Test set option button in the Sample group box, and then click the Predicted vs. observed by classes button. The program will then display the spreadsheet of predicted values and probability of class memberships. It will also display a spreadsheet and a 3D histogram of the classification matrix, together with a spreadsheet of the confusion matrix.


In addition, you may want to review the various additional summary statistics (e.g., Risk estimates) and the predictor importance (in the form of a histogram). The Predictor importance graph contains the importance ranking on a 0-1 scale for each predictor variable in the analysis. See Predictor Importance in Statistica GC&RT, Interactive Trees, and Boosted Trees.

This plot can be used for visual inspection of the relative importance of the predictor variables used in the analysis and, thus, helps to conclude which predictor variable is the most important predictor. See also, Predictor Importance in Statistica GC&RT, Interactive Trees, and Boosted Trees.. In this case, variables ORD1, ORD5, and ORD12 stand out as the most important predictors.
