Random Forest Specifications - Advanced Tab
Select the Advanced tab of the Random Forest Specifications dialog box to access options to control the various parameters for the Random Forest algorithm (as described in the Introductory Overview and Technical Notes).
Random Forest options.
- Number of predictors
- Specify the number of predictors for the tree models. The default value is a subset of the total number of predictor variables (selected via the Variables button on the Quick tab) and is determined from log2 M+1, where M is the number of inputs.
- Number of trees
- Specify the number of simple regression trees to be computed in successive forest building steps. In the Results dialog box, you can later review intermediate solutions, (for fewer or larger number of trees than initially requested and computed).
- Random test data proportion
- Specify the proportion of randomly chosen observations that will serve as a test sample in the computations. This option is only applicable if the Test sample option (see below) is set to Off.
- Subsample proportion
- Specify the subsample proportion to be used for drawing the bootstrap learning samples for consecutive steps. Bootstrap creates subsets by randomly sampling, with replacement, from cases of the original data set. See also the Introductory Overview and Technical Notes for a description of the basic algorithm implemented in Statistica Random Forests.
- Seed for random number generator
- Specify a constant for seeding the random number generator, which is used to select the subsamples for consecutive trees.
- Stopping parameters
- The parameters in this group box control the complexity of the individual trees that will be built at each consecutive step.
- Minimum n of cases
- One way to control splitting is to allow splitting to continue until all terminal nodes contain no more than a specified minimum number of cases or objects; this minimum number of cases in a terminal node can be specified via this option.
- Minimum n in child node
- Use this option to control the smallest permissible number in a child node, for a split to be applied. While the Minimum n of cases parameter determines whether an additional split is considered at any particular node, the Minimum n in child node parameter determines whether a split will be applied, depending on whether any of the resultant child nodes will be smaller (have fewer cases) than n as specified via this option. This option is useful if during your analyses you determine that consecutive trees tend to partition off very small terminal nodes along one side or branch of the tree. In this case, setting this option to a value other than the default will prevent those splits to be applied.
- Maximum n of levels
- The value entered here will be used as a stopping criterion based on the number of levels in a tree. Each time a parent node is split, the total number of levels (depth of the tree as measured from the root node) is examined, and the splitting is stopped if this number exceeds the number specified in the Maximum n of levels box.
- Maximum n of nodes
- The value entered here will be used as a stopping criterion based on the number of nodes in each tree. Each time a parent node is split, the total number of nodes in the tree is examined, and the splitting is stopped if this number exceeds the number specified in Maximum n of nodes box.
- Test sample
- Click the
Test sample button to display the
Test-Sample dialog box, which is used to select a test or hold-out sample for the analysis. If no Test sample variable and code is selected, the program will create by default a random sub-sample consisting of 30% of the available observations (cases) in the data and treat them as a test sample in order to evaluate the fit of the model over successive iterations in this separate (hold-out) sample. You can adjust this value via the
Random test data proportion option (see above). Given the default proportion (30%) of test cases, this leaves the remaining 70% of the observations for the analysis. By default, the program will choose the specific solution (with the specific number of simple trees) that yields the absolute smallest error (or misclassification rate) over all iterations.
Use the Test sample options if you want to select a specific sub-sample for the test (hold-out) sample. For example, you may want to use the spreadsheet transformation options to create a new indicator variable (with 0s and 1s) using the formula "=(rnd(1)>.2)"; this would generate approximately 20% zeroes and 80% ones in the new variable. Then select this variable as the Test sample variable, and 1 as the code for the analysis sample. As a result, a test or hold-out sample for the analysis will then be created consisting of 80% of the total cases instead of the default 70% (100%-30%=70%; as described in the previous paragraph).