Boosted Trees Specifications - Advanced Tab

Select the Advanced tab of the Boosted Trees Specifications dialog box to access options to control the various parameters for the stochastic gradient boosting trees algorithm (as described in the Overview and Computational Details). Use the options in the Boosted trees options group box to control the learning or shrinkage rate, the maximum number of boosted trees (additive terms in the additive expansion), and subsample proportion (see also Friedman, 1999b); the Stopping parameters pertain to the computations for (building) each individual tree in the sequence of trees. On this tab, you also can specify a Test sample for the analyses; the observations in this sample will be used to evaluate the predictive validity of the model and to determine the final best model (number of boosted trees).

Boosted trees options.

Learning rate
Specify here the learning or shrinkage rate for the computations. As described in the Introductory Overview, the Statistica Boosted Trees module will compute a weighted "additive" expansion of simple regression trees. The specific weight with which consecutive simple trees are added into the prediction equation is usually a constant, and referred to as the learning rate or shrinkage parameter; according to Friedman (1999b, p. 2; 1999a), empirical studies have shown that shrinkage values of .1 or less usually lead to better models (with better predictive validity). See also Computational Details for more information.
Number of additive terms
Specify here the number of additive terms to be computed, i.e., the number of simple regression trees to be computed in successive boosting steps. In the Results dialog box, you can later review intermediate solutions, i.e., for fewer trees than initially requested (and computed).
Random test data proportion
Specify here the proportion of randomly chosen observations that will serve as a test sample in the computations; this option is only applicable if the Test sample option (see below) is set to Off.
Subsample proportion
Specify here the subsample proportion to be used for drawing the random learning sample for consecutive boosting steps. See also the Introductory Overview for a description of the basic algorithm implemented in Statistica Boosted Trees.
Seed for random number generator
Specify here a constant for seeding the random number generator, which is used to select the subsamples for consecutive boosting trees.

Stopping parameters. The parameters in this group box control the complexity of the individual trees that will be built at each consecutive boosting step. In general, it is highly recommended to specify relatively simple trees for each consecutive step (e.g., to leave the single-split, 3-node trees default in place).

Minimum n of cases
One way to control splitting is to allow splitting to continue until all terminal nodes contain no more than a specified minimum number of cases or objects; this minimum number of cases in a terminal node can be specified via this option.
Minimum n in child node
Use this option to control the smallest permissible number in a child node, for a split to be applied. While the Minimum n of cases parameter determines whether an additional split is considered at any particular node, the Minimum n in child node parameter determines whether a split will be applied, depending on whether any of the two resultant child nodes will be smaller (have fewer cases) than n as specified via this option. This option is useful if during your analyses you determined that consecutive trees tend to "partition off" very small terminal nodes along one side or branch of the tree. In this case, setting this option to a value other than the default 1 will prevent those splits to be applied.  
Maximum n of levels
The value entered here will be used for stopping on the basis on the number of levels in a tree. Each time a parent node is split, the total number of levels ("depth" of the tree as measured from the root node) is examined, and the splitting is stopped if this number exceeds the number specified in the Maximum n of levels box.
Maximum n of nodes
The value entered here will be used for stopping on the basis of the number of nodes in each tree. Each time a parent node is split, the total number of nodes in the tree is examined, and the splitting is stopped if this number exceeds the number specified in Maximum n of nodes box. The default value 3 would cause each consecutive tree to consist of a single split (one root node, two child nodes).
Test sample
Click the Test sample button to display the dialog by the same name, which is used to select a test or "hold-out" sample for the analyses. If no Test sample variable and code is selected, the program will create by default a random sub-sample of 30% of the observations (cases) in the data and treat them as a test sample in order to evaluate the fit of the model over successive iterations in this separate (hold-out) sample. You can adjust this value via the Random test data proportion option (see above). Given the default proportion (30%) of test cases, this leaves the remaining 70% of the observations for the analyses via stochastic gradient boosting (e.g., for the selection of samples for consecutive boosting steps). By default, the program will choose the specific solution (with the specific number of simple boosted trees) that yields the absolute smallest error (misclassification rate) over all boosting iterations.

Use the Test sample options if you want to select a specific sub-sample for the test (hold-out) sample. For example, you may want to use the spreadsheet transformation options to create a new indicator variable (with 0s and 1s) using the formula "=(rnd(1)>.2)"; this would generate approximately 20% zeroes and 80% ones in the new variable. Then select this variable as the Test sample variable, and 1 as the code for the analysis sample. As a result, the test or hold-out sample for the analyses would then be created as 80% of the total cases instead of the default 70% (i.e., 100%-30%=70%; as described in the previous paragraph).