Workspace Node: Boosted Classification Trees - Specifications - Advanced Tab
In the Boosted Classification Trees workspace node dialog box, under the Specifications heading, select the Advanced tab to access options to control the various parameters for the stochastic gradient boosting trees algorithm (as described in the Overview and Computational Details). Use the options in the Boosted trees options group box to control the learning or shrinkage rate, the maximum number of boosted trees (additive terms in the additive expansion), and subsample proportion (see also Friedman, 1999b); the Stopping parameters pertain to the computations for (building) each individual tree in the sequence of trees. On this tab, you also can specify a Test sample for the analyses; the observations in this sample will be used to evaluate the predictive validity of the model and to determine the final best model (number of boosted trees).
Boosted trees options
Element Name | Description |
---|---|
Learning rate | Specify the learning or shrinkage rate for the computations. As described in the Introductory Overview, the Statistica Boosted Trees module will compute a weighted "additive" expansion of simple regression trees. The specific weight with which consecutive simple trees are added into the prediction equation is usually a constant, and referred to as the learning rate or shrinkage parameter; according to Friedman (1999b, p. 2; 1999a), empirical studies have shown that shrinkage values of .1 or less usually lead to better models (with better predictive validity). See also Computational Details for more information. |
Number of additive terms | Specify the number of additive terms to be computed, i.e., the number of simple regression trees to be computed in successive boosting steps. On the Results tabs, you can later review intermediate solutions, i.e., for fewer trees than initially requested (and computed). |
Random test data proportion | Specify the proportion of randomly chosen observations that will serve as a test sample in the computations; this option is only applicable if the Test sample option (see below) is set to Off. |
Subsample proportion | Specify the subsample proportion to be used for drawing the random learning sample for consecutive boosting steps. See also the Introductory Overview for a description of the basic algorithm implemented in Statistica Boosted Trees. |
Seed for random number generator | Specify a constant for seeding the random number generator, which is used to select the subsamples for consecutive boosting trees.
Stopping parameters. The parameters in this group box control the complexity of the individual trees that will be built at each consecutive boosting step. In general, it is highly recommended to specify relatively simple trees for each consecutive step (e.g., to leave the single-split, 3-node trees default in place). |
Stopping parameters | The parameters in this group box control the complexity of the individual trees that will be built at each consecutive boosting step. In general, it is highly recommended to specify relatively simple trees for each consecutive step (e.g., to leave the single-split, 3-node trees default in place). |
Minimum n (%) of cases | One way to control splitting is to allow splitting to continue until all terminal nodes contain no more than a specified minimum number of cases or objects; this minimum number of cases in a terminal node can be specified via this option. |
Minimum n in child node | Use this option to control the smallest permissible number in a child node, for a split to be applied. While the Minimum n of cases parameter determines whether an additional split is considered at any particular node, the Minimum n in child node parameter determines whether a split will be applied, depending on whether any of the two resultant child nodes will be smaller (have fewer cases) than n as specified via this option. This option is useful if during your analyses you determined that consecutive trees tend to partition off very small terminal nodes along one side or branch of the tree. In this case, setting this option to a value other than the default 1 will prevent those splits to be applied. |
Maximum n of levels | The value entered here will be used for stopping on the basis on the number of levels in a tree. Each time a parent node is split, the total number of levels (depth of the tree as measured from the root node) is examined, and the splitting is stopped if this number exceeds the number specified in the Maximum n of levels box. |
Maximum n of nodes | The value entered here will be used for stopping on the basis of the number of nodes in each tree. Each time a parent node is split, the total number of nodes in the tree is examined, and the splitting is stopped if this number exceeds the number specified in Maximum n of nodes box. The default value 3 would cause each consecutive tree to consist of a single split (one root node, two child nodes). |
Test sample | Click this button to display the
Test-Sample dialog box, which is used to select a test or hold-out sample for the analyses. If no Test sample variable and code is selected, the program will create by default a random sub-sample of 30% of the observations (cases) in the data and treat them as a test sample in order to evaluate the fit of the model over successive iterations in this separate (hold-out) sample. You can adjust this value via the Random test data proportion option (see above). Given the default proportion (30%) of test cases, this leaves the remaining 70% of the observations for the analyses via stochastic gradient boosting (e.g., for the selection of samples for consecutive boosting steps). By default, the program will choose the specific solution (with the specific number of simple boosted trees) that yields the absolute smallest error (misclassification rate) over all boosting iterations.
Use the Test sample options if you want to select a specific sub-sample for the test (hold-out) sample. For example, you may want to use the spreadsheet transformation options to create a new indicator variable (with 0's and 1's) using the formula "=(rnd(1)>.2)"; this would generate approximately 20% zeroes and 80% ones in the new variable. Then select this variable as the Test sample variable, and 1 as the code for the analysis sample. As a result, the test or hold-out sample for the analyses would then be created as 80% of the total cases instead of the default 70% (i.e., 100%-30%=70%; as described in the previous paragraph). Options / C / W. See Common Options. |
OK | Click this button to accept all the specifications made in the dialog box and to close it. The analysis results are placed in the Reporting Documents workspace node after running (updating) the project. |