Workspace Node: C&RT Classification - Specifications - Validation Tab
In the C&RT Classification node dialog box, under the Specifications heading, select the Validation tab to access the following options.
V-Fold Cross-Validation of Tree-Sequences in C&RT
As described in the Introductory Overview (the V-Fold Cross-Validation of Trees and Tree Sequences section), the v-fold cross-validation options available in the Interactive Trees (C&RT, CHAID) module will be applied to the best automatically selected tree (solution) only, and not to the entire sequence of trees (built algorithmically). V-fold cross-validation is a very powerful technique for choosing a best tree from an automatically generated sequence of trees in classification and regression trees.
Element Name | Description |
---|---|
V-fold cross-validation | V-fold cross-validation is particularly useful when no test sample is available and the learning sample is too small to have the test sample taken from it. Select the V-fold cross-validation check box to make use of v-fold cross-validation. Additional specifications for v-fold cross-validation include Seed for random number generator and V-fold cross-validation; v-value. These values will be used to control the sampling that Statistica performs to obtain cross-validation error estimates. If this check box is selected when you click the OK button, the program will automatically grow the ("best") tree (apply pruning), and then compute risk estimates separately for the training and cross-validation samples. |
Seed for random number generator | The positive integer value entered in the Seed for random number generator box is used as the seed for a random number generator that produces v-fold random subsamples from the learning sample to test the predictive accuracy of the computed trees. |
V-fold cross-validation; v-value | The value entered in this box determines the number of cross-validation samples that will be generated from the learning sample to provide an estimate of the for the current tree. See also the Introductory Overview for details. |
Standard error rule | If a pruning method is selected in the
Stopping rule group box on the
Stopping tab, i.e., the
Prune on misclassification error,
Prune on deviance, or
Prune on variance option button is selected, the value entered in the
Standard error rule box is used in the selection of the "right-sized" tree after pruning (see also the General Classification and Regression Trees (GC&RT) Introductory Overviews).
The standard error rule is applied as follows: Find the pruned tree among all trees produced during pruning that has the smallest cost; this value is computed either from the training data sample or the test sample if a Test sample (see below) is specified. Call this value Min. V (validation or cross-validation) cost, and call the standard error of the V cost for this tree Min. Standard error. Then select as the "right-sized" tree the pruned tree with the fewest terminal nodes that has a V cost no greater than Min. V plus the Standard error rule times Min. Standard error. A smaller (closer to zero) value for the Standard error rule generally results in the selection of a right-sized tree that is only slightly "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. A larger (much greater than zero) value for the Standard error rule generally results in the selection of a right-sized tree that is much "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. This so-called cost/complexity pruning, as implemented in the selection of the right-sized tree, makes use of the basic scientific principles of parsimony and replication. Choose as the best theory the simplest theory (i.e., the pruned tree with the fewest terminal nodes) that is consistent with (i.e., has a V cost no greater than Min. V plus Standard error rule times Min. SE ) the theory best supported by independent tests (i.e., the pruned tree with the smallest V cost). |
Cross-validate tree sequence | Select this check box to specify that the program apply and record the results of the v-fold cross-validation procedure to the entire tree sequence, instead of just the final tree. |
Test sample | This option enables you to use a subsample of cases for estimating the accuracy of the classifier or prediction. Click the
Test sample button to display the
Test-Sample dialog box, through which you can switch on or off the
Test sample option as well as select a variable that will be used as the sample identifier variable. Click the
Sample Identifier Variable button to display a variable selection dialog to choose the sample identifier variable. In addition, you need to select the code for the selected variable that uniquely identifies the cases to be used in the test sample. By default, when a sample identifier variable has been selected, a valid code will be displayed in the
Code for analysis sample box. If this is not the desired code for identifying the test sample, double-click on the box (or press the F2 key on your keyboard) to display a dialog box from which you can select the desired code from the list of valid variable codes.
If a Test sample is identified, the Risk estimates for the final tree and predicted values or classifications and residuals can be computed separately for the training and the testing sample. Options / C / W. See Common Options. |
OK | Click the OK button to accept all the specifications made in the dialog box and to close it. The analysis results will be placed in the Reporting Documents node after running (updating) the project. |