C&RT Quick Specs - Validation Tab
The Validation tab of the C&RT Quick specs dialog box is used to select the method for cross validation to be used in the General C&RT analysis. Two types of cross-validation methods are available on this tab: V-fold cross-validation and Test sample.
- V-fold cross-validation
- V-fold cross-validation is particularly useful when no test sample is available and the learning sample is too small to have the test sample taken from it. Select the V-fold cross-validation check box to make use of v-fold cross-validation. Additional specifications for v-fold cross validation include: Seed for random number generator, V-fold cross-validation; v-value, and Standard error rule. These values will be used to control the sampling that STATISTICA performs to obtain cross-validation error estimates. See also, Basic Ideas Part II for details.
- Seed for random number generator
- The positive integer value entered in this box is used as the seed for a random number generator that produces v-fold random subsamples from the learning sample to test the predictive accuracy of the computed classification trees.
- V-fold cross-validation; v-value
- The value entered in this box determines the number of cross-validation samples that will be generated from the learning sample to provide an estimate of the CV cost for each classification tree in the tree sequence. See also, Basic Ideas Part II for details.
- Standard error rule
- If a pruning method is selected in the Stopping rule group box on the Stopping tab, i.e., the Prune on misclassification error, Prune on deviance, or Prune on variance option button is selected, then the value entered in the Standard error rule box is used in the selection of the "right-sized" classification tree from the sequence of pruned trees after v-fold cross-validation.
The standard error rule is applied as follows. Find the pruned tree in the tree sequence with the smallest CV cost. Call this value Min. CV, and call the standard error of the CV cost for this tree Min. Standard Error. Then select as the "right-sized" tree the pruned tree in the tree sequence with the fewest terminal nodes that has a CV cost no greater than Min. CV plus the standard error rule times Min. Standard Error. A smaller (closer to zero) value for the standard error rule generally results in the selection of a "right-sized" tree that is only slightly "simpler" (in terms of the number of terminal nodes) than the minimum CV cost tree. A larger (much greater than zero) value for the standard error rule generally results in the selection of a "right-sized" tree that is much "simpler" (in terms of the number of terminal nodes) than the minimum CV cost tree. Thus, cost/complexity pruning, as implemented in the selection of the right-sized tree, makes use of the basic scientific principles of parsimony and replication: Choose as the best theory the simplest theory (i.e., the pruned tree with the fewest terminal nodes) that is consistent with (i.e., has a CV cost no greater than Min. CV plus standard error rule times Min. SE ) the theory best supported by independent tests (i.e., the pruned tree with the smallest CV cost).
- Test sample
- With the test sample option, you can use a subsample of cases for estimating the accuracy of the classifier or prediction. Click the Test sample button to display the Cross-Validation dialog box, through which you can switch on or off the Test sample option as well as select a variable that will be used as the sample identifier variable. Click the Sample Identifier Variable button to display a variable selection dialog from which you choose the sample identifier variable. In addition, you need to select the code for the selected variable that uniquely identifies the cases to be used in the test sample. By default, when a sample identifier variable has been selected, a valid code will appear in the Code for analysis sample box. If this is not the desired code for identifying the test sample, double-click on the box (or press the F2 key on your keyboard) to display a dialog from which you can select the desired code from the list of valid variable codes.
If a Test sample is identified, then the Tree sequence option on the Results dialog box will include an estimate of the misclassification cost for the test sample, which is another estimate of the predictive power of the chosen tree.