ITrees Extended Options - Validation Tab
Select the Validation tab of the ITrees Extended Options dialog box to access options to choose the method of cross-validation to be used in the C&RT or (Exhaustive) CHAID analysis. Two types of cross-validation methods are available from this tab: V-fold cross-validation and Test sample, which are described here.
Element Name | Description |
---|---|
V-Fold cross-validation of tree-sequences in C&RT | As described in the Introductory Overview (the V-Fold Cross-Validation of Trees and Tree Sequences section), the v-fold cross-validation options available in the Interactive Trees (C&RT, CHAID) module will be applied to the best automatically selected tree (solution) only, and not to the entire sequence of trees (built algorithmically). V-fold cross-validation is a very powerful technique for choosing a best tree from an automatically generated sequence of trees in classification and regression trees [see General Classification and Regression Trees (GC&RT)]. For additional details, see also GC&RT Introductory Overview - Basic Ideas Part II. |
V-fold cross-validation | V-fold cross-validation is particularly useful when no test sample is available and the learning sample is too small to have the test sample taken from it. Select the V-fold cross-validation check box to make use of v-fold cross-validation. Additional specifications for v-fold cross-validation include Seed for random number generator and V-fold cross-validation; v-value. These values will be used to control the sampling that Statistica performs to obtain cross-validation error estimates. If this check box is selected when you click the OK button, the program will automatically grow the ("best") tree (apply pruning if C&RT is selected as the Model building method on the Interactive Trees Startup Panel - Quick tab), and then compute risk estimates separately for the training and cross-validation samples when you review the Risk estimates from the ITrees Results dialog box - Summary tab. |
Seed for random number generator | The positive integer value entered in the Seed for random number generator box is used as the seed for a random number generator that produces v-fold random subsamples from the learning sample to test the predictive accuracy of the computed trees. |
V-fold cross-validation; v-value | The value entered in the V-fold cross-validation; v-value box determines the number of cross-validation samples that will be generated from the learning sample to provide an estimate for the current tree. See also the Introductory Overview for details. |
Standard error rule | This option is available only when C&RT is selected as the Model building method on the
Interactive Trees Startup Panel - Quick tab. If a pruning method is selected in the Stopping rule group box on the
ITrees Extended Options dialog box - Stopping tab (only applicable to C&RT), i.e., the Prune on misclassification error, Prune on deviance, or Prune on variance option button is selected, then the value entered in the Standard error rule box is used in the selection of the "right-sized" tree after pruning (see also the General Classification and Regression Trees (GC&RT) Introductory Overviews).
The standard error rule is applied as follows: Find the pruned tree among all trees produced during pruning that has the smallest cost; this value is computed either from the training data sample or the test sample if a Test sample (see below) is specified. Call this value Min. V (validation or cross-validation) cost, and call the standard error of the V cost for this tree Min. Standard error. Then select as the "right-sized" tree the pruned tree with the fewest terminal nodes that has a V cost no greater than Min. V plus the Standard error rule times Min. Standard error. A smaller (closer to zero) value for the Standard error rule generally results in the selection of a "right-sized" tree that is only slightly "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. A larger (much greater than zero) value for the Standard error rule generally results in the selection of a "right-sized" tree that is much "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. This so-called cost/complexity pruning, as implemented in the selection of the right-sized tree, makes use of the basic scientific principles of parsimony and replication. Choose as the best theory the simplest theory (i.e., the pruned tree with the fewest terminal nodes) that is consistent with (i.e., has a V cost no greater than Min. V plus Standard error rule times Min. SE ) the theory best supported by independent tests (i.e., the pruned tree with the smallest V cost). |
Cross-validate tree sequence | Select this check box to specify that the program apply and record the results of the v-fold cross-validation procedure to the entire tree sequence, instead of just the final tree. If this option is set, then on the ITrees Results dialog box - Summary tab, you can review the cross-validation cost for each tree (of a given complexity) via the Tree sequence option (this is the default manner in which v-fold cross-validation is performed in the General Classification and Regression Trees (GC&RT) module). |
Test sample | The test sample option enables you to use a subsample of cases for estimating the accuracy of the classifier or prediction. Click the Test sample button to display the
Cross-Validation dialog box, through which you can switch on or off the Test sample option as well as select a variable that will be used as the sample identifier variable. Click the Sample Identifier Variable button to display a variable selection dialog to choose the sample identifier variable. In addition, you need to select the code for the selected variable that uniquely identifies the cases to be used in the test sample. By default, when a sample identifier variable has been selected, a valid code will be displayed in the Code for analysis sample box. If this is not the desired code for identifying the test sample, double-click on the box (or press the F2 key on your keyboard) to display a dialog box from which you can select the desired code from the list of valid variable codes.
If a Test sample is identified, the Risk estimates for the final tree (see the ITrees Results dialog box - Summary tab) and predicted values or classifications (and residuals; see the ITrees Results dialog box - Prediction tab) can be computed separately for the training and the testing sample. |