ITrees Results - Summary Tab
Select the Summary tab of the ITrees Results dialog box to access options to review the main results for the current tree.
Element Name | Description |
---|---|
Tree view | Use the options in this group box to review the current tree. |
Display histogram of response in Tree workbook | This option is only applicable (available) if the dependent variable for the current analysis is continuous in nature, i.e., if you selected Regression Analysis in the Type of analysis box of the Interactive Trees Startup Panel - Quick tab. When the Display histogram of response in Tree workbook check box is selected, and you click the Tree browser button (see the next option description), the individual nodes (graphs) will not only contain summary statistics (means, standard deviations, fitted normal distribution plot) for the observations belonging to the respective nodes, but also histograms of the observed data. |
Tree browser | Click the Tree browser button to produce a complete representation of the results tree inside a STATISTICA Workbook-like browser, where every node is represented by a graph containing the respective split rule (unless the respective node is a terminal node) and various summary statistics. For more details, see the description of the Tree browser button in the ITrees Results dialog box - Manager tab topic. |
Brush tree | Click the Brush tree button to display the current tree and the Brushing Commands dialog box containing tree-brushing tools to interactively "brush" the current tree. In this mode, you can select all of the options for growing or pruning the tree, and immediately review the results of the chosen actions on the current tree. See also, Tree Brushing Tools for additional details. |
Tree graph | Click the Tree graph button to produce the tree graph for the current tree. For more details, see the description of the Tree graph button in the ITrees Results dialog box - Manager tab topic. |
Scrollable tree | Click this button to display the same tree graph as described above, but in a scrollable window. For more details, see the description of the Scrollable tree button in the ITrees Results dialog box - Manager tab topic. |
Tree layout | Click the Tree layout button to display the graph showing the structure of the current tree. Each node will be presented as a rectangular box; terminal nodes are highlighted in red and non-terminal nodes are highlighted in blue. |
Tree structure | Click the Tree structure button to display the Tree Structure spreadsheet, which contains summary information for all splits and the terminal nodes for the current tree. Regardless of the type of analysis problem selected on the
Interactive Trees Startup Panel - Quick tab, the information available in the tree structure will include for each node:
If you are analyzing a categorical response variable (i.e., you have selected Classification Analysis in the Type of analysis list on the Interactive Trees Startup Panel - Quick tab), then in addition to the information described above, the tree structure will include the number of cases or objects in each observed class that are sent to the node. Alternatively, in the case of a continuous response variable (regression), the tree structure will contain information about the mean and variance of the dependent variable for the cases or objects belonging to the node. |
Terminal nodes | Click the Terminal nodes button to display the spreadsheet containing summary information for the terminal nodes only.
For classification problems (categorical dependent variable; i.e., if you selected Classification Analysis in the Type of analysis list on the Interactive Trees Startup Panel - Quick tab), the spreadsheet shows the number of cases or objects in each observed class that are sent to the node; a Gain value is also reported. By default (with Profit equal to 1.0 for each dependent variable class; see also the options available on the Classification tab), the gain value is simply the total number of observations (cases) in the respective node. If separate Profit values are specified for each dependent variable class, then the Gain value is computed as the total profit (number of cases times respective profit values). For regression problems (continuous dependent variable), the spreadsheet shows the number of cases or objects in each observed class that are sent to the node, and the respective node mean and variance. |
Importance | Click the Importance button with the spreadsheet icon to display a spreadsheet that contains the importance ranking on a 0-100 scale for each predictor variable in the analysis. Computational details regarding this measure can be found in Breiman (1984; p. 147). In general, with the results presented in this spreadsheet, you can judge the relative importance of each predictor variable for producing the final tree. Refer to the discussion in Breiman (1984) for details. See also, Predictor Importance in STATISTICA GC&RT, Interactive Trees, and Boosted Trees. |
Importance | Click the Importance button with the plot icon to display a bar graph that pictorially shows the importance ranking on a 0-100 scale for each predictor variable considered in the analysis. This plot can be used for visual inspection of the relative importance of the predictor variables used in the analysis and, thus, helps to conclude which predictor variable is the most important predictor. See also, Predictor Importance in STATISTICA GC&RT, Interactive Trees, and Boosted Trees. |
Risk estimates | Click this button to display a spreadsheet with risk estimates for the analysis sample, the test sample (if one is specified on the ITrees Extended Options dialog box - Validation tab), and the v-fold cross-validation risk (if v-fold cross-validation is requested on the Validation tab). For classification-type problems with a categorical dependent variable (if you selected Classification Analysis in the Type of analysis list on the Interactive Trees Startup Panel - Quick tab), and equal misclassification costs (for a C&RT analysis), risk is calculated as the proportion of cases incorrectly classified by the tree (in the respective type of sample); if unequal misclassification costs are specified (for a C&RT analysis), the risk is adjusted accordingly, i.e., expressed relative to the overall cost. For regression-type problems with a continuous dependent variable, risk is calculated as the within-node variance. The standard error for the risk estimate is also reported. |
Tree sequence | This button is available only for C&RT, and only if v-fold cross-validation is specified and the Cross-validate tree sequence check box was selected on the ITrees Extended Options dialog box - Validation tab. In this case, you can review the cross-validation cost of the entire tree sequence, i.e., for each level of complexity of the tree (very similar to the results available via the Tree sequence option on the GC&RT Results dialog box - Summary tab). Note that these results are only available until the first time you change the tree by manually removing branches or adding splits: At that point, the tree sequence is no longer valid, and this option will no longer be available until you use the Grow and prune option on the ITrees Results dialog box - Manager tab to grow and cross-validate the tree sequence again. |
Predictor details | This button is available only for C&RT classification and regression analysis. Click this button to produce a spreadsheet for each terminal node containing one row for each of the K predictors. Each row of the spreadsheet contains the node ID; the name of the predictor; the splitting condition (i.e., less than cut-off point, etc.), and in the case of a categorical predictor, the set of its levels leading to the left son; the node ID of the successive nodes (sons) when the splitting condition is/is not satisfied (in the case of non-terminal nodes) or string "LEAVE“ (in the case of a terminal node); the impurity measure for the proposed cut-off (improvement statistics); number. of observations in the node; and the number of observations in the node with missing predictor values. |
V-fold cross-validation | Click the V-fold cross-validation button to automatically grow (and prune if C&RT is selected as the Model building method on the
Interactive Trees Startup Panel - Quick tab) the tree, and then perform v-fold cross-validation of the best tree, using the settings as specified below this button; you can also specify those settings on the
ITrees Extended Options dialog - Validation tab.
Note: When you use this option after you have grown a custom-tree, that custom tree will be discarded, and a new algorithmically (best) tree will be grown. Therefore, be sure you want to discard your custom tree before using this option. Alternatively, click the New tree button at the bottom of the
ITrees Results dialog (available regardless of which tab is selected) to first copy ("clone") the current tree problem, and then use the V-fold cross-validation option on that copy to create an algorithmically grown best tree solution and perform v-fold cross-validation.
The technique of v-fold cross-validation is used in various analytic procedures of STATISTICA (e.g., in General Classification and Regression Trees (GC&RT), General CHAID (GCHAID) Models, and Classification Trees) to avoid overfitting of the data. You can use v-fold cross-validation methods to algorithmically grow a best tree (consistent with the current tree building method and settings) and then fit that tree v times, each time using v-1 subsamples of the data for estimation, and evaluate the predictive power (validity) of the respective tree model in the remaining sample. This technique enables you to examine how accurate (valid) the predictions made from the best (algorithmically grown) tree are for "new" cases, i.e., observations not included in the estimation of the tree. See also the section on Comparison of Interactive Trees and GC&RT and GCHAID in the Introductory Overview for additional details. Specifications for v-fold cross-validation include Seed for random number generator and V-fold cross-validation; v-value (the Standard error rule parameter is used to prune C&RT trees). These values will be used to control the sampling that STATISTICA performs to obtain cross-validation error estimates. The Standard error rule is only available for C&RT, and will affect pruning of the tree to find the right-sized tree. See also the Introductory Overview for details. |
Seed for random number generator | The positive integer value entered in this box is used as the seed for a random number generator that produces v-fold random subsamples from the learning sample to test the predictive accuracy of the computed classification trees. |
V-fold cross-validation; v-value | The value entered in this box determines the number of cross-validation samples that will be generated from the learning sample to provide an estimate of the CV cost for each classification tree in the tree sequence. See also the Introductory Overview for details. |
Standard error rule | This option is only available when C&RT is selected as the Model building method from the Interactive Trees Startup Panel - Quick tab. If a pruning method is selected in the Stopping rule group box (only applicable to C&RT) on the ITrees Extended Options dialog - Stopping tab, i.e., the Prune on misclassification error, Prune on deviance, or Prune on variance option button is selected, then the value entered in the Standard error rule box is used in the selection of the "right-sized" tree after pruning (see also the General Classification and Regression Trees (GC&RT) Introductory Overviews). The Standard error rule is applied as follows: Find the pruned tree among all trees produced during pruning that has the smallest cost; this value is computed either from the training data sample or the test sample, if a Test sample is specified on the ITrees Extended Options dialog - Validation tab). Call this value Min. V (validation or cross-validation) cost, and call the standard error of the V cost for this tree Min. Standard error. Then select as the right-sized tree the pruned tree with the fewest terminal nodes that has a V cost no greater than Min. V plus the Standard error rule times Min. Standard error. A smaller (closer to zero) value for the Standard error rule generally results in the selection of a right-sized tree that is only slightly "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. A larger (much greater than zero) value for the Standard error rule generally results in the selection of a right-sized tree that is much "simpler" (in terms of the number of terminal nodes) than the minimum V cost tree. This so-called cost/complexity pruning, as implemented in the selection of the right-sized tree, makes use of the basic scientific principles of parsimony and replication: Choose as the best theory the simplest theory (i.e., the pruned tree with the fewest terminal nodes) that is consistent with (i.e., has a V cost no greater than Min. V plus Standard error rule times Min. SE ) the theory best supported by independent tests (i.e., the pruned tree with the smallest V cost). |