Classification Trees Startup Panel - Methods Tab
Select the Methods tab of the Classification Trees Startup Panel to access options to select the method of split selection that will be used in the analysis, which measure for goodness of fit that will be used, the prior probability, and the method for specifying misclassification costs.
Element Name | Description |
---|---|
Split selection method | Use the options under Split selection method to select the method by which to compute a hierarchy of splits on the predictor variables used to classify cases or objects on the dependent variable. In a manner somewhat similar to the Forward stepwise entry method available in the Discriminant Analysis module, splits are successively added to the classification tree, just as predictors are successively entered into the prediction equations when Forward stepwise entry is used in Discriminant Analysis. Three Split selection method options are available in Statistica: |
Discriminant-based univariate splits for categ. and ordered predictors. | This option button can be selected with categorical predictor variables, ordered predictor variables, or any combination of both types of predictor variables. |
Discriminant-based linear combination splits for ordered predictors | This option button can be selected when only ordered predictor variables have been specified for the analysis. |
C & RT-style exhaustive search for univariate splits | Like the first option, the C & RT-style exhaustive search for univariate splits option can be used with categorical predictor variables, ordered predictor variables, or any combination of both types of predictor variables. Unlike the discriminant-based split selection methods, the exhaustive search method performs a grid search of all possible combinations of levels of the predictor variables to find the best split. Accordingly, with many predictor variables having many levels, the search can be extensive, resulting in long computing times.
The C&RT-style exhaustive search for univariate splits option works by searching for the split that maximizes the reduction in the value of the selected goodness of fit measure. When the fit is perfect, classification is perfect. |
Goodness of fit | There are three options under Goodness of fit: Gini measure, Chi-square, and G-square. These options are only available when the C&RT-style exhaustive search for univariate splits option button (see above) is selected. Goodness of fit is used as a criterion for selecting the best split from the set of possible candidate splits. |
Gini measure | The Gini measure option is a measure that reaches a value of zero when only one class is present at a node (with priors estimated from class sizes and equal misclassification costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal). The Gini measure was the measure of goodness of fit preferred by the developers of C&RT (Breiman et. al., 1984). |
Chi-square | The Chi-square measure is similar to the standard Chi-square value computed for the expected and observed classifications (with priors adjusted for misclassification cost). |
G-square | The G-square measure is similar to the maximum-likelihood Chi-square (as, for example, computed in the Log-Linear module; with priors adjusted for misclassification cost). |
Prior probabilities | There are three options under Prior probabilities: Estimated, Equal, and User spec. Use these options to specify how likely it is, without using any prior knowledge of the values for the predictor variables in the model, that a case or object will fall into one of the classes. Note that the specification of equal or unequal prior probabilities can greatly affect the accuracy of the final tree model for predicting particular classes. For details, see Prior Probabilities, the Gini Measure of Node Impurity, and Misclassification Cost. |
Estimated | Select the Estimated option button to specify that the likelihood that a case or object will fall into one of the classes is proportional to the dependent variable class sizes (see example below). |
Equal | Select the Equal option button to specify that the likelihood that a case or object will fall into one of the classes is the same for all dependent variable classes (see example below).
Example. These two options are best explained with an example. In an educational study of high school dropouts, for instance, it may happen that, overall, there are fewer dropouts than there are students who stay in school (i.e., there are different base rates); thus, the a priori probability that a student drops out is lower than that a student remains in school. The a priori probabilities can greatly affect the classification of cases or objects. If differential base rates are not of interest for the study, or if one knows that there are about an equal number of cases in each class, then one could set the a priori probabilities to be Equal. If the differential base rates are reflected in the class sizes (as they would be if the sample is a probability sample) then set the a priori probabilities to Estimated. |
User spec | Select the User spec. option button if you have specific knowledge about the base rates (for example, based on previous research). When you select the User spec. option button, the Enter values for prior probabilities dialog box is displayed, in which you specify the a priori probabilities for each class of the dependent variable. This dialog is automatically displayed only the first time priors are set to user-defined (i.e., the User spec. option button is selected); thereafter, click the accompanying settings button to display the dialog containing the previously specified values. If the probabilities do not add up to 1.0, STATISTICA will automatically adjust them proportionately. Note that the User spec. option button is only available if Dependent variable codes are selected via the Advanced tab. |
Misclassif. costs. | There are two options under Misclassif.s: Equal and User spec. Use these options to specify the costs of misclassifying objects in an observed class as belonging to another class. |
Equal | If you select the Equal option button, each off-diagonal element of the predicted class (row) x observed class (column) misclassification costs matrix is set equal to 1.0, and the specified Prior probabilities (see above) for the classes on the dependent variable are not adjusted. |
User spec | Select the User spec. option button if more accurate classification is desired for some classes than others. For example, carriers of a disease who are contagious to others might need to be more accurately predicted than carriers of the disease who are not contagious to others. If so, select User spec. in the Misclassif. costs group box to display the User Specified Misclassification Costs spreadsheet, in which you enter the (non-negative) values representing the misclassification costs in the appropriate off-diagonal cells of the misclassification costs matrix. The spreadsheet is automatically displayed only the first time you select the User spec. option button in the Misclassif. costs group box; thereafter, click the accompanying settings button to display the spreadsheet, which will contain the previously specified values. Effectively, user-specified misclassification costs can be used to "weight" the analysis more heavily toward some classes than for others.
Click the Misclassification costs button on the Classification Trees Results dialog box - Predicted Classes tab to display a spreadsheet containing the misclassification costs used in the analysis. User-specified misclassification costs are used to adjust the Prior probabilities (see above) for the classes on the dependent variable included in the analysis. If you select User-spec., the adjusted a priori probabilities used in the analysis can be displayed by clicking the Adjusted priors button on the Classification Trees Results dialog box - Predicted Classes tab. |