Discriminant Function Analysis - Example
Specifying the Analysis.

The first two variables in this file (Sepallen, Sepalwid) pertain to the length and width of sepals; the next two variables (Petallen, Petalwid) pertain to the length and width of petals. The last variable in this file is a grouping or coding variable that identifies to which type of iris each flower belongs (Setosa, Versicol, and Virginic). In all, there are 150 flowers in this sample, 50 of each type.
Quick
tab, select the Advanced options (stepwise analysis) check box. Click the Variables button to display the standard variable selection dialog. Here, select Iristype as the Grouping variable and the remaining variables as the Independent variable list that will be used in order to discriminate between iris types, and then click the OK button.
Next, specify the codes that were used in the grouping variable to identify to which group each case belongs. Click the Codes for grouping variables button and either enter 1-3, click the All button, or use the asterisk (*) convention to select all codes on the Select codes for grouping variable dialog.

Click the OK button to return to the Startup Panel. Alternatively, you can click the OK button on the Startup Panel and Statistica will automatically search the grouping variable(s) and select all codes for those variables.
Descriptives
tab.
Before specifying the discriminant function analysis, click the Review descriptive stats button to look at the distribution of some of the variables and their intercorrelations. This displays the Review Descriptive Statistics dialog.

First, look at the means. On the
Quick


Many other options to graphically view the data are available on the Review Descriptive Statistics dialog. These options are described below.
All cases
tab, click the Box plot of means button to produce a box and whisker plot of the independent variables. A standard variable selection dialog is first displayed; select all of the variables and then click the OK button. Next, the Box-Whisker Type dialog is displayed, select the Mean/SD/1.96*SD option button and then the OK button.
This plot is useful to summarize the distribution of the variables by three components:
- A central line to indicate central tendency or location (i.e., mean or median);
- A box to indicate variability around this central tendency (i.e., quartiles, standard errors, or standard deviations);
- Whiskers around the box to indicate the range of the variable [i.e., ranges, standard deviations, 1.96 times the standard deviations (95% normal prediction interval for individual observations around the mean), or 1.96 times the standard errors of the means (95% confidence interval)].
You can also view the distribution of the variables within each level of the grouping variable by clicking the
Box plot of means by group button on the
Within

Within
tab. When you click this button, a standard variable selection dialog is displayed in which you can select a variable from a list of the previously selected independent variables. For this example, select the variable Sepalwid and then click the OK button. The histograms as categorized by the grouping variable selected in the Startup Panel are shown below.
As you can see, this variable is basically normally distributed within each group (type of flower).
All cases
tab. Select all variables in the variable selection dialog and then click the OK button.
Now, look at the scatterplot for variables Sepallen and Petallen. Select Scatterplots from the Graphs menu to display the 2D Scatterplots dialog. On the Quick tab, click the Variables button and in the variable selection dialog, select Petallen as the X variable, Sepallen as the Y variable, and then click the OK button. Next, select the Confidence option button under Regression bands. Now, click the OK button.

It appears that there are two "clouds" of points in this plot. Perhaps the points in the lower-left corner of this plot all belong to one iris type. If so, then there is good "hope" for this discriminant analysis. However, if not, then the possibility that the underlying distribution for these two variables is not bivariate normal, but rather multi-modal with more than one "peak," would have to be considered. To explore this possibility, create a categorized scatterplot of variables Petallen by Sepallen, categorized by Iristype. Select Scatterplots from the Graphs - Categorized Graphs menu to display the 2D Categorized Scatterplots dialog. On the Quick tab, click the Variables button to display the standard variable selection dialog. Here, select variable Petallen as the Scatterplot X, variable Sepallen as the Scatterplot Y, variable Iristype as the X-Category, and then click the OK button. Also, click the Overlaid option button under Layout and then click the OK button on the 2D Categorized Scatterplots dialog to produce the following plot.

This scatterplot shows the correlation between variables Sepallen and Petallen within groups. Thus, it can be concluded that the assumption of a bivariate normal distribution within each group is probably not violated for this particular pair of variables.
Advanced
tab, select Forward stepwise in the Method box. In this setting, Statistica will enter variables into the discriminant function model one by one, always choosing the variable that makes the most significant contribution to the discrimination.
Stop rules. Statistica will keep "stepping" until one of four things happen. The program will terminate the stepwise procedure when:
- All variables have been entered or removed, or
- The maximum number of steps has been reached, as specified in the Number of steps box, or
- No other variable that is not in the model has an F value greater than the F to enter that is specified in this dialog and when no other variable in the model has an F value that is smaller than the F to remove specified in this dialog, or
- Any variable after the next step would have a tolerance value that is smaller than that specified in the Tolerance box.
If you want to remove all variables from a model, one by one, set F to enter to a very large value (e.g., 9999), and also set F to remove to a very large value that is only marginally smaller than the F to enter value (e.g., 9998). Remember that the F to enter value must always be set to a larger value than the F to remove value.
For example, if a variable that is about to enter into the model has a tolerance value of .01, then this variable can be considered to be 99% redundant with the variables already included. At one point, when one or more variables become too redundant, the variance-/covariance matrix of variables included in the model can no longer be inverted, and the discriminant function analysis cannot be performed.
It is generally recommended that you leave the Tolerance setting at its default value of 0.01. If a variable is included in the model that is more than 99% redundant with other variables, then its practical contribution to the improvement of the discriminatory power is dubious. More importantly, if you set the tolerance to a much smaller value, round-off errors may result, leading to unstable estimates of parameters.
Reviewing the Results of Discriminant Analysis.
Results at Step 0. First, the Discriminant Function Analysis Results dialog at Step 0 is displayed. Step 0 means that no variable has yet been included into the model.

Because no variable has been entered yet, most options on this dialog are not yet available (i.e., they are dimmed). However, you can review the variables not in the equation via the Variables not in the model button.

F to enter and p-value. Wilks' lambda can be converted to a standard F value (see Notes), and you can compute the corresponding p-values for each F. However, as discussed in the Introductory Overviews, one should generally not take these p-values at face value. One is always capitalizing on chance when including several variables in an analysis without having any a priori hypotheses about them, and choosing to interpret only those that happen to be "significant" is not appropriate.
In short, there is a big difference between predicting a priori a significant effect for a particular variable and then finding that variable to be significant, as compared to choosing from among 100 variables in the analysis the one that happens to be significant. Without going into details, in purely practical terms, in the latter case, it is not very likely that you would find the same variable to be significant if you were to replicate the study. When reporting the results of a discriminant function analysis, you should be careful not to leave the impression as if only the significant variables were chosen in the first place (for some theoretical reasons), when, in fact, they were chosen because they happened to "work."
Looking at the spreadsheet above, you can see that the largest F to enter is shown for variable Petallen. Thus, that variable will be entered into the model at the next (first) step.

Overall, the discrimination between types of irises is highly significant (Wilks' Lambda = .037; F = 307.1, p<0.0001). Now look at the independent contributions to the prediction for each variable in the model.


As you can see, both variables that are not yet in the model have F to enter values that are larger than 1; thus, you know that the stepping will continue and that the next variable that will enter into the model is the variable Petalwid.
Results at Step 4 (Final Step).
Once again, click the Next button in the Discriminant Function Analysis Results dialog to go to the next step in the analysis. Step 3 will not be reviewed here, so click the Next button again to go to the final step in the analysis - Step 4.

Now, click the Summary: Variables in the model button to review the independent contributions for each variable to the overall discrimination between types of irises.

The
Partial Wilks' Lambda indicates that variable
Petallen contributes most, variable
Petalwid second most, variable
Sepalwid third most, and variable
Sepallen contributes least to the overall discrimination. (Remember that the smaller the
PartialWilks' Lambda, the greater is the contribution to the overall discrimination.) Thus, you may conclude at this point that the measures of the petals are the major variables that allow you to discriminate between different types of irises. To learn more about the nature of the discrimination, you need to perform a canonical analysis. Thus, click on the
Advanced

As discussed in the Introductory Overviews, Statistica will compute different independent (orthogonal) discriminant functions. Each successive discriminant function will contribute less to the overall discriminatory power. The maximum number of functions that is estimated is either equal to the number of variables or the number of groups minus one, whichever number is smaller. In this case, two discriminant functions will be estimated.

In general, this spreadsheet reports a step-down test of all canonical roots. The first line always contains the significance test for all roots; the second line reports the significance of the remaining roots, after removing the first root, and so on. Thus, this spreadsheet tells you how many canonical roots (discriminant functions) to interpret. In this example, both discriminant (or canonical) functions are Statistically significant. Thus, you will have to come up with two separate conclusions (interpretations) of how the measures of sepals and petals allow you to discriminate between iris types.

Raw here means that the coefficients can be used in conjunction with the observed data to compute (raw) discriminant function scores. The standardized coefficients are the ones that are customarily used for interpretation, because they pertain to the standardized variables and therefore refer to comparable scales.

The first discriminant function is weighted most heavily by the length and width of petals (variable Petallen and Petalwid, respectively). The other two variables also contribute to this function. The second function seems to be marked mostly by variables Sepalwid, and to a lesser extent by Petalwid and Petallen.
Canonical Analysis - Advanced
tab) represent the correlations between the variables and the discriminant functions and are commonly used in order to interpret the "meaning" of discriminant functions (see also the discussion in the Introductory Overviews).In educational or psychological research it is sometimes desired to attach meaningful labels to functions (e.g., "extroversion," "achievement motivation"), using the same reasoning as in factor analysis (see Factor Analysis). In those cases, the interpretation of factors should be based on the factor structure coefficients. However, such meaningful labels for these functions will not be considered for this example.


Apparently, the first discriminant function discriminates mostly between the type Setosa and the other iris types. The canonical mean for Setosa is quite different from that of the other groups. The second discriminant function seems to distinguish mostly between type Versicol and the other iris types; however, as one would expect based on the review of the eigenvalues earlier, the magnitude of the discrimination is much smaller.
Canonical Analysis - Canonical Scores
tab and then click the Scatterplot of canonical scores button to plot the unstandardized scores for Root 1 vs. Root 2.
This plot confirms the interpretation so far. Clearly, the flowers of type Setosa are plotted much further to the right in the scatterplot. Thus, the first discriminant function mostly discriminates between that type of iris and the two others. The second function seems to provide some discrimination between the flowers of type Versicol (which mostly show negative values for the second canonical function) and the others (which have mostly positive values). However, the discrimination is not nearly as clear as that provided by the first canonical function (root).
tab
and then click the Classification functions button to see those functions.
You could use these functions to define the transformations for three new variables. As you would then enter new cases, Statistica would automatically compute the classification scores for each group.
tab
). These are the probabilities that a case belongs to a respective group, without using any knowledge of the values for the variables in the model. For example, you may know a priori that there are more flowers of type Versicol in the world, and therefore, the a priori probability of a flower to belong to that group is higher than that for any other group. A priori probabilities can greatly affect the accuracy of the classification. You can also compute the results for selected cases only (by using the Select button). This is particularly useful if you want to validate the discriminant function analysis results with new additional data. However, for this example, simply accept the default selection of the Proportional to group sizes option button.Classification Matrix. Now, click the Classification matrix button. In the resulting spreadsheet, the second line in each column header indicates the a priori classification probabilities.

Because there were exactly 50 flowers of each type, and you chose those probabilities to be proportional to the sample sizes, the a priori probabilities are equal to 1/3 for each group. In the first column of the spreadsheet, you see the percent of cases that are correctly classified in each group by the current classification functions. The remaining columns show the number of cases that are misclassified in each group, and how they are misclassified.
Classification of Cases.
Classification
tab. Shown below is part of the Squared Mahalanobis Distances from Group Centroids spreadsheet.
You can also directly compute the probability that a case belongs to a particular group. This is a conditional probability, that is, it is contingent on your knowledge of the values for the variables in the model. Thus, these probabilities are called posterior probabilities. You can request those probabilities via the Posterior probabilities button. Note that as in the case of the classification matrix, you can select cases to be classified, and you can specify different a priori probabilities.

The classifications are ordered into a first, second, and third choice. The column under the header 1 contains the first classification choice, that is, the group for which the respective case had the highest posterior probability. The rows marked by the asterisk (*) are cases that are misclassified. Again, in this example, the classification accuracy is very high, even considering the fact that these are all post hoc classifications. Such accuracy is rarely attained in research in the social sciences.
See also, Discriminant Function Analysis - Index.