Discriminant Function Analysis Introductory Overview - Assumptions
Discriminant function analysis is computationally very similar to MANOVA, and all assumptions for MANOVA mentioned in ANOVA/MANOVA apply. In fact, you can use the wide range of diagnostics and statistical tests of assumption that are available in ANOVA/MANOVA to examine your data for the discriminant analysis (to avoid unnecessary duplications, the extensive set of facilities provided in ANOVA/MANOVA is not repeated in Discriminant Analysis).
- Normal distribution
- It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. Note that it is very simple to produce histograms of frequency distributions from within results spreadsheets via the shortcut menu, which allows you to examine whether or not variables are normally distributed. However, note that violations of the normality assumption are usually not "fatal," meaning, that the resultant significance tests etc. are still "trustworthy." ANOVA/MANOVA provides specific tests for normality.
- Homogeneity of variances/covariances
- It is assumed that the variance/covariance matrices of variables are homogeneous across groups. Again, minor deviations are not that important; however, before accepting final conclusions for an important study it is probably a good idea to review the within-groups variances and correlation matrices. In particular the scatterplot matrix that can be produced from the Prob. and Scatterplots tab of the Descriptive Statistics dialog can be very useful for this purpose. When in doubt, try re-running the analyses excluding one or two groups that are of less interest. If the overall results (interpretations) hold up, you probably do not have a problem. You may also use the numerous tests and facilities in ANOVA/MANOVA to examine whether or not this assumption is violated in your data. However, as mentioned in ANOVA/MANOVA, the multivariate Box M test for homogeneity of variances/covariances is particularly sensitive to deviations from multivariate normality, and should not be taken too "seriously."
- Correlations between means and variances
- The major "real" threat to the validity of significance tests occurs when the means for variables across groups are correlated with the variances (or standard deviations). Intuitively, if there is large variability in a group with particularly high means on some variables, then those high means are not reliable. However, the overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. In practice, this pattern may occur if one group in the study contains a few extreme outliers, who have a large impact on the means, and also increase the variability. To guard against this problem, inspect the descriptive statistics, that is, the means and standard deviations or variances for such a correlation. ANOVA/MANOVA also allows you to plot the means and variances (or standard deviations) in a scatterplot.
- The matrix ill-conditioning problem
- Another assumption of discriminant function analysis is that the variables that are used to discriminate between groups are not completely redundant. As part of the computations involved in discriminant analysis, STATISTICA inverts the variance/covariance matrix of the variables in the model. If any one of the variables is completely redundant with the other variables then the matrix is said to be ill-conditioned, and it cannot be inverted. For example, if a variable is the sum of three other variables that are also in the model, then the matrix is ill-conditioned.
- Tolerance values
- In order to guard against matrix ill-conditioning, STATISTICA constantly checks the so-called tolerance value for each variable. This value is also routinely displayed when you ask to review the summary statistics for variables that are in the model, and those that are not in the model. This tolerance value is computed as 1 minus R-square of the respective variable with all other variables included in the current model. Thus, it is the proportion of variance that is unique to the respective variable. You can also refer to Multiple Regression to learn more about multiple regression and the interpretation of the tolerance value. In general, when a variable is almost completely redundant (and, therefore, the matrix ill-conditioning problem is likely to occur), the tolerance value for that variable will approach 0. The default value in Discriminant Analysis for the minimum acceptable tolerance is 0.01. STATISTICA issues a matrix ill-conditioning message when the tolerance for any variable falls below that value, that is if any variable is more than 99% redundant (you may change this default value by selecting the Advanced options (stepwise analysis) check box on the Quick tab of the Discriminant Function Analysis dialog, and then adjusting the Tolerance box on the resulting Advanced tab of the Model Definition dialog).
Copyright © 2021. Cloud Software Group, Inc. All Rights Reserved.