GDA Introductory Overview - Advantages of GDA
- Specifying models for predictor variables and predictor effects
- One advantage of applying the general linear model to the discriminant analysis problem is that you can specify complex models for the set of predictor variables. For example, you can specify for a set of continuous predictor variables, a polynomial regression model, response surface model, factorial regression, or mixture surface regression (without an intercept). Thus, you could analyze a constrained mixture experiment (where the predictor variable values must sum to a constant), where the dependent variable of interest is categorical in nature. In fact, STATISTICA GDA does not impose any particular restrictions on the type of predictor variable (categorical or continuous) that can be used, or the models that can be specified. However, when using categorical predictor variables, caution should be used (see "A note of caution for models with categorical predictors, and other advanced techniques" below).
- Stepwise and best-subset analyses
- In addition to the traditional stepwise analyses for single continuous predictors provided in Quick Discriminant Analysis, the General Discriminant Analysis module makes available the options for stepwise and best-subset analyses provided in the General Regression Models (GRM). Specifically, you can request stepwise and best-subset selection of predictors or sets of predictors (in multiple-degree of freedom effects, involving categorical predictors), based on the F-to-enter and p-to-enter statistics (associated with the multivariate Wilks' Lambda test statistic). In addition, when a cross-validation sample is specified, best-subset selection can also be based on the misclassification rates for the cross-validation sample; in other words, after estimating the discriminant functions for a given set of predictors, the misclassification rates for the cross-validation sample are computed, and the model (subset of predictors) that yields the lowest misclassification rate for the cross-validation sample is chosen. This is a powerful technique for choosing models that may yield good predictive validity, while avoiding overfitting of the data (see also STATISTICA Automated Neural Networks).
- Desirability profiling of posterior classification probabilities
- Another unique option of the General Discriminant Analysis (GDA) Models facilities in STATISTICA is the inclusion of Response/desirability profiler options. These options are described in some detail in the context of the Experimental Design (DOE) module, as well as the General Linear Models module. In short, the program will compute the predicted response values for each dependent variable, and those values can be combined into a single desirability score. A graphical summary can then be produced to show the "behavior" of the predicted responses and the desirability score over the ranges of values for the predictor variables. In GDA, you can profile both simple predicted values (like in the General Regression Models module) for the coded dependent variables (i.e., dummy-coded categories of the categorical dependent variable), and you can also profile posterior prediction probabilities. This unique latter option allows you to evaluate how different values for the predictor variables affect the predicted classification of cases, and is particularly useful when interpreting the results for complex models that involve categorical and continuous predictors and their interactions. Additional details concerning these features are provided in the description of the Profiler tab.
- A note of caution for models with categorical predictors, and other advanced techniques
- The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most -- if not all -- textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No "experience" (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a cross-validation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.
- The use of categorical predictor variables
- The use of categorical predictor variables or effects in a discriminant function analysis model may be (statistically) questionable. For example, you can use GDA to analyze a 2 by 2 frequency table, by specifying one variable in the 2 by 2 table as the dependent variable, and the other as the predictor. Clearly, the (ab)use of the GDA module in this manner would be silly (although, interestingly, in most cases you will get results that are generally compatible with those you would get by computing a simple Chi-square test for the 2 by 2 table). On the other hand, if you only consider the parameter estimates computed by GDA as the least squares solution to a set of linear (prediction) equations, then the use of categorical predictors in GDA is fully justified; moreover, it is not uncommon in applied research to be confronted with a mixture of continuous and categorical predictors (e.g., income or age which are continuous, along with occupational status, which is categorical) for predicting a categorical dependent variable. In those cases, it can be very instructive to consider specific models involving the categorical predictors, and possibly interactions between categorical and continuous predictors for classifying observations. However, to reiterate, the use of categorical predictor variables in discriminant function analysis is not widely documented, and you should proceed cautiously before accepting the results of statistical significance tests, and before drawing final conclusions from your analyses. Also remember that there are alternative methods available in STATISTICA to perform similar analyses, namely, the multinomial logit models available in the Generalized Linear Models (GLZ) module, and the methods for analyzing multi-way frequency tables in the Log-Linear module.
Copyright © 1995-2020 TIBCO Software Inc. All rights reserved.