GDA Introductory Overview - Unique Features

The following topics highlight only some of the unique features of the GDA module, that are usually not found in other (less complete) programs for performing discriminant analysis (refer also to the topic Comparison with Other General Linear Model Programs).

Specification of Complex Predictor Designs, Including Mixtures
One advantage of applying the general linear model to the discriminant analysis problem (see the Introductory Overview) is that you can specify complex models for the set of predictor variables. For example, you can specify for a set of continuous predictor variables, a polynomial regression model, response surface mode, factorial regression, or mixture surface regression (without an intercept). Thus, you could analyze a constrained mixture experiment (where the predictor variable values must sum to a constant), and where the dependent variable of interest is categorical in nature.

In fact, STATISTICA GDA does not impose any particular restrictions on the type of predictor variable (categorical or continuous) that can be used, or the models that can be specified. However, when using categorical predictor variables, caution should be used (see A note of caution for models with categorical predictors, and other advanced techniques).

Inclusion of Categorical (ANOVA-like) Effects in Complex Designs
The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. Most -- if not all -- textbook treatments of discriminant function analysis (as well as the implementations in all commercially available computer programs) are limited to simple and stepwise analyses with single degree of freedom continuous predictors. In GDA, you can include categorical "ANOVA-like" effects in complex ANOVA-like models for the predictor variables (see also "Specification of complex predictor designs, including mixtures" below; however, see also A note of caution for models with categorical predictors, and other advanced techniques).
Best Subset Discriminant Analysis, on Complex Designs
Because GDA is an implementation of the general linear model, it shares many of the unique features of the General Regression Models (GRM) module (see Comparisons with Other Regression Programs), while adding several enhancements that are of particular utility in the context of classification problems. In addition to stepwise discriminant analysis techniques, STATISTICA GDA includes methods for performing searches for best subsets of predictor variables and/or effects. Like in GRM, the best subset search (as well as stepwise selection methods) can be used for predictor models that include multiple-degree-of-freedom effects for categorical predictors; during the stepwise or best subset search, those effects will be evaluated, and moved in or out of the model, as a whole, and will not be "broken up" into single-degree-of-freedom variables. In addition, in GDA you can select as the criterion for evaluating the best subset the misclassification rates in the analysis sample, as well as in a cross validation sample (not included in the computations for least-squares parameter estimates). The combination of best-subset discriminant analysis methods for continuous and categorical predictors, along with misclassification based selection of effects, makes GDA a unique very efficient data mining tool.
Best Subset Selection of Predictors Based on Cross-Validation Misclassification Rates
As described in "Best subset discriminant analysis" (see above), on complex designs, GDA provides options for performing best subset searches of the predictors, even with complex ANOVA-like predictor effects (for categorical predictors). Several criteria are available for choosing the predictor effects to be included in the model; one criterion is to include the predictor effects that produce the smallest misclassification rates, when classifying cases (based on the posterior classification probabilities). You can choose to compute those misclassification rates either for the analysis sample (i.e., for cases or observations that are included in the computations of the parameter estimates) or for a cross validation sample (i.e., for cases or observations that are not included in the computation of the parameter estimates). This method is particularly useful in data mining applications where one needs to build models that have good predictive validity for classifying new cases. It is also useful for guarding against over-fitting of models: Often, when only considering the classification of cases in the analysis sample, and in particular when the sample sizes are large, one often includes predictor effects that may slightly improve the fit of the model in the analysis sample, but which have no predictive validity in the cross-validation sample.
Profiling of Predicted Responses, Posterior Probabilities, and Desirability
In complex discriminant analysis problems with many predictor effects and classes (groups) in the dependent variable it is often difficult to interpret the results. Specifically, it is often not clear how to determine the combinations of values for the predictors that maximize the likelihood that a respective case belongs to a particular class, or set of classes. GDA includes the sample Desirability profiler and response optimization options provided in the General Stepwise Regression Models (GRM) module. However, in GDA, the profiler methods can be applied to the (dummy-) coded dependent variables (see also the Introductory Overview), and you can choose between simple (regression-like) predicted values and posterior classification probabilities, which will always vary between 0 and 1. So for example, after fitting a particular model to the data, you can perform a grid search of the design space (the values of the predictor effects), to maximize the posterior classification probabilities for one particular class, or a combination of particular classes. To our knowledge, only STATISTICA offers this extension of response profiling to discriminant analysis for classification.