Discriminant Function Analysis Introductory Overview - Classification

Another major purpose to which discriminant analysis is applied is the issue of predictive classification of cases. Once a model has been finalized and the discriminant functions have been derived, how well can we predict to which group a particular case belongs?

A priori and post hoc predictions
Before going into the details of different estimation procedures, we would like to make sure that this difference is clear. Obviously, if we estimate, based on some data set, the discriminant functions that best discriminate between groups, and then use the same data to evaluate how accurate our prediction is, then we are very much capitalizing on chance. In general, one will always get a worse classification when predicting cases that were not used for the estimation of the discriminant function. Put another way, post hoc predictions are always better than a priori predictions. (The trouble with predicting the future a priori is that one does not know what will happen; it is much easier to find ways to predict what we already know has happened.) Therefore, one should never base one's confidence regarding the correct classification of future observations on the same data set from which the discriminant functions were derived; rather, if one wants to classify cases predictively, it is necessary to collect new data to "try out" (cross-validate) the utility of the discriminant functions.
Classification functions
Discriminant analysis automatically computes the classification functions. These are not to be confused with the discriminant functions. The classification functions can be used to determine to which group each case most likely belongs. There are as many classification functions as there are groups. Each function allows us to compute classification scores for each case for each group, by applying the formula:

Si = ci + wi1*x1 + wi2*x2 + ... + wim*xm

In this formula, the subscript i denotes the respective group; the subscripts 1, 2, ..., m denote the m variables; ci is a constant for the i'th group, wij is the weight for the j'th variable in the computation of the classification score for the i'th group; xj is the observed value for the respective case for the j'th variable. Si is the resultant classification score.

We can use the classification functions to directly compute classification scores for some new observations (for example, these function can be specified in spreadsheet formulas as the formulas for computing new variables; as new cases are added to the file, the classification scores are then automatically computed).

Classification of cases
Once we have computed the classification scores for a case, it is easy to decide how to classify the case: in general we classify the case as belonging to the group for which it has the highest classification score (unless the a priori classification probabilities are widely disparate; see below). Thus, if we were to study high school students' post-graduation career/educational choices (e.g., attending college, attending a professional or trade school, or getting a job) based on several variables assessed one year prior to graduation, we could use the classification functions to predict what each student is most likely to do after graduation. However, we would also like to know the probability that the student will make the predicted choice. Those probabilities are called posterior probabilities, and can also be computed. However, to understand how those probabilities are derived, let us first consider the so-called Mahalanobis distances.
Mahalanobis distances
You may have read about these distances in other parts of STATISTICA user manual (e.g., in Multiple Regression). In general, the Mahalanobis distance is a measure of distance between two points in the space defined by two or more correlated variables. For example, if there are two variables that are uncorrelated, then we could plot points (cases) in a standard two-dimensional scatterplot; the Mahalanobis distances between the points would then be identical to the Euclidean distance; that is, the distance as, for example, measured by a ruler. If there are three uncorrelated variables, we could also simply use a ruler (in a 3D plot) to determine the distances between points. If there are more than 3 variables, we cannot represent the distances in a plot any more. Also, when the variables are correlated, then the axes in the plots can be thought of as being non-orthogonal, that is, they would not be positioned in right angles to each other. In those cases, the simple Euclidean distance is not an appropriate measure, while the Mahalanobis distance will adequately account for the correlations.
Mahalanobis distances and classification
For each group in our sample, we can determine the location of the point that represents the means for all variables in the multivariate space defined by the variables in the model. These points are called group centroids. For each case we can then compute the Mahalanobis distances (of the respective case) from each of the group centroids. Again, we would classify the case as belonging to the group to which it is closest, that is, where the Mahalanobis distance is smallest.
Posterior classification probabilities
Using the Mahalanobis distances to do the classification, we can now derive probabilities. The probability that a case belongs to a particular group is inversely proportional to the Mahalanobis distance from that group centroid (it is not exactly proportional because we assume a multivariate normal distribution around each centroid). Because we compute the location of each case from our prior knowledge of the values for that case on the variables in the model, these probabilities are called posterior probabilities. In summary, the posterior probability is the probability, based on our knowledge of the values of other variables, that the respective case belongs to a particular group. Of course, discriminant analysis automatically computes those probabilities for all cases (or for selected cases only for cross-validation studies).
A priori classification probabilities
There is one additional factor that needs to be considered when classifying cases. Sometimes, we know ahead of time that there are more observations in one group than in any other; thus, the a priori probability that a case belongs to that group is higher. For example, if we know ahead of time that 60% of the graduates from our high school usually go to college (20% go to a professional school, and another 20% get a job), then we should adjust our prediction accordingly: a priori, and all other things being equal, it is more likely that a student will attend college than choose either of the other two options. Discriminant analysis allows you to specify different a priori probabilities, which will then be used to adjust the classification of cases (and the computation of posterior probabilities) accordingly.

In practice, you need to ask yourself  whether the unequal number of cases in different groups in the sample is a reflection of the true distribution in the population, or whether it is only the (random) result of the sampling procedure. In the former case, we would set the a priori probabilities to be proportional to the sizes of the groups in our sample, in the latter case we would specify the a priori probabilities as being equal in each group. The specification of different a priori probabilities can greatly affect the accuracy of the prediction.

Summary of the prediction
A common result that one looks at in order to determine how well the current classification functions predict group membership of cases is the classification matrix. The classification matrix shows the number of cases that were correctly classified (on the diagonal of the matrix) and those that were misclassified.
Another word of caution
To reiterate, post hoc predicting of what has happened in the past is not that difficult. It is not uncommon to obtain very good classification if one uses the same cases from which the classification functions were computed. In order to get an idea of how well the current classification functions "perform," one must classify (a priori) different cases, that is, cases that were not used to estimate the classification functions. In discriminant analysis you can use the selection conditions to include or exclude cases from the computations; thus, the classification matrix can be computed for "old" cases as well as "new" cases. Only the classification of new cases allows us to assess the predictive validity of the classification functions (see also cross-validation), the classification of old cases only provides a useful diagnostic tool to identify outliers or areas where the classification function seems to be less adequate.
Summary
In general, discriminant analysis is a very useful tool (1) for detecting the variables that allow the researcher to discriminate between different (naturally occurring) groups, and (2) for classifying cases into different groups with a better than chance accuracy.