Correspondence Analysis Overview
Correspondence analysis is a descriptive/exploratory technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns.
The results provide information which is similar in nature to those produced by Factor Analysis techniques, and they allow you to explore the structure of categorical variables included in the table. The most common kind of table of this type is the two-way frequency crosstabulation table (for example, the Basic Statistics or Log-Linear module).
In a typical correspondence analysis, a crosstabulation table of frequencies is first standardized, so that the relative frequencies across all cells sum to 1.0. One way to state the goal of a typical analysis is to represent the entries in the table of relative frequencies in terms of the distances between individual rows and columns in a low-dimensional space. This is best illustrated by a simple example, which is described here. There are several parallels in interpretation between correspondence analysis and Factor Analysis, and some similar concepts are also pointed out here.
For a comprehensive description of this method, computational details, and its applications (in the English language), refer to the classic text by Greenacre (1984). These methods were originally developed primarily in France by Jean-Paul Benzécri in the early 1960's and 1970's, but have only more recently gained increasing popularity in English-speaking countries. Note that similar techniques were developed independently in several countries, where they were known as optimal scaling, reciprocal averaging, optimal scoring, quantification method, or homogeneity analysis. In the following paragraphs, a general introduction to correspondence analysis is presented. Note that the Correspondence Analysis module also performs multiple correspondence analyses of Burt tables. If you are familiar with the general concepts used in correspondence analysis, you can refer to Computational Details for a brief review of the computational formulas.
Smoking Category | |||||
Staff
Group | (1)
None | (2)
Light | (3)
Medium | (4)
Heavy | Row
Totals |
(1) Senior Managers | 4 | 2 | 3 | 2 | 11 |
(2) Junior Managers | 4 | 3 | 7 | 4 | 18 |
(3) Senior Employees | 25 | 10 | 12 | 4 | 51 |
(4) Junior Employees | 18 | 24 | 33 | 13 | 88 |
(5) Secretaries | 10 | 6 | 7 | 2 | 25 |
Column Totals | 61 | 45 | 62 | 25 | 193 |
You might think of the 4 column values in each row of the table as coordinates in a 4-dimensional space, and one could compute the (Euclidean) distances between the 5 row points in the 4-dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Now suppose one could find a lower-dimensional space, in which to position the row points in a manner that retains all, or almost all, of the information about the differences between the rows. You could then present all information about the similarities between the rows (types of employees in this case) in a simple 1, 2, or 3-dimensional graph. While this may not appear to be particularly useful for small tables like the one shown, one can easily imagine how the presentation and interpretation of very large tables (example, differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved using correspondence analysis (example, represent the 10 consumer items in a two-dimensional space).
Eigenvalues and Inertia for all Dimensions
Input Table (Rows x Columns): 5 x 4 Total Inertia = .08519 Chi² = 16.442 | |||||
No. of
Dims | Singular
Values | Eigen-
Values | Perc. of
Inertia | Cumulatv
Percent | Chi
Squares |
1 | .273421 | .074759 | 87.75587 | 87.7559 | 14.42851 |
2 | .100086 | .010017 | 11.75865 | 99.5145 | 1.93332 |
3 | .020337 | .000414 | .48547 | 100.0000 | .07982 |
First, it appears that, with a single dimension, 87.76% of the inertia can be explained, that is, the relative frequency values that can be reconstructed from a single dimension can reproduce 87.76% of the total Chi-square value (and, thus, of the inertia) for this two-way table; two dimensions allow you to explain 99.51%.
Row Name | Dim. 1 | Dim. 2 |
(1) Senior Managers | -.065768 | .193737 |
(2) Junior Managers | .258958 | .243305 |
(3) Senior Employees | -.380595 | .010660 |
(4) Junior Employees | .232952 | -.057744 |
(5) Secretaries | -.201089 | -.078911 |
Of course, you can plot these coordinates in a two-dimensional scatterplot from the Correspondence Analysis Results dialog box. Remember that the purpose of correspondence analysis is to reproduce the distances between the row and column points in a two-way table in a lower-dimensional display; note that, as in factor analysis, the actual rotational orientation of the axes is arbitrarily chosen so that successive dimensions explain less and less of the overall Chi-square value (or inertia). You could, for example, reverse the signs in each column in the table shown above, thereby effectively rotating the respective axis in the plot by 180° (note that you can quickly achieve this reversal of scales using the Reverse scaling check box on the Scale Options dialog box for the respective axis).
What is important are the distances of the points in the two-dimensional display, which are informative in that row points that are close to each other are similar with regard to the pattern of relative frequencies across the columns. If you have produced this plot you can see that, along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together on the left side of the origin (scale position 0). If you looked at the table of relative row frequencies (that is, frequencies standardized, so that their sum in each row is equal to 100%), you will see that these two groups of employees indeed show very similar patterns of relative frequencies across the categories of smoking intensity.
Percentages of Row Totals | |||||
Smoking Category | |||||
Staff
Group | (1)
None | (2)
Light | (3)
Medium | (4)
Heavy | Row
Totals |
(1) Senior Managers | 36.36 | 18.18 | 27.27 | 18.18 | 100.00 |
(2) Junior Managers | 22.22 | 16.67 | 38.89 | 22.22 | 100.00 |
(3) Senior Employees | 49.02 | 19.61 | 23.53 | 7.84 | 100.00 |
(4) Junior Employees | 20.45 | 27.27 | 37.50 | 14.77 | 100.00 |
(5) Secretaries | 40.00 | 24.00 | 28.00 | 8.00 | 100.00 |
Obviously the final goal of correspondence analysis is to find theoretical interpretations (that is, meaning) for the extracted dimensions. One method that might aid in interpreting extracted dimensions is to plot the column points. Shown below are the column coordinates for the first and second dimension.
Smoking
category | Dim. 1 | Dim. 2 |
None | -.393308 | .030492 |
Light | .099456 | -.141064 |
Medium | .196321 | -.007359 |
Heavy | .293776 | .197766 |
It appears that the first dimension distinguishes mostly between the different degrees of smoking, and in particular between category None and the others. Thus one can interpret the greater similarity of Senior Managers with Secretaries, with regard to their position on the first axis, as mostly deriving from the relatively large numbers of None smokers in these two groups of employees.
Options
tab of the Correspondence Analysis Results dialog box, if one is primarily interested in interpreting the differences (distances) between the rows in the table.Conversely, if you are interested in the similarities and differences between the columns of the table, you should select the Column profiles (interpret col. dist.) option button in the Standardization of coordinates group box on the
tab of the Correspondence Analysis Results dialog box; the resulting column coordinates are then derived from the analysis of the column profile matrix (the matrix of column proportions, where the sum of the table entries in each column is equal to 1.0). This standardization will maximize the distances between the column points in the final coordinate system.By default, Statistica performs both types of standardizations prior to reporting the coordinates (the Row & column profiles option button in the Standardization of Coordinates group box on the
tab of the Correspondence Analysis Results dialog box). The row coordinates are computed from the row profile matrix, and the column coordinates are computed from the column profile matrix.A fourth option button, Canonical standardization, is also available in the Standardization of Coordinates group box on the
tab of the Correspondence Analysis Results dialog box, and it amounts to a standardization of the columns and rows of the matrix of relative frequencies. For more information, refer to Computational Details; this standardization amounts to a rescaling of the coordinates based on the row profile standardization and the column profile standardization, and this type of standardization is not widely used. Note also that a variety of other custom standardizations can be easily performed, because Statistica reports the raw eigenvalues matrix, which can further be processed with Statistica Visual Basic.In that case (but not if you chose the canonical standardization), the squared Euclidean distance between, for example, two row points i and i' in the respective coordinate system of a given number of dimensions actually approximates a weighted (that is, Chi-square) distance between the relative frequencies:
In this formula, dii'² stands for the squared distance between the two points, cj stands for the column total for the j'th column of the standardized frequency table (where the sum of all entries or mass is equal to 1.0), pij stands for the individual cell entries in the standardized frequency table (row i, column j), ri stands for the row total for the i'th column of the relative frequency table, and the summation (S) is over the columns of the table. To reiterate, only the distances between row points, and correspondingly, between column points are interpretable in this manner; the distances between row points and column points cannot be interpreted.
Row Coordinates and Contributions to Inertia | ||||||
Staff Group | Coordin.
Dim.1 | Mass | Quality | Relative
Inertia | Inertia
Dim.1 | Cosine²
Dim.1 |
(1) Senior Managers | -.065768 | .056995 | .092232 | .031376 | .003298 | .092232 |
(2) Junior Managers | .258958 | .093264 | .526400 | .139467 | .083659 | .526400 |
(3) Senior Employees | -.380595 | .264249 | .999033 | .449750 | .512060 | .999033 |
(4) Junior Employees | .232952 | .455959 | .941934 | .308354 | .330974 | .941934 |
(5) Secretaries | -.201089 | .129534 | .865346 | .071053 | .070064 | .865346 |
Options
tab of the Correspondence Analysis Results dialog box, the row coordinates are computed based on the row profile matrix. Put another way, the coordinates are computed based on the matrix of conditional probabilities shown in the Mass column.See also, Correspondence Analysis - Program Overview, Correspondence Analysis - Supplementary Points, Multiple Correspondence Analysis (MCA), and Correspondence Analysis Introductory Overview - Burt Table.