Indicator or design matrix
Consider again the simple two-way table presented in the Introductory Overview:
| Smoking Category
| |
Staff
Group
| (1)
None
| (2)
Light
| (3)
Medium
| (4)
Heavy
| Row
Totals
|
(1) Senior Managers
| 4
| 2
| 3
| 2
| 11
|
(2) Junior Managers
| 4
| 3
| 7
| 4
| 18
|
(3) Senior Employees
| 25
| 10
| 12
| 4
| 51
|
(4) Junior Employees
| 18
| 24
| 33
| 13
| 88
|
(5) Secretaries
| 10
| 6
| 7
| 2
| 25
|
Column Totals
| 61
| 45
| 62
| 25
| 193
|
Suppose you had entered the data for this table in the following manner, as an indicator or design matrix:
| Staff Group
| Smoking
|
Case
Number
| Senior
Manager
| Junior
Manager
| Senior
Employee
| Junior
Employee
| Secretary
| None
| Light
| Medium
| Heavy
|
1
| 1
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
|
2
| 1
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
|
3
| 1
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
|
4
| 1
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
|
5
| 1
| 0
| 0
| 0
| 0
| 0
| 1
| 0
| 0
|
...
| .
| .
| .
| .
| .
| .
| .
| .
| |
...
| .
| .
| .
| .
| .
| .
| .
| .
| .
|
...
| .
| .
| .
| .
| .
| .
| .
| .
| .
|
191
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 1
| 0
|
192
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
| 1
|
193
| 0
| 0
| 0
| 0
| 1
| 0
| 0
| 0
| 1
|
Each one of the 193 total cases in the table is represented by one case in this data file. For each case a 1 is entered into the category where the respective case belongs, and a 0 otherwise. For example, case 1 represents a
Senior Manager who is a
None smoker. As can be seen in the table above, there are a total of 4 such cases in the two-way table, and thus there will be four cases like this in the indicator matrix. In all, there will be 193 cases in the indicator or design matrix.
Analyzing the design matrix
If you now analyzed this data file (design or indicator matrix) shown above as if it were a two-way frequency table, the results of the correspondence analysis would provide column coordinates that would allow you to relate the different categories to each other, based on the distances between the row points, that is, between the individual cases. In fact, the two-dimensional display you would obtain for the column coordinates would look very similar to the combined display for row and column coordinates, if you had performed the simple correspondence analysis on the two-way frequency table (note that the metric is different, but the relative positions of the points is very similar).
More than two variables
The approach to analyzing categorical data outlined above can easily be extended to more than two categorical variables. For example, the indicator or design matrix could contain two additional variables
Male and
Female, again coded 0 and 1, to indicate the subjects' gender; and three variables could be added to indicate to which one of three age groups a case belongs. Thus, in the final display, one could represent the relationships (similarities) between
Gender,
Age,
Smoking habits, and
Occupation (Staff Groups).
Fuzzy coding
It is not necessary that each case is assigned exclusively to only one category of each categorical variable. Rather than the 0-or-1 coding scheme, one could enter probabilities for membership in a category, or some other measure that represents a fuzzy rule for group membership. Greenacre (1984) discusses different types of coding schemes of this kind. For example, suppose in the example design matrix shown earlier, you had missing data for a few cases regarding their smoking habits. Instead of discarding those cases entirely from the analysis (or creating a new category
Missing data), you could assign to the different smoking categories proportions (which should add to 1.0) to represent the probabilities that the respective case belongs to the respective category (example, you could enter proportions based on your knowledge about estimates for the national averages for the different categories).
Interpretation of coordinates and other results
To reiterate, the results of a multiple correspondence analysis are identical to the results you would obtain for the column coordinates from a simple correspondence analysis of the design or indicator matrix. Therefore, the interpretation of coordinate values,
quality values, cosine
2's and other statistics reported as the results from a multiple correspondence analysis can be interpreted in the same manner as described in the context of the simple correspondence analysis (see
Introductory Overview), however, these statistics pertain to the total
inertia associated with the entire design matrix.
Supplementary column points and "multiple regression" for categorical variables
Another application of the analysis of design matrices via correspondence analysis techniques is that it allows you to perform the equivalent of a Multiple Regression for categorical variables, by adding supplementary columns to the design matrix. For example, suppose you added to the design matrix shown earlier two columns to indicate whether or not the respective subject had or had not been ill over the past year (that is, you could add one column
Ill and another column
Not ill, and again enter 0's and 1's to indicate each subject's health status). If, in a simple correspondence analysis of the design matrix, you added those columns as supplementary columns to the analysis, then (1) the summary statistics for the
quality of representation (see the
Introductory Overview) for those columns would give you an indication of how well you can explain illness as a function of the other variables in the design matrix, and (2) the display of the column points in the final coordinate system would provide an indication of the nature (example, direction) of the relationships between the columns in the design matrix and the column points indicating illness; this technique (adding supplementary points to an MCA analysis) is also sometimes called predictive mapping.
The Burt table
The actual computations in multiple correspondence analysis are not performed on a design or indicator matrix (which, potentially, may be very large if there are many cases), but on the inner product of this matrix; this matrix is also called the Burt matrix. With frequency tables, this amounts to tabulating the stacked categories against each other; for example the
Burt table for the two-way frequency table presented earlier would look like this.
| Employee
| Smoking
|
(1)
| (2)
| (3)
| (4)
| (5)
| (1)
| (2)
| (3)
| (4)
|
(1) Senior Managers
| 11
| 0
| 0
| 0
| 0
| 4
| 2
| 3
| 2
|
(2) Junior Managers
| 0
| 18
| 0
| 0
| 0
| 4
| 3
| 7
| 4
|
(3) Senior Employees
| 0
| 0
| 51
| 0
| 0
| 25
| 10
| 12
| 4
|
(4) Junior Employees
| 0
| 0
| 0
| 88
| 0
| 18
| 24
| 33
| 13
|
(5) Secretaries
| 0
| 0
| 0
| 0
| 25
| 10
| 6
| 7
| 2
|
(1) Smoking:None
| 4
| 4
| 25
| 18
| 10
| 61
| 0
| 0
| 0
|
(2) Smoking:Light
| 2
| 3
| 10
| 24
| 6
| 0
| 45
| 0
| 0
|
(3) Smoking:Medium
| 3
| 7
| 12
| 33
| 7
| 0
| 0
| 62
| 0
|
(4) Smoking:Heavy
| 2
| 4
| 4
| 13
| 2
| 0
| 0
| 0
| 25
|
The
Burt table has a clearly defined structure. In the case of two categorical variables, it consists of 4 partitions: (1) the crosstabulation of variable
Employee against itself, (2) the crosstabulation of variable
Employee against variable
Smoking, (3), the crosstabulation of variable
Smoking against variable
Employee, and (4) the crosstabulation of variable
Smoking against itself. Note that the matrix is symmetrical, and that the sum of the diagonal elements in each partition representing the crosstabulation of a variable against itself must be the same (example, there were a total of 193 observations in the present example, and hence, the diagonal elements in the crosstabulation tables of variable
Employee against itself, and
Smoking against itself must also be equal to 193).
Note: The off-diagonal elements in the partitions representing the crosstabulations of a variable against itself are equal to 0 in the table shown above. However, this is not necessarily always the case, for example, when the
Burt table was derived from a design or indicator matrix that included fuzzy coding of category membership.
Creating a Burt table in Statistica
The
Correspondence Analysis module enables you to use a
Burt table directly for input into the analysis. The module can also automatically create a Burt table from variables coded in the standard manner, that is, if you included in your data file grouping variables to indicate the group membership of each case (example, you included a variable
Gender, with the two possible values
Male and
Female). Thus, in most cases there is no need to recode your data in any special way (example, into a design or indicator matrix), and you can analyze categorical variables coded in a manner that also allows you to use, for example, the Log-Linear module, or Basic Statistics module. Please refer to the Correspondence Analysis: Table Specifications dialog for additional details on the different ways in which data can be formatted for use with the Correspondence Analysis module.
Creating customized Burt table
In case your analysis requires you to employ some customized fuzzy coding scheme for several categorical variables, it is very easy to create a Burt table using
Statistica Visual Basic; that table can then be displayed in a spreadsheet and saved as a data file, for subsequent analysis with the Correspondence Analysis module (remember that the Burt table is simply the inner product of the design or indicator matrix, example, if matrix X is the design or indicator matrix, then matrix product
X'X is a Burt table).