Overview - ANOVA and REML Method Implementation in Variance Estimation and Precision
With the introduction of Variance Estimation and Precision, STATISTICA now supports two methods for analyzing mixed model designs:
- The traditional ANOVA-based method as it is implemented (for example) in STATISTICA's GLM and Variance Components modules, as well as other commercial software (e.g., SAS Proc GLM, Proc Varcomp; SPSS GLM), and
- The more "modern" and general mixed-model approach which fits the fixed and random effects portions of the design separately and takes the variance components into account when estimating (fitting) and testing the fixed effects (as implemented in Variance Estimation and Precision).
When you chose the ANOVA method on the Define/Review Model dialog, the first approach is used to generate the analysis of variance results (sums of squares, expected means squares, variance components); when you choose the REML method, the second approach is used for analysis of variance results and variance component estimates. However, the general mixed-model approach is used for computing predictions and performing least squares means analysis regardless of which estimation method is selected.
Random Effects in the Linear Model
- ANOVA method
- When the ANOVA method is chosen, Variance Estimation and Precision will first compute the ANOVA table based on the standard GLM computations, which do not explicitly account for the variance components. Instead, a single design matrix X, which contains indicator columns for both the fixed and random effects in the design, will be computed. This design matrix is then used to compute the coefficients of the linear model (as
b = (X'X)-X'y) and test hypotheses (SS(Lß = 0) = (Lb)'(L(X'X)-L')-1(Lb), where L is a matrix of hypotheses vectors for the respective test). Thus, the general linear model design matrix is used to compute the variance components, perform denominator synthesis, and compute the ANOVA table.
Once the ANOVA results have been calculated, a second model is fit: one which utilizes a design matrix (X) for fixed effects only and a separate design matrix (Z) for random effects. Specifically, separate coefficient (solution) vectors are computed for the fixed effects only (b), and the random effects (g); the combined variance/covariance matrix for both parameter vectors combined is usually denoted by C.
This general mixed model (y= Xb + Zg) is then used to compute least squares means and predictions (predicted values); see Least Squares Means; Predictions (below).
- REML method
- When the REML method is chosen, Variance Estimation and Precision will compute and evaluate the fixed effect model via the design matrix for the fixed effects in X, while estimating the parameters for the random effects in a separate design matrix usually denoted as Z. Further, the variance of the
y values is explicitly considered to be a function of the random effects in the model; the matrix of the variances and covariances of
y in this model formulation is typically denoted as V (note that Variance Estimation and Precision currently only supports the random effects model). The coefficients for the fixed effects (only) are then computed by solving
b=(X'V-1X)-X'V-1y (the covariances of the fixed effect parameters are estimated as (X'V-1X)-).
As a consequence, in Variance Estimation and Precision the analysis of variance results computed via the REML method can be very different from those computed via the ANOVA method. Similar differences can be observed in other commercial software (e.g., such as SAS GLM or VARCOMP vs. MIXED), and sometimes these differences can be confusing unless one is aware of the differences in the methods (e.g., two seemingly identical analyses can yield different significant effects in the fixed effects model).
Differences in Results
- Nature of hypotheses
- When the ANOVA method is chosen in Variance Estimation and Precision, the traditional "GLM-like" hypotheses are tested for the fixed effects (and random effects). Specifically, Type I, II, and III hypotheses (hypothesis matrices L) are constructed using the
X'X with all effects (fixed and random). When the REML method is chosen, the hypothesis (L) matrices are constructed using the
X'X matrix with the fixed effects only. As a consequence, sometimes very different hypotheses may be tested.
For example, suppose you have a simple two-way factorial design with two factors A and B, each with two levels. Now declare B as a random effect but treat A by B (A*B) as fixed. Using the ANOVA method and Type III hypotheses, the A*B interaction test will have 1 degree of freedom in the numerator (as in GLM); however, it will have 2 degrees of freedom using REML (i.e., using the more general mixed model approach).
The reason for this is because the random main effect B will not be part of the fixed effect design in the latter case, and hence, when constructing the Type III hypothesis vectors, this interaction will contain both the main effect for B (1 degree of freedom) and the A*B interaction (1 degree of freedom). Using the GLM method, the A*B interaction hypothesis will be tested in the presence of the B main effect and, hence, be associated with only 1 degree of freedom.
This is certainly a somewhat surprising result, but it is consistent with the two different approaches for dealing with mixed models. Later in this section, this (and the issues below) will further be discussed.
- F tests
- Using the ANOVA method, a standard ANOVA table (with Sums of Squares, Mean Squares) is constructed, and tests of significance (for fixed and random effects) are performed using the Satterthwaite method and
denominator synthesis. Using the REML method (and explicit mixed model formulation), only the fixed effects will be tested using:
F= (b'L'(L(X'V-1X)-L')-Lb)/rank(L)
Again, L is the matrix of estimable functions that specify the respective hypotheses (main effect, interaction, etc.; L is constructed so as to test the chosen Type I, II, or III hypotheses). The denominator degrees of freedom for this test are computed using the so-called "containment method." Variance Estimation and Precision will search all random effects in the design and determine the smallest degrees of freedom (rank contribution to the XZ matrix, where Z is the design matrix for the random effects) found for any random effect that (syntactically) contains the respective fixed effect (for example, the A*B interaction or A(B) effect would contain the A effect, but not a third effect C). Because the resulting F test statistic is not computed from (or compatible with) the standard ANOVA table, the Sums of Squares and Mean Squares in this case are not reported (as they are not used to compute the test statistic).
Least Squares Means; Predictions
For both estimation methods (ANOVA and REML), Variance Estimation and Precision will compute predicted (least-squares) means and predicted values from the estimated coefficients for the fixed effects only. LS means are computed as Lb, where L is the coefficient matrix for the respective LS means (or their differences). The standard errors are computed as the square root of L(X'V-1X)-L'.
Predicted values are computed from both fixed and random effect parameter vectors (b and g); since the predicted values are computed from the rows of the combined design matrices X and Z, predictions are computed as Xb+Zg. The standard errors of predicted values are computed from the combined variance/covariance matrix for both parameter vectors; usually denoted C.
Which Method to Use?
To summarize, results calculated in the Variance Estimation and Precision module are quite different from the results generated in the General Linear Models (GLM) module. This is at first somewhat perplexing, but it is consistent with the two different approaches. The question of course is: Which approach is correct?
Following statistical theory, the general mixed model approach (used in Variance Estimation and Precision) is the more appropriate ("correct") method to use. But there may be other considerations why you might prefer to analyze the data using the "older" GLM-like approach.
In general, statistical "practice" serves to communicate what is and is not statistically significant and, hence, worth additional study. By using the same "standards" on how specific analyses are performed and tests are computed, a scientific community gains experience on what appears reasonable, how spurious effects may appear, etc. From that perspective, the GLM method is desirable because of the long history and general "experience" in various research communities and the thousands of papers that have been reviewed and published using this methodology. The GLM method, therefore, is well understood and somewhat "safe" in the sense that most members of the research community will understand exactly how the results were computed and what assumptions were made when testing hypotheses.
In contrast, the more general mixed model approach described above is rather "new," and less is known about the robustness of various tests, assumptions and their violation, etc. Simply put, many in the research practitioner community will not be familiar with this method and its various options, assumptions, ways of constructing tests of hypotheses, methods for computing degrees of freedom for testing the fixed effects, etc. So while this method is more general and grounded in sound("er") statistical theory (i.e., it is technically the correct approach), it may not always be the preferred method to apply in order to "communicate results," either to a regulatory agency or research community accustomed to and more familiar with traditional approaches.
Either way, STATISTICA supports both approaches and, as such, will be able to accommodate the range of analyses required in different domains of research practice.
See also, Computational Details.