loglin
Contingency Table Analysis

Description

Estimates test statistics and parameter values for a log-linear analysis of a multidimensional contingency table.

Usage

loglin(table, margin, start = rep(1, length(table)), fit = FALSE, 
    eps = 0.1, iter = 20, param = FALSE, print = TRUE) 

Arguments

table a contingency table (array) to be fit by log-linear model. Typically, table is output from the table function. Neither negative nor missing values (NAs) are allowed.
margin a list of vectors describing the marginal totals to fit. A margin is described by the factors not summed over. Thus list(1:2, 3:4) would indicate fitting the 1,2 margin (summing over variables 3 and 4) and the 3,4 margin in a four-way table. The names of factors (that is, names(dimnames(table))) can be used instead of indices.
start the starting estimate for a fitted table. If start is omitted, a start is used that assures convergence. If structural zeros appear in table, start should contain zeros in corresponding entries and ones in other places. This assures that the fit contains those zeros.
fit a logical value. If TRUE, estimated fit is returned. The default is FALSE.
eps the maximum permissible deviation between an observed margin and a fitted margin.
iter the maximum number of iterations.
param a logical value. If TRUE, the parameter values are returned. Setting this to FALSE (the default) saves computation as well as space.
print a logical value. If TRUE (the default), the final deviation and number of iterations is printed.

Details

The fit is produced by the Iterative Proportional Fitting algorithm as presented in Haberman (1972).
Convergence is considered to be achieved if the maximum deviation between an observed and a fitted margin is less than eps. At most, iter iterations are performed. The fitting is currently done in single precision, other computations are in double precision.
The margins that are fit describe the model, similar to describing an ANOVA model. A high-order term automatically includes all the lower-order terms within it: for example, the term c(1,3) includes the one-factor terms 1 and 3. A factor that had constraints in the sampling plan should always be included. For example, if the sampling plan was such that there would be (precisely) x females and y males sampled, then gender should be in all models.
Both the LRT and the Pearson test statistics are asymptotically distributed chisquare with df degrees of freedom (assuming there are no zeros). A general rule of thumb is that the asymptotic distribution is trustworthy when the number of observations is 10 times the number of cells. If the two test statistics differ considerably, not much faith can be put in the test.
Using the test statistics to select a model is a rather backward use of hypothesis testing - a model can be "proved" wrong, but passing the test does not mean that the model is right. Bayesian techniques have been developed to select a good model (or models).
The start argument can be used to produce an analysis when the cells are assigned different weights. (See Clogg and Eliason (1988).) The start should be one over the weights.
A suggested analysis strategy is to use the default settings to narrow down the number of models, and then to set the fit and param options to TRUE to investigate the more promising models further.
Value
returns a list with components:
lrt the Likelihood Ratio Test statistic. This is often called either L squared or G squared in the literature, and it is 2 times the discrimination information. It is defined as 2 * sum(observed * log(observed/expected)).
pearson the Pearson test statistic (chi squared). It is defined as sum((observed - expected)^2/expected).
df the degrees of freedom for the model fit. There is no adjustment for zeros; the user must adjust for them.
margin a list of the margins that were fit. This is the input margin, except that the names of the factors are used if they are present.
fit an array like table, but containing fitted values. This is returned only when the argument fit is TRUE.
param the estimated parameters of the model. They are parametrized so that the (Intercept) component describes the overall mean, each single factor sums to zero, each two factor parameter sums to zero both by rows and columns, and so on. This is returned only when the argument param is TRUE.
Background
Log-linear analysis studies the relationship between a number of categorical variables, extending the idea of simply testing for independence of the factors. Typically, the number of observations falling into each combination of the levels of the variables (factors) is modeled. The model, as the name suggests, is that the logarithm of the counts follows a linear model depending on the levels of the factors.
References
Agresti, A. 1990. Categorical data analysis. New York, NY: Wiley.
Becker, R. A., Chambers, J. M., and Wilks, A. R. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books and Software.
Clogg, C. C. and Eliason, S. R. 1988. Some Common Problems in Log-Linear Analysis. Common Problems/Proper Solutions. (J. Scott Long, ed. ) Newbury Park, CA: Sage Publications.
Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical Data. Second Edition. Cambridge, MA: MIT Press.
Haberman, S. J. 1972. Log-linear fit for contingency tables: Algorithm AS51. Applied Statistics. Volume 21. 218-225.
Lunneborg, C. E. and Abbott, R. D. 1983. Elementary Multivariate Analysis for the Behavioral Sciences. New York, NY: North-Holland.
See Also
table, Chisquare.
Examples
tbl <- with(Sdatasets::market.survey, table(income, age, education))
loglin(tbl, margin=list(c("age", "income"), "education"))
Package stats version 6.1.1-7
Package Index