loglin
Contingency Table Analysis
Description
Estimates test statistics and parameter values for a log-linear
analysis of a multidimensional contingency table.
Usage
loglin(table, margin, start = rep(1, length(table)), fit = FALSE,
eps = 0.1, iter = 20, param = FALSE, print = TRUE)
Arguments
table |
a contingency table (array) to be fit by log-linear model.
Typically, table is output from the table function.
Neither negative nor missing values (NAs) are allowed.
|
margin |
a list of vectors describing the marginal totals to fit. A margin
is described by the factors not summed over.
Thus list(1:2, 3:4) would indicate
fitting the 1,2 margin (summing over variables 3 and 4) and
the 3,4 margin in a four-way table.
The names of factors (that is, names(dimnames(table)))
can be used instead of indices.
|
start |
the starting estimate for a fitted table. If start is omitted, a
start is used that assures convergence. If structural
zeros appear in table, start should contain zeros in
corresponding entries and ones in other places. This assures
that the fit contains those zeros.
|
fit |
a logical value. If TRUE, estimated fit is returned. The
default is FALSE.
|
eps |
the maximum permissible deviation between an observed margin and a fitted
margin.
|
iter |
the maximum number of iterations.
|
param |
a logical value. If TRUE, the parameter values are returned.
Setting this to FALSE (the default) saves computation as well as space.
|
print |
a logical value. If TRUE (the default), the final deviation and number
of iterations is printed.
|
Details
The fit is produced by the Iterative Proportional Fitting algorithm as
presented in Haberman (1972).
Convergence is considered to be achieved if the maximum deviation between
an observed and a fitted margin is less than eps.
At most, iter iterations are performed.
The fitting is currently done in single
precision, other computations are in double precision.
The margins that are fit describe the model, similar to describing an ANOVA
model. A high-order term automatically includes all the lower-order terms
within it: for example, the term c(1,3) includes the one-factor terms
1 and 3. A factor that had constraints in the sampling plan should
always be included. For example, if the sampling plan was such that there
would be (precisely) x females and y males sampled, then gender should
be in all models.
Both the LRT and the Pearson test statistics are asymptotically distributed
chisquare with df degrees of freedom (assuming there are no zeros).
A general rule of thumb is that the asymptotic distribution is trustworthy
when the number of observations is 10 times the number of cells.
If the two test statistics differ considerably, not much faith can be put
in the test.
Using the test statistics to select a model is a rather backward use of
hypothesis testing - a model can be "proved" wrong, but passing the test
does not mean that the model is right. Bayesian techniques have been
developed to select a good model (or models).
The start argument can be used to produce an analysis when the cells are
assigned different weights. (See Clogg and Eliason (1988).)
The start should be one over the weights.
A suggested analysis strategy is to use the default settings to narrow down
the number of models, and then to set the fit and param options to TRUE
to investigate the more promising models further.
Value
returns a list with components:
lrt |
the Likelihood Ratio Test statistic. This is often called either L squared
or G squared in the literature, and it is 2 times the discrimination information.
It is defined as 2 * sum(observed * log(observed/expected)).
|
pearson |
the Pearson test statistic (chi squared). It is defined as
sum((observed - expected)^2/expected).
|
df |
the degrees of freedom for the model fit. There is no adjustment for zeros;
the user must adjust for them.
|
margin |
a list of the margins that were fit.
This is the input margin, except
that the names of the factors are used if they are present.
|
fit |
an array like table, but containing fitted values.
This is returned only when the argument fit is TRUE.
|
param |
the estimated parameters of the model.
They are parametrized so that the (Intercept) component describes the overall
mean, each single factor sums to zero, each two factor parameter sums to
zero both by rows and columns, and so on.
This is returned only when the argument param is TRUE.
|
Background
Log-linear analysis studies the relationship between a number of categorical
variables,
extending the idea of simply testing for independence of the factors.
Typically, the number of observations falling into each
combination of the levels of the variables (factors) is modeled.
The model, as the name suggests, is that the logarithm of the counts
follows a linear model depending on the levels of the factors.
References
Agresti, A. 1990. Categorical data analysis. New York, NY: Wiley.
Becker, R. A., Chambers, J. M., and Wilks, A. R. 1988. The New S Language: A Programming Environment for Data Analysis and Graphics. Pacific Grove, CA: Wadsworth & Brooks/Cole Advanced Books and Software.
Clogg, C. C. and Eliason, S. R. 1988. Some Common Problems in Log-Linear Analysis. Common Problems/Proper Solutions. (J. Scott Long, ed. ) Newbury Park, CA: Sage Publications.
Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical Data. Second Edition. Cambridge, MA: MIT Press.
Haberman, S. J. 1972. Log-linear fit for contingency tables: Algorithm AS51. Applied Statistics. Volume 21. 218-225.
Lunneborg, C. E. and Abbott, R. D. 1983. Elementary Multivariate Analysis for the Behavioral Sciences. New York, NY: North-Holland.
See Also
Examples
tbl <- with(Sdatasets::market.survey, table(income, age, education))
loglin(tbl, margin=list(c("age", "income"), "education"))