Routines for modeling and analyzing a two- or higher-dimensional contingency table are described in this chapter. Also included are routines for modeling responses from some discrete distributions when discrete or continuous covariates are measured.
The most common of the three data structures used by the
routines in this chapter is a multidimensional (or multi-way) contingency table
input as a real vector with length equal to the product of the number of
categories for each dimension. This structure may be obtained from a data matrix
X
via the routine FREQ
in Chapter 1, Basic Statistics. Alternatively, multi-way
tables may be created and input directly by the user. The multi-way structure is
used by all of the
log-linear modeling routines (PRPFT, CTLLN, CTPAR, CTASC, and CTSTP), and is also used in the randomization
tests routine, CTRAN.
A second data structure used by the categorical generalized
linear models routine, CTGLM, is the data matrix X.
In CTGLM (and elsewhere), if X
has many identical rows, at least on the variables of interest, consider using
Chapter 1 routine CSTAT to
add a frequency variable to a reduced
matrix X.
The transposed output from this routine can replace X
as input to CTGLM,
and CTGLM
will perform its computations faster (with a linear speed up) on the reduced
matrix.
Finally, two-way tables are input into routines CTCHI, CTTWO, CTPRB, CTEPR, and CTWLS as
two-dimensional real arrays. As
with the multidimensional arrays, two-dimensional arrays may be created via
Chapter 1 routine FREQ, in
which case the leading dimension must equal the number of categories for the
first dimension in the table, or they can be created and input directly by the
user. Alternatively, the routine TWFRQ from Chapter 1 may be used to obtain the two-way
frequency table.
Routines CTCHI (r × c) and CTTWO (2 × 2) (see Chapter 1: Basic Statistics) compute many statistics of interest in a two-way table. Statistics computed by these routines include the usual chi-squared statistics, measures of association, Kappa, and many others. Asymptotic statistics for a two-way table that are not computed by either CTCHI or CTTWO can probably be computed by routines CTRAN or CTWLS, but note that these latter two routines require more setup since they require that the user indicate how the statistics are to be computed. Exact probabilities for two-way tables can be computed by CTPRB, but this routine uses the total enumeration algorithm and, thus, often uses orders of magnitude more computer time that CTEPR, which computes the same probabilities by use of the network algorithm (but can still be quite expensive).
The routines in the second section are all concerned with hierarchical log-linear models (see, e.g., Bishop, Fienberg, and Holland 1975). The routines in Chapter 1: Basic Statistics will often be used to obtain the multi-dimensional tables input into these routines, or the table will be input directly by the user. If the hierarchical is not known, routine CTASC will often be the first routine considered. The partial association statistics computed by this routine can be used to obtain a rough estimate of the model to be used. This rough model can then be refined through the use of CTSTP, which does stepwise model building. Of course, both of these routines are subject to the usual problems associated with building models once the data have been collected: the resulting models may not be correct.
Once a model has been selected (provisional or otherwise), routine CTLLN can be used to compute and print many model statistics (parameter estimates, residuals, goodness of fit tests, etc.). If only the parameter estimates and associated variance/covariance matrix are needed, CTPAR can be used instead. Both of these routines can compute estimates when sampling and/or structural zeros (cells in the table with observed or restricted counts of zero, respectively) are present in the table, as can all routines in this section.
The algorithm underlying all of the routines in the second section is the iterative proportional fitting algorithm, which is implemented in routine PRPFT. When structural or sampling zeros are present in the table, this algorithm can be quite slow to converge. Also, only the expected cell counts are returned by PRPFT, it can be quite difficult to determine degrees of freedom when structural zeros are present in the data. Because a structural zero is a restriction on the parameter space, 1 degree of freedom must be subtracted for each structural zero in the multiway table. The difficulty is in determining where the subtraction should occur. All routines in this section use a Cholesky factorization of XT X where X is the “design matrix.” This is used to determine which effects should lose degrees of freedom because of structural zeros. Sampling zeros, although they can lead to infinite parameter estimates, do not subtract from the total degrees of freedom. See Clarkson and Jennrich (1991), or Baker, Clarke, and Lane (1985) for details.
Routine CTRAN computes generalized Mantel-Haenszel
statistics in stratified
r × c
tables. Generalized Mantel-Haenszel statistics assume that the “direction” of
departure from the null hypothesis is consistent from one table to the next.
Under this assumption, statistics computed for each table are pooled across all
strata yielding a more powerful test than could be obtained otherwise. The
statistics computed include measures of correlation, location, and independence
using user selected row and/or column scores. Details can be found in (Koch,
Amara, and Atkinson 1983) or in the “Algorithm” section for CTRAN.
The routine CTGLM in the fourth section is concerned with generalized linear models (see McCullagh and Nelder 1983) in discrete data. This routine may be used to compute estimates and associated statistics in probit, logistic, minimum extreme value, Poisson, negative binomial (with known number of successes), and logarithmic models. Classification variables as well as weights, frequencies and additive constants may be used so that quite general linear models can be fit. Residuals, a measure of influence, the coefficient estimates, and other statistics are returned for each model fit. When infinite parameter estimates are required, extended maximum likelihood estimation may be used. Log-linear models may be fit in CTGLM through the use of Poisson regression models. Results from Poisson regression models involving structural and sampling zeros will be identical to the results obtained from the log-linear model routines but will be fit by a quasi-Newton algorithm rather than through iterative proportional fitting.
The weighted least-squares analysis of Grizzle, Starmer, and Koch (1969) is implemented in routine CTWLS. In this routine, the user first transforms the observed probability estimates (in predefined ways) and then fits a linear model to the transformed estimates using generalized least squares. Multivariate hypotheses associated with the coefficient estimates for the linear model fit may then be tested. In this way, many statistics of interest such as generalized Kappa statistics and parameter estimates in logistic models may be estimated. Of course, the logistic models fit by CTWLS use a generalized least-squares criterion rather than the maximum likelihood criterion used to compute the logistic model estimates in CTGLM. The generalized least-squares estimates will generally differ somewhat from estimates computed via maximum likelihood.
The routines in Chapter 1, “Basic Statistics,” may be used
to create the data structures discussed above. These routines can also create
one-dimensional frequency tables, which may then be used by routine CHIGF
(see Chapter 7, Tests of Goodness
of Fit and Randomness;), to compute
chi-squared goodness-of-fit test statistics or with routines VHSTPor HHSTP (see Chapter 16, Line Printer Graphics;) to prepare histograms. Routines CTRHO, TETCC, BSCAT, and BSPBS(see Chapter 3,
Correlation;) may be used to
compute some measures of correlation in two-way contingency tables.
PHONE: 713.784.3131 FAX:713.781.9260 |