Performs a chi-squared analysis of a 2 by 2 contingency table.
TABLE — 2 by 2 matrix containing the observed counts in the contingency table. (Input)
EXPECT — 3 by 3 matrix containing the expected values of each cell in TABLE under the null hypothesis of independence, in the first 2 rows and 2 columns, and the marginal totals in the last row and column. (Output)
CHICTR — 3 by 3
matrix containing the contributions to chi-squared for each cell in TABLE in the first 2
rows and 2 columns. (Output)
The last row and column contain the
total contribution to chi-squared for that row or column.
CHISQ — Vector of length 15 containing statistics associated with this contingency table. (Output)
I |
CHISQ(I) |
1 |
Pearson chi-squared statistic |
2 |
Probability of a larger Pearson chi-squared |
3 |
Degrees of freedom for chi-squared |
4 |
Likelihood ratio G2 (chi-squared) |
5 |
Probability of a larger G2 |
6 |
Probability of a larger G2 |
7 |
Yates corrected chi-squared |
8 |
Probability of a larger corrected chi-squared |
9 |
Fisher’s exact test (one tail) |
10 |
Exact mean |
11 |
Exact standard deviation |
The following statistics are based upon the chi-squared statistic CHISQ(1).
I |
CHISQ(I) |
12 |
Phi (Φ) |
13 |
The maximum possible Φ |
14 |
Contingency coefficient P |
15 |
The maximum possible contingency coefficient |
STAT — 24 by 5
matrix containing statistics associated with this table.
(Output)
Each row of the matrix corresponds to a statistic.
Row |
Statistic |
1 |
Gamma |
2 |
Kendall’s τb |
3 |
Stuart’s τc |
4 |
Somers’ D (row) |
5 |
Somers’ D (column) |
6 |
Product moment correlation |
7 |
Spearman rank correlation |
8 |
Goodman and Kruskal τ (row) |
9 |
Goodman and Kruskal τ (column) |
10 |
Uncertainty coefficient U (normed) |
11 |
Uncertainty Ur|c (row) |
12 |
Uncertainty Uc|r (column) |
13 |
Optimal prediction λ (symmetric) |
14 |
Optimal prediction λr|c (row) |
15 |
Optimal prediction λc|r(column) |
16 |
Optimal
prediction |
17 |
Optimal
prediction |
18 |
Yule’s Q |
19 |
Yule’s Y |
20 |
Crossproduct ratio |
21 |
Log of crossproduct ratio |
22 |
Test for linear trend |
23 |
Kappa |
24 |
McNemar test of symmetry |
If a statistic is not computed, its value is reported as NaN (not a number). The columns are as follows:
Column |
Statistic |
1 |
Estimated statistic |
2 |
Its estimated standard error for any parameter value |
3 |
Its estimated standard error under the null hypothesis |
4 |
z-score for testing the null hypothesis |
5 |
p-value for the test in column 4 |
In the McNemar test, column 1 contains the statistic, column 2 contains the chi-squared degrees of freedom, column 4 contains the exact p-value, and column 5 contains the chi-squared asymptotic p-value.
LDTABL — Leading
dimension of TABLE exactly as
specified in the dimension statement of the calling program.
(Input)
Default: LDTABL = size (TABLE,1).
ICMPT — Computing
option. (Input)
If ICMPT = 0, all of the
values in CHISQ
and STAT are
computed. ICMPT
= 1 means compute only the first 11 values of CHISQ, and no values
of STAT are
computed.
Default: ICMPT = 0.
IPRINT — Printing
option. (Input)
IPRINT = 0 means no
printing is performed. If IPRINT = 1, printing
is performed.
Default: IPRINT = 0.
LDEXPE — Leading
dimension of EXPECT exactly as
specified in the dimension statement of the calling program.
(Input)
Default: LDEXPE = size (EXPECT,1).
LDCHIC — Leading
dimension of CHI
exactly as specified in the dimension statement of the calling
program. (Input)
Default: LDCHI = size (CHI,1).
LDSTAT — Leading
dimension of STAT exactly as
specified in the dimension statement of the calling program.
(Input)
Default: LDSTAT = size (STAT,1).
Generic: CALL CTTWO (TABLE, EXPECT, CHICTR, CHISQ, STAT [,…])
Specific: The specific interface names are S_CTTWO and D_CTTWO.
Single: CALL CTTWO (TABLE, LDTABL, ICMPT, IPRINT, EXPECT, LDEXPE, CHICTR, LDCHIC, CHISQ, STAT, LDSTAT)
Double: The double precision name is DCTTWO.
Routine CTTWO computes statistics associated with 2 × 2 contingency tables. Always computed are chi-squared tests of independence, expected values based upon the independence assumption, contributions to chi-squared in a test of independence, and row and column marginal totals. Optionally, when ICMPT = 0, CTTWO can compute some measures of association, correlation, prediction, uncertainty, the McNemar test for symmetry, a test for linear trend, the odds and the log odds ratio, and the Kappa statistic.
Other IMSL routines that may be of interest include TETCC in Chapter 3, Correlation (for computing the tetrachoric correlation coefficient) and CTCHI in this chapter (for computing statistics in other than 2 × 2 contingency tables).
Let xij denote the observed cell frequency in the ij cell of the table and n denote the total count in the table. Let pij = pi∙p∙j denote the predicted cell probabilities (under the null hypothesis of independence) where pi∙ and p∙j are the row and column relative marginal frequencies, respectively. Next, compute the expected cell counts as eij = n pij.
Also required in the following are auv and buv, u, v =
1, …,
n. Let (rs, cs) denote the row and
column response of observation s. Then, auv = 1, 0, or −1, depending
upon whether ru < rv,
ru = rv, or ru > rv, respectively. The
buv are similarly defined
in terms of the cs’s.
For each cell of the four cells in the table, the contribution to chi-squared is given as (xij − eij)2/eij. The Pearson chi-squared statistic (denoted is Χ2) is computed as the sum of the cell contributions to chi-squared. It has, of course, 1 degree of freedom and tests the null hypothesis of independence, i.e., of H0 : pij = pi∙p∙j. Reject the null hypothesis if the computed value of Χ2 is too large.
Compute G2, the maximum likelihood equivalent of Χ2, as
G2 is asymptotically equivalent to Χ2 and tests the same hypothesis with the same degrees of freedom.
Two measures related to chi-squared but which do not depend upon sample size are phi,
and the contingency coefficient,
Since these statistics do not depend upon sample size and
are large when the hypothesis of independence is rejected, they may be thought
of as measures of association and may be compared across tables with different
sized samples. While P has
a range between 0.0 and 1.0 for any given table, the upper bound of P is actually somewhat less than
1.0 (see Kendall and Stuart 1979,
page 577). In order to understand
association within a table, consider also the maximum possible P(CHISQ(15))
and the maximum possible ɸ (CHISQ(13)).
The significance of both statistics is the same as that of the Χ2 statistic, CHISQ(1).
The distribution of the Χ2 statistic in finite samples approximates a chi-squared distribution. To compute the expected mean and standard deviation of the Χ2 statistic, Haldane (1939) uses the multinomial distribution with fixed table marginals. The exact mean and standard deviation generally differ little from the mean and standard deviation of the associated chi-squared distribution.
Fisher’s exact test is a conservative but uniformly most powerful unbiased test of equal row (or column) cell probabilities in the 2 × 2 table. In this test, the row and column marginals are assumed fixed, and the hypergeometric distribution is used to obtain the significance level of the test. A one- or a two-sided test is possible. See Kendall and Stuart (1979, page 582) for a discussion.
In rows 1 through 7 of STAT,
estimated standard errors and asymptotic p-values are reported. Routine
CTTWO
computes these standard errors in two ways. The first estimate, in column 2 of
matrix STAT,
is asymptotically valid for any value of the statistic. The second estimate, in
column 3 of STAT,
is only correct under the null hypothesis of no association. The z-scores
in column 4 are computed using this second estimate of the standard errors, and
the p-values in column 5 are computed from these z-scores. See
Brown and Benedetti (1977) for a discussion and formulas for the standard errors
in column 3.
The measures of association ɸ and P do not require any ordering of the row and column categories. Routine CTTWO also computes several measures of association for tables in which the rows and column categories correspond to ranked observations. Two of these measures, the product-moment correlation and the Spearman correlation, are correlation coefficients that are computed using assigned scores for the row and column categories. In the product-moment correlation, this score is the cell index, while in the Spearman rank correlation, this score is the average of the tied ranks of the row or column marginals. Other scores are possible.
Other measures of associations, Gamma, Kendall’s τb, Stuart’s τc and Somers’ D, are also computed similarly to a correlation coefficient in that the numerator in these statistics in some sense is a “covariance.” In fact, these measures differ only in their denominators, their numerators being the “covariance” between the auv’s and the buv’s defined earlier. The numerator is computed as
Since the product auvbuv = 1 if both auv and buv are 1 or −1, it is easy to show that the “covariance” is twice the total number of agreements minus the number disagreements between the row and column variables where a disagreement occurs when auvbuv = −1.
Kendall’s τb is computed as the correlation between the auv’s and the buv’s (see Kendall and Stuart 1979, page 583). Stuart suggested a modification to the denominator of τ in which the denominator becomes the largest possible value of the “covariance.” This value turns out to be approximately 2n2in 2 × 2 tables, and this is the value used in the denominator of Stuart’s τc. For large n, τc ≈ 2 τb.
Gamma can be motivated in a slightly different manner. Because the “covariance” of the auv’s and the buv’s can be thought of as two times the number of agreements minus the number of disagreements [2(A − D), where A is the number of agreements and D is the number of disagreements], gamma is motivated as the probability of agreement minus the probability of disagreement, given that either agreement or disagreement occurred. This is just (A − D)/(A + D).
Two definitions of Somers’ D are possible, one for rows and a second for columns. Somers’ D for rows can be thought of as the regression coefficient for predicting auv from buv. Moreover, Somers’ D for rows is the probability of agreement minus the probability of disagreement, given that the column variable, buv, is not zero. Somers’ D for columns is defined in a similar manner.
A discussion of all of the measures of association in this section can be found in Kendall and Stuart (1979, starting on page 592).
The crossproduct ratio is also sometimes thought of as a measure of association (see Bishop, Feinberg and Holland 1975, page 14). It is computed as:
The log of the crossproduct ratio is the log of this quantity.
The Yule’s Q and Yule’s Y are related to the cross product ratio. They are computed as:
The measures in this section do not require any ordering of the row or column variables. They are based entirely upon probabilities. Most are discussed in Bishop, Feinberg, and Holland (1975, page 385).
Consider predicting or classifying the column variable for a given value of the row variable. The best classification for each row under the null hypothesis of independence is the column that has the highest marginal probability (and thus the highest probability for the row under the independence assumption). The probability of misclassification is then one minus this marginal probability. On the other hand, if independence is not assumed so that the row and columns variables are dependent, then within each row one would classify the column variables according to the category with the highest row conditional probability. The probability of misclassification for the row is then one minus this conditional probability.
Define the optimal prediction coefficient λc|r for predicting columns from rows as the proportion of the probability of misclassification that is eliminated because the random variables are not independent. It is estimated by:
where m is the index of the maximum estimated probability in the row (pim) or row margin (p∙m). A similar coefficient is defined for predicting the rows from the columns. The symmetric version of the optimal prediction λ is obtained by summing the numerators and denominators of λr|c and λc|r and dividing. Standard errors for these coefficients are given in Bishop, Feinberg, and Holland (1975, page 388).
A problem with the optimal prediction coefficients λ is that they vary with the marginal probabilities. One way to correct for this is to use row conditional probabilities. The optimal prediction λ* coefficients are defined as the corresponding λ coefficients in which one first adjusts the row (or column) marginals to the same number of observations. This yields
where i indexes the rows and j indexes the columns, and pj|i is the (estimated) probability of column j given row i.
is similarly defined.
A second kind of prediction measure attempts to explain the proportion of the explained variation of the row (column) measure given the column (row) measure. Define the total variation in the rows to be
This is 1/(2n) times the sums of squares of the auv’s.
With this definition of variation, the Goodman and Kruskal τ coefficient for rows is computed as the reduction of the total variation for rows accounted for by the columns divided by the total variation for the rows. To compute the reduction in the total variation of the rows accounted for by the columns, define the total variation for the rows within column j as
Define the total variation for rows within columns as the sum of the qj’s. Consistent with the usual methods in the analysis of variance, the reduction in the total variation is the difference between the total variation for rows and the total variation for rows within the columns.
Goodman and Kruskal’s τ columns is similarly defined. See Bishop, Feinberg, and Holland (1975, page 391) for the standard errors.
The uncertainty coefficient for rows is the increase in the log-likelihood that is achieved by the most general model over the independence model divided by the marginal log-likelihood for the rows. This is given by
The uncertainty coefficient for columns is similarly defined. The symmetric uncertainty coefficient contains the same numerator as Ur|c and Uc|r but averages the denominators of these two statistics. Standard errors for U are given in Brown (1983).
The Kruskal-Wallis statistic for rows is a one-way
analysis-of-variance-type test that assumes that the column variable is
monotonically ordered. It tests the null hypothesis that the row populations are
identical, using average ranks for the column variable. This amounts to a test
of
Ho : p1∙ =
p2∙. The
Kruskal-Wallis statistic for columns is similarly defined. Conover (1980)
discusses the Kruskal-Wallis test.
The test for a linear trend in the column probabilities assumes that the row variable is monotonically ordered. In this test, the probability for column 1 is predicted by the row index using weighted simple linear regression. The slope is given by
where
is the average row index. An asymptotic test that the slope is zero may be obtained as the usual large sample regression test of zero slope.
Kappa is a measure of agreement. In the Kappa statistic,
the rows and columns correspond to the responses of two judges. The judges agree
along the diagonal and disagree off the diagonal. Let
po = p11+ p22 denote the
probability that the two judges agree, and let pc= p1∙p∙1 + p2∙p∙2 denote the
expected probability of agreement under the independence model. Kappa is then
given by (po − pc)/(1 − pc).
The McNemar test is also a test of symmetry in square contingency tables. It tests the null hypothesis Ho : θ ij = θ ji. The test statistic with 1 degree of freedom is computed as
Its exact probability may be computed via the binomial distribution.
Informational errors
Type Code
3 1 At least one marginal total is zero. The remainder of the analysis cannot proceed.
3
9
Some expected table values are less than 1.0. Some asymptotic
p-values may not be good.
3
10
Some expected table values are less than 2.0. Some asymptotic
p-values may not be good.
3 11 20% of the table expected values are less than 5.
The following example from Kendall and Stuart (1979, pages 582-583) compares the teeth in breast-fed versus bottle-fed babies.
USE CTTWO_INT
IMPLICIT NONE
INTEGER IPRINT, LDCHIC, LDEXPE, LDSTAT, LDTABL
PARAMETER (IPRINT=1, LDCHIC=3, LDEXPE=3, LDSTAT=24, LDTABL=2)
!
REAL CHICTR(LDCHIC,3), CHISQ(15), EXPECT(LDEXPE,3), &
STAT(LDSTAT,5), TABLE(LDTABL,2)
!
DATA TABLE/4, 1, 16, 21/
!
CALL CTTWO (TABLE, EXPECT, CHICTR, CHISQ, STAT, IPRINT=IPRINT)
END
TABLE
1 2
1
4.00 16.00
2 1.00
21.00
Expected
values
Col 1 Col 2
Marginal
Row 1
2.3810 17.6190
20.0000
Row 2
2.6190 19.3810
22.0000
Marginal
5.0000 37.0000
42.0000
Contributions to
chi-squared
Col 1 Col
2 Total
Row
1 1.1010
0.1488 1.2497
Row
2 1.0009
0.1353
1.1361
Total
2.1018 0.2840
2.3858
CHISQ
1
Pearson chi-squared
2.3858
p-value
0.1224
Degrees of freedom
1.0000
Likelihood ratio
2.5099
p-value
0.1131
Yates
chi-squared
1.1398
p-value
0.2857
Fisher (one tail)
0.1435
Fisher (two tail)
0.1745
Exact
mean
1.0244
Exact std
dev
1.3267
Phi
0.2383
Max possible phi
0.3855
Contingency coef.
0.2318
Max possible coef.
0.3597
STAT
Statistic Std err. Std err. 0
t-value
p-value
Gamma
0.6800 0.3135
0.4395
1.5472 0.1218
Kendall’s tau
B 0.2383
0.1347 0.1540
1.5472 0.1218
Stuart’s tau
C 0.1542
0.0997
NaN 1.5472
0.1218
Somers’ D row
0.1545 0.0999
0.0999 1.5472
0.1218
Somers’ D col
0.3676 0.1966
0.2376 1.5472
0.1218
Correlation
0.2383 0.1347
0.1540 1.5472
0.1218
Spearman rank
0.2383 0.1347
0.1540 1.5472
0.1218
GK tau row
0.0568
0.0641
NaN
NaN NaN
GK tau
col
0.0568
0.0609
NaN
NaN NaN
U
normed
0.0565
0.0661
NaN
NaN NaN
U
row
0.0819
0.0935
NaN
NaN NaN
U
col
0.0432
0.0516
NaN
NaN NaN
Lamda
sym
0.1200
0.0779
NaN
NaN NaN
Lamda
row
0.0000
0.0000
NaN
NaN NaN
Lamda
col
0.1500
0.1031
NaN
NaN NaN
Lamda star
row 0.0000
0.0000
NaN
NaN NaN
Lamda star
col 0.1761
0.1978
NaN
NaN NaN
Yule’s
Q
0.6800 0.3135
0.4770 1.4255
0.1540
Yule’s
Y
0.3923 0.2467
0.2385 1.6450
0.1000
Ratio
5.2500
NaN
NaN
NaN NaN
Log
ratio
1.6582 1.1662
0.9540 1.7381
0.0822
Linear trend
-0.1545
0.1001
NaN -1.5446
0.1224
Kappa
0.1600 0.1572
0.1600 1.0000
0.3173
McNemar
13.2353
1.0000
NaN 0.0000
0.0003
*** WARNING ERROR 11 from CTTWO. Twenty percent of the
table expected
***
values are less than 5.0.
PHONE: 713.784.3131 FAX:713.781.9260 |