Computes cell frequencies, cell means, and cell sums of squares for multivariate data.
X — |NROW| by NCOL
matrix containing the data. (Input)
Each column of X represents either a
classification variable, a response variable, a weight, or a frequency.
KMAX — Maximum
number of cells. (Input)
This quantity does not have to be
exact, but must be at least as large as the actual number of cells, K.
CELIF — Matrix
with min(KMAX,
K)
columns containing cell information.
(Output, if IDO = 0 or 1;
input/output, if IDO = 2.)
The
number of rows in CELIF
depends on the eight cases tabled below.
Case
Description
Rows in CELIF
1
MOPT
≤ 0, IFRQ
= 0 and IWT
= 0
NCOL
+ NR +
1
2
MOPT
≤ 0, IFRQ
> 0 and IWT
= 0
NCOL
+ NR
3
MOPT
≤ 0, IFRQ
= 0 and IWT
> 0
NCOL
+ NR +
1
4
MOPT
≤ 0, IFRQ
> 0 and IWT
> 0
NCOL
+ NR
5
MOPT
> 0, IFRQ
= 0 and IWT
= 0
NCOL
+ 2 * NR +
1
6
MOPT
> 0, IFRQ
> 0 and IWT
= 0
NCOL
+ 2 * NR
7
MOPT
> 0, IFRQ
= 0 and IWT
> 0
NCOL
+ 3 * NR
8
MOPT
> 0, IFRQ
> 0 and IWT
> 0
NCOL
+ 3 * NR
− 1
Each column contains information on each unique combination of values of the m classification variables that occurs in the data. The first m rows give the values of the classification variables. Row m + 1 gives the number of observations that are in this cell. (For cases 2, 4, 6 and 8, this is the sum of the frequencies.) For case 3 and 4, row m + 2 contains the sum of the weights. For NR greater than zero, the remaining rows (beginning with row m + 3 in case 3 and 4 and with row m + 2 otherwise) contain information concerning the response variables. For cases 1, 2, 3 and 4, there are 2 * NR remaining rows with the cell (weighted) mean and cell (weighted) sum of squares for each of the NR response variables. For cases 5 and 6, there are 3 * NR remaining rows with the sample size, the mean and sum of squares for each of the NR response variables. For case 7 and 8, there are 4 * NR remaining rows with the sample size, the sum of weights, weighted means, and weighted sum of squares for each of the NR response variables.
IDO — Processing
option. (Input)
Default: IDO
= 0.
IDO |
Action |
0 |
This is the only invocation of CSTAT for this data set, and all the data are input at once. |
1 |
This is the first invocation, and additional calls to CSTAT will be made. Initialization and updating for the data in X are performed. |
2 |
This is an intermediate invocation of CSTAT, and updating for the data in X is performed. |
NROW — The
absolute value of NROW
is the number of rows of data currently input in X.
(Input)
Default: NROW
= size (X,1).
NROW
may be positive or negative. Negative NROW
means that the −NROW
rows of data are to be deleted from some aspects of the analysis, and this
should be done only if IDO is 2. When a
negative value is input for NROW,
it is assumed that each of the −NROW
rows of X has
been input (with positive NROW)
in previous invocations of CSTAT.
NCOL — Number of
columns in X.
(Input)
Default: NCOL
= size (X,2).
LDX — Leading
dimension of X
exactly as specified in the dimension statement in the calling
program. (Input)
Default: LDX
= size (X,1).
NR — Number of
response variables. (Input)
NR = 0 means no
response variables are input. Otherwise, cell means and sums of squares are
computed for the response variables.
Default: NR
= 0.
IRX — Vector of
length NR. (Input
if NR
is greater than 0.)
The IRX(1), …, IRX(NR)
columns of X
contain the response variables for which cell means and sums of squares are
computed.
IFRQ — Frequency
option. (Input)
IFRQ
= 0 means that all frequencies are 1.0. For positive IFRQ,
column number IFRQ
of X contains
the frequencies.
Default: IFRQ
= 0.
IWT — Weighting
option. (Input)
IWT = 0 means that all
weights are 1.0. For positive IWT, column IWT of X
contains the weights.
Default: IWT
= 0.
MOPT — Missing
value option. (Input)
If MOPT
is zero, the exclusion is listwise. If MOPT
is positive, the following occurs: (1) if a classification variable’s value is
missing, the entire case is excluded, (2) if
IFRQ
> 0 and the frequency variable’s value is missing, the entire case is
excluded, (3) if IWT
> 0 and the weight variable’s value is missing, the case is classified and
the cell frequency updated, but no information with regard to the response
variables is computed, and (4) if only some response variables’ values are
missing, all computations are performed except those pertaining to the response
variables with missing values.
Default: MOPT
= 0.
K — Number of
cells or an upper bound for this number. (Input/Output)
On the
first call K
must be input K
= 0. It should not be changed between calls to CSTAT. K is incremented by
one for each new cell up to KMAX
cells. Once KMAX
cells are encountered, K is incremented by
one for each observation that does not fall into one of the KMAX
cells. In this case, K
is an upper bound on the number of cells and can be used for KMAX
in a subsequent run.
Default: K
= 0.
LDCELI — Leading
dimension of CELIF
exactly as specified in the dimension statement in the calling
program. (Input)
Default: LDCELI
= size (CELIF,1).
Generic: CALL CSTAT (X, KMAX, CELIF [,…])
Specific: The specific interface names are S_CSTAT and D_CSTAT.
Single: CALL CSTAT (IDO, NROW, NCOL, X, LDX, NR, IRX, IFRQ, IWT, MOPT, KMAX, K, CELIF, LDCELI)
Double: The double precision name is DCSTAT.
The routine CSTAT
computes cell frequencies, cell means, and cell sums of squares for multivariate
data in X.
The columns of X
can contain data for four types of variables: classification variables, a
frequency variable, a weight variable, and response variables. The frequency
variable, the weight variable, and the response variables are all designated by
indicators in IFRQ,
IWT,
and IRX.
All other variables are considered to be classification variables; hence, there
are m classification variables, where m = NCOL
− NR
if there is no weight or frequency variable,
m = NCOL
− NR
− 1 if
there is a weight or frequency variable but not both, and m = NCOL
− NR
− 2 if
there are weight and frequency variables.
Each combination of values of the classification variables is stored in the first m rows of CELIF. For each combination of values of the classification variables, the frequencies are stored in the next row of CELIF. Then, for each combination, means and sums of squares for each of the response variables are computed and stored in the remaining rows of CELIF. If a weighting variable is specified, the sum of the weights for each combination is computed and stored. If missing values are deleted elementwise (that is, if MOPT is positive), the frequencies and sums of weights for each of the response variables are stored in the rows of CELIF.
1. If no nonmissing observations with positive weights or frequencies exist in a cell for a particular response variable, the mean and sum of squares are set to NaN (not a number).
2. In cases 3 and 6, if a zero weight is encountered, there is no contribution to the means or sums of squares, but the sample sizes are implemented by one for that observation.
In this example, there are two classification variables, C1 and C2, and two response variables, R1 and R2. Their values are shown below.
|
|
C1 | |||
|
|
1 |
2 | ||
|
|
R1 |
R2 |
R1 |
R2 |
C2 |
1 |
5.0 |
3.4 |
3.8 |
2.4 |
|
|
R1 |
R2 |
R1 |
R2 |
|
2 |
4.3 |
9.8 |
6.5 |
3.4 |
USE CSTAT_INT
USE WRRRL_INT
IMPLICIT NONE
INTEGER KMAX, LDCELI, LDX, NR, NCOL
PARAMETER (KMAX=4, LDCELI=15, LDX=10, NR=2, NCOL=4)
!
INTEGER IDO, IFRQ, IRX(NR), IWT, K, MIN0, MOPT, NROW
REAL CELIF(LDCELI,KMAX), X(LDX,NCOL)
CHARACTER CLABEL(1)*6, FMT*7, RLABEL(7)*6
INTRINSIC MIN0
! Get data for example
DATA X/1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, &
1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 2.0, 5.0, 7.0, 4.3, &
3.2, 1.7, 3.8, 5.2, 4.9, 6.5, 3.1, 3.4, 2.6, 9.8, 7.1, 6.3, &
2.4, 6.3, 1.2, 3.4, 5.1/
! All data are input at once
IDO = 0
NROW = 10
K = 0
! No unequal frequencies or weights
! are used
IFRQ = 0
IWT = 0
! Response variables are in 3rd and 4th
! columns
IRX(1) = 3
IRX(2) = 4
! Delete any row containing a missing
! value
MOPT = 0
!
CALL CSTAT (X, KMAX, CELIF, NR=NR, IRX=IRX, K=K)
! Print the results
CLABEL(1) = 'NONE'
RLABEL(1) = ' '
RLABEL(2) = ' '
RLABEL(3) = 'Freq.'
RLABEL(4) = 'Mean 1'
RLABEL(5) = 'SS 1'
RLABEL(6) = 'Mean 2'
RLABEL(7) = 'SS 2'
FMT = '(W10.4)'
CALL WRRRL ('Statistics for the Cells', CELIF, &
RLABEL, CLABEL, NRA=(NCOL+NR+1), &
NCA=MIN0(KMAX, K), FMT=FMT)
END
Statistics for the
Cells
1.00
1.00
2.00
2.00
1.00
2.00
1.00
2.00
Freq.
2.00
3.00
3.00 2.00
Mean
1
6.00
3.07
4.63 4.80
SS
1
2.00
3.41
1.09 5.78
Mean
2
3.00
7.73
3.30 4.25
SS
2
0.32
6.73
14.22 1.44
This example uses the same data as in the first example, except some of the data are set to missing values. Also, a frequency variable is used. It is in the fourth column of X. The frequency variable indicates that the values of the classification and response variables in the first observation occur 3 times and that all other frequencies are 1. Since MOPT is greater than zero, statistics on one response variable are accumulated even if the other response variable has a missing value. If the frequency variable has a missing value, however, the entire observation is omitted.
The missing value is NaN (not a number) that can be
obtained with the argument of 6 in the routine AMACH
(Reference Material). For this example, we set the first response variable in
the first cell (C1 = 1, C2 = 1) to a missing
value; we set the second response variable in the (2, 1) cell to a missing
value; and we set the frequency variable in the (1, 2) cell to a missing value.
The data are now as shown below, with “NaN” in place of the missing
values.
|
|
C1 | |||
|
|
1 |
2 | ||
|
|
R1 |
R2 |
R1 |
R2 |
C2 |
1 |
NaN |
3.4 |
3.8 |
NaN |
|
|
R1 |
R2 |
R1 |
R2 |
|
2 |
NaN |
NaN |
6.5 |
3.4 |
The first two rows output in CELIF are the values of the classification variables, and the third row is the frequencies of the cells, as before. The next three rows correspond to the first response variable, and the last three rows correspond to the second response variable. (This is “case 6” above, where the argument CELIF is described.)
USE CSTAT_INT
USE WRRRN_INT
IMPLICIT NONE
INTEGER KMAX, LDCELI, LDX, NR, NCOL, NROW
PARAMETER (KMAX=4, LDCELI=15, LDX=10, NR=2, NCOL=5)
!
INTEGER IDO, IFRQ, IRX(NR), IWT, K, MIN0, MOPT
REAL CELIF(LDCELI,KMAX), X(LDX,NCOL), AMACH
INTRINSIC MIN0
! Get data for example.
DATA X/1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 2.0, 2.0, 1.0, &
1.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 2.0, 2.0, 5.0, 7.0, 4.3, &
3.2, 1.7, 3.8, 5.2, 4.9, 6.5, 3.1, 3.4, 2.6, 9.8, 7.1, 6.3, &
2.4, 6.3, 1.2, 3.4, 5.1, 3.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, &
1.0, 1.0, 1.0/
! All data are input at once.
IDO = 0
NROW = 10
K = 0
! Frequencies are in the 5th column.
! All weights are equal
IFRQ = 5
IWT = 0
! Response variables are in 3rd and 4th
! columns.
IRX(1) = 3
IRX(2) = 4
! Set some values to “missing” for
! this example. Specify elementwise
! deletion of missing values of the
! response variables.
MOPT = 1
X(1,3) = AMACH(6)
X(6,4) = AMACH(6)
X(3,5) = AMACH(6)
!
CALL CSTAT (X, KMAX, CELIF, NR=NR, IRX=IRX, MOPT=MOPT, IFRQ=IFRQ, &
K=K)
! Print the results.
CALL WRRRN ('Statistics for the Cells', CELIF, NRA=(NCOL+2*NR), &
NCA=MIN0(KMAX, K))
END
Statistics for the
Cells
1 2
3 4
1
1.00 1.00 2.00 2.00
2 1.00 2.00
1.00 2.00
3 4.00
2.00 3.00 2.00
4
1.00 2.00 3.00
2.00
5 7.00 2.45
4.63 4.80
6 0.00
1.12 1.09 5.78
7
4.00 2.00 2.00
2.00
8 3.20 6.70
3.75 4.25
9 0.48
0.32 13.01 1.44
PHONE: 713.784.3131 FAX:713.781.9260 |