Performs a K-means (centroid) cluster analysis.
X NOBS by NCOL matrix containing
the observations to be clustered. (Input)
The only columns of
X used are those
indicated by IND
and possibly IFRQ and/or IWT.
CM K by NVAR matrix
containing, on input, the cluster seeds, i.e., estimates for the cluster
centers, and the cluster means on output. (Input/Output)
The cluster seeds
must be unique.
SWT K by NVAR matrix containing
the sum of the weights used to compute each cluster mean. (Output)
Missing observations are excluded from SWT.
IC Vector of length NOBS containing the cluster membership for each observation. (Output)
NC Vector of length K containing the number of observations in each cluster. (Output)
WSS Vector of length K containing the within sum of squares for each cluster. (Output)
NOBS Number of
observations. (Input)
Default: NOBS = size (X,1).
NCOL Number of
columns in X.
(Input)
Default: NCOL = size (X,2).
NVAR Number of
variables to be used in computing the metric. (Input)
Default:
NVAR = size
(CM,2).
LDX Leading
dimension of X
exactly as specified in the dimension statement in the calling
program. (Input)
Default: LDX = size (X,1).
IFRQ Frequency
option. (Input)
IFRQ = 0 means all
frequencies are 1. For positive IFRQ, column number
IFRQ of X contains the
nonnegative frequencies.
Default: IFRQ = 0.
IWT Weighting
option. (Input)
IWT = 0 means all
weights are 1. For positive IWT, column number
IWT contains the
nonnegative weights.
Default: IWT = 0.
IND Vector of
length NVAR
containing the columns of X to be used in
computing the metric. (Input)
In the usual case in which X is the data matrix,
no observation has multiple frequency, and unequal weighting is not desired,
IND = (1, 2, 3,
, NVAR).
By default,
IND(I) = (I)
K Number of
clusters. (Input)
Default: K = size (CM,1).
MAXIT Maximum
number of iterations. (Input)
MAXIT = 30 is usually
sufficient.
Default: MAXIT = 30.
LDCM Leading
dimension of CM
exactly as specified in the dimension statement in the calling
program. (Input)
Default: LDCM = size (CM,1).
LDSWT Leading
dimension of SWT
exactly as specified in the dimension statement in the calling
program. (Input)
Default: LDSWT = size (SWT,1).
Generic: CALL KMEAN (X, CM, SWT, IC, NC, WSS [, ])
Specific: The specific interface names are S_KMEAN and D_KMEAN.
Single: CALL KMEAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS)
Double: The double precision name is DKMEAN.
Routine KMEAN is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. Routine KMEAN allows for missing values (coded as NaN, not a number) and for weights and frequencies.
Let p = NVAR denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,
where vim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is zero if the j-th variable on observation vim is missing, otherwise δ is one; and
is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. The user in referred to Hartigan and Wong (1979) or Hartigan (1975) for the details.
1. Workspace may be explicitly provided, if desired, by use of K2EAN/DK2EAN. The reference is:
CALL K2EAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS, IC2, NCP, D, ITRAN, LIVE)
The additional arguments are as follows:
IC2 Work vector of length NOBS.
NCP Work vector of length K.
D Work vector of length NOBS.
ITRAN Work vector of length K.
LIVE Work vector of length K.
2. Informational Error
Type Code
3 1 Convergence did not occur within MAXIT iterations.
This example performs K-means cluster analysis on Fishers iris data, which is first obtained via routine GDATA (see Chapter 19, Utilities). The initial cluster seed for each iris type is an observation known to be in the iris type.
USE IMSL_LIBRARIES
IMPLICIT NONE
INTEGER IPRINT, K, LDCM, LDSWT, LDX, NCOL, NOBS, NV, NVAR
PARAMETER (IPRINT=0, K=3, NCOL=5, NOBS=150, NV=5, NVAR=4, &
LDCM=K, LDSWT=K, LDX=NOBS)
!
INTEGER IC(NOBS), IND(NVAR), NC(K), NXCOL, NXROW
REAL CM(K,NVAR), SWT(K,NVAR), WSS(K), X(NOBS,NV)
!
DATA IND/2, 3, 4, 5/
!
CALL GDATA (3, X, NXROW, NXCOL)
! Copy the cluster seeds into CM
CALL SCOPY (NVAR, X(1:,2), LDX, CM(1:,1), LDCM)
CALL SCOPY (NVAR, X(51:,2), LDX, CM(2:,1), LDCM)
CALL SCOPY (NVAR, X(101:,2), LDX, CM(3:,1), LDCM)
!
CALL KMEAN (X, CM, SWT, IC, NC, WSS, IND=IND)
!
CALL WRRRN ('CM', CM)
CALL WRRRN ('SWT', SWT)
CALL WRIRN ('IC', IC, 1, NOBS, 1)
CALL WRIRN ('NC', NC, 1, K, 1)
CALL WRRRN ('WSS', WSS, 1, K, 1)
END
CM
1 2
3 4
1 5.006
3.428 1.462 0.246
2 5.902
2.748 4.394 1.434
3 6.850
3.074 5.742
2.071
SWT
1 2
3 4
1 50.00
50.00 50.00 50.00
2 62.00
62.00 62.00 62.00
3 38.00
38.00 38.00
38.00
IC
1 2 3 4 5 6
7 8 9 10 11 12 13 14
15 16 17 18 19 20
1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1
21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37
38 39 40
1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1
41 42 43 44 45 46 47 48 49 50
51 52 53 54 55 56 57 58 59
60
1 1 1 1 1 1
1 1 1 1 2
2 3 2 2 2
2 2 2 2
61 62 63 64 65
66 67 68 69 70 71 72 73 74
75 76 77 78 79 80
2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 3
2 2
81 82 83 84 85 86
87 88 89 90 91 92 93 94 95
96 97 98 99
2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2
100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115
2 3 2 3
3 3 3 2
3 3 3 3
3 3 2 2
116 117 118
119 120 121 122 123 124 125 126
127 128 129 130 131
3
3 3 3 2
3 2 3 2
3 3 2 2
3 3 3
132 133 134 135 136
137 138 139 140 141 142 143 144
145 146 147
3 3 2
3 3 3 3
2 3 3 3
2 3 3 3
2
148 149 150
3
3 2
NC
1 2 3
50
62 38
WSS
1
2 3
15.15 39.82
23.88
PHONE: 713.784.3131 FAX:713.781.9260 |