Chapter 11: Cluster Analysis

KMEAN

Performs a K-means (centroid) cluster analysis.

Required Arguments

X — NOBS by NCOL matrix containing the observations to be clustered.   (Input)
The only columns of X used are those indicated by IND and possibly IFRQ and/or IWT.

CM — K by NVAR matrix containing, on input, the cluster seeds, i.e., estimates for the cluster centers, and the cluster means on output. (Input/Output)
The cluster seeds must be unique.

SWT — K by NVAR matrix containing the sum of the weights used to compute each cluster mean.   (Output)
Missing observations are excluded from SWT.

IC — Vector of length NOBS containing the cluster membership for each observation.   (Output)

NC — Vector of length K containing the number of observations in each cluster.   (Output)

WSS — Vector of length K containing the within sum of squares for each cluster.   (Output)

Optional Arguments

NOBS — Number of observations.   (Input)
Default: NOBS = size (X,1).

NCOL — Number of columns in X.   (Input)
Default: NCOL = size (X,2).

NVAR — Number of variables to be used in computing the metric.   (Input)
Default: NVAR = size (CM,2).

LDX — Leading dimension of X exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDX = size (X,1).

IFRQ — Frequency option.   (Input)
IFRQ = 0 means all frequencies are 1. For positive IFRQ, column number IFRQ of X contains the nonnegative frequencies.
Default: IFRQ = 0.

IWT — Weighting option.   (Input)
IWT = 0 means all weights are 1. For positive IWT, column number IWT contains the nonnegative weights.
Default: IWT = 0.

IND — Vector of length NVAR containing the columns of X to be used in computing the metric.   (Input)
In the usual case in which X is the data matrix, no observation has multiple frequency, and unequal weighting is not desired, IND = (1, 2, 3, …, NVAR).
By default, IND(I) = (I)

K — Number of clusters.   (Input)
Default: K = size (CM,1).

MAXIT — Maximum number of iterations.   (Input)
MAXIT = 30 is usually sufficient.
Default: MAXIT = 30.

LDCM — Leading dimension of CM exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDCM = size (CM,1).

LDSWT — Leading dimension of SWT exactly as specified in the dimension statement in the calling program.   (Input)
Default: LDSWT = size (SWT,1).

FORTRAN 90 Interface

Generic:          CALL KMEAN (X, CM, SWT, IC, NC, WSS [,…])

Specific:                             The specific interface names are S_KMEAN and D_KMEAN.

FORTRAN 77 Interface

Single:            CALL KMEAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS)

Double:                              The double precision name is DKMEAN.

Description

Routine KMEAN is an implementation of Algorithm AS 136 by Hartigan and Wong (1979). It computes K-means (centroid) Euclidean metric clusters for an input matrix starting with initial estimates of the K cluster means. Routine KMEAN allows for missing values (coded as NaN, “not a number”) and for weights and frequencies.

Let p = NVAR denote the number of variables to be used in computing the Euclidean distance between observations. The idea in K-means cluster analysis is to find a clustering (or grouping) of the observations so as to minimize the total within-cluster sums of squares. In this case, the total sums of squares within each cluster is computed as the sum of the centered sum of squares over all nonmissing values of each variable. That is,

where vim denotes the row index of the m-th observation in the i-th cluster in the matrix X; ni is the number of rows of X assigned to group i; f denotes the frequency of the observation; w denotes its weight; δ is zero if the j-th variable on observation vim is missing, otherwise δ is one; and

is the average of the nonmissing observations for variable j in group i. This method sequentially processes each observation and reassigns it to another cluster if doing so results in a decrease in the total within-cluster sums of squares. The user in referred to Hartigan and Wong (1979) or Hartigan (1975) for the details.

Comments

1.         Workspace may be explicitly provided, if desired, by use of K2EAN/DK2EAN. The reference is:

CALL K2EAN (NOBS, NCOL, NVAR, X, LDX, IFRQ, IWT, IND, K, MAXIT, CM, LDCM, SWT, LDSWT, IC, NC, WSS, IC2, NCP, D, ITRAN, LIVE)

The additional arguments are as follows:

IC2 — Work vector of length NOBS.

NCP — Work vector of length K.

D — Work vector of length NOBS.

ITRAN — Work vector of length K.

LIVE — Work vector of length K.

2.         Informational Error

Type Code

3         1                  Convergence did not occur within MAXIT iterations.

Example

This example performs K-means cluster analysis on Fisher’s iris data, which is first obtained via routine GDATA (see Chapter 19, Utilities). The initial cluster seed for each iris type is an observation known to be in the iris type.

 

      USE IMSL_LIBRARIES

 

      IMPLICIT   NONE

      INTEGER    IPRINT, K, LDCM, LDSWT, LDX, NCOL, NOBS, NV, NVAR

      PARAMETER  (IPRINT=0, K=3, NCOL=5, NOBS=150, NV=5, NVAR=4, &

                  LDCM=K, LDSWT=K, LDX=NOBS)

!

      INTEGER    IC(NOBS), IND(NVAR), NC(K), NXCOL, NXROW

      REAL       CM(K,NVAR), SWT(K,NVAR), WSS(K), X(NOBS,NV)

!

      DATA IND/2, 3, 4, 5/

!

      CALL GDATA (3, X, NXROW, NXCOL)

!                                 Copy the cluster seeds into CM

      CALL SCOPY (NVAR, X(1:,2), LDX, CM(1:,1), LDCM)

      CALL SCOPY (NVAR, X(51:,2), LDX, CM(2:,1), LDCM)

      CALL SCOPY (NVAR, X(101:,2), LDX, CM(3:,1), LDCM)

!

      CALL KMEAN (X, CM, SWT, IC, NC, WSS, IND=IND)

!

      CALL WRRRN ('CM', CM)

      CALL WRRRN ('SWT', SWT)

      CALL WRIRN ('IC', IC, 1, NOBS, 1)

      CALL WRIRN ('NC', NC, 1, K, 1)

      CALL WRRRN ('WSS', WSS, 1, K, 1)

      END

Output

 

               CM
        1       2       3       4
1   5.006   3.428   1.462   0.246
2   5.902   2.748   4.394   1.434
3   6.850   3.074   5.742   2.071

               SWT
        1       2       3       4
1   50.00   50.00   50.00   50.00
2   62.00   62.00   62.00   62.00
3   38.00   38.00   38.00   38.00

                                     IC
1  2  3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20
1  1  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

21 22 23 24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40
 1  1  1  1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1

41 42 43 44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
 1  1  1  1   1   1   1   1   1   1   2   2   3   2   2   2   2   2   2   2

61 62 63 64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80
 2  2  2  2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2   2

81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99
 2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2

100 101 102 103  104  105  106  107  108  109  110  111  112  113  114  115
  2   3   2   3    3    3    3    2    3    3    3    3    3    3    2    2

116 117 118 119  120  121  122  123  124  125  126  127  128  129  130  131
  3   3   3   3    2    3    2    3    2    3    3    2    2    3    3    3

132 133 134 135  136  137  138  139  140  141  142  143  144  145  146  147
  3   3   2   3    3    3    3    2    3    3    3    2    3    3    3    2

148  149  150
  3    3    2

     NC
 1    2    3
50   62   38

          WSS
    1       2       3
15.15   39.82   23.88



http://www.vni.com/
PHONE: 713.784.3131
FAX:713.781.9260