Cluster Analysis - k-Means tab

Select the k-Means tab of the Cluster Analysis dialog box to access options to specify parameters for the k-Means clustering algorithm; see also the Introductory Overview for details.

Initial cluster centers

This group box contains three options (described below) to specify the way in which the initial cluster centers are computed. Note that the results from the k-Means clustering method may depend to some extent on the initial configuration (i.e., cluster means or centers). This is particularly the case when there are many small clusters (with few objects) that are clearly distinct. See also the Introductory Overview for additional details regarding the k-Means clustering algorithm.

Maximize the initial distance

When you select this option button, particular observations will be chosen as the initial cluster centers, so as to maximize the initial cluster distances. Specifically, 1) the program will first select the first k (number of clusters) cases to be the respective cluster centers; 2) subsequent cases will replace previous cluster centers if their smallest distance to any of the cluster centers is larger than the smallest distance between clusters; if this is not the case, then 3) subsequent cases will replace initial cluster centers if their smallest distance from a cluster center is larger than the distance of that cluster center from any other cluster center. The effect of this selection procedure is to maximize the initial distances between clusters. Note that this procedure may yield clusters with single observations if there are clear outliers in the data.

Randomly choose k observations

When you select this option button, k (number of clusters) observations selected randomly will be the initial cluster centers.

Choose the first k observations

When you select this option button, the first k (number of clusters) observations will be the initial cluster centers. Thus, this option provides full control over the choice of the initial configuration. This is often useful if you bring a priori expectations regarding the nature of the clusters to the analysis. In that case, move the cases that you want to choose as the initial cluster centers to the beginning of the file.

Standardize distances

Select this check box to compute distances based upon standardized or normalized values. This option prevents variables from affecting the analysis simply based upon how they are scaled, that is, all variables are placed on equal footing.

Specifically, in the formulas given below:

where , xmin and xmax, and ymin and ymax, are the minimum and maximum values for the x and y variable in each distance.

Distance measure

Specify a distance measure from the four options described below. The default is Euclidean Distances. Note that all distances are computed from ("measured in") normalized values; hence (and unlike the method for computing Euclidean distances for k-Means clustering in the Cluster Analysis module; see also Differences in k-Means Algorithms in Generalized EM & k-Means Cluster Analysis vs. Cluster Analysis), different scaling (ranges of values) for different variables will not affect the clustering results.

Specifically, in the formulas given below:

where , xmin and xmax, and ymin and ymax, are the minimum and maximum values for the x and y variable in each distance.

Distance measures for categorical variables

For categorical variables, all distances can only be 0 (zero) or 1 (one); 0 if the class to which a particular observation belongs is the same as the one that occurs with the greatest frequency in the respective cluster, and 1 if it is different from that class (see also the Introductory Overview for details regarding the treatment of categorical variables in k-Means and EM clustering). Consequently, with the exception of the Chebychev distance, for categorical variables the different distance measures available in the program will yield identical results.

Euclidean Distances

This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as:

Squared Euclidean Distances

You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as:

City-block (Manhattan) Distances

This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as:

Chebychev Distances

This distance measure may be appropriate in cases when you want to define two objects as "different" if they are different on any one of the dimensions. The Chebychev distance is computed as:

Contents

Index

Search Results

Cluster Analysis - k-Means tab