Workspace Node: Generalized Cluster Analysis - Specifications - Options Tab
In the Generalized Cluster Analysis workspace node dialog box, under the Specifications heading, select the Options tab; the options available depend on whether k-Means or EM is selected on the Specifications - Quick tab.
k-Means
The following options are available when k-Means is selected on the Specifications - Quick tab.
Element Name | Description |
---|---|
Initial cluster center | This group box contains three options (described below) to specify the way in which the initial cluster centers are computed. Note that the results from the k-Means clustering method may depend to some extent on the initial configuration (i.e., cluster means or centers). This is particularly the case when there are many small clusters (with few objects) that are clearly distinct. See also the Introductory Overview for additional details regarding the k-Means clustering algorithm. |
Maximize the initial distance | When you select this option button, particular observations will be chosen as the initial cluster centers, so as to maximize the initial cluster distances. Specifically, 1) the program will first select the first k (number of clusters) cases to be the respective cluster centers; 2) subsequent cases will replace previous cluster centers if their smallest distance to any of the cluster centers is larger than the smallest distance between clusters; if this is not the case, then 3) subsequent cases will replace initial cluster centers if their smallest distance from a cluster center is larger than the distance of that cluster center from any other cluster center. The effect of this selection procedure is to maximize the initial distances between clusters. Note that this procedure may yield clusters with single observations if there are clear outliers in the data. |
Randomly choose k observations | When you select this option button, k (number of clusters) observations selected randomly will be the initial cluster centers. |
Choose the first k observations | When you select this option button, the first k (number of clusters) observations will be the initial cluster centers. Thus, this option provides full control over the choice of the initial configuration. This is often useful if you bring a priori expectations regarding the nature of the clusters to the analysis. In that case, move the cases that you want to choose as the initial cluster centers to the beginning of the file. |
Standardize Distances | Select this check box to compute distances based upon standardized or normalized values. This option prevents variables from affecting the analysis simply based upon how they are scaled, that is, all variables are placed on equal footing.
Specifically, in the formulas given below: where , xmin and xmax, and ymin and ymax, are the minimum and maximum values for the x and y variable in each distance. |
Distance Measure | Specify a distance measure from the options in this drop-down list. The default is
Euclidean distance. Note that all distances are computed from (measured in) normalized values; hence (and unlike the method for computing Euclidean distances for
k-Means clustering in the Cluster Analysis module; see also Differences in k-Means Algorithms in Generalized EM & k-Means Cluster Analysis vs. Cluster Analysis), different scaling (ranges of values) for different variables will not affect the clustering results.
Specifically, in the formulas given below: where , xmin and xmax, and ymin and ymax, are the minimum and maximum values for the x and y variable in each distance. |
Distance measures for categorical variables | For categorical variables, all distances can only be 0 (zero) or 1 (one); 0 if the class to which a particular observation belongs is the same as the one that occurs with the greatest frequency in the respective cluster, and 1 if it is different from that class (see also the Introductory Overview for details regarding the treatment of categorical variables in k-Means and EM clustering). Consequently, with the exception of the Chebychev distance, for categorical variables the different distance measures available in the program will yield identical results. |
Euclidean distance | This is probably the most commonly chosen type of distance. It simply is the geometric distance in the multidimensional space. It is computed as: |
Squared Euclidean distance | You may want to square the standard Euclidean distance in order to place progressively greater weight on objects that are further apart. This distance is computed as: |
City-block (Manhattan) distance | This distance is simply the average difference across dimensions. In most cases, this distance measure yields results similar to the simple Euclidean distance. However, note that in this measure, the effect of single large differences (outliers) is dampened (since they are not squared). The city-block distance is computed as: |
Chebychev distance | This distance measure may be appropriate in cases when you want to define two objects as different if they are different on any one of the dimensions. The Chebychev distance is computed as: |
EM
The following options are available when EM is selected on the Specifications - Quick tab.
Element Name | Description |
---|---|
Random seed | Specify the random number generator seed to be used when initializing the classification probabilities (weights). |
Minimum increase of log-likelihood | Specify a minimum increase in the log-likelihood value over consecutive iterations. |
Distributions for continuous variables | Specify the distribution for each continuous variable; by default, the continuous variables are treated as following the
Normal distribution. Use the drop-down list to assign the (previously selected) continuous variables to one of three distributions: normal, lognormal, or Poisson.
Options / C / W. See Common Options. |
OK | Click this button to accept all the specifications made in the dialog box and to close it. The analysis results are placed in the Reporting Documents workspace node after running (updating) the project. |