K-Means Clustering Advanced Tab

Select the Advanced tab of the Cluster Analysis: K-Means Clustering dialog box to access the described options.

  • Variables: Click the Variables button to display the standard variable selection dialog box, in which you select variables for the analysis.
  • Cluster: The Cluster box contains two options: Variables (columns) and Cases (rows). The option you select determines how STATISTICA interprets the selected Variables.
  • Variables (columns): If Variables (Columns) is selected, STATISTICA interprets the selected Variables as objects.
  • Cases (rows): If Cases (rows) is selected, STATISTICA interprets the selected Variables as dimensions.
  • Number of clusters: Use the Number of clusters box (and the accompanying microscrolls) to enter the desired number of clusters, which must be greater than 1 and less than the number of objects (cases or variables depending on the selection in the Cluster box).The purpose of the k-means clustering procedure is to classify objects into a user-specified number of clusters. The algorithm will move objects into different clusters with the goal of minimizing the within-cluster variability while maximizing the between-cluster variability. For a further discussion of this method,
  • Number of iterations: Use the Number of iterations box (and the accompanying microscrolls) to specify the maximum number of iterations that can be performed. k-means clustering is an iterative procedure; in each iteration, objects are moved into different clusters. The algorithm implemented in the Cluster Analysis module is very efficient, and the default setting (10 iterations) usually does not need to be changed.
  • Initial cluster centers: The Initial cluster centers group box contains three options (described below). Use these options to specify the way in which the initial cluster centers are computed. Note that the results from the k-means clustering method depend to some extent on the initial configuration (cluster means or centers). This is particularly the case when there are many small clusters (with few objects) that are clearly distinct.
    • Choose observations to maximize initial between-cluster distances: If you select this option, observations or objects will be set as the initial cluster centers; the choice of the object follows rules to maximize the initial cluster distances. Specifically, (1) the program will select the first N (number of clusters) cases to be the respective cluster centers; (2) subsequent cases will replace previous cluster centers if their smallest distance to any of the cluster centers is larger than the smallest distance between clusters; if this is not the case, then (3) subsequent cases will replace initial cluster centers if their smallest distance from a cluster center is larger the distance of that cluster center from any other cluster center. The effect of this selection procedure is to maximize the initial distances between clusters. Note that this procedure may yield clusters with single observations if there are clear outliers in the data.
    • Sort distances and take observations at constant intervals: If you select this option , the distances between all objects will first be sorted, and then objects at constant intervals will be chosen as initial cluster centers.
    • Choose the first N (Number of clusters) observations: If you select this option, the first N (number of clusters) observations will be the initial cluster centers. Thus, this option provides full control over the choice of the initial configuration. This is often useful if you bring a priori expectations regarding the nature of the clusters to the analysis. In that case, move the cases that you want to choose as the initial cluster centers to the beginning of the file.
  • Batch processing and reporting: If you select the Batch processing and reporting check box, STATISTICA automatically performs the analysis (after you click the OK button) and sends the entire output from the analysis to a workbook, individual windows, and/or to a report (depending on the options selected in the Analysis/Graph Output Manager).