Example 1: Automatic selection of the best number of clusters from the data
In other words, suppose you only have the measurements for 150 different flowers of type iris, and are wondering whether these flowers "naturally" fall into a certain number of clusters based on the measurements available to you. In more general terms, suppose you have a set of measurements taken on a large sample of observations, and you are wondering whether any clusters of observations exist in the sample, and if so, how many. Thus, this type of research question will come up in various domains, such as marketing research where one might be interested in clusters of lifestyles or market segmentations; in manufacturing and quality control applications, one might be interested in any clusters or patterns of failures or defects occurring in the final product (e.g., on silicon wafers), etc.
Specifying the analysis

Open the Cluster Analysis (Generalized EM, k-Means & Tree) module by selecting that command from the Data Mining menu. In the Cluster Analysis dialog box, click the Variables button. In the variable selection dialog box, select as continuous variables in the analysis variables 1 through 4: Sepallen, Sepalwid, Petallen, and Petalwid. Note that we are not selecting variable Iristype. Click the OK button. Consistent with the stated purpose of this example (see the Overview paragraph above), we will instead pretend that we have no prior knowledge regarding the "true" number of clusters (types of iris) contained in the data set.


Now, click OK to perform the computations and display the Results dialog.
Reviewing Results
Note that, as discussed in the Introductory Overview, the results of your analysis may be different because the initial random assignment of observations (based on the random number seed) is different; you can enter the same random number seed (as shown in the above illustration) to achieve the same results.
In this particular case, the program extracted 4 clusters from the data.


It appears that the error function quickly drops from the 2- to the 3-cluster solution, and then it "flattens" out. Using the same logic as applied to the similar Scree plot computed in Factor Analysis (for determining the best number of factors), you could choose either the 3 or the 4-cluster solution for final review. For this example, let's accept the 4-cluster solution selected by the program.

where
is the transformed (scaled) mean for continuous variable i and cluster j
is the arithmetic ("unscaled") mean for continuous variable i and cluster j
are the maximum and minimum observed values for continuous variable i
In other words, the plotted values depict the means scaled to the overall ranges of observed values for the respective continuous variables.

It appears that the pattern of means for Cluster 1 is quite distinct from that for the other clusters. You can also click the Cluster distances button to verify that the distance of Cluster 1 from all the others is larger than the distances between clusters 2, 3, and 4.
On the Results dialog - Quick tab, click the Save classifications & distances button, and then select Iristype as an additional variable to save along with the cluster analysis results (assignments, distances to cluster centers). Part of the resulting data file is shown below.

This data file will automatically be created as an input file for subsequent analyses (see also option Data - Input Spreadsheet). Select Basic Statistics/Tables from the Statistics menu, select Tables and banners to display the Crosstabulation Tables dialog, and then compute a cross tabulation table of variable Iristype by Final classification.

Shown above is the summary crosstabulation table, along with the (row) percentages of observations of each known Iristype classified into the respective clusters. As expected, Cluster 1 (the one that showed the greatest distance from all others) is most distinct. 100% of all flowers of type Setosa were correctly classified as belonging to a distinct group or cluster. Cluster 3 and Cluster 4 apparently identify the flowers of type Versicol and Virginic that are easily "classifiable," while Cluster 2 contains both flowers of type Versicol and Virginic. It appears that these two types of flowers are not as easily distinguished, a result that is consistent with those that were computed in the Discriminant Function Analysis - Example.
Summary
The purpose of this example is to illustrate the usefulness of the Generalized EM and k-Means Cluster Analysis module for automatically determining a "best" number of clusters from the data, using v-fold cross-validation techniques. This extension of traditional clustering methods makes it a very powerful technique for unsupervised learning and pattern recognition, which are typical tasks encountered in Data Mining.