Cluster Analysis - Validation Tab
Select the Validation tab of the Cluster Analysis dialog box to access options to select a method for cross-validation for determining the best number of clusters from the data; see also the Introductory Overview for details. In general, the program will compute cluster solutions for an increasing number of clusters (from the Minimum number of clusters to the Maximum number of clusters) until the decrease of the respective error function (average distance of cases to cluster centers for k-Means; average log-likelihood of cases for EM) in consecutive cluster solutions with increasing numbers of clusters is less than the percentage specified in the Smallest percentage decrease field. By default, that value is set to 5, so if, for example, the respective error function for a solution with k+1 clusters is not at least 5% better than the solution for k clusters, then the solution with k clusters will be retained as the final (and best) solution.
- Test sample
- With the
Test sample option, you can use a sub-sample of cases for fitting the number of clusters. Click the
Test sample button to display the Cross-Validation dialog box, where you can switch on or off the
Test sample option as well as select a variable that will be used as the sample identifier variable. Click the
Sample Identifier Variable button to display a variable selection dialog box from which to choose the sample identifier variable. In addition, you need to select the code for the selected variable that uniquely identifies the cases to be used in the test sample (from which the cluster solutions will be computed). By default, when a sample identifier variable has been selected, a valid code is displayed in the
Code for analysis sample box. If this is not the desired code for identifying the test sample, double-click on the box (or press the F2 key on your keyboard) to display a dialog box from which you can select the desired code from the list of valid variable codes.
If a Test sample is identified, the V-fold cross-validation options are disabled.
- V-fold cross-validation
- Select this check box to use a v-fold cross-validation algorithm to determine the best number of clusters from the range of cluster numbers specified in the Minimum/Maximum number of cluster fields also displayed on this tab. The v-fold cross-validation algorithm is described in some detail in the context of the Classification Trees, Classification and Regression Trees (C&RT), and General CHAID modules. The general idea of this method is to divide the overall sample into a number of v folds, or randomly drawn (disjoint) sub-samples. The same type of analysis is then successively applied to the observations belonging to the v-1 folds (training sample), and the results of the analyses are applied to sample v (the sample or fold that was not used to determine the clusters; i.e., this is the testing sample) to compute an index of "predictive validity" (e.g., how well the observations in sample v can be assigned to homogenous clusters using the current cluster solution computed from the v-1 learning samples). The results for the v replications are aggregated (averaged) to yield a single measure of the stability of the respective model, i.e., the validity of the model for assigning new observations to clusters.
- v value
- Specify the number of folds used to perform the cross-validation. The default value is 10, the minimum is 2, and the maximum is 999.
- Random seed
- Specify the random number generator seed to be used in the process of (randomly) grouping the data into v folds.
- Minimum number of clusters
- Specify the minimum number of clusters where to start the search for the best cluster solution (using v-fold cross-validation or a testing sample). The default minimum number of clusters is 2, the minimum is 1, and the maximum is 999.
- Maximum number of clusters
- Specify the maximum number of clusters for the search for the best cluster solution. The default is value 25, the minimum is 2, and the maximum is 1000.
- Smallest percentage decrease
- Enter here the minimum percentage decrease in the respective error function so that the next cluster solution will be evaluated. The program will compute cluster solutions for an increasing number of clusters (from the Minimum number of clusters to the Maximum number of clusters) until the decrease of the respective error function (average distance of cases to cluster centers for k-Means; average log-likelihood of cases for EM) in consecutive cluster solutions with increasing numbers of clusters is less than the percentage specified in the Smallest percentage decrease field. By default, that value is set to 5, so if, for example, the respective error function for a solution with k+1 clusters is not at least 5% better than the solution for k clusters, then the solution with k clusters will be retained as the final (and best) solution.