K-Nearest Neighbors - Cross-Validation Tab
Select the Cross-validation tab of the K-Nearest Neighbors dialog box to access options to apply cross-validation for estimating the number of nearest neighbors K. Although you can specify K on the Options tab, it is often the case that little is known about its best value. The cross-validation algorithm can obtain an estimate of K for you.
- V-fold cross validation for model selection
- In this group box, select the Apply v-fold cross-validation check box (see description below) to use the cross-validation algorithm to obtain an estimate of K. The v-fold cross-validation algorithm is described in detail in the documentation for Classification Trees, Classification and Regression Trees (C&RT), and General CHAID modules. The general idea of this method is to divide the overall sample into a number of v folds (randomly drawn disjointed sub-samples). For the same type of KNN analysis, the outcomes for the observations in the v sample are predicted using the observations in the v-1 folds (i.e., the v-1 folds are used as the prototype sample), and from this an error, usually defined as the sum-of-squared, is computed. This process is then repeated for the v replications, and the errors belonging to the v-folds are averaged to yield a single measure (model error) of the stability of the respective model, i.e., the validity of the model for predicting unseen data. The steps above are then repeated for various values of K (defined by the search range) and the value of K that achieves the best (lowest) validation error is then selected as the estimate of K.
- Apply v-fold cross-validation
- Select this check box to apply v-fold cross validation. Note that the number if nearest neighbors K on the Options tab will not be available if this option is selected.
- V value
- In this field, specify the number of folds used to perform the cross-validation. The default value is 10, the minimum is 2. The larger the number of sampling, the fewer data cases will be available in each sample. This may lead to a high variance in the cross-validation results (among the v-folds). Thus, care should be taken in specifying the number of cross-validation folds v.
- Seed
- In this field, specify the random number generator seed to be used in the process of (randomly) grouping the data into v folds.
- Search range for nearest neighbors
- Use the options in this group box to define the grid search (the Minimum, Maximum, and Increments) for the number of nearest neighbors K.
- Minimum
- In this field, specify the minimum value of K for the cross-validation algorithm.
- Maximum
- In this field, specify the maximum value of K for the cross-validation algorithm.
- Increment
- In this field, specify the increase in the value of K (minimum 1).
Copyright © 2021. Cloud Software Group, Inc. All Rights Reserved.