kmeans
K-Means Clustering

Description

Performs a clustering of numeric data into K groups. The K groups are represented by K center vectors. The observations in a group are closer to its center than to any other of the centers.

Usage

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong",
    "Lloyd", "Forgy", "MacQueen"))
## S3 method for class 'kmeans':
print(x, ...)
## S3 method for class 'kmeans':
fitted(object, method = c("centers", "classes"), ...)

Arguments

x a matrix of multivariate data. Each row corresponds to an observation, and each column corresponds to a variable. Missing values (NAs) are not accepted. For the print method, x is an object of class "kmeans".
centers a matrix of initial guesses for the cluster centers, or an integer giving the number of clusters.
  • If centers is an integer, a random set of rows in x are selected as initial values. See the nstart argument to try multiple random initial values.
  • If centers is a matrix, each row represents a cluster center, and thus centers must have the same number of columns as x. The number of rows in centers, is the number of clusters that are formed. There must be at least two rows for the "Hartigan-Wong" algorithm. The other algorithms work with one row.
  • If centers is 1, the algorithm is ignored and the "MacQueen" algorithm is always used. Missing values (NAs) are not accepted.
iter.max the maximum number of iterations.
nstart the number of random sets if centers is an integer. nstart random centers are tried and the best fit (in terms of minimum total within cluster sums-of-squares) is returned.
algorithm a character string specifying the algorithm for clustering computing. Currently, only the "Hartigan-Wong" algorithm is implemented.
object for the print method, a class of "kmeans".
method for the fitted method, a character string that specifies the type of fitted value to return: "centers" for the observations center vector, or "class" for the observations cluster membership value.

Details

The object is to find a partition of the observations into nrow(centers) groups that minimizes sum(withinss). To actually guarantee the minimum would be computationally infeasible in many settings; this function finds a local minimum, that is, a solution such that there is no single switch of an observation from one group to another group that will decrease the objective. The procedure used to achieve the local minimum is rather complex - see Hartigan and Wong (1979) for details.
It may be necessary to scale the columns of x in order for the clustering to be sensible. The larger the variance of a variable, the more important it will be to the clustering.
When deciding on the number of clusters, Hartigan (1975, pp 90-91) suggests the following rough rule of thumb. If k is the result of kmeans with k groups and kplus1 is the result with k+1 groups, then it is justifiable to add the extra group when
(sum(k$withinss)/sum(kplus1$withinss)-1)*(nrow(x)-k-1)
is greater than 10.
The hidden print method prints out some key components of an object of class kmeans.
The hidden fitted method returns cluster fitted values. If method is "classes", this is a vector of cluster membership (the cluster component of the "kmeans" object). If method is "centers", this is a matrix where each row is the cluster center for the observation. The rownames of the matrix are the cluster membership values.
Value
kmeans returns an object of class kmeans with the following components:
cluster A vector of integers, ranging from 1 to nrow(centers), with length the same as the number of rows of x. The ith value indicates the cluster in which the ith data point belongs.
centers A matrix like the input centers containing the locations of the final cluster centers. Each row is a cluster center location. The row names are character strings of 1 to nrow(centers). The column names are same as column names of x.
totss total sum of squares for the scaled x. That is: sum(scale(x, scale = FALSE)^2)
withinss A vector of length nrow(centers). The ith value gives the within cluster sum of squares for the ith cluster.
tot.withinss The total sum of withinss.
betweenss The between-cluster sum of squares. That is the result of totss minus tot.withinss.
size The vector of length nrow(centers). The ith value gives the number of data points in cluster i.
iter The number of iterations used to compute the results.
ifault An integer error code. If 0 then there were no errors, if 2 then the iter.max iterations were performed but the algorithm did not converge.
fittedif method is "classes", returns a vector of integers that indicate cluster membership (this is the same as the cluster component in the kmeans object). When method is "centers", a matrix like x where each row is the center vector for that observation. The rownames of the matrix are the cluster membership values.
Differences between Spotfire Enterprise Runtime for R and Open-source R
The methods "Lloyd", "Forgy" and "MacQueen", which are available in R, are not yet implemented in Spotfire Enterprise Runtime for R. If one of the unimplemented methods is specified, a warning is issued and the default method ("Hartigan-Wong") is used.
References
Forgy, E. W. 1965. Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics. Volume 21. 768-769.
Hartigan, J. A. and Wong, M. A. 1979. A k-means clustering algorithm. Applied Statistics. Volume 28. 100-108.
Hartigan, J. A. 1975. Clustering Algorithms. New York, NY: Wiley.
Lloyd, S. P. 1982. Technical Note: Least squares quantization in PCM. IEEE Transactions on Information Theory. Bell Laboratories. Volume 28. 128-137.
MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: University of California Press. (eds L. M. Le Cam & J. Neyman) Volume 1. pp. 281-297.
See Also
hclust
Examples
kmr <- kmeans(Sdatasets::iris[, c("Sepal.Length", "Sepal.Width",
             "Petal.Length", "Petal.Width")], centers=3)
# result will be somewhat random, since initial estimates for 3 centers
# were randomly chosen from the rows of the input data.frame
table(Sdatasets::iris[, "Species"], kmr$cluster)
kmn <- kmeans(Sdatasets::iris[, 1:4], centers=Sdatasets::iris[c(1,51,101),1:4])
# result will not be random, since initial estimates of centers were given
table(Sdatasets::iris[, "Species"], kmn$cluster)
Package stats version 6.1.4-13
Package Index