Technical Notes: The k-Means Algorithm

The k-means algorithm is a clustering method used to partition a collection of objects into k groups according to the distance measure specified. The fundamental algorithm iterates between two steps.

  1. Compute cluster centroids

    a. The initial centroids are set up using the method specified by the user. There are three initial center methods: choose N observations to maximize the initial distance, randomly choose N observations, and choose the first N observation. Here, N represents k.

    b. After assigning all objects to the nearest centroid, compute the new centroids using all the members assigned to them. For continuous variables, the centroid value is simply the mean of the values of all members assigned to this cluster. For categorical variables, the centroid value is the first mode of all members assigned to it.

  2. Assign each object to the nearest centroid

    a. The nearest centroid is decided using the specified distance measure method. The continuous variable values are all normalized before the calculation. Note that c represents a centroid and x represents an observation.

    i. If ith variable is categorical, (xi - ci) equals 0 if they have the same value, otherwise equals 1.

    ii. If ith variable is continuous, xi and ci are normalized first using the minimum and maximum values of this variable.

    Euclidean distance:

    distance( x,  c ) = {åi (xi - ci)2 }½

    Squared Euclidean distance:

    distance( x,  c ) = åi (xi - ci)2

    City-block (Manhattan) distance:

    distance( x,  c ) = åi |xi - ci|

    Chebychev distance:

    distance( x,  c ) = Maximum|xi - ci|

    b. If all the observations belong to the cluster it belonged to before this iteration, stop the iteration. Also, if the number of iterations equals those specified by the user, stop the iteration. Renew the centroids, and get the final clustering.

For more information, see also the Introductory Overview.