K-Means Clustering - MADlib

Team Studio supports the MADlib K-Means Clustering model implementation.

Information at a Glance

Category Model
Data source type DB
Sends output to other operators No
Data processing tool n/a

Algorithm

Clustering refers to the problem of partitioning a set of objects according to some problem-dependent measure of similarity. In the k-means variant, one is given n points, and the goal is to position k centroids so that the sum of distances between each point and its closest centroid is minimized. Each centroid represents a cluster that consists of all points to which this centroid is closest.

In the most common case, the distance measure used is the square of the Euclidean distance.

This problem is computationally difficult (NP-hard), yet the local-search heuristic proposed by Lloyd [1] performs reasonably well in practice. In fact, it is so ubiquitous today that it is often referred to as the standard algorithm or even just the k-means algorithm. It works as follows:

  1. Seed the k centroids.
  2. Repeat until convergence:
    1. Assign each point to its closest centroid.
    2. Move each centroid to a position that minimizes the sum of distances in this cluster.
  3. Convergence is achieved when no points change their assignments during step 2a.

Since the objective function decreases in every step, this algorithm is guaranteed to converge to a local optimum.

More information can be found in the Official MADlib documentation page.

Input

A data set that contains an array column with the various attribute values to be used as clustering criteria.
  • Non-numeric variables in the input data set must first be transformed and scaled or normalized before the clustering can take place.
  • Numeric variables can be normalized as well.
  • Depending on the chosen transformations, normalizations, and distance calculations, certain variables may dominate clustering results or may be completely ignored.

Restrictions

Output can be sent to a K-Means Predictor operator only.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
MADlib Schema Name The schema where MADlib is installed in the database. MADlib must be installed in the same database as the input data set.

If a madlib schema exists in the database, this parameter defaults to madlib.

Model Output Schema Name The name of the schema where the output is stored.
Model Output Table Name The name of the table that is created to store the model. The model output table stores:

centroids | objective_fn | frac_reassigned | num_iterations

For more information, see the official MADlib k-means documentationn.
Drop If Exists
  • If Yes (the default), drop the existing table of the same name and create a new one.
  • If No, stop the flow and alert the user that an error has occurred.
Centroid Seeding Centroid seeding can be done in three ways:
  • random (the default) - Selects the initial centroids randomly from the input data set.
  • distributed - Selects the initial centroids through the k-Means Plus Plus \[1\] algorithm, which strives to pick centroids far apart from one another.
  • user-defined - You specify a centroid source input table (Centroid Source) and its Centroid Column that has the initial centroid positions.
Centroid Source If Centroid Seeding is set to user-defined, set the input table for the centroid source.
Centroid Column If Centroid Seeding is set to user-defined, set the column in the input table specified by Centroid Source for the centroid column.
Points Column The Points column in the input data set contains the array of attributes for each point.

This parameter must be an array type.

K Specifies the number of clusters to create during the cluster analysis process.
  • The default value is 2.
    Note: If the number of clusters, K, for the K-means algorithm is not chosen to match the natural structure of the data, the results are not ideal. The proper way to alleviate this is to experiment with different values for K. In principle, the best K value exhibits the smallest intra-cluster distances and largest inter-cluster distances. (See the Rudjer Boskovic Institute Data Mining tutorial on clustering at http://dms.irb.hr/tutorial/tut_clustering_short.php for more information.)
  • Therefore, the modeler might want to experiment with various values for K based on the cluster results. Perhaps an increase in clusters is needed, for example, if the analysis does not converge or if the groups are too overlapping or spread out.
Distance Function Calculates the difference between the cluster member's values from the cluster's centroid (mean) value. The distance can be calculated in the following different ways:

Cosine - Measures the cosine of the angle between two vectors. For n dimensional vectors and , it is calculated using the dot product formula as:

Euclidean - The square root of the sum of the squares of distances along each attribute axes. For n dimensional vectors and , it is calculated as:

K-Means clustering euclidean

Manhattan - The Manhattan, or taxicab, metric measures the distance between two points when traveling parallel to the axes. For n dimensional vectors and , it is given by

Squared Euclidean (the default) - The default method for calculating the straight line distance between two points. It is the sum of the squares of distances along each attribute axes. For n dimensional vectors and , it is calculated as

Tanimoto - Measures dissimilarity between sample sets. It is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union. As with Dice similarity, 0-1 vectors are used to represent sets. Then the Jaccard Similarity of sets A and B represented by set-vectors and is given by:

User defined - Specified by the user. See User-Defined Distance.

User-Defined Distance If, for Distance Function, you specify User-defined, provide the function.

The modeler might try to start with the default Squared Euclidean method, then experiment with the various other calculation methods to determine whether the cluster results seem more intuitive or provide more business insight.

Aggregation of Centroid Defines how cluster centroid positions should be calculated using the points assigned to the same cluster. The options are average (the default) or normalized average.
Maximum Iterations The number of iterations required before computation stops. When the number of iterations is greater than the Maximum Iterations, or when the fraction of reassigned points is below the Minimum Fraction Reassigned, computation stops.
Mini Frac Reassigned The minimum fraction of reassigned points required before computation stops. When the number of iterations is greater than the Maximum Iterations, or when the fraction of reassigned points is below the value of Mini Frac Reassigned, computation stops.

Output

Visual Output
The K-Means Clustering (MADlib) operator outputs Center Point and Cluster tabs.
Note: To get the assignment of points to each cluster, run the K-Means Predictor (MADlib) operator.

The Center Point results tab displays the cluster centroid positions, objective function value, fraction of reassigned points in the last iteration, number of total iterations, and the simple silhouette coefficient - a popular method to assess the quality of the clustering (the closer to 1, the better the clustering).

  • The Cluster results tab displays a graph of the final centroids determined from the K-Means analysis.
  • The output can be displayed only two dimensions at a time.



Data Output
You can send output only to a K-means Predictor operator.

Additional Notes

References

[1] Lloyd, Stuart: Least squares quantization in PCM. Technical Note, Bell Laboratories. Published much later in: IEEE Transactions on Information Theory 28(2), pp. 128-137. 1982.

[1] David Arthur, Sergei Vassilvitskii: k-means++: the advantages of careful seeding, Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'07), pp. 1027-1035,http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf