How to Perform a K-means Clustering


The K-means Clustering tool cannot be used unless you have created a suitable line chart to base the calculation on. For example, you cannot use multiple Y-axes scales or an X-axis which is both continuous and binned when performing a K-means clustering. See below for more information about how to set up the line chart.

  1. Create a line chart visualization by clicking on the New Line Chart button on the toolbar.

    Comment: The tool uses the lines specified in a line chart to define the data for the calculation.

  2. Make sure that all values that should be included in the calculation are selected on the Y-axis.

  3. If more than one column is selected on the Y-axis, make sure that (Column Names) is selected on the X-axis.

    Comment: (Column Names) is an option that treats the names of the columns selected on the Y-axis as separate categories.

  4. Use Line By, Color By or Trellis By to split the lines according to at least one column, in order to create multiple lines.

    Comment: See examples on how to split lines on How to Use the Line Chart. If you want to create one line for each individual row, one of these options must be set to define a unique identifier for all rows. "(Row Number)" is a fictive column representing the row index of all rows and can be used for this purpose.

  5. Select Tools > K-means Clustering....

    Response: The K-means Clustering dialog is displayed.

  6. Make sure that the line chart you just created is selected under Line chart to work on.

  7. Select whether to Create new result column or Update existing result column.

    Comment: Update existing is only available when you have previously performed a K-means clustering during this analysis.

  8. Select a Distance measure to use in the calculation.

    Comment: For more information see Correlation or Euclidean distance.

  9. Specify the Max number of clusters that you wish to create.

    Comment: The actual number of clusters may be smaller than the specified maximum.

  10. Click OK.

    Response: A result column is created, specifying a cluster ID for each individual row (line).

    Comment: Note that the result column is based on a snapshot of the line chart from the moment of performing the calculation and it may become invalid when any additional filtering is applied.

Note: When opening an analysis file in which data has been saved linked to, any result columns generated by the clustering operation are dynamically re-evaluated, based on the new data.

Note: If the input line chart is trellised, the column or expression used to trellis by will be moved to the Line By setting upon running a K-means clustering. This is done in order to keep the original lines in the line chart after presenting the K-means result in trellis panels.

Tip: If you do not want to be able to overwrite the result column by consecutive clusterings, or when saving an analysis file with linked data, you can turn it into a static column by performing the following: Select Edit > Column Properties. Click on the result column to select it, and then click on the Freeze Column button in the lower part of the General tab.

See also:

Details on K-means Clustering