SOM Clustering

This operator implements the self-organizing maps algorithm that generates the clusters spatially aligned according to their similarity.

SOM Clustering icon.png

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Model
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The self-organizing map clustering algorithm is a simple neural network that produces clusters spatially aligned on a grid, where clusters close to each other are more similar to each other. The columns specified are used to train the self-organizing maps clustering algorithm.

This operator uses the Silhouette value to determine the optimal number of clusters. The silhouette metric is a measure of how similar observation is to its assigned cluster compared to other clusters. Generally, the higher silhouette values are preferred.

Input

An input is a single tabular data set.

Bad or Missing Values
Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the SOM Clustering operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Use all available columns as Predictors When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors Specify the numerical data columns for training the self-organizing maps model. It must be numerical column. Click Select Columns to select the required columns.
Number of Clusters Specify the grid dimensions, which control the number of clusters to create during self-organizing maps. The following methods are available:
  • A single value such as K1. The number of clusters must be greater than 1. For example, K1=2 is equal to 4 clusters (2x2) grid.

  • A comma-separated sequence such as K1, K2, K3, and so on. For example, K1=2, K2=3, and K3=4 are equal to 4 clusters (2x2) grid, 9 clusters (3x3) grid, and 16 clusters (4x4) grid respectively.

  • A sequence specified by start:end:step. The following conditions must be met or an error is displayed.

    • start must be less than end.

    • step must be less than the result of end-start.

    • start must be equal to or greater than 2.

    For example, start:end:step=2:6:2 generates K1=2, K2=4, and K3=6 are equal to 4 clusters (2x2) grid, 16 clusters (4x4) grid, and 36 clusters (6x6) grid respectively.

  • A sequence specified by start:end. This generates a sequence with the step equal to 1. For example, start:end=2:4 generates K1=2, K2=3, and K3=4 are equal to 4 clusters (2x2) grid, 9 clusters (3x3) grid, and 16 clusters (4x4) grid respectively.

Default: 2

Normalize Features Specify whether to normalize the numerical features (Z-Transformation).

Default: Yes

Max Iterations Specify the maximum number of iterations for one run of the self-organizing map clustering algorithm (one iteration per sample).

Default: 100

Tolerance The smaller the value, the stricter the determination of when the analysis has converged. A smaller number results in more iterations of the algorithm but are capped by the iteration limit.

Default: 1.0E-4

Random Seed Specify the seed used for the pseudo-random row extraction.

Default: 1

Output

Visual Output
  • Parameter Summary Info: Displays information about the input parameters and their current settings.
  • Training Summary: A text field that displays the training summary.
    • Silhouette: A measure to compare the similarity of observation to its assigned cluster compared to other clusters.

    • Training Cost: The sum of specified distances to the nearest centroid for all points in the training data set.

Output to Successive operator
A model object that can be used with a Predictor operator. To perform the clustering against a data set, the SOM Clustering operator must be succeeded by a Predictor operator. Two additional columns are produced in the Predictor operator.
  • PRED_SOM: Specifies the cluster to that an observation belongs.

  • DIST_SOM: The distance between the cluster centroid and the observation.

A model object that cannot be used with any Model Validation operators.

Example

The following example illustrates the SOM Clustering operator.

SOM Clustering workflow.png
Data
golf: This data set contains the following information:
  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
  • Use all available columns as Predictors: No

  • Continuous Predictors: temperature, humidity

  • Number of Clusters: 2

  • Normalize Features: Yes

  • Max Iterations: 100

  • Tolerance: 1.0E-4

  • Random Seed: 1

Results
The following figure displays the output for the parameter settings for the golf data set.
Parameter Summary Info
SOM Clustering output - Parameter Summary Info tab.png
Training Summary
SOM Clustering output - Training Summary tab.png