Isolation Forest

This operator applies the Isolation Forest unsupervised outlier detection algorithm to the input data set. The implementation of the Isolation Forest algorithm is provided by the open-source library from LinkedIn.

Isolation Forest operator icon

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Model
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

An Isolation Forest is an unsupervised learning algorithm for anomaly detection that isolates the potential anomalies in a data set. Isolation Forest is built on an ensemble of decision trees. The algorithm builds each tree with randomly selected features and samples. In principle, the most different observations are partitioned with fewer splits and are closer to the root. Thus, the path length is defined as the measure of normality, and the anomaly score returned by the algorithm is calculated with the function of inverse average path length over a forest of decision trees.

The columns specified are used to train the isolation anomaly detection model and the selected categorical columns are featured by a one-hot encoding algorithm.

Input

An input is a single tabular data set.

Columns containing Dates or Times
The input variables that contain date or date/time values should not be entered as string variables. The variables must be converted into numerical else they are ignored when using the Isolation Forest operator. The following methods can be used for converting the dates into numbers:
  1. Generating an integer timestamp (For example, the number of seconds since 1 January 1970)

  2. Extracting for example, day of month, month, year wherever appropriate

Bad or Missing Values
Null values are not allowed and result in an error.

Configuration

The following table provides the configuration details for the Isolation Forest operator.

Note: A column that contains unique value in each row should not be used as a Predictor.
Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Use all available columns as Predictors When set to Yes, the operator uses all the available columns as predictors and ignores the Continuous Predictors and Categorical Predictors parameters. When set to No, the user must select at least one of the Continuous or Categorical Predictors.
Continuous Predictors Specify the numerical data columns as independent columns. It must be a numerical column. Click Select Columns to select the required columns.
Note: The columns selected in the Categorical Predictors parameter are not available.
Categorical Predictors Specify the categorical data columns as independent columns. Click Select Columns to select the required columns.
Note: The columns selected in the Continuous Predictors parameter are not available.
Number of Estimators Specify the number of trees or estimators.

Default: 100

Apply Bootstrap Specify whether to sample each tree with replacement. If Yes, draw a sample for each tree with replacement. If No, do not sample with replacement.

Default: No

Fraction/ Number of Samples Specify the number of samples used to train each tree. If the value is between 0.0 and 1.0, it is treated as a fraction. If the value is more than 1.0, it is treated as a count.

Default: 1.0

Fraction/ Number of Features Specify the number of features used to train each tree. If the value is between 0.0 and 1.0, it is treated as a fraction. If the value is more than 1.0, it is treated as a count.

Default: 1.0

Contamination Specify the fraction of outliers in the training data set. If the value is set to 0.0, it speeds up the training and all predicted labels are false. The model and outlier scores are otherwise unaffected by this parameter.

Default: 0.1

Contamination Error (Advanced)

The acceptable error when calculating the threshold required to achieve the specified contamination fraction. When the value is 0.0, it forces an exact calculation of the threshold. The exact calculation is slow and can fail for large data sets. If there are issues with the exact calculation, a good choice for this parameter is often 1% of the specified contamination value.

Default: 1.0E-4

Random Seed Specify the seed used for the pseudo-random row extraction.

Default: 1

Output

Visual Output
  • Parameters Summary Info: Displays information about the input parameters and their current settings.
  • Training Summary: Displays a table containing the data for normality count, anomaly count, and cutoff value.

    The cutoff value is the outlier score threshold or the minimum score of the anomalous observations. The observations with outlier scores greater than the cutoff value are considered anomalous.

Output to successive operators
A model object that can only be used with a Predictor operator. To perform the transformation against a data set, the Isolation Forest operator must be succeeded by a Predictor operator. The following additional columns are produced in the Predictor operator.
  • PRED_ISF: Specifies whether an observation is an anomaly. If the value is 1, it is an anomaly and if the value is 0, it is not an anomaly.

  • CONF_ISF: Returns the anomaly score.

A model object that cannot be used with any Model Validation operators.

Example

The following example demonstrates the Isolation Forest operator.

Isolation Forest example
Data
golf: This data set contains the following information:
  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).
Parameter Setting
The parameter settings for the golf data set are as follows:
  • Use all available columns as Predictors: No

  • Continuous Predictors: temperature

  • Categorical Predictors: outlook, wind

  • Number of Estimators: 100

  • Apply Bootstrap: No

  • Fraction/ Number of Samples: 1.0

  • Fraction/ Number of Features: 1.0

  • Contamination: 0.1

  • Contamination Error (Advanced): 1.0E-4

  • Random Seed: 1.0

Results
These figures display the results for the parameter settings for the golf data set.
Parameters Summary Info
Isolation Forest operator - Parameters Summary Info
Training Summary
Isolation Forest operator - Training Summary