ANOMALY_IF: Detecting Outliers

How to:

ANOMALY_IF detects outliers using an Isolation Forest. An Isolation Forest uses decision trees to randomly and recursively split the space spanned by the observations X0, X1, . . . into hyper-rectangles, each containing one or a small number of samples. Outlier samples can be isolated in a hyper-rectangle with fewer splits than samples that are in close vicinity of many other samples. The number of splits needed to reach a sample translates into the anomaly score of the sample.

Syntax: How to Calculate an Anomaly Score

ANOMALY_IF('options' predictor_field1[, predictor_field2, ...])

where:

'options'

Is a dictionary of advanced parameters that control the model attributes, enclosed in single quotation marks. Most of these parameters have a default value, so you can omit them from the request, if you want to use the default values. Even with no advanced parameters, the single quotation marks are required. The format of the advanced parameter dictionary is:

'{"parm_name1": "parm_value1", ... ,"parm_namei": "parm_valuei"}'

The following advanced parameters are supported:

"trees"

Is the number of decision trees in the forest. Allowed values are integers greater than 10. The default value is "100".

"score"

Defines the type of value returned by the function. If score is "binary", the function returns -1 for anomalous samples and +1 for normal samples. If score is "grade", a continuous anomaly score between -1.0 and 1.0 is returned, in which the more negative the number returned is, the more of an outlier the point is. Valid values are "binary" and "grade". The default value is "binary".

"max_samples"
Is the fraction of the rows in the training set used per tree. Allowed values are fractions between 0 (zero) and 1. The default value is "0.5".
"contamination"
Only applies when score is "binary". Is the fraction of the samples in the training set that will be marked anomalous. Allowed values are fractions between 0 (zero) and 0.5. The default value is "0.1".
"train_ratio"

Is a value between 0 and 1 that specifies the fraction of data used for training the model. The default value is "1.0".

predictor_field1[, predictor_field2, ...]

Numeric

Are one or more predictor field names.

Example: Detecting Outliers Using ANOMALY_IF

The following procedure uses ANOMALY_IF to detect outliers using binary mode and predictors horsepower, peak RPM, city MPG, highway MPG, and price. Outliers are identified by the return value -1.00.

TABLE FILE imports85
PRINT horsepower peakRpm cityMpg highwayMpg price
COMPUTE AnomalyBinaryScore/D5.2 = ANOMALY_IF('{"trees":"123","score":"binary","contamination":"0.2"}',
               horsepower, peakRpm, cityMpg, highwayMpg, price);
ON TABLE SET PAGE NOLEAD
ON TABLE SET STYLE *
GRID=OFF,$
ENDSTYLE
END

Partial output is shown in the following image.

The following version of the request uses grade mode with the same advanced parameters and predictors.

TABLE FILE imports85
PRINT horsepower peakRpm cityMpg highwayMpg price
COMPUTE AnomalyGradeScore/D5.2 = ANOMALY_IF('{"trees":"123","score":"grade","contamination":"0.2"}',
                                              horsepower, peakRpm, cityMpg, highwayMpg, price);
ON TABLE SET PAGE NOLEAD
ON TABLE SET STYLE *
GRID=OFF,$
ENDSTYLE
END

Partial output is shown in the following image.