Outliers in Numerical Data

Your numerical data might include data that lies outside of a specified percentile threshold.

You can use the Replace Outliers operator (available for both database and Hadoop) to manage those outliers. For details about using the Replace Outliers operator, see the reference topics for your data source type (Replace Outliers (DB) and Replace Outliers (HD)).

Replace Outliers replaces all of the values above and below the specified percentile threshold with the maximum/minimum value within that threshold. Rather than providing an absolute minimum and maximum value to use as the thresholds, you can choose a lower percentile (Lower Boundary %) and upper percentile (Upper Boundary %).

For example, consider the following data set.

StringColumn	Column1	Column2
first	0	5
second	10	5
third	20	5
fourth	30	5
fifth	40	10
sixth	50	5
seventh	60	8
eighth	70	8
ninth	80	9
tenth	90	5

If you set Lower Boundary (%) = 30.0 and Upper Boundary (%) = 90.0, the output is as follows.

StringColumn	Column1	Column2
first	0̶ 20	5
second	1̶0 20	5
third	20	5
fourth	30	5
fifth	40	1̶0 9
sixth	50	5
seventh	60	8
eighth	70	8
ninth	80	9
tenth	9̶0 80	5

For Column1, the ninth element is 80, and the third element is 20, so all the numbers above 80 and below 20 are replaced with those values. Because there are ten elements, the lower threshold is the third element (30% of 10 = 3) and the upper-threshold ninth element is the upper threshold because (90% of 10 = 9).

For Column2 (which is not sorted), the ninth element is 9. Thus, everything greater than 9 is replaced. The third element is 5. So everything lower than 5 (in this case, nothing) is replaced.

Now that you understand how replacing outliers works, try using the Replace Operators operator (either DB or HD).

Related concepts

Aggregation Methods for Batch Aggregation

Key-Value Pairs Parsing Example using the Variable Operator

datetime Format Conversion Examples

Contents

Index

Search Results

Outliers in Numerical Data