Outliers in Numerical Data
Your numerical data might include data that lies outside of a specified percentile threshold.
You can use the Replace Outliers operator (available for both database and Hadoop) to manage those outliers. For details about using the Replace Outliers operator, see the reference topics for your data source type (Replace Outliers (DB) and Replace Outliers (HD)).
Replace Outliers replaces all of the values above and below the specified percentile threshold with the maximum/minimum value within that threshold. Rather than providing an absolute minimum and maximum value to use as the thresholds, you can choose a lower percentile (Lower Boundary %) and upper percentile (Upper Boundary %).
For example, consider the following data set.
StringColumn | Column1 | Column2 |
---|---|---|
first | 0 | 5 |
second | 10 | 5 |
third | 20 | 5 |
fourth | 30 | 5 |
fifth | 40 | 10 |
sixth | 50 | 5 |
seventh | 60 | 8 |
eighth | 70 | 8 |
ninth | 80 | 9 |
tenth | 90 | 5 |
If you set Lower Boundary (%) = 30.0 and Upper Boundary (%) = 90.0, the output is as follows.
StringColumn | Column1 | Column2 |
---|---|---|
first | 0̶ 20 | 5 |
second | 1̶0 20 | 5 |
third | 20 | 5 |
fourth | 30 | 5 |
fifth | 40 | 1̶0 9 |
sixth | 50 | 5 |
seventh | 60 | 8 |
eighth | 70 | 8 |
ninth | 80 | 9 |
tenth | 9̶0 80 | 5 |
For Column1, the ninth element is 80, and the third element is 20, so all the numbers above 80 and below 20 are replaced with those values. Because there are ten elements, the lower threshold is the third element (30% of 10 = 3) and the upper-threshold ninth element is the upper threshold because (90% of 10 = 9).
For Column2 (which is not sorted), the ninth element is 9. Thus, everything greater than 9 is replaced. The third element is 5. So everything lower than 5 (in this case, nothing) is replaced.
Now that you understand how replacing outliers works, try using the Replace Operators operator (either DB or HD).