Replace Outliers (DB)

Reduces the range of values for numeric columns.

Information at a Glance

Category Transform
Data source type DB
Sends output to other operators Yes
Data processing tool DB

For more information about how the Replace Outliers operator works, see Outliers in Numerical Data.

Note: The Replace Outliers (DB) operator is for database data only. For Hadoop data, use the Replace Outliers (HD) operator.

Input

This operator works for tabular data sets. The transformation function can be applied only to numeric columns, and the type of the numeric input columns is preserved in the output.

Bad or Missing Values
Any row that contains dirty data, such as a string in a numeric column, is removed as the data is read in. After the data is read in, the operator filters out all rows that contain null values in the selected numeric columns. Rows that have null values in any of the columns not selected are not removed. The rows removed are reported in the Summary tab. If the value of Write Null Data to File Parameter is set to yes, then the rows removed because they have null data are written to an external file (the location of which is reported in the Summary tab).

Restrictions

Any data set with numeric columns can be used. This operator slows down as the number of columns selected and the cardinality of the columns increases.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Columns The numeric columns to transform.
Lower Boundary (%) A double that represents the percentage of values in the left tail of the distribution (on the low end of the range in each column) to replace.

The lower threshold x is calculated as lower boundary formula.

Upper Boundary (%) A double that represents the percentage of values in the right tail of the original distribution for each column (the high end of the range in each column) to replace.

The upper threshold y is calculated as upper boundary formula.

Output Type
  • TABLE outputs a database table. Specifying TABLE enables Storage Parameters.
  • VIEW outputs a database view.
Output Schema The schema for the output table or view.
Output Table The table path and name where the results are output. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Drop If Exists Specifies whether to overwrite an existing table.
  • Yes - If a table with the name exists, it is dropped before storing the results.
  • No - If a table with the name exists, the results window shows an error message.

Output

Visual Output
The operator has two tabs of output. The first is the output data, which is passed on to the next operator. The second is a summary that explains which parameters were selected, how much null data was removed, and where the results were written.
  • Output: A table with the outlier values replaced, as detailed above.

  • Summary: A description of the input data and the rows removed due to null data. It also shows where the results are stored.

Data Output
The operator outputs the same tabular data set as the input data, but with some of the values in the selected numeric columns replaced. See Outliers in Numerical Data for more information.