Replace Outliers (HD)

Information at a Glance

Parameter	Description
Category	Transform
Data source type	HD
Send output to other operators	Yes
Data processing tool	Spark

For more information about how the Replace Outliers operator works, see Outliers in Numerical Data.

Note: The Replace Outliers (HD) operator is for Hadoop data only. For database data, use the Replace Outliers (DB) operator.

Input

This operator works for tabular data sets on HDFS. The transformation function can be applied only to numeric columns, and the type of the numeric input columns is preserved in the output.

Bad or Missing Values

Any row that contains dirty data, such as a string in a numeric column, is removed as the data is read in. After the data is read in, the operator filters out all rows that contain null values in the selected numeric columns. Rows that have null values in any of the columns not selected are not removed. The rows removed are reported in the Summary tab. If the value of Write Null Data to File Parameter is set to yes, then the rows removed because they have null data are written to an external file (the location of which is reported in the Summary tab).

Restrictions

Any data set with numeric columns can be used. This operator slows down as the number of columns selected and the cardinality of the columns increases.

Configuration

Parameter	Description
Notes	Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Columns	Numeric columns to transform.
Lower Boundary (%)	A double that represents the percentage of values in the left tail of the distribution (on the low end of the range in each column) to replace. The lower threshold x is calculated as
Upper Boundary (%)	A double that represents the percentage of values in the right tail of the original distribution for each column (the high end of the range in each column) to replace. The upper threshold y is calculated as
Write Null Data To File *required	Rows with null values are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file. The file is written to the same directory as the rest of the output. The filename is suffixed with _baddata. Do Not Write or Count Null Rows (Fastest) - remove null-value data, but do not count and display in the result UI. Do Not Write Null Rows to File - remove null-value data and display in the result UI, but do not write to an external file. Write Up to 1000 Null Rows to File - remove null-value data and write the first 1,000 rows of that data to the external file. Write All Null Rows to File - remove null-value data and write all removed rows to an external file.

Storage Format

Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression

Select the type of compression for the output.

Available Parquet compression options.

GZIP
Deflate
Snappy
no compression

Available Avro compression options.

Deflate
Snappy
no compression

Output Directory	The location to store the output files.
Output Name	The name to contain the results.
Overwrite Output	Specifies whether to delete existing data at that path. Yes - if the path exists, delete that file and save the results. No - fail if the path already exists.

Advanced Spark Settings Automatic Optimization

Yes specifies using the default Spark optimization settings.
No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings dialog for more information.

Output

Visual Output

The operator has two tabs of output. The first is the output data, which is passed on to the next operator. The second is a summary that explains which parameters were selected, how much null data was removed, and where the results were written.

Output: A table with the outlier values replaced, as detailed above.
Summary: A description of the input data and the rows removed due to null data. It also shows where the results are stored.

Data Output

The operator outputs the same tabular data set as the input data, but with some of the values in the selected numeric columns replaced. See Outliers in Numerical Data for more information. This output should work as input for any Hadoop operator that expects tabular data.