Normalization (HD)

Performs normalization on the selected columns of the input data set. Normalization means adjusting values measured on different scales to a notionally common scale.

Information at a Glance

Category	Transform
Data source type	HD
Sends output to other operators	Yes
Data processing tool	Pig

Note: The Normalization (HD) operator is for Hadoop data only. For database data, use the Normalization (DB) operator.

Algorithm

You can accomplish normalization in various ways.

By specifying a user-defined minimum and maximum value.
By a z-transformation (for example, on mean 0 and variance 1).
By a transformation as proportion of the average or sum of the respective attribute.

Your selection translates into four possible types of normalization to select.

Z-Transformation.
Proportion Transformation.
Range Transformation.
Divide-By-Average Transformation.

See Method under Configuration for a definition of each type.

Input

A data set from the preceding operator.

Configuration

Parameter	Description
Notes	Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Method	Normalization method to use. Divide-By-Average Transformation: calculate by sample's average. Proportion Transformation: calculate by sample's sum. Z-Transformation: calculate by sample's mean and variance. Range Transformation: calculate by sample's Min and Max value.
Range Minimum	Specify the minimum value in Range transformation.
Range Maximum	Specify the maximum value in Range transformation.
Columns	Click Select Columns to open the dialog box for selecting the available numerical columns for the columns to normalize.
Store Results?	Specifies whether to store the results. true - results are stored. false - the data set is passed to the next operator without storing.
Results Location	The HDFS directory where the results of the operator are stored. This is the main directory, the subdirectory of which is specified in Results Name. Click Choose File to open the Hadoop File Explorer Dialog Box and browse to the storage location. Do not edit the text directly.
Results Name	The name of the file in which to store the results.
Overwrite	Specifies whether to delete existing data at that path and file name. Yes - if the path exists, delete that file and save the results. No - Fail if the path already exists.
Storage Format	Select the format in which to store the results. The storage format is determined by your type of operator. Typical formats are Avro, CSV, TSV, or Parquet.
Compression	Select the type of compression for the output. Available Parquet compression options. GZIP Deflate Snappy no compression Available Avro compression options. Deflate Snappy no compression
Use Spark	If Yes (the default), uses Spark to optimize calculation time.
Advanced Spark Settings Automatic Optimization	Yes specifies using the default Spark optimization settings. No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output: The data rows of the output table or view displayed (up to 200 rows of the data).
Data Output: The data set of the normalized data.

Contents

Index

Search Results