Transformation Operators

Transformation operators provide different ways to prepare data for modeling.

Aggregation (DB)
Performs aggregate calculations on a data set by specifying a group-by configuration and an aggregate expression.
Aggregation (HD)
Performs aggregate calculations on a data set by specifying a group-by configuration and an aggregate expression.
Batch Aggregation
Performs aggregations on multiple columns.
Collapse
Transforms the data contained in a column of a table by means of subtotals (or other calculations) that are defined by another column in the same list. The other calculations might be averages and counts. The result is a collapsed or condensed data set.
Column Filter (DB)
Selects a subset of the columns from data source. Only the columns selected remain in the output dataset.
Column Filter (HD)
Selects a subset of the columns from data source. Only the columns selected remain in the output dataset.
Correlation Filter (DB)
Filters numeric columns so the remaining columns are not correlated strongly with each other.
Correlation Filter (HD)
Filters numeric columns so the remaining columns are not correlated strongly with each other.
Distinct (DB)
Returns only distinct combinations of values from specified columns of a database source. Rows are not returned in any particular order, but each combination of values within a row is distinct from other rows.
Distinct (HD)
Returns the unique value combinations across selected columns.
Fuzzy Join
Performs a fuzzy matching join to connect two data sets based on nearly matching string values.
Join (DB)
Performs a table join on the input data sets by allowing users to define the input data set alias, the output columns, and the join condition.
Join (HD)
Performs a table join on the input data sets by allowing users to define the input data set alias, the output columns, and the join condition.
Normalization (DB)
Performs normalization on the selected columns of the input data set. Normalization means adjusting values measured on different scales to a notionally common scale.
Normalization (HD)
Performs normalization on the selected columns of the input data set. Normalization means adjusting values measured on different scales to a notionally common scale.
Null Value Replacement (DB)
Replaces null values of the selected fields of the data set with designated values. This is helpful as a pre-cleansing data step.
Null Value Replacement (HD)
Replaces null values of the selected fields of the data set with designated values. This is helpful as a pre-cleansing data step.
Numeric to Text (DB)
Converts a numeric type column to a text type column.
Numeric to Text (HD)
Converts a numeric type column to a text type column.
One-Hot Encoding
Performs one-hot encoding on a set of categorical columns selected: it encodes categorical features using a one-hot scheme (also known as "one-of-K" scheme), and outputs a binary column for each distinct category in the input column.
Pivot (DB)
Lets you transform the categorical data contained in a column of a table into columns of a new table, by means of subtotals (or other calculations) that might be defined by another column in the same list. The other calculations might be averages and counts.
Pivot (HD)
Lets you transform the categorical data contained in a column of a table into columns of a new table, by means of subtotals (or other calculations) that might be defined by another column in the same list. The other calculations might be averages and counts.
Reorder Columns (DB)
Reorders one or more columns from an input table, and optionally renames them.
Reorder Columns (HD)
Reorders one or more columns from an input table, and optionally renames them.
Replace Outliers (DB)
Reduces the range of values for numeric columns.
Replace Outliers (HD)
Reduces the range of values for numeric columns.
Row Filter (DB)
Sets the criteria for filtering data set rows. Only the rows that meet the criteria remain in the output data set.
Row Filter (HD)
Sets the criteria for filtering data set rows. Only the rows that meet the criteria remain in the output data set.
Sessionization
Enables the application of sessionization on time-series data to create a session_id column that, for each row (and user ID), gives the session the action belongs to.
Set Operations (DB)
Combines results from merging two or more queries into a single result set.
Set Operations (HD)
Combines results from merging two or more queries into a single result set.
Sort By Multiple Columns
Allows you to choose up to three columns to sort by and returns a data set sorted by the selected column(s), adding a column called row_index that enables you to filter the output based on the sorting results.
Transpose
Allows you to rearrange data so that rows and columns are switched.
Unpivot (DB)
Unpivots one or more columns.
Unpivot (HD)
Unpivots one or more columns.
Unstack
Takes an HDFS data set in stacked format and produces an unstacked (wide) HDFS data set using user-specified grouping and pivot columns.
Variable (DB)
Use to define variables created from data fields of the input data set, forming a new table or view.
Variable (HD)
Use to define variables created from data fields of the input data set, forming a new table or view.
Window Functions - Aggregate
Unlike regular aggregate functions calls, allows you to create aggregate variables for each input row, based on the specified frame (with an optional order).
Window Functions - Lag/Lead
For several columns and offset values (n), returns the value of the column that is n rows before (lag) or after (lead) the current row.
Window Functions - Rank
Returns the rank of each row in relation to its windowed partition.

Related concepts

Exploration Operators

Model Validation Operators

Tool Operators

Contents

Index

Search Results

Transformation Operators