Transformation Operators
Transformation operators provide different ways to prepare data for modeling.
- Aggregation (DB)
Performs aggregate calculations on a data set by specifying a group-by configuration and an aggregate expression. - Aggregation (HD)
Performs aggregate calculations on a data set by specifying a group-by configuration and an aggregate expression. - Batch Aggregation
Performs aggregations on multiple columns. - Collapse
Transforms the data contained in a column of a table by means of subtotals (or other calculations) that are defined by another column in the same list. The other calculations might be averages and counts. The result is a collapsed or condensed data set. - Column Filter (DB)
Selects a subset of the columns from data source. Only the columns selected remain in the output dataset. - Column Filter (HD)
Selects a subset of the columns from data source. Only the columns selected remain in the output dataset. - Correlation Filter (DB)
Filters numeric columns so the remaining columns are not correlated strongly with each other. - Correlation Filter (HD)
Filters numeric columns so the remaining columns are not correlated strongly with each other. - Distinct (DB)
Returns only distinct combinations of values from specified columns of a database source. Rows are not returned in any particular order, but each combination of values within a row is distinct from other rows. - Distinct (HD)
Returns the unique value combinations across selected columns. - Fuzzy Join
Performs a fuzzy matching join to connect two data sets based on nearly matching string values. - Join (DB)
Performs a table join on the input data sets by allowing users to define the input data set alias, the output columns, and the join condition. - Join (HD)
Performs a table join on the input data sets by allowing users to define the input data set alias, the output columns, and the join condition. - Normalization (DB)
Performs normalization on the selected columns of the input data set. Normalization means adjusting values measured on different scales to a notionally common scale. - Normalization (HD)
Performs normalization on the selected columns of the input data set. Normalization means adjusting values measured on different scales to a notionally common scale. - Null Value Replacement (DB)
Replaces null values of the selected fields of the data set with designated values. This is helpful as a pre-cleansing data step. - Null Value Replacement (HD)
Replaces null values of the selected fields of the data set with designated values. This is helpful as a pre-cleansing data step. - Numeric to Text (DB)
Converts a numeric type column to a text type column. - Numeric to Text (HD)
Converts a numeric type column to a text type column. - One-Hot Encoding
Performs one-hot encoding on a set of categorical columns selected: it encodes categorical features using a one-hot scheme (also known as "one-of-K" scheme), and outputs a binary column for each distinct category in the input column. - Pivot (DB)
Lets you transform the categorical data contained in a column of a table into columns of a new table, by means of subtotals (or other calculations) that might be defined by another column in the same list. The other calculations might be averages and counts. - Pivot (HD)
Lets you transform the categorical data contained in a column of a table into columns of a new table, by means of subtotals (or other calculations) that might be defined by another column in the same list. The other calculations might be averages and counts. - Reorder Columns (DB)
Reorders one or more columns from an input table, and optionally renames them. - Reorder Columns (HD)
Reorders one or more columns from an input table, and optionally renames them. - Replace Outliers (DB)
Reduces the range of values for numeric columns. - Replace Outliers (HD)
Reduces the range of values for numeric columns. - Row Filter (DB)
Sets the criteria for filtering data set rows. Only the rows that meet the criteria remain in the output data set. - Row Filter (HD)
Sets the criteria for filtering data set rows. Only the rows that meet the criteria remain in the output data set. - Sessionization
Enables the application of sessionization on time-series data to create a session_id column that, for each row (and user ID), gives the session the action belongs to. - Set Operations (DB)
Combines results from merging two or more queries into a single result set. - Set Operations (HD)
Combines results from merging two or more queries into a single result set. - Sort By Multiple Columns
Allows you to choose up to three columns to sort by and returns a data set sorted by the selected column(s), adding a column called row_index that enables you to filter the output based on the sorting results. - Transpose
Allows you to rearrange data so that rows and columns are switched. - Unpivot (DB)
Unpivots one or more columns. - Unpivot (HD)
Unpivots one or more columns. - Unstack
Takes an HDFS data set in stacked format and produces an unstacked (wide) HDFS data set using user-specified grouping and pivot columns. - Variable (DB)
Use to define variables created from data fields of the input data set, forming a new table or view. - Variable (HD)
Use to define variables created from data fields of the input data set, forming a new table or view. - Window Functions - Aggregate
Unlike regular aggregate functions calls, allows you to create aggregate variables for each input row, based on the specified frame (with an optional order). - Window Functions - Lag/Lead
For several columns and offset values (n), returns the value of the column that is n rows before (lag) or after (lead) the current row. - Window Functions - Rank
Returns the rank of each row in relation to its windowed partition.
Copyright © Cloud Software Group, Inc. All rights reserved.