Window Functions - Lag/Lead
For several columns and offset values (n), returns the value of the column that is n rows before (lag) or after (lead) the current row.
Information at a Glance
Lag is used to access data from n rows prior, and allows the current row to access that data. Lead follows the same logic, but is used to access data from n rows ahead. An example use case is finding previous and next order dates for customers. Given a data set with a customer ID and an order date, using 1 as the parameter for lag and 1 as the parameter for lead, the result is two additional columns: Lag returns the data prior to the current row, and Lead returns the data from one row ahead. Once this information is known, a potential next step could be determining the number of days between customer orders.
Team Studio provides two other distinct Window Functions operators: Aggregate and Rank. The window functions operators allow you compute database-like window functions on top of Hadoop by leveraging Spark SQL. To learn more about how window functions are implemented in Spark SQL, see this tutorial.
A window function calculates a return value for every input row of an input based on a specific group of user-defined rows called the frame. Every input row has a unique frame associated with it.
Input
This operator takes one data set from HDFS. It must have some numeric columns to compute numeric aggregations on, and might include partition by column(s) and order by column of any type.
Datetime columns must have a format specified in the input (for example, Datetime 'MM/dd/yy').
- Bad or Missing Values
- Dirty data: When parsing delimited data, the window functions operators remove dirty data (such as strings in numeric columns, doubles in integer columns, or rows with the incorrect number of values) as it parses. These rows are silently removed because Spark is incapable of handling them.
Null values: Before calculating any of the window functions, the operator filters any rows that contain null values in the Order By column selected. The operator then processes these rows with null values according to the value of the Write Rows Removed Due to Null Data To File parameter. The number of rows removed due to null data is reported in the Summary tab of the visual output (if you do not select Yes, but do not count removed rows (faster).
Restrictions
Wide data: This operator works quickly on long data, but performance might slow down dramatically if window functions are calculated on thousands of columns. Increasing Spark's executor memory might improve performance.
Datetime columns: Input datetime columns must have a format specified in the input (for example, Datetime 'MM/dd/yy'); otherwise, the operator returns null values for the whole column.
Configuration
Parameter | Description |
---|---|
Notes | Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator. |
Partition By | Column(s) to partition by. |
Order By | The column by which to order each partition (all data types are supported). Default ordering: Ascending. |
Calculate Lag | The lag value for each of these columns within an ordered partition, and for each value in
Lag Offset Values.
Lag: for an offset value n, the lag function returns the value that is n rows before the current row. For example, an offset of 1 returns the following row at any given point in the window partition (or a null value for the last row of the partition). |
Lag Offset Values | If some columns are selected in
Calculate Lag, specify the integer comma-separated list of offsets for which to compute the lag value(s). For example:
1,2,4 (= compute lag values of 1, 2, and 4 rows before each column selected in Calculate Lag). |
Calculate Lead | The lead value for each of these columns within an ordered partition, and for each value in
Lead Offset Values.
Lead: For an offset value n, the lead function returns the value that is n rows after the current row. For example, an offset of 1 returns the previous row at any given point in the ordered window partition (or a null value for the first row of the partition). |
Lead Offest Values | If some columns are selected in
Calculate Lead, specify the integer comma-separated list of offsets for which to compute the lead value(s). For example:
1,2,4 (= compute lag values of 1, 2, and 4 rows after each column selected in Calculate Lead). |
Columns to Keep | The columns to keep in the output. |
Write Rows Removed Due to Null Data To File |
Specify whether rows with null values in the
Order By column selected are removed from the analysis and written to a file. The file is written to the same directory as the rest of the output. The file name is suffixed with
_baddata.
|
Storage Format | Select the format in which to store the results. The storage format is determined by your type of operator.
Typical formats are Avro, CSV, TSV, or Parquet. |
Compression | Select the type of compression for the output.
Available Avro compression options. |
Output Directory | The location to store the output files. |
Output Name | The name to contain the results. |
Overwrite Output | Specifies whether to delete existing data at that path. |
Advanced Spark Settings Automatic Optimization |
|