Window Functions - Rank

This operator returns the rank of each row in relation to its windowed partition.

window-function-rank operator icon

Information at a Glance

Note: This operator can only be used with TIBCO® Data Virtualization and Apache Spark 3.2 or later.

Parameter

Description
Category Transform
Data source type TIBCO® Data Virtualization
Send output to other operators Yes
Data processing tool TIBCO® DV, Apache Spark 3.2 or later

Algorithm

The Window Functions-Rank operator supports the rank, dense_rank, cumulative distribution, and ntile (n: number of quantiles) functions. The core concept of this operator is to compute the rank or order of each row, relative to a defined grouping or partition. An example use case is ranking transactions for individual customers, within a data set that contains unique customers. In this example, the partitioning, or grouping, is the customer ID, while the data to rank or order is the transaction value amount. Transactions within each partition or customer ID are ranked and ordered starting with 1 for the highest, and counting up.

You can use this operator to compute database-like window functions by leveraging Spark SQL. To learn more about how window functions are implemented in Spark SQL.

A window function calculates the return value for every input row of an input based on a specific group of user-defined rows called the frame. Every input row has a unique frame associated with it.

Note: Each operator can compute several window functions on several columns at once, but for a specific ordered partition. If you need to compute the same window functions on a different (ordered) partition, you can copy the operator and modify the partition, frame, or order-related parameters on the copy.

Input

An input is a single tabular data set. It must have some numeric columns where it computes numeric aggregations and includes partition by column(s) and order by columns of any type.

Bad or Missing Values
Dirty data: When parsing delimited data, the window functions operators remove dirty data (such as strings in numeric columns, doubles in integer columns, or rows with the incorrect number of values) as it parses. These rows are silently removed because Spark is not capable of handling them.

Null values: Before calculating any of the window functions, the operator filters any rows that contain null values in the Order By column selected. The number of rows removed due to null data is reported on the Summary tab of the visual output.

The Datetime columns must have the format specified in the input (for example, Datetime 'MM/dd/yy').

Restrictions

Wide data: This operator works quickly on long data, but performance might slow down dramatically when window functions are calculated on thousands of columns. Increasing Spark's executor memory might improve performance.

Datetime columns: Input datetime columns must have the format specified in the input (for example, Datetime 'MM/dd/yy'); otherwise, the operator returns null values for the whole column.

Configuration

The following table provides the configuration details for the Window Functions-Rank operator.

Parameter Description
Notes Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator.
Partition By Specify the column(s) by which to partition. Click Select Columns to select the required columns.
Order By Specify the column by which to order each partition (all data types are supported).
Order Specify the order type for the selected Order By column.
Calculate Row Number Specify whether the row number function returns a sequential (and unique) number starting at 1 within an ordered partition.
Calculate Rank Specify whether the rank function returns a sequential number starting at 1 within an ordered partition.
Calculate Dense Rank Specify whether the dense rank function returns a sequential number starting at 1 within an ordered partition.

Note:

The difference between rank and dense rank is that dense rank leaves no gaps in the ranking sequence when ties occur (for example, if three values tie for second place, all three have a dense rank of 2, and the next value has a dense rank of 3).

Calculate Cumulative Distribution Specify whether the function returns the cumulative distribution of values within an ordered window partition; that is, the fraction of rows that are below the current row.

If N = total number of rows in the partition and V = number of values before (and including) x, the equation would be CUME_DIST(X)=V/N.

Calculate Quantiles If Yes, returns the n-tile group ID (from 1 to Number of Quantiles inclusive) in an ordered window partition (equivalent to the NTILE function in SQL).

For example, if n = 4 (number of quantiles), the first quarter of the rows gets a value of 1, the second quarter gets 2, the third quarter gets 3, and the last quarter gets 4.

Note:

If Yes is selected, you must specify an integer value for the Number of Quantiles.

Number of Quantiles If Calculate Quantiles is set to Yes, specify the number of quantiles to return for each ordered window partition. The value must be an integer > 0.
Columns to Keep Specify the columns to keep in the output. Click Select Columns to select the required columns.
Output Schema Specify the schema for the output table or view.
Output Table Specify the table path and name where the output of the results is generated. By default, this is a unique table name based on your user ID, workflow ID, and operator.
Store Results When set to Yes, the operator saves the results. If set to No, the operator does not save the results.

Output

Visual Output
The operator returns the visual output with two tabs: Output and Summary.
  • Output: A table that displays the output of a data set.

  • Summary: Displays information about the input parameters, their current settings, and a summary of the number of rows removed due to null data.

Output to Successive operator
A data set that can be used by any TIBCO Data Science – Team Studio operator that accepts data set as input.

Example

The following example illustrates the Window Functions-Rank operator.

Window Functions Rank operator workflow

Data

golf: This data set contains the following information:

  • Multiple columns namely outlook, temperature, wind, humidity, and play.
  • Multiple rows (14 rows).

Parameter Setting

The parameter settings for the golf data set are as follows:

  • Partition By: play

  • Order By: temperature

  • Order: Ascending

  • Calculate Row Number: Yes

  • Calculate Rank: Yes

  • Calculate Dense Rank: Yes

  • Calculate Cumulative Distribution: Yes

  • Calculate Quantiles: Yes

  • Number of Quantiles: 2

  • Columns to Keep: outlook, temperature, humidity, wind, play

  • Store Results: Yes

Results

These figures displays the results for the parameter settings for the golf data set.

Summary

Window Functions Rank operator - Summary tab

Output

Window Functions Rank operator - Output tab