Association Rules

Association Rules modeling refers to the process of determining patterns that occur frequently within a data set, such as identifying frequent combinations or sets of items bunched together, subsequences or substructures within the data.

Information at a Glance

Category Model
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

For more information about Association Rules, see Patterns in Data Sets

Input

An HDFS tabular data set with a sparse item set (or "basket") column (key/value dictionary with distinct items as keys and frequency as values).

The input is most likely to be generated by the Collapse operator, using the transaction ID column(s) in a Group By clause and the item column in the Columns to Collapse check boxes. See the example for an illustration.

Bad or Missing Values
If the Item Set Column selected column has null values in the Item Sets Column selected, the rows are removed from the data set. The number of null values removed can be listed in the Summary section of the output (depending on the chosen option for Write Rows Removed Due to Null Data To File). If the Item Set Column selected has bad key/values pairs in the dictionary, the operator fails at runtime with a meaningful error message.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Item Set Column Select the dictionary column that represents the item set (or "basket") for each transaction (ID, group...).

Data type supported: Team Studio Sparse

This column is most likely to be the output column of the Collapse operator (see the Input section, above).

Maximum Size of Frequent Item Sets Specify the maximum number of items a frequent item set can contain in order to be used to generate association rules. The range of possible values is between 1 and 50. Default value: 5.
Minimum Support Support of an association rule is the percentage of groups that contain all of the items listed in the Association Rule. Minimum Support specifies the lower threshold percentage of groups or transactions that contain all of the items listed for the association rule.
  • Therefore, an Association Rule is considered valid only when the Support value equals or exceeds this Minimum Support value.
  • The range of possible values is any decimal value between 0.001 and 1. The default value is 0.3, or at least 30% of the transactions that contain the item set.
Minimum Rule Confidence

The Confidence of an Association Rule is a percentage value that shows how frequently the rule head occurs among all groups that contain the rule body. The Confidence value indicates how reliable this rule is. The higher the value, the more often this set of items is associated together. Minimum Confidence specifies the lower percentage threshold for the required frequency of the rule occurring.

For a rule X => Y :

. **

  • Therefore, the Confidence measurement indicates how reliable a rule is. The higher the value, the more often this set of items is associated together. For example, if the Confidence is 85% for the list {meat, potatoes}, it means that 85% of the times that meat & potatoes (premise) was purchased, so were potatoes (rule body).
  • The range of possible values is any decimal value between 0 and 1. The default value is 0.7, which means that the items must be associated together 70% of the time.
Minimum Rule Lift

The Lift of an Association Rule is the ratio of the observed support for the rule ( = support (XUY) ) to that expected if X and Y were independent.

If a rule has a lift of 1, it implies that the probability of occurrence of the antecedent X and that of the consequent Y are independent of each other. When 2 events are independent of each other, no rule should be drawn involving those two events.

If the lift is > 1, that lets us know the degree to which those two occurrences are dependent on one another, and makes those rules potentially useful for predicting the consequent in future data sets.

The value of lift is that it considers both the confidence of the rule and the overall data set.

  • Default value is 1.5
  • Range of possible values is any decimal value between 0 and 10E5.

    Note: Valid rules - To be considered a valid association rule, a rule must meet all the following requirements:
    • Minimum support for the item set
    • Minimum rule confidence
    • Minimum rule lift

    Note that means the support of the union of the items in X and Y. This is somewhat confusing, since we normally think in terms of probabilities of events and not sets of items. We can rewrite as the joint probability where and are the events that a transaction contains itemset and , respectively.

Write Rows Removed Due to Null Data To File

Rows with null values in Item Sets Column are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.

The file is written to the same directory as the rest of the output. The filename is bad_data.

  • Do Not Write or Count Null Rows (Fastest) - remove null value data, but do not count and display in the result UI
  • Do Not Write Null Rows - remove null value data, but do not write to an external file.
  • Write Up to 1000 Null Rows to File - remove null value data and write the first 1000 rows of that data to the external file.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Outputs

Visual Outputs
  • Association Rules Output (the main output transmitted to subsequent operators). An example is shown here.



    It displays the rules selected from the user's specified criteria, with the related measures of significance of these rules:

    • Support (see Parameter section - Minimum Support)
    • Confidence: (see Parameter section - Minimum Rule Confidence)
    • Lift: (see Parameter section - Minimum Rule Lift)
    • Conviction: The conviction of a rule can be interpreted as the ratio of the expected frequency that X occurs without Y (that is, the frequency that the rule makes an incorrect prediction) if X and Y were independent, divided by the observed frequency of incorrect predictions. For example, if a rule (X => Y) has a conviction of 1.2, it means that this rule would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance. The higher conviction is, the more reliable the rule is.

      Note: Null Conviction - If the observed frequency of incorrect convictions is 0 (that is, confidence is 1), the rule conviction cannot be calculated and a null value is displayed.
    • Frequent Item Sets data set (stored in HDFS):



    • Summary of the parameters selected, rows removed due to null values, rules generated, and output location.



Data Output

The output of this operator is the Association Rules data set that can be further filtered out based on rule metrics (for example, using a Row Filter).

Note: The Frequent Items data set is also stored in HDFS and can be dragged on the canvas for further analysis.

Example

Additional Notes

FP-Growth Performance - The FP-Growth algorithm can result in slow processing when the minimum support chosen is very low (close to 0). If you detect slowness, try one of the following remedies.

  • Increase the value of Minimum Support.
  • Increase Spark executor memory and/or Number of Partitions in Advanced Spark Settings.