Patterns in Data Sets

A pattern in a data set is referred to as an "Association." It can be a set of items, subsequences, substructures, and so on, that occurs frequently in a dataset. It is also known as a frequent pattern.

You can use the Association Rules operator to discover frequent patterns in your data. For details about using the Association Rules operator, see Association Rules.

Specific examples of the inherent regularities in data that Association Rule modeling can find include the following.

  • Frequent itemsets: For example, a set of items, such as milk and bread, might appear frequently together in a transaction dataset.
    • What products are often purchased together? This is commonly referred to as Shopping Basket Analysis or Market Basket Analysis.
  • Frequent substructures: A substructure can refer to different structural forms, such as subgraphs, subtrees, or sublattices, which may be combined with itemsets of subsequences. For example, a substructure such as certain DNA structures that are sensitive to a new drug might occur frequently in a biotechnology drug testing dataset.
    • What are substructures of the data associated with a certain event?

Association Rules are if/then statements that define relationships between seemingly unrelated data. An example of an association rule might be "If a customer buys a dozen eggs, he is 80% likely to also purchase milk."

  • The if part of the rule is often referred to as the premise.
  • The then part of the rule is the conclusion

Therefore, a premise can be considered to be an item or condition that is found frequently in combination with the conclusion item or condition.

The Association Rules performs the following analysis.
  1. Finds the frequent item sets in the data by applying the parallel version of FP-Growth algorithm form MLLib (described in the paper Li et al., PFP: Parallel FP-growth for query recommendation).
  2. Runs a parallel Association Rules algorithm to produce a list of uncovered Association Rules that have a single item as the consequent and which meet the modeler's specified Support, Confidence, and Lift criteria.
    • Support is an indication of how frequently the items appear together in the input.
    • Confidence indicates the percentage of times the if/then statements have been found to be true.
    • Lift measures the ratio of the observed support to that expected if the antecedent and the consequent were independent (the degree to which the antecedent and consequent are dependent on another, which makes the rule valuable).
Use Cases
Association Rules modeling is useful when analyzing unsupervised transactional data that is categorical in nature. Finding such frequent patterns can be applied to a variety of business use cases, such as shopping basket analysis. It studies customers' buying habits by searching for item sets that are frequently purchased together (or in sequence). Other common use cases include cross-marketing, product clustering, catalog design, store layout, sales campaign analysis, Web log ("click stream") analysis, and DNA sequence analyses.
Related reference