Association Rules - Computational Procedures and Terminology
Statistica Association Rules is an implementation of the powerful a priori algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000); this algorithm, and the types of data that can be analyzed using this algorithm in Statistica are described below.
- Categorical Variables, Multiple Response Variables, Multiple Dichotomies
- Statistica Association Rules supports all common types of variables or formats in which categories, items, or transactions (e.g., information regarding purchases of consumer items) are typically recorded.
- Categorical or class variables
- Categorical variables are single variables that contain codes or text values to denote distinct classes; for example, a variable Gender would have the categories Male and Female.
- Multiple response variables
- Multiple response variables usually consist of multiple variables (i.e., a list of variables) that can contain, for each observations, codes or text values describing a single "dimension" or transaction. A good example of a multiple response variable would be if a vendor recorded the purchases made by a customer in a single record, where each record could contain one or more items purchased, in arbitrary order. This is a typical format in which customer transaction data would be kept. This type of data format is also discussed in great detail in the context of Basic Statistics (see Multiple Responses/Dichotomies - Multiple Response Variables).
- Multiple dichotomies
- In this data format, each variable would represent one item or category, and the dichotomous data in each variable would indicate whether or not the respective item or category applies to the respective case. For example, suppose a vendor created a data spreadsheet where each column represented one of the products available for purchase. Each transaction (row of the data spreadsheet) would record whether or not the respective customer did or did not purchase that product, i.e., whether or not the respective transaction involved each item. This type of data format is also discussed in great detail in the context of Basic Statistics (see Multiple Responses/Dichotomies - Multiple Dichotomies).
- Association Rules: If Body then Head
- The a priori algorithm attempts to derive from the data association rules of the form: If "Body" then "Head," where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items; e.g., if (Car=Porsche and Age<20) then (Risk=High and Insurance=High); here the logical conjunction before the then would be the Body, and the logical conjunction following the then would be the Head of the association rule).
- Initial Pass Through the Data: The Support Value
- First, Statistica scans all variables to determine the unique codes or text values (items) found in the variables selected for the analysis. In this initial pass, the relative frequencies with which the individual codes or text values occur in each transaction is also computed. The probability that a transaction contains a particular code or text value is called Support; the Support value is also computed in consecutive passes through the data, as the joint probability (relative frequency of co-occurrence) of pairs, triplets, etc. of codes or text values (items), i.e., separately for the Body and Head of each association rule.
- Second Pass Through the Data: The Confidence Value; Correlation Value
- After the initial pass through the data, all items with a support value greater than some predefined minimum support value are "remembered" for subsequent passes through the data: Specifically, Statistica computes the conditional probabilities for all pairs of codes or text values that have support values greater than the minimum support value. This conditional probability - that an observation (transaction) that contains a code or text value X also contains a code or text value Y - is called the Confidence Value. In general (in later passes through the data) the confidence value denotes the conditional probability of the Head of the association rule, given the Body of the association rule.
In addition, Statistica computes the support value for each pair of codes or text values, and a Correlation value based on the support values. The correlation value for a pair of codes or text values {X, Y} is computed as the support value for that pair, divided by the square root of the product of the support values for X and Y. The program will retain after the second pass through the data those pairs of codes or text values that 1) have a confidence value that is greater than some user-defined minimum confidence value, 2) have a support value that is greater than some user-defined minimum support value, and 3) have a correlation value that is greater than some minimum correlation value.
- Subsequent Passes Through The Data: Maximum Item Size in Body, Head
- Statistica continues scanning the data in subsequent steps, computing support, confidence, and correlation values for pairs of codes or text values (associations between single codes or text values), triplets of codes or text values, and so on. To reiterate, in general, at each iteration the program derives association rules of the general form if "Body" then "Head", where Body and Head stand for simple codes or text values (items), or the conjunction of codes and text values (items).
Unless the process stops because no further associations can be found that satisfy the minimum support, confidence, and correlation conditions, the process could continue to build very complex association rules (e.g., if X1 and X2 .. and X20 then Y1 and Y2 ... and Y20). To avoid excessive complexity, additionally, the user can specify the maximum number of codes or text values (items) in the Body and Head of the association rules; this value is referred to as the maximum item set size in the Body and Head of an association rule.