About the Relationship Probability Score (RPS)
Discovery controls what relationships are discovered and how they are scored based on Discovery’s internal RPS algorithms. Discovered relationships are assigned a RPS based on these factors:
Column Name Comparison Factor
Index Key Factor
Match Percentage Factor
Number of Matches Factor
Schema Locality Factor
The factors are multiplied by their weights (which total 100%) and then added together to arrive at the total score for the relationship using this formula:
Score = columnNameComparisonFactor * 40% + indexKeyFactor * 30% + matchPercentageFactor * 10% + numberOfMatchesFactor * 10% + schemaLocalityFactor * 10%
The factor weights are configurable for non-string data types. See Adjusting the Weights of the RPS Factors for information about changing the weights of these factors.
The following table describes these factors.
|
Factor |
Description |
|
This factor is multiplied by its weight to get the name component of RPS. It ranges from 0 to 1, with 1 being an exact match and 0 being no match. 1.0—The column name of c1 and c2 match exactly. 0.9—The column names match exactly with non-alphanumeric characters removed. 0.9—One column name ends with the other column name. 0.9—The table name of one column name is part of the other column name. 0.8-0.5—Column values have similar names (to handle misspelling names). |
|
|
This factor is multiplied by its weight to get the index key component of RPS. It is in the range from 0 to 1 based on the likelihood that one of the columns in the relationship is a key column: 1.0—The relationship cardinality is one-to-one, many-to-one, or one-to-many; and either column has more than 90% unique values. 0.5—The relationship cardinality is many-to-many with less than 90% unique values in both columns. |
|
|
This factor is multiplied by its weight to get the match percentage component of RPS. It is calculated using this formula: [# matches]/ MIN ([# unique values in c1], [# unique values in c2]) [# matches] is the number of unique values in both column1 and column2. See Adjusting the Minimum Unique Percentage for information about adjusting the threshold of value uniqueness. Example: If the number of unique values in c1 is 100, the number of unique values in c2 is 50, and the number of unique values appearing in both c1 and c2 is 40. In this case, the factor is equal to 40/MIN(50,100)= 40/50=0.8. |
|
|
This factor is multiplied by its weight to get the number of matches component of RPS: 1.0—[number of matches] => 10 else [factor]—[number of matches]/10 By default, if the minimum number of matches is less than 3, the relationship is not discovered. |
|
|
This factor is multiplied by its weight to get the schema locality component of RPS: 1.0—Two columns are from the same data source. 0—The columns are not from the same data source. |