Fuzzy Join

Performs a fuzzy matching join to connect two data sets based on nearly matching string values.

Information at a Glance

Category Transform
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

Input

Exactly two HDFS tabular inputs.

Bad or Missing Values

If the output contains empty strings in the first column, these empty strings are converted to a double quote instead (and are not detected as empty strings by subsequent operators).

This issue is due to a bug in spark-csv for Spark version 1.6.1, where the CSV writer always quotes empty strings if they are first in a row (see https://issues.apache.org/jira/browse/CSV-63 for details). To work around it, replace all empty strings/null values in the first column in the Fuzzy Join output before you run it.

Restrictions

This operator accepts two inputs only.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Left data set

*required

The left data set input for the join.
Join On Column

*required

The joining column from the Left data set. This column must be categorical (chararray).
Columns To Keep The columns to keep from the Left data set.
Right data set The right data set input for the join.
Join On Column

*required

The joining column from the Right data set. This column must be categorical (chararray).
Columns to Keep The columns to keep from the Right data set.
Join Type

*required

The join type.
  • Inner
  • Left Outer
  • Right Outer
  • Full Outer
Pre-Precessing on Columns to Match Select zero or more text pre-processing options to apply to the two Join On columns selected before executing the fuzzy join.
  • Normalize text (trim, strip diacritics and accents, transform to lower case)
  • Remove Punctuation
  • Stem Words (using Snowball Stemmer from open Apache open-nlp tools: Open NLP - Snowball Stemmer documentation v 1.6.0)
  • Clear Stop Words (from default list or custom list of stop words; see Custom Stop Words File parameter below)
  • Sort Words Alphabetically (words are considered distinct when separated by a space)
Custom Stop Words File If the Clear Stop Words option is selected in Pre-Processing on Columns to Match and this box is left blank, a standard set of stop words is used. You can find this list here.

Otherwise, choose a file that contains a list of the stop words. This list must be one word per line and should be small enough to fit in memory.

Note: If additional pre-processing transformations are selected, these transformations are applied to the list of stop words as well.
Stemmer Algorithm Define the stemming algorithm to use if the Stem Words option is selected in Pre-Processing on Columns to Match.
  • English (the default)
  • Porter
  • French
  • German
  • Italian
  • Russian
  • Dutch
  • Finnish
  • Swedish
  • Spanish

If you use a language other than English, you must add a custom stop word file.

Match Threshold (%)

*required

Select the minimum match threshold (inclusive) between the (pre-processed or not) two strings so that they are considered a match. It must be a Double in [0,100].

The match threshold between strings A and B is computed using the formula: 100.0 - (100 * Damerau-Levenshtein distance (A, B) / max_length(A, B)).

If both values A and B are null or empty (that is, if max_length(A, B) = 0), the match score is 100. Thus, they are considered a match.

For details, see Wikipedia: Damerau-Levenshtein distance.

Suffix for Duplicate Columns in Right data set

*required

If some columns to keep from left and right data sets have the same name, this suffix is appended (preceded by an underscore) to the output columns of the right data set.
Storage Format Select the format in which to store the results. The storage format is determined by your type of operator.

Typical formats are Avro, CSV, TSV, or Parquet.

Compression Select the type of compression for the output.
Available Parquet compression options.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Output

Visual Output
Output - A preview of the data output.

Summary - A summary of the selected parameters and a message that provides the results location.

Data Output
The joined Hadoop data set with columns to keep selected from each input, plus the match score column (match_score).
Note: If some pre-processing operations are selected on the join columns, the transformed columns are present in the output (and not the initial join input columns).

Example