N-gram Dictionary Loader

Creates an N-gram dictionary object from a dictionary data set input (with the exact same columns as the output dictionary data set created by the N-gram Dictionary Builder operator), and the location of the N-gram dictionary builder configuration file (which is always stored in HDFS when training an N-gram Dictionary Builder operator and has the output suffix _dictInfo).

Information at a Glance

Category NLP
Data source type HD
Sends output to other operators Yes
Data processing tool Spark

For more information, see N-gram Dictionary Builder.

With this operator, you can reuse an N-gram dictionary without having to retrain the N-Gram Dictionary Builder operator each time. You can filter a dictionary created by an N-Gram Dictionary Builder operator in a custom way, and then use it as the new dictionary data set to create an N-gram dictionary object that can be used with Text Featurizer or LDA Trainer operators.

Input

A tabular data set that represents an N-Gram dictionary (most commonly the output dictionary of an N-Gram Dictionary Builder operator, which has been filtered out), with the exact same column names and types as the N-Gram Dictionary Builder data set output.



Restrictions

This operator requires an input with the exact same column names and types as the N-Gram Dictionary Builder data set output; otherwise, an error occurs.

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
N-Gram Dictionary Builder Configuration Select the HDFS directory where the configuration parameters and corpus statistics of a trained N-Gram Dictionary Builder operator is stored.
Note: It should have been created when running an N-Gram Dictionary Builder operator in the first place, and stored at the same output path of the N-Gram dictionary data set, with the _dictInfo suffix appended.

This configuration file contains information on the training corpus of documents as well as user-specified options when training the N-Gram Dictionary Builder in the first place (stemming, case sensitivity, stop words, sentence tokenization, and so on).

Output

Visual Output
Visual output includes Dictionary, Corpus Statistics, and Summary sections.
Dictionary
A table that shows the first preview of the n-gram dictionary loaded by the operator and passed on to future operators.

Corpus Statistics
Shows aggregate counts for number of documents, n-grams, and unique tokens found.

Summary
Contains some information about which parameters were selected and where the results were stored. Use this information to navigate to the full results data set.

Data Output
The N-gram dictionary object that can be connected to a Text Featurizer or LDA Trainer operator (in combination to a data set input).

Example