N-gram Dictionary Loader
Creates an N-gram dictionary object from a dictionary data set input (with the exact same columns as the output dictionary data set created by the N-gram Dictionary Builder operator), and the location of the N-gram dictionary builder configuration file (which is always stored in HDFS when training an N-gram Dictionary Builder operator and has the output suffix
_dictInfo).
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | NLP |
| Data source type | HD |
| Send output to other operators | Yes |
| Data processing tool | Spark |
For more information, see N-gram Dictionary Builder.
With this operator, you can reuse an N-gram dictionary without having to retrain the N-Gram Dictionary Builder operator each time. You can filter a dictionary created by an N-Gram Dictionary Builder operator in a custom way, and then use it as the new dictionary data set to create an N-gram dictionary object that can be used with Text Featurizer or LDA Trainer operators.
Input
A tabular data set that represents an N-Gram dictionary (most commonly the output dictionary of an N-Gram Dictionary Builder operator, which has been filtered out), with the exact same column names and types as the N-Gram Dictionary Builder data set output.
Restrictions
This operator requires an input with the exact same column names and types as the N-Gram Dictionary Builder data set output; otherwise, an error occurs.
Configuration
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| N-Gram Dictionary Builder Configuration | Select the HDFS directory where the configuration parameters and corpus statistics of a trained N-Gram Dictionary Builder operator is stored.
Note: It should have been created when running an N-Gram Dictionary Builder operator in the first place, and stored at the same output path of the N-Gram dictionary data set, with the
_dictInfo suffix appended.
This configuration file contains information on the training corpus of documents as well as user-specified options when training the N-Gram Dictionary Builder in the first place (stemming, case sensitivity, stop words, sentence tokenization, and so on). |
Output



Example