N-gram Dictionary Builder
A sequence of tokens (one or greater) that might appear in a text corpus. The N-gram Dictionary operator parses each document in the corpus into tokens, and then into all possible n-grams (combinations of sequential tokens).
Information at a Glance
Parameter |
Description |
---|---|
Category | NLP |
Data source type | HD |
Send output to other operators | Yes |
Data processing tool | Spark |
For more detailed information about working with N-gram Dictionary Builder, see Test Corpus Parsing.
Input
A corpus of documents represented as a data set with one row per document and at least one column of text. You can select the column with the text to analyze from the Text Column parameter.
Configuration
Parameter | Description |
---|---|
Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
Text Column | Select the column with the text of each document. |
Maximum Size of N-Gram | Must be 1, 2 (the default), or 3 (Unigrams, Bigrams, or Trigrams). |
Case Insensitive | If yes (the default), all words in corpus are considered lowercase, so "Fish" and "fish" would be considered the same token. |
Use Sentence Tokenization |
|
Use Stemming |
Stemming means reducing the forms of a word to a single word. For example: "swims", "swimmer", and "swimming" would all be parsed into the token "swim". Stemming is its own complicated subfield of linguistics, so for simplicity, we use the Porter Stemmer from apache.open.nlp.
Note: The stemmer might not produce real English words. For example, the stemmed version of "parsed" is "pars", because stemming removes the "ed" suffix.
|
Filter Stop Words |
Stop words are words that are very common or not useful for the analysis; for example, "a", "the", and "that". You can specify your own list of stop words in the Stop Words File parameter, or use our default list of stop words Stop Words. |
Stop Words File | If left at the default value, a standard set of stop words is used. You can find this list
Stop Words.
Otherwise, choose a file that contains a list of stop words. This list must be one word per line and should be small enough to fit in memory.
|
Output Directory | The location to store the output files. |
Output Name | The name to contain the results. |
Overwrite Output | Specifies whether to delete existing data at that path.
|
Advanced Spark Settings Automatic Optimization |
|
Outputs
Three sections are displayed after the operator is run:
Additional Notes
The data output of this operator is written to HDFS as a delimited file, but is not recognized by TIBCO Data Science – Team Studio as a tabular data set. This is because it is a special n-gram dictionary type that is only recognized by the Text Featurizer operator.
Although you cannot connect a transformation operator such as Summary Statistics (DB) to this operator directly, you can go to the location of the results on HDFS (specified in the Summary tab of the results pane), drag the file(s) onto your workflow, and use that as input to other TIBCO Data Science – Team Studio operators. However, keep in mind that, if you use this method, the files might be stored in multiple parts.