NLP Operators

You can use the NLP (Natural Language Processing) operators to process and extract text from human-language documents.

N-gram Dictionary Builder
A sequence of tokens (one or greater) that might appear in a text corpus. The N-gram Dictionary operator parses each document in the corpus into tokens, and then into all possible n-grams (combinations of sequential tokens).
N-gram Dictionary Loader
Creates an N-gram dictionary object from a dictionary data set input (with the exact same columns as the output dictionary data set created by the N-gram Dictionary Builder operator), and the location of the N-gram dictionary builder configuration file (which is always stored in HDFS when training an N-gram Dictionary Builder operator and has the output suffix _dictInfo).
Text Extractor
Using the Text Extractor, users can select an HDFS input directory that contains a set of documents, and then parse that content to create a new data set that contains the parsed text.
Text Featurizer
Parses a corpus of text into numeric features. You can select which metric(s) to compute for each document and for each of the selected n-grams or hashed features.
Stop Words
Stop words are words that are very common or not useful for an analysis.
LDA Predictor
Uses both the model trained by the LDA Trainer and a tabular data set to output topic prediction for the new documents in various formats.
LDA Trainer
LDA (Latent Dirichlet Allocation) is an unsupervised text-mining algorithm used to analyze collections of unstructured documents.

Related concepts

Modeling Operators