NLP Operators
You can use the NLP (Natural Language Processing) operators to process and extract text from human-language documents.
- N-gram Dictionary Builder
A sequence of tokens (one or greater) that might appear in a text corpus. The N-gram Dictionary operator parses each document in the corpus into tokens, and then into all possible n-grams (combinations of sequential tokens). - N-gram Dictionary Loader
Creates an N-gram dictionary object from a dictionary data set input (with the exact same columns as the output dictionary data set created by the N-gram Dictionary Builder operator), and the location of the N-gram dictionary builder configuration file (which is always stored in HDFS when training an N-gram Dictionary Builder operator and has the output suffix _dictInfo). - Text Extractor
Using the Text Extractor, users can select an HDFS input directory that contains a set of documents, and then parse that content to create a new data set that contains the parsed text. - Text Featurizer
Parses a corpus of text into numeric features. You can select which metric(s) to compute for each document and for each of the selected n-grams or hashed features. - Stop Words
Stop words are words that are very common or not useful for an analysis. - LDA Predictor
Uses both the model trained by the LDA Trainer and a tabular data set to output topic prediction for the new documents in various formats. - LDA Trainer
LDA (Latent Dirichlet Allocation) is an unsupervised text-mining algorithm used to analyze collections of unstructured documents.
Copyright © 2021. Cloud Software Group, Inc. All Rights Reserved.