LDA Predictor
Uses both the model trained by the LDA Trainer and a tabular data set to output topic prediction for the new documents in various formats.
Information at a Glance
|
Parameter |
Description |
|---|---|
| Category | NLP |
| Data source type | HD |
| Send output to other operators | Yes1 |
| Data processing tool | Spark |
This operator also stores in HDFS the Featurized data set used as input for LDA Prediction (and created from the featurization parameters specified in the LDA Trainer). You can drag this data set onto the canvas for further analysis.
For more information about using LDA, see Unsupervised Text Mining and LDA Training and Model Evaluation Tips.
Input
The LDA Predictor requires the following 2 inputs.
- The output model from an LDA Trainer.
- A tabular data set with at least a unique document ID column and a text content column (for example, output of the Text Extractor operator).
Note: The LDA Predictor operator includes the Text Featurization of Text Column, which converts raw text to n-gram features.
Configuration
| Parameter | Description |
|---|---|
| Notes | Notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk appears on the operator. |
| Document ID Column
*required |
Column that contains the document unique ID (String, Int, or Long).
Note: This column must contain unique IDs. If duplicate values are found, the job fails with an error message.
|
| Text Column
*required |
Column that contains the document text content (that is featurized based on N-Gram Dictionary Builder and Featurization input parameters). |
| Other Columns to Keep | Columns to keep in the output data set. |
| Write Rows Removed Due to Null Data To File
*required |
Rows with null values (in
Doc ID Column or
Text Column) are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.
The file is written to the same directory as the rest of the output. The filename is bad_data.
|
| Compression | Select the type of compression for the output.
Available Parquet compression options are the following.
Available Avro compression options are the following.
|
| Output Directory | The location to store the output files. |
| Output Name | The name to contain the results. |
| Overwrite Output | Specifies whether to delete existing data at that path.
|
| Advanced Spark Settings Automatic Optimization |
|
Outputs
- A tabular data set output with the following columns (which can be further filtered, depending on the columns needed for the user's specific use case):
- Doc ID Column and Pass Through columns selected
- top_topics_summary column with the top topics and their corresponding weights in a dictionary format (the number of top topics displayed corresponds to the Max Topics to Describe Documents parameter set in the LDA Trainer).
- Full topic distribution weights (weight_topic_X columns)
- Top topic distribution as pairs of columns topic_ranked_X, weight_topic_ranked_X). The number of top topics displayed corresponds to the Max Topics to Describe Documents parameter set in the LDA Trainer)

- Featurized data set (stored in HDFS): featurized data set used as input for LDA prediction:

- Summary tab that describes the parameters selected, model prediction results, output location of the results, and the number of rows processed

Training data Log Likelihood (entire corpus): Lower bound on the log likelihood of the entire corpus.
Training data Average Log Likelihood (per document): Log Likelihood/Total Number of Documents
Training data Log Perplexity (per token): Upper bound on the log perplexity per token of the provided documents, given the inferred topics (lower is better). In information theory, perplexity is a measurement of how well a probability distribution (or probability model) predicts a sample, and is a common measure of model performance in topic modeling. More specifically, it measures how well the word counts of the documents are represented by the word distributions represented by the topics. A low perplexity indicates the probability distribution is good at predicting the sample.
- Change the configuration properties of the Transpose operator.
- Change the input connected to the Transpose operator.
- Clear the step run results of the Transpose operator.
In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and you must run the Transpose operator again to transmit the new output schema.
Example