LDA Predictor

Uses both the model trained by the LDA Trainer and a tabular data set to output topic prediction for the new documents in various formats.

Information at a Glance

Category NLP
Data source type HD
Sends output to other operators Yes1
Data processing tool Spark

This operator also stores in HDFS the Featurized data set used as input for LDA Prediction (and created from the featurization parameters specified in the LDA Trainer). You can drag this data set onto the canvas for further analysis.

For more information about using LDA, see Unsupervised Text Mining and LDA Training and Model Evaluation Tips.

Input

The LDA Predictor requires the following 2 inputs.

  • The output model from an LDA Trainer.
  • A tabular data set with at least a unique document ID column and a text content column (for example, output of the Text Extractor operator).
    Note: The LDA Predictor operator includes the Text Featurization of Text Column, which converts raw text to n-gram features.
Bad or Missing Values
If a row contains a null value in at least one of the Doc ID column or Text column, the row is removed from the data set. The number of null values removed can be listed in the Summary section of the output (depending on the chosen option for Write Rows Removed Due to Null Data To File).

Configuration

Parameter Description
Notes Any notes or helpful information about this operator's parameter settings. When you enter content in the Notes field, a yellow asterisk is displayed on the operator.
Document ID Column

*required

Column that contains the document unique ID (String, Int, or Long).
Note: This column must contain unique IDs. If duplicate values are found, the job fails with an error message.
Text Column

*required

Column that contains the document text content (that is featurized based on N-Gram Dictionary Builder and Featurization input parameters).
Other Columns to Keep Columns to keep in the output data set.
Write Rows Removed Due to Null Data To File

*required

Rows with null values (in Doc ID Column or Text Column) are removed from the analysis. This parameter allows you to specify that the data with null values be written to a file.

The file is written to the same directory as the rest of the output. The filename is bad_data.

  • Do Not Write Null Rows to File - remove null value data and display in the result UI, but do not write to an external file.
  • Do Not Write or Count Null Rows (Fastest) - remove null value data but do not count and display in the result UI.
  • Write All Null Rows to File - remove null value data and write all removed rows to an external file.
Compression Select the type of compression for the output.
Available Parquet compression options are the following.
  • GZIP
  • Deflate
  • Snappy
  • no compression

Available Avro compression options are the following.

  • Deflate
  • Snappy
  • no compression
Output Directory The location to store the output files.
Output Name The name to contain the results.
Overwrite Output Specifies whether to delete existing data at that path.
  • Yes - if the path exists, delete that file and save the results.
  • No - fail if the path already exists.
Advanced Spark Settings Automatic Optimization
  • Yes specifies using the default Spark optimization settings.
  • No enables providing customized Spark optimization. Click Edit Settings to customize Spark optimization. See Advanced Settings Dialog Box for more information.

Outputs

Visual Output
  • A tabular data set output with the following columns (which can be further filtered, depending on the columns needed for the user's specific use case):
    • Doc ID Column and Pass Through columns selected
    • top_topics_summary column with the top topics and their corresponding weights in a dictionary format (the number of top topics displayed corresponds to the Max Topics to Describe Documents parameter set in the LDA Trainer).
    • Full topic distribution weights (weight_topic_X columns)
    • Top topic distribution as pairs of columns topic_ranked_X, weight_topic_ranked_X). The number of top topics displayed corresponds to the Max Topics to Describe Documents parameter set in the LDA Trainer)



  • Featurized data set (stored in HDFS): featurized data set used as input for LDA prediction:



  • Summary tab that describes the parameters selected, model prediction results, output location of the results, and the number of rows processed



Training data Log Likelihood (entire corpus): Lower bound on the log likelihood of the entire corpus.

Training data Average Log Likelihood (per document): Log Likelihood/Total Number of Documents

Training data Log Perplexity (per token): Upper bound on the log perplexity per token of the provided documents, given the inferred topics (lower is better). In information theory, perplexity is a measurement of how well a probability distribution (or probability model) predicts a sample, and is a common measure of model performance in topic modeling. More specifically, it measures how well the word counts of the documents are represented by the word distributions represented by the topics. A low perplexity indicates the probability distribution is good at predicting the sample.

Data Output
This is a semi-terminal operator. It can be connected to any subsequent operator at design time, but it does not transmit the full output schema until the user runs the operator. The partial output schema at design time is only the first columns of the output(Doc ID Column, Pass Through columns selected and top_topics_summary column). After running it, the output schema is automatically updated, and subsequent operators turn red in case the UI parameters selection is no longer valid.
Note: The final output schema of the Transpose operator is cleared you take one of the following actions.
  • Change the configuration properties of the Transpose operator.
  • Change the input connected to the Transpose operator.
  • Clear the step run results of the Transpose operator.

In this case, the output schema transmitted to subsequent operators again becomes the partial schema defined at design time (hence, subsequent operators can turn invalid), and you must run the Transpose operator again to transmit the new output schema.

Example



1 The full output schema is not available until you step-run the operator. After you run this operator, the output schema automatically updates, and subsequent operators either validate or turn red, depending on the structure of the output data.