LDA Training and Model Evaluation Tips

When you are using the LDA Predictor and LDA Trainer, following these guidelines can produce more meaningful results.

  • Select the right n-grams to use: ensure that the N-Gram Dictionary and the n-gram selection method used are relevant (by specifying/updating a customized Stop Words File in the N-Gram Dictionary Builder, and changing the n-gram selection method):
    • Filter common stop words and any other set of words not relevant for your use case.
    • Ensure that you don't allow very high-frequency words to overpower the rest of the corpus.
    • You likely don't want very infrequent words either.
  • Run the LDA long enough (it can require many iterations to obtain relevant topics)
  • Try different parameters (number of topics, etc.) and evaluate log perplexity on a held-out sample.
  • Building a good LDA model often requires many iterations and human feedback. Indeed, log perplexity is good for relative comparisons between models or parameter settings, but its numeric value doesn't really mean much, and it's not correlated to human judgment.
    • Inspect the topics: Look at the highest-likelihood words in each topic. Do they sound like they form a cohesive topic, or just some random group of words?
    • Inspect the topic assignments. Hold out a few random documents from training and see what topics LDA assigns to them. Manually inspect the documents and the top words in the assigned topics. Does it look like the topics really describe what the documents are actually about?
  • Look at the density of words of the topics: if you have a topic with weak/low densities for its constituent words, it is most likely a weak topic.