Unsupervised Text Mining
You can perform unsupervised text mining to analyze collections of unstructured documents using the LDA (Latent Dirichlet Allocation) operators.
In LDA, each document can be viewed as a mixture of various topics. This is similar to probabilistic latent semantic analysis (pLSA), except that in LDA, the topic distribution is assumed to have a Dirichlet prior. In practice, this results in more reasonable mixtures of topics in a document.
For example, an LDA model might have topics that can be classified as CAT_related and DOG_related . A topic has probabilities of generating various words, such as "milk," "meow," and "kitten," which can be classified and interpreted by the viewer as "CAT_related". Naturally, the word cat itself has high probability given this topic. The DOG_related topic likewise has probabilities of generating each word: "puppy," "bark," and "bone" might have high probability. Words without special relevance, such as "the," have roughly even probability between classes (or can be placed into a separate category). A topic is not strongly defined, neither semantically nor epistemologically. It is identified on the basis of supervised labeling and (manual) pruning on the basis of their likelihood of co-occurrence. A lexical word might occur in several topics with a different probability, but with a different typical set of neighboring words in each topic.
Each document is assumed to be characterized by a particular set of topics. This is akin to the standard bag-of-words model assumption, and makes the individual words exchangeable.
- Clustering: Topics are cluster centers and documents are associated with multiple clusters (topics). This clustering can help organize or summarize document collections.
- Feature generation: LDA can generate features for other ML algorithms to use. As mentioned above, LDA infers a distribution over topics for each document; with k topics, this gives k numerical features. These features can then be plugged into algorithms such as Logistic Regression or Decision Trees for prediction tasks.
- Dimensionality reduction: Each document's distribution over topics gives a concise summary of the document. Comparing documents in this reduced feature space can be more meaningful than comparing them in the original feature space of words.
We leverage the MLLib LDA algorithm (Spark version 1.5.1) with the Online Variational Bayes optimizer (Original Online LDA paper: Online Learning for Latent Dirichlet Allocation - Hoffman, Blei and Bach, NIPS, 2010). This algorithm uses iterative mini-batch sampling: it processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively. This makes it memory-friendly, especially on large number of documents or vocabulary. This algorithm is also preferable to the EM algorithm (also available in MLLib LDA), because it can optimize the parameters (α) of the Dirichlet prior over the topic mixing weights for the documents, which can create better topics.
LDA Trainer and LDA Predictor work on Hadoop datasets, and you can use them in tandem with our other NLP operators to build complex workflows.