Master generative directed graph models for text analysis. Learn Latent Dirichlet Allocation (LDA), bag-of-words representation, Dirichlet distributions, and topic inference for document clustering.
Topic Models are generative directed graph models for discrete data (primarily text). Latent Dirichlet Allocation (LDA) is the most popular topic model, which assumes documents are mixtures of topics, and topics are distributions over words.
LDA is a generative model that describes how documents are generated from topics and words:
Each document has a distribution over topics, drawn from a Dirichlet distribution:
is a K-dimensional probability vector: , where is the probability of topic k in document d.
Each topic has a distribution over words, drawn from a Dirichlet distribution:
is a V-dimensional probability vector: , where is the probability of word v in topic k.
For each word in document d:
The word is generated by first choosing a topic from the document's topic distribution, then choosing a word from that topic's word distribution.
The Dirichlet distribution is a probability distribution over probability vectors (vectors that sum to 1). It's the conjugate prior for the multinomial distribution, making it ideal for LDA.
Where is the gamma function, is the concentration parameter, and is a probability vector.
Properties:
Given a collection of documents, learn the topic-word distributions and document-topic distributions . The topic assignments are hidden variables.
Most common method for LDA parameter learning. Iteratively samples topic assignments for each word, then estimates and from the samples.
Sampling Formula:
Where = count of topic k in document d (excluding current word), = count of word v in topic k (excluding current word).
Alternative method using variational inference to approximate the posterior distribution. Faster but may have approximation error.
Apply LDA to a collection of 5 documents. Learn 3 topics (Tech, Sports, Science) and infer document-topic distributions.
| Document | Words (Sample) | Topic Distribution |
|---|---|---|
| Doc 1 | machine learning algorithm neural network | Tech (0.6), Science (0.3), Other (0.1) |
| Doc 2 | basketball game player team score | Sports (0.7), Entertainment (0.2), Other (0.1) |
| Doc 3 | neural network brain neuron signal | Science (0.5), Tech (0.4), Other (0.1) |
| Doc 4 | football match goal team win | Sports (0.8), Entertainment (0.1), Other (0.1) |
| Doc 5 | algorithm data structure computer | Tech (0.7), Science (0.2), Other (0.1) |
LDA learns that Doc 1 and Doc 5 are about Tech, Doc 2 and Doc 4 are about Sports, and Doc 3 is about Science. Each document is a mixture of topics.
Top words: machine (0.15), learning (0.12), algorithm (0.11), computer (0.10), data (0.09), ...
Top words: game (0.18), team (0.15), player (0.13), score (0.11), match (0.10), ...
Top words: neural (0.14), network (0.12), brain (0.11), neuron (0.10), signal (0.09), ...
LDA successfully discovers 3 interpretable topics from the document collection. Each topic has a clear semantic meaning, and documents are assigned to topics based on their content.