Machine Learning/Learning Center/Probabilistic Graphical Models/Topic Models

Topic Models (LDA)

Master generative directed graph models for text analysis. Learn Latent Dirichlet Allocation (LDA), bag-of-words representation, Dirichlet distributions, and topic inference for document clustering.

Module 7 of 7

Intermediate to Advanced

120-150 min

Core Definition

Topic Models are generative directed graph models for discrete data (primarily text). Latent Dirichlet Allocation (LDA) is the most popular topic model, which assumes documents are mixtures of topics, and topics are distributions over words.

Key Concepts

• Bag-of-words: Document represented as unordered word counts (ignores word order)
• Topics: Latent (hidden) themes that generate words in documents
• Word distributions: Each topic has a probability distribution over vocabulary words
• Document-topic distribution: Each document has a probability distribution over topics

Advantages

• Unsupervised learning
• Interpretable topics
• Handles large corpora
• No labeled data needed

Limitations

• Ignores word order
• Number of topics must be specified
• May find spurious topics
• Computationally intensive

LDA Model Structure

LDA is a generative model that describes how documents are generated from topics and words:

Variables

• $D$ = collection of documents
• $d$ = a single document
• $w_{d,n}$ = n-th word in document d
• $z_{d,n}$ = topic assignment for word $w_{d,n}$ (hidden)
• $\theta_d$ = document-topic distribution for document d
• $\phi_k$ = topic-word distribution for topic k

Parameters

• $\alpha$ = Dirichlet parameter for document-topic distributions (K-dimensional vector)
• $\eta$ = Dirichlet parameter for topic-word distributions (V-dimensional vector, V = vocabulary size)
• $K$ = number of topics (user-specified hyperparameter)

Core Probability Formulas

Document-Topic Distribution

Each document has a distribution over topics, drawn from a Dirichlet distribution:

\theta_d \sim \text{Dirichlet}(\alpha)

$\theta_d$ is a K-dimensional probability vector: $\theta_d = (\theta_{d,1}, \ldots, \theta_{d,K})$ , where $\theta_{d,k}$ is the probability of topic k in document d.

Topic-Word Distribution

Each topic has a distribution over words, drawn from a Dirichlet distribution:

\phi_k \sim \text{Dirichlet}(\eta)

$\phi_k$ is a V-dimensional probability vector: $\phi_k = (\phi_{k,1}, \ldots, \phi_{k,V})$ , where $\phi_{k,v}$ is the probability of word v in topic k.

Word Generation Process

For each word $w_{d,n}$ in document d:

1. Sample topic: $z_{d,n} \sim \text{Multinomial}(\theta_d)$
2. Sample word: $w_{d,n} \sim \text{Multinomial}(\phi_{z_{d,n}})$

The word is generated by first choosing a topic from the document's topic distribution, then choosing a word from that topic's word distribution.

Dirichlet Distribution

The Dirichlet distribution is a probability distribution over probability vectors (vectors that sum to 1). It's the conjugate prior for the multinomial distribution, making it ideal for LDA.

Probability Density Function

P(\theta | \alpha) = \frac{\Gamma(\sum_{i=1}^K \alpha_i)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_i^{\alpha_i - 1}

Where $\Gamma$ is the gamma function, $\alpha = (\alpha_1, \ldots, \alpha_K)$ is the concentration parameter, and $\theta = (\theta_1, \ldots, \theta_K)$ is a probability vector.

Properties:

• $\alpha_i$ controls concentration: larger values = more uniform distribution
• $\alpha_i < 1$ encourages sparse distributions (few topics/words have high probability)
• $\alpha_i > 1$ encourages uniform distributions

Parameter Learning

Given a collection of documents, learn the topic-word distributions $\phi_k$ and document-topic distributions $\theta_d$ . The topic assignments $z_{d,n}$ are hidden variables.

Collapsed Gibbs Sampling

Most common method for LDA parameter learning. Iteratively samples topic assignments $z_{d,n}$ for each word, then estimates $\phi_k$ and $\theta_d$ from the samples.

Sampling Formula:

P(z_{d,n} = k | z_{-(d,n)}, w, \alpha, \eta) \propto \frac{n_{d,k}^{-dn} + \alpha_k}{\sum_{k'} (n_{d,k'}^{-dn} + \alpha_{k'})} \cdot \frac{n_{k,v}^{-dn} + \eta_v}{\sum_{v'} (n_{k,v'}^{-dn} + \eta_{v'})}

Where $n_{d,k}^{-dn}$ = count of topic k in document d (excluding current word), $n_{k,v}^{-dn}$ = count of word v in topic k (excluding current word).

Variational Inference

Alternative method using variational inference to approximate the posterior distribution. Faster but may have approximation error.

Document Collection Example

Apply LDA to a collection of 5 documents. Learn 3 topics (Tech, Sports, Science) and infer document-topic distributions.

Documents with Inferred Topic Distributions

Document	Words (Sample)	Topic Distribution
Doc 1	machine learning algorithm neural network	Tech (0.6), Science (0.3), Other (0.1)
Doc 2	basketball game player team score	Sports (0.7), Entertainment (0.2), Other (0.1)
Doc 3	neural network brain neuron signal	Science (0.5), Tech (0.4), Other (0.1)
Doc 4	football match goal team win	Sports (0.8), Entertainment (0.1), Other (0.1)
Doc 5	algorithm data structure computer	Tech (0.7), Science (0.2), Other (0.1)

LDA learns that Doc 1 and Doc 5 are about Tech, Doc 2 and Doc 4 are about Sports, and Doc 3 is about Science. Each document is a mixture of topics.

Learned Topic-Word Distributions

Topic 1: Tech

Top words: machine (0.15), learning (0.12), algorithm (0.11), computer (0.10), data (0.09), ...

Topic 2: Sports

Top words: game (0.18), team (0.15), player (0.13), score (0.11), match (0.10), ...

Topic 3: Science

Top words: neural (0.14), network (0.12), brain (0.11), neuron (0.10), signal (0.09), ...

LDA Result:

LDA successfully discovers 3 interpretable topics from the document collection. Each topic has a clear semantic meaning, and documents are assigned to topics based on their content.

Applications

Text Analysis Tasks

• Document clustering: Group similar documents
• Topic discovery: Find hidden themes in text corpora
• Information retrieval: Improve search by topic matching
• Text summarization: Extract key topics from documents

Real-World Examples

• News articles: Discover topics (politics, sports, tech)
• Research papers: Find research themes and trends
• Social media: Analyze trending topics and discussions
• Customer reviews: Extract product features and opinions

Advantages and Limitations

Advantages

• Unsupervised learning: No labeled data required
• Interpretable topics: Topics have clear semantic meaning
• Handles large corpora: Scalable to millions of documents
• Flexible: Can model any text collection

Limitations

• Ignores word order: Bag-of-words assumption loses syntax
• Number of topics: Must be specified a priori
• May find spurious topics: Topics may not be meaningful
• Computationally intensive: Training can be slow for large corpora

Module Complete!