MathIsimple

Topic Models (LDA)

Master generative directed graph models for text analysis. Learn Latent Dirichlet Allocation (LDA), bag-of-words representation, Dirichlet distributions, and topic inference for document clustering.

Module 7 of 7
Intermediate to Advanced
120-150 min

Core Definition

Topic Models are generative directed graph models for discrete data (primarily text). Latent Dirichlet Allocation (LDA) is the most popular topic model, which assumes documents are mixtures of topics, and topics are distributions over words.

Key Concepts

  • Bag-of-words: Document represented as unordered word counts (ignores word order)
  • Topics: Latent (hidden) themes that generate words in documents
  • Word distributions: Each topic has a probability distribution over vocabulary words
  • Document-topic distribution: Each document has a probability distribution over topics

Advantages

  • • Unsupervised learning
  • • Interpretable topics
  • • Handles large corpora
  • • No labeled data needed

Limitations

  • • Ignores word order
  • • Number of topics must be specified
  • • May find spurious topics
  • • Computationally intensive

LDA Model Structure

LDA is a generative model that describes how documents are generated from topics and words:

Variables

  • DD = collection of documents
  • dd = a single document
  • wd,nw_{d,n} = n-th word in document d
  • zd,nz_{d,n} = topic assignment for word wd,nw_{d,n} (hidden)
  • θd\theta_d = document-topic distribution for document d
  • ϕk\phi_k = topic-word distribution for topic k

Parameters

  • α\alpha = Dirichlet parameter for document-topic distributions (K-dimensional vector)
  • η\eta = Dirichlet parameter for topic-word distributions (V-dimensional vector, V = vocabulary size)
  • KK = number of topics (user-specified hyperparameter)

Core Probability Formulas

Document-Topic Distribution

Each document has a distribution over topics, drawn from a Dirichlet distribution:

θdDirichlet(α)\theta_d \sim \text{Dirichlet}(\alpha)

θd\theta_d is a K-dimensional probability vector: θd=(θd,1,,θd,K)\theta_d = (\theta_{d,1}, \ldots, \theta_{d,K}), where θd,k\theta_{d,k} is the probability of topic k in document d.

Topic-Word Distribution

Each topic has a distribution over words, drawn from a Dirichlet distribution:

ϕkDirichlet(η)\phi_k \sim \text{Dirichlet}(\eta)

ϕk\phi_k is a V-dimensional probability vector: ϕk=(ϕk,1,,ϕk,V)\phi_k = (\phi_{k,1}, \ldots, \phi_{k,V}), where ϕk,v\phi_{k,v} is the probability of word v in topic k.

Word Generation Process

For each word wd,nw_{d,n} in document d:

  1. 1. Sample topic: zd,nMultinomial(θd)z_{d,n} \sim \text{Multinomial}(\theta_d)
  2. 2. Sample word: wd,nMultinomial(ϕzd,n)w_{d,n} \sim \text{Multinomial}(\phi_{z_{d,n}})

The word is generated by first choosing a topic from the document's topic distribution, then choosing a word from that topic's word distribution.

Dirichlet Distribution

The Dirichlet distribution is a probability distribution over probability vectors (vectors that sum to 1). It's the conjugate prior for the multinomial distribution, making it ideal for LDA.

Probability Density Function

P(θα)=Γ(i=1Kαi)i=1KΓ(αi)i=1Kθiαi1P(\theta | \alpha) = \frac{\Gamma(\sum_{i=1}^K \alpha_i)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{i=1}^K \theta_i^{\alpha_i - 1}

Where Γ\Gamma is the gamma function, α=(α1,,αK)\alpha = (\alpha_1, \ldots, \alpha_K)is the concentration parameter, and θ=(θ1,,θK)\theta = (\theta_1, \ldots, \theta_K) is a probability vector.

Properties:

  • αi\alpha_i controls concentration: larger values = more uniform distribution
  • αi<1\alpha_i < 1 encourages sparse distributions (few topics/words have high probability)
  • αi>1\alpha_i > 1 encourages uniform distributions

Parameter Learning

Given a collection of documents, learn the topic-word distributions ϕk\phi_kand document-topic distributions θd\theta_d. The topic assignmentszd,nz_{d,n} are hidden variables.

Collapsed Gibbs Sampling

Most common method for LDA parameter learning. Iteratively samples topic assignmentszd,nz_{d,n} for each word, then estimates ϕk\phi_k andθd\theta_d from the samples.

Sampling Formula:

P(zd,n=kz(d,n),w,α,η)nd,kdn+αkk(nd,kdn+αk)nk,vdn+ηvv(nk,vdn+ηv)P(z_{d,n} = k | z_{-(d,n)}, w, \alpha, \eta) \propto \frac{n_{d,k}^{-dn} + \alpha_k}{\sum_{k'} (n_{d,k'}^{-dn} + \alpha_{k'})} \cdot \frac{n_{k,v}^{-dn} + \eta_v}{\sum_{v'} (n_{k,v'}^{-dn} + \eta_{v'})}

Where nd,kdnn_{d,k}^{-dn} = count of topic k in document d (excluding current word),nk,vdnn_{k,v}^{-dn} = count of word v in topic k (excluding current word).

Variational Inference

Alternative method using variational inference to approximate the posterior distribution. Faster but may have approximation error.

Document Collection Example

Apply LDA to a collection of 5 documents. Learn 3 topics (Tech, Sports, Science) and infer document-topic distributions.

Documents with Inferred Topic Distributions

DocumentWords (Sample)Topic Distribution
Doc 1machine learning algorithm neural networkTech (0.6), Science (0.3), Other (0.1)
Doc 2basketball game player team scoreSports (0.7), Entertainment (0.2), Other (0.1)
Doc 3neural network brain neuron signalScience (0.5), Tech (0.4), Other (0.1)
Doc 4football match goal team winSports (0.8), Entertainment (0.1), Other (0.1)
Doc 5algorithm data structure computerTech (0.7), Science (0.2), Other (0.1)

LDA learns that Doc 1 and Doc 5 are about Tech, Doc 2 and Doc 4 are about Sports, and Doc 3 is about Science. Each document is a mixture of topics.

Learned Topic-Word Distributions

Topic 1: Tech

Top words: machine (0.15), learning (0.12), algorithm (0.11), computer (0.10), data (0.09), ...

Topic 2: Sports

Top words: game (0.18), team (0.15), player (0.13), score (0.11), match (0.10), ...

Topic 3: Science

Top words: neural (0.14), network (0.12), brain (0.11), neuron (0.10), signal (0.09), ...

LDA Result:

LDA successfully discovers 3 interpretable topics from the document collection. Each topic has a clear semantic meaning, and documents are assigned to topics based on their content.

Applications

Text Analysis Tasks

  • Document clustering: Group similar documents
  • Topic discovery: Find hidden themes in text corpora
  • Information retrieval: Improve search by topic matching
  • Text summarization: Extract key topics from documents

Real-World Examples

  • News articles: Discover topics (politics, sports, tech)
  • Research papers: Find research themes and trends
  • Social media: Analyze trending topics and discussions
  • Customer reviews: Extract product features and opinions

Advantages and Limitations

Advantages

  • Unsupervised learning: No labeled data required
  • Interpretable topics: Topics have clear semantic meaning
  • Handles large corpora: Scalable to millions of documents
  • Flexible: Can model any text collection

Limitations

  • Ignores word order: Bag-of-words assumption loses syntax
  • Number of topics: Must be specified a priori
  • May find spurious topics: Topics may not be meaningful
  • Computationally intensive: Training can be slow for large corpora