Machine Learning/Learning Center/Probabilistic Graphical Models/Approximate Inference

Approximate Inference

Handle complex graphs and high-dimensional variables with approximation methods. Learn MCMC sampling and variational inference for intractable probability computations.

Module 6 of 7

Advanced

150-180 min

When to Use Approximate Inference

Exact inference becomes intractable (computationally infeasible) in several scenarios:

Graphs with Cycles

Belief propagation may not converge or give incorrect results. Variable elimination can have exponential complexity.

High-Dimensional Variables

Large state spaces make exact marginalization computationally expensive or impossible.

Complex Distributions

Non-standard or multi-modal distributions may not have closed-form solutions.

Two Categories of Approximate Inference

Sampling Methods (MCMC)

Use Monte Carlo sampling to approximate probabilities by drawing samples from the distribution.

• Metropolis-Hastings algorithm
• Gibbs sampling
• Importance sampling

Variational Inference

Approximate the true distribution with a simpler distribution from a tractable family.

• Mean-field approximation
• Structured variational methods
• Stochastic variational inference

MCMC Sampling Methods

Markov Chain Monte Carlo (MCMC) methods construct a Markov chain whose stationary distribution is the target distribution. By sampling from the chain, we approximate the true distribution.

Metropolis-Hastings Algorithm

General-purpose MCMC algorithm that can sample from any distribution with known unnormalized density.

Algorithm Steps:

1. Initialize: Start with initial state $x^{(0)}$
2. For each iteration t:
- • Propose new state $x'$ from proposal distribution $q(x' | x^{(t)})$
- • Compute acceptance probability:
$\alpha = \min\left(1, \frac{P(x') q(x^{(t)} | x')}{P(x^{(t)}) q(x' | x^{(t)})}\right)$
- • Accept $x'$ with probability $\alpha$ , else keep $x^{(t)}$
- • Set $x^{(t+1)} = x'$ if accepted, else $x^{(t+1)} = x^{(t)}$
3. After burn-in: Collect samples $x^{(T)}, x^{(T+1)}, \ldots$ to approximate distribution.

Gibbs Sampling

Special case of Metropolis-Hastings where we update one variable at a time, sampling from its conditional distribution given all other variables.

Algorithm Steps:

1. Initialize: $x^{(0)} = (x_1^{(0)}, \ldots, x_n^{(0)})$
2. For each iteration t:
- • For each variable $i = 1, \ldots, n$ :
- • Sample $x_i^{(t+1)} \sim P(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_n^{(t)})$
- • Update only variable $i$ , keeping others fixed
3. After burn-in: Collect samples to approximate joint distribution.

Advantages:

• No rejection (acceptance probability = 1)
• Simple to implement
• Works well for many graphical models

Variational Inference

Variational inference approximates the true posterior $P(x | \text{evidence})$ with a simpler distribution $Q(x)$ from a tractable family. The goal is to find $Q$ that minimizes the KL divergence from $Q$ to $P$ .

Evidence Lower Bound (ELBO)

Minimizing KL divergence is equivalent to maximizing the ELBO:

\text{ELBO}(Q) = \mathbb{E}_{x \sim Q}[\log P(x, \text{evidence})] - \mathbb{E}_{x \sim Q}[\log Q(x)]

ELBO is a lower bound on the log evidence $\log P(\text{evidence})$ . Maximizing ELBO minimizes KL divergence.

KL Divergence

KL divergence measures the difference between distributions:

\text{KL}(Q \| P) = \sum_x Q(x) \log \frac{Q(x)}{P(x | \text{evidence})}

KL divergence is always ≥ 0, and equals 0 only when $Q = P$ .

Mean-Field Approximation

The simplest variational family assumes variables are independent:

Q(x) = \prod_{i=1}^n Q_i(x_i)

Each $Q_i$ is optimized independently. This simplifies computation but may lose important dependencies.

Optimization:

Use coordinate ascent: iteratively optimize each $Q_i$ while fixing others, until convergence.

MCMC vs Variational Inference

MCMC Sampling

• Asymptotically exact: Converges to true distribution
• Slower: Requires many samples
• Uncertainty preserved: Captures full distribution
• No bias: Only sampling error

Variational Inference

• Faster: Optimization-based, converges quickly
• Biased: Approximation error from simpler family
• Deterministic: No sampling randomness
• May underestimate uncertainty: Simpler distribution

Advantages and Limitations

Advantages

• Handles complex graphs: Works with cycles and high dimensions
• Practical: Only feasible method for many real-world problems
• Flexible: Can approximate any distribution
• Scalable: Can handle large models

Limitations

• Approximation error: Not exact, may have bias
• Convergence issues: MCMC may be slow, VI may get stuck
• Parameter tuning: Requires careful hyperparameter selection
• Quality depends on method: Results vary by algorithm choice

Next Module