MathIsimple

Approximate Inference

Handle complex graphs and high-dimensional variables with approximation methods. Learn MCMC sampling and variational inference for intractable probability computations.

Module 6 of 7
Advanced
150-180 min

When to Use Approximate Inference

Exact inference becomes intractable (computationally infeasible) in several scenarios:

Graphs with Cycles

Belief propagation may not converge or give incorrect results. Variable elimination can have exponential complexity.

High-Dimensional Variables

Large state spaces make exact marginalization computationally expensive or impossible.

Complex Distributions

Non-standard or multi-modal distributions may not have closed-form solutions.

Two Categories of Approximate Inference

Sampling Methods (MCMC)

Use Monte Carlo sampling to approximate probabilities by drawing samples from the distribution.

  • • Metropolis-Hastings algorithm
  • • Gibbs sampling
  • • Importance sampling

Variational Inference

Approximate the true distribution with a simpler distribution from a tractable family.

  • • Mean-field approximation
  • • Structured variational methods
  • • Stochastic variational inference

MCMC Sampling Methods

Markov Chain Monte Carlo (MCMC) methods construct a Markov chain whose stationary distribution is the target distribution. By sampling from the chain, we approximate the true distribution.

Metropolis-Hastings Algorithm

General-purpose MCMC algorithm that can sample from any distribution with known unnormalized density.

Algorithm Steps:

  1. 1. Initialize: Start with initial state x(0)x^{(0)}
  2. 2. For each iteration t:
    • • Propose new state xx' from proposal distribution q(xx(t))q(x' | x^{(t)})
    • • Compute acceptance probability:
    α=min(1,P(x)q(x(t)x)P(x(t))q(xx(t)))\alpha = \min\left(1, \frac{P(x') q(x^{(t)} | x')}{P(x^{(t)}) q(x' | x^{(t)})}\right)
    • • Accept xx' with probability α\alpha, else keep x(t)x^{(t)}
    • • Set x(t+1)=xx^{(t+1)} = x' if accepted, else x(t+1)=x(t)x^{(t+1)} = x^{(t)}
  3. 3. After burn-in: Collect samples x(T),x(T+1),x^{(T)}, x^{(T+1)}, \ldotsto approximate distribution.

Gibbs Sampling

Special case of Metropolis-Hastings where we update one variable at a time, sampling from its conditional distribution given all other variables.

Algorithm Steps:

  1. 1. Initialize: x(0)=(x1(0),,xn(0))x^{(0)} = (x_1^{(0)}, \ldots, x_n^{(0)})
  2. 2. For each iteration t:
    • • For each variable i=1,,ni = 1, \ldots, n:
    • • Sample xi(t+1)P(xix1(t+1),,xi1(t+1),xi+1(t),,xn(t))x_i^{(t+1)} \sim P(x_i | x_1^{(t+1)}, \ldots, x_{i-1}^{(t+1)}, x_{i+1}^{(t)}, \ldots, x_n^{(t)})
    • • Update only variable ii, keeping others fixed
  3. 3. After burn-in: Collect samples to approximate joint distribution.

Advantages:

  • • No rejection (acceptance probability = 1)
  • • Simple to implement
  • • Works well for many graphical models

Variational Inference

Variational inference approximates the true posterior P(xevidence)P(x | \text{evidence})with a simpler distribution Q(x)Q(x) from a tractable family. The goal is to findQQ that minimizes the KL divergence from QQ to PP.

Evidence Lower Bound (ELBO)

Minimizing KL divergence is equivalent to maximizing the ELBO:

ELBO(Q)=ExQ[logP(x,evidence)]ExQ[logQ(x)]\text{ELBO}(Q) = \mathbb{E}_{x \sim Q}[\log P(x, \text{evidence})] - \mathbb{E}_{x \sim Q}[\log Q(x)]

ELBO is a lower bound on the log evidence logP(evidence)\log P(\text{evidence}). Maximizing ELBO minimizes KL divergence.

KL Divergence

KL divergence measures the difference between distributions:

KL(QP)=xQ(x)logQ(x)P(xevidence)\text{KL}(Q \| P) = \sum_x Q(x) \log \frac{Q(x)}{P(x | \text{evidence})}

KL divergence is always ≥ 0, and equals 0 only when Q=PQ = P.

Mean-Field Approximation

The simplest variational family assumes variables are independent:

Q(x)=i=1nQi(xi)Q(x) = \prod_{i=1}^n Q_i(x_i)

Each QiQ_i is optimized independently. This simplifies computation but may lose important dependencies.

Optimization:

Use coordinate ascent: iteratively optimize each QiQ_i while fixing others, until convergence.

MCMC vs Variational Inference

MCMC Sampling

  • Asymptotically exact: Converges to true distribution
  • Slower: Requires many samples
  • Uncertainty preserved: Captures full distribution
  • No bias: Only sampling error

Variational Inference

  • Faster: Optimization-based, converges quickly
  • Biased: Approximation error from simpler family
  • Deterministic: No sampling randomness
  • May underestimate uncertainty: Simpler distribution

Advantages and Limitations

Advantages

  • Handles complex graphs: Works with cycles and high dimensions
  • Practical: Only feasible method for many real-world problems
  • Flexible: Can approximate any distribution
  • Scalable: Can handle large models

Limitations

  • Approximation error: Not exact, may have bias
  • Convergence issues: MCMC may be slow, VI may get stuck
  • Parameter tuning: Requires careful hyperparameter selection
  • Quality depends on method: Results vary by algorithm choice