MathIsimple

EM Algorithm & Approximate Inference

Handling latent variables and missing data with iterative optimization

What is the EM Algorithm?

Iterative Optimization

The Expectation-Maximization (EM) Algorithm is an iterative optimization method for estimating parameters in probabilistic models with latent (unobserved) variables or missing data. It's a powerful tool for handling incomplete datasets.

Problem Scenario

Example: A watermelon's stem has fallen off, so we can't observe whether it's "curled" or "stiff". The "stem" attribute value is unknown—this is a latent variable. EM algorithm can estimate model parameters even when some variables are unobserved.

EM Algorithm Formulation

Objective

Let XX denote observed variables, ZZ denote latent variables, and Θ\Theta denote model parameters. We want to maximize the marginal likelihood:

LL(ΘX)=lnP(XΘ)=lnZP(X,ZΘ)LL(\Theta | X) = \ln P(X | \Theta) = \ln \sum_Z P(X, Z | \Theta)

Iterative Steps

Starting from initial parameters Θt\Theta^t, iterate until convergence:

E-step (Expectation)

Based on current parameters Θt\Theta^t, infer the expected value of latent variablesZZ, denoted ZtZ^t.

Q(Θ;Θt)=EZP(ZX,Θt)[lnP(X,ZΘ)]Q(\Theta; \Theta^t) = E_{Z \sim P(Z | X, \Theta^t)}[\ln P(X, Z | \Theta)]

M-step (Maximization)

Based on observed variables XX and ZtZ^t, perform maximum likelihood estimation to update parameters, denoted Θt+1\Theta^{t+1}.

Θt+1=argmaxΘQ(Θ;Θt)\Theta^{t+1} = \arg\max_{\Theta} Q(\Theta; \Theta^t)

Insight

By alternating between E-step and M-step, the marginal likelihood lnP(XΘ)\ln P(X | \Theta)monotonically increases, eventually converging to a local maximum. EM provides a principled way to handle missing data without discarding incomplete samples.

Gibbs Sampling

Overview

Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method for approximate inference in Bayesian networks. It iteratively samples variables to approximate the posterior distribution.

Algorithm

  1. 1.
    Initialize: Randomly generate a sample q0q^0 consistent with evidenceE=eE = e as the starting point.
  2. 2.
    Iterative Sampling: Perform TT sampling iterations. In each iteration, examine each non-evidence variable:
    • • Assume all other attributes take current values
    • • Infer the sampling probability for this variable
    • • Sample and update the variable's value
  3. 3.
    Approximate Posterior: After TT iterations, if nqn_qsamples match query target qq, approximate:
    P(Q=qE=e)nqTP(Q = q | E = e) \approx \frac{n_q}{T}

Advantage

Gibbs sampling alternates between variables, gradually converging to the true posterior distribution. It's particularly effective in high-dimensional spaces and works well for complex Bayesian networks.

Variational Inference

Basic Idea

Variational Inference transforms complex probabilistic inference into an optimization problem. It uses a known simple distribution q(Z)q(Z) to approximate the complex posterior distributionP(ZX,Θ)P(Z | X, \Theta).

Log-Likelihood Decomposition

The log-likelihood can be decomposed into two parts:

lnp(X)=L(q)+KL(qp)\ln p(X) = L(q) + KL(q || p)

Evidence Lower Bound (ELBO):

L(q)=q(z)ln{p(x,z)q(z)}dzL(q) = \int q(z) \ln\left\{\frac{p(x, z)}{q(z)}\right\} dz

KL Divergence:

KL(qp)=q(z)ln{p(zx)q(z)}dzKL(q || p) = -\int q(z) \ln\left\{\frac{p(z | x)}{q(z)}\right\} dz

Optimization Objective

Since KL(qp)0KL(q || p) \geq 0, we have L(q)lnp(X)L(q) \leq \ln p(X). Variational inference maximizes L(q)L(q) (the ELBO), which makes q(z)q(z)as close as possible to the true posterior p(zx)p(z | x).

EM Algorithm with Variational Inference

Integration

In practice, the E-step of EM (inferring P(ZX,Θt)P(Z | X, \Theta^t)) may be complex. Variational inference can simplify this by constraining the form of q(z)q(z):

E-step (with Variational Inference)

Fix Θt\Theta^t, find optimal q(z)q(z) to maximizeL(q,Θt)L(q, \Theta^t). This is equivalent to minimizingKL(qp(zX,Θt))KL(q || p(z | X, \Theta^t)).

M-step

Fix q(z)q(z), find optimal Θ\Theta to maximizeL(q,Θ)L(q, \Theta). This is equivalent to maximizingEq(z)[lnp(X,ZΘ)]E_{q(z)}[\ln p(X, Z | \Theta)].

Graphical Interpretation

By alternating E-step and M-step, lnp(XΘ)\ln p(X | \Theta) monotonically increases, eventually converging to a local maximum. Variational inference makes the E-step tractable by using simpler approximate distributions.

Plate Notation

Definition

Plate notation is a compact way to represent probabilistic graphical models. Mutually independent variables generated by the same mechanism are placed in a box (plate) with the number of repetitions NNlabeled. Plates can be nested, and observed variables are typically shaded.

Example

For NN observed variables x1,,xNx_1, \ldots, x_N generated from latent variablezz:

p(xΘ)=i=1Nzp(xi,zΘ)p(x | \Theta) = \prod_{i=1}^{N} \sum_z p(x_i, z | \Theta)

The corresponding log-likelihood:

lnp(xΘ)=i=1Nln{zp(xi,zΘ)}\ln p(x | \Theta) = \sum_{i=1}^{N} \ln\left\{\sum_z p(x_i, z | \Theta)\right\}

Example: Handling Missing Data

Using EM algorithm for incomplete datasets

Problem Setup

In a medical dataset, some patients have missing test results. We want to estimate model parameters (e.g., disease probabilities) using all available data, including incomplete records.

EM Solution

E-step: For samples with missing test results, estimate the expected value of the missing variable based on current parameter estimates and observed symptoms.

M-step: Update disease probabilities and symptom distributions using both complete samples and expected values from incomplete samples.

Iterate: Repeat E-step and M-step until convergence. The algorithm naturally handles missing data without discarding incomplete samples.