EM Algorithm & Approximate Inference | E-step, M-step, Gibbs Sampling, Variational Inference

What is the EM Algorithm?

Iterative Optimization

The Expectation-Maximization (EM) Algorithm is an iterative optimization method for estimating parameters in probabilistic models with latent (unobserved) variables or missing data. It's a powerful tool for handling incomplete datasets.

Problem Scenario

Example: A watermelon's stem has fallen off, so we can't observe whether it's "curled" or "stiff". The "stem" attribute value is unknown—this is a latent variable. EM algorithm can estimate model parameters even when some variables are unobserved.

EM Algorithm Formulation

Objective

Let $X$ denote observed variables, $Z$ denote latent variables, and $\Theta$ denote model parameters. We want to maximize the marginal likelihood:

LL(\Theta | X) = \ln P(X | \Theta) = \ln \sum_Z P(X, Z | \Theta)

Iterative Steps

Starting from initial parameters $\Theta^t$ , iterate until convergence:

E-step (Expectation)

Based on current parameters $\Theta^t$ , infer the expected value of latent variables $Z$ , denoted $Z^t$ .

Q(\Theta; \Theta^t) = E_{Z \sim P(Z | X, \Theta^t)}[\ln P(X, Z | \Theta)]

M-step (Maximization)

Based on observed variables $X$ and $Z^t$ , perform maximum likelihood estimation to update parameters, denoted $\Theta^{t+1}$ .

\Theta^{t+1} = \arg\max_{\Theta} Q(\Theta; \Theta^t)

Insight

By alternating between E-step and M-step, the marginal likelihood $\ln P(X | \Theta)$ monotonically increases, eventually converging to a local maximum. EM provides a principled way to handle missing data without discarding incomplete samples.

Gibbs Sampling

Overview

Gibbs Sampling is a Markov Chain Monte Carlo (MCMC) method for approximate inference in Bayesian networks. It iteratively samples variables to approximate the posterior distribution.

Algorithm

1.
Initialize: Randomly generate a sample $q^0$ consistent with evidence $E = e$ as the starting point.
2.
Iterative Sampling: Perform $T$ sampling iterations. In each iteration, examine each non-evidence variable:
- • Assume all other attributes take current values
- • Infer the sampling probability for this variable
- • Sample and update the variable's value
3.
Approximate Posterior: After $T$ iterations, if $n_q$ samples match query target $q$ , approximate:
$P(Q = q | E = e) \approx \frac{n_q}{T}$

Advantage

Gibbs sampling alternates between variables, gradually converging to the true posterior distribution. It's particularly effective in high-dimensional spaces and works well for complex Bayesian networks.

Variational Inference

Basic Idea

Variational Inference transforms complex probabilistic inference into an optimization problem. It uses a known simple distribution $q(Z)$ to approximate the complex posterior distribution $P(Z | X, \Theta)$ .

Log-Likelihood Decomposition

The log-likelihood can be decomposed into two parts:

\ln p(X) = L(q) + KL(q || p)

Evidence Lower Bound (ELBO):

L(q) = \int q(z) \ln\left\{\frac{p(x, z)}{q(z)}\right\} dz

KL Divergence:

KL(q || p) = -\int q(z) \ln\left\{\frac{p(z | x)}{q(z)}\right\} dz

Optimization Objective

Since $KL(q || p) \geq 0$ , we have $L(q) \leq \ln p(X)$ . Variational inference maximizes $L(q)$ (the ELBO), which makes $q(z)$ as close as possible to the true posterior $p(z | x)$ .

EM Algorithm with Variational Inference

Integration

In practice, the E-step of EM (inferring $P(Z | X, \Theta^t)$ ) may be complex. Variational inference can simplify this by constraining the form of $q(z)$ :

E-step (with Variational Inference)

Fix $\Theta^t$ , find optimal $q(z)$ to maximize $L(q, \Theta^t)$ . This is equivalent to minimizing $KL(q || p(z | X, \Theta^t))$ .

M-step

Fix $q(z)$ , find optimal $\Theta$ to maximize $L(q, \Theta)$ . This is equivalent to maximizing $E_{q(z)}[\ln p(X, Z | \Theta)]$ .

Graphical Interpretation

By alternating E-step and M-step, $\ln p(X | \Theta)$ monotonically increases, eventually converging to a local maximum. Variational inference makes the E-step tractable by using simpler approximate distributions.

Plate Notation

Definition

Plate notation is a compact way to represent probabilistic graphical models. Mutually independent variables generated by the same mechanism are placed in a box (plate) with the number of repetitions $N$ labeled. Plates can be nested, and observed variables are typically shaded.

Example

For $N$ observed variables $x_1, \ldots, x_N$ generated from latent variable $z$ :

p(x | \Theta) = \prod_{i=1}^{N} \sum_z p(x_i, z | \Theta)

The corresponding log-likelihood:

\ln p(x | \Theta) = \sum_{i=1}^{N} \ln\left\{\sum_z p(x_i, z | \Theta)\right\}

Example: Handling Missing Data

Using EM algorithm for incomplete datasets

Problem Setup

In a medical dataset, some patients have missing test results. We want to estimate model parameters (e.g., disease probabilities) using all available data, including incomplete records.

EM Solution

E-step: For samples with missing test results, estimate the expected value of the missing variable based on current parameter estimates and observed symptoms.

M-step: Update disease probabilities and symptom distributions using both complete samples and expected values from incomplete samples.

Iterate: Repeat E-step and M-step until convergence. The algorithm naturally handles missing data without discarding incomplete samples.