MathIsimple
Back to Machine Learning
Advanced PGMs
Advanced Level

Probabilistic Graphical Models

Master Markov Random Fields, Conditional Random Fields, and factor graphs

Types of Probabilistic Graphical Models

Directed Models (Bayesian Networks)

  • • DAG structure
  • • Natural causal interpretation
  • • Examples: Bayesian networks, DBNs

Undirected Models (Markov Networks)

  • • Undirected graph
  • • Symmetric dependencies
  • • Examples: MRF, CRF, Boltzmann machines

Partition Function: Computational Challenge

Why Computing Z is Hard

Definition

Z=xcψc(xc)Z = \sum_{\mathbf{x}} \prod_{c} \psi_c(\mathbf{x}_c)

Sum over all possible configurations of variables

Intractability

For n binary variables: 2^n configurations

• n=20: ~1 million
• n=50: ~10^15
• n=100: ~10^30 (more than atoms in universe!)

Exact computation infeasible for moderate-sized networks

Approximation Methods

Sampling (MCMC)

  • • Gibbs sampling
  • • Metropolis-Hastings
  • • Approximate Z by samples

Variational Methods

  • • Mean field approximation
  • • Loopy belief propagation
  • • Structured variational inference

Factor Graphs

Unified Representation for Directed and Undirected Models

Factor Graph Structure

Bipartite graph with variable nodes and factor nodes:

P(X)=1Zafa(XN(a))P(\mathbf{X}) = \frac{1}{Z}\prod_{a} f_a(\mathbf{X}_{N(a)})

where N(a) = neighbors (variables connected to factor a)

Message Passing (Sum-Product Algorithm)

Variable to Factor:

μxf(x)=hN(x)fμhx(x)\mu_{x \to f}(x) = \prod_{h \in N(x) \setminus f} \mu_{h \to x}(x)

Factor to Variable:

μfx(x)=XN(f)xf(XN(f))yN(f)xμyf(y)\mu_{f \to x}(x) = \sum_{\mathbf{X}_{N(f)\setminus x}} f(\mathbf{X}_{N(f)}) \prod_{y \in N(f)\setminus x} \mu_{y \to f}(y)

Marginals: P(x) proportional to product of messages

Loopy BP Convergence

Tree-structured graphs: Exact inference in O(n) time

Graphs with cycles: No convergence guarantee, but often works well in practice

Markov Random Fields (MRF)

Undirected graphical model for joint distribution:

P(X)=1ZcCψc(Xc)P(X) = \frac{1}{Z}\prod_{c \in C} \psi_c(X_c)

Potentials encode preferences, not probabilities. Often (energy-based).

Applications:

  • • Image denoising
  • • Texture modeling
  • • Spatial data modeling
Conditional Random Fields (CRF)

Discriminative model for structured prediction:

P(YX)=1Z(X)cψc(Yc,X)P(Y|X) = \frac{1}{Z(X)}\prod_c \psi_c(Y_c, X)

Directly models , not joint. Can use rich, overlapping features of observations .

Applications:

  • • Named Entity Recognition
  • • POS tagging
  • • Image segmentation

MRF: Factorization Theorem

Hammersley-Clifford Theorem

Gibbs Distribution

P(X)=1Zexp(cCEc(Xc))=1ZcCψc(Xc)P(\mathbf{X}) = \frac{1}{Z}\exp\left(-\sum_{c \in C} E_c(\mathbf{X}_c)\right) = \frac{1}{Z}\prod_{c \in C} \psi_c(\mathbf{X}_c)

• C = set of cliques (fully connected subgraphs)
• E_c = energy function for clique c
• ψ_c = exp(-E_c) = potential function
• Z = partition function (normalization constant)

Partition Function Z

Z=xcCψc(xc)Z = \sum_{\mathbf{x}} \prod_{c \in C} \psi_c(\mathbf{x}_c)

Major challenge: Computing Z requires summing over all possible configurations → intractable for large graphs!

For n binary variables: 2^n terms to sum. Approximation methods needed.

Markov Property

Pairwise Markov Property:

XiXjX{i,j} if (i,j)EX_i \perp X_j | \mathbf{X}_{\setminus\{i,j\}} \text{ if } (i,j) \notin E

Non-adjacent nodes are conditionally independent given all other nodes

CRF: Gradient Calculation

Linear-Chain CRF Training

Linear-Chain CRF Model

For sequence labeling (e.g., POS tagging):

P(yx)=1Z(x)exp(tkλkfk(yt,yt1,x,t))P(\mathbf{y}|\mathbf{x}) = \frac{1}{Z(\mathbf{x})}\exp\left(\sum_t \sum_k \lambda_k f_k(y_t, y_{t-1}, \mathbf{x}, t)\right)

• f_k = feature functions
• λ_k = learned weights
• Linear-chain: each y_t depends only on y_(t-1) and all of x

Log-Likelihood Gradient

Objective: Maximize log-likelihood

L(λ)=i=1nlogP(y(i)x(i))\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{n} \log P(\mathbf{y}^{(i)}|\mathbf{x}^{(i)})

Gradient has elegant form:

Lλk=i[tfk(yt(i),yt1(i),x(i),t)observedEP(yx(i))[tfk]expected]\frac{\partial \mathcal{L}}{\partial \lambda_k} = \sum_i \left[\underbrace{\sum_t f_k(y_t^{(i)}, y_{t-1}^{(i)}, \mathbf{x}^{(i)}, t)}_{\text{observed}} - \underbrace{\mathbb{E}_{P(\mathbf{y}|\mathbf{x}^{(i)})}[\sum_t f_k]}_{\text{expected}}\right]

Observed count - Expected count under model!

Forward-Backward Algorithm

Efficiently compute marginals P(y_t | x) and expectations:

Forward α:

αt(y)=yαt1(y)ψ(y,y,x,t)\alpha_t(y) = \sum_{y'} \alpha_{t-1}(y') \psi(y', y, \mathbf{x}, t)

Probability of prefix ending at y

Backward β:

βt(y)=yβt+1(y)ψ(y,y,x,t+1)\beta_t(y) = \sum_{y'} \beta_{t+1}(y') \psi(y, y', \mathbf{x}, t+1)

Probability of suffix starting at y

Marginal:

P(ytx)=αt(yt)βt(yt)Z(x)P(y_t|\mathbf{x}) = \frac{\alpha_t(y_t)\beta_t(y_t)}{Z(\mathbf{x})}

Viterbi Algorithm (MAP Inference)

Find most likely sequence: y* = argmax P(y|x)

δ_t(y) = max_(y') δ_(t-1)(y') ψ(y', y, x, t)

Return: sequence with max δ_T via backtracking

Complexity: O(T|Y|²) where T = sequence length, |Y| = label set size

MAP Inference Algorithms

Finding Most Likely Configuration

Problem

x=argmaxxP(x)\mathbf{x}^* = \arg\max_{\mathbf{x}} P(\mathbf{x})

Find configuration with highest probability (Maximum A Posteriori)

Max-Product Algorithm

Replace sum with max in message passing:

μfx(x)=maxXN(f)xf(XN(f))yN(f)xμyf(y)\mu_{f \to x}(x) = \max_{\mathbf{X}_{N(f)\setminus x}} f(\mathbf{X}_{N(f)}) \prod_{y \in N(f)\setminus x} \mu_{y \to f}(y)

Exact on trees, approximate on graphs with cycles

LP Relaxation for General Graphs

MAP as integer programming:

maxcxcθc(xc)μc(xc)\max \sum_c \sum_{\mathbf{x}_c} \theta_c(\mathbf{x}_c) \mu_c(\mathbf{x}_c)

subject to: μ ∈ [0,1] (indicator variables), marginalization constraints

Relax to LP: μ ∈ [0,1]. Tractable but may not have integral solution.

Structure Learning

Learning Graph Structure from Data

Score-Based Methods

BIC Score:

BIC(G)=logP(DG,θ^)d2logn\text{BIC}(G) = \log P(D|G, \hat{\theta}) - \frac{d}{2}\log n

• First term: log-likelihood (goodness of fit)
• Second term: penalty for complexity (d = # parameters)
• Higher BIC → better structure

Constraint-Based Methods

PC Algorithm:

  1. Start with fully connected graph
  2. Test conditional independence: X ⊥ Y | Z?
  3. Remove edges for independent pairs
  4. Orient edges using v-structures
Example: Named Entity Recognition with Linear-Chain CRF

Task: Label each word in "John lives in New York" as PERSON, LOCATION, or OTHER

CRF Models:

  • • Transition features: (e.g., PERSON often followed by LOCATION)
  • • Emission features: (e.g., capitalized word more likely PERSON/LOCATION)
  • • Contextual features: (window of words)

Result:

John [PERSON] lives [OTHER] in [OTHER] New [LOCATION] York [LOCATION]

Variational Inference

Deterministic Approximation to Posterior

Problem Setup

Want to compute posterior P(Z|X), but intractable due to Z = Σ_z P(X,z)

Idea: Approximate P(Z|X) with simpler distribution q(Z) from family Q

Evidence Lower Bound (ELBO)

Maximize ELBO w.r.t. q:

L(q)=Eq[logP(X,Z)]Eq[logq(Z)]\mathcal{L}(q) = \mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z})] - \mathbb{E}_q[\log q(\mathbf{Z})]

Equivalently:

L(q)=logP(X)DKL(q(Z)P(ZX))\mathcal{L}(q) = \log P(\mathbf{X}) - D_{KL}(q(\mathbf{Z}) \| P(\mathbf{Z}|\mathbf{X}))

Maximizing ELBO = minimizing KL divergence to true posterior

Mean Field Approximation

Assume factorized posterior:

q(Z)=i=1Mqi(Zi)q(\mathbf{Z}) = \prod_{i=1}^{M} q_i(Z_i)

Coordinate ascent updates:

logqj(Zj)=Eij[logP(X,Z)]+const\log q_j^*(Z_j) = \mathbb{E}_{i \neq j}[\log P(\mathbf{X}, \mathbf{Z})] + \text{const}

Iterate until convergence. Faster than sampling but less accurate.

MCMC Sampling

Markov Chain Monte Carlo Methods

Gibbs Sampling

Algorithm: Sample each variable conditioned on all others

Initialize x^(0)

For t = 1 to T:

For i = 1 to n:

x_i^(t) ~ P(x_i | x_1^(t),...,x_(i-1)^(t), x_(i+1)^(t-1),...,x_n^(t-1))

Simple when conditionals are tractable. Converges to stationary distribution.

Metropolis-Hastings

General MCMC with proposal distribution q(x'|x):

  1. Propose x' ~ q(x'|x)
  2. Compute acceptance ratio:
α=min(1,P(x)q(xx)P(x)q(xx))\alpha = \min\left(1, \frac{P(x')q(x|x')}{P(x)q(x'|x)}\right)
  1. Accept x' with probability α, else keep x

Only needs P up to normalization constant (Z cancels in ratio)!

Convergence Diagnostics

  • Trace plots: Visualize chain trajectory
  • Gelman-Rubin statistic: Compare multiple chains
  • Effective sample size: Account for autocorrelation
  • Burn-in: Discard initial samples before convergence

PGM Applications

Real-World Use Cases

Bayesian Networks

  • • Medical diagnosis (symptoms → diseases)
  • • Spam filtering (words → spam/ham)
  • • Credit risk assessment
  • • Gene regulatory networks

Markov Random Fields

  • • Image segmentation (pixel neighbors)
  • • Image denoising
  • • Texture synthesis
  • • Protein structure prediction

CRFs

  • • Named Entity Recognition
  • • Part-of-speech tagging
  • • Semantic segmentation
  • • OCR (optical character recognition)

Hidden Markov Models

  • • Speech recognition
  • • Handwriting recognition
  • • Bioinformatics (gene finding)
  • • Time series prediction

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the main difference between directed (Bayesian networks) and undirected (Markov networks) graphical models?
Not attempted
2
In a Markov Random Field, the joint distribution factorizes as:
Not attempted
3
What is the partition function in MRFs?
Not attempted
4
Conditional Random Fields (CRFs) are:
Not attempted
5
Linear-chain CRF for sequence labeling has the form:
Not attempted
6
Compared to HMMs, CRFs:
Not attempted
7
Factor graphs represent:
Not attempted
8
Markov blanket of a node in undirected graph consists of:
Not attempted
9
Loopy belief propagation is used when:
Not attempted
10
Common applications of CRFs include:
Not attempted