MathIsimple – Simple, Friendly Math Tools & Learning

Types of Probabilistic Graphical Models

Directed Models (Bayesian Networks)

• DAG structure
•
• Natural causal interpretation
• Examples: Bayesian networks, DBNs

Undirected Models (Markov Networks)

• Undirected graph
•
• Symmetric dependencies
• Examples: MRF, CRF, Boltzmann machines

Partition Function: Computational Challenge

Why Computing Z is Hard

Definition

Z = \sum_{\mathbf{x}} \prod_{c} \psi_c(\mathbf{x}_c)

Sum over all possible configurations of variables

Intractability

For n binary variables: 2^n configurations

• n=20: ~1 million
• n=50: ~10^15
• n=100: ~10^30 (more than atoms in universe!)

Exact computation infeasible for moderate-sized networks

Approximation Methods

Sampling (MCMC)

• Gibbs sampling
• Metropolis-Hastings
• Approximate Z by samples

Variational Methods

• Mean field approximation
• Loopy belief propagation
• Structured variational inference

Factor Graphs

Unified Representation for Directed and Undirected Models

Factor Graph Structure

Bipartite graph with variable nodes and factor nodes:

P(\mathbf{X}) = \frac{1}{Z}\prod_{a} f_a(\mathbf{X}_{N(a)})

where N(a) = neighbors (variables connected to factor a)

Message Passing (Sum-Product Algorithm)

Variable to Factor:

\mu_{x \to f}(x) = \prod_{h \in N(x) \setminus f} \mu_{h \to x}(x)

Factor to Variable:

\mu_{f \to x}(x) = \sum_{\mathbf{X}_{N(f)\setminus x}} f(\mathbf{X}_{N(f)}) \prod_{y \in N(f)\setminus x} \mu_{y \to f}(y)

Marginals: P(x) proportional to product of messages

Loopy BP Convergence

Tree-structured graphs: Exact inference in O(n) time

Graphs with cycles: No convergence guarantee, but often works well in practice

Markov Random Fields (MRF)

Undirected graphical model for joint distribution:

P(X) = \frac{1}{Z}\prod_{c \in C} \psi_c(X_c)

Potentials encode preferences, not probabilities. Often (energy-based).

Applications:

• Image denoising
• Texture modeling
• Spatial data modeling

Conditional Random Fields (CRF)

Discriminative model for structured prediction:

P(Y|X) = \frac{1}{Z(X)}\prod_c \psi_c(Y_c, X)

Directly models , not joint. Can use rich, overlapping features of observations .

Applications:

• Named Entity Recognition
• POS tagging
• Image segmentation

MRF: Factorization Theorem

Hammersley-Clifford Theorem

Gibbs Distribution

P(\mathbf{X}) = \frac{1}{Z}\exp\left(-\sum_{c \in C} E_c(\mathbf{X}_c)\right) = \frac{1}{Z}\prod_{c \in C} \psi_c(\mathbf{X}_c)

• C = set of cliques (fully connected subgraphs)
• E_c = energy function for clique c
• ψ_c = exp(-E_c) = potential function
• Z = partition function (normalization constant)

Partition Function Z

Z = \sum_{\mathbf{x}} \prod_{c \in C} \psi_c(\mathbf{x}_c)

Major challenge: Computing Z requires summing over all possible configurations → intractable for large graphs!

For n binary variables: 2^n terms to sum. Approximation methods needed.

Markov Property

Pairwise Markov Property:

X_i \perp X_j | \mathbf{X}_{\setminus\{i,j\}} \text{ if } (i,j) \notin E

Non-adjacent nodes are conditionally independent given all other nodes

CRF: Gradient Calculation

Linear-Chain CRF Training

Linear-Chain CRF Model

For sequence labeling (e.g., POS tagging):

P(\mathbf{y}|\mathbf{x}) = \frac{1}{Z(\mathbf{x})}\exp\left(\sum_t \sum_k \lambda_k f_k(y_t, y_{t-1}, \mathbf{x}, t)\right)

• f_k = feature functions
• λ_k = learned weights
• Linear-chain: each y_t depends only on y_(t-1) and all of x

Log-Likelihood Gradient

Objective: Maximize log-likelihood

\mathcal{L}(\boldsymbol{\lambda}) = \sum_{i=1}^{n} \log P(\mathbf{y}^{(i)}|\mathbf{x}^{(i)})

Gradient has elegant form:

\frac{\partial \mathcal{L}}{\partial \lambda_k} = \sum_i \left[\underbrace{\sum_t f_k(y_t^{(i)}, y_{t-1}^{(i)}, \mathbf{x}^{(i)}, t)}_{\text{observed}} - \underbrace{\mathbb{E}_{P(\mathbf{y}|\mathbf{x}^{(i)})}[\sum_t f_k]}_{\text{expected}}\right]

Observed count - Expected count under model!

Forward-Backward Algorithm

Efficiently compute marginals P(y_t | x) and expectations:

Forward α:

\alpha_t(y) = \sum_{y'} \alpha_{t-1}(y') \psi(y', y, \mathbf{x}, t)

Probability of prefix ending at y

Backward β:

\beta_t(y) = \sum_{y'} \beta_{t+1}(y') \psi(y, y', \mathbf{x}, t+1)

Probability of suffix starting at y

Marginal:

P(y_t|\mathbf{x}) = \frac{\alpha_t(y_t)\beta_t(y_t)}{Z(\mathbf{x})}

Viterbi Algorithm (MAP Inference)

Find most likely sequence: y* = argmax P(y|x)

δ_t(y) = max_(y') δ_(t-1)(y') ψ(y', y, x, t)

Return: sequence with max δ_T via backtracking

Complexity: O(T|Y|²) where T = sequence length, |Y| = label set size

MAP Inference Algorithms

Finding Most Likely Configuration

Problem

\mathbf{x}^* = \arg\max_{\mathbf{x}} P(\mathbf{x})

Find configuration with highest probability (Maximum A Posteriori)

Max-Product Algorithm

Replace sum with max in message passing:

\mu_{f \to x}(x) = \max_{\mathbf{X}_{N(f)\setminus x}} f(\mathbf{X}_{N(f)}) \prod_{y \in N(f)\setminus x} \mu_{y \to f}(y)

Exact on trees, approximate on graphs with cycles

LP Relaxation for General Graphs

MAP as integer programming:

\max \sum_c \sum_{\mathbf{x}_c} \theta_c(\mathbf{x}_c) \mu_c(\mathbf{x}_c)

subject to: μ ∈ [0,1] (indicator variables), marginalization constraints

Relax to LP: μ ∈ [0,1]. Tractable but may not have integral solution.

Structure Learning

Learning Graph Structure from Data

Score-Based Methods

BIC Score:

\text{BIC}(G) = \log P(D|G, \hat{\theta}) - \frac{d}{2}\log n

• First term: log-likelihood (goodness of fit)
• Second term: penalty for complexity (d = # parameters)
• Higher BIC → better structure

Constraint-Based Methods

PC Algorithm:

Start with fully connected graph
Test conditional independence: X ⊥ Y | Z?
Remove edges for independent pairs
Orient edges using v-structures

Example: Named Entity Recognition with Linear-Chain CRF

Task: Label each word in "John lives in New York" as PERSON, LOCATION, or OTHER

CRF Models:

• Transition features: (e.g., PERSON often followed by LOCATION)
• Emission features: (e.g., capitalized word more likely PERSON/LOCATION)
• Contextual features: (window of words)

Result:

John [PERSON] lives [OTHER] in [OTHER] New [LOCATION] York [LOCATION]

View More Examples & Exercises

Variational Inference

Deterministic Approximation to Posterior

Problem Setup

Want to compute posterior P(Z|X), but intractable due to Z = Σ_z P(X,z)

Idea: Approximate P(Z|X) with simpler distribution q(Z) from family Q

Evidence Lower Bound (ELBO)

Maximize ELBO w.r.t. q:

\mathcal{L}(q) = \mathbb{E}_q[\log P(\mathbf{X}, \mathbf{Z})] - \mathbb{E}_q[\log q(\mathbf{Z})]

Equivalently:

\mathcal{L}(q) = \log P(\mathbf{X}) - D_{KL}(q(\mathbf{Z}) \| P(\mathbf{Z}|\mathbf{X}))

Maximizing ELBO = minimizing KL divergence to true posterior

Mean Field Approximation

Assume factorized posterior:

q(\mathbf{Z}) = \prod_{i=1}^{M} q_i(Z_i)

Coordinate ascent updates:

\log q_j^*(Z_j) = \mathbb{E}_{i \neq j}[\log P(\mathbf{X}, \mathbf{Z})] + \text{const}

Iterate until convergence. Faster than sampling but less accurate.

MCMC Sampling

Markov Chain Monte Carlo Methods

Gibbs Sampling

Algorithm: Sample each variable conditioned on all others

Initialize x^(0)

For t = 1 to T:

For i = 1 to n:

x_i^(t) ~ P(x_i | x_1^(t),...,x_(i-1)^(t), x_(i+1)^(t-1),...,x_n^(t-1))

Simple when conditionals are tractable. Converges to stationary distribution.

Metropolis-Hastings

General MCMC with proposal distribution q(x'|x):

Propose x' ~ q(x'|x)
Compute acceptance ratio:

\alpha = \min\left(1, \frac{P(x')q(x|x')}{P(x)q(x'|x)}\right)

Accept x' with probability α, else keep x

Only needs P up to normalization constant (Z cancels in ratio)!

Convergence Diagnostics

• Trace plots: Visualize chain trajectory
• Gelman-Rubin statistic: Compare multiple chains
• Effective sample size: Account for autocorrelation
• Burn-in: Discard initial samples before convergence

PGM Applications

Real-World Use Cases

Bayesian Networks

• Medical diagnosis (symptoms → diseases)
• Spam filtering (words → spam/ham)
• Credit risk assessment
• Gene regulatory networks

Markov Random Fields

• Image segmentation (pixel neighbors)
• Image denoising
• Texture synthesis
• Protein structure prediction

CRFs

• Named Entity Recognition
• Part-of-speech tagging
• Semantic segmentation
• OCR (optical character recognition)

Hidden Markov Models

• Speech recognition
• Handwriting recognition
• Bioinformatics (gene finding)
• Time series prediction

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the main difference between directed (Bayesian networks) and undirected (Markov networks) graphical models?

Not attempted

2

In a Markov Random Field, the joint distribution factorizes as:

Not attempted

3

What is the partition function in MRFs?

Not attempted

4

Conditional Random Fields (CRFs) are:

Not attempted

5

Linear-chain CRF for sequence labeling has the form:

Not attempted

6

Compared to HMMs, CRFs:

Not attempted

7

Factor graphs represent:

Not attempted

8

Markov blanket of a node in undirected graph consists of:

Not attempted

9

Loopy belief propagation is used when:

Not attempted

10

Common applications of CRFs include:

Not attempted