Master Markov Random Fields, Conditional Random Fields, and factor graphs
Sum over all possible configurations of variables
For n binary variables: 2^n configurations
• n=20: ~1 million
• n=50: ~10^15
• n=100: ~10^30 (more than atoms in universe!)
Exact computation infeasible for moderate-sized networks
Sampling (MCMC)
Variational Methods
Bipartite graph with variable nodes and factor nodes:
where N(a) = neighbors (variables connected to factor a)
Variable to Factor:
Factor to Variable:
Marginals: P(x) proportional to product of messages
Tree-structured graphs: Exact inference in O(n) time
Graphs with cycles: No convergence guarantee, but often works well in practice
Undirected graphical model for joint distribution:
Potentials encode preferences, not probabilities. Often (energy-based).
Applications:
Discriminative model for structured prediction:
Directly models , not joint. Can use rich, overlapping features of observations .
Applications:
• C = set of cliques (fully connected subgraphs)
• E_c = energy function for clique c
• ψ_c = exp(-E_c) = potential function
• Z = partition function (normalization constant)
Major challenge: Computing Z requires summing over all possible configurations → intractable for large graphs!
For n binary variables: 2^n terms to sum. Approximation methods needed.
Pairwise Markov Property:
Non-adjacent nodes are conditionally independent given all other nodes
For sequence labeling (e.g., POS tagging):
• f_k = feature functions
• λ_k = learned weights
• Linear-chain: each y_t depends only on y_(t-1) and all of x
Objective: Maximize log-likelihood
Gradient has elegant form:
Observed count - Expected count under model!
Efficiently compute marginals P(y_t | x) and expectations:
Forward α:
Probability of prefix ending at y
Backward β:
Probability of suffix starting at y
Marginal:
Find most likely sequence: y* = argmax P(y|x)
δ_t(y) = max_(y') δ_(t-1)(y') ψ(y', y, x, t)
Return: sequence with max δ_T via backtracking
Complexity: O(T|Y|²) where T = sequence length, |Y| = label set size
Find configuration with highest probability (Maximum A Posteriori)
Replace sum with max in message passing:
Exact on trees, approximate on graphs with cycles
MAP as integer programming:
subject to: μ ∈ [0,1] (indicator variables), marginalization constraints
Relax to LP: μ ∈ [0,1]. Tractable but may not have integral solution.
BIC Score:
• First term: log-likelihood (goodness of fit)
• Second term: penalty for complexity (d = # parameters)
• Higher BIC → better structure
PC Algorithm:
Task: Label each word in "John lives in New York" as PERSON, LOCATION, or OTHER
CRF Models:
Result:
John [PERSON] lives [OTHER] in [OTHER] New [LOCATION] York [LOCATION]
Want to compute posterior P(Z|X), but intractable due to Z = Σ_z P(X,z)
Idea: Approximate P(Z|X) with simpler distribution q(Z) from family Q
Maximize ELBO w.r.t. q:
Equivalently:
Maximizing ELBO = minimizing KL divergence to true posterior
Assume factorized posterior:
Coordinate ascent updates:
Iterate until convergence. Faster than sampling but less accurate.
Algorithm: Sample each variable conditioned on all others
Initialize x^(0)
For t = 1 to T:
For i = 1 to n:
x_i^(t) ~ P(x_i | x_1^(t),...,x_(i-1)^(t), x_(i+1)^(t-1),...,x_n^(t-1))
Simple when conditionals are tractable. Converges to stationary distribution.
General MCMC with proposal distribution q(x'|x):
Only needs P up to normalization constant (Z cancels in ratio)!
Test your understanding with 10 multiple-choice questions