MathIsimple
Back to Machine Learning
Probabilistic Models
Advanced Level

Bayesian Networks

Master probabilistic graphical models for reasoning under uncertainty

Structure & Representation

Directed Acyclic Graph (DAG)

A Bayesian network consists of:

1. Graph Structure (Qualitative)

  • • Nodes = random variables
  • • Edges = direct dependencies
  • • DAG = no directed cycles
  • • Encodes conditional independence

2. Parameters (Quantitative)

  • • Each node : CPT (Conditional Probability Table)
  • • Specifies
  • • Quantifies strength of dependencies

Joint Probability Factorization

P(X1,X2,...,Xn)=i=1nP(XiParents(Xi))P(X_1, X_2, ..., X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))

This compact representation exploits conditional independence to reduce from to manageable parameters.

Example: Simple Medical Diagnosis

Variables: (Disease), (Fever), (Cough)

Structure: (disease causes symptoms)

P(D,F,C)=P(D)P(FD)P(CD)P(D, F, C) = P(D) \cdot P(F|D) \cdot P(C|D)

Parameters:

Conditional Independence Patterns

Chain (Serial)
A → B → C

Independence:

ACBA \perp C \mid B

B blocks the path. Given B, A doesn't provide info about C.

Fork (Diverging)
A ← B → C

Independence:

ACBA \perp C \mid B

B is common cause. Conditioning on B blocks influence between A and C.

Collider (V-structure)
A → B ← C

Independence:

ACA \perp CA⊥̸CBA \not\perp C \mid B

Marginally independent, but dependent when conditioning on B ('explaining away').

Explaining Away Example

Scenario: Grass is wet (W). Possible causes: Rain (R) or Sprinkler (S). Structure:

Without observing W:

(rain and sprinkler are independent events)

After observing W = wet:

. If we learn sprinkler was on, rain is less likely (one cause explains the effect).

D-Separation

Graphical Test for Conditional Independence

D-separation (directed separation) determines if by examining paths in the graph:

A path is blocked by Z if:

  • Chain or Fork: Middle node is in Z
  • Collider: Collider node AND its descendants are NOT in Z

If all paths from X to Y are blocked by Z, then

Example Application

Network:

Query: Is ?

Path :

  • • Collider at C (not in observation set {B})
  • • Path is blocked!

Answer: Yes,

Inference in Bayesian Networks

Computing Posterior Probabilities

Inference Task

Given evidence (observed variables), compute posterior for query variable(s) :

P(Xe)=P(X,e)P(e)=αhP(X,h,e)P(X | \mathbf{e}) = \frac{P(X, \mathbf{e})}{P(\mathbf{e})} = \alpha \sum_{\mathbf{h}} P(X, \mathbf{h}, \mathbf{e})

where are hidden (unobserved) variables, is normalization constant

Exact Inference

  • Variable Elimination: Marginalize variables systematically, reuse intermediate results
  • Belief Propagation: Message passing in tree-structured networks (efficient)
  • Junction Tree: Convert to tree, then use message passing

✓ Accurate | ✗ Exponential in tree-width

Approximate Inference

  • Sampling: Forward sampling, rejection sampling, MCMC (Gibbs, Metropolis-Hastings)
  • Variational Methods: Approximate complex distribution with simpler one

✓ Scales to large networks | ✗ Approximate

Variable Elimination Algorithm

Steps:

  1. 1. Start with factors (CPTs) from the network
  2. 2. For each hidden variable (in elimination order):
  3. a. Collect all factors mentioning
  4. b. Multiply them together
  5. c. Sum out from the product
  6. d. Add resulting factor back to list
  7. 3. Multiply remaining factors and normalize

Key insight: Elimination order matters! Good orders exploit independence structure for efficiency.

Learning Bayesian Networks

Parameter Learning

Given structure, estimate CPT parameters from data:

Maximum Likelihood:

θijk=P(Xi=kParents(Xi)=j)=Count(Xi=k,Pa=j)Count(Pa=j)\theta_{ijk} = P(X_i=k | \text{Parents}(X_i)=j) = \frac{\text{Count}(X_i=k, \text{Pa}=j)}{\text{Count}(\text{Pa}=j)}

With complete data, closed-form solution. For missing data, use EM algorithm.

Structure Learning

Learn graph structure from data:

  • Constraint-based: Test conditional independence, construct graph
  • Score-based: Search over structures, optimize score (BIC, AIC, MDL)
  • Hybrid: Combine both approaches

Challenge: Exponentially many structures, hard to learn causal direction from observational data alone.

Example: Medical Diagnosis Network

Diagnosing Disease from Symptoms

Network Structure

Flu → Fever
Flu → Cough
COVID → Fever
COVID → Cough

Parameters (Simplified)

Priors:

Likelihoods:

Inference Example

Query:

Use variable elimination or belief propagation to compute posterior probability of Flu given both symptoms. Result might be ,

Note: Both can be high if they're not mutually exclusive (patient could have both).

Joint Probability Factorization Theorem

Chain Rule and DAG Structure

General Chain Rule

P(X1,X2,,Xn)=i=1nP(XiX1,,Xi1)P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, \ldots, X_{i-1})

This is always true by definition of conditional probability, but has exponential parameters!

Bayesian Network Factorization

Theorem: For DAG G with nodes X_1, ..., X_n:

P(X1,,Xn)=i=1nP(XiParents(Xi))P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))

Each variable depends only on its parents, not all predecessors! This is the key insight.

CPT Size Reduction:

If each node has at most k parents: O(n × 2^k) vs O(2^n) parameters. Exponential savings!

D-Separation Rules

D-separation determines conditional independence from graph structure:

Chain: A → B → C

A ⊥ C | B (B blocks path)

Fork: A ← B → C

A ⊥ C | B (B blocks path)

V-structure (collider): A → B ← C

A ⊥ C | ∅ but NOT A ⊥ C | B (B unblocks!)

Variable Elimination Algorithm

Efficient Exact Inference

Problem: Computing Marginals

Goal: Compute P(X_query | evidence)

P(XqE=e)=all other varsP(Xq,X,E=e)Xq,all other varsP(Xq,X,E=e)P(X_q | E=e) = \frac{\sum_{\text{all other vars}} P(X_q, \mathbf{X}, E=e)}{\sum_{X_q, \text{all other vars}} P(X_q, \mathbf{X}, E=e)}

Naive summation: exponential time in number of variables!

Key Idea: Push Sums Inward

Example: P(A) in chain A-B-C-D

Naive:

P(A)=BCDP(A)P(BA)P(CB)P(DC)P(A) = \sum_B \sum_C \sum_D P(A)P(B|A)P(C|B)P(D|C)

Cost: O(2^4) = 16 operations

Variable Elimination (eliminate D, then C, then B):

P(A)=P(A)BP(BA)CP(CB)DP(DC)=1P(A) = P(A) \sum_B P(B|A) \sum_C P(C|B) \underbrace{\sum_D P(D|C)}_{=1}

Cost: O(3×2) = 6 operations

Algorithm Steps

1. Choose elimination ordering π

2. For each variable Z in π:

a. Multiply all factors containing Z

b. Sum out Z from the product

c. Add resulting factor to factor list

3. Multiply remaining factors

4. Normalize to get probability

Complexity Analysis

Time complexity: O(n × d^(w+1)) where:

  • • n = number of variables
  • • d = domain size (e.g., 2 for binary)
  • • w = induced width (depends on elimination order!)

Finding optimal elimination order is NP-hard, but good heuristics exist (min-fill, min-degree).

Parameter Learning: Closed-Form Solutions

Maximum Likelihood Estimation for BN Parameters

Complete Data MLE

For node X_i with parents Pa_i, estimate P(X_i | Pa_i):

θ^xipai=Count(Xi=xi,Pai=pai)Count(Pai=pai)\hat{\theta}_{x_i|\text{pa}_i} = \frac{\text{Count}(X_i=x_i, \text{Pa}_i=\text{pa}_i)}{\text{Count}(\text{Pa}_i=\text{pa}_i)}

Simply count frequencies in data! Each CPT entry estimated independently (global decomposition).

Why MLE Decomposes

Log-likelihood of data D:

L(θ)=j=1mlogP(x(j))=j=1mi=1nlogP(xi(j)pai(j))\mathcal{L}(\theta) = \sum_{j=1}^{m} \log P(\mathbf{x}^{(j)}) = \sum_{j=1}^{m} \sum_{i=1}^{n} \log P(x_i^{(j)} | \text{pa}_i^{(j)})

Rearranging sums:

=i=1nj=1mlogP(xi(j)pai(j))= \sum_{i=1}^{n} \sum_{j=1}^{m} \log P(x_i^{(j)} | \text{pa}_i^{(j)})

Each term involves only parameters of one CPT! Optimize independently.

Bayesian Parameter Learning

With Dirichlet prior α:

θ^xipai=Count(Xi=xi,Pai=pai)+αCount(Pai=pai)+αXi\hat{\theta}_{x_i|\text{pa}_i} = \frac{\text{Count}(X_i=x_i, \text{Pa}_i=\text{pa}_i) + \alpha}{\text{Count}(\text{Pa}_i=\text{pa}_i) + \alpha|X_i|}

Equivalent to Laplace smoothing! Prevents overfitting with sparse data.

Inference Complexity: NP-Hardness

Why Exact Inference is Intractable

Theorem: NP-Hardness

Computing P(X_query | evidence) in general Bayesian networks is #P-complete.

(#P includes NP, meaning at least as hard as SAT, traveling salesman, etc.)

Proof Sketch via 3-SAT Reduction

Idea: Construct BN that encodes 3-SAT formula, where inference solves SAT.

  1. For each variable x_i in SAT: create binary node X_i
  2. For each clause (a ∨ b ∨ c): create clause node C with parents a,b,c
  3. C = 1 iff (a ∨ b ∨ c) is true
  4. Add evidence: all clause nodes = 1
  5. Query: P(X_1 = true | evidence) > 0 iff SAT formula is satisfiable

Since SAT is NP-complete and we reduced it to BN inference, BN inference is NP-hard!

Tractable Special Cases

✓ Poly-time Cases:

  • • Trees (w=1)
  • • Bounded treewidth
  • • Singly-connected networks

Approximate Methods:

  • • MCMC sampling
  • • Variational inference
  • • Loopy belief propagation

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is a Bayesian network?
Not attempted
2
In a Bayesian network, if there's an edge from node A to node B, we say:
Not attempted
3
What does a Bayesian network's factorization theorem state?
Not attempted
4
For a chain , are A and C conditionally independent given B?
Not attempted
5
What is a 'v-structure' (collider) in Bayesian networks?
Not attempted
6
What does d-separation determine?
Not attempted
7
In a Bayesian network with n binary variables in a chain, how many parameters are needed?
Not attempted
8
What is inference in Bayesian networks?
Not attempted
9
Variable elimination works by:
Not attempted
10
What makes exact inference in Bayesian networks intractable (NP-hard in general)?
Not attempted