MathIsimple – Simple, Friendly Math Tools & Learning

Structure & Representation

Directed Acyclic Graph (DAG)

A Bayesian network consists of:

1. Graph Structure (Qualitative)

• Nodes = random variables
• Edges = direct dependencies
• DAG = no directed cycles
• Encodes conditional independence

2. Parameters (Quantitative)

• Each node : CPT (Conditional Probability Table)
• Specifies
• Quantifies strength of dependencies

Joint Probability Factorization

P(X_1, X_2, ..., X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))

This compact representation exploits conditional independence to reduce from to manageable parameters.

Example: Simple Medical Diagnosis

Variables: (Disease), (Fever), (Cough)

Structure: (disease causes symptoms)

P(D, F, C) = P(D) \cdot P(F|D) \cdot P(C|D)

Parameters:

Conditional Independence Patterns

Chain (Serial)

A → B → C

Independence:

A \perp C \mid B

B blocks the path. Given B, A doesn't provide info about C.

Fork (Diverging)

A ← B → C

Independence:

A \perp C \mid B

B is common cause. Conditioning on B blocks influence between A and C.

Collider (V-structure)

A → B ← C

Independence:

A \perp C

A \not\perp C \mid B

Marginally independent, but dependent when conditioning on B ('explaining away').

Explaining Away Example

Scenario: Grass is wet (W). Possible causes: Rain (R) or Sprinkler (S). Structure:

Without observing W:

(rain and sprinkler are independent events)

After observing W = wet:

. If we learn sprinkler was on, rain is less likely (one cause explains the effect).

D-Separation

Graphical Test for Conditional Independence

D-separation (directed separation) determines if by examining paths in the graph:

A path is blocked by Z if:

• Chain or Fork: Middle node is in Z
• Collider: Collider node AND its descendants are NOT in Z

If all paths from X to Y are blocked by Z, then

Example Application

Network:

Query: Is ?

Path :

• Collider at C (not in observation set {B})
• Path is blocked!

Answer: Yes,

Inference in Bayesian Networks

Computing Posterior Probabilities

Inference Task

Given evidence (observed variables), compute posterior for query variable(s) :

P(X | \mathbf{e}) = \frac{P(X, \mathbf{e})}{P(\mathbf{e})} = \alpha \sum_{\mathbf{h}} P(X, \mathbf{h}, \mathbf{e})

where are hidden (unobserved) variables, is normalization constant

Exact Inference

Variable Elimination: Marginalize variables systematically, reuse intermediate results
Belief Propagation: Message passing in tree-structured networks (efficient)
Junction Tree: Convert to tree, then use message passing

✓ Accurate | ✗ Exponential in tree-width

Approximate Inference

Sampling: Forward sampling, rejection sampling, MCMC (Gibbs, Metropolis-Hastings)
Variational Methods: Approximate complex distribution with simpler one

✓ Scales to large networks | ✗ Approximate

Variable Elimination Algorithm

Steps:

1. Start with factors (CPTs) from the network
2. For each hidden variable (in elimination order):
a. Collect all factors mentioning
b. Multiply them together
c. Sum out from the product
d. Add resulting factor back to list
3. Multiply remaining factors and normalize

Key insight: Elimination order matters! Good orders exploit independence structure for efficiency.

Learning Bayesian Networks

Parameter Learning

Given structure, estimate CPT parameters from data:

Maximum Likelihood:

\theta_{ijk} = P(X_i=k | \text{Parents}(X_i)=j) = \frac{\text{Count}(X_i=k, \text{Pa}=j)}{\text{Count}(\text{Pa}=j)}

With complete data, closed-form solution. For missing data, use EM algorithm.

Structure Learning

Learn graph structure from data:

Constraint-based: Test conditional independence, construct graph
Score-based: Search over structures, optimize score (BIC, AIC, MDL)
Hybrid: Combine both approaches

Challenge: Exponentially many structures, hard to learn causal direction from observational data alone.

Example: Medical Diagnosis Network

Diagnosing Disease from Symptoms

Network Structure

Flu → Fever

Flu → Cough

COVID → Fever

COVID → Cough

Parameters (Simplified)

Priors:

Likelihoods:

Inference Example

Query:

Use variable elimination or belief propagation to compute posterior probability of Flu given both symptoms. Result might be ,

Note: Both can be high if they're not mutually exclusive (patient could have both).

View More Examples & Exercises

Joint Probability Factorization Theorem

Chain Rule and DAG Structure

General Chain Rule

P(X_1, X_2, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | X_1, \ldots, X_{i-1})

This is always true by definition of conditional probability, but has exponential parameters!

Bayesian Network Factorization

Theorem: For DAG G with nodes X_1, ..., X_n:

P(X_1, \ldots, X_n) = \prod_{i=1}^{n} P(X_i | \text{Parents}(X_i))

Each variable depends only on its parents, not all predecessors! This is the key insight.

CPT Size Reduction:

If each node has at most k parents: O(n × 2^k) vs O(2^n) parameters. Exponential savings!

D-Separation Rules

D-separation determines conditional independence from graph structure:

Chain: A → B → C

A ⊥ C | B (B blocks path)

Fork: A ← B → C

A ⊥ C | B (B blocks path)

V-structure (collider): A → B ← C

A ⊥ C | ∅ but NOT A ⊥ C | B (B unblocks!)

Variable Elimination Algorithm

Efficient Exact Inference

Problem: Computing Marginals

Goal: Compute P(X_query | evidence)

P(X_q | E=e) = \frac{\sum_{\text{all other vars}} P(X_q, \mathbf{X}, E=e)}{\sum_{X_q, \text{all other vars}} P(X_q, \mathbf{X}, E=e)}

Naive summation: exponential time in number of variables!

Key Idea: Push Sums Inward

Example: P(A) in chain A-B-C-D

Naive:

P(A) = \sum_B \sum_C \sum_D P(A)P(B|A)P(C|B)P(D|C)

Cost: O(2^4) = 16 operations

Variable Elimination (eliminate D, then C, then B):

P(A) = P(A) \sum_B P(B|A) \sum_C P(C|B) \underbrace{\sum_D P(D|C)}_{=1}

Cost: O(3×2) = 6 operations

Algorithm Steps

1. Choose elimination ordering π

2. For each variable Z in π:

a. Multiply all factors containing Z

b. Sum out Z from the product

c. Add resulting factor to factor list

3. Multiply remaining factors

4. Normalize to get probability

Complexity Analysis

Time complexity: O(n × d^(w+1)) where:

• n = number of variables
• d = domain size (e.g., 2 for binary)
• w = induced width (depends on elimination order!)

Finding optimal elimination order is NP-hard, but good heuristics exist (min-fill, min-degree).

Parameter Learning: Closed-Form Solutions

Maximum Likelihood Estimation for BN Parameters

Complete Data MLE

For node X_i with parents Pa_i, estimate P(X_i | Pa_i):

\hat{\theta}_{x_i|\text{pa}_i} = \frac{\text{Count}(X_i=x_i, \text{Pa}_i=\text{pa}_i)}{\text{Count}(\text{Pa}_i=\text{pa}_i)}

Simply count frequencies in data! Each CPT entry estimated independently (global decomposition).

Why MLE Decomposes

Log-likelihood of data D:

\mathcal{L}(\theta) = \sum_{j=1}^{m} \log P(\mathbf{x}^{(j)}) = \sum_{j=1}^{m} \sum_{i=1}^{n} \log P(x_i^{(j)} | \text{pa}_i^{(j)})

Rearranging sums:

= \sum_{i=1}^{n} \sum_{j=1}^{m} \log P(x_i^{(j)} | \text{pa}_i^{(j)})

Each term involves only parameters of one CPT! Optimize independently.

Bayesian Parameter Learning

With Dirichlet prior α:

\hat{\theta}_{x_i|\text{pa}_i} = \frac{\text{Count}(X_i=x_i, \text{Pa}_i=\text{pa}_i) + \alpha}{\text{Count}(\text{Pa}_i=\text{pa}_i) + \alpha|X_i|}

Equivalent to Laplace smoothing! Prevents overfitting with sparse data.

Inference Complexity: NP-Hardness

Why Exact Inference is Intractable

Theorem: NP-Hardness

Computing P(X_query | evidence) in general Bayesian networks is #P-complete.

(#P includes NP, meaning at least as hard as SAT, traveling salesman, etc.)

Proof Sketch via 3-SAT Reduction

Idea: Construct BN that encodes 3-SAT formula, where inference solves SAT.

For each variable x_i in SAT: create binary node X_i
For each clause (a ∨ b ∨ c): create clause node C with parents a,b,c
C = 1 iff (a ∨ b ∨ c) is true
Add evidence: all clause nodes = 1
Query: P(X_1 = true | evidence) > 0 iff SAT formula is satisfiable

Since SAT is NP-complete and we reduced it to BN inference, BN inference is NP-hard!

Tractable Special Cases

✓ Poly-time Cases:

• Trees (w=1)
• Bounded treewidth
• Singly-connected networks

Approximate Methods:

• MCMC sampling
• Variational inference
• Loopy belief propagation

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is a Bayesian network?

Not attempted

2

In a Bayesian network, if there's an edge from node A to node B, we say:

Not attempted

3

What does a Bayesian network's factorization theorem state?

Not attempted

4

For a chain , are A and C conditionally independent given B?

Not attempted

5

What is a 'v-structure' (collider) in Bayesian networks?

Not attempted

6

What does d-separation determine?

Not attempted

7

In a Bayesian network with n binary variables in a chain, how many parameters are needed?

Not attempted

8

What is inference in Bayesian networks?

Not attempted

9

Variable elimination works by:

Not attempted

10

What makes exact inference in Bayesian networks intractable (NP-hard in general)?

Not attempted