Bayesian Networks | DAG Structure, CPT, D-separation & Inference Tutorial

What are Bayesian Networks?

Probabilistic Graphical Model

Bayesian Networks (also called "belief networks") are a type of probabilistic graphical model that use a graph structure to represent conditional dependencies between variables. They provide a compact way to represent and reason about complex probability distributions.

Definition

A Bayesian network $B = (G, \Theta)$ consists of:

• Structure $G$ : A Directed Acyclic Graph (DAG) where nodes represent attribute variables and edges represent dependencies
• Parameters $\Theta$ : A set of Conditional Probability Tables (CPT) for each node given its parents

Core Assumption

Conditional Independence

Bayesian networks assume that each attribute is conditionally independent of its non-descendants given its parent nodes. This allows the joint probability distribution to be factorized:

P_B(x_1, x_2, \ldots, x_d) = \prod_{i=1}^{d} P_B(x_i | \pi_i) = \prod_{i=1}^{d} \theta_{x_i | \pi_i}

where $\pi_i$ is the set of parent nodes of $x_i$

Example Factorization

For a Bayesian network with variables (好瓜 $x_1$ , 甜度 $x_2$ , 敲声 $x_3$ , 色泽 $x_4$ , 根蒂 $x_5$ ):

P(x_1, x_2, x_3, x_4, x_5) = P(x_1) P(x_2) P(x_3 | x_1) P(x_4 | x_1, x_2) P(x_5 | x_2)

Three Basic Dependency Structures

Common Parent

Structure: $x_1 \to x_3$ , $x_1 \to x_4$

Conditional Independence: $x_3 \perp x_4 | x_1$

Given parent $x_1$ , children $x_3$ and $x_4$ are independent.

V-Structure

Structure: $x_1 \to x_4$ , $x_2 \to x_4$

Marginal Independence: $x_1 \perp x_2$

Key: Given $x_4$ , $x_1$ and $x_2$ become dependent (explaining away effect).

Sequential Structure

Structure: $x \to y \to z$

Conditional Independence: $x \perp z | y$

Given intermediate node $y$ , $x$ and $z$ are independent.

D-separation

Moral Graph

To analyze D-separation, we first convert the directed graph to an undirected moral graph:

1. Connect V-structure parents: For each V-structure ( $A \to C \gets B$ ), add an undirected edge between $A$ and $B$ ("moralization")
2. Convert directed edges: Replace all directed edges with undirected edges

D-separation Criterion

Two variables $x$ and $y$ are D-separated by set $z$ if $x$ and $y$ are separated into different connected components in the moral graph after removing nodes in $z$ .

If $x$ and $y$ are D-separated by $z$ , then $x \perp y | z$ (conditional independence).

Structure Learning

Scoring Functions

Structure learning aims to find the optimal network structure $G$ that best fits the training data. Common scoring functions include:

AIC (Akaike Information Criterion)

s(B | D) = f(\theta) |B| - LL(B | D)

where $f(\theta) = 1$

BIC (Bayesian Information Criterion)

s(B | D) = f(\theta) |B| - LL(B | D)

where $f(\theta) = \frac{1}{2} \log m$ ( $m$ is sample size)

MDL (Minimal Description Length)

Based on information-theoretic principles, balancing model complexity and data fit.

Computational Challenge

Finding the optimal Bayesian network structure is NP-hard. In practice, we use heuristic search algorithms (e.g., greedy search, genetic algorithms) to find good structures rather than optimal ones.

Inference

Exact Inference

Directly compute posterior probabilities from the joint distribution defined by the Bayesian network. However, exact inference is NP-hard for general networks.

Approximate Inference

For complex networks, we use approximate methods:

•
Gibbs Sampling: MCMC method that iteratively samples variables to approximate the posterior distribution
•
Variational Inference: Transform inference into an optimization problem, approximating the true posterior with a simpler distribution

Example: Medical Diagnosis Network

Bayesian network for disease diagnosis

Network Structure

Variables: Disease (D), Symptom1 (S1), Symptom2 (S2), Test Result (T), Age (A)

Dependencies:

• D → S1, D → S2, D → T
• A → D

Joint Probability:

P(D, S_1, S_2, T, A) = P(A) P(D | A) P(S_1 | D) P(S_2 | D) P(T | D)

Inference Example

Given observed symptoms S1=present, S2=present, and test result T=positive, we can compute:

P(D = \text{disease} | S_1, S_2, T) = \frac{P(D, S_1, S_2, T)}{P(S_1, S_2, T)}

This allows us to diagnose the probability of disease given the observed evidence.