Machine Learning/Learning Center/Semi-Supervised Learning/Graph-Based Learning

Graph-Based Learning

Master graph-based semi-supervised learning through label propagation. Learn how to construct graphs, propagate labels along edges, and leverage the manifold assumption for classification.

Module 4 of 6

Intermediate to Advanced

100-130 min

Core Idea: Graph Structure and Label Propagation

Map data to an undirected graph $G = (V, E)$ where nodes represent samples and edges represent similarity. Labels from labeled samples propagate along edges to unlabeled samples, following the manifold assumption: nearby samples (connected by edges) should have similar labels.

Key Components

• Nodes V: Each sample (labeled or unlabeled) is a node
• Edges E: Connect similar samples; edge weight $W_{ij}$ reflects similarity
• Label Propagation: Labels spread from labeled nodes to unlabeled nodes along edges
• Manifold Assumption: Neighboring samples (high edge weight) should have similar labels

Advantages

• Directly embodies manifold assumption
• Sensitive to local structure
• No model assumptions needed
• Intuitive and interpretable

Limitations

• Large storage overhead
• Difficult for large-scale data
• Hard to incorporate new samples
• Requires similarity metric

Graph Construction: Computing Edge Weights

The quality of the graph significantly affects label propagation. We use a Gaussian kernel functionto compute edge weights based on sample similarity.

Gaussian Kernel Function

W_{ij} = \begin{cases} \exp\left( \frac{-\|x_i - x_j\|_2^2}{2\sigma^2} \right), & i \neq j \\ 0, & i = j \end{cases}

Where:

• $W_{ij}$ = weight of edge between nodes i and j (similarity)
• $\|x_i - x_j\|_2^2$ = squared Euclidean distance between samples
• $\sigma$ = bandwidth parameter (controls similarity decay rate)
• $W_{ii} = 0$ (no self-loops)

Bandwidth Parameter σ

The bandwidth $\sigma$ controls how quickly similarity decreases with distance:

• Small σ: Only very close samples have high similarity (sparse graph)
• Large σ: Distant samples also have moderate similarity (dense graph)
• Typical choice: $\sigma$ = median distance to k-th nearest neighbor

Label Propagation: Core Formulas

Labels propagate iteratively from labeled nodes to unlabeled nodes. We define a degree matrixand propagation matrix to control the propagation process.

Degree Matrix D

A diagonal matrix where each diagonal element is the sum of edge weights connected to that node:

D_{ii} = \sum_j W_{ij}

The degree $D_{ii}$ measures how "well-connected" node i is in the graph.

Propagation Matrix S

For multi-class classification, we use symmetric normalization:

S = D^{-\frac{1}{2}}WD^{-\frac{1}{2}}

This symmetric normalization ensures stable propagation and convergence.

Alternative (Row Normalization):

P = D^{-1}W

P represents transition probabilities (from node i to j). Used in some variants.

Label Matrix Iteration

Let $F(t)$ be the label matrix at iteration t, where $F_{ij}$ represents the probability that node i belongs to class j:

F(t+1) = \alpha S F(t) + (1-\alpha) Y

Where:

• $\alpha \in [0,1)$ = trade-off parameter (balance between propagation and initial labels)
• $S$ = propagation matrix
• $Y$ = initial label matrix (1 if labeled sample i belongs to class j, 0 otherwise)
• $F(0) = Y$ (initialization)

Convergence Result

As $t \to \infty$ , the label matrix converges to:

F^* = (1-\alpha)(I - \alpha S)^{-1} Y

This closed-form solution can be computed directly, but iterative updates are often more efficient for large graphs.

Graph-Based Learning Algorithm Steps

Step 1

Graph Construction

Compute edge weights $W_{ij}$ using Gaussian kernel for all sample pairs. Construct degree matrix $D$ and propagation matrix $S = D^{-\frac{1}{2}}WD^{-\frac{1}{2}}$ .

Step 2

Initialization

Initialize label matrix $F(0) = Y$ :

• For labeled sample i in class j: $F_{ij}(0) = 1$ , $F_{ik}(0) = 0$ for $k \neq j$
• For unlabeled sample i: $F_{ij}(0) = 0$ for all classes j

Step 3

Iterative Propagation

Repeat until convergence:

F(t+1) = \alpha S F(t) + (1-\alpha) Y

Stop when $\|F(t+1) - F(t)\| < \epsilon$ (e.g., $\epsilon = 10^{-5}$ ).

Step 4

Prediction

For unlabeled sample i, predict class:

y_i = \arg\max_{1 \leq j \leq |\mathcal{Y}|} F^*_{ij}

Choose the class with highest probability in the converged label matrix.

Social Network Classification Example

Apply graph-based learning to classify 200 social network users into "Tech Enthusiast" or "Sports Fan" based on friendship connections. Only 50 users have known interests (labeled), while 150 users are unlabeled.

Graph Construction and Propagation

Graph Construction:

Build graph with 200 nodes (users). Edge weight $W_{ij}$ based on friendship strength:

• Direct friends: $W_{ij} = 1.0$
• Mutual friends: $W_{ij} = 0.5$
• No connection: $W_{ij} = 0$

Label Propagation (α = 0.8):

Iteratively propagate labels from 50 labeled users to 150 unlabeled users:

• Iteration 1: Labels spread to immediate neighbors
• Iteration 3: Labels reach 2-hop neighbors
• Iteration 8: Convergence (change < 1e-5)

Final Result:

Achieves 88% accuracy on test set (vs 75% using labeled samples only). Labels successfully propagate through friendship connections, leveraging the "birds of a feather" principle.

Advantages and Limitations

Advantages

• Directly embodies manifold assumption: Naturally captures local structure
• Sensitive to local structure: Works well for high-dimensional data
• No model assumptions: Doesn't require specific distribution
• Intuitive and interpretable: Graph structure is easy to visualize

Limitations

• Large storage overhead: Need to store $m \times m$ weight matrix
• Difficult for large-scale data: Doesn't scale well to millions of samples
• Hard to incorporate new samples: Graph must be reconstructed
• Requires similarity metric: Choice of kernel affects performance

Next Module