Machine Learning/Learning Center/Semi-Supervised Learning/Disagreement-Based Methods

Disagreement-Based Methods

Master multi-view learning and co-training algorithms. Learn how multiple classifiers trained on different views can collaborate to improve performance by leveraging high-confidence unlabeled samples.

Module 5 of 6

Intermediate to Advanced

100-130 min

Core Concept: Multi-View Data

A view is a subset of attributes (feature set) that can be used to train a classifier. Multi-view data has multiple independent views, each providing complementary information about the same samples.

Examples of Multi-View Data

• Web pages: View 1 = text content, View 2 = hyperlink structure
• Images: View 1 = pixel features, View 2 = texture features
• Documents: View 1 = word frequency, View 2 = semantic features
• Videos: View 1 = visual features, View 2 = audio features

Advantages

• Leverages multiple information sources
• Robust to view-specific noise
• Can improve with limited labels
• Natural for many applications

Requirements

• Requires multi-view data
• Views must be compatible
• Views should be complementary
• Need high-confidence selection

Key Assumptions

For disagreement-based methods to work effectively, two critical assumptions must hold:

1. Compatibility Assumption

Different views contain consistent class information. If one view predicts class A, other views should also predict class A (or at least agree on the same class).

Example:

If text view classifies a web page as "Tech", the link view should also classify it as "Tech" (not "Sports"). Both views provide consistent information about the same underlying class.

2. Complementarity Assumption

Each view is sufficient (can train an optimal classifier alone) and conditionally independent (given the class, views are independent).

Sufficiency:

A single view contains enough information to train a good classifier. For example, text content alone can classify web pages reasonably well.

Conditional Independence:

Given the class label, views are independent: $P(\text{view}_1, \text{view}_2 | \text{class}) = P(\text{view}_1 | \text{class}) \cdot P(\text{view}_2 | \text{class})$

Co-Training Algorithm

The most famous disagreement-based method. Two classifiers trained on different views "teach each other" by selecting high-confidence unlabeled samples and providing pseudo-labels to the other view.

Step 1

Data Preparation

From unlabeled set $D_u$ , randomly sample $s$ samples to form a buffer pool $D_s$ . Remaining samples: $D_u = D_u \setminus D_s$ .

Parameters: Base learning algorithm (e.g., SVM, Decision Tree), number of rounds T, positive samples per round p, negative samples per round n.

Step 2

Initial Training

For each view j (j = 1, 2), train an initial classifier $h_j$ using labeled samples $D_l$ .

Example: View 1 (text) classifier and View 2 (links) classifier, both trained on 50 labeled web pages.

Step 3

Iterative Learning (T rounds)

For each round t = 1 to T, and for each view j:

a) Use $h_j$ to evaluate samples in buffer pool $D_s$
b) Select p high-confidence positive samples and n high-confidence negative samples
c) Generate pseudo-labeled samples: $D_j^+ = \{(x, +1) | x \in \text{positive}\}$ , $D_j^- = \{(x, -1) | x \in \text{negative}\}$
d) Remove selected samples from buffer: $D_s = D_s \setminus (D_j^+ \cup D_j^-)$
e) If both classifiers haven't updated, terminate
f) Otherwise: Add pseudo-labeled samples to view j's training set: $D_l^j = D_l^j \cup (D_j^+ \cup D_j^-)$
g) Sample $2p + 2n$ new samples from $D_u$ to replenish buffer $D_s$

Step 4

Output

Final classifiers $h_1$ and $h_2$ . Can use either separately or combine predictions (e.g., voting or averaging).

High-Confidence Selection

Only select unlabeled samples where the classifier is highly confident:

• For probabilistic classifiers: Select samples with $P(y|x) > \theta$ (e.g., θ = 0.9)
• For SVM: Select samples far from decision boundary (large margin)
• This ensures pseudo-labels are likely correct, preventing error propagation

Web Page Classification Example

Apply co-training to classify 200 web pages into "Tech" or "Sports" categories. Only 50 pages are labeled, while 150 are unlabeled. Two views: text content and hyperlink structure.

Web Page Dataset: Multi-View Features

ID	Text Words	Text Links	Link Count	Link Text	Label	Type
1	450	8	12	35	Tech	labeled
2	320	3	5	15	Sports	labeled
3	280	2	8	25	Tech	labeled
4	520	5	15	45	Sports	labeled
5	380	6	10	30	Tech	labeled
6	410	4	7	20	Unknown	unlabeled
7	290	3	9	28	Unknown	unlabeled
8	480	7	11	38	Unknown	unlabeled
9	350	2	6	18	Unknown	unlabeled
10	440	5	13	42	Unknown	unlabeled

Dataset: 200 web pages (50 labeled: 25 Tech, 25 Sports; 150 unlabeled). View 1 (Text): word count, text link count. View 2 (Links): total links, link text words.

Co-Training Process

Initial Training:

Train View 1 (text) classifier and View 2 (links) classifier on 50 labeled pages. Initial accuracy: 78% (text), 72% (links).

Round 1:

View 1 selects 5 high-confidence Tech and 5 Sports pages from buffer. View 2 selects 5 Tech and 5 Sports. Each view adds the other view's selections to its training set. Both classifiers retrain.

Round 2-5:

Continue iterative learning. Each round, classifiers improve by learning from high-confidence samples identified by the other view.

Final Result:

After 5 rounds, View 1 accuracy: 89%, View 2 accuracy: 85%. Combined (voting): 91% accuracy (vs 78% using labeled samples only). Co-training successfully leveraged unlabeled data.

Advantages and Limitations

Advantages

• Leverages multiple information sources: Uses complementary views effectively
• Robust to view-specific noise: Errors in one view can be corrected by another
• Can improve with limited labels: Iteratively expands training set
• Natural for many applications: Web pages, images, documents naturally have multiple views

Limitations

• Requires multi-view data: Not all problems have natural multiple views
• Assumptions must hold: Compatibility and complementarity are critical
• High-confidence selection needed: Poor confidence estimation can propagate errors
• Sensitive to initial classifiers: Poor initial models can lead to poor final results

Next Module