MathIsimple

Disagreement-Based Methods

Master multi-view learning and co-training algorithms. Learn how multiple classifiers trained on different views can collaborate to improve performance by leveraging high-confidence unlabeled samples.

Module 5 of 6
Intermediate to Advanced
100-130 min

Core Concept: Multi-View Data

A view is a subset of attributes (feature set) that can be used to train a classifier. Multi-view data has multiple independent views, each providing complementary information about the same samples.

Examples of Multi-View Data

  • Web pages: View 1 = text content, View 2 = hyperlink structure
  • Images: View 1 = pixel features, View 2 = texture features
  • Documents: View 1 = word frequency, View 2 = semantic features
  • Videos: View 1 = visual features, View 2 = audio features

Advantages

  • • Leverages multiple information sources
  • • Robust to view-specific noise
  • • Can improve with limited labels
  • • Natural for many applications

Requirements

  • • Requires multi-view data
  • • Views must be compatible
  • • Views should be complementary
  • • Need high-confidence selection

Key Assumptions

For disagreement-based methods to work effectively, two critical assumptions must hold:

1. Compatibility Assumption

Different views contain consistent class information. If one view predicts class A, other views should also predict class A (or at least agree on the same class).

Example:

If text view classifies a web page as "Tech", the link view should also classify it as "Tech" (not "Sports"). Both views provide consistent information about the same underlying class.

2. Complementarity Assumption

Each view is sufficient (can train an optimal classifier alone) and conditionally independent (given the class, views are independent).

Sufficiency:

A single view contains enough information to train a good classifier. For example, text content alone can classify web pages reasonably well.

Conditional Independence:

Given the class label, views are independent: P(view1,view2class)=P(view1class)P(view2class)P(\text{view}_1, \text{view}_2 | \text{class}) = P(\text{view}_1 | \text{class}) \cdot P(\text{view}_2 | \text{class})

Co-Training Algorithm

The most famous disagreement-based method. Two classifiers trained on different views "teach each other" by selecting high-confidence unlabeled samples and providing pseudo-labels to the other view.

Step 1

Data Preparation

From unlabeled set DuD_u, randomly sample ss samples to form a buffer pool DsD_s. Remaining samples: Du=DuDsD_u = D_u \setminus D_s.

Parameters: Base learning algorithm (e.g., SVM, Decision Tree), number of rounds T, positive samples per round p, negative samples per round n.

Step 2

Initial Training

For each view j (j = 1, 2), train an initial classifier hjh_j using labeled samples DlD_l.

Example: View 1 (text) classifier and View 2 (links) classifier, both trained on 50 labeled web pages.

Step 3

Iterative Learning (T rounds)

For each round t = 1 to T, and for each view j:

  1. a) Use hjh_j to evaluate samples in buffer pool DsD_s
  2. b) Select p high-confidence positive samples and n high-confidence negative samples
  3. c) Generate pseudo-labeled samples: Dj+={(x,+1)xpositive}D_j^+ = \{(x, +1) | x \in \text{positive}\},Dj={(x,1)xnegative}D_j^- = \{(x, -1) | x \in \text{negative}\}
  4. d) Remove selected samples from buffer: Ds=Ds(Dj+Dj)D_s = D_s \setminus (D_j^+ \cup D_j^-)
  5. e) If both classifiers haven't updated, terminate
  6. f) Otherwise: Add pseudo-labeled samples to view j's training set:Dlj=Dlj(Dj+Dj)D_l^j = D_l^j \cup (D_j^+ \cup D_j^-)
  7. g) Sample 2p+2n2p + 2n new samples from DuD_u to replenish buffer DsD_s
Step 4

Output

Final classifiers h1h_1 and h2h_2. Can use either separately or combine predictions (e.g., voting or averaging).

High-Confidence Selection

Only select unlabeled samples where the classifier is highly confident:

  • • For probabilistic classifiers: Select samples with P(yx)>θP(y|x) > \theta (e.g., θ = 0.9)
  • • For SVM: Select samples far from decision boundary (large margin)
  • • This ensures pseudo-labels are likely correct, preventing error propagation

Web Page Classification Example

Apply co-training to classify 200 web pages into "Tech" or "Sports" categories. Only 50 pages are labeled, while 150 are unlabeled. Two views: text content and hyperlink structure.

Web Page Dataset: Multi-View Features

IDText WordsText LinksLink CountLink TextLabelType
145081235
Tech
labeled
23203515
Sports
labeled
32802825
Tech
labeled
452051545
Sports
labeled
538061030
Tech
labeled
64104720Unknown
unlabeled
72903928Unknown
unlabeled
848071138Unknown
unlabeled
93502618Unknown
unlabeled
1044051342Unknown
unlabeled

Dataset: 200 web pages (50 labeled: 25 Tech, 25 Sports; 150 unlabeled). View 1 (Text): word count, text link count. View 2 (Links): total links, link text words.

Co-Training Process

Initial Training:

Train View 1 (text) classifier and View 2 (links) classifier on 50 labeled pages. Initial accuracy: 78% (text), 72% (links).

Round 1:

View 1 selects 5 high-confidence Tech and 5 Sports pages from buffer. View 2 selects 5 Tech and 5 Sports. Each view adds the other view's selections to its training set. Both classifiers retrain.

Round 2-5:

Continue iterative learning. Each round, classifiers improve by learning from high-confidence samples identified by the other view.

Final Result:

After 5 rounds, View 1 accuracy: 89%, View 2 accuracy: 85%. Combined (voting): 91% accuracy (vs 78% using labeled samples only). Co-training successfully leveraged unlabeled data.

Advantages and Limitations

Advantages

  • Leverages multiple information sources: Uses complementary views effectively
  • Robust to view-specific noise: Errors in one view can be corrected by another
  • Can improve with limited labels: Iteratively expands training set
  • Natural for many applications: Web pages, images, documents naturally have multiple views

Limitations

  • Requires multi-view data: Not all problems have natural multiple views
  • Assumptions must hold: Compatibility and complementarity are critical
  • High-confidence selection needed: Poor confidence estimation can propagate errors
  • Sensitive to initial classifiers: Poor initial models can lead to poor final results