Master the lazy learning algorithm that serves as a key application scenario for dimensionality reduction. Learn how kNN finds neighbors, makes predictions, and why its performance degrades in high dimensions.
k-Nearest Neighbors (kNN) is a lazy learning (also called instance-based learning) algorithm with no explicit training process. Given a test sample, kNN finds the k training samples closest to it based on some distance metric, then makes predictions based on these neighbors.
Lazy Learning
No training phase - all computation happens at prediction time
Non-Parametric
Makes no assumptions about data distribution
Distance-Based
Performance directly depends on distance metric quality
Sensitive to Dimensionality
Performance degrades severely in high-dimensional spaces (curse of dimensionality)
The kNN algorithm follows a simple three-step process:
For a test sample x, calculate distances to all training samples using a distance metric (typically Euclidean distance).
Details: Distance calculation: dist(x, x_i) = ||x - x_i|| for all training samples x_i.
Select the k training samples with smallest distances to x. These are the 'neighbors'.
Details: k is a hyperparameter. Common values: k=1 (nearest neighbor), k=3, k=5, or k=√n (n = number of training samples).
For classification: majority vote among k neighbors' labels. For regression: average of k neighbors' values.
Details: Classification: y_pred = mode({y_i | i ∈ k nearest neighbors}). Regression: y_pred = (1/k) Σ y_i.
kNN can be used for both classification and regression tasks, with different prediction strategies:
For classification tasks, kNN uses majority voting among the k nearest neighbors.
Select the most common class label among k neighbors
Example: k=3
Neighbors: [Class A, Class A, Class B]
Prediction: Class A (majority: 2 votes)
For regression tasks, kNN uses averaging of the k nearest neighbors' target values.
Average the target values of k neighbors
Example: k=3
Neighbors: [450000, 480000, 420000]
Prediction: $450,000 (average)
Understanding the theoretical error rate of nearest neighbor classifiers provides insight into why dimensionality reduction is crucial for kNN performance.
For a test sample , let be its nearest neighbor. The error probability is:
Where is the set of all class labels, and is the probability of class given sample .
Under the assumption that samples are independent and identically distributed (i.i.d.), and training samples are sufficiently dense (any test sample has training samples nearby), the error rate can be approximated as:
Core Conclusion:
The generalization error rate of the nearest neighbor classifier is at most twice the error rate of the Bayes optimal classifier. This provides strong theoretical justification for kNN's effectiveness when distances are meaningful.
The theoretical guarantee assumes that:
In high-dimensional spaces, both assumptions break down:
This is why dimensionality reduction is essential for kNN: it restores sample density and makes distances meaningful again.
Apply kNN regression to predict house prices based on similar properties:
| ID | Sqft | Bedrooms | Bathrooms | Year | Location | Price |
|---|---|---|---|---|---|---|
| 1 | 2,400 | 4 | 2.5 | 2005 | 8.5/10 | $450,000 |
| 2 | 1,800 | 3 | 2 | 1998 | 7.2/10 | $320,000 |
| 3 | 3,200 | 5 | 3 | 2010 | 9.1/10 | $680,000 |
| 4 | 1,500 | 2 | 1.5 | 1995 | 6.8/10 | $240,000 |
| 5 | 2,800 | 4 | 2.5 | 2008 | 8.9/10 | $520,000 |
| 6 | 2,100 | 3 | 2 | 2002 | 7.5/10 | $380,000 |
| 7 | 3,600 | 5 | 3.5 | 2012 | 9.5/10 | $750,000 |
| 8 | 1,900 | 3 | 2 | 2000 | 7/10 | $310,000 |
Features: Square footage (continuous), Bedrooms/Bathrooms (discrete), Year Built (ordinal), Location Score (1-10), Price (continuous target)
To predict price for a new house with: 2600 sqft, 4 bedrooms, 2.5 bathrooms, built in 2007, location score 8.7:
• House #1: 2400 sqft, 4 bed, 2.5 bath, 2005, loc 8.5 → Price: $450,000
• House #5: 2800 sqft, 4 bed, 2.5 bath, 2008, loc 8.9 → Price: $520,000
• House #6: 2100 sqft, 3 bed, 2 bath, 2002, loc 7.5 → Price: $380,000
Average of 3 nearest neighbors: ($450k + $520k + $380k) / 3 = $450,000
kNN's performance directly depends on the quality of distance computations, making it a prime application scenario for dimensionality reduction techniques.
In high-dimensional spaces, kNN faces critical challenges:
Dimensionality reduction addresses these issues by:
A common machine learning pipeline: