Master fundamental concepts, terminology, and evaluation methods in machine learning
Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. More formally, a computer program is said to learn from experience with respect to some task and performance measure , if its performance on , as measured by , improves with experience .
Example: Email Spam Filter
Learning from labeled training data with input-output pairs
Learning patterns from unlabeled data
Dataset (D)
Collection of samples for training/testing
Sample (Instance)
Single object or data point
Example
Sample with its label
Feature (Attribute)
Characteristic describing an object
Label (Target)
Output or result to predict
Feature Space
Space formed by all features
Hypothesis (Model)
Learned mapping from features to labels
Hypothesis Space
Set of all possible models
Inductive Bias
Assumptions made by learning algorithm
Training Set
Data used to train the model
Validation Set
Data for hyperparameter tuning
Test Set
Data for final evaluation
Model learns training data too well, including noise and outliers
Model too simple to capture underlying patterns
Training Error
Error on training data
Generalization Error
Expected error on unseen data
Bias
Error from wrong assumptions (underfitting)
Variance
Error from sensitivity to training data (overfitting)
Noise
Irreducible error in data
Split data into training (70-80%) and test (20-30%) sets
Split data into k folds, train k times using k-1 folds for training and 1 for validation
Special case of k-fold where k = n (number of samples)
| Predicted | |||
|---|---|---|---|
| Positive | Negative | ||
| Actual | Positive | TP True Positive | FN False Negative |
| Negative | FP False Positive | TN True Negative | |
Accuracy
Overall correctness
Precision
Accuracy of positive predictions
Recall (Sensitivity)
Coverage of actual positives
F1 Score
Harmonic mean of precision and recall
| Sqft | Bedrooms | Location | Year | Price | Category |
|---|---|---|---|---|---|
| 2800 | 4 | 9.5 | 2018 | $695,000 | High |
| 2400 | 4 | 9 | 2015 | $575,000 | High |
| 2100 | 3 | 8 | 2012 | $485,000 | Standard |
| 1800 | 3 | 8.5 | 2010 | $420,000 | Standard |
| 1650 | 3 | 7.5 | 2008 | $385,000 | Standard |
| 1500 | 3 | 7 | 2010 | $350,000 | Standard |
| 1350 | 2 | 6.5 | 2005 | $315,000 | Standard |
| 1200 | 2 | 7 | 2005 | $285,000 | Standard |
| 950 | 2 | 5.5 | 1998 | $225,000 | Standard |
| 2200 | 3 | 8 | 2013 | $520,000 | High |
Dimensionality: 4 features
Task Type: Binary Classification
Class Balance: 3 High, 7 Standard
Learning Goal
Train a classifier to learn the pattern: larger, newer houses in better locations tend to be high-value. The model should generalize to predict categories for new houses it hasn't seen before.
The No Free Lunch (NFL) theorem states that, averaged over all possible problems, every learning algorithm has the same expected performance. This means:
Practical Implication
Always try multiple algorithms and validation methods. What works for one problem may fail for another. Success requires understanding both your data and your algorithms.
Let f be the true function, and let h be a learned model from training set D.
The expected prediction error at point x over all possible training sets:
Since y = f(x) + ε:
Add and subtract the expected prediction:
Let A = h(x) - E[h(x)] and B = E[h(x)] - f(x):
Cross terms vanish because: E[A] = 0, E[ε] = 0, and B is deterministic given x
Squared difference between average prediction and true function. High bias = underfitting.
Reduced by: More complex models
Expected squared deviation from average prediction. High variance = overfitting.
Reduced by: Regularization, more data
Noise in data. Cannot be reduced by any model.
Inherent in the problem
A concept class C is PAC-learnable if there exists an algorithm A such that:
If m ≥ m₀ training samples, then with probability at least 1 - δ:
For finite hypothesis class with |H| hypotheses:
Logarithmic in hypothesis space size
For hypothesis class with VC dimension d:
More complex hypothesis spaces (higher d) require more samples to achieve same (ε, δ) guarantee. This formalizes the intuition that complex models need more data.
| Predicted Class | ||
|---|---|---|
| Positive | Negative | |
| True Positive | TP | FN |
| True Negative | FP | TN |
Precision (Positive Predictive Value)
Of all positive predictions, what fraction is correct? High precision = few false alarms.
Recall (Sensitivity, True Positive Rate)
Of all actual positives, what fraction did we catch? High recall = few missed positives.
F1-Score (Harmonic Mean)
Derivation: Harmonic mean gives equal weight to precision and recall.
Specificity (True Negative Rate)
Of all actual negatives, what fraction did we correctly identify?
By adjusting decision threshold t:
This fundamental tradeoff stems from the fact that TP + FN = constant (total positives) and TP + FP = predicted positives varies with threshold.
ROC curve plots TPR vs FPR at different classification thresholds:
True Positive Rate (TPR):
False Positive Rate (FPR):
AUC has a probabilistic interpretation:
Meaning: Probability that a randomly chosen positive example has a higher predicted score than a randomly chosen negative example.
For n threshold points (TPR_i, FPR_i):
Sum of trapezoid areas between consecutive points
AUC = 0.5
Random classifier (diagonal line)
AUC = 0.7-0.8
Acceptable model
AUC = 0.9-1.0
Excellent model
K-fold CV estimates true error by averaging errors over K folds:
where E_i is the error on fold i when trained on other K-1 folds.
Key insight: Folds are not independent (they share n(K-1)/K training samples), so covariance terms are positive. This increases variance compared to independent estimates.
Large K (e.g., K=n, LOOCV)
Small K (e.g., K=5 or 10)
Standard Choice: K=10
Empirically shown to provide good bias-variance balance. K=5 for large datasets, K=n for small datasets.
Problem: Random splits may have imbalanced class distributions across folds.
Solution: Maintain class proportions in each fold.
Mathematical effect: Reduces variance of CV estimate by ensuring each fold is representative. Especially important for imbalanced datasets.
Test your understanding with 10 multiple-choice questions