Learn how to evaluate ML models, understand overfitting and underfitting, and master essential performance metrics with watermelon examples
The proportion of misclassified samples
where a = number of errors, m = total samples
The proportion of correctly classified samples
Accuracy + Error Rate = 1
Error on the training set - measures how well the model fits the training data
Error on the test set - measures how well the model generalizes to new data
The learner "memorizes" the training samples too well, treating specific characteristics of the training data as general properties of all samples.
The model learns: "A good watermelon must be exactly 3.5kg, dark-green color, clear texture, AND harvested on Tuesday morning." It performs perfectly on training data but fails on new watermelons because it learned noise and specific details rather than general patterns.
The learner fails to capture general properties even in the training samples. The model is too simple to learn the underlying patterns.
The model learns: "All watermelons are good" or "Only weight matters, ignore all other features." It performs poorly even on training data because it's too simplistic to capture real patterns.
How do we split our data to evaluate model performance? Here are the main approaches:
Directly split the dataset into training set and test set.
With 20 watermelons (10 good, 10 bad):
Trade-off: More training data = better model, but less test data = less reliable evaluation
Split data into k equal-sized subsets. Use k-1 subsets for training and 1 for testing, rotating through all k combinations.
Final result = average of k test results
20 watermelons, k=5: Each fold has 4 watermelons. Train on 16, test on 4, repeat 5 times. Average the 5 test accuracies.
Special case of cross-validation where k = number of samples. Each sample serves as the test set once.
Sample with replacement from the original dataset to create training sets. Samples not selected form the test set.
Given dataset D with m samples, randomly draw m samples with replacement to form training set D'. About 36.8% of samples never appear in D' (can be proven mathematically).
The probability a sample is never selected = (1-1/m)^m → 1/e ≈ 0.368 as m→∞
From 20 watermelons, draw 20 times with replacement. ~7-8 watermelons never selected become test set.
Different tasks require different metrics to evaluate model performance:
Most common metric for regression tasks. Measures average squared difference between predictions and actual values.
where f(xᵢ) is predicted value, yᵢ is actual value
Predicting sugar content (0-100). If predicted [78, 65, 82] but actual [80, 70, 85], MSE = ((78-80)² + (65-70)² + (82-85)²) / 3 = (4 + 25 + 9) / 3 = 12.67
A table showing true vs predicted classifications:
| Predicted | ||
|---|---|---|
| Positive | Negative | |
| Actual Positive | TP True Positive 8 | FN False Negative 1 |
| Actual Negative | FP False Positive 2 | TN True Negative 9 |
• TP (8): Predicted good, actually good ✓
• FP (2): Predicted good, actually bad ✗
• FN (1): Predicted bad, actually good ✗
• TN (9): Predicted bad, actually bad ✓
Of all predicted positives, how many are actually positive?
Of all actual positives, how many did we find?
Harmonic mean of precision and recall
Weighted harmonic mean allowing us to emphasize precision or recall:
ROC (Receiver Operating Characteristic) Curve: Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at various classification thresholds.
Use case: ROC/AUC measures how well the model ranks samples, independent of the classification threshold. Particularly useful for imbalanced datasets.
Generalization error can be decomposed into three components:
The difference between the expected prediction and the true value. Measures how far off the model is on average.
How much predictions vary when trained on different training sets. Measures model stability.
Irreducible error from the data itself. No model can eliminate this.
The total error is the sum of squared bias, variance, and irreducible noise.
Insufficient training: High bias dominates generalization error (underfitting)
Continued training: Bias decreases, but variance gradually increases
Excessive training: High variance dominates, leading to overfitting
Finding the sweet spot between bias and variance is a fundamental challenge in machine learning!
Training vs Test Error: Training error measures fit, test error measures generalization
Overfitting/Underfitting: Balance model complexity with data size using regularization and cross-validation
Evaluation Methods: Hold-out, k-fold CV, LOO, bootstrap - each with specific use cases
Performance Metrics: Choose appropriate metrics for your task (MSE for regression, precision/recall/F1 for classification)
Bias-Variance Tradeoff: Finding the right model complexity is essential for good generalization