Machine Learning/Learning Center/Introduction to ML/Model Evaluation

Model Evaluation & Selection

Learn how to evaluate ML models, understand overfitting and underfitting, and master essential performance metrics with watermelon examples

Module 3 of 4

Intermediate Level

60-80 min

Understanding Error Concepts

Error Rate and Accuracy

Error Rate

The proportion of misclassified samples

E = a/m

where a = number of errors, m = total samples

Accuracy

The proportion of correctly classified samples

Accuracy = 1 - E

Accuracy + Error Rate = 1

Types of Error

Training (Empirical) Error

Error on the training set - measures how well the model fits the training data

If model predicts 2 out of 15 training watermelons incorrectly, training error = 2/15 = 13.3%

Test Error

Error on the test set - measures how well the model generalizes to new data

If model predicts 1 out of 5 test watermelons incorrectly, test error = 1/5 = 20%

Overfitting and Underfitting

Overfitting (Over-learning)

The learner "memorizes" the training samples too well, treating specific characteristics of the training data as general properties of all samples.

Watermelon Example:

The model learns: "A good watermelon must be exactly 3.5kg, dark-green color, clear texture, AND harvested on Tuesday morning." It performs perfectly on training data but fails on new watermelons because it learned noise and specific details rather than general patterns.

Solutions:

• Regularization: Add penalty terms to the optimization objective
• Early stopping: Stop training before model overfits
• Data augmentation: Increase training data diversity
• Dropout: Randomly drop neurons during training (for neural networks)
• Ensemble methods: Combine multiple models

Underfitting (Under-learning)

The learner fails to capture general properties even in the training samples. The model is too simple to learn the underlying patterns.

Watermelon Example:

The model learns: "All watermelons are good" or "Only weight matters, ignore all other features." It performs poorly even on training data because it's too simplistic to capture real patterns.

Solutions:

• Increase model complexity: Use more features or more complex model
• Decision trees: Add more branches and depth
• Neural networks: Add more layers or neurons
• Feature engineering: Create more informative features
• Train longer: Increase training iterations

Model Evaluation Methods

How do we split our data to evaluate model performance? Here are the main approaches:

1. Hold-out Method

Directly split the dataset into training set and test set.

Key Principles:

• Maintain data distribution consistency
• Use stratified sampling to preserve class proportions
• Perform multiple random splits and average results
• Common split ratios: 70:30, 75:25, or 80:20

Watermelon Example:

With 20 watermelons (10 good, 10 bad):

• Training: 15 watermelons (7-8 good, 7-8 bad)
• Test: 5 watermelons (2-3 good, 2-3 bad)
• Stratified sampling maintains 50:50 ratio

Trade-off: More training data = better model, but less test data = less reliable evaluation

2. Cross-Validation (k-fold CV)

Split data into k equal-sized subsets. Use k-1 subsets for training and 1 for testing, rotating through all k combinations.

Process:

Fold 1

Train on folds 2-k, test on fold 1

Fold 2

Train on folds 1,3-k, test on fold 2

...

Fold k

Train on folds 1-(k-1), test on fold k

Final result = average of k test results

Common Practice:

• k=10 is most common (10-fold CV)
• "10 times 10-fold CV" = repeat 10-fold CV 10 times
• More stable than single hold-out

Watermelon Example:

20 watermelons, k=5: Each fold has 4 watermelons. Train on 16, test on 4, repeat 5 times. Average the 5 test accuracies.

3. Leave-One-Out (LOO)

Special case of cross-validation where k = number of samples. Each sample serves as the test set once.

Advantages:

• Not affected by random split variations
• Results are highly accurate
• Maximizes training data usage

Disadvantages:

• Computationally expensive for large datasets
• Must train m models (m = dataset size)
• Watermelon: 20 models for 20 samples

4. Bootstrap Method

Sample with replacement from the original dataset to create training sets. Samples not selected form the test set.

Process:

Given dataset D with m samples, randomly draw m samples with replacement to form training set D'. About 36.8% of samples never appear in D' (can be proven mathematically).

The probability a sample is never selected = (1-1/m)^m → 1/e ≈ 0.368 as m→∞

Use Cases:

• Small datasets
• Ensemble learning (e.g., Random Forest)
• Estimating confidence intervals

Watermelon Example:

From 20 watermelons, draw 20 times with replacement. ~7-8 watermelons never selected become test set.

Performance Metrics

Different tasks require different metrics to evaluate model performance:

Regression Metrics

Mean Squared Error (MSE)

Most common metric for regression tasks. Measures average squared difference between predictions and actual values.

E(f;D) = (1/m) Σ(f(xᵢ) - yᵢ)²

where f(xᵢ) is predicted value, yᵢ is actual value

Watermelon Example:

Predicting sugar content (0-100). If predicted [78, 65, 82] but actual [80, 70, 85], MSE = ((78-80)² + (65-70)² + (82-85)²) / 3 = (4 + 25 + 9) / 3 = 12.67

Classification Metrics

Confusion Matrix

A table showing true vs predicted classifications:

	Predicted
	Positive	Negative
Actual Positive	TP True Positive 8	FN False Negative 1
Actual Negative	FP False Positive 2	TN True Negative 9

Watermelon Example:

• TP (8): Predicted good, actually good ✓
• FP (2): Predicted good, actually bad ✗
• FN (1): Predicted bad, actually good ✗
• TN (9): Predicted bad, actually bad ✓

Precision (查准率)

P = TP / (TP + FP)

Of all predicted positives, how many are actually positive?

80.0%

Recall (查全率)

R = TP / (TP + FN)

Of all actual positives, how many did we find?

88.9%

F1 Score

F1 = 2PR/(P+R)

Harmonic mean of precision and recall

84.2%

F-beta Score

Weighted harmonic mean allowing us to emphasize precision or recall:

Fβ = (1 + β²) × P × R / (β² × P + R)

β > 1: Emphasize recall (find more positives, okay to have false positives)

β < 1: Emphasize precision (be more certain, okay to miss some positives)

ROC Curve & AUC

ROC (Receiver Operating Characteristic) Curve: Plots True Positive Rate (TPR) vs False Positive Rate (FPR) at various classification thresholds.

Key Terms:

• TPR (True Positive Rate): Same as Recall
• FPR (False Positive Rate): FP / (FP + TN)
• AUC: Area Under the ROC Curve

Interpretation:

• AUC = 1.0: Perfect classifier
• AUC = 0.5: Random guessing
• AUC < 0.5: Worse than random
• Higher AUC = Better ranking quality

Use case: ROC/AUC measures how well the model ranks samples, independent of the classification threshold. Particularly useful for imbalanced datasets.

Bias-Variance Tradeoff

Generalization error can be decomposed into three components:

Bias (偏差)

The difference between the expected prediction and the true value. Measures how far off the model is on average.

High bias: Underfitting - model is too simple, can't capture patterns

Variance (方差)

How much predictions vary when trained on different training sets. Measures model stability.

High variance: Overfitting - too sensitive to training data specifics

Noise (噪声)

Irreducible error from the data itself. No model can eliminate this.

Inherent randomness and measurement errors in data

Generalization Error Decomposition

E(f;D) = bias²(x) + var(x) + ε²

The total error is the sum of squared bias, variance, and irreducible noise.

The Bias-Variance Dilemma

Insufficient training: High bias dominates generalization error (underfitting)

Continued training: Bias decreases, but variance gradually increases

Excessive training: High variance dominates, leading to overfitting

Finding the sweet spot between bias and variance is a fundamental challenge in machine learning!

Key Takeaways

Training vs Test Error: Training error measures fit, test error measures generalization

Overfitting/Underfitting: Balance model complexity with data size using regularization and cross-validation

Evaluation Methods: Hold-out, k-fold CV, LOO, bootstrap - each with specific use cases

Performance Metrics: Choose appropriate metrics for your task (MSE for regression, precision/recall/F1 for classification)

Bias-Variance Tradeoff: Finding the right model complexity is essential for good generalization

Next Module