MathIsimple – Simple, Friendly Math Tools & Learning

What is Machine Learning?

Classic Definition

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. More formally, a computer program is said to learn from experience $E$ with respect to some task $T$ and performance measure $P$ , if its performance on $T$ , as measured by $P$ , improves with experience $E$ .

Example: Email Spam Filter

Task (T): Classify emails as spam or not spam
Experience (E): Labeled training emails
Performance (P): Classification accuracy on new emails

Supervised Learning

Learning from labeled training data with input-output pairs

Classification: Predict discrete categories

Regression: Predict continuous values

Unsupervised Learning

Learning patterns from unlabeled data

Clustering: Group similar data points

Dimensionality reduction: Reduce features

Essential Terminology

Dataset & Samples

Dataset (D)

Collection of samples for training/testing

Sample (Instance)

Single object or data point

Example

Sample with its label

Features & Labels

Feature (Attribute)

Characteristic describing an object

Label (Target)

Output or result to predict

Feature Space

Space formed by all features

Model Components

Hypothesis (Model)

Learned mapping from features to labels

Hypothesis Space

Set of all possible models

Inductive Bias

Assumptions made by learning algorithm

Data Splits

Training Set

Data used to train the model

Validation Set

Data for hyperparameter tuning

Test Set

Data for final evaluation

Model Evaluation

Overfitting vs Underfitting

Overfitting

Model learns training data too well, including noise and outliers

•High training accuracy, low test accuracy
•Model too complex for data
•Poor generalization

Underfitting

Model too simple to capture underlying patterns

•Low training and test accuracy
•High bias, low variance
•Fails to learn patterns

Error Metrics

Training Error

E_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i)

Error on training data

Generalization Error

E_{\text{gen}} = E_{(x,y) \sim \mathcal{D}} [L(y, f(x))]

Expected error on unseen data

Bias-Variance Tradeoff

\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Bias

Error from wrong assumptions (underfitting)

Variance

Error from sensitivity to training data (overfitting)

Noise

Irreducible error in data

Cross-Validation Methods

Hold-Out Method

Split data into training (70-80%) and test (20-30%) sets

✓ Pros: Simple and fast
✗ Cons: High variance, less training data

K-Fold Cross-Validation

Split data into k folds, train k times using k-1 folds for training and 1 for validation

E_{\text{CV}} = \frac{1}{k} \sum_{i=1}^{k} E_i

✓ Pros: Lower variance, uses all data
✗ Cons: Computationally expensive

Leave-One-Out (LOO)

Special case of k-fold where k = n (number of samples)

✓ Pros: Maximum training data, no randomness
✗ Cons: Very expensive for large datasets

Classification Metrics

Confusion Matrix

		Predicted
		Positive	Negative
Actual	Positive	TP True Positive	FN False Negative
Actual	Negative	FP False Positive	TN True Negative

Accuracy

\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Overall correctness

Precision

\text{Precision} = \frac{TP}{TP + FP}

Accuracy of positive predictions

Recall (Sensitivity)

\text{Recall} = \frac{TP}{TP + FN}

Coverage of actual positives

F1 Score

F_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Harmonic mean of precision and recall

Example: Housing Data Classification

Task: Predict House Value Category

Classify houses as High-Value (> 500k)

Sample Dataset (10 houses)

Sqft	Bedrooms	Location	Year	Price	Category
2800	4	9.5	2018	$695,000	High
2400	4	9	2015	$575,000	High
2100	3	8	2012	$485,000	Standard
1800	3	8.5	2010	$420,000	Standard
1650	3	7.5	2008	$385,000	Standard
1500	3	7	2010	$350,000	Standard
1350	2	6.5	2005	$315,000	Standard
1200	2	7	2005	$285,000	Standard
950	2	5.5	1998	$225,000	Standard
2200	3	8	2013	$520,000	High

Features (Input)

• Square footage (continuous)
• Number of bedrooms (discrete)
• Location score 1-10 (ordinal)
• Year built (continuous)

Dimensionality: 4 features

Label (Output)

• High-Value: Price > $500k
• Standard-Value: Price ≤ $500k

Task Type: Binary Classification

Class Balance: 3 High, 7 Standard

Learning Goal

Train a classifier to learn the pattern: larger, newer houses in better locations tend to be high-value. The model should generalize to predict categories for new houses it hasn't seen before.

View More Examples & Exercises

No Free Lunch Theorem

The No Free Lunch (NFL) theorem states that, averaged over all possible problems, every learning algorithm has the same expected performance. This means:

No single algorithm is universally best for all problems
Algorithm performance depends on problem structure
Domain knowledge and inductive bias are crucial
Must match algorithm assumptions to problem characteristics

Practical Implication

Always try multiple algorithms and validation methods. What works for one problem may fail for another. Success requires understanding both your data and your algorithms.

Bias-Variance Decomposition

Mathematical Decomposition of Expected Error

Setup and Notation

Let f be the true function, and let h be a learned model from training set D.

• True relationship: $y = f(x) + \epsilon$ where error has zero mean
• Prediction from model h: $\hat{y} = h(x)$
• Expectation over all possible training sets: $\mathbb{E}_D[h(x)]$
• Noise variance: $\sigma^2 = \mathbb{E}[\epsilon^2]$

Step 1: Expected Squared Error

The expected prediction error at point x over all possible training sets:

\mathbb{E}_D[(h(x) - y)^2]

Step 2: Substitute True Function

Since y = f(x) + ε:

\mathbb{E}_D[(h(x) - f(x) - \epsilon)^2]

Step 3: Add and Subtract Average Prediction

Add and subtract the expected prediction:

\mathbb{E}_D[(h(x) - \mathbb{E}_D[h(x)] + \mathbb{E}_D[h(x)] - f(x) - \epsilon)^2]

Step 4: Expand the Square

Let A = h(x) - E[h(x)] and B = E[h(x)] - f(x):

\mathbb{E}_D[(A + B - \epsilon)^2] = \mathbb{E}_D[A^2] + \mathbb{E}_D[B^2] + \mathbb{E}_D[\epsilon^2] + \text{cross terms}

Cross terms vanish because: E[A] = 0, E[ε] = 0, and B is deterministic given x

Step 5: Final Decomposition

\mathbb{E}_D[(h(x) - y)^2] = \underbrace{(\mathbb{E}_D[h(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D[(h(x) - \mathbb{E}_D[h(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}

Bias²

Squared difference between average prediction and true function. High bias = underfitting.

Reduced by: More complex models

Variance

Expected squared deviation from average prediction. High variance = overfitting.

Reduced by: Regularization, more data

Irreducible Error

Noise in data. Cannot be reduced by any model.

Inherent in the problem

PAC Learning Framework

Probably Approximately Correct Learning

Formal Definition

A concept class C is PAC-learnable if there exists an algorithm A such that:

\forall f \in C, \forall \delta > 0, \forall \epsilon > 0, \exists m_0(\epsilon, \delta):

If m ≥ m₀ training samples, then with probability at least 1 - δ:

P(\text{error}(h) \leq \epsilon) \geq 1 - \delta

Key Parameters

• ε (epsilon): Accuracy parameter - maximum acceptable error
• δ (delta): Confidence parameter - probability of failure
• m: Number of training examples needed
• h: Hypothesis (learned model)

Sample Complexity

For finite hypothesis class with |H| hypotheses:

m \geq \frac{1}{\epsilon}\left(\ln|H| + \ln\frac{1}{\delta}\right)

Logarithmic in hypothesis space size

Proof Sketch: VC Dimension Bound

For hypothesis class with VC dimension d:

m \geq O\left(\frac{1}{\epsilon}\left(d + \ln\frac{1}{\delta}\right)\right)

More complex hypothesis spaces (higher d) require more samples to achieve same (ε, δ) guarantee. This formalizes the intuition that complex models need more data.

Confusion Matrix Metrics Derivation

From Confusion Matrix to Performance Metrics

Confusion Matrix Foundation

	Predicted Class
	Positive	Negative
True Positive	TP	FN
True Negative	FP	TN

Metric Derivations

Precision (Positive Predictive Value)

\text{Precision} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{All Predicted Positives}}

Of all positive predictions, what fraction is correct? High precision = few false alarms.

Recall (Sensitivity, True Positive Rate)

\text{Recall} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{All Actual Positives}}

Of all actual positives, what fraction did we catch? High recall = few missed positives.

F1-Score (Harmonic Mean)

F_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

Derivation: Harmonic mean gives equal weight to precision and recall.

\frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2PR}{P+R}

Specificity (True Negative Rate)

\text{Specificity} = \frac{TN}{TN + FP} = \frac{\text{True Negatives}}{\text{All Actual Negatives}}

Of all actual negatives, what fraction did we correctly identify?

Precision-Recall Tradeoff

By adjusting decision threshold t:

• Lower threshold → more positives → higher recall, lower precision
• Higher threshold → fewer positives → lower recall, higher precision

This fundamental tradeoff stems from the fact that TP + FN = constant (total positives) and TP + FP = predicted positives varies with threshold.

ROC Curve & AUC Analysis

Receiver Operating Characteristic Mathematical Foundation

ROC Curve Definition

ROC curve plots TPR vs FPR at different classification thresholds:

True Positive Rate (TPR):

TPR = \frac{TP}{TP + FN} = \text{Recall}

False Positive Rate (FPR):

FPR = \frac{FP}{FP + TN} = 1 - \text{Specificity}

AUC (Area Under Curve) Interpretation

AUC has a probabilistic interpretation:

AUC = P(\text{score}(x_+) > \text{score}(x_-) | x_+ \in \text{positive}, x_- \in \text{negative})

Meaning: Probability that a randomly chosen positive example has a higher predicted score than a randomly chosen negative example.

Computing AUC: Trapezoidal Rule

For n threshold points (TPR_i, FPR_i):

AUC \approx \sum_{i=1}^{n-1} \frac{(TPR_i + TPR_{i+1})}{2} \cdot (FPR_{i+1} - FPR_i)

Sum of trapezoid areas between consecutive points

AUC = 0.5

Random classifier (diagonal line)

AUC = 0.7-0.8

Acceptable model

AUC = 0.9-1.0

Excellent model

Cross-Validation Theory

K-Fold Cross-Validation Mathematical Analysis

Estimation of Generalization Error

K-fold CV estimates true error by averaging errors over K folds:

\hat{E}_{CV} = \frac{1}{K}\sum_{i=1}^{K} E_i

where E_i is the error on fold i when trained on other K-1 folds.

Variance of CV Estimate

\text{Var}(\hat{E}_{CV}) = \frac{1}{K^2}\sum_{i=1}^{K} \text{Var}(E_i) + \frac{2}{K^2}\sum_{i<j} \text{Cov}(E_i, E_j)

Key insight: Folds are not independent (they share n(K-1)/K training samples), so covariance terms are positive. This increases variance compared to independent estimates.

Bias-Variance Tradeoff in Fold Choice

Large K (e.g., K=n, LOOCV)

• Training set size: n-1 (almost full)
• Low bias in error estimate
• High variance (highly correlated folds)
• Computationally expensive

Small K (e.g., K=5 or 10)

• Training set size: n(K-1)/K
• Slight upward bias in error
• Lower variance
• Computationally efficient

Standard Choice: K=10

Empirically shown to provide good bias-variance balance. K=5 for large datasets, K=n for small datasets.

Stratified K-Fold Justification

Problem: Random splits may have imbalanced class distributions across folds.

Solution: Maintain class proportions in each fold.

Mathematical effect: Reduces variance of CV estimate by ensuring each fold is representative. Especially important for imbalanced datasets.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

Which of the following is an example of supervised learning?

Not attempted

2

What is the main difference between classification and regression?

Not attempted

3

In the housing price dataset with features (sqft, bedrooms, bathrooms, year) → price, what is the sample dimensionality?

Not attempted

4

What does overfitting mean in machine learning?

Not attempted

5

In k-fold cross-validation with k=5, how many times is the model trained?

Not attempted

6

For a binary classifier with TP=80, FP=20, TN=70, FN=30, what is the precision?

Not attempted

7

What does the No Free Lunch (NFL) theorem state?

Not attempted

8

In the bias-variance tradeoff, high bias typically indicates:

Not attempted

9

Which metric is most appropriate for highly imbalanced datasets (e.g., 1% positive class)?

Not attempted

10

What is the purpose of a validation set?

Not attempted

Introduction to Machine Learning

What is Machine Learning?

Essential Terminology

Model Evaluation

Overfitting

Underfitting

Hold-Out Method

K-Fold Cross-Validation

Leave-One-Out (LOO)

Classification Metrics

Example: Housing Data Classification

Sample Dataset (10 houses)

Features (Input)

Label (Output)

Bias-Variance Decomposition

Setup and Notation

Step 1: Expected Squared Error

Step 2: Substitute True Function

Step 3: Add and Subtract Average Prediction

Step 4: Expand the Square

Step 5: Final Decomposition

Bias²

Variance

Irreducible Error

PAC Learning Framework

Formal Definition

Key Parameters

Sample Complexity

Proof Sketch: VC Dimension Bound

Confusion Matrix Metrics Derivation

Confusion Matrix Foundation

Metric Derivations

Precision-Recall Tradeoff

ROC Curve & AUC Analysis

ROC Curve Definition

AUC (Area Under Curve) Interpretation

Computing AUC: Trapezoidal Rule

Cross-Validation Theory

Estimation of Generalization Error

Variance of CV Estimate

Bias-Variance Tradeoff in Fold Choice

Stratified K-Fold Justification

Practice Quiz