MathIsimple
Back to Machine Learning
Introduction
Beginner Level

Introduction to Machine Learning

Master fundamental concepts, terminology, and evaluation methods in machine learning

What is Machine Learning?

Classic Definition

Machine Learning is a field of computer science that gives computers the ability to learn without being explicitly programmed. More formally, a computer program is said to learn from experience EE with respect to some task TT and performance measure PP, if its performance on TT, as measured by PP, improves with experience EE.

Example: Email Spam Filter

  • Task (T): Classify emails as spam or not spam
  • Experience (E): Labeled training emails
  • Performance (P): Classification accuracy on new emails
Supervised Learning

Learning from labeled training data with input-output pairs

Classification: Predict discrete categories
Regression: Predict continuous values
Unsupervised Learning

Learning patterns from unlabeled data

Clustering: Group similar data points
Dimensionality reduction: Reduce features

Essential Terminology

Dataset & Samples

Dataset (D)

Collection of samples for training/testing

Sample (Instance)

Single object or data point

Example

Sample with its label

Features & Labels

Feature (Attribute)

Characteristic describing an object

Label (Target)

Output or result to predict

Feature Space

Space formed by all features

Model Components

Hypothesis (Model)

Learned mapping from features to labels

Hypothesis Space

Set of all possible models

Inductive Bias

Assumptions made by learning algorithm

Data Splits

Training Set

Data used to train the model

Validation Set

Data for hyperparameter tuning

Test Set

Data for final evaluation

Model Evaluation

Overfitting vs Underfitting

Overfitting

Model learns training data too well, including noise and outliers

  • High training accuracy, low test accuracy
  • Model too complex for data
  • Poor generalization

Underfitting

Model too simple to capture underlying patterns

  • Low training and test accuracy
  • High bias, low variance
  • Fails to learn patterns
Error Metrics

Training Error

Etrain=1ni=1nL(yi,y^i)E_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i)

Error on training data

Generalization Error

Egen=E(x,y)D[L(y,f(x))]E_{\text{gen}} = E_{(x,y) \sim \mathcal{D}} [L(y, f(x))]

Expected error on unseen data

Bias-Variance Tradeoff
Error=Bias2+Variance+Noise\text{Error} = \text{Bias}^2 + \text{Variance} + \text{Noise}

Bias

Error from wrong assumptions (underfitting)

Variance

Error from sensitivity to training data (overfitting)

Noise

Irreducible error in data

Cross-Validation Methods

Hold-Out Method

Split data into training (70-80%) and test (20-30%) sets

✓ Pros: Simple and fast
✗ Cons: High variance, less training data

K-Fold Cross-Validation

Split data into k folds, train k times using k-1 folds for training and 1 for validation

ECV=1ki=1kEiE_{\text{CV}} = \frac{1}{k} \sum_{i=1}^{k} E_i
✓ Pros: Lower variance, uses all data
✗ Cons: Computationally expensive

Leave-One-Out (LOO)

Special case of k-fold where k = n (number of samples)

✓ Pros: Maximum training data, no randomness
✗ Cons: Very expensive for large datasets

Classification Metrics

Confusion Matrix
Predicted
PositiveNegative
ActualPositive
TP
True Positive
FN
False Negative
Negative
FP
False Positive
TN
True Negative

Accuracy

Accuracy=TP+TNTP+TN+FP+FN\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}

Overall correctness

Precision

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

Accuracy of positive predictions

Recall (Sensitivity)

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Coverage of actual positives

F1 Score

F1=2×Precision×RecallPrecision+RecallF_1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}

Harmonic mean of precision and recall

Example: Housing Data Classification

Task: Predict House Value Category
Classify houses as High-Value (> 500k)

Sample Dataset (10 houses)

SqftBedroomsLocationYearPriceCategory
280049.52018$695,000
High
2400492015$575,000
High
2100382012$485,000
Standard
180038.52010$420,000
Standard
165037.52008$385,000
Standard
1500372010$350,000
Standard
135026.52005$315,000
Standard
1200272005$285,000
Standard
95025.51998$225,000
Standard
2200382013$520,000
High

Features (Input)

  • • Square footage (continuous)
  • • Number of bedrooms (discrete)
  • • Location score 1-10 (ordinal)
  • • Year built (continuous)

Dimensionality: 4 features

Label (Output)

  • • High-Value: Price > $500k
  • • Standard-Value: Price ≤ $500k

Task Type: Binary Classification

Class Balance: 3 High, 7 Standard

Learning Goal

Train a classifier to learn the pattern: larger, newer houses in better locations tend to be high-value. The model should generalize to predict categories for new houses it hasn't seen before.

No Free Lunch Theorem

The No Free Lunch (NFL) theorem states that, averaged over all possible problems, every learning algorithm has the same expected performance. This means:

  • No single algorithm is universally best for all problems
  • Algorithm performance depends on problem structure
  • Domain knowledge and inductive bias are crucial
  • Must match algorithm assumptions to problem characteristics

Practical Implication

Always try multiple algorithms and validation methods. What works for one problem may fail for another. Success requires understanding both your data and your algorithms.

Bias-Variance Decomposition

Mathematical Decomposition of Expected Error

Setup and Notation

Let f be the true function, and let h be a learned model from training set D.

  • • True relationship: y=f(x)+ϵy = f(x) + \epsilon where error has zero mean
  • • Prediction from model h: y^=h(x)\hat{y} = h(x)
  • • Expectation over all possible training sets: ED[h(x)]\mathbb{E}_D[h(x)]
  • • Noise variance: σ2=E[ϵ2]\sigma^2 = \mathbb{E}[\epsilon^2]

Step 1: Expected Squared Error

The expected prediction error at point x over all possible training sets:

ED[(h(x)y)2]\mathbb{E}_D[(h(x) - y)^2]

Step 2: Substitute True Function

Since y = f(x) + ε:

ED[(h(x)f(x)ϵ)2]\mathbb{E}_D[(h(x) - f(x) - \epsilon)^2]

Step 3: Add and Subtract Average Prediction

Add and subtract the expected prediction:

ED[(h(x)ED[h(x)]+ED[h(x)]f(x)ϵ)2]\mathbb{E}_D[(h(x) - \mathbb{E}_D[h(x)] + \mathbb{E}_D[h(x)] - f(x) - \epsilon)^2]

Step 4: Expand the Square

Let A = h(x) - E[h(x)] and B = E[h(x)] - f(x):

ED[(A+Bϵ)2]=ED[A2]+ED[B2]+ED[ϵ2]+cross terms\mathbb{E}_D[(A + B - \epsilon)^2] = \mathbb{E}_D[A^2] + \mathbb{E}_D[B^2] + \mathbb{E}_D[\epsilon^2] + \text{cross terms}

Cross terms vanish because: E[A] = 0, E[ε] = 0, and B is deterministic given x

Step 5: Final Decomposition

ED[(h(x)y)2]=(ED[h(x)]f(x))2Bias2+ED[(h(x)ED[h(x)])2]Variance+σ2Irreducible Error\mathbb{E}_D[(h(x) - y)^2] = \underbrace{(\mathbb{E}_D[h(x)] - f(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}_D[(h(x) - \mathbb{E}_D[h(x)])^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible Error}}

Bias²

Squared difference between average prediction and true function. High bias = underfitting.

Reduced by: More complex models

Variance

Expected squared deviation from average prediction. High variance = overfitting.

Reduced by: Regularization, more data

Irreducible Error

Noise in data. Cannot be reduced by any model.

Inherent in the problem

PAC Learning Framework

Probably Approximately Correct Learning

Formal Definition

A concept class C is PAC-learnable if there exists an algorithm A such that:

fC,δ>0,ϵ>0,m0(ϵ,δ):\forall f \in C, \forall \delta > 0, \forall \epsilon > 0, \exists m_0(\epsilon, \delta):

If m ≥ m₀ training samples, then with probability at least 1 - δ:

P(error(h)ϵ)1δP(\text{error}(h) \leq \epsilon) \geq 1 - \delta

Key Parameters

  • • ε (epsilon): Accuracy parameter - maximum acceptable error
  • • δ (delta): Confidence parameter - probability of failure
  • • m: Number of training examples needed
  • • h: Hypothesis (learned model)

Sample Complexity

For finite hypothesis class with |H| hypotheses:

m1ϵ(lnH+ln1δ)m \geq \frac{1}{\epsilon}\left(\ln|H| + \ln\frac{1}{\delta}\right)

Logarithmic in hypothesis space size

Proof Sketch: VC Dimension Bound

For hypothesis class with VC dimension d:

mO(1ϵ(d+ln1δ))m \geq O\left(\frac{1}{\epsilon}\left(d + \ln\frac{1}{\delta}\right)\right)

More complex hypothesis spaces (higher d) require more samples to achieve same (ε, δ) guarantee. This formalizes the intuition that complex models need more data.

Confusion Matrix Metrics Derivation

From Confusion Matrix to Performance Metrics

Confusion Matrix Foundation

Predicted Class
PositiveNegative
True PositiveTPFN
True NegativeFPTN

Metric Derivations

Precision (Positive Predictive Value)

Precision=TPTP+FP=True PositivesAll Predicted Positives\text{Precision} = \frac{TP}{TP + FP} = \frac{\text{True Positives}}{\text{All Predicted Positives}}

Of all positive predictions, what fraction is correct? High precision = few false alarms.

Recall (Sensitivity, True Positive Rate)

Recall=TPTP+FN=True PositivesAll Actual Positives\text{Recall} = \frac{TP}{TP + FN} = \frac{\text{True Positives}}{\text{All Actual Positives}}

Of all actual positives, what fraction did we catch? High recall = few missed positives.

F1-Score (Harmonic Mean)

F1=2Precision×RecallPrecision+Recall=2TP2TP+FP+FNF_1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

Derivation: Harmonic mean gives equal weight to precision and recall.

21P+1R=2PRP+R\frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2PR}{P+R}

Specificity (True Negative Rate)

Specificity=TNTN+FP=True NegativesAll Actual Negatives\text{Specificity} = \frac{TN}{TN + FP} = \frac{\text{True Negatives}}{\text{All Actual Negatives}}

Of all actual negatives, what fraction did we correctly identify?

Precision-Recall Tradeoff

By adjusting decision threshold t:

  • • Lower threshold → more positives → higher recall, lower precision
  • • Higher threshold → fewer positives → lower recall, higher precision

This fundamental tradeoff stems from the fact that TP + FN = constant (total positives) and TP + FP = predicted positives varies with threshold.

ROC Curve & AUC Analysis

Receiver Operating Characteristic Mathematical Foundation

ROC Curve Definition

ROC curve plots TPR vs FPR at different classification thresholds:

True Positive Rate (TPR):

TPR=TPTP+FN=RecallTPR = \frac{TP}{TP + FN} = \text{Recall}

False Positive Rate (FPR):

FPR=FPFP+TN=1SpecificityFPR = \frac{FP}{FP + TN} = 1 - \text{Specificity}

AUC (Area Under Curve) Interpretation

AUC has a probabilistic interpretation:

AUC=P(score(x+)>score(x)x+positive,xnegative)AUC = P(\text{score}(x_+) > \text{score}(x_-) | x_+ \in \text{positive}, x_- \in \text{negative})

Meaning: Probability that a randomly chosen positive example has a higher predicted score than a randomly chosen negative example.

Computing AUC: Trapezoidal Rule

For n threshold points (TPR_i, FPR_i):

AUCi=1n1(TPRi+TPRi+1)2(FPRi+1FPRi)AUC \approx \sum_{i=1}^{n-1} \frac{(TPR_i + TPR_{i+1})}{2} \cdot (FPR_{i+1} - FPR_i)

Sum of trapezoid areas between consecutive points

AUC = 0.5

Random classifier (diagonal line)

AUC = 0.7-0.8

Acceptable model

AUC = 0.9-1.0

Excellent model

Cross-Validation Theory

K-Fold Cross-Validation Mathematical Analysis

Estimation of Generalization Error

K-fold CV estimates true error by averaging errors over K folds:

E^CV=1Ki=1KEi\hat{E}_{CV} = \frac{1}{K}\sum_{i=1}^{K} E_i

where E_i is the error on fold i when trained on other K-1 folds.

Variance of CV Estimate

Var(E^CV)=1K2i=1KVar(Ei)+2K2i<jCov(Ei,Ej)\text{Var}(\hat{E}_{CV}) = \frac{1}{K^2}\sum_{i=1}^{K} \text{Var}(E_i) + \frac{2}{K^2}\sum_{i<j} \text{Cov}(E_i, E_j)

Key insight: Folds are not independent (they share n(K-1)/K training samples), so covariance terms are positive. This increases variance compared to independent estimates.

Bias-Variance Tradeoff in Fold Choice

Large K (e.g., K=n, LOOCV)

  • • Training set size: n-1 (almost full)
  • • Low bias in error estimate
  • • High variance (highly correlated folds)
  • • Computationally expensive

Small K (e.g., K=5 or 10)

  • • Training set size: n(K-1)/K
  • • Slight upward bias in error
  • • Lower variance
  • • Computationally efficient

Standard Choice: K=10

Empirically shown to provide good bias-variance balance. K=5 for large datasets, K=n for small datasets.

Stratified K-Fold Justification

Problem: Random splits may have imbalanced class distributions across folds.

Solution: Maintain class proportions in each fold.

Mathematical effect: Reduces variance of CV estimate by ensuring each fold is representative. Especially important for imbalanced datasets.

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
Which of the following is an example of supervised learning?
Not attempted
2
What is the main difference between classification and regression?
Not attempted
3
In the housing price dataset with features (sqft, bedrooms, bathrooms, year) → price, what is the sample dimensionality?
Not attempted
4
What does overfitting mean in machine learning?
Not attempted
5
In k-fold cross-validation with k=5, how many times is the model trained?
Not attempted
6
For a binary classifier with TP=80, FP=20, TN=70, FN=30, what is the precision?
Not attempted
7
What does the No Free Lunch (NFL) theorem state?
Not attempted
8
In the bias-variance tradeoff, high bias typically indicates:
Not attempted
9
Which metric is most appropriate for highly imbalanced datasets (e.g., 1% positive class)?
Not attempted
10
What is the purpose of a validation set?
Not attempted