Introduction to Machine Learning

Comprehensive Examples & Practice Problems

📧 Example 1: Email Spam Classification

Problem Statement

A company has collected 1,000 emails: 600 legitimate and 400 spam. Each email has the following features:

Length: Number of characters (100-5000)
Uppercase ratio: Percentage of capital letters (0-100%)
Keyword count: Occurrences of words like "free", "winner" (0-20)
In contacts: Is sender in contact list? (Yes/No)

Task: Build a classifier to predict if a new email is spam.

Sample Training Data

Email ID	Length	Uppercase %	Keywords	In Contacts	Label
1	450	5%	0	Yes	Legitimate
2	2500	35%	8	No	Spam
3	800	10%	1	Yes	Legitimate
4	3200	45%	12	No	Spam
5	650	8%	2	No	Legitimate

Step-by-Step Solution

Step 1: Data Splitting (70-30 Split)

Training set: 700 emails (420 legitimate, 280 spam)
Test set: 300 emails (180 legitimate, 120 spam)

Step 2: Model Selection - Logistic Regression

Decision function:

P(Spam|x) = 1 / (1 + e^(-z))
where z = w₀ + w₁·length + w₂·uppercase + w₃·keywords + w₄·contacts

Step 3: Confusion Matrix on Test Set

	Predicted: Legitimate	Predicted: Spam
Actual: Legitimate	170 (TN)	10 (FP)
Actual: Spam	15 (FN)	105 (TP)

Step 4: Calculate Metrics

Accuracy = (TP + TN) / Total = (105 + 170) / 300 = 91.67%

Precision = TP / (TP + FP) = 105 / 115 = 91.30%

Recall = TP / (TP + FN) = 105 / 120 = 87.50%

F1-Score = 2 × (Precision × Recall) / (Precision + Recall) = 89.36%

💡 Key Takeaways

This is a supervised learning binary classification problem
Accuracy alone can be misleading with imbalanced data
Precision measures "Of all predicted spam, how many were actually spam?"
Recall measures "Of all actual spam, how many did we catch?"
F1-Score balances precision and recall (harmonic mean)

🏠 Example 2: House Price Prediction (Regression)

Problem Statement

Predict house prices using the following features:

House	Area (m²)	Bedrooms	Floor	Age (years)	Price ($1000s)
1	80	2	5	10	300
2	120	3	10	5	480
3	100	2	15	3	420
4	150	4	8	2	550
5	70	1	3	15	250
Test	90	2	8	7	?

Task: Predict the price of the test house (90m², 2 bedrooms, floor 8, age 7 years)

Step-by-Step Solution

Step 1: Linear Regression Model

Price = w₀ + w₁·Area + w₂·Bedrooms + w₃·Floor + w₄·Age

After training on the 5 houses, we get:

Price = 50 + 3.2·Area + 20·Bedrooms + 5·Floor - 8·Age

Step 2: Make Prediction

Price = 50 + 3.2×90 + 20×2 + 5×8 - 8×7

Price = 50 + 288 + 40 + 40 - 56

Price = $362,000

Step 3: Model Evaluation (on test set)

Mean Absolute Error (MAE) = Avg|Actual - Predicted| = $25,000

Root Mean Squared Error (RMSE) = √(Avg(Actual - Predicted)²) = $32,000

R² Score = 1 - (SS_res / SS_tot) = 0.87

R² = 0.87 means the model explains 87% of the variance in house prices.

Step 4: Feature Importance

Area: Coefficient = 3.2 → Each m² adds $3,200
Bedrooms: Coefficient = 20 → Each bedroom adds $20,000
Age: Coefficient = -8 → Each year reduces $8,000
Floor: Coefficient = 5 → Each floor adds $5,000

💡 Key Takeaways

Regression predicts continuous values (vs. discrete categories in classification)
Linear regression assumes linear relationship between features and target
Coefficients show feature importance and direction of influence
R² score indicates how well the model fits the data (0-1 scale)
Feature engineering (e.g., Area², Age×Floor) can improve accuracy

✏️ Practice Problems

Problem 1: Bias-Variance Tradeoff

A model achieves the following results:

Training error: 2%
Validation error: 15%
Test error: 16%

Questions:

Is this overfitting or underfitting?
What are 3 possible solutions?
If you add L2 regularization, which error(s) will likely increase?

Show Solution

1. Problem: Overfitting (large gap between training and validation error)

2. Solutions:

Regularization: Add L1/L2 penalty to prevent large weights
More data: Collect more training examples
Reduce complexity: Use fewer features or simpler model
Dropout: For neural networks, randomly drop neurons
Early stopping: Stop training when validation error increases

3. L2 Regularization effect: Training error will increase (model becomes simpler), validation/test error should decrease

Problem 2: Cross-Validation Calculation

You perform 5-fold cross-validation on 500 samples. The accuracy for each fold is:

Fold 1: 85%, Fold 2: 88%, Fold 3: 82%, Fold 4: 90%, Fold 5: 85%

Calculate:

Mean accuracy
Standard deviation
How many samples in each training/validation split?

Show Solution

1. Mean Accuracy:

μ = (85 + 88 + 82 + 90 + 85) / 5 = 430 / 5 = 86%

2. Standard Deviation:

σ = √[Σ(xᵢ - μ)² / n]

= √[((85-86)² + (88-86)² + (82-86)² + (90-86)² + (85-86)²) / 5]

= √[(1 + 4 + 16 + 16 + 1) / 5]

= √[38 / 5] = √7.6 ≈ 2.76%

3. Data Split:

Training set: 500 × 4/5 = 400 samples

Validation set: 500 × 1/5 = 100 samples

Final Result: 86% ± 2.76%

Problem 3: Learning Type Classification

Classify each scenario as Supervised, Unsupervised, or Reinforcement Learning:

Predicting stock prices using historical data with known prices
Grouping customers into segments without predefined categories
Training a robot to navigate a maze using rewards
Detecting fraudulent transactions (labeled as fraud/legitimate)
Discovering topics in a collection of documents
Playing chess by learning from wins/losses

Show Solution

Supervised Learning - Regression with labeled prices
Unsupervised Learning - Clustering without labels
Reinforcement Learning - Learning through rewards/penalties
Supervised Learning - Binary classification with labels
Unsupervised Learning - Topic modeling (e.g., LDA)
Reinforcement Learning - Learning optimal strategy through gameplay

← Back to Course