Introduction to Machine Learning

Comprehensive Examples & Practice Problems

← Back to Course

📧 Example 1: Email Spam Classification

Problem Statement

A company has collected 1,000 emails: 600 legitimate and 400 spam. Each email has the following features:

  • Length: Number of characters (100-5000)
  • Uppercase ratio: Percentage of capital letters (0-100%)
  • Keyword count: Occurrences of words like "free", "winner" (0-20)
  • In contacts: Is sender in contact list? (Yes/No)

Task: Build a classifier to predict if a new email is spam.

Sample Training Data

Email ID Length Uppercase % Keywords In Contacts Label
1 450 5% 0 Yes Legitimate
2 2500 35% 8 No Spam
3 800 10% 1 Yes Legitimate
4 3200 45% 12 No Spam
5 650 8% 2 No Legitimate

Step-by-Step Solution

Step 1: Data Splitting (70-30 Split)

  • Training set: 700 emails (420 legitimate, 280 spam)
  • Test set: 300 emails (180 legitimate, 120 spam)

Step 2: Model Selection - Logistic Regression

Decision function:

P(Spam|x) = 1 / (1 + e^(-z))
where z = w₀ + w₁·length + w₂·uppercase + w₃·keywords + w₄·contacts

Step 3: Confusion Matrix on Test Set

Predicted: Legitimate Predicted: Spam
Actual: Legitimate 170 (TN) 10 (FP)
Actual: Spam 15 (FN) 105 (TP)

Step 4: Calculate Metrics

Accuracy = (TP + TN) / Total = (105 + 170) / 300 = 91.67%

Precision = TP / (TP + FP) = 105 / 115 = 91.30%

Recall = TP / (TP + FN) = 105 / 120 = 87.50%

F1-Score = 2 × (Precision × Recall) / (Precision + Recall) = 89.36%

💡 Key Takeaways

  • This is a supervised learning binary classification problem
  • Accuracy alone can be misleading with imbalanced data
  • Precision measures "Of all predicted spam, how many were actually spam?"
  • Recall measures "Of all actual spam, how many did we catch?"
  • F1-Score balances precision and recall (harmonic mean)

🏠 Example 2: House Price Prediction (Regression)

Problem Statement

Predict house prices using the following features:

House Area (m²) Bedrooms Floor Age (years) Price ($1000s)
1 80 2 5 10 300
2 120 3 10 5 480
3 100 2 15 3 420
4 150 4 8 2 550
5 70 1 3 15 250
Test 90 2 8 7 ?

Task: Predict the price of the test house (90m², 2 bedrooms, floor 8, age 7 years)

Step-by-Step Solution

Step 1: Linear Regression Model

Price = w₀ + w₁·Area + w₂·Bedrooms + w₃·Floor + w₄·Age

After training on the 5 houses, we get:

Price = 50 + 3.2·Area + 20·Bedrooms + 5·Floor - 8·Age

Step 2: Make Prediction

Price = 50 + 3.2×90 + 20×2 + 5×8 - 8×7

Price = 50 + 288 + 40 + 40 - 56

Price = $362,000

Step 3: Model Evaluation (on test set)

Mean Absolute Error (MAE) = Avg|Actual - Predicted| = $25,000

Root Mean Squared Error (RMSE) = √(Avg(Actual - Predicted)²) = $32,000

R² Score = 1 - (SS_res / SS_tot) = 0.87

R² = 0.87 means the model explains 87% of the variance in house prices.

Step 4: Feature Importance

  • Area: Coefficient = 3.2 → Each m² adds $3,200
  • Bedrooms: Coefficient = 20 → Each bedroom adds $20,000
  • Age: Coefficient = -8 → Each year reduces $8,000
  • Floor: Coefficient = 5 → Each floor adds $5,000

💡 Key Takeaways

  • Regression predicts continuous values (vs. discrete categories in classification)
  • Linear regression assumes linear relationship between features and target
  • Coefficients show feature importance and direction of influence
  • R² score indicates how well the model fits the data (0-1 scale)
  • Feature engineering (e.g., Area², Age×Floor) can improve accuracy

✏️ Practice Problems

Problem 1: Bias-Variance Tradeoff

A model achieves the following results:

  • Training error: 2%
  • Validation error: 15%
  • Test error: 16%

Questions:

  1. Is this overfitting or underfitting?
  2. What are 3 possible solutions?
  3. If you add L2 regularization, which error(s) will likely increase?
Show Solution

1. Problem: Overfitting (large gap between training and validation error)

2. Solutions:

  • Regularization: Add L1/L2 penalty to prevent large weights
  • More data: Collect more training examples
  • Reduce complexity: Use fewer features or simpler model
  • Dropout: For neural networks, randomly drop neurons
  • Early stopping: Stop training when validation error increases

3. L2 Regularization effect: Training error will increase (model becomes simpler), validation/test error should decrease

Problem 2: Cross-Validation Calculation

You perform 5-fold cross-validation on 500 samples. The accuracy for each fold is:

Fold 1: 85%, Fold 2: 88%, Fold 3: 82%, Fold 4: 90%, Fold 5: 85%

Calculate:

  1. Mean accuracy
  2. Standard deviation
  3. How many samples in each training/validation split?
Show Solution

1. Mean Accuracy:

μ = (85 + 88 + 82 + 90 + 85) / 5 = 430 / 5 = 86%

2. Standard Deviation:

σ = √[Σ(xᵢ - μ)² / n]

= √[((85-86)² + (88-86)² + (82-86)² + (90-86)² + (85-86)²) / 5]

= √[(1 + 4 + 16 + 16 + 1) / 5]

= √[38 / 5] = √7.6 ≈ 2.76%

3. Data Split:

Training set: 500 × 4/5 = 400 samples

Validation set: 500 × 1/5 = 100 samples

Final Result: 86% ± 2.76%

Problem 3: Learning Type Classification

Classify each scenario as Supervised, Unsupervised, or Reinforcement Learning:

  1. Predicting stock prices using historical data with known prices
  2. Grouping customers into segments without predefined categories
  3. Training a robot to navigate a maze using rewards
  4. Detecting fraudulent transactions (labeled as fraud/legitimate)
  5. Discovering topics in a collection of documents
  6. Playing chess by learning from wins/losses
Show Solution
  1. Supervised Learning - Regression with labeled prices
  2. Unsupervised Learning - Clustering without labels
  3. Reinforcement Learning - Learning through rewards/penalties
  4. Supervised Learning - Binary classification with labels
  5. Unsupervised Learning - Topic modeling (e.g., LDA)
  6. Reinforcement Learning - Learning optimal strategy through gameplay
← Back to Course