A company has collected 1,000 emails: 600 legitimate and 400 spam. Each email has the following features:
Task: Build a classifier to predict if a new email is spam.
| Email ID | Length | Uppercase % | Keywords | In Contacts | Label |
|---|---|---|---|---|---|
| 1 | 450 | 5% | 0 | Yes | Legitimate |
| 2 | 2500 | 35% | 8 | No | Spam |
| 3 | 800 | 10% | 1 | Yes | Legitimate |
| 4 | 3200 | 45% | 12 | No | Spam |
| 5 | 650 | 8% | 2 | No | Legitimate |
Step 1: Data Splitting (70-30 Split)
Step 2: Model Selection - Logistic Regression
Decision function:
Step 3: Confusion Matrix on Test Set
| Predicted: Legitimate | Predicted: Spam | |
|---|---|---|
| Actual: Legitimate | 170 (TN) | 10 (FP) |
| Actual: Spam | 15 (FN) | 105 (TP) |
Step 4: Calculate Metrics
Accuracy = (TP + TN) / Total = (105 + 170) / 300 = 91.67%
Precision = TP / (TP + FP) = 105 / 115 = 91.30%
Recall = TP / (TP + FN) = 105 / 120 = 87.50%
F1-Score = 2 × (Precision × Recall) / (Precision + Recall) = 89.36%
Predict house prices using the following features:
| House | Area (m²) | Bedrooms | Floor | Age (years) | Price ($1000s) |
|---|---|---|---|---|---|
| 1 | 80 | 2 | 5 | 10 | 300 |
| 2 | 120 | 3 | 10 | 5 | 480 |
| 3 | 100 | 2 | 15 | 3 | 420 |
| 4 | 150 | 4 | 8 | 2 | 550 |
| 5 | 70 | 1 | 3 | 15 | 250 |
| Test | 90 | 2 | 8 | 7 | ? |
Task: Predict the price of the test house (90m², 2 bedrooms, floor 8, age 7 years)
Step 1: Linear Regression Model
After training on the 5 houses, we get:
Step 2: Make Prediction
Price = 50 + 3.2×90 + 20×2 + 5×8 - 8×7
Price = 50 + 288 + 40 + 40 - 56
Price = $362,000
Step 3: Model Evaluation (on test set)
Mean Absolute Error (MAE) = Avg|Actual - Predicted| = $25,000
Root Mean Squared Error (RMSE) = √(Avg(Actual - Predicted)²) = $32,000
R² Score = 1 - (SS_res / SS_tot) = 0.87
R² = 0.87 means the model explains 87% of the variance in house prices.
Step 4: Feature Importance
A model achieves the following results:
Questions:
1. Problem: Overfitting (large gap between training and validation error)
2. Solutions:
3. L2 Regularization effect: Training error will increase (model becomes simpler), validation/test error should decrease
You perform 5-fold cross-validation on 500 samples. The accuracy for each fold is:
Fold 1: 85%, Fold 2: 88%, Fold 3: 82%, Fold 4: 90%, Fold 5: 85%
Calculate:
1. Mean Accuracy:
μ = (85 + 88 + 82 + 90 + 85) / 5 = 430 / 5 = 86%
2. Standard Deviation:
σ = √[Σ(xᵢ - μ)² / n]
= √[((85-86)² + (88-86)² + (82-86)² + (90-86)² + (85-86)²) / 5]
= √[(1 + 4 + 16 + 16 + 1) / 5]
= √[38 / 5] = √7.6 ≈ 2.76%
3. Data Split:
Training set: 500 × 4/5 = 400 samples
Validation set: 500 × 1/5 = 100 samples
Final Result: 86% ± 2.76%
Classify each scenario as Supervised, Unsupervised, or Reinforcement Learning: