Master binary and multi-class classification methods including logistic regression, LDA, and handling imbalanced data
Logistic regression models the probability of binary outcomes using the sigmoid function:
• Output range: (0, 1) - perfect for probabilities
• gives (decision boundary)
• S-shaped curve: smooth transition between classes
Binary cross-entropy (log loss):
Convex function - guaranteed global minimum
For gradient descent:
Similar form to linear regression!
Decision Threshold
Default threshold is 0.5, but adjust based on costs: use lower threshold if false negatives are more costly (e.g., disease detection).
LDA models class distributions and uses Bayes theorem for classification. Assumes Gaussian distributions with equal covariance:
where features follow Gaussian distribution with class-specific means and shared covariance
Between-class scatter :
Within-class scatter :
Train binary classifiers, each distinguishing one class from all others.
Training:
For class : positive = class , negative = all other classes
Prediction:
✓ Simple, efficient
✗ Class imbalance in training
Train binary classifiers for each pair of classes.
Training:
For each pair : use only samples from classes and
Prediction:
Majority voting across all classifiers
✓ Balanced training
✗ Many classifiers ()
Direct multi-class extension of logistic regression. Models all classes simultaneously:
Probabilities guaranteed to sum to 1. Generalizes sigmoid to classes.
where is 1 if sample belongs to class , 0 otherwise (one-hot encoding)
When one class greatly outnumbers others (e.g., fraud detection: 1% fraud, 99% normal), classifiers bias toward majority class.
Example:
Dataset: 990 normal transactions, 10 fraudulent. A classifier predicting "all normal" achieves 99% accuracy but catches 0% fraud!
Undersampling
Remove majority class samples
⚠ Loses information
Oversampling
Duplicate minority class samples
⚠ Risk overfitting
Synthetic Minority Over-sampling Technique
How it works:
Class Weights
Weight loss by inverse class frequency
Threshold Adjustment
Lower threshold for minority class
| Income ($k) | Credit Score | Debt ($k) | Decision |
|---|---|---|---|
| 85 | 720 | 15 | Approve |
| 45 | 650 | 25 | Reject |
| 120 | 780 | 20 | Approve |
| 35 | 580 | 30 | Reject |
| 95 | 740 | 18 | Approve |
| 55 | 640 | 28 | Reject |
| 110 | 760 | 22 | Approve |
| 40 | 600 | 32 | Reject |
After training, learned weights:
Higher income/score increases approval probability; higher debt decreases it.
New applicant: Income=20k
15.4% approval probability → Predict: Reject
Maps any real number z to (0, 1), perfect for probability interpretation.
Proof:
Theorem:
Proof:
Using quotient rule on σ(z) = 1/(1 + e^(-z)):
Rewrite:
Computational Advantage:
If we already computed σ(z), the derivative requires only one multiplication and subtraction! This makes backpropagation very efficient.
As z → +∞:
High confidence positive prediction, gradient vanishes.
As z → -∞:
High confidence negative prediction, gradient vanishes.
Maximum derivative at z = 0: σ'(0) = 0.25. This is the vanishing gradient problem in deep networks.
Model the probability that y = 1 given x:
Compact form using Bernoulli distribution:
For n independent samples, the likelihood is:
Take logarithm (monotonic transformation, doesn't change argmax):
Product becomes sum, which is easier to differentiate and numerically stable.
Maximize log-likelihood ⇔ Minimize negative log-likelihood:
This is the binary cross-entropy loss!
Taking derivative with respect to w_j:
Using σ'(z) = σ(z)(1-σ(z)) and chain rule:
Substituting and simplifying:
In vector form:
Remarkably Simple Form!
Gradient is just the difference between predictions and true labels, weighted by features. Same form as linear regression!
A function f is convex if its Hessian matrix H is positive semi-definite:
The gradient is: ∇J(w) = X^T(σ - y)
Taking second derivative:
Since ∂σ_i/∂w_j = σ_i(1-σ_i)x_ij, we get:
where S is diagonal with S_ii = σ_i(1-σ_i).
Theorem: H is positive semi-definite
Proof: For any vector v:
Let z = Xv. Then:
Since 0 < σ_i < 1, we have σ_i(1-σ_i) > 0, and z_i² ≥ 0.
Therefore: v^T H v ≥ 0 for all v ⇒ H is positive semi-definite ⇒ J is convex ✓
✓ Guaranteed Global Optimum
Any local minimum is also a global minimum. No risk of getting stuck in bad local minima.
✓ Convergence Guarantee
Gradient descent with proper step size is guaranteed to converge to the global optimum.
For K classes, compute logits z_k = w_k^T x for each class, then:
Properties: All probabilities positive, sum to 1, generalizes sigmoid (K=2 case).
Using one-hot encoding y_i ∈ [0,1]^K (only one 1, rest 0):
Since y_ik is 1 only for true class c_i:
Computing ∂J/∂z_k:
For the correct class k = c:
For incorrect classes k ≠ c:
Compact form:
Beautiful Result!
Same form as binary logistic regression. Gradient is prediction error!
Computing e^z_k directly can overflow for large z. Use log-sum-exp trick:
where c = max_k z_k. This shifts all exponents to be ≤ 0, preventing overflow.
where H is the Hessian matrix (second derivative).
Recall:
Newton's update becomes:
Gradient Descent
Newton's Method
Quadratic Convergence: Near the optimum, error squares at each iteration:
This means if you have 1 correct digit, next iteration gives 2, then 4, then 8, etc. Much faster than gradient descent's linear convergence!
Iteratively Reweighted Least Squares:
Initialize: w₀
Repeat until convergence:
1. Compute σ_i = σ(w^T x_i) for all i
2. Form diagonal matrix S with S_ii = σ_i(1-σ_i)
3. Solve: (X^T S X)w_new = X^T S X w_old - X^T(σ - y)
4. w ← w_new
Return: w
Test your understanding with 10 multiple-choice questions