MathIsimple
Back to Machine Learning
Classification
Intermediate Level

Logistic Regression & Classification

Master binary and multi-class classification methods including logistic regression, LDA, and handling imbalanced data

Logistic Regression

Sigmoid Function & Model

Logistic regression models the probability of binary outcomes using the sigmoid function:

P(y=1x)=σ(wTx)=11+ewTxP(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}}

• Output range: (0, 1) - perfect for probabilities

• gives (decision boundary)

• S-shaped curve: smooth transition between classes

Loss Function

Binary cross-entropy (log loss):

L=1ni=1n[yilog(y^i)+(1yi)log(1y^i)]L = -\frac{1}{n}\sum_{i=1}^{n}[y_i\log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i)]

Convex function - guaranteed global minimum

Gradient

For gradient descent:

L=1nXT(σ(Xw)y)\nabla L = \frac{1}{n}\mathbf{X}^T(\sigma(\mathbf{Xw}) - \mathbf{y})

Similar form to linear regression!

Decision Threshold

Default threshold is 0.5, but adjust based on costs: use lower threshold if false negatives are more costly (e.g., disease detection).

Linear Discriminant Analysis (LDA)

Generative Approach to Classification

LDA models class distributions and uses Bayes theorem for classification. Assumes Gaussian distributions with equal covariance:

P(y=kx)=P(xy=k)P(y=k)P(x)P(y=k|\mathbf{x}) = \frac{P(\mathbf{x}|y=k)P(y=k)}{P(\mathbf{x})}

where features follow Gaussian distribution with class-specific means and shared covariance

Objective: Maximize Between-Class Separation

J(w)=wTSBwwTSWwJ(\mathbf{w}) = \frac{\mathbf{w}^TS_B\mathbf{w}}{\mathbf{w}^TS_W\mathbf{w}}

Between-class scatter :

SB=knk(μkμ)(μkμ)TS_B = \sum_{k} n_k(\boldsymbol{\mu}_k - \boldsymbol{\mu})(\boldsymbol{\mu}_k - \boldsymbol{\mu})^T

Within-class scatter :

SW=kxCk(xμk)(xμk)TS_W = \sum_{k}\sum_{\mathbf{x} \in C_k}(\mathbf{x}-\boldsymbol{\mu}_k)(\mathbf{x}-\boldsymbol{\mu}_k)^T

✓ When LDA Works Well

  • • Classes roughly Gaussian
  • • Similar covariance structure
  • • Small dataset (fewer parameters than logistic regression)

✗ When to Avoid LDA

  • • Non-Gaussian distributions
  • • Very different covariances
  • • Large dataset (logistic regression more flexible)

Multi-Class Classification

One-vs-Rest (OvR)

Train binary classifiers, each distinguishing one class from all others.

Training:

For class : positive = class , negative = all other classes

Prediction:

y^=argmaxkPk(y=1x)\hat{y} = \arg\max_k P_k(y=1|\mathbf{x})

✓ Simple, efficient

✗ Class imbalance in training

One-vs-One (OvO)

Train binary classifiers for each pair of classes.

Training:

For each pair : use only samples from classes and

Prediction:

Majority voting across all classifiers

✓ Balanced training

✗ Many classifiers ()

Softmax Regression (Multinomial Logistic Regression)

Direct multi-class extension of logistic regression. Models all classes simultaneously:

P(y=kx)=ewkTxj=1KewjTxP(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{j=1}^{K} e^{\mathbf{w}_j^T\mathbf{x}}}

Probabilities guaranteed to sum to 1. Generalizes sigmoid to classes.

Cross-Entropy Loss

L=1ni=1nk=1Kyi,klog(y^i,k)L = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} y_{i,k} \log(\hat{y}_{i,k})

where is 1 if sample belongs to class , 0 otherwise (one-hot encoding)

Handling Class Imbalance

The Problem

When one class greatly outnumbers others (e.g., fraud detection: 1% fraud, 99% normal), classifiers bias toward majority class.

Example:

Dataset: 990 normal transactions, 10 fraudulent. A classifier predicting "all normal" achieves 99% accuracy but catches 0% fraud!

1. Resampling

Undersampling

Remove majority class samples

⚠ Loses information

Oversampling

Duplicate minority class samples

⚠ Risk overfitting

2. SMOTE

Synthetic Minority Over-sampling Technique

How it works:

  1. 1. Pick minority sample
  2. 2. Find k nearest neighbors
  3. 3. Randomly select neighbor
  4. 4. Generate synthetic sample between original and neighbor
3. Cost-Sensitive

Class Weights

wk=nKnkw_k = \frac{n}{K \cdot n_k}

Weight loss by inverse class frequency

Threshold Adjustment

Lower threshold for minority class

Example: Credit Approval

Binary Classification: Approve or Reject

Sample Data (8 applicants)

Income ($k)Credit ScoreDebt ($k)Decision
8572015
Approve
4565025
Reject
12078020
Approve
3558030
Reject
9574018
Approve
5564028
Reject
11076022
Approve
4060032
Reject

Logistic Regression Model

After training, learned weights:

P(Approve)=σ(0.02×Income+0.01×Score0.03×Debt10)P(\text{Approve}) = \sigma(0.02 \times \text{Income} + 0.01 \times \text{Score} - 0.03 \times \text{Debt} - 10)

Higher income/score increases approval probability; higher debt decreases it.

Prediction Example

New applicant: Income=20k

z=0.02(90)+0.01(710)0.03(20)10z = 0.02(90) + 0.01(710) - 0.03(20) - 10z=1.8+7.10.610=1.7z = 1.8 + 7.1 - 0.6 - 10 = -1.7
P=σ(1.7)=0.154P = \sigma(-1.7) = 0.154

15.4% approval probability → Predict: Reject

Sigmoid Function Mathematical Properties

Logistic Function Derivation and Properties

Sigmoid Function Definition

σ(z)=11+ez=ez1+ez\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}

Maps any real number z to (0, 1), perfect for probability interpretation.

Key Property: Symmetry

σ(z)=1σ(z)\sigma(-z) = 1 - \sigma(z)

Proof:

σ(z)=11+ez=11+ezezez=ezez+1\sigma(-z) = \frac{1}{1 + e^{z}} = \frac{1}{1 + e^{z}} \cdot \frac{e^{-z}}{e^{-z}} = \frac{e^{-z}}{e^{-z} + 1}=1+ez11+ez=111+ez=1σ(z)= \frac{1 + e^{-z} - 1}{1 + e^{-z}} = 1 - \frac{1}{1 + e^{-z}} = 1 - \sigma(z)

Derivative of Sigmoid

Theorem:

dσdz=σ(z)(1σ(z))\frac{d\sigma}{dz} = \sigma(z)(1 - \sigma(z))

Proof:

Using quotient rule on σ(z) = 1/(1 + e^(-z)):

dσdz=0(1+ez)1(ez)(1+ez)2=ez(1+ez)2\frac{d\sigma}{dz} = \frac{0 \cdot (1 + e^{-z}) - 1 \cdot (-e^{-z})}{(1 + e^{-z})^2} = \frac{e^{-z}}{(1 + e^{-z})^2}

Rewrite:

=11+ezez1+ez=11+ez1+ez11+ez= \frac{1}{1 + e^{-z}} \cdot \frac{e^{-z}}{1 + e^{-z}} = \frac{1}{1 + e^{-z}} \cdot \frac{1 + e^{-z} - 1}{1 + e^{-z}}=σ(z)(1σ(z))= \sigma(z) \cdot (1 - \sigma(z))

Computational Advantage:

If we already computed σ(z), the derivative requires only one multiplication and subtraction! This makes backpropagation very efficient.

Asymptotic Behavior

As z → +∞:

σ(z)1,σ(z)0\sigma(z) \to 1, \quad \sigma'(z) \to 0

High confidence positive prediction, gradient vanishes.

As z → -∞:

σ(z)0,σ(z)0\sigma(z) \to 0, \quad \sigma'(z) \to 0

High confidence negative prediction, gradient vanishes.

Maximum derivative at z = 0: σ'(0) = 0.25. This is the vanishing gradient problem in deep networks.

Maximum Likelihood Estimation

From Probability Model to Loss Function

Step 1: Probabilistic Model

Model the probability that y = 1 given x:

P(y=1x;w)=σ(wTx)=11+ewTxP(y=1|\mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x}) = \frac{1}{1 + e^{-\mathbf{w}^T\mathbf{x}}}P(y=0x;w)=1σ(wTx)P(y=0|\mathbf{x}; \mathbf{w}) = 1 - \sigma(\mathbf{w}^T\mathbf{x})

Compact form using Bernoulli distribution:

P(yx;w)=σ(wTx)y(1σ(wTx))1yP(y|\mathbf{x}; \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})^y \cdot (1 - \sigma(\mathbf{w}^T\mathbf{x}))^{1-y}

Step 2: Likelihood Function

For n independent samples, the likelihood is:

L(w)=i=1nP(yixi;w)=i=1nσ(wTxi)yi(1σ(wTxi))1yiL(\mathbf{w}) = \prod_{i=1}^{n} P(y_i|\mathbf{x}_i; \mathbf{w}) = \prod_{i=1}^{n} \sigma(\mathbf{w}^T\mathbf{x}_i)^{y_i} (1 - \sigma(\mathbf{w}^T\mathbf{x}_i))^{1-y_i}

Step 3: Log-Likelihood

Take logarithm (monotonic transformation, doesn't change argmax):

(w)=logL(w)=i=1n[yilogσ(wTxi)+(1yi)log(1σ(wTxi))]\ell(\mathbf{w}) = \log L(\mathbf{w}) = \sum_{i=1}^{n} \left[y_i \log \sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))\right]

Product becomes sum, which is easier to differentiate and numerically stable.

Step 4: Negative Log-Likelihood (Cross-Entropy Loss)

Maximize log-likelihood ⇔ Minimize negative log-likelihood:

J(w)=(w)=i=1n[yilogσ(wTxi)+(1yi)log(1σ(wTxi))]J(\mathbf{w}) = -\ell(\mathbf{w}) = -\sum_{i=1}^{n} \left[y_i \log \sigma(\mathbf{w}^T\mathbf{x}_i) + (1-y_i) \log(1 - \sigma(\mathbf{w}^T\mathbf{x}_i))\right]

This is the binary cross-entropy loss!

Step 5: Gradient Calculation

Taking derivative with respect to w_j:

Jwj=i=1n[yiσi1yi1σi]σiwj\frac{\partial J}{\partial w_j} = -\sum_{i=1}^{n} \left[\frac{y_i}{\sigma_i} - \frac{1-y_i}{1-\sigma_i}\right] \frac{\partial \sigma_i}{\partial w_j}

Using σ'(z) = σ(z)(1-σ(z)) and chain rule:

σiwj=σi(1σi)xij\frac{\partial \sigma_i}{\partial w_j} = \sigma_i(1-\sigma_i) x_{ij}

Substituting and simplifying:

Jwj=i=1n[yi(1σi)(1yi)σi]xij=i=1n(σiyi)xij\frac{\partial J}{\partial w_j} = -\sum_{i=1}^{n} [y_i(1-\sigma_i) - (1-y_i)\sigma_i] x_{ij} = \sum_{i=1}^{n} (\sigma_i - y_i) x_{ij}

In vector form:

J(w)=XT(σy)\nabla J(\mathbf{w}) = \mathbf{X}^T(\boldsymbol{\sigma} - \mathbf{y})

Remarkably Simple Form!

Gradient is just the difference between predictions and true labels, weighted by features. Same form as linear regression!

Convexity of Logistic Regression

Proof that Loss Function is Convex

Convexity Definition

A function f is convex if its Hessian matrix H is positive semi-definite:

H=2f0    vTHv0v\mathbf{H} = \nabla^2 f \succeq 0 \quad \iff \quad \mathbf{v}^T\mathbf{H}\mathbf{v} \geq 0 \quad \forall \mathbf{v}

Computing the Hessian

The gradient is: ∇J(w) = X^T(σ - y)

Taking second derivative:

H=2J(w)=w(XTσ)\mathbf{H} = \nabla^2 J(\mathbf{w}) = \frac{\partial}{\partial \mathbf{w}}(\mathbf{X}^T\boldsymbol{\sigma})

Since ∂σ_i/∂w_j = σ_i(1-σ_i)x_ij, we get:

H=XTSX\mathbf{H} = \mathbf{X}^T\mathbf{S}\mathbf{X}

where S is diagonal with S_ii = σ_i(1-σ_i).

Positive Semi-Definite Proof

Theorem: H is positive semi-definite

Proof: For any vector v:

vTHv=vTXTSXv=(Xv)TS(Xv)\mathbf{v}^T\mathbf{H}\mathbf{v} = \mathbf{v}^T\mathbf{X}^T\mathbf{S}\mathbf{X}\mathbf{v} = (\mathbf{X}\mathbf{v})^T\mathbf{S}(\mathbf{X}\mathbf{v})

Let z = Xv. Then:

vTHv=i=1nSiizi2=i=1nσi(1σi)zi2\mathbf{v}^T\mathbf{H}\mathbf{v} = \sum_{i=1}^{n} S_{ii} z_i^2 = \sum_{i=1}^{n} \sigma_i(1-\sigma_i) z_i^2

Since 0 < σ_i < 1, we have σ_i(1-σ_i) > 0, and z_i² ≥ 0.

Therefore: v^T H v ≥ 0 for all v ⇒ H is positive semi-definite ⇒ J is convex ✓

Implications

✓ Guaranteed Global Optimum

Any local minimum is also a global minimum. No risk of getting stuck in bad local minima.

✓ Convergence Guarantee

Gradient descent with proper step size is guaranteed to converge to the global optimum.

Softmax for Multi-Class Classification

From Binary to K-Class Classification

Softmax Function Definition

For K classes, compute logits z_k = w_k^T x for each class, then:

P(y=kx)=ezkj=1Kezj=softmax(z)kP(y=k|\mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} = \text{softmax}(\mathbf{z})_k

Properties: All probabilities positive, sum to 1, generalizes sigmoid (K=2 case).

Cross-Entropy Loss for Multi-Class

Using one-hot encoding y_i ∈ [0,1]^K (only one 1, rest 0):

J(W)=i=1nk=1KyiklogP(y=kxi)J(\mathbf{W}) = -\sum_{i=1}^{n}\sum_{k=1}^{K} y_{ik} \log P(y=k|\mathbf{x}_i)

Since y_ik is 1 only for true class c_i:

J(W)=i=1nlogP(y=cixi)=i=1nlogezicijezijJ(\mathbf{W}) = -\sum_{i=1}^{n} \log P(y=c_i|\mathbf{x}_i) = -\sum_{i=1}^{n} \log \frac{e^{z_{ic_i}}}{\sum_j e^{z_{ij}}}

Gradient Derivation

Computing ∂J/∂z_k:

For the correct class k = c:

Jzc=1pcpc(1pc)=(1pc)=pc1\frac{\partial J}{\partial z_c} = -\frac{1}{p_c} \cdot p_c(1-p_c) = -(1-p_c) = p_c - 1

For incorrect classes k ≠ c:

Jzk=1pc(pcpk)=pk\frac{\partial J}{\partial z_k} = -\frac{1}{p_c} \cdot (-p_c p_k) = p_k

Compact form:

Jzk=pkyk(predicted prob - true label)\frac{\partial J}{\partial z_k} = p_k - y_k \quad \text{(predicted prob - true label)}

Beautiful Result!

Same form as binary logistic regression. Gradient is prediction error!

Numerical Stability Trick

Computing e^z_k directly can overflow for large z. Use log-sum-exp trick:

logkezk=c+logkezkc\log\sum_{k} e^{z_k} = c + \log\sum_{k} e^{z_k - c}

where c = max_k z_k. This shifts all exponents to be ≤ 0, preventing overflow.

Newton's Method for Logistic Regression

Second-Order Optimization with Hessian

Newton's Update Rule

w(t+1)=w(t)H1(w(t))J(w(t))\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - \mathbf{H}^{-1}(\mathbf{w}^{(t)}) \nabla J(\mathbf{w}^{(t)})

where H is the Hessian matrix (second derivative).

For Logistic Regression

Recall:

  • • Gradient: ∇J = X^T(σ - y)
  • • Hessian: H = X^T S X, where S_ii = σ_i(1-σ_i)

Newton's update becomes:

w(t+1)=w(t)(XTSX)1XT(σy)\mathbf{w}^{(t+1)} = \mathbf{w}^{(t)} - (\mathbf{X}^T\mathbf{S}\mathbf{X})^{-1}\mathbf{X}^T(\boldsymbol{\sigma} - \mathbf{y})

Comparison: Newton vs Gradient Descent

Gradient Descent

  • • First-order method (gradient only)
  • • Per iteration: O(np) time
  • • Many iterations needed
  • • No matrix inversion
  • • Better for large p

Newton's Method

  • • Second-order method (uses Hessian)
  • • Per iteration: O(np² + p³) time
  • • Fewer iterations (quadratic convergence)
  • • Requires matrix inversion
  • • Better for small p

Convergence Rate

Quadratic Convergence: Near the optimum, error squares at each iteration:

w(t+1)wCw(t)w2\|\mathbf{w}^{(t+1)} - \mathbf{w}^*\| \leq C \|\mathbf{w}^{(t)} - \mathbf{w}^*\|^2

This means if you have 1 correct digit, next iteration gives 2, then 4, then 8, etc. Much faster than gradient descent's linear convergence!

Practical Algorithm: IRLS

Iteratively Reweighted Least Squares:

Initialize: w₀

Repeat until convergence:

1. Compute σ_i = σ(w^T x_i) for all i

2. Form diagonal matrix S with S_ii = σ_i(1-σ_i)

3. Solve: (X^T S X)w_new = X^T S X w_old - X^T(σ - y)

4. w ← w_new

Return: w

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the range of the sigmoid function ?
Not attempted
2
In logistic regression, what loss function is typically used?
Not attempted
3
For a binary classifier with decision threshold 0.5, if , what is the prediction?
Not attempted
4
What does LDA (Linear Discriminant Analysis) maximize?
Not attempted
5
For 5 classes, how many binary classifiers does One-vs-One (OvO) require?
Not attempted
6
In softmax for multi-class classification with classes, what is the formula for class ?
Not attempted
7
What is the main purpose of SMOTE for imbalanced datasets?
Not attempted
8
For imbalanced dataset (1% positive), which metric is LEAST appropriate?
Not attempted
9
What assumption does LDA make that logistic regression does not?
Not attempted
10
In One-vs-Rest (OvR) for 4 classes, if all classifiers output low probabilities, what happens?
Not attempted