Machine Learning/Learning Center/Linear Models/Logistic Regression

Logistic Regression

Master binary classification with logistic regression, from sigmoid functions to maximum likelihood estimation with real-world credit approval examples

What is Logistic Regression?

Binary Classification

Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It predicts the probability that an input belongs to a particular class (typically the "positive" or "1" class) by applying a sigmoid function to a linear combination of features.

Logistic Regression Formula

Probability of positive class:

P(y=1|x) = σ(w^Tx + b) = 1 / (1 + e^{-(w^Tx + b)})

Where σ is the sigmoid (logistic) function:

σ(z) = 1 / (1 + e^-z)

Output Range

The sigmoid function maps any real-valued input to the range (0, 1), making it perfect for probability estimation. Output values close to 0 indicate negative class, close to 1 indicate positive class.

Decision Boundary

Typically, we classify as positive if P(y=1|x) ≥ 0.5, which occurs when w^Tx + b ≥ 0. This creates a linear decision boundary in feature space, separating the two classes.

The Sigmoid Function

Understanding the mathematical transformation at the heart of logistic regression

Properties of the Sigmoid Function

Key Properties

Output range: (0, 1) - perfect for probabilities
σ(0) = 0.5 - symmetric around origin
As z → ∞, σ(z) → 1
As z → -∞, σ(z) → 0
Smooth and differentiable everywhere
Derivative: σ'(z) = σ(z)(1 - σ(z))

Sample Values

z	σ(z)	Interpretation
-5	0.007	Very unlikely
-2	0.119	Unlikely
0	0.500	Uncertain
2	0.881	Likely
5	0.993	Very likely

Log-Odds (Logit) Interpretation

The logit function is the inverse of the sigmoid. It transforms probabilities back to the real line:

logit(p) = ln(p / (1-p)) = w^Tx + b

The term p/(1-p) is called the odds, and its logarithm is the log-odds. This shows that logistic regression models the log-odds as a linear function of features.

Practical Meaning

If the probability of credit approval is 0.8, the odds are 0.8/0.2 = 4, meaning approval is 4 times more likely than rejection. The log-odds is ln(4) ≈ 1.39. Each unit increase in a feature contributes its weight to this log-odds value.

Credit Approval Prediction Example

Using logistic regression to predict loan approval decisions

Dataset Overview

A bank has historical data on 1,000 loan applications. Here's a sample of 8 applications:

ID	Income	Age	Debt	Credit Score	Years Employed	Approved
1	$45,000	28	$12,000	680	3	No
2	$75,000	35	$8,000	750	8	Yes
3	$55,000	42	$25,000	620	12	No
4	$95,000	38	$15,000	780	10	Yes
5	$32,000	25	$18,000	590	2	No
6	$68,000	45	$5,000	720	15	Yes
7	$52,000	31	$10,000	695	5	Yes
8	$38,000	29	$22,000	610	3	No

Learned Model (Example)

After training on the full dataset, we obtain a model like:

z = 0.00003 × income

+ 0.02 × age

- 0.00008 × debt

+ 0.012 × credit_score

+ 0.15 × years_employed

- 10.5

P(approved = 1) = 1 / (1 + e^-z)

Interpretation:

• Higher income increases approval probability (positive coefficient)
• Higher debt decreases approval probability (negative coefficient)
• Higher credit score strongly increases approval probability
• More years employed increases approval probability
• Age has a small positive effect

Making a Prediction

Example applicant:

• Income: $60,000
• Age: 32
• Debt: $15,000
• Credit Score: 710
• Years Employed: 6

z = 0.00003(60000) + 0.02(32) - 0.00008(15000) + 0.012(710) + 0.15(6) - 10.5

z = 1.8 + 0.64 - 1.2 + 8.52 + 0.9 - 10.5 = 0.16

P(approved) = 1 / (1 + e^-0.16) = 1 / (1 + 0.852) ≈ 0.54

Prediction: APPROVED (54% probability)

Since P(approved) = 0.54 > 0.5, we classify this application as approved with moderate confidence.

Maximum Likelihood Estimation (MLE)

How logistic regression finds optimal parameters

Likelihood Function

Unlike linear regression which minimizes squared error, logistic regression maximizes the likelihood of observing the training data. For a single sample (xᵢ, yᵢ):

P(yᵢ | xᵢ) = p̂ᵢ^yᵢ × (1 - p̂ᵢ)^(1-yᵢ)

where p̂ᵢ = σ(w^Txᵢ + b) is the predicted probability

This formula elegantly handles both classes: when yᵢ=1, it equals p̂ᵢ; when yᵢ=0, it equals (1-p̂ᵢ).

Log-Likelihood (Loss Function)

For computational stability, we maximize the log-likelihood instead. Taking the negative gives us the cross-entropy loss (also called log loss or logistic loss):

Negative log-likelihood (to minimize):

ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ log(p̂ᵢ) - (1-yᵢ) log(1-p̂ᵢ)]

Equivalently (with β = (w; b) and x̂ᵢ augmented features):

ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ β^Tx̂ᵢ + log(1 + e^{β^Tx̂ᵢ})]

Why Cross-Entropy Loss?

Advantages

Convex: Guarantees global optimum
Differentiable: Can use gradient-based optimization
Probabilistic: Directly models probability distribution
Penalizes confidence: Heavily penalizes wrong but confident predictions

Loss Behavior

Correct prediction with high confidence:

y=1, p̂=0.95 → loss ≈ 0.05 (small)

Wrong prediction with high confidence:

y=1, p̂=0.05 → loss ≈ 3.0 (large)

Uncertain prediction:

y=1, p̂=0.50 → loss ≈ 0.69 (medium)

Solving with Gradient Descent

Iterative optimization for finding optimal weights

Gradient Descent Algorithm

Since cross-entropy loss is convex but has no closed-form solution, we use gradient descent to iteratively find the optimal parameters:

Update Rules

1. Initialize: w ← random values, b ← 0

2. Repeat until convergence:

w ← w - α × ∂ℓ/∂w

b ← b - α × ∂ℓ/∂b

where α is the learning rate (step size)

Gradient Computation

The gradient of cross-entropy loss has a remarkably clean form:

∂ℓ/∂w = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ) xᵢ

∂ℓ/∂b = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ)

The gradient is simply the average of (prediction - actual) weighted by features. This elegance comes from the special relationship between sigmoid function and cross-entropy loss.

Variants of Gradient Descent

Batch GD

Uses all training samples in each iteration

Pro: Stable convergence
Con: Slow for large datasets

Stochastic GD

Uses one sample at a time

Pro: Fast updates, escapes local minima
Con: Noisy convergence

Mini-Batch GD

Uses small batches (32-256 samples)

Pro: Best of both worlds, GPU-efficient
Con: Hyperparameter tuning needed

Advanced Optimizers

Modern implementations often use sophisticated optimizers that adapt learning rates:

• Adam: Combines momentum with adaptive learning rates (most popular)
• RMSprop: Adapts learning rate based on recent gradient magnitudes
• Newton's Method: Uses second-order information (Hessian) for faster convergence

Model Evaluation Metrics

Measuring logistic regression performance beyond simple accuracy

Confusion Matrix

For our credit approval model on a test set of 100 applications:

	Predicted
Actual	Rejected (0)	Approved (1)
Rejected (0)	45 True Negative	8 False Positive
Approved (1)	5 False Negative	42 True Positive

Performance Metrics

Accuracy

87.0%

(TP + TN) / Total = 87 / 100

Overall correctness, but can be misleading with imbalanced classes

Precision

84.0%

TP / (TP + FP) = 42 / 50

Of predicted approvals, what fraction were correct?

Recall (Sensitivity)

89.4%

TP / (TP + FN) = 42 / 47

Of actual approvals, what fraction did we catch?

F1 Score

86.6%

2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall

Which Metric to Optimize?

Credit Approval: Precision might be more important (avoid approving risky applicants) to minimize defaults.

Disease Screening: Recall is critical (don't miss sick patients) even if it means more false alarms.

Spam Detection: Balance both - don't want to miss spam (recall) or flag legitimate emails (precision).

Fraud Detection: High recall to catch fraudsters, but need reasonable precision to avoid blocking legitimate transactions.

Logistic Regression vs Linear Discriminant Analysis

Understanding when to use each classification method

Aspect	Logistic Regression	LDA
Approach	Discriminative - models P(y\|x) directly	Generative - models P(x\|y) and P(y)
Assumptions	No distributional assumptions on features	Features follow Gaussian distribution, equal covariance
Training Data Needed	More flexible, works with less data	Needs sufficient data to estimate class distributions
Outliers	More robust to outliers	Sensitive to outliers (affects covariance estimates)
Computational Cost	Iterative optimization (gradient descent)	Closed-form solution (faster)
When Assumptions Met	Still works well	More statistically efficient (better with small data)
Best Use Case	General-purpose binary classification	Well-separated Gaussian classes, multi-class problems

Recommendation

Use Logistic Regression as your default choice for binary classification. It's more flexible, makes fewer assumptions, and is more robust. Switch to LDA when: (1) you have strong evidence features are Gaussian, (2) you have limited training data and assumptions are met, or (3) you need a multi-class classifier with shared covariance structure.

Common Questions About Logistic Regression

Why is it called "logistic regression" if it's for classification?

Historical reasons. The method was developed as an extension of linear regression, using the logistic (sigmoid) function to transform continuous output into probabilities. The name stuck despite being used primarily for classification. Think of it as "regression to estimate probabilities for classification."

Can I use logistic regression for multi-class classification?

Yes, through extensions. Use One-vs-Rest (train K binary classifiers, one per class) or multinomial logistic regression (softmax regression), which generalizes logistic regression to K classes using the softmax function instead of sigmoid. Scikit-learn's LogisticRegression supports both approaches.

How do I choose the decision threshold (default is 0.5)?

Adjust based on cost of errors. If false negatives are expensive (e.g., missing cancer diagnosis), lower the threshold to 0.3 to increase recall. If false positives are costly (e.g., flagging legitimate transactions as fraud), raise it to 0.7 to increase precision. Use ROC curve analysis to find the optimal threshold for your use case.

Should I regularize logistic regression?

Almost always yes, especially with many features. L2 regularization (Ridge) is standard and prevents overfitting. L1 regularization (Lasso) adds feature selection. Most implementations default to L2. The regularization parameter C (inverse of λ) should be tuned via cross-validation - smaller C means more regularization.

What if my classes are severely imbalanced (e.g., 1% positive)?

Several approaches: (1) Use class weights to penalize minority class errors more heavily, (2) Resample your data (oversample minority or undersample majority), (3) Adjust decision threshold, (4) Use appropriate metrics (F1, precision-recall curves instead of accuracy). See the class imbalance section for more details.