MathIsimple

Logistic Regression

Master binary classification with logistic regression, from sigmoid functions to maximum likelihood estimation with real-world credit approval examples

What is Logistic Regression?

Binary Classification

Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It predicts the probability that an input belongs to a particular class (typically the "positive" or "1" class) by applying a sigmoid function to a linear combination of features.

Logistic Regression Formula

Probability of positive class:

P(y=1|x) = σ(wTx + b) = 1 / (1 + e-(wTx + b))

Where σ is the sigmoid (logistic) function:

σ(z) = 1 / (1 + e-z)

Output Range

The sigmoid function maps any real-valued input to the range (0, 1), making it perfect for probability estimation. Output values close to 0 indicate negative class, close to 1 indicate positive class.

Decision Boundary

Typically, we classify as positive if P(y=1|x) ≥ 0.5, which occurs when wTx + b ≥ 0. This creates a linear decision boundary in feature space, separating the two classes.

The Sigmoid Function

Understanding the mathematical transformation at the heart of logistic regression

Properties of the Sigmoid Function

Key Properties

  • Output range: (0, 1) - perfect for probabilities
  • σ(0) = 0.5 - symmetric around origin
  • As z → ∞, σ(z) → 1
  • As z → -∞, σ(z) → 0
  • Smooth and differentiable everywhere
  • Derivative: σ'(z) = σ(z)(1 - σ(z))

Sample Values

zσ(z)Interpretation
-50.007Very unlikely
-20.119Unlikely
00.500Uncertain
20.881Likely
50.993Very likely

Log-Odds (Logit) Interpretation

The logit function is the inverse of the sigmoid. It transforms probabilities back to the real line:

logit(p) = ln(p / (1-p)) = wTx + b

The term p/(1-p) is called the odds, and its logarithm is the log-odds. This shows that logistic regression models the log-odds as a linear function of features.

Practical Meaning

If the probability of credit approval is 0.8, the odds are 0.8/0.2 = 4, meaning approval is 4 times more likely than rejection. The log-odds is ln(4) ≈ 1.39. Each unit increase in a feature contributes its weight to this log-odds value.

Credit Approval Prediction Example

Using logistic regression to predict loan approval decisions

Dataset Overview

A bank has historical data on 1,000 loan applications. Here's a sample of 8 applications:

IDIncomeAgeDebtCredit ScoreYears EmployedApproved
1$45,00028$12,0006803
No
2$75,00035$8,0007508
Yes
3$55,00042$25,00062012
No
4$95,00038$15,00078010
Yes
5$32,00025$18,0005902
No
6$68,00045$5,00072015
Yes
7$52,00031$10,0006955
Yes
8$38,00029$22,0006103
No

Learned Model (Example)

After training on the full dataset, we obtain a model like:

z = 0.00003 × income

+ 0.02 × age

- 0.00008 × debt

+ 0.012 × credit_score

+ 0.15 × years_employed

- 10.5

P(approved = 1) = 1 / (1 + e-z)

Interpretation:

  • • Higher income increases approval probability (positive coefficient)
  • • Higher debt decreases approval probability (negative coefficient)
  • • Higher credit score strongly increases approval probability
  • • More years employed increases approval probability
  • • Age has a small positive effect

Making a Prediction

Example applicant:

  • • Income: $60,000
  • • Age: 32
  • • Debt: $15,000
  • • Credit Score: 710
  • • Years Employed: 6

z = 0.00003(60000) + 0.02(32) - 0.00008(15000) + 0.012(710) + 0.15(6) - 10.5

z = 1.8 + 0.64 - 1.2 + 8.52 + 0.9 - 10.5 = 0.16

P(approved) = 1 / (1 + e-0.16) = 1 / (1 + 0.852) ≈ 0.54

Prediction: APPROVED (54% probability)

Since P(approved) = 0.54 > 0.5, we classify this application as approved with moderate confidence.

Maximum Likelihood Estimation (MLE)

How logistic regression finds optimal parameters

Likelihood Function

Unlike linear regression which minimizes squared error, logistic regression maximizes the likelihood of observing the training data. For a single sample (xᵢ, yᵢ):

P(yᵢ | xᵢ) = p̂ᵢyᵢ × (1 - p̂ᵢ)(1-yᵢ)

where p̂ᵢ = σ(wTxᵢ + b) is the predicted probability

This formula elegantly handles both classes: when yᵢ=1, it equals p̂ᵢ; when yᵢ=0, it equals (1-p̂ᵢ).

Log-Likelihood (Loss Function)

For computational stability, we maximize the log-likelihood instead. Taking the negative gives us the cross-entropy loss (also called log loss or logistic loss):

Negative log-likelihood (to minimize):

ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ log(p̂ᵢ) - (1-yᵢ) log(1-p̂ᵢ)]

Equivalently (with β = (w; b) and x̂ᵢ augmented features):

ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ βTx̂ᵢ + log(1 + eβTx̂ᵢ)]

Why Cross-Entropy Loss?

Advantages

  • Convex: Guarantees global optimum
  • Differentiable: Can use gradient-based optimization
  • Probabilistic: Directly models probability distribution
  • Penalizes confidence: Heavily penalizes wrong but confident predictions

Loss Behavior

Correct prediction with high confidence:

y=1, p̂=0.95 → loss ≈ 0.05 (small)

Wrong prediction with high confidence:

y=1, p̂=0.05 → loss ≈ 3.0 (large)

Uncertain prediction:

y=1, p̂=0.50 → loss ≈ 0.69 (medium)

Solving with Gradient Descent

Iterative optimization for finding optimal weights

Gradient Descent Algorithm

Since cross-entropy loss is convex but has no closed-form solution, we use gradient descent to iteratively find the optimal parameters:

Update Rules

1. Initialize: w ← random values, b ← 0

2. Repeat until convergence:

w ← w - α × ∂ℓ/∂w

b ← b - α × ∂ℓ/∂b

where α is the learning rate (step size)

Gradient Computation

The gradient of cross-entropy loss has a remarkably clean form:

∂ℓ/∂w = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ) xᵢ

∂ℓ/∂b = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ)

The gradient is simply the average of (prediction - actual) weighted by features. This elegance comes from the special relationship between sigmoid function and cross-entropy loss.

Variants of Gradient Descent

Batch GD

Uses all training samples in each iteration

Pro: Stable convergence
Con: Slow for large datasets

Stochastic GD

Uses one sample at a time

Pro: Fast updates, escapes local minima
Con: Noisy convergence

Mini-Batch GD

Uses small batches (32-256 samples)

Pro: Best of both worlds, GPU-efficient
Con: Hyperparameter tuning needed

Advanced Optimizers

Modern implementations often use sophisticated optimizers that adapt learning rates:

  • Adam: Combines momentum with adaptive learning rates (most popular)
  • RMSprop: Adapts learning rate based on recent gradient magnitudes
  • Newton's Method: Uses second-order information (Hessian) for faster convergence

Model Evaluation Metrics

Measuring logistic regression performance beyond simple accuracy

Confusion Matrix

For our credit approval model on a test set of 100 applications:

Predicted
ActualRejected (0)Approved (1)
Rejected (0)
45
True Negative
8
False Positive
Approved (1)
5
False Negative
42
True Positive

Performance Metrics

Accuracy

87.0%

(TP + TN) / Total = 87 / 100

Overall correctness, but can be misleading with imbalanced classes

Precision

84.0%

TP / (TP + FP) = 42 / 50

Of predicted approvals, what fraction were correct?

Recall (Sensitivity)

89.4%

TP / (TP + FN) = 42 / 47

Of actual approvals, what fraction did we catch?

F1 Score

86.6%

2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall

Which Metric to Optimize?

Credit Approval: Precision might be more important (avoid approving risky applicants) to minimize defaults.

Disease Screening: Recall is critical (don't miss sick patients) even if it means more false alarms.

Spam Detection: Balance both - don't want to miss spam (recall) or flag legitimate emails (precision).

Fraud Detection: High recall to catch fraudsters, but need reasonable precision to avoid blocking legitimate transactions.

Logistic Regression vs Linear Discriminant Analysis

Understanding when to use each classification method

AspectLogistic RegressionLDA
ApproachDiscriminative - models P(y|x) directlyGenerative - models P(x|y) and P(y)
AssumptionsNo distributional assumptions on featuresFeatures follow Gaussian distribution, equal covariance
Training Data NeededMore flexible, works with less dataNeeds sufficient data to estimate class distributions
OutliersMore robust to outliersSensitive to outliers (affects covariance estimates)
Computational CostIterative optimization (gradient descent)Closed-form solution (faster)
When Assumptions MetStill works wellMore statistically efficient (better with small data)
Best Use CaseGeneral-purpose binary classificationWell-separated Gaussian classes, multi-class problems

Recommendation

Use Logistic Regression as your default choice for binary classification. It's more flexible, makes fewer assumptions, and is more robust. Switch to LDA when: (1) you have strong evidence features are Gaussian, (2) you have limited training data and assumptions are met, or (3) you need a multi-class classifier with shared covariance structure.

Common Questions About Logistic Regression

Why is it called "logistic regression" if it's for classification?

Historical reasons. The method was developed as an extension of linear regression, using the logistic (sigmoid) function to transform continuous output into probabilities. The name stuck despite being used primarily for classification. Think of it as "regression to estimate probabilities for classification."

Can I use logistic regression for multi-class classification?

Yes, through extensions. Use One-vs-Rest (train K binary classifiers, one per class) or multinomial logistic regression (softmax regression), which generalizes logistic regression to K classes using the softmax function instead of sigmoid. Scikit-learn's LogisticRegression supports both approaches.

How do I choose the decision threshold (default is 0.5)?

Adjust based on cost of errors. If false negatives are expensive (e.g., missing cancer diagnosis), lower the threshold to 0.3 to increase recall. If false positives are costly (e.g., flagging legitimate transactions as fraud), raise it to 0.7 to increase precision. Use ROC curve analysis to find the optimal threshold for your use case.

Should I regularize logistic regression?

Almost always yes, especially with many features. L2 regularization (Ridge) is standard and prevents overfitting. L1 regularization (Lasso) adds feature selection. Most implementations default to L2. The regularization parameter C (inverse of λ) should be tuned via cross-validation - smaller C means more regularization.

What if my classes are severely imbalanced (e.g., 1% positive)?

Several approaches: (1) Use class weights to penalize minority class errors more heavily, (2) Resample your data (oversample minority or undersample majority), (3) Adjust decision threshold, (4) Use appropriate metrics (F1, precision-recall curves instead of accuracy). See the class imbalance section for more details.