Master binary classification with logistic regression, from sigmoid functions to maximum likelihood estimation with real-world credit approval examples
Despite its name, logistic regression is a classification algorithm, not a regression algorithm. It predicts the probability that an input belongs to a particular class (typically the "positive" or "1" class) by applying a sigmoid function to a linear combination of features.
Probability of positive class:
P(y=1|x) = σ(wTx + b) = 1 / (1 + e-(wTx + b))
Where σ is the sigmoid (logistic) function:
σ(z) = 1 / (1 + e-z)
The sigmoid function maps any real-valued input to the range (0, 1), making it perfect for probability estimation. Output values close to 0 indicate negative class, close to 1 indicate positive class.
Typically, we classify as positive if P(y=1|x) ≥ 0.5, which occurs when wTx + b ≥ 0. This creates a linear decision boundary in feature space, separating the two classes.
Understanding the mathematical transformation at the heart of logistic regression
| z | σ(z) | Interpretation |
|---|---|---|
| -5 | 0.007 | Very unlikely |
| -2 | 0.119 | Unlikely |
| 0 | 0.500 | Uncertain |
| 2 | 0.881 | Likely |
| 5 | 0.993 | Very likely |
The logit function is the inverse of the sigmoid. It transforms probabilities back to the real line:
logit(p) = ln(p / (1-p)) = wTx + b
The term p/(1-p) is called the odds, and its logarithm is the log-odds. This shows that logistic regression models the log-odds as a linear function of features.
If the probability of credit approval is 0.8, the odds are 0.8/0.2 = 4, meaning approval is 4 times more likely than rejection. The log-odds is ln(4) ≈ 1.39. Each unit increase in a feature contributes its weight to this log-odds value.
Using logistic regression to predict loan approval decisions
A bank has historical data on 1,000 loan applications. Here's a sample of 8 applications:
| ID | Income | Age | Debt | Credit Score | Years Employed | Approved |
|---|---|---|---|---|---|---|
| 1 | $45,000 | 28 | $12,000 | 680 | 3 | No |
| 2 | $75,000 | 35 | $8,000 | 750 | 8 | Yes |
| 3 | $55,000 | 42 | $25,000 | 620 | 12 | No |
| 4 | $95,000 | 38 | $15,000 | 780 | 10 | Yes |
| 5 | $32,000 | 25 | $18,000 | 590 | 2 | No |
| 6 | $68,000 | 45 | $5,000 | 720 | 15 | Yes |
| 7 | $52,000 | 31 | $10,000 | 695 | 5 | Yes |
| 8 | $38,000 | 29 | $22,000 | 610 | 3 | No |
After training on the full dataset, we obtain a model like:
z = 0.00003 × income
+ 0.02 × age
- 0.00008 × debt
+ 0.012 × credit_score
+ 0.15 × years_employed
- 10.5
P(approved = 1) = 1 / (1 + e-z)
Example applicant:
z = 0.00003(60000) + 0.02(32) - 0.00008(15000) + 0.012(710) + 0.15(6) - 10.5
z = 1.8 + 0.64 - 1.2 + 8.52 + 0.9 - 10.5 = 0.16
P(approved) = 1 / (1 + e-0.16) = 1 / (1 + 0.852) ≈ 0.54
Prediction: APPROVED (54% probability)
Since P(approved) = 0.54 > 0.5, we classify this application as approved with moderate confidence.
How logistic regression finds optimal parameters
Unlike linear regression which minimizes squared error, logistic regression maximizes the likelihood of observing the training data. For a single sample (xᵢ, yᵢ):
P(yᵢ | xᵢ) = p̂ᵢyᵢ × (1 - p̂ᵢ)(1-yᵢ)
where p̂ᵢ = σ(wTxᵢ + b) is the predicted probability
This formula elegantly handles both classes: when yᵢ=1, it equals p̂ᵢ; when yᵢ=0, it equals (1-p̂ᵢ).
For computational stability, we maximize the log-likelihood instead. Taking the negative gives us the cross-entropy loss (also called log loss or logistic loss):
Negative log-likelihood (to minimize):
ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ log(p̂ᵢ) - (1-yᵢ) log(1-p̂ᵢ)]
Equivalently (with β = (w; b) and x̂ᵢ augmented features):
ℓ(β) = Σᵢ₌₁ᵐ [-yᵢ βTx̂ᵢ + log(1 + eβTx̂ᵢ)]
Correct prediction with high confidence:
y=1, p̂=0.95 → loss ≈ 0.05 (small)
Wrong prediction with high confidence:
y=1, p̂=0.05 → loss ≈ 3.0 (large)
Uncertain prediction:
y=1, p̂=0.50 → loss ≈ 0.69 (medium)
Iterative optimization for finding optimal weights
Since cross-entropy loss is convex but has no closed-form solution, we use gradient descent to iteratively find the optimal parameters:
1. Initialize: w ← random values, b ← 0
2. Repeat until convergence:
w ← w - α × ∂ℓ/∂w
b ← b - α × ∂ℓ/∂b
where α is the learning rate (step size)
The gradient of cross-entropy loss has a remarkably clean form:
∂ℓ/∂w = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ) xᵢ
∂ℓ/∂b = (1/m) Σᵢ₌₁ᵐ (p̂ᵢ - yᵢ)
The gradient is simply the average of (prediction - actual) weighted by features. This elegance comes from the special relationship between sigmoid function and cross-entropy loss.
Uses all training samples in each iteration
Pro: Stable convergence
Con: Slow for large datasets
Uses one sample at a time
Pro: Fast updates, escapes local minima
Con: Noisy convergence
Uses small batches (32-256 samples)
Pro: Best of both worlds, GPU-efficient
Con: Hyperparameter tuning needed
Modern implementations often use sophisticated optimizers that adapt learning rates:
Measuring logistic regression performance beyond simple accuracy
For our credit approval model on a test set of 100 applications:
| Predicted | ||
|---|---|---|
| Actual | Rejected (0) | Approved (1) |
| Rejected (0) | 45 True Negative | 8 False Positive |
| Approved (1) | 5 False Negative | 42 True Positive |
87.0%
(TP + TN) / Total = 87 / 100
Overall correctness, but can be misleading with imbalanced classes
84.0%
TP / (TP + FP) = 42 / 50
Of predicted approvals, what fraction were correct?
89.4%
TP / (TP + FN) = 42 / 47
Of actual approvals, what fraction did we catch?
86.6%
2 × (Precision × Recall) / (Precision + Recall)
Harmonic mean of precision and recall
Credit Approval: Precision might be more important (avoid approving risky applicants) to minimize defaults.
Disease Screening: Recall is critical (don't miss sick patients) even if it means more false alarms.
Spam Detection: Balance both - don't want to miss spam (recall) or flag legitimate emails (precision).
Fraud Detection: High recall to catch fraudsters, but need reasonable precision to avoid blocking legitimate transactions.
Understanding when to use each classification method
| Aspect | Logistic Regression | LDA |
|---|---|---|
| Approach | Discriminative - models P(y|x) directly | Generative - models P(x|y) and P(y) |
| Assumptions | No distributional assumptions on features | Features follow Gaussian distribution, equal covariance |
| Training Data Needed | More flexible, works with less data | Needs sufficient data to estimate class distributions |
| Outliers | More robust to outliers | Sensitive to outliers (affects covariance estimates) |
| Computational Cost | Iterative optimization (gradient descent) | Closed-form solution (faster) |
| When Assumptions Met | Still works well | More statistically efficient (better with small data) |
| Best Use Case | General-purpose binary classification | Well-separated Gaussian classes, multi-class problems |
Use Logistic Regression as your default choice for binary classification. It's more flexible, makes fewer assumptions, and is more robust. Switch to LDA when: (1) you have strong evidence features are Gaussian, (2) you have limited training data and assumptions are met, or (3) you need a multi-class classifier with shared covariance structure.
Historical reasons. The method was developed as an extension of linear regression, using the logistic (sigmoid) function to transform continuous output into probabilities. The name stuck despite being used primarily for classification. Think of it as "regression to estimate probabilities for classification."
Yes, through extensions. Use One-vs-Rest (train K binary classifiers, one per class) or multinomial logistic regression (softmax regression), which generalizes logistic regression to K classes using the softmax function instead of sigmoid. Scikit-learn's LogisticRegression supports both approaches.
Adjust based on cost of errors. If false negatives are expensive (e.g., missing cancer diagnosis), lower the threshold to 0.3 to increase recall. If false positives are costly (e.g., flagging legitimate transactions as fraud), raise it to 0.7 to increase precision. Use ROC curve analysis to find the optimal threshold for your use case.
Almost always yes, especially with many features. L2 regularization (Ridge) is standard and prevents overfitting. L1 regularization (Lasso) adds feature selection. Most implementations default to L2. The regularization parameter C (inverse of λ) should be tuned via cross-validation - smaller C means more regularization.
Several approaches: (1) Use class weights to penalize minority class errors more heavily, (2) Resample your data (oversample minority or undersample majority), (3) Adjust decision threshold, (4) Use appropriate metrics (F1, precision-recall curves instead of accuracy). See the class imbalance section for more details.