Master probabilistic classification using Bayes' theorem and conditional independence
Posterior
Probability of class given features
Likelihood
Probability of features given class
Prior
Probability of class before seeing data
Choose class with highest posterior. Since is constant for all classes, we can ignore it.
Features are assumed conditionally independent given the class:
This drastically simplifies computation from possible feature combinations to individual probabilities!
For continuous features, assume normal distribution per class:
Estimate from training data for each class and feature
For count data (e.g., word frequencies):
is Laplace smoothing parameter, is vocabulary size. Popular for text classification.
For binary features (presence/absence):
Explicitly models both feature presence AND absence. Good for document classification with binary features.
If a word never appears in training for a class, makes entire product zero!
One zero probability ruins everything:
(Laplace), (Jeffreys). Adds "pseudo-counts" to avoid zeros.
Vocabulary size , word "free" appears 5 times in spam (total 200 words), :
Word "aardvark" appears 0 times:
| Word | Spam Count | Ham Count |
|---|---|---|
| free | 8 | 1 |
| money | 6 | 2 |
| meeting | 1 | 10 |
| Total words | 50 | 100 |
Prior: P(spam) = 0.4, P(ham) = 0.6 (from 40 spam, 60 ham emails)
Spam Score
Ham Score
Prediction: SPAM (0.00768 > 0.00012)
Probability of A given B has occurred, assuming P(B) > 0.
Step 1: Express joint probability two ways
Step 2: Equate and solve for P(A|B)
Bayes' Theorem:
Denominator can be computed by summing over all classes. Often ignored in classification since it's constant for all classes.
Can drop P(x) since it's the same for all classes!
Naive assumption: Features are conditionally independent given the class
This simplifies likelihood computation dramatically!
This assumption is often violated! Real-world features are usually correlated.
Example: In spam detection, words "free" and "money" likely appear together.
Yet it works well in practice because:
Without Naive Assumption
For d binary features:
Exponential in d! Infeasible for large d.
With Naive Assumption
For d binary features:
Linear in d! Scales to high dimensions.
Assume each feature follows a Gaussian distribution:
Parameters: μ_ic (mean) and σ_ic (std dev) for feature i in class c
For training data from class c, maximize log-likelihood:
Taking derivative w.r.t. μ:
MLE for mean is sample mean!
Taking derivative w.r.t. σ²:
MLE for variance is sample variance!
Training Phase:
For each class c:
P(c) = n_c / n_total
For each feature i:
μ_ic = mean(x_i for examples in class c)
σ²_ic = var(x_i for examples in class c)
Prediction Phase:
For new example x:
For each class c:
score_c = log P(c) + Σ log P(x_i|c)
Return: class with highest score
Problem: If a feature value never appears with a class in training:
One zero probability wipes out the entire product!
Solution: Add α (typically 1) to all counts
Without smoothing:
With Laplace smoothing (α=1):
where |V_i| is the number of possible values for feature i
Laplace smoothing is equivalent to using a Dirichlet prior:
With uniform prior (α=1), the posterior mean is the smoothed estimate. This is a form of regularization—prevents overfitting to training data.
Setup:
Without smoothing:
P("lottery"|spam) = 5/5000 = 0.001
P("jackpot"|spam) = 0/5000 = 0
With smoothing:
P("lottery"|spam) = 6/15000 ≈ 0.0004
P("jackpot"|spam) = 1/15000 ≈ 0.000067
Problem: Multiplying many small probabilities leads to underflow
Example: If P(x_i|c) ≈ 0.001 for 1000 features:
Far below machine precision! Computed as 0.
Solution: Work in log-space
Original:
Log-space (using log monotonicity):
Benefits:
def predict(x, classes_params):
scores =
for c in classes:
log_prob = log(prior[c])
for i, x_i in enumerate(x):
log_prob += log(P(x_i | c, params[c][i]))
scores[c] = log_prob
return argmax(scores)
Note: Never exponentiate back to probabilities unless needed for interpretation!
Training: O(n × d) where n = samples, d = features
Prediction per sample: O(K × d) where K = classes
Space: O(K × d) to store parameters
Extremely efficient!
Linear in both features and classes. Scales to millions of features (e.g., text classification).
Test your understanding with 10 multiple-choice questions