MathIsimple
Back to Machine Learning
Probabilistic ML
Beginner to Intermediate

Naive Bayes

Master probabilistic classification using Bayes' theorem and conditional independence

Bayes' Theorem & Foundation

Probabilistic Classification

Bayes' Theorem

P(YX)=P(XY)P(Y)P(X)P(Y|\mathbf{X}) = \frac{P(\mathbf{X}|Y) \cdot P(Y)}{P(\mathbf{X})}

Posterior

P(YX)P(Y|\mathbf{X})

Probability of class given features

Likelihood

P(XY)P(\mathbf{X}|Y)

Probability of features given class

Prior

P(Y)P(Y)

Probability of class before seeing data

Classification Rule

y^=argmaxyYP(Y=y)i=1nP(XiY=y)\hat{y} = \arg\max_{y \in \mathcal{Y}} P(Y=y) \prod_{i=1}^{n} P(X_i | Y=y)

Choose class with highest posterior. Since is constant for all classes, we can ignore it.

The Naive Independence Assumption

Conditional Independence

Features are assumed conditionally independent given the class:

P(X1,X2,...,XnY)=P(X1Y)P(X2Y)P(XnY)=i=1nP(XiY)P(X_1, X_2, ..., X_n | Y) = P(X_1|Y) \cdot P(X_2|Y) \cdot \ldots \cdot P(X_n|Y) = \prod_{i=1}^{n} P(X_i|Y)

This drastically simplifies computation from possible feature combinations to individual probabilities!

Why It Works

  • • Massively reduces parameters to estimate
  • • Fast training and prediction
  • • Works well even when assumption is violated
  • • Excels in high dimensions

When It Fails

  • • Strong feature dependencies exist
  • • Dependencies are critical for classification
  • • Example: diagnostic rules requiring feature combinations

Naive Bayes Variants

Gaussian Naive Bayes

For continuous features, assume normal distribution per class:

P(Xi=xY=c)=12πσc2e(xμc)22σc2P(X_i=x | Y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} e^{-\frac{(x-\mu_c)^2}{2\sigma_c^2}}

Estimate from training data for each class and feature

Multinomial Naive Bayes

For count data (e.g., word frequencies):

P(wic)=count(wi,c)+αjcount(wj,c)+αVP(w_i|c) = \frac{\text{count}(w_i, c) + \alpha}{\sum_j \text{count}(w_j, c) + \alpha V}

is Laplace smoothing parameter, is vocabulary size. Popular for text classification.

Bernoulli Naive Bayes

For binary features (presence/absence):

P(XiY=c)=picXi(1pic)1XiP(X_i|Y=c) = p_{ic}^{X_i}(1-p_{ic})^{1-X_i}

Explicitly models both feature presence AND absence. Good for document classification with binary features.

Laplace Smoothing

Avoiding Zero Probabilities

If a word never appears in training for a class, makes entire product zero!

Problem: Zero Counts

P(wic)=0N=0P(w_i|c) = \frac{0}{N} = 0

One zero probability ruins everything:

Solution: Add-α Smoothing

P(wic)=count(wi,c)+αN+αVP(w_i|c) = \frac{\text{count}(w_i,c) + \alpha}{N + \alpha V}

(Laplace), (Jeffreys). Adds "pseudo-counts" to avoid zeros.

Example Calculation

Vocabulary size , word "free" appears 5 times in spam (total 200 words), :

P(’free’spam)=5+1200+1×10000=6102000.000588P(\text{'free'}|\text{spam}) = \frac{5 + 1}{200 + 1 \times 10000} = \frac{6}{10200} \approx 0.000588

Word "aardvark" appears 0 times:

P(’aardvark’spam)=0+1102000.000098P(\text{'aardvark'}|\text{spam}) = \frac{0 + 1}{10200} \approx 0.000098

Example: Email Spam Classification

Spam Filter Using Multinomial Naive Bayes

Training Data

WordSpam CountHam Count
free81
money62
meeting110
Total words50100

Prior: P(spam) = 0.4, P(ham) = 0.6 (from 40 spam, 60 ham emails)

Classify: "free money"

Spam Score

P(spam)=0.4P(\text{spam}) = 0.4P(freespam)=8/50=0.16P(\text{free}|\text{spam}) = 8/50 = 0.16P(moneyspam)=6/50=0.12P(\text{money}|\text{spam}) = 6/50 = 0.12Score=0.4×0.16×0.12=0.00768\text{Score} = 0.4 \times 0.16 \times 0.12 = 0.00768

Ham Score

P(ham)=0.6P(\text{ham}) = 0.6P(freeham)=1/100=0.01P(\text{free}|\text{ham}) = 1/100 = 0.01P(moneyham)=2/100=0.02P(\text{money}|\text{ham}) = 2/100 = 0.02Score=0.6×0.01×0.02=0.00012\text{Score} = 0.6 \times 0.01 \times 0.02 = 0.00012

Prediction: SPAM (0.00768 > 0.00012)

Bayes' Theorem Complete Derivation

From Conditional Probability to Bayes' Rule

Conditional Probability Definition

P(AB)=P(AB)P(B)P(A|B) = \frac{P(A \cap B)}{P(B)}

Probability of A given B has occurred, assuming P(B) > 0.

Step-by-Step Derivation

Step 1: Express joint probability two ways

P(AB)=P(AB)P(B)P(A \cap B) = P(A|B)P(B)P(AB)=P(BA)P(A)P(A \cap B) = P(B|A)P(A)

Step 2: Equate and solve for P(A|B)

P(AB)P(B)=P(BA)P(A)P(A|B)P(B) = P(B|A)P(A)P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Bayes' Theorem:

P(classfeatures)=P(featuresclass)P(class)P(features)P(\text{class}|\text{features}) = \frac{P(\text{features}|\text{class}) \cdot P(\text{class})}{P(\text{features})}
• P(class|features): Posterior
• P(features|class): Likelihood
• P(class): Prior
• P(features): Evidence

Law of Total Probability (for Evidence)

P(x)=cP(xc)P(c)P(\mathbf{x}) = \sum_{c} P(\mathbf{x}|c)P(c)

Denominator can be computed by summing over all classes. Often ignored in classification since it's constant for all classes.

Classification Rule

c^=argmaxcP(cx)=argmaxcP(xc)P(c)P(x)=argmaxcP(xc)P(c)\hat{c} = \arg\max_c P(c|\mathbf{x}) = \arg\max_c \frac{P(\mathbf{x}|c)P(c)}{P(\mathbf{x})} = \arg\max_c P(\mathbf{x}|c)P(c)

Can drop P(x) since it's the same for all classes!

Naive Independence Assumption

Mathematical Formulation and Implications

The Assumption

Naive assumption: Features are conditionally independent given the class

P(x1,x2,,xdc)=P(x1c)P(x2c)P(xdc)=i=1dP(xic)P(x_1, x_2, \ldots, x_d | c) = P(x_1|c) \cdot P(x_2|c) \cdot \ldots \cdot P(x_d|c) = \prod_{i=1}^{d} P(x_i|c)

This simplifies likelihood computation dramatically!

Why "Naive"?

This assumption is often violated! Real-world features are usually correlated.

Example: In spam detection, words "free" and "money" likely appear together.

Yet it works well in practice because:

  • • Classification only needs correct ranking of P(c|x), not exact values
  • • Errors in P(x|c) estimates can cancel out
  • • Simple model reduces overfitting
  • • Very few parameters to estimate

Parameter Count Comparison

Without Naive Assumption

For d binary features:

2d1 parameters per class2^d - 1 \text{ parameters per class}

Exponential in d! Infeasible for large d.

With Naive Assumption

For d binary features:

d parameters per classd \text{ parameters per class}

Linear in d! Scales to high dimensions.

Full Naive Bayes Formula

P(cx)P(c)i=1dP(xic)P(c|\mathbf{x}) \propto P(c) \prod_{i=1}^{d} P(x_i|c)c^=argmaxc[P(c)i=1dP(xic)]\hat{c} = \arg\max_c \left[ P(c) \prod_{i=1}^{d} P(x_i|c) \right]

Gaussian Naive Bayes: MLE Derivation

Parameter Estimation for Continuous Features

Gaussian Likelihood Model

Assume each feature follows a Gaussian distribution:

P(xic)=12πσic2exp((xiμic)22σic2)P(x_i|c) = \frac{1}{\sqrt{2\pi\sigma_{ic}^2}} \exp\left(-\frac{(x_i - \mu_{ic})^2}{2\sigma_{ic}^2}\right)

Parameters: μ_ic (mean) and σ_ic (std dev) for feature i in class c

Maximum Likelihood Estimation for μ

For training data from class c, maximize log-likelihood:

L(μ,σ2)=j=1nlogP(xjμ,σ2)\mathcal{L}(\mu, \sigma^2) = \sum_{j=1}^{n} \log P(x_j|\mu, \sigma^2)=j=1n[12log(2πσ2)(xjμ)22σ2]= \sum_{j=1}^{n} \left[-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_j - \mu)^2}{2\sigma^2}\right]

Taking derivative w.r.t. μ:

Lμ=j=1nxjμσ2=0\frac{\partial \mathcal{L}}{\partial \mu} = \sum_{j=1}^{n} \frac{x_j - \mu}{\sigma^2} = 0j=1nxjnμ=0\sum_{j=1}^{n} x_j - n\mu = 0
μ^=1nj=1nxj\hat{\mu} = \frac{1}{n}\sum_{j=1}^{n} x_j

MLE for mean is sample mean!

Maximum Likelihood Estimation for σ²

Taking derivative w.r.t. σ²:

Lσ2=j=1n[12σ2+(xjμ)22(σ2)2]=0\frac{\partial \mathcal{L}}{\partial \sigma^2} = \sum_{j=1}^{n} \left[-\frac{1}{2\sigma^2} + \frac{(x_j - \mu)^2}{2(\sigma^2)^2}\right] = 0n2σ2+j(xjμ)22(σ2)2=0-\frac{n}{2\sigma^2} + \frac{\sum_j(x_j - \mu)^2}{2(\sigma^2)^2} = 0
σ^2=1nj=1n(xjμ^)2\hat{\sigma}^2 = \frac{1}{n}\sum_{j=1}^{n} (x_j - \hat{\mu})^2

MLE for variance is sample variance!

Algorithm Summary

Training Phase:

For each class c:

P(c) = n_c / n_total

For each feature i:

μ_ic = mean(x_i for examples in class c)

σ²_ic = var(x_i for examples in class c)

Prediction Phase:

For new example x:

For each class c:

score_c = log P(c) + Σ log P(x_i|c)

Return: class with highest score

Laplace Smoothing Derivation

Handling Zero Probabilities

The Zero Probability Problem

Problem: If a feature value never appears with a class in training:

P(xic)=0    P(cx)=0P(x_i|c) = 0 \implies P(c|\mathbf{x}) = 0

One zero probability wipes out the entire product!

Laplace (Add-One) Smoothing

Solution: Add α (typically 1) to all counts

Without smoothing:

P(xi=vc)=count(xi=v,c)count(c)P(x_i = v | c) = \frac{\text{count}(x_i = v, c)}{\text{count}(c)}

With Laplace smoothing (α=1):

P(xi=vc)=count(xi=v,c)+αcount(c)+αViP(x_i = v | c) = \frac{\text{count}(x_i = v, c) + \alpha}{\text{count}(c) + \alpha |V_i|}

where |V_i| is the number of possible values for feature i

Bayesian Interpretation

Laplace smoothing is equivalent to using a Dirichlet prior:

P(θ)=Dir(α,α,,α)P(\theta) = \text{Dir}(\alpha, \alpha, \ldots, \alpha)

With uniform prior (α=1), the posterior mean is the smoothed estimate. This is a form of regularization—prevents overfitting to training data.

Example: Text Classification

Setup:

  • • Vocabulary size: |V| = 10,000 words
  • • Class "spam": 100 documents, 5000 total words
  • • Word "lottery" appears 5 times in spam

Without smoothing:

P("lottery"|spam) = 5/5000 = 0.001

P("jackpot"|spam) = 0/5000 = 0

With smoothing:

P("lottery"|spam) = 6/15000 ≈ 0.0004

P("jackpot"|spam) = 1/15000 ≈ 0.000067

Log-Probability for Numerical Stability

Preventing Underflow in Computation

Numerical Underflow Problem

Problem: Multiplying many small probabilities leads to underflow

Example: If P(x_i|c) ≈ 0.001 for 1000 features:

i=110000.001=103000\prod_{i=1}^{1000} 0.001 = 10^{-3000}

Far below machine precision! Computed as 0.

Log-Space Computation

Solution: Work in log-space

Original:

c^=argmaxcP(c)i=1dP(xic)\hat{c} = \arg\max_c P(c) \prod_{i=1}^{d} P(x_i|c)

Log-space (using log monotonicity):

c^=argmaxc[logP(c)+i=1dlogP(xic)]\hat{c} = \arg\max_c \left[\log P(c) + \sum_{i=1}^{d} \log P(x_i|c)\right]

Benefits:

  • • Product → Sum (numerically stable)
  • • No underflow (log(10^-1000) = -1000, representable)
  • • Often faster (addition vs multiplication)

Practical Implementation

def predict(x, classes_params):

scores =

for c in classes:

log_prob = log(prior[c])

for i, x_i in enumerate(x):

log_prob += log(P(x_i | c, params[c][i]))

scores[c] = log_prob

return argmax(scores)

Note: Never exponentiate back to probabilities unless needed for interpretation!

Complexity Analysis

Training: O(n × d) where n = samples, d = features

Prediction per sample: O(K × d) where K = classes

Space: O(K × d) to store parameters

Extremely efficient!

Linear in both features and classes. Scales to millions of features (e.g., text classification).

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz
10
Questions
0
Correct
0%
Accuracy
1
What is the 'naive' assumption in Naive Bayes?
Not attempted
2
Using Bayes' theorem, the class posterior probability is:
Not attempted
3
For a document classification task with vocabulary size 5000, word 'discount' appears 3 times in spam (total 100 words) and 0 times in ham (total 200 words). With Laplace smoothing (), what is ?
Not attempted
4
What distribution does Gaussian Naive Bayes assume for continuous features?
Not attempted
5
Why do we work with log probabilities in practice?
Not attempted
6
For spam classification with P(spam)=0.3, P('lottery'|spam)=0.05, P('lottery'|ham)=0.001, what does Naive Bayes predict for an email containing only 'lottery'?
Not attempted
7
Which Naive Bayes variant is best for binary feature vectors (e.g., word presence/absence)?
Not attempted
8
What is a key advantage of Naive Bayes over other classifiers?
Not attempted
9
In Multinomial Naive Bayes for text, what does represent?
Not attempted
10
When would Naive Bayes perform poorly?
Not attempted