MathIsimple – Simple, Friendly Math Tools & Learning

Bayes' Theorem & Foundation

Probabilistic Classification

Bayes' Theorem

P(Y|\mathbf{X}) = \frac{P(\mathbf{X}|Y) \cdot P(Y)}{P(\mathbf{X})}

Posterior

P(Y|\mathbf{X})

Probability of class given features

Likelihood

P(\mathbf{X}|Y)

Probability of features given class

Prior

P(Y)

Probability of class before seeing data

Classification Rule

\hat{y} = \arg\max_{y \in \mathcal{Y}} P(Y=y) \prod_{i=1}^{n} P(X_i | Y=y)

Choose class with highest posterior. Since is constant for all classes, we can ignore it.

The Naive Independence Assumption

Conditional Independence

Features are assumed conditionally independent given the class:

P(X_1, X_2, ..., X_n | Y) = P(X_1|Y) \cdot P(X_2|Y) \cdot \ldots \cdot P(X_n|Y) = \prod_{i=1}^{n} P(X_i|Y)

This drastically simplifies computation from possible feature combinations to individual probabilities!

Why It Works

• Massively reduces parameters to estimate
• Fast training and prediction
• Works well even when assumption is violated
• Excels in high dimensions

When It Fails

• Strong feature dependencies exist
• Dependencies are critical for classification
• Example: diagnostic rules requiring feature combinations

Naive Bayes Variants

Gaussian Naive Bayes

For continuous features, assume normal distribution per class:

P(X_i=x | Y=c) = \frac{1}{\sqrt{2\pi\sigma_c^2}} e^{-\frac{(x-\mu_c)^2}{2\sigma_c^2}}

Estimate from training data for each class and feature

Multinomial Naive Bayes

For count data (e.g., word frequencies):

P(w_i|c) = \frac{\text{count}(w_i, c) + \alpha}{\sum_j \text{count}(w_j, c) + \alpha V}

is Laplace smoothing parameter, is vocabulary size. Popular for text classification.

Bernoulli Naive Bayes

For binary features (presence/absence):

P(X_i|Y=c) = p_{ic}^{X_i}(1-p_{ic})^{1-X_i}

Explicitly models both feature presence AND absence. Good for document classification with binary features.

Laplace Smoothing

Avoiding Zero Probabilities

If a word never appears in training for a class, makes entire product zero!

Problem: Zero Counts

P(w_i|c) = \frac{0}{N} = 0

One zero probability ruins everything:

Solution: Add-α Smoothing

P(w_i|c) = \frac{\text{count}(w_i,c) + \alpha}{N + \alpha V}

(Laplace), (Jeffreys). Adds "pseudo-counts" to avoid zeros.

Example Calculation

Vocabulary size , word "free" appears 5 times in spam (total 200 words), :

P(\text{'free'}|\text{spam}) = \frac{5 + 1}{200 + 1 \times 10000} = \frac{6}{10200} \approx 0.000588

Word "aardvark" appears 0 times:

P(\text{'aardvark'}|\text{spam}) = \frac{0 + 1}{10200} \approx 0.000098

Example: Email Spam Classification

Spam Filter Using Multinomial Naive Bayes

Training Data

Word	Spam Count	Ham Count
free	8	1
money	6	2
meeting	1	10
Total words	50	100

Prior: P(spam) = 0.4, P(ham) = 0.6 (from 40 spam, 60 ham emails)

Classify: "free money"

Spam Score

P(\text{spam}) = 0.4

P(\text{free}|\text{spam}) = 8/50 = 0.16

P(\text{money}|\text{spam}) = 6/50 = 0.12

\text{Score} = 0.4 \times 0.16 \times 0.12 = 0.00768

Ham Score

P(\text{ham}) = 0.6

P(\text{free}|\text{ham}) = 1/100 = 0.01

P(\text{money}|\text{ham}) = 2/100 = 0.02

\text{Score} = 0.6 \times 0.01 \times 0.02 = 0.00012

Prediction: SPAM (0.00768 > 0.00012)

View More Examples & Exercises

Bayes' Theorem Complete Derivation

From Conditional Probability to Bayes' Rule

Conditional Probability Definition

P(A|B) = \frac{P(A \cap B)}{P(B)}

Probability of A given B has occurred, assuming P(B) > 0.

Step-by-Step Derivation

Step 1: Express joint probability two ways

P(A \cap B) = P(A|B)P(B)

P(A \cap B) = P(B|A)P(A)

Step 2: Equate and solve for P(A|B)

P(A|B)P(B) = P(B|A)P(A)

P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Bayes' Theorem:

P(\text{class}|\text{features}) = \frac{P(\text{features}|\text{class}) \cdot P(\text{class})}{P(\text{features})}

• P(class|features): Posterior

• P(features|class): Likelihood

• P(class): Prior

• P(features): Evidence

Law of Total Probability (for Evidence)

P(\mathbf{x}) = \sum_{c} P(\mathbf{x}|c)P(c)

Denominator can be computed by summing over all classes. Often ignored in classification since it's constant for all classes.

Classification Rule

\hat{c} = \arg\max_c P(c|\mathbf{x}) = \arg\max_c \frac{P(\mathbf{x}|c)P(c)}{P(\mathbf{x})} = \arg\max_c P(\mathbf{x}|c)P(c)

Can drop P(x) since it's the same for all classes!

Naive Independence Assumption

Mathematical Formulation and Implications

The Assumption

Naive assumption: Features are conditionally independent given the class

P(x_1, x_2, \ldots, x_d | c) = P(x_1|c) \cdot P(x_2|c) \cdot \ldots \cdot P(x_d|c) = \prod_{i=1}^{d} P(x_i|c)

This simplifies likelihood computation dramatically!

Why "Naive"?

This assumption is often violated! Real-world features are usually correlated.

Example: In spam detection, words "free" and "money" likely appear together.

Yet it works well in practice because:

• Classification only needs correct ranking of P(c|x), not exact values
• Errors in P(x|c) estimates can cancel out
• Simple model reduces overfitting
• Very few parameters to estimate

Parameter Count Comparison

Without Naive Assumption

For d binary features:

2^d - 1 \text{ parameters per class}

Exponential in d! Infeasible for large d.

With Naive Assumption

For d binary features:

d \text{ parameters per class}

Linear in d! Scales to high dimensions.

Full Naive Bayes Formula

P(c|\mathbf{x}) \propto P(c) \prod_{i=1}^{d} P(x_i|c)

\hat{c} = \arg\max_c \left[ P(c) \prod_{i=1}^{d} P(x_i|c) \right]

Gaussian Naive Bayes: MLE Derivation

Parameter Estimation for Continuous Features

Gaussian Likelihood Model

Assume each feature follows a Gaussian distribution:

P(x_i|c) = \frac{1}{\sqrt{2\pi\sigma_{ic}^2}} \exp\left(-\frac{(x_i - \mu_{ic})^2}{2\sigma_{ic}^2}\right)

Parameters: μ_ic (mean) and σ_ic (std dev) for feature i in class c

Maximum Likelihood Estimation for μ

For training data from class c, maximize log-likelihood:

\mathcal{L}(\mu, \sigma^2) = \sum_{j=1}^{n} \log P(x_j|\mu, \sigma^2)

= \sum_{j=1}^{n} \left[-\frac{1}{2}\log(2\pi\sigma^2) - \frac{(x_j - \mu)^2}{2\sigma^2}\right]

Taking derivative w.r.t. μ:

\frac{\partial \mathcal{L}}{\partial \mu} = \sum_{j=1}^{n} \frac{x_j - \mu}{\sigma^2} = 0

\sum_{j=1}^{n} x_j - n\mu = 0

\hat{\mu} = \frac{1}{n}\sum_{j=1}^{n} x_j

MLE for mean is sample mean!

Maximum Likelihood Estimation for σ²

Taking derivative w.r.t. σ²:

\frac{\partial \mathcal{L}}{\partial \sigma^2} = \sum_{j=1}^{n} \left[-\frac{1}{2\sigma^2} + \frac{(x_j - \mu)^2}{2(\sigma^2)^2}\right] = 0

-\frac{n}{2\sigma^2} + \frac{\sum_j(x_j - \mu)^2}{2(\sigma^2)^2} = 0

\hat{\sigma}^2 = \frac{1}{n}\sum_{j=1}^{n} (x_j - \hat{\mu})^2

MLE for variance is sample variance!

Algorithm Summary

Training Phase:

For each class c:

P(c) = n_c / n_total

For each feature i:

μ_ic = mean(x_i for examples in class c)

σ²_ic = var(x_i for examples in class c)

Prediction Phase:

For new example x:

For each class c:

score_c = log P(c) + Σ log P(x_i|c)

Return: class with highest score

Laplace Smoothing Derivation

Handling Zero Probabilities

The Zero Probability Problem

Problem: If a feature value never appears with a class in training:

P(x_i|c) = 0 \implies P(c|\mathbf{x}) = 0

One zero probability wipes out the entire product!

Laplace (Add-One) Smoothing

Solution: Add α (typically 1) to all counts

Without smoothing:

P(x_i = v | c) = \frac{\text{count}(x_i = v, c)}{\text{count}(c)}

With Laplace smoothing (α=1):

P(x_i = v | c) = \frac{\text{count}(x_i = v, c) + \alpha}{\text{count}(c) + \alpha |V_i|}

where |V_i| is the number of possible values for feature i

Bayesian Interpretation

Laplace smoothing is equivalent to using a Dirichlet prior:

P(\theta) = \text{Dir}(\alpha, \alpha, \ldots, \alpha)

With uniform prior (α=1), the posterior mean is the smoothed estimate. This is a form of regularization—prevents overfitting to training data.

Example: Text Classification

Setup:

• Vocabulary size: |V| = 10,000 words
• Class "spam": 100 documents, 5000 total words
• Word "lottery" appears 5 times in spam

Without smoothing:

P("lottery"|spam) = 5/5000 = 0.001

P("jackpot"|spam) = 0/5000 = 0

With smoothing:

P("lottery"|spam) = 6/15000 ≈ 0.0004

P("jackpot"|spam) = 1/15000 ≈ 0.000067

Log-Probability for Numerical Stability

Preventing Underflow in Computation

Numerical Underflow Problem

Problem: Multiplying many small probabilities leads to underflow

Example: If P(x_i|c) ≈ 0.001 for 1000 features:

\prod_{i=1}^{1000} 0.001 = 10^{-3000}

Far below machine precision! Computed as 0.

Log-Space Computation

Solution: Work in log-space

Original:

\hat{c} = \arg\max_c P(c) \prod_{i=1}^{d} P(x_i|c)

Log-space (using log monotonicity):

\hat{c} = \arg\max_c \left[\log P(c) + \sum_{i=1}^{d} \log P(x_i|c)\right]

Benefits:

• Product → Sum (numerically stable)
• No underflow (log(10^-1000) = -1000, representable)
• Often faster (addition vs multiplication)

Practical Implementation

def predict(x, classes_params):

scores =

for c in classes:

log_prob = log(prior[c])

for i, x_i in enumerate(x):

log_prob += log(P(x_i | c, params[c][i]))

scores[c] = log_prob

return argmax(scores)

Note: Never exponentiate back to probabilities unless needed for interpretation!

Complexity Analysis

Training: O(n × d) where n = samples, d = features

Prediction per sample: O(K × d) where K = classes

Space: O(K × d) to store parameters

Extremely efficient!

Linear in both features and classes. Scales to millions of features (e.g., text classification).

Practice Quiz

Test your understanding with 10 multiple-choice questions

Practice Quiz

10

Questions

0

Correct

0%

Accuracy

1

What is the 'naive' assumption in Naive Bayes?

Not attempted

2

Using Bayes' theorem, the class posterior probability is:

Not attempted

3

For a document classification task with vocabulary size 5000, word 'discount' appears 3 times in spam (total 100 words) and 0 times in ham (total 200 words). With Laplace smoothing (), what is ?

Not attempted

4

What distribution does Gaussian Naive Bayes assume for continuous features?

Not attempted

5

Why do we work with log probabilities in practice?

Not attempted

6

For spam classification with P(spam)=0.3, P('lottery'|spam)=0.05, P('lottery'|ham)=0.001, what does Naive Bayes predict for an email containing only 'lottery'?

Not attempted

7

Which Naive Bayes variant is best for binary feature vectors (e.g., word presence/absence)?

Not attempted

8

What is a key advantage of Naive Bayes over other classifiers?

Not attempted

9

In Multinomial Naive Bayes for text, what does represent?

Not attempted

10

When would Naive Bayes perform poorly?

Not attempted