Naive Bayes Classifier

What is Naive Bayes?

Generative Model

The Naive Bayes Classifier is one of the simplest yet most effective probabilistic classifiers. Its "naive" name comes from a strong assumption: that all attributes are conditionally independentgiven the class label.

Main Challenge

In Bayes' theorem, estimating the class-conditional probability $P(x | c)$ is the main obstacle. When sample $x$ contains multiple attributes, the joint probability $P(x | c)$ is difficult to estimate from limited training samples, leading to combinatorial explosion and data sparsity problems.

Solution: Attribute Independence Assumption

Assume that given class $c$ , all attributes $x_1, x_2, \ldots, x_d$ are conditionally independent. This allows us to decompose the joint probability into a product of individual attribute probabilities.

Formula Derivation

From Bayes' Theorem

Starting with Bayes' theorem:

P(c | x) = \frac{P(c) P(x | c)}{P(x)}

Independence Assumption

Under the attribute independence assumption, the class-conditional probability can be decomposed:

P(x | c) = P(x_1, x_2, \ldots, x_d | c) = \prod_{i=1}^{d} P(x_i | c)

where $d$ is the number of attributes and $x_i$ is the value of attribute $i$

Substituting into Bayes' theorem and ignoring the evidence $P(x)$ (same for all classes):

P(c | x) = \frac{P(c) \prod_{i=1}^{d} P(x_i | c)}{P(x)}

The Naive Bayes classifier $h_{nb}(x)$ decision rule is:

h_{nb}(x) = \arg\max_{c \in \mathcal{Y}} P(c) \prod_{i=1}^{d} P(x_i | c)

Probability Estimation

Prior Probability $P(c)$

Estimate the prior probability using class frequencies:

P(c) = \frac{|D_c|}{|D|}

where $|D_c|$ is the number of samples of class $c$ in training set $D$ , and $|D|$ is the total number of samples

Class-Conditional Probability $P(x_i | c)$

For Discrete Attributes

Let $D_{c,x_i}$ denote the set of samples in $D_c$ where attribute $i$ takes value $x_i$ :

P(x_i | c) = \frac{|D_{c,x_i}|}{|D_c|}

For Continuous Attributes

Typically assume a Gaussian (normal) distribution. If $p(x_i | c) \sim \mathcal{N}(\mu_{c,i}, \sigma_{c,i}^2)$ , the probability density function is:

p(x_i | c) = \frac{1}{\sqrt{2\pi}\sigma_{c,i}} \exp\left(-\frac{(x_i - \mu_{c,i})^2}{2\sigma_{c,i}^2}\right)

where $\mu_{c,i}$ and $\sigma_{c,i}^2$ are the mean and variance of attribute $i$ for class $c$ , estimated using MLE from training samples.

Laplacian Correction

The Zero Probability Problem

If a certain attribute value never appears with a class in the training set, the estimated probability $P(x_i | c) = 0$ . Since Naive Bayes uses multiplication, this causes the entire posterior probability to become zero, "erasing" information from other attributes.

Example

If training data never shows "敲声=清脆" (sound=crisp) for "好瓜=是" (good melon=yes), then $P(\text{敲声=清脆} | \text{好瓜=是}) = 0$ . When encountering a test sample with "敲声=清脆", regardless of how good other attributes look, it will be classified as "not good melon".

Solution: Laplacian Correction

Laplacian correction (also called smoothing) adds a small constant to the numerator and denominator to avoid zero probabilities:

Corrected prior probability:

\hat{P}(c) = \frac{|D_c| + 1}{|D| + N}

where $N$ is the number of possible classes in training set $D$

Corrected class-conditional probability:

\hat{P}(x_i | c) = \frac{|D_{c,x_i}| + 1}{|D_c| + N_i}

where $N_i$ is the number of possible values for attribute $i$

Insight

Laplacian correction assumes a uniform distribution over attribute values and classes, introducing some bias. However, it effectively solves the zero probability problem and improves model robustness, especially with small datasets.

Complete Example: Watermelon Dataset 3.0

Step-by-step Naive Bayes classification with mixed discrete and continuous attributes

Test Sample

Test Sample: 色泽=青绿 (Color=Green), 根蒂=蜷缩 (Stem=Curled), 敲声=浊响 (Sound=Dull), 纹理=清晰 (Texture=Clear), 脐部=凹陷 (Navel=Sunken), 触感=硬滑 (Touch=Hard-smooth), 密度=0.697 (Density=0.697), 含糖率=0.460 (Sugar=0.460)

Goal: Classify as "好瓜=是" (Good Melon=Yes) or "好瓜=否" (Good Melon=No)

Step 1: Estimate Prior Probabilities

Assume training set has 8 good melons and 9 bad melons, total 17 samples:

P(\text{好瓜=是}) = \frac{8}{17} \approx 0.471

P(\text{好瓜=否}) = \frac{9}{17} \approx 0.529

Step 2: Estimate Conditional Probabilities for Discrete Attributes

色泽=青绿 (Color=Green):

P(\text{色泽=青绿} | \text{好瓜=是}) = \frac{3}{8} = 0.375

P(\text{色泽=青绿} | \text{好瓜=否}) = \frac{3}{9} \approx 0.333

根蒂=蜷缩 (Stem=Curled):

P(\text{根蒂=蜷缩} | \text{好瓜=是}) = \frac{5}{8} = 0.625

P(\text{根蒂=蜷缩} | \text{好瓜=否}) = \frac{3}{9} \approx 0.333

敲声=浊响 (Sound=Dull):

P(\text{敲声=浊响} | \text{好瓜=是}) = \frac{6}{8} = 0.750

P(\text{敲声=浊响} | \text{好瓜=否}) = \frac{4}{9} \approx 0.444

纹理=清晰 (Texture=Clear):

P(\text{纹理=清晰} | \text{好瓜=是}) = \frac{7}{8} = 0.875

P(\text{纹理=清晰} | \text{好瓜=否}) = \frac{2}{9} \approx 0.222

脐部=凹陷 (Navel=Sunken):

P(\text{脐部=凹陷} | \text{好瓜=是}) = \frac{6}{8} = 0.750

P(\text{脐部=凹陷} | \text{好瓜=否}) = \frac{2}{9} \approx 0.222

触感=硬滑 (Touch=Hard-smooth):

P(\text{触感=硬滑} | \text{好瓜=是}) = \frac{6}{8} = 0.750

P(\text{触感=硬滑} | \text{好瓜=否}) = \frac{6}{9} \approx 0.667

Step 3: Estimate Conditional Probabilities for Continuous Attributes

For continuous attributes, assume Gaussian distribution. After calculating mean and variance from training data:

密度=0.697 (Density=0.697):

p(\text{密度=0.697} | \text{好瓜=是}) \approx 1.959

p(\text{密度=0.697} | \text{好瓜=否}) \approx 1.203

含糖率=0.460 (Sugar=0.460):

p(\text{含糖率=0.460} | \text{好瓜=是}) \approx 0.788

p(\text{含糖率=0.460} | \text{好瓜=否}) \approx 0.066

Step 4: Calculate Posterior Probabilities (Unnormalized)

For "好瓜=是" (Good Melon=Yes):

P(\text{好瓜=是}) \times P(\text{青绿} | \text{是}) \times P(\text{蜷缩} | \text{是}) \times P(\text{浊响} | \text{是}) \times P(\text{清晰} | \text{是}) \times P(\text{凹陷} | \text{是}) \times P(\text{硬滑} | \text{是}) \times p(\text{密度=0.697} | \text{是}) \times p(\text{含糖率=0.460} | \text{是})

= 0.471 \times 0.375 \times 0.625 \times 0.750 \times 0.875 \times 0.750 \times 0.750 \times 1.959 \times 0.788

\approx 0.038

For "好瓜=否" (Good Melon=No):

P(\text{好瓜=否}) \times P(\text{青绿} | \text{否}) \times P(\text{蜷缩} | \text{否}) \times P(\text{浊响} | \text{否}) \times P(\text{清晰} | \text{否}) \times P(\text{凹陷} | \text{否}) \times P(\text{硬滑} | \text{否}) \times p(\text{密度=0.697} | \text{否}) \times p(\text{含糖率=0.460} | \text{否})

= 0.529 \times 0.333 \times 0.333 \times 0.444 \times 0.222 \times 0.222 \times 0.667 \times 1.203 \times 0.066

\approx 6.80 \times 10^{-5}

Step 5: Classification Result

Since $0.038 > 6.80 \times 10^{-5}$ , the Naive Bayes classifier classifies the test sample as:

好瓜=是 (Good Melon=Yes)

Laplacian Correction Example

Problem Scenario

If training data never shows "敲声=清脆" (Sound=Crisp) for "好瓜=是" (Good Melon=Yes), then:

P(\text{敲声=清脆} | \text{好瓜=是}) = \frac{0}{8} = 0

This causes $P(\text{好瓜=是} | x) = 0$ , regardless of other attributes, leading to incorrect classification.

With Laplacian Correction

Corrected prior probabilities:

\hat{P}(\text{好瓜=是}) = \frac{8 + 1}{17 + 2} = \frac{9}{19} \approx 0.474

\hat{P}(\text{好瓜=否}) = \frac{9 + 1}{17 + 2} = \frac{10}{19} \approx 0.526

Corrected conditional probabilities:

\hat{P}(\text{色泽=青绿} | \text{好瓜=是}) = \frac{3 + 1}{8 + 3} = \frac{4}{11} \approx 0.364

\hat{P}(\text{色泽=青绿} | \text{好瓜=否}) = \frac{3 + 1}{9 + 3} = \frac{4}{12} \approx 0.333

\hat{P}(\text{敲声=清脆} | \text{好瓜=是}) = \frac{0 + 1}{8 + 3} = \frac{1}{11} \approx 0.091 \quad \text{(no longer 0!)}

Use Cases and Scenarios

High Prediction Speed

Strategy: Pre-compute all probability estimates, then use direct "lookup" during prediction. Naive Bayes training mainly involves counting frequencies and computing means/variances. Once complete, prediction requires only simple multiplication and comparison—extremely fast.

Frequent Data Updates

Strategy: No training required. Compute estimates on-the-fly when prediction requests arrive (lazy learning). Suitable for data streams or scenarios requiring real-time model updates, avoiding frequent retraining.

Incremental Learning

Strategy: Based on existing estimates, update only probability estimates involving new samples (incremental learning). Naive Bayes easily supports incremental learning by updating relevant counters or statistics without reprocessing all historical data.

Text Classification