MathIsimple

Naive Bayes Classifier

Simple yet powerful probabilistic classifier with attribute independence assumption

What is Naive Bayes?

Generative Model

The Naive Bayes Classifier is one of the simplest yet most effective probabilistic classifiers. Its "naive" name comes from a strong assumption: that all attributes are conditionally independentgiven the class label.

Main Challenge

In Bayes' theorem, estimating the class-conditional probability P(xc)P(x | c) is the main obstacle. When sample xx contains multiple attributes, the joint probability P(xc)P(x | c)is difficult to estimate from limited training samples, leading to combinatorial explosion and data sparsity problems.

Solution: Attribute Independence Assumption

Assume that given class cc, all attributes x1,x2,,xdx_1, x_2, \ldots, x_dare conditionally independent. This allows us to decompose the joint probability into a product of individual attribute probabilities.

Formula Derivation

From Bayes' Theorem

Starting with Bayes' theorem:

P(cx)=P(c)P(xc)P(x)P(c | x) = \frac{P(c) P(x | c)}{P(x)}

Independence Assumption

Under the attribute independence assumption, the class-conditional probability can be decomposed:

P(xc)=P(x1,x2,,xdc)=i=1dP(xic)P(x | c) = P(x_1, x_2, \ldots, x_d | c) = \prod_{i=1}^{d} P(x_i | c)

where dd is the number of attributes and xix_i is the value of attribute ii

Naive Bayes Classifier

Substituting into Bayes' theorem and ignoring the evidence P(x)P(x) (same for all classes):

P(cx)=P(c)i=1dP(xic)P(x)P(c | x) = \frac{P(c) \prod_{i=1}^{d} P(x_i | c)}{P(x)}

The Naive Bayes classifier hnb(x)h_{nb}(x) decision rule is:

hnb(x)=argmaxcYP(c)i=1dP(xic)h_{nb}(x) = \arg\max_{c \in \mathcal{Y}} P(c) \prod_{i=1}^{d} P(x_i | c)

Probability Estimation

Prior Probability P(c)P(c)

Estimate the prior probability using class frequencies:

P(c)=DcDP(c) = \frac{|D_c|}{|D|}

where Dc|D_c| is the number of samples of class cc in training set DD, and D|D| is the total number of samples

Class-Conditional Probability P(xic)P(x_i | c)

For Discrete Attributes

Let Dc,xiD_{c,x_i} denote the set of samples in DcD_c where attributeii takes value xix_i:

P(xic)=Dc,xiDcP(x_i | c) = \frac{|D_{c,x_i}|}{|D_c|}

For Continuous Attributes

Typically assume a Gaussian (normal) distribution. Ifp(xic)N(μc,i,σc,i2)p(x_i | c) \sim \mathcal{N}(\mu_{c,i}, \sigma_{c,i}^2), the probability density function is:

p(xic)=12πσc,iexp((xiμc,i)22σc,i2)p(x_i | c) = \frac{1}{\sqrt{2\pi}\sigma_{c,i}} \exp\left(-\frac{(x_i - \mu_{c,i})^2}{2\sigma_{c,i}^2}\right)

where μc,i\mu_{c,i} and σc,i2\sigma_{c,i}^2 are the mean and variance of attribute ii for class cc, estimated using MLE from training samples.

Laplacian Correction

The Zero Probability Problem

If a certain attribute value never appears with a class in the training set, the estimated probabilityP(xic)=0P(x_i | c) = 0. Since Naive Bayes uses multiplication, this causes the entire posterior probability to become zero, "erasing" information from other attributes.

Example

If training data never shows "敲声=清脆" (sound=crisp) for "好瓜=是" (good melon=yes), thenP(敲声=清脆好瓜=是)=0P(\text{敲声=清脆} | \text{好瓜=是}) = 0. When encountering a test sample with "敲声=清脆", regardless of how good other attributes look, it will be classified as "not good melon".

Solution: Laplacian Correction

Laplacian correction (also called smoothing) adds a small constant to the numerator and denominator to avoid zero probabilities:

Corrected prior probability:

P^(c)=Dc+1D+N\hat{P}(c) = \frac{|D_c| + 1}{|D| + N}

where NN is the number of possible classes in training set DD

Corrected class-conditional probability:

P^(xic)=Dc,xi+1Dc+Ni\hat{P}(x_i | c) = \frac{|D_{c,x_i}| + 1}{|D_c| + N_i}

where NiN_i is the number of possible values for attribute ii

Insight

Laplacian correction assumes a uniform distribution over attribute values and classes, introducing some bias. However, it effectively solves the zero probability problem and improves model robustness, especially with small datasets.

Complete Example: Watermelon Dataset 3.0

Step-by-step Naive Bayes classification with mixed discrete and continuous attributes

Test Sample

Test Sample: 色泽=青绿 (Color=Green), 根蒂=蜷缩 (Stem=Curled), 敲声=浊响 (Sound=Dull), 纹理=清晰 (Texture=Clear), 脐部=凹陷 (Navel=Sunken), 触感=硬滑 (Touch=Hard-smooth), 密度=0.697 (Density=0.697), 含糖率=0.460 (Sugar=0.460)

Goal: Classify as "好瓜=是" (Good Melon=Yes) or "好瓜=否" (Good Melon=No)

Step 1: Estimate Prior Probabilities

Assume training set has 8 good melons and 9 bad melons, total 17 samples:

P(好瓜=是)=8170.471P(\text{好瓜=是}) = \frac{8}{17} \approx 0.471
P(好瓜=否)=9170.529P(\text{好瓜=否}) = \frac{9}{17} \approx 0.529

Step 2: Estimate Conditional Probabilities for Discrete Attributes

色泽=青绿 (Color=Green):

P(色泽=青绿好瓜=是)=38=0.375P(\text{色泽=青绿} | \text{好瓜=是}) = \frac{3}{8} = 0.375P(色泽=青绿好瓜=否)=390.333P(\text{色泽=青绿} | \text{好瓜=否}) = \frac{3}{9} \approx 0.333

根蒂=蜷缩 (Stem=Curled):

P(根蒂=蜷缩好瓜=是)=58=0.625P(\text{根蒂=蜷缩} | \text{好瓜=是}) = \frac{5}{8} = 0.625P(根蒂=蜷缩好瓜=否)=390.333P(\text{根蒂=蜷缩} | \text{好瓜=否}) = \frac{3}{9} \approx 0.333

敲声=浊响 (Sound=Dull):

P(敲声=浊响好瓜=是)=68=0.750P(\text{敲声=浊响} | \text{好瓜=是}) = \frac{6}{8} = 0.750P(敲声=浊响好瓜=否)=490.444P(\text{敲声=浊响} | \text{好瓜=否}) = \frac{4}{9} \approx 0.444

纹理=清晰 (Texture=Clear):

P(纹理=清晰好瓜=是)=78=0.875P(\text{纹理=清晰} | \text{好瓜=是}) = \frac{7}{8} = 0.875P(纹理=清晰好瓜=否)=290.222P(\text{纹理=清晰} | \text{好瓜=否}) = \frac{2}{9} \approx 0.222

脐部=凹陷 (Navel=Sunken):

P(脐部=凹陷好瓜=是)=68=0.750P(\text{脐部=凹陷} | \text{好瓜=是}) = \frac{6}{8} = 0.750P(脐部=凹陷好瓜=否)=290.222P(\text{脐部=凹陷} | \text{好瓜=否}) = \frac{2}{9} \approx 0.222

触感=硬滑 (Touch=Hard-smooth):

P(触感=硬滑好瓜=是)=68=0.750P(\text{触感=硬滑} | \text{好瓜=是}) = \frac{6}{8} = 0.750P(触感=硬滑好瓜=否)=690.667P(\text{触感=硬滑} | \text{好瓜=否}) = \frac{6}{9} \approx 0.667

Step 3: Estimate Conditional Probabilities for Continuous Attributes

For continuous attributes, assume Gaussian distribution. After calculating mean and variance from training data:

密度=0.697 (Density=0.697):

p(密度=0.697好瓜=是)1.959p(\text{密度=0.697} | \text{好瓜=是}) \approx 1.959p(密度=0.697好瓜=否)1.203p(\text{密度=0.697} | \text{好瓜=否}) \approx 1.203

含糖率=0.460 (Sugar=0.460):

p(含糖率=0.460好瓜=是)0.788p(\text{含糖率=0.460} | \text{好瓜=是}) \approx 0.788p(含糖率=0.460好瓜=否)0.066p(\text{含糖率=0.460} | \text{好瓜=否}) \approx 0.066

Step 4: Calculate Posterior Probabilities (Unnormalized)

For "好瓜=是" (Good Melon=Yes):

P(好瓜=是)×P(青绿)×P(蜷缩)×P(浊响)×P(清晰)×P(凹陷)×P(硬滑)×p(密度=0.697)×p(含糖率=0.460)P(\text{好瓜=是}) \times P(\text{青绿} | \text{是}) \times P(\text{蜷缩} | \text{是}) \times P(\text{浊响} | \text{是}) \times P(\text{清晰} | \text{是}) \times P(\text{凹陷} | \text{是}) \times P(\text{硬滑} | \text{是}) \times p(\text{密度=0.697} | \text{是}) \times p(\text{含糖率=0.460} | \text{是})
=0.471×0.375×0.625×0.750×0.875×0.750×0.750×1.959×0.788= 0.471 \times 0.375 \times 0.625 \times 0.750 \times 0.875 \times 0.750 \times 0.750 \times 1.959 \times 0.788
0.038\approx 0.038

For "好瓜=否" (Good Melon=No):

P(好瓜=否)×P(青绿)×P(蜷缩)×P(浊响)×P(清晰)×P(凹陷)×P(硬滑)×p(密度=0.697)×p(含糖率=0.460)P(\text{好瓜=否}) \times P(\text{青绿} | \text{否}) \times P(\text{蜷缩} | \text{否}) \times P(\text{浊响} | \text{否}) \times P(\text{清晰} | \text{否}) \times P(\text{凹陷} | \text{否}) \times P(\text{硬滑} | \text{否}) \times p(\text{密度=0.697} | \text{否}) \times p(\text{含糖率=0.460} | \text{否})
=0.529×0.333×0.333×0.444×0.222×0.222×0.667×1.203×0.066= 0.529 \times 0.333 \times 0.333 \times 0.444 \times 0.222 \times 0.222 \times 0.667 \times 1.203 \times 0.066
6.80×105\approx 6.80 \times 10^{-5}

Step 5: Classification Result

Since 0.038>6.80×1050.038 > 6.80 \times 10^{-5}, the Naive Bayes classifier classifies the test sample as:

好瓜=是 (Good Melon=Yes)

Laplacian Correction Example

Problem Scenario

If training data never shows "敲声=清脆" (Sound=Crisp) for "好瓜=是" (Good Melon=Yes), then:

P(敲声=清脆好瓜=是)=08=0P(\text{敲声=清脆} | \text{好瓜=是}) = \frac{0}{8} = 0

This causes P(好瓜=是x)=0P(\text{好瓜=是} | x) = 0, regardless of other attributes, leading to incorrect classification.

With Laplacian Correction

Corrected prior probabilities:

P^(好瓜=是)=8+117+2=9190.474\hat{P}(\text{好瓜=是}) = \frac{8 + 1}{17 + 2} = \frac{9}{19} \approx 0.474P^(好瓜=否)=9+117+2=10190.526\hat{P}(\text{好瓜=否}) = \frac{9 + 1}{17 + 2} = \frac{10}{19} \approx 0.526

Corrected conditional probabilities:

P^(色泽=青绿好瓜=是)=3+18+3=4110.364\hat{P}(\text{色泽=青绿} | \text{好瓜=是}) = \frac{3 + 1}{8 + 3} = \frac{4}{11} \approx 0.364P^(色泽=青绿好瓜=否)=3+19+3=4120.333\hat{P}(\text{色泽=青绿} | \text{好瓜=否}) = \frac{3 + 1}{9 + 3} = \frac{4}{12} \approx 0.333P^(敲声=清脆好瓜=是)=0+18+3=1110.091(no longer 0!)\hat{P}(\text{敲声=清脆} | \text{好瓜=是}) = \frac{0 + 1}{8 + 3} = \frac{1}{11} \approx 0.091 \quad \text{(no longer 0!)}

Use Cases and Scenarios

High Prediction Speed

Strategy: Pre-compute all probability estimates, then use direct "lookup" during prediction. Naive Bayes training mainly involves counting frequencies and computing means/variances. Once complete, prediction requires only simple multiplication and comparison—extremely fast.

Frequent Data Updates

Strategy: No training required. Compute estimates on-the-fly when prediction requests arrive (lazy learning). Suitable for data streams or scenarios requiring real-time model updates, avoiding frequent retraining.

Incremental Learning

Strategy: Based on existing estimates, update only probability estimates involving new samples (incremental learning). Naive Bayes easily supports incremental learning by updating relevant counters or statistics without reprocessing all historical data.

Text Classification

Application: Email spam detection, document categorization, sentiment analysis. Naive Bayes works excellently for text because word independence assumption is often reasonable, and it handles high-dimensional feature spaces efficiently.