MathIsimple

Semi-Naive Bayes Classifiers

Relaxing the independence assumption to model attribute dependencies

Motivation

Problem & Solution

The attribute independence assumption in Naive Bayes is often too strong in practice. Real-world attributes frequently exhibit dependencies. Semi-Naive Bayes classifiers aim to appropriately consider some attribute dependencies to improve classification performance.

Key Challenge

How to model dependencies without falling into the combinatorial explosion problem? We need methods that balance between the simplicity of Naive Bayes and the complexity of full dependency modeling.

One-Dependent Estimator (ODE)

Assumption

Assume each attribute depends on at most one other attribute besides the class. This dependent attribute is called the parent attribute paipa_i.

P(cx)P(c)i=1dP(xic,pai)P(c | x) \propto P(c) \prod_{i=1}^{d} P(x_i | c, pa_i)

where paipa_i is the parent attribute of attribute ii

Key Challenge

The main challenge is determining which attribute should be the parent paipa_i for each attribute. Different methods (SPODE, TAN, AODE) use different strategies to address this.

SPODE (Super-Parent ODE)

Assumption

SPODE assumes all attributes depend on the same attribute, called the super-parent. This attribute serves as the parent for all other attributes.

For SPODE with super-parent x1x_1:

P(cx)P(c)i=1dP(xic,x1)P(c | x) \propto P(c) \prod_{i=1}^{d} P(x_i | c, x_1)

Determining the Super-Parent

The best super-parent is determined through cross-validation or other model selection methods. We try each attribute as the super-parent and choose the one that gives the best performance.

Insight

SPODE simplifies the dependency structure while still being more flexible than Naive Bayes. It's easier to implement and train than more complex methods like TAN.

TAN (Tree Augmented Naïve Bayes)

Construction Method

  1. 1.
    Build a complete graph: Use conditional mutual information between attributes (given the class) as edge weights.
  2. 2.
    Find maximum weighted spanning tree: Use algorithms like Prim's or Kruskal's to keep only strong attribute dependencies, forming a tree structure.
  3. 3.
    Add class node: The class node becomes the parent of all attribute nodes, and the tree structure represents attribute dependencies.

Formula

P(cx)P(c)i=1dP(xic,pai)P(c | x) \propto P(c) \prod_{i=1}^{d} P(x_i | c, pa_i)

where paipa_i is determined by the tree structure (either the class or another attribute)

Advantage

TAN builds a tree structure that captures the most important attribute dependencies while maintaining computational efficiency. It balances between model complexity and performance.

AODE (Averaged One-Dependent Estimator)

Basic Idea

AODE is a more powerful ODE method proposed by Geoff Webb. Instead of choosing a single super-parent, it averages over multiple SPODE models:

P(cx)i=1,DximdP(c,xi)j=1dP(xjc,xi)P(c | x) \propto \sum_{i=1, |D_{x_i}| \geq m'}^{d} P(c, x_i) \prod_{j=1}^{d} P(x_j | c, x_i)

where DxiD_{x_i} is the set of samples where attribute ii equals xix_i, and mm' is a threshold constant to ensure sufficient training data support

Probability Estimation

Joint probability:

P^(c,xi)=Dc,xi+1D+Ni\hat{P}(c, x_i) = \frac{|D_{c,x_i}| + 1}{|D| + N_i}

Conditional probability:

P^(xjc,xi)=Dc,xi,xj+1Dc,xi+Nj\hat{P}(x_j | c, x_i) = \frac{|D_{c,x_i,x_j}| + 1}{|D_{c,x_i}| + N_j}

where Dc,xi,xjD_{c,x_i,x_j} is the set of samples with class cc, attribute ii equals xix_i, and attribute jj equals xjx_j

Advantage

By averaging multiple SPODE models, AODE reduces the risk of choosing a poor super-parent. It improves robustness and accuracy compared to a single SPODE model, while still being more efficient than full Bayesian networks.

Dependency Relationship Diagrams

(a) Naive Bayes (NB)

Class yy is the parent of all attributes xix_i. Attributes are independent given the class.

y → x₁, x₂, x₃, ..., x_d

(b) SPODE

Class yy is parent of all attributes. x1x_1(super-parent) is also parent of all other attributes.

y → all x_i
x₁ → x₂, x₃, ..., x_d

(c) TAN

Class yy is parent of all attributes. Attributes form a tree structure representing dependencies.

y → all x_i
x₁—x₂—x₃—...—x_d

Example: Customer Segmentation

Applying TAN to model customer attribute dependencies

Problem Setup

We want to classify customers into segments (High-Value, Medium-Value, Low-Value) based on attributes:

  • Age: Young, Middle, Senior
  • Income: Low, Medium, High
  • Purchase Frequency: Low, Medium, High
  • Average Order Value: Continuous

TAN Construction

Step 1: Calculate conditional mutual information between attributes given the class. We might find that Income and Purchase Frequency are strongly dependent.

Step 2: Build maximum weighted spanning tree. The tree might show:

Class → Age → Income → Purchase Frequency → Average Order Value

Step 3: Use this structure for classification, where each attribute depends on the class and its parent in the tree.

High-Order Dependencies

Question

Can we improve generalization performance by considering higher-order dependencies between attributes?

Simple Extension: kDE

We could extend ODE to kDE (k-Dependent Estimator) by replacing the parent attributepaipa_i with a set of kk attributes:

P(xiy,pai)P(x_i | y, pa_i)

where paipa_i now contains kk attributes

Major Obstacle

As kk increases, the number of samples needed to estimateP(xiy,pai)P(x_i | y, pa_i) grows exponentially.

  • Sufficient training samples: Performance may improve
  • Limited training samples: High-order joint probability estimation becomes difficult
  • Conclusion: Modeling high-order dependencies requires more sophisticated methods like Bayesian networks