MathIsimple

Bayesian Decision Theory & Bayes' Theorem

The probabilistic framework for optimal decision-making under uncertainty

What is Bayesian Decision Theory?

Probabilistic Framework

Bayesian Decision Theory is a fundamental framework for making optimal decisions under uncertainty. It provides a principled way to quantify the risk associated with different decisions and choose the action that minimizes expected loss. In machine learning, this theory forms the foundation for probabilistic classification.

Key Insight

Instead of making hard classifications, Bayesian decision theory allows us to consider the probabilitythat a sample belongs to each class, along with the cost of making different types of errors. This enables optimal decision-making that accounts for both uncertainty and consequences.

Conditional Risk

Definition

Given NN classes, let λij\lambda_{ij} represent the loss incurred by classifying a sample from class jj as class ii. The conditional risk R(cix)R(c_i | x) of classifying sample xxas class cic_i is:

R(cix)=j=1NλijP(cjx)R(c_i | x) = \sum_{j=1}^{N} \lambda_{ij} P(c_j | x)

where P(cjx)P(c_j | x) is the posterior probability that sample xx belongs to class cjc_j

Insight

This formula computes the expected loss of classifying sample xx as class cic_i. It considers all possible true classes cjc_j, their corresponding posterior probabilitiesP(cjx)P(c_j | x), and the misclassification loss λij\lambda_{ij}.

Example: Medical Diagnosis

Consider a medical diagnosis scenario with two classes: Healthy (class 0) and Disease (class 1). The loss matrix might be:

Predicted / ActualHealthyDisease
Healthy010
Disease10

Here, λ01=10\lambda_{01} = 10 (classifying disease as healthy) has high cost because it delays treatment, while λ10=1\lambda_{10} = 1 (classifying healthy as disease) has lower cost (unnecessary test).

Calculation Example

Suppose for a patient with symptoms xx, we have:

  • P(Healthyx)=0.3P(\text{Healthy} | x) = 0.3
  • P(Diseasex)=0.7P(\text{Disease} | x) = 0.7

Conditional risk of predicting "Healthy":

R(Healthyx)=0×0.3+10×0.7=7.0R(\text{Healthy} | x) = 0 \times 0.3 + 10 \times 0.7 = 7.0

Conditional risk of predicting "Disease":

R(Diseasex)=1×0.3+0×0.7=0.3R(\text{Disease} | x) = 1 \times 0.3 + 0 \times 0.7 = 0.3

Optimal decision: Predict "Disease" (lower risk: 0.3 < 7.0)

Bayes Decision Rule

Definition

The Bayes Decision Rule aims to minimize the overall risk. The decision functionh(x)h^*(x) is:

h(x)=argmincYR(cx)h^*(x) = \arg\min_{c \in \mathcal{Y}} R(c | x)

where Y\mathcal{Y} is the set of all possible classes

Insight

For any given sample xx, we should classify it into the class ccthat minimizes its conditional risk R(cx)R(c | x).

Bayes Optimal Classifier

The classifier hh^* obtained from the Bayes decision rule is called theBayes Optimal Classifier.

Bayes Risk

The overall risk corresponding to the Bayes optimal classifier is called the Bayes Risk. This represents the theoretical performance limit.

Key Insight: Given the data distribution and loss function, no classifier can achieve a lower error rate than the Bayes optimal classifier. It serves as a benchmark for evaluating other classifiers.

Discriminative vs. Generative Models

In machine learning, we need to estimate the posterior probability P(cx)P(c | x) for classification. However, P(cx)P(c | x) is usually difficult to obtain directly. There are two fundamental strategies:

FeatureDiscriminative ModelsGenerative Models
ApproachDirectly model P(cx)P(c | x)Model joint distribution P(x,c)P(x, c), then derive P(cx)P(c | x)
FormulaN/AP(cx)=P(x,c)P(x)P(c | x) = \frac{P(x, c)}{P(x)}
ExamplesDecision Trees, Neural Networks, SVMBayesian Classifiers
FocusLearn decision boundaries directlyLearn data generation mechanism
CapabilityBetter at classificationCan generate new samples

Important Note

Bayesian Classifiers ≠ Bayesian Learning. Bayesian learning is a learning paradigm, while Bayesian classifiers are specific models. Bayesian classifiers use Bayes' theorem but may not necessarily use Bayesian learning (which involves prior distributions over parameters).

Bayes' Theorem

The Fundamental Formula

Bayes' theorem is a core theorem in probability theory that describes how to update probabilities when new evidence is observed:

P(cx)=P(x,c)P(x)=P(c)P(xc)P(x)P(c | x) = \frac{P(x, c)}{P(x)} = \frac{P(c) P(x | c)}{P(x)}

Components Explained

P(cx)P(c | x) - Posterior Probability

The probability that sample xx belongs to class ccafter observing the sample. This is what we want to estimate for classification.

P(c)P(c) - Prior Probability

The proportion of class cc samples in the sample space. Can be estimated from class frequencies using the law of large numbers.

P(xc)P(x | c) - Likelihood

Also called class-conditional probability. The probability of observing samplexx given that it belongs to class cc. This is the main challenge in practice.

P(x)P(x) - Evidence

A normalization factor independent of class cc. In classification, since it's the same for all classes, we can ignore it and compareP(c)P(xc)P(c) P(x | c) directly.

Main Difficulty

The main difficulty in applying Bayes' theorem is estimating the likelihood P(xc)P(x | c). When sample xx is a high-dimensional vector, directly estimating its probability distribution is very challenging. This is why we need techniques like maximum likelihood estimation and assumptions like attribute independence (in Naive Bayes).

Example: Email Spam Classification

Real-world application of Bayes' theorem

Problem Setup

We want to classify emails as Spam or Not Spam based on the presence of certain words. Suppose we observe an email containing the word "winner".

Given Information

  • • Prior: P(Spam)=0.3P(\text{Spam}) = 0.3, P(Not Spam)=0.7P(\text{Not Spam}) = 0.7
  • • Likelihood: P("winner"Spam)=0.8P(\text{"winner"} | \text{Spam}) = 0.8
  • • Likelihood: P("winner"Not Spam)=0.1P(\text{"winner"} | \text{Not Spam}) = 0.1

Calculate Posterior

First, calculate evidence:

P("winner")=P(Spam)P("winner"Spam)+P(Not Spam)P("winner"Not Spam)P(\text{"winner"}) = P(\text{Spam}) P(\text{"winner"} | \text{Spam}) + P(\text{Not Spam}) P(\text{"winner"} | \text{Not Spam})
=0.3×0.8+0.7×0.1=0.24+0.07=0.31= 0.3 \times 0.8 + 0.7 \times 0.1 = 0.24 + 0.07 = 0.31

Posterior for Spam:

P(Spam"winner")=P(Spam)P("winner"Spam)P("winner")=0.3×0.80.310.774P(\text{Spam} | \text{"winner"}) = \frac{P(\text{Spam}) P(\text{"winner"} | \text{Spam})}{P(\text{"winner"})} = \frac{0.3 \times 0.8}{0.31} \approx 0.774

Posterior for Not Spam:

P(Not Spam"winner")=P(Not Spam)P("winner"Not Spam)P("winner")=0.7×0.10.310.226P(\text{Not Spam} | \text{"winner"}) = \frac{P(\text{Not Spam}) P(\text{"winner"} | \text{Not Spam})}{P(\text{"winner"})} = \frac{0.7 \times 0.1}{0.31} \approx 0.226

Classification Result: Spam(probability: 77.4% > 22.6%)