Bayesian Decision Theory & Bayes' Theorem

What is Bayesian Decision Theory?

Probabilistic Framework

Bayesian Decision Theory is a fundamental framework for making optimal decisions under uncertainty. It provides a principled way to quantify the risk associated with different decisions and choose the action that minimizes expected loss. In machine learning, this theory forms the foundation for probabilistic classification.

Key Insight

Instead of making hard classifications, Bayesian decision theory allows us to consider the probabilitythat a sample belongs to each class, along with the cost of making different types of errors. This enables optimal decision-making that accounts for both uncertainty and consequences.

Conditional Risk

Definition

Given $N$ classes, let $\lambda_{ij}$ represent the loss incurred by classifying a sample from class $j$ as class $i$ . The conditional risk $R(c_i | x)$ of classifying sample $x$ as class $c_i$ is:

R(c_i | x) = \sum_{j=1}^{N} \lambda_{ij} P(c_j | x)

where $P(c_j | x)$ is the posterior probability that sample $x$ belongs to class $c_j$

Insight

This formula computes the expected loss of classifying sample $x$ as class $c_i$ . It considers all possible true classes $c_j$ , their corresponding posterior probabilities $P(c_j | x)$ , and the misclassification loss $\lambda_{ij}$ .

Example: Medical Diagnosis

Consider a medical diagnosis scenario with two classes: Healthy (class 0) and Disease (class 1). The loss matrix might be:

Predicted / Actual	Healthy	Disease
Healthy	0	10
Disease	1	0

Here, $\lambda_{01} = 10$ (classifying disease as healthy) has high cost because it delays treatment, while $\lambda_{10} = 1$ (classifying healthy as disease) has lower cost (unnecessary test).

Calculation Example

Suppose for a patient with symptoms $x$ , we have:

• $P(\text{Healthy} | x) = 0.3$
• $P(\text{Disease} | x) = 0.7$

Conditional risk of predicting "Healthy":

R(\text{Healthy} | x) = 0 \times 0.3 + 10 \times 0.7 = 7.0

Conditional risk of predicting "Disease":

R(\text{Disease} | x) = 1 \times 0.3 + 0 \times 0.7 = 0.3

Optimal decision: Predict "Disease" (lower risk: 0.3 < 7.0)

Bayes Decision Rule

Definition

The Bayes Decision Rule aims to minimize the overall risk. The decision function $h^*(x)$ is:

h^*(x) = \arg\min_{c \in \mathcal{Y}} R(c | x)

where $\mathcal{Y}$ is the set of all possible classes

Insight

For any given sample $x$ , we should classify it into the class $c$ that minimizes its conditional risk $R(c | x)$ .

Bayes Optimal Classifier

The classifier $h^*$ obtained from the Bayes decision rule is called theBayes Optimal Classifier.

Bayes Risk

The overall risk corresponding to the Bayes optimal classifier is called the Bayes Risk. This represents the theoretical performance limit.

Key Insight: Given the data distribution and loss function, no classifier can achieve a lower error rate than the Bayes optimal classifier. It serves as a benchmark for evaluating other classifiers.

Discriminative vs. Generative Models

In machine learning, we need to estimate the posterior probability $P(c | x)$ for classification. However, $P(c | x)$ is usually difficult to obtain directly. There are two fundamental strategies:

Feature	Discriminative Models	Generative Models
Approach	Directly model $P(c \| x)$	Model joint distribution $P(x, c)$ , then derive $P(c \| x)$
Formula	N/A	$P(c \| x) = \frac{P(x, c)}{P(x)}$
Examples	Decision Trees, Neural Networks, SVM	Bayesian Classifiers
Focus	Learn decision boundaries directly	Learn data generation mechanism
Capability	Better at classification	Can generate new samples

Important Note

Bayesian Classifiers ≠ Bayesian Learning. Bayesian learning is a learning paradigm, while Bayesian classifiers are specific models. Bayesian classifiers use Bayes' theorem but may not necessarily use Bayesian learning (which involves prior distributions over parameters).

Bayes' Theorem

The Fundamental Formula

Bayes' theorem is a core theorem in probability theory that describes how to update probabilities when new evidence is observed:

P(c | x) = \frac{P(x, c)}{P(x)} = \frac{P(c) P(x | c)}{P(x)}

Components Explained

$P(c | x)$ - Posterior Probability

The probability that sample $x$ belongs to class $c$ after observing the sample. This is what we want to estimate for classification.

$P(c)$ - Prior Probability

The proportion of class $c$ samples in the sample space. Can be estimated from class frequencies using the law of large numbers.

$P(x | c)$ - Likelihood

Also called class-conditional probability. The probability of observing sample $x$ given that it belongs to class $c$ . This is the main challenge in practice.

$P(x)$ - Evidence

A normalization factor independent of class $c$ . In classification, since it's the same for all classes, we can ignore it and compare $P(c) P(x | c)$ directly.

Main Difficulty

The main difficulty in applying Bayes' theorem is estimating the likelihood $P(x | c)$ . When sample $x$ is a high-dimensional vector, directly estimating its probability distribution is very challenging. This is why we need techniques like maximum likelihood estimation and assumptions like attribute independence (in Naive Bayes).

Example: Email Spam Classification

Real-world application of Bayes' theorem

Problem Setup

We want to classify emails as Spam or Not Spam based on the presence of certain words. Suppose we observe an email containing the word "winner".

Given Information

• Prior: $P(\text{Spam}) = 0.3$ , $P(\text{Not Spam}) = 0.7$
• Likelihood: $P(\text{"winner"} | \text{Spam}) = 0.8$
• Likelihood: $P(\text{"winner"} | \text{Not Spam}) = 0.1$

Calculate Posterior

First, calculate evidence:

P(\text{"winner"}) = P(\text{Spam}) P(\text{"winner"} | \text{Spam}) + P(\text{Not Spam}) P(\text{"winner"} | \text{Not Spam})

= 0.3 \times 0.8 + 0.7 \times 0.1 = 0.24 + 0.07 = 0.31

Posterior for Spam:

P(\text{Spam} | \text{"winner"}) = \frac{P(\text{Spam}) P(\text{"winner"} | \text{Spam})}{P(\text{"winner"})} = \frac{0.3 \times 0.8}{0.31} \approx 0.774

Posterior for Not Spam:

P(\text{Not Spam} | \text{"winner"}) = \frac{P(\text{Not Spam}) P(\text{"winner"} | \text{Not Spam})}{P(\text{"winner"})} = \frac{0.7 \times 0.1}{0.31} \approx 0.226

Classification Result: Spam(probability: 77.4% > 22.6%)

What is Bayesian Decision Theory?

Key Insight

Conditional Risk

Definition

Insight

Example: Medical Diagnosis

Calculation Example

Bayes Decision Rule

Definition

Insight

Bayes Optimal Classifier

Bayes Risk

Discriminative vs. Generative Models

Important Note

Bayes' Theorem

The Fundamental Formula

Components Explained

P(c∣x)P(c | x)P(c∣x) - Posterior Probability

P(c)P(c)P(c) - Prior Probability

P(x∣c)P(x | c)P(x∣c) - Likelihood

P(x)P(x)P(x) - Evidence

Main Difficulty

Example: Email Spam Classification

Problem Setup

Given Information

Calculate Posterior

$P(c | x)$ - Posterior Probability

$P(c)$ - Prior Probability

$P(x | c)$ - Likelihood

$P(x)$ - Evidence