The probabilistic framework for optimal decision-making under uncertainty
Bayesian Decision Theory is a fundamental framework for making optimal decisions under uncertainty. It provides a principled way to quantify the risk associated with different decisions and choose the action that minimizes expected loss. In machine learning, this theory forms the foundation for probabilistic classification.
Instead of making hard classifications, Bayesian decision theory allows us to consider the probabilitythat a sample belongs to each class, along with the cost of making different types of errors. This enables optimal decision-making that accounts for both uncertainty and consequences.
Given classes, let represent the loss incurred by classifying a sample from class as class . The conditional risk of classifying sample as class is:
where is the posterior probability that sample belongs to class
This formula computes the expected loss of classifying sample as class . It considers all possible true classes , their corresponding posterior probabilities, and the misclassification loss .
Consider a medical diagnosis scenario with two classes: Healthy (class 0) and Disease (class 1). The loss matrix might be:
| Predicted / Actual | Healthy | Disease |
|---|---|---|
| Healthy | 0 | 10 |
| Disease | 1 | 0 |
Here, (classifying disease as healthy) has high cost because it delays treatment, while (classifying healthy as disease) has lower cost (unnecessary test).
Suppose for a patient with symptoms , we have:
Conditional risk of predicting "Healthy":
Conditional risk of predicting "Disease":
Optimal decision: Predict "Disease" (lower risk: 0.3 < 7.0)
The Bayes Decision Rule aims to minimize the overall risk. The decision function is:
where is the set of all possible classes
For any given sample , we should classify it into the class that minimizes its conditional risk .
The classifier obtained from the Bayes decision rule is called theBayes Optimal Classifier.
The overall risk corresponding to the Bayes optimal classifier is called the Bayes Risk. This represents the theoretical performance limit.
Key Insight: Given the data distribution and loss function, no classifier can achieve a lower error rate than the Bayes optimal classifier. It serves as a benchmark for evaluating other classifiers.
In machine learning, we need to estimate the posterior probability for classification. However, is usually difficult to obtain directly. There are two fundamental strategies:
| Feature | Discriminative Models | Generative Models |
|---|---|---|
| Approach | Directly model | Model joint distribution , then derive |
| Formula | N/A | |
| Examples | Decision Trees, Neural Networks, SVM | Bayesian Classifiers |
| Focus | Learn decision boundaries directly | Learn data generation mechanism |
| Capability | Better at classification | Can generate new samples |
Bayesian Classifiers ≠ Bayesian Learning. Bayesian learning is a learning paradigm, while Bayesian classifiers are specific models. Bayesian classifiers use Bayes' theorem but may not necessarily use Bayesian learning (which involves prior distributions over parameters).
Bayes' theorem is a core theorem in probability theory that describes how to update probabilities when new evidence is observed:
The probability that sample belongs to class after observing the sample. This is what we want to estimate for classification.
The proportion of class samples in the sample space. Can be estimated from class frequencies using the law of large numbers.
Also called class-conditional probability. The probability of observing sample given that it belongs to class . This is the main challenge in practice.
A normalization factor independent of class . In classification, since it's the same for all classes, we can ignore it and compare directly.
The main difficulty in applying Bayes' theorem is estimating the likelihood . When sample is a high-dimensional vector, directly estimating its probability distribution is very challenging. This is why we need techniques like maximum likelihood estimation and assumptions like attribute independence (in Naive Bayes).
Real-world application of Bayes' theorem
We want to classify emails as Spam or Not Spam based on the presence of certain words. Suppose we observe an email containing the word "winner".
First, calculate evidence:
Posterior for Spam:
Posterior for Not Spam:
Classification Result: Spam(probability: 77.4% > 22.6%)