Maximum Likelihood Estimation (MLE) is a fundamental method for estimating the parameters of probability distributions. The core idea is to find parameter values that make the observed data most probable under the assumed distribution.
Assume that the class-conditional probability has a specific probability distribution form, uniquely determined by parameters . Our task is to estimate using training set .
For the set of samples from class in training set, the likelihood is:
This is the probability of observing all samples in given parameters
Multiplying many probabilities (each between 0 and 1) can cause numerical underflow—the result becomes too small for computers to represent accurately. This is why we use log-likelihood.
To avoid underflow, we use the log-likelihood:
The maximum likelihood estimate is:
For continuous attributes, we often assume they follow a Gaussian (normal) distribution. If , the probability density function is:
The maximum likelihood estimates for the mean and variance are:
Mean:
Variance:
The MLE estimates are simply the sample mean and sample variance of the training data for each class. This makes intuitive sense: we estimate the distribution parameters using the statistics of the observed data.
For binary attributes (e.g., presence/absence of a word in an email), we use the Bernoulli distribution. The probability mass function is:
where is the probability that attribute equals 1 for class
The maximum likelihood estimate is:
where is the set of samples in where attribute equals
The MLE is simply the proportion of samples in class where attribute equals 1. This is the empirical frequency, which aligns with our intuition.
Applying MLE to estimate alcohol content distribution
We have wine quality data with alcohol content measurements. For "High Quality" wines (class 1), we observe the following alcohol percentages:
Sample alcohol content values: 12.5, 13.0, 13.2, 12.8, 13.5, 13.1, 12.9, 13.3, 12.7, 13.4
Step 1: Calculate sample mean
Step 2: Calculate sample variance
Estimated Distribution:
For a new wine with alcohol content 13.2%, we can now calculate using the Gaussian probability density function.
Critical Insight: The accuracy of MLE estimates depends heavily on whether the assumed probability distribution form matches the true underlying distribution.
If the assumed distribution is incorrect, even maximizing the likelihood may yield inaccurate parameter estimates. For example, if we assume a Gaussian distribution but the data is actually skewed, the MLE estimates may be biased.