MathIsimple

Maximum Likelihood Estimation

Estimating probability distribution parameters from observed data

What is Maximum Likelihood Estimation?

Parameter Estimation

Maximum Likelihood Estimation (MLE) is a fundamental method for estimating the parameters of probability distributions. The core idea is to find parameter values that make the observed data most probable under the assumed distribution.

Basic Approach

  1. 1. Assume a probability distribution form (e.g., Gaussian, Bernoulli)
  2. 2. Estimate distribution parameters based on training samples
  3. 3. Choose parameters that maximize the probability of observing the training data

Likelihood Function

Setup

Assume that the class-conditional probability P(xc)P(x | c) has a specific probability distribution form, uniquely determined by parameters θc\theta_c. Our task is to estimate θc\theta_c using training set DD.

Likelihood for Class c

For the set DcD_c of samples from class cc in training setDD, the likelihood is:

P(Dcθc)=xDcP(xθc)P(D_c | \theta_c) = \prod_{x \in D_c} P(x | \theta_c)

This is the probability of observing all samples in DcD_c given parameters θc\theta_c

Underflow Problem

Multiplying many probabilities (each between 0 and 1) can cause numerical underflow—the result becomes too small for computers to represent accurately. This is why we use log-likelihood.

Log-Likelihood

Definition

To avoid underflow, we use the log-likelihood:

LL(θc)=logP(Dcθc)=xDclogP(xθc)LL(\theta_c) = \log P(D_c | \theta_c) = \sum_{x \in D_c} \log P(x | \theta_c)

Why Log-Likelihood?

  • Converts multiplication to addition: Easier to compute and differentiate
  • Avoids underflow: Log of small numbers is manageable
  • Monotonic transformation: Maximizing log-likelihood is equivalent to maximizing likelihood

Maximum Likelihood Estimate

The maximum likelihood estimate θ^c\hat{\theta}_c is:

θ^c=argmaxθcLL(θc)\hat{\theta}_c = \arg\max_{\theta_c} LL(\theta_c)

Gaussian Distribution MLE

Setup

For continuous attributes, we often assume they follow a Gaussian (normal) distribution. If p(xic)N(μc,i,σc,i2)p(x_i | c) \sim \mathcal{N}(\mu_{c,i}, \sigma_{c,i}^2), the probability density function is:

p(xic)=12πσc,iexp((xiμc,i)22σc,i2)p(x_i | c) = \frac{1}{\sqrt{2\pi}\sigma_{c,i}} \exp\left(-\frac{(x_i - \mu_{c,i})^2}{2\sigma_{c,i}^2}\right)

MLE for Gaussian Parameters

The maximum likelihood estimates for the mean and variance are:

Mean:

μ^c,i=1DcxDcxi\hat{\mu}_{c,i} = \frac{1}{|D_c|} \sum_{x \in D_c} x_i

Variance:

σ^c,i2=1DcxDc(xiμ^c,i)2\hat{\sigma}_{c,i}^2 = \frac{1}{|D_c|} \sum_{x \in D_c} (x_i - \hat{\mu}_{c,i})^2

Insight

The MLE estimates are simply the sample mean and sample variance of the training data for each class. This makes intuitive sense: we estimate the distribution parameters using the statistics of the observed data.

Bernoulli Distribution MLE

Setup

For binary attributes (e.g., presence/absence of a word in an email), we use the Bernoulli distribution. The probability mass function is:

P(xi=vc)={θc,iif v=11θc,iif v=0P(x_i = v | c) = \begin{cases} \theta_{c,i} & \text{if } v = 1 \\ 1 - \theta_{c,i} & \text{if } v = 0 \end{cases}

where θc,i\theta_{c,i} is the probability that attribute ii equals 1 for class cc

MLE for Bernoulli Parameter

The maximum likelihood estimate is:

θ^c,i=Dc,xiDc\hat{\theta}_{c,i} = \frac{|D_{c,x_i}|}{|D_c|}

where Dc,xiD_{c,x_i} is the set of samples in DcD_c where attribute ii equals xix_i

Interpretation

The MLE is simply the proportion of samples in class cc where attributeii equals 1. This is the empirical frequency, which aligns with our intuition.

Example: Wine Quality Prediction

Applying MLE to estimate alcohol content distribution

Dataset Overview

We have wine quality data with alcohol content measurements. For "High Quality" wines (class 1), we observe the following alcohol percentages:

Sample alcohol content values: 12.5, 13.0, 13.2, 12.8, 13.5, 13.1, 12.9, 13.3, 12.7, 13.4

MLE Calculation

Step 1: Calculate sample mean

μ^1,alcohol=110(12.5+13.0+13.2+12.8+13.5+13.1+12.9+13.3+12.7+13.4)\hat{\mu}_{1,\text{alcohol}} = \frac{1}{10}(12.5 + 13.0 + 13.2 + 12.8 + 13.5 + 13.1 + 12.9 + 13.3 + 12.7 + 13.4)
=130.410=13.04= \frac{130.4}{10} = 13.04

Step 2: Calculate sample variance

σ^1,alcohol2=110i=110(xi13.04)2\hat{\sigma}_{1,\text{alcohol}}^2 = \frac{1}{10}\sum_{i=1}^{10}(x_i - 13.04)^2
=110[(12.513.04)2+(13.013.04)2++(13.413.04)2]= \frac{1}{10}[(12.5-13.04)^2 + (13.0-13.04)^2 + \cdots + (13.4-13.04)^2]
=0.68410=0.0684= \frac{0.684}{10} = 0.0684

Estimated Distribution: N(13.04,0.0684)\mathcal{N}(13.04, 0.0684)

For a new wine with alcohol content 13.2%, we can now calculate p(13.2High Quality)p(13.2 | \text{High Quality})using the Gaussian probability density function.

Limitations and Assumptions

Distribution Assumption

Critical Insight: The accuracy of MLE estimates depends heavily on whether the assumed probability distribution form matches the true underlying distribution.

If the assumed distribution is incorrect, even maximizing the likelihood may yield inaccurate parameter estimates. For example, if we assume a Gaussian distribution but the data is actually skewed, the MLE estimates may be biased.

Key Limitations

  • Distribution Mismatch: Assumes the correct distribution form is known. In practice, we may need to test multiple distributions or use non-parametric methods.
  • Sample Size: Requires sufficient training samples. With very few samples, MLE estimates can be unreliable, especially for high-dimensional data.
  • Overfitting Risk: MLE can overfit to training data, particularly with complex models. Regularization techniques may be needed.
  • No Prior Information: MLE doesn't incorporate prior knowledge about parameters. Bayesian estimation (MAP) can incorporate priors when available.