Maximum Likelihood Estimation | MLE Tutorial for Parameter Estimation

What is Maximum Likelihood Estimation?

Parameter Estimation

Maximum Likelihood Estimation (MLE) is a fundamental method for estimating the parameters of probability distributions. The core idea is to find parameter values that make the observed data most probable under the assumed distribution.

Basic Approach

1. Assume a probability distribution form (e.g., Gaussian, Bernoulli)
2. Estimate distribution parameters based on training samples
3. Choose parameters that maximize the probability of observing the training data

Likelihood Function

Setup

Assume that the class-conditional probability $P(x | c)$ has a specific probability distribution form, uniquely determined by parameters $\theta_c$ . Our task is to estimate $\theta_c$ using training set $D$ .

Likelihood for Class c

For the set $D_c$ of samples from class $c$ in training set $D$ , the likelihood is:

P(D_c | \theta_c) = \prod_{x \in D_c} P(x | \theta_c)

This is the probability of observing all samples in $D_c$ given parameters $\theta_c$

Underflow Problem

Multiplying many probabilities (each between 0 and 1) can cause numerical underflow—the result becomes too small for computers to represent accurately. This is why we use log-likelihood.

Log-Likelihood

Definition

To avoid underflow, we use the log-likelihood:

LL(\theta_c) = \log P(D_c | \theta_c) = \sum_{x \in D_c} \log P(x | \theta_c)

Why Log-Likelihood?

• Converts multiplication to addition: Easier to compute and differentiate
• Avoids underflow: Log of small numbers is manageable
• Monotonic transformation: Maximizing log-likelihood is equivalent to maximizing likelihood

Maximum Likelihood Estimate

The maximum likelihood estimate $\hat{\theta}_c$ is:

\hat{\theta}_c = \arg\max_{\theta_c} LL(\theta_c)

Gaussian Distribution MLE

Setup

For continuous attributes, we often assume they follow a Gaussian (normal) distribution. If $p(x_i | c) \sim \mathcal{N}(\mu_{c,i}, \sigma_{c,i}^2)$ , the probability density function is:

p(x_i | c) = \frac{1}{\sqrt{2\pi}\sigma_{c,i}} \exp\left(-\frac{(x_i - \mu_{c,i})^2}{2\sigma_{c,i}^2}\right)

MLE for Gaussian Parameters

The maximum likelihood estimates for the mean and variance are:

Mean:

\hat{\mu}_{c,i} = \frac{1}{|D_c|} \sum_{x \in D_c} x_i

Variance:

\hat{\sigma}_{c,i}^2 = \frac{1}{|D_c|} \sum_{x \in D_c} (x_i - \hat{\mu}_{c,i})^2

Insight

The MLE estimates are simply the sample mean and sample variance of the training data for each class. This makes intuitive sense: we estimate the distribution parameters using the statistics of the observed data.

Bernoulli Distribution MLE

Setup

For binary attributes (e.g., presence/absence of a word in an email), we use the Bernoulli distribution. The probability mass function is:

P(x_i = v | c) = \begin{cases} \theta_{c,i} & \text{if } v = 1 \\ 1 - \theta_{c,i} & \text{if } v = 0 \end{cases}

where $\theta_{c,i}$ is the probability that attribute $i$ equals 1 for class $c$

MLE for Bernoulli Parameter

The maximum likelihood estimate is:

\hat{\theta}_{c,i} = \frac{|D_{c,x_i}|}{|D_c|}

where $D_{c,x_i}$ is the set of samples in $D_c$ where attribute $i$ equals $x_i$

Interpretation

The MLE is simply the proportion of samples in class $c$ where attribute $i$ equals 1. This is the empirical frequency, which aligns with our intuition.

Example: Wine Quality Prediction

Applying MLE to estimate alcohol content distribution

Dataset Overview

We have wine quality data with alcohol content measurements. For "High Quality" wines (class 1), we observe the following alcohol percentages:

Sample alcohol content values: 12.5, 13.0, 13.2, 12.8, 13.5, 13.1, 12.9, 13.3, 12.7, 13.4

MLE Calculation

Step 1: Calculate sample mean

\hat{\mu}_{1,\text{alcohol}} = \frac{1}{10}(12.5 + 13.0 + 13.2 + 12.8 + 13.5 + 13.1 + 12.9 + 13.3 + 12.7 + 13.4)

= \frac{130.4}{10} = 13.04

Step 2: Calculate sample variance

\hat{\sigma}_{1,\text{alcohol}}^2 = \frac{1}{10}\sum_{i=1}^{10}(x_i - 13.04)^2

= \frac{1}{10}[(12.5-13.04)^2 + (13.0-13.04)^2 + \cdots + (13.4-13.04)^2]

= \frac{0.684}{10} = 0.0684

Estimated Distribution: $\mathcal{N}(13.04, 0.0684)$

For a new wine with alcohol content 13.2%, we can now calculate $p(13.2 | \text{High Quality})$ using the Gaussian probability density function.

Limitations and Assumptions

Distribution Assumption

Critical Insight: The accuracy of MLE estimates depends heavily on whether the assumed probability distribution form matches the true underlying distribution.

If the assumed distribution is incorrect, even maximizing the likelihood may yield inaccurate parameter estimates. For example, if we assume a Gaussian distribution but the data is actually skewed, the MLE estimates may be biased.

Key Limitations

Distribution Mismatch: Assumes the correct distribution form is known. In practice, we may need to test multiple distributions or use non-parametric methods.
Sample Size: Requires sufficient training samples. With very few samples, MLE estimates can be unreliable, especially for high-dimensional data.
Overfitting Risk: MLE can overfit to training data, particularly with complex models. Regularization techniques may be needed.
No Prior Information: MLE doesn't incorporate prior knowledge about parameters. Bayesian estimation (MAP) can incorporate priors when available.