MathIsimple

Hidden Markov Models (HMM)

Master sequential directed graph models with hidden states. Learn how HMMs model temporal dependencies for speech recognition, weather prediction, and sequence analysis.

Module 2 of 7
Intermediate to Advanced
120-150 min

Core Definition

Hidden Markov Models (HMM) are sequential directed graph models based on Markov chains. The core feature is "hidden states, observable outputs": state variables cannot be directly observed, but can only be inferred indirectly through observation variables.

Key Characteristics

  • Sequential structure: Models temporal sequences of states and observations
  • Hidden states: Internal states that are not directly observable
  • Observable outputs: What we can actually measure or observe
  • Markov assumption: Next state depends only on current state, not history

Advantages

  • • Handles sequential data naturally
  • • Efficient algorithms available
  • • Well-established theory
  • • Wide range of applications

Limitations

  • • Assumes Markov property
  • • Discrete state/observation spaces
  • • May not capture long dependencies
  • • Limited expressiveness

Key Elements: Variables

State Variables y1,y2,,yny_1, y_2, \ldots, y_n (Hidden)

Internal states that cannot be directly observed. Each state variable takes values from the state space S={s1,s2,,sN}S = \{s_1, s_2, \ldots, s_N\}, where NN is the number of states.

Examples:

  • • Weather: S={Sunny,Cloudy,Rainy}S = \{\text{Sunny}, \text{Cloudy}, \text{Rainy}\} (N=3)
  • • Speech: Phonemes or words being spoken
  • • Part-of-speech: Noun, verb, adjective, etc.

Observation Variables x1,x2,,xnx_1, x_2, \ldots, x_n (Observable)

Variables that can be directly observed. Each observation variable takes values from the observation space O={o1,o2,,oM}O = \{o_1, o_2, \ldots, o_M\}, where MM is the number of possible observations.

Examples:

  • • Umbrella: O={Umbrella,No Umbrella}O = \{\text{Umbrella}, \text{No Umbrella}\} (M=2)
  • • Speech: Acoustic features or audio signals
  • • Text: Words in a sentence

Three Parameters: Model Definition

An HMM is completely defined by three parameters λ=[A,B,π]\lambda = [A, B, \pi]:

1. State Transition Probability Matrix A=[aij]N×NA = [a_{ij}]_{N \times N}

aij=P(yt+1=sjyt=si)a_{ij} = P(y_{t+1} = s_j | y_t = s_i)

Probability of transitioning from state sis_i to state sjs_j.

Example (Weather):

aSunnyRainy=0.3a_{\text{Sunny} \to \text{Rainy}} = 0.3 means 30% chance of transitioning from Sunny to Rainy. Each row sums to 1: jaij=1\sum_j a_{ij} = 1.

2. Observation Emission Probability Matrix B=[bij]N×MB = [b_{ij}]_{N \times M}

bij=P(xt=ojyt=si)b_{ij} = P(x_t = o_j | y_t = s_i)

Probability of generating observation ojo_j when in state sis_i.

Example (Weather-Umbrella):

bRainyUmbrella=0.7b_{\text{Rainy} \to \text{Umbrella}} = 0.7 means 70% chance of seeing "Umbrella" when state is "Rainy". Each row sums to 1: jbij=1\sum_j b_{ij} = 1.

3. Initial State Probability Vector π=(π1,π2,,πN)\pi = (\pi_1, \pi_2, \ldots, \pi_N)

πi=P(y1=si)\pi_i = P(y_1 = s_i)

Probability of being in state sis_i at the initial time step (t=1).

Example:

πSunny=0.5\pi_{\text{Sunny}} = 0.5 means 50% chance of starting in "Sunny" state. Vector sums to 1: iπi=1\sum_i \pi_i = 1.

Core Probability Formula

Based on the Markov chain assumption (next state depends only on current state, independent of history), the joint probability distribution of all variables is:

P(x1,y1,,xn,yn)=P(y1)P(x1y1)i=2nP(yiyi1)P(xiyi)P(x_1, y_1, \ldots, x_n, y_n) = P(y_1) P(x_1 | y_1) \prod_{i=2}^n P(y_i | y_{i-1}) P(x_i | y_i)

This factorization breaks down into:

  • P(y1)P(y_1) = initial state probability (from π\pi)
  • P(x1y1)P(x_1 | y_1) = first observation probability (from BB)
  • P(yiyi1)P(y_i | y_{i-1}) = state transition probability (from AA)
  • P(xiyi)P(x_i | y_i) = observation emission probability (from BB)

Observation Sequence Generation Process

How an HMM generates an observation sequence:

Step 1

Select Initial State

According to initial state probability π\pi, select initial state y1y_1.

Step 2

Generate Observation

According to current state yty_t and emission probability BB, generate observation xtx_t.

Step 3

Transition to Next State

According to current state yty_t and transition probability AA, transition to next state yt+1y_{t+1}.

Step 4

Repeat

Repeat steps 2-3 until t=nt = n (desired sequence length reached).

Three Fundamental Problems

HMM applications typically involve solving one of three fundamental problems:

Problem 1: Evaluation Problem

Given: Model parameters λ\lambda and observation sequence xx

Compute: P(xλ)P(x | \lambda) (probability of observation sequence given model)

Algorithm:

  • Forward algorithm: Compute P(xλ)P(x | \lambda) using forward probabilities
  • Backward algorithm: Alternative computation using backward probabilities
  • • Both have time complexity O(N2T)O(N^2 T) where N=states, T=sequence length

Application:

Model selection - compare different HMMs to see which best explains the observed data.

Problem 2: Decoding Problem (Inference)

Given: Model parameters λ\lambda and observation sequence xx

Find: Optimal state sequence y=argmaxyP(yx,λ)y^* = \arg\max_y P(y | x, \lambda)

Algorithm:

  • Viterbi algorithm: Dynamic programming to find most likely state sequence
  • • Uses recursion: δt(j)=maxi[δt1(i)aij]bj(xt)\delta_t(j) = \max_i [\delta_{t-1}(i) a_{ij}] b_j(x_t)
  • • Time complexity: O(N2T)O(N^2 T)

Application:

Speech recognition - infer phonemes/words from acoustic signals. Part-of-speech tagging - infer POS tags from words.

Problem 3: Learning Problem (Training)

Given: Observation sequence xx

Estimate: Model parameters λ=argmaxλP(xλ)\lambda^* = \arg\max_\lambda P(x | \lambda)

Algorithm:

  • Baum-Welch algorithm: Special case of EM algorithm for HMM
  • E-step: Compute expected state transitions and emissions
  • M-step: Update parameters A, B, π to maximize likelihood
  • • Iterates until convergence

Application:

Train HMM from unlabeled observation sequences. Learn model parameters from data.

Weather Prediction Example

Apply HMM to predict weather states (Sunny, Cloudy, Rainy) from umbrella observations (Umbrella, No Umbrella) over 7 days.

Observation Sequence and Inferred States

DayState (Hidden)ObservationEmission Prob
1
Sunny
No Umbrella0.6
2
Sunny
No Umbrella0.6
3
Rainy
Umbrella0.7
4
Rainy
Umbrella0.7
5
Cloudy
No Umbrella0.5
6
Sunny
No Umbrella0.6
7
Rainy
Umbrella0.7

States are hidden (not directly observable). We observe umbrella usage and infer weather states using HMM parameters and Viterbi algorithm.

HMM Parameters

Transition Matrix A:

Sunny → Sunny: 0.7, Sunny → Cloudy: 0.2, Sunny → Rainy: 0.1

Rainy → Rainy: 0.6, Rainy → Cloudy: 0.2, Rainy → Sunny: 0.2

Emission Matrix B:

Sunny → No Umbrella: 0.6, Sunny → Umbrella: 0.4

Rainy → Umbrella: 0.7, Rainy → No Umbrella: 0.3

Initial State π:

Sunny: 0.5, Cloudy: 0.3, Rainy: 0.2

Viterbi Decoding Result:

Given observation sequence [No, No, Umbrella, Umbrella, No, No, Umbrella], Viterbi algorithm infers most likely state sequence: [Sunny, Sunny, Rainy, Rainy, Cloudy, Sunny, Rainy] with probability 0.0042.

Advantages and Limitations

Advantages

  • Handles sequential data naturally: Perfect for time series and sequences
  • Efficient algorithms: Forward, backward, Viterbi all have polynomial time complexity
  • Well-established theory: Decades of research and applications
  • Wide range of applications: Speech, NLP, bioinformatics, finance
  • Interpretable: States and transitions have clear meaning

Limitations

  • Markov assumption: Next state depends only on current state, may miss long dependencies
  • Discrete spaces: Typically assumes discrete state and observation spaces
  • Limited expressiveness: May not capture complex temporal patterns
  • Parameter estimation: Baum-Welch can be slow and may converge to local optima
  • State space size: Computational cost grows quadratically with number of states