Machine Learning/Learning Center/Probabilistic Graphical Models/Hidden Markov Models

Hidden Markov Models (HMM)

Master sequential directed graph models with hidden states. Learn how HMMs model temporal dependencies for speech recognition, weather prediction, and sequence analysis.

Module 2 of 7

Intermediate to Advanced

120-150 min

Core Definition

Hidden Markov Models (HMM) are sequential directed graph models based on Markov chains. The core feature is "hidden states, observable outputs": state variables cannot be directly observed, but can only be inferred indirectly through observation variables.

Key Characteristics

• Sequential structure: Models temporal sequences of states and observations
• Hidden states: Internal states that are not directly observable
• Observable outputs: What we can actually measure or observe
• Markov assumption: Next state depends only on current state, not history

Advantages

• Handles sequential data naturally
• Efficient algorithms available
• Well-established theory
• Wide range of applications

Limitations

• Assumes Markov property
• Discrete state/observation spaces
• May not capture long dependencies
• Limited expressiveness

Key Elements: Variables

State Variables $y_1, y_2, \ldots, y_n$ (Hidden)

Internal states that cannot be directly observed. Each state variable takes values from the state space $S = \{s_1, s_2, \ldots, s_N\}$ , where $N$ is the number of states.

Examples:

• Weather: $S = \{\text{Sunny}, \text{Cloudy}, \text{Rainy}\}$ (N=3)
• Speech: Phonemes or words being spoken
• Part-of-speech: Noun, verb, adjective, etc.

Observation Variables $x_1, x_2, \ldots, x_n$ (Observable)

Variables that can be directly observed. Each observation variable takes values from the observation space $O = \{o_1, o_2, \ldots, o_M\}$ , where $M$ is the number of possible observations.

Examples:

• Umbrella: $O = \{\text{Umbrella}, \text{No Umbrella}\}$ (M=2)
• Speech: Acoustic features or audio signals
• Text: Words in a sentence

Three Parameters: Model Definition

An HMM is completely defined by three parameters $\lambda = [A, B, \pi]$ :

1. State Transition Probability Matrix $A = [a_{ij}]_{N \times N}$

a_{ij} = P(y_{t+1} = s_j | y_t = s_i)

Probability of transitioning from state $s_i$ to state $s_j$ .

Example (Weather):

$a_{\text{Sunny} \to \text{Rainy}} = 0.3$ means 30% chance of transitioning from Sunny to Rainy. Each row sums to 1: $\sum_j a_{ij} = 1$ .

2. Observation Emission Probability Matrix $B = [b_{ij}]_{N \times M}$

b_{ij} = P(x_t = o_j | y_t = s_i)

Probability of generating observation $o_j$ when in state $s_i$ .

Example (Weather-Umbrella):

$b_{\text{Rainy} \to \text{Umbrella}} = 0.7$ means 70% chance of seeing "Umbrella" when state is "Rainy". Each row sums to 1: $\sum_j b_{ij} = 1$ .

3. Initial State Probability Vector $\pi = (\pi_1, \pi_2, \ldots, \pi_N)$

\pi_i = P(y_1 = s_i)

Probability of being in state $s_i$ at the initial time step (t=1).

Example:

$\pi_{\text{Sunny}} = 0.5$ means 50% chance of starting in "Sunny" state. Vector sums to 1: $\sum_i \pi_i = 1$ .

Core Probability Formula

Based on the Markov chain assumption (next state depends only on current state, independent of history), the joint probability distribution of all variables is:

P(x_1, y_1, \ldots, x_n, y_n) = P(y_1) P(x_1 | y_1) \prod_{i=2}^n P(y_i | y_{i-1}) P(x_i | y_i)

This factorization breaks down into:

• $P(y_1)$ = initial state probability (from $\pi$ )
• $P(x_1 | y_1)$ = first observation probability (from $B$ )
• $P(y_i | y_{i-1})$ = state transition probability (from $A$ )
• $P(x_i | y_i)$ = observation emission probability (from $B$ )

Observation Sequence Generation Process

How an HMM generates an observation sequence:

Step 1

Select Initial State

According to initial state probability $\pi$ , select initial state $y_1$ .

Step 2

Generate Observation

According to current state $y_t$ and emission probability $B$ , generate observation $x_t$ .

Step 3

Transition to Next State

According to current state $y_t$ and transition probability $A$ , transition to next state $y_{t+1}$ .

Step 4

Repeat

Repeat steps 2-3 until $t = n$ (desired sequence length reached).

Three Fundamental Problems

HMM applications typically involve solving one of three fundamental problems:

Problem 1: Evaluation Problem

Given: Model parameters $\lambda$ and observation sequence $x$

Compute: $P(x | \lambda)$ (probability of observation sequence given model)

Algorithm:

• Forward algorithm: Compute $P(x | \lambda)$ using forward probabilities
• Backward algorithm: Alternative computation using backward probabilities
• Both have time complexity $O(N^2 T)$ where N=states, T=sequence length

Application:

Model selection - compare different HMMs to see which best explains the observed data.

Problem 2: Decoding Problem (Inference)

Given: Model parameters $\lambda$ and observation sequence $x$

Find: Optimal state sequence $y^* = \arg\max_y P(y | x, \lambda)$

Algorithm:

• Viterbi algorithm: Dynamic programming to find most likely state sequence
• Uses recursion: $\delta_t(j) = \max_i [\delta_{t-1}(i) a_{ij}] b_j(x_t)$
• Time complexity: $O(N^2 T)$

Application:

Speech recognition - infer phonemes/words from acoustic signals. Part-of-speech tagging - infer POS tags from words.

Problem 3: Learning Problem (Training)

Given: Observation sequence $x$

Estimate: Model parameters $\lambda^* = \arg\max_\lambda P(x | \lambda)$

Algorithm:

• Baum-Welch algorithm: Special case of EM algorithm for HMM
• E-step: Compute expected state transitions and emissions
• M-step: Update parameters A, B, π to maximize likelihood
• Iterates until convergence

Application:

Train HMM from unlabeled observation sequences. Learn model parameters from data.

Weather Prediction Example

Apply HMM to predict weather states (Sunny, Cloudy, Rainy) from umbrella observations (Umbrella, No Umbrella) over 7 days.

Observation Sequence and Inferred States

Day	State (Hidden)	Observation	Emission Prob
1	Sunny	No Umbrella	0.6
2	Sunny	No Umbrella	0.6
3	Rainy	Umbrella	0.7
4	Rainy	Umbrella	0.7
5	Cloudy	No Umbrella	0.5
6	Sunny	No Umbrella	0.6
7	Rainy	Umbrella	0.7

States are hidden (not directly observable). We observe umbrella usage and infer weather states using HMM parameters and Viterbi algorithm.

HMM Parameters

Transition Matrix A:

Sunny → Sunny: 0.7, Sunny → Cloudy: 0.2, Sunny → Rainy: 0.1

Rainy → Rainy: 0.6, Rainy → Cloudy: 0.2, Rainy → Sunny: 0.2

Emission Matrix B:

Sunny → No Umbrella: 0.6, Sunny → Umbrella: 0.4

Rainy → Umbrella: 0.7, Rainy → No Umbrella: 0.3

Initial State π:

Sunny: 0.5, Cloudy: 0.3, Rainy: 0.2

Viterbi Decoding Result:

Given observation sequence [No, No, Umbrella, Umbrella, No, No, Umbrella], Viterbi algorithm infers most likely state sequence: [Sunny, Sunny, Rainy, Rainy, Cloudy, Sunny, Rainy] with probability 0.0042.

Advantages and Limitations

Advantages

• Handles sequential data naturally: Perfect for time series and sequences
• Efficient algorithms: Forward, backward, Viterbi all have polynomial time complexity
• Well-established theory: Decades of research and applications
• Wide range of applications: Speech, NLP, bioinformatics, finance
• Interpretable: States and transitions have clear meaning

Limitations

• Markov assumption: Next state depends only on current state, may miss long dependencies
• Discrete spaces: Typically assumes discrete state and observation spaces
• Limited expressiveness: May not capture complex temporal patterns
• Parameter estimation: Baum-Welch can be slow and may converge to local optima
• State space size: Computational cost grows quadratically with number of states

Next Module

Hidden Markov Models (HMM)

Core Definition

Key Characteristics

Advantages

Limitations

Key Elements: Variables

State Variables y1,y2,…,yny_1, y_2, \ldots, y_ny1​,y2​,…,yn​ (Hidden)

Observation Variables x1,x2,…,xnx_1, x_2, \ldots, x_nx1​,x2​,…,xn​ (Observable)

Three Parameters: Model Definition

1. State Transition Probability Matrix A=[aij]N×NA = [a_{ij}]_{N \times N}A=[aij​]N×N​

2. Observation Emission Probability Matrix B=[bij]N×MB = [b_{ij}]_{N \times M}B=[bij​]N×M​

3. Initial State Probability Vector π=(π1,π2,…,πN)\pi = (\pi_1, \pi_2, \ldots, \pi_N)π=(π1​,π2​,…,πN​)

Core Probability Formula

Observation Sequence Generation Process

Select Initial State

Generate Observation

Transition to Next State

Repeat

Three Fundamental Problems

Problem 1: Evaluation Problem

Problem 2: Decoding Problem (Inference)

Problem 3: Learning Problem (Training)

Weather Prediction Example

Observation Sequence and Inferred States

HMM Parameters

Transition Matrix A:

Emission Matrix B:

Initial State π:

Viterbi Decoding Result:

Advantages and Limitations

Advantages

Limitations

State Variables $y_1, y_2, \ldots, y_n$ (Hidden)

Observation Variables $x_1, x_2, \ldots, x_n$ (Observable)

1. State Transition Probability Matrix $A = [a_{ij}]_{N \times N}$

2. Observation Emission Probability Matrix $B = [b_{ij}]_{N \times M}$

3. Initial State Probability Vector $\pi = (\pi_1, \pi_2, \ldots, \pi_N)$