Machine Learning/Learning Center/Probabilistic Graphical Models/Conditional Random Fields

Conditional Random Fields (CRF)

Master discriminative undirected graph models for sequence labeling. Learn how CRFs model conditional probabilities for part-of-speech tagging, named entity recognition, and syntax analysis.

Module 4 of 7

Intermediate to Advanced

120-150 min

Core Definition

Conditional Random Fields (CRF) are discriminative undirected graph models that focus on modeling the conditional probability of state sequence $y$ given observation sequence $x$ : $P(y | x)$ .

Key Characteristics

• Discriminative: Models $P(y | x)$ directly, not joint $P(x, y)$
• Undirected graph: Uses MRF structure for state dependencies
• Markov property: $P(y_v | x, y_{V \setminus \{v\}}) = P(y_v | x, y_{n(v)})$
• Feature-based: Uses feature functions to capture patterns

Advantages

• More accurate for classification
• Flexible feature engineering
• No independence assumptions
• Handles overlapping features

Limitations

• Cannot generate samples
• Requires labeled data
• Training can be slow
• Feature engineering needed

Linear-Chain CRF: Typical Form

The most commonly used CRF structure is the linear-chain CRF, where state sequence $y_1, y_2, \ldots, y_n$ forms a chain, corresponding to observation sequence $x_1, x_2, \ldots, x_n$ .

Graph Structure

States form a linear chain: $y_1 - y_2 - y_3 - \ldots - y_n$ (undirected edges). Each state $y_i$ is connected to its corresponding observation $x_i$ .

Example (POS Tagging):

• States: POS tags (Noun, Verb, Adjective, etc.)
• Observations: Words in the sentence
• Goal: Infer POS tag sequence from word sequence

Core Probability Formula

CRF defines conditional probability through feature functions and exponential form, ensuring non-negativity and interpretability:

P(y | x) = \frac{1}{Z(x)} \exp\left( \sum_j \sum_{i=1}^{n-1} \lambda_j t_j(y_{i+1}, y_i, x, i) + \sum_k \sum_{i=1}^n \mu_k s_k(y_i, x, i) \right)

Where:

• $Z(x) = \sum_y \exp(\ldots)$ = normalization constant (depends on x)
• $t_j(y_{i+1}, y_i, x, i)$ = transition feature function (captures adjacent state pairs)
• $s_k(y_i, x, i)$ = state feature function (captures state-observation relationships)
• $\lambda_j, \mu_k$ = feature weights (learned from training data)

Feature Functions

Feature functions are binary indicators (0 or 1) that capture specific patterns in the data. They enable flexible feature engineering.

Transition Feature Functions $t_j(y_{i+1}, y_i, x, i)$

Capture relationships between adjacent states $y_i$ and $y_{i+1}$ , possibly depending on observations $x$ and position $i$ .

Example (POS Tagging):

• $t_1(y_{i+1}, y_i, x, i) = 1$ if $y_i = \text{DT}$ and $y_{i+1} = \text{NN}$ , else 0
• Captures pattern: Determiner (DT) often followed by Noun (NN)
• $t_2(y_{i+1}, y_i, x, i) = 1$ if $y_i = \text{JJ}$ and $y_{i+1} = \text{NN}$ , else 0
• Captures pattern: Adjective (JJ) often followed by Noun (NN)

State Feature Functions $s_k(y_i, x, i)$

Capture relationships between current state $y_i$ and observations $x$ at position $i$ .

Example (POS Tagging):

• $s_1(y_i, x, i) = 1$ if $y_i = \text{NN}$ and $x_i$ ends with "tion", else 0
• Captures pattern: Words ending in "tion" are often nouns
• $s_2(y_i, x, i) = 1$ if $y_i = \text{VB}$ and $x_i$ ends with "ed", else 0
• Captures pattern: Words ending in "ed" are often verbs

Feature Function Properties

• Binary values: Each feature function returns 0 or 1 (indicator function)
• Flexible design: Can capture any pattern (word patterns, context, position, etc.)
• Weighted combination: Feature weights $\lambda_j, \mu_k$ are learned from data
• Overlapping features: Multiple features can be active simultaneously

Part-of-Speech Tagging Example

Apply linear-chain CRF to tag words in a sentence with their part-of-speech labels. Observations are words, states are POS tags.

Sentence: Word Sequence with POS Tags

Index	Word (Observation)	POS Tag (State)	Tag Description
1	The	DT	Determiner
2	quick	JJ	Adjective
3	brown	JJ	Adjective
4	fox	NN	Noun
5	jumps	VBZ	Verb (3rd person)
6	over	IN	Preposition
7	the	DT	Determiner
8	lazy	JJ	Adjective
9	dog	NN	Noun

Sentence: "The quick brown fox jumps over the lazy dog". Goal: Given word sequence, infer POS tag sequence using CRF.

CRF Model for POS Tagging

Graph Structure:

Linear chain: $\text{POS}_1 - \text{POS}_2 - \text{POS}_3 - \ldots - \text{POS}_9$ . Each POS tag $y_i$ corresponds to word $x_i$ .

Transition Features:

• DT → NN: $\lambda_1 = 2.5$ (determiner often followed by noun)
• JJ → NN: $\lambda_2 = 2.0$ (adjective often followed by noun)
• NN → VBZ: $\lambda_3 = 1.5$ (noun often followed by verb)

State Features:

• Word="The" → DT: $\mu_1 = 3.0$ (high confidence)
• Word ends with "ly" → RB: $\mu_2 = 2.0$ (adverb pattern)
• Word capitalized → NNP: $\mu_3 = 1.5$ (proper noun pattern)

CRF Inference Result:

Given word sequence, CRF computes $P(\text{POS tags} | \text{words})$ and finds most likely tag sequence using Viterbi-like algorithm. Result: DT-JJ-JJ-NN-VBZ-IN-DT-JJ-NN (shown in table above) with high confidence.

Applications

Sequence Labeling Tasks

• Part-of-speech tagging: Words → POS tags
• Named entity recognition: Words → Entity types (Person, Location, Organization)
• Chunking: Words → Syntactic chunks (NP, VP, etc.)
• Chinese word segmentation: Characters → Word boundaries

Advantages Over HMM

• No independence assumption: Can use overlapping features
• More accurate: Discriminative training focuses on classification
• Flexible features: Can incorporate any observation pattern
• Better for NLP: Handles complex word patterns naturally

Advantages and Limitations

Advantages

• More accurate for classification: Discriminative training optimizes for prediction
• Flexible feature engineering: Can use any observation pattern
• No independence assumptions: Can model overlapping features
• Better for NLP: Handles complex word patterns and context

Limitations

• Cannot generate samples: Only models conditional distribution
• Requires labeled data: Needs fully labeled sequences for training
• Training can be slow: Feature computation and normalization can be expensive
• Feature engineering needed: Requires domain knowledge to design features

Next Module

Conditional Random Fields (CRF)

Core Definition

Key Characteristics

Advantages

Limitations

Linear-Chain CRF: Typical Form

Graph Structure

Core Probability Formula

Feature Functions

Transition Feature Functions tj(yi+1,yi,x,i)t_j(y_{i+1}, y_i, x, i)tj​(yi+1​,yi​,x,i)

State Feature Functions sk(yi,x,i)s_k(y_i, x, i)sk​(yi​,x,i)

Feature Function Properties

Part-of-Speech Tagging Example

Sentence: Word Sequence with POS Tags

CRF Model for POS Tagging

Graph Structure:

Transition Features:

State Features:

CRF Inference Result:

Applications

Sequence Labeling Tasks

Advantages Over HMM

Advantages and Limitations

Advantages

Limitations

Transition Feature Functions $t_j(y_{i+1}, y_i, x, i)$

State Feature Functions $s_k(y_i, x, i)$