MathIsimple

Conditional Random Fields (CRF)

Master discriminative undirected graph models for sequence labeling. Learn how CRFs model conditional probabilities for part-of-speech tagging, named entity recognition, and syntax analysis.

Module 4 of 7
Intermediate to Advanced
120-150 min

Core Definition

Conditional Random Fields (CRF) are discriminative undirected graph models that focus on modeling the conditional probability of state sequence yygiven observation sequence xx: P(yx)P(y | x).

Key Characteristics

  • Discriminative: Models P(yx)P(y | x) directly, not joint P(x,y)P(x, y)
  • Undirected graph: Uses MRF structure for state dependencies
  • Markov property: P(yvx,yV{v})=P(yvx,yn(v))P(y_v | x, y_{V \setminus \{v\}}) = P(y_v | x, y_{n(v)})
  • Feature-based: Uses feature functions to capture patterns

Advantages

  • • More accurate for classification
  • • Flexible feature engineering
  • • No independence assumptions
  • • Handles overlapping features

Limitations

  • • Cannot generate samples
  • • Requires labeled data
  • • Training can be slow
  • • Feature engineering needed

Linear-Chain CRF: Typical Form

The most commonly used CRF structure is the linear-chain CRF, where state sequencey1,y2,,yny_1, y_2, \ldots, y_n forms a chain, corresponding to observation sequencex1,x2,,xnx_1, x_2, \ldots, x_n.

Graph Structure

States form a linear chain: y1y2y3yny_1 - y_2 - y_3 - \ldots - y_n (undirected edges). Each state yiy_i is connected to its corresponding observation xix_i.

Example (POS Tagging):

  • • States: POS tags (Noun, Verb, Adjective, etc.)
  • • Observations: Words in the sentence
  • • Goal: Infer POS tag sequence from word sequence

Core Probability Formula

CRF defines conditional probability through feature functions and exponential form, ensuring non-negativity and interpretability:

P(yx)=1Z(x)exp(ji=1n1λjtj(yi+1,yi,x,i)+ki=1nμksk(yi,x,i))P(y | x) = \frac{1}{Z(x)} \exp\left( \sum_j \sum_{i=1}^{n-1} \lambda_j t_j(y_{i+1}, y_i, x, i) + \sum_k \sum_{i=1}^n \mu_k s_k(y_i, x, i) \right)

Where:

  • Z(x)=yexp()Z(x) = \sum_y \exp(\ldots) = normalization constant (depends on x)
  • tj(yi+1,yi,x,i)t_j(y_{i+1}, y_i, x, i) = transition feature function (captures adjacent state pairs)
  • sk(yi,x,i)s_k(y_i, x, i) = state feature function (captures state-observation relationships)
  • λj,μk\lambda_j, \mu_k = feature weights (learned from training data)

Feature Functions

Feature functions are binary indicators (0 or 1) that capture specific patterns in the data. They enable flexible feature engineering.

Transition Feature Functions tj(yi+1,yi,x,i)t_j(y_{i+1}, y_i, x, i)

Capture relationships between adjacent states yiy_i and yi+1y_{i+1}, possibly depending on observations xx and position ii.

Example (POS Tagging):

  • t1(yi+1,yi,x,i)=1t_1(y_{i+1}, y_i, x, i) = 1 if yi=DTy_i = \text{DT} and yi+1=NNy_{i+1} = \text{NN}, else 0
  • • Captures pattern: Determiner (DT) often followed by Noun (NN)
  • t2(yi+1,yi,x,i)=1t_2(y_{i+1}, y_i, x, i) = 1 if yi=JJy_i = \text{JJ} and yi+1=NNy_{i+1} = \text{NN}, else 0
  • • Captures pattern: Adjective (JJ) often followed by Noun (NN)

State Feature Functions sk(yi,x,i)s_k(y_i, x, i)

Capture relationships between current state yiy_i and observations xx at position ii.

Example (POS Tagging):

  • s1(yi,x,i)=1s_1(y_i, x, i) = 1 if yi=NNy_i = \text{NN} and xix_i ends with "tion", else 0
  • • Captures pattern: Words ending in "tion" are often nouns
  • s2(yi,x,i)=1s_2(y_i, x, i) = 1 if yi=VBy_i = \text{VB} and xix_i ends with "ed", else 0
  • • Captures pattern: Words ending in "ed" are often verbs

Feature Function Properties

  • Binary values: Each feature function returns 0 or 1 (indicator function)
  • Flexible design: Can capture any pattern (word patterns, context, position, etc.)
  • Weighted combination: Feature weights λj,μk\lambda_j, \mu_k are learned from data
  • Overlapping features: Multiple features can be active simultaneously

Part-of-Speech Tagging Example

Apply linear-chain CRF to tag words in a sentence with their part-of-speech labels. Observations are words, states are POS tags.

Sentence: Word Sequence with POS Tags

IndexWord (Observation)POS Tag (State)Tag Description
1The
DT
Determiner
2quick
JJ
Adjective
3brown
JJ
Adjective
4fox
NN
Noun
5jumps
VBZ
Verb (3rd person)
6over
IN
Preposition
7the
DT
Determiner
8lazy
JJ
Adjective
9dog
NN
Noun

Sentence: "The quick brown fox jumps over the lazy dog". Goal: Given word sequence, infer POS tag sequence using CRF.

CRF Model for POS Tagging

Graph Structure:

Linear chain: POS1POS2POS3POS9\text{POS}_1 - \text{POS}_2 - \text{POS}_3 - \ldots - \text{POS}_9. Each POS tag yiy_icorresponds to word xix_i.

Transition Features:

  • • DT → NN: λ1=2.5\lambda_1 = 2.5 (determiner often followed by noun)
  • • JJ → NN: λ2=2.0\lambda_2 = 2.0 (adjective often followed by noun)
  • • NN → VBZ: λ3=1.5\lambda_3 = 1.5 (noun often followed by verb)

State Features:

  • • Word="The" → DT: μ1=3.0\mu_1 = 3.0 (high confidence)
  • • Word ends with "ly" → RB: μ2=2.0\mu_2 = 2.0 (adverb pattern)
  • • Word capitalized → NNP: μ3=1.5\mu_3 = 1.5 (proper noun pattern)

CRF Inference Result:

Given word sequence, CRF computes P(POS tagswords)P(\text{POS tags} | \text{words})and finds most likely tag sequence using Viterbi-like algorithm. Result: DT-JJ-JJ-NN-VBZ-IN-DT-JJ-NN (shown in table above) with high confidence.

Applications

Sequence Labeling Tasks

  • Part-of-speech tagging: Words → POS tags
  • Named entity recognition: Words → Entity types (Person, Location, Organization)
  • Chunking: Words → Syntactic chunks (NP, VP, etc.)
  • Chinese word segmentation: Characters → Word boundaries

Advantages Over HMM

  • No independence assumption: Can use overlapping features
  • More accurate: Discriminative training focuses on classification
  • Flexible features: Can incorporate any observation pattern
  • Better for NLP: Handles complex word patterns naturally

Advantages and Limitations

Advantages

  • More accurate for classification: Discriminative training optimizes for prediction
  • Flexible feature engineering: Can use any observation pattern
  • No independence assumptions: Can model overlapping features
  • Better for NLP: Handles complex word patterns and context

Limitations

  • Cannot generate samples: Only models conditional distribution
  • Requires labeled data: Needs fully labeled sequences for training
  • Training can be slow: Feature computation and normalization can be expensive
  • Feature engineering needed: Requires domain knowledge to design features