Seq2Seq, Attention, and Beam Search: Building Modern NMT

Sequence-to-sequence models map one variable-length sequence to another: English to French, audio to text, source code to docstrings. The original Seq2Seq design used an encoder RNN to compress the input into a single context vector and a decoder RNN to generate the output one token at a time. That single context vector turned out to be the architecture's critical weakness — and attention was the fix that fundamentally changed how sequence models work.

This article walks through the Seq2Seq architecture, the bottleneck problem, the attention mechanism that solves it, and beam search — the inference-time decoding strategy that produces better translations than greedy sampling.

The Seq2Seq architecture

Seq2Seq splits the model into two RNNs trained jointly:

The encoder reads the input sequence $x_1, x_2, \ldots, x_T$ one token at a time and produces a hidden state at every position:

h_t = f_{\text{enc}}(x_t, h_{t-1})

The final hidden state $h_T$ becomes the context vector $c$ , summarizing the entire input.

The decoder initializes its hidden state with the context vector and generates output tokens autoregressively:

s_t = f_{\text{dec}}(y_{t-1}, s_{t-1}, c)

P(y_t \mid y_{<t}, x) = \text{softmax}(W_o s_t + b_o)

Two special tokens are essential: <bos> (beginning of sequence) starts decoding, and <eos> (end of sequence) signals the model to stop. During training, teacher forcing feeds the ground-truth previous token $y_{t-1}$ as input to the decoder. At inference time, the decoder must use its own previous prediction.

The bottleneck problem

A single fixed-size context vector $c$ must encode every detail of the input sequence. For a 5-word sentence this works. For a 50-word sentence, one 256- or 512-dimensional vector simply cannot preserve all the information needed for accurate translation.

Empirically, Seq2Seq performance on long sequences degraded sharply. Translation BLEU scores collapsed beyond 20–30 tokens. The bottleneck was not the encoder or the decoder individually — it was the assumption that a fixed-size summary could suffice for variable-length inputs.

Compressing a 50-word sentence into a 256-dimensional vector is like asking someone to summarize a paragraph in three syllables. The compression is too aggressive.

Attention: a weighted view of the entire encoder

Attention removes the bottleneck by giving the decoder access to all encoder hidden states $h_1, \ldots, h_T$ , not just the final one. At every decoding step, the decoder computes a different weighted combination of the encoder states based on what it currently needs.

Three quantities define attention. The decoder generates a query from its current state. Each encoder position provides a key (used to compute relevance) and a value (used to construct the context). For Seq2Seq attention, both keys and values are typically the encoder hidden states themselves.

The attention computation has three steps. First, compute alignment scores between the query and each key:

e_{t,i} = \text{score}(s_{t-1}, h_i)

Second, normalize the scores into a probability distribution:

\alpha_{t,i} = \frac{\exp(e_{t,i})}{\sum_{j=1}^{T} \exp(e_{t,j})}

Third, compute the context as a weighted sum of encoder hidden states:

c_t = \sum_{i=1}^{T} \alpha_{t,i} h_i

The decoder receives this fresh, position-aware context vector at every step instead of relying on the static $c$ . Two common scoring functions exist:

Additive attention (Bahdanau): a small feedforward network combines query and key:

e_{t,i} = v^\top \tanh(W_q s_{t-1} + W_k h_i)

Scaled dot-product attention (Luong, then Transformer): a simple inner product:

e_{t,i} = \frac{s_{t-1}^\top h_i}{\sqrt{d}}

The $\sqrt{d}$ denominator prevents large dot products from saturating the softmax when query and key dimensions are large. Scaled dot-product is the form used in Transformers and is computationally cheaper because it is just matrix multiplication.

Attention as a soft, differentiable lookup

A useful intuition: attention is a soft generalization of dictionary lookup. With a hard lookup, a query exactly matches one key and retrieves one value. With attention, the query produces a similarity score against every key, the softmax turns those scores into probabilities, and the output is a weighted blend of all values.

The blend is differentiable end-to-end, which is what makes attention trainable. The network learns which positions to weight more heavily by adjusting the parameters that produce queries, keys, and values. There is no manual rule for which encoder positions matter — the gradients sort it out.

Concrete example: when translating "The cat sat on the mat" to French, the decoder generating "chat" would learn to place high attention weight on the encoder position for "cat" and lower weight elsewhere. A few steps later, when generating "tapis," the attention pattern shifts to the encoder position for "mat." This positional alignment emerges from training, not from hand-engineered rules.

Why attention fixes the bottleneck

Attention solves three problems at once. The encoder no longer needs to compress everything into one vector — it provides one hidden state per input token. The decoder no longer relies on a static summary — it pulls fresh context at every step. Long-range dependencies become accessible because the decoder can attend directly to any input position, regardless of distance.

In numbers: pre-attention Seq2Seq lost ~5 BLEU points on long sentences. Adding attention recovered most of that gap and unlocked further improvements. The attention mechanism became the dominant architectural ingredient in subsequent NLP work, eventually evolving into the self-attention layer of the Transformer.

Beam search: better decoding without greedy mistakes

Even with attention, the decoder still needs an inference strategy to choose the output sequence. The simplest approach, greedy decoding, picks the highest-probability token at each step:

y_t = argmax P(y_t | y_<t, x)

Greedy decoding is fast but myopic. A locally optimal token at step $t$ can lead to dead-end continuations that no later step can fix. The optimal output sequence is the one that maximizes the joint probability:

y^* = \arg\max_{y_1, \ldots, y_{T'}} \prod_{t=1}^{T'} P(y_t \mid y_{<t}, x)

Exhaustive search is intractable: with vocabulary size $V$ and length $T'$ , the search space is $V^{T'}$ . Beam search is a tractable approximation that maintains the top $k$ candidate sequences at every step.

The beam search algorithm

With beam width $k$ (commonly 5 or 10):

Initialize with $k$ copies of the start token <bos>, each with score 0.
At each step, extend each of the $k$ candidates with every possible next token, producing $k \times V$ extensions.
Score each extension by its cumulative log-probability: $\sum_{t} \log P(y_t \mid y_{<t}, x)$ .
Keep only the top $k$ highest-scoring extensions and continue.
When a candidate emits <eos>, set it aside as a complete hypothesis. Continue searching with the remaining beams until length limit or all beams complete.

The use of log-probabilities is essential. Multiplying many small probabilities causes numerical underflow; summing log-probabilities is both numerically stable and equivalent up to a monotonic transformation.

Length normalization

Beam search has a built-in bias toward shorter sequences. Each additional token contributes a negative log-probability, so longer sequences accumulate more negative score regardless of quality. To compensate, divide the cumulative score by sequence length raised to a power $\alpha$ :

\text{score} = \frac{1}{L^{\alpha}} \sum_{t=1}^{L} \log P(y_t \mid y_{<t}, x)

Common values are $\alpha = 0.6$ or $0.7$ . With $\alpha = 0$ , no normalization (favors short outputs). With $\alpha = 1$ , simple length averaging (sometimes favors very long outputs). Tuning $\alpha$ on a validation set is standard practice in machine translation.

Beam width tradeoffs

Larger beams explore more candidates but also cost more computation and memory. The pattern is usually:

$k = 1$ : greedy decoding. Fast but suboptimal.
$k = 5$ : typical sweet spot for machine translation. Good quality at modest cost.
$k = 10$ to $20$ : slightly better quality, sometimes used for evaluation.
$k > 20$ : diminishing returns. Quality plateaus and inference time scales linearly.

Surprisingly, very wide beams sometimes hurt quality. The extra freedom finds candidates that maximize model probability but produce unnatural language — the "beam search curse." This is a consequence of imperfect language models: maximum-probability sequences are not always the most fluent ones.

Evaluation: BLEU as a translation metric

Translation quality is usually measured with BLEU (Bilingual Evaluation Understudy), which compares $n$ -gram overlap between the model output and one or more reference translations:

\text{BLEU} = \text{BP} \cdot \exp\!\left(\sum_{n=1}^{N} w_n \log p_n\right)

Here $p_n$ is the precision of $n$ -gram matches and BP is a brevity penalty discouraging overly short outputs. BLEU is imperfect — it does not capture meaning preservation or fluency directly — but it is fast and reproducible, which is why it remains the standard MT metric.

From attention to Transformer

Once attention was established, a logical next step appeared: if attention can replace the encoder-decoder bottleneck, can it also replace the recurrence inside the encoder and decoder themselves? The answer was yes — and the result was the Transformer, which removes RNNs entirely in favor of stacked self-attention layers. That architecture is a separate topic, but the path to it runs straight through the Seq2Seq + attention design covered in this article.

The main takeaway

Seq2Seq turned variable-length input into variable-length output by chaining an encoder RNN and a decoder RNN. The single context vector worked for short sequences but became a bottleneck for longer ones, capping translation quality at around 20–30 tokens.

Attention solved the bottleneck by giving the decoder a position-aware, weighted view of all encoder hidden states at every decoding step. Two scoring forms (additive and scaled dot-product) and three steps (score, softmax, weighted sum) are all that is needed to express the mechanism. Beam search at inference time replaces myopic greedy decoding with a tractable approximation of joint-probability maximization, with length normalization to counter the short-output bias.

Together, these three pieces — encoder/decoder, attention, beam search — defined the era of neural machine translation that immediately preceded Transformers, and they remain the foundation for understanding modern attention-based architectures.