GRU and LSTM Explained: Gates, Cell States, and Long-Range Memory

Vanilla RNNs lose long-range dependencies. Backpropagation through time multiplies the same Jacobian dozens or hundreds of times, and the gradient either vanishes or explodes long before it reaches the early layers. Gated cells — GRU and LSTM — fix this with a simple but powerful idea: let the network learn when to remember, when to forget, and when to update.

Both architectures replace the single tanh transform of a vanilla RNN with multiple gates that control the flow of information. This article walks through GRU first, then LSTM, then explains when to use each.

GRU: two gates that decide what to keep

The Gated Recurrent Unit introduces two gates that act element-wise on the hidden state. Each gate is a vector in $(0, 1)$ produced by a sigmoid; values close to zero mean "block" and values close to one mean "allow."

The reset gate $R_t$ decides how much of the past hidden state to ignore when computing the candidate update:

R_t = \sigma(X_t W_{xr} + H_{t-1} W_{hr} + b_r)

The update gate $Z_t$ decides how much of the past to carry forward versus how much to overwrite with the new candidate state:

Z_t = \sigma(X_t W_{xz} + H_{t-1} W_{hz} + b_z)

The candidate hidden state mixes the current input with a reset-gated version of the past:

\tilde{H}_t = \tanh\!\left(X_t W_{xh} + (R_t \odot H_{t-1}) W_{hh} + b_h\right)

And the final hidden state is a learned interpolation between the previous state and the candidate:

H_t = Z_t \odot H_{t-1} + (1 - Z_t) \odot \tilde{H}_t

The symbol $\odot$ is element-wise multiplication. Each dimension of the hidden state has its own gate value, so the cell can remember some features while overwriting others.

Reading the GRU equations

The two extremes clarify what the gates do:

When $Z_t \approx 1$ : $H_t \approx H_{t-1}$ . The cell ignores the new input entirely and carries the past forward unchanged. Useful for "hold this information across many time steps."
When $Z_t \approx 0$ : $H_t \approx \tilde{H}_t$ . The cell completely overwrites memory with the new candidate. Useful for "forget the past, the situation just changed."
When $R_t \approx 0$ : the candidate $\tilde{H}_t$ ignores the previous hidden state. The cell processes the current input fresh.
When $R_t \approx 1$ : the candidate uses the full past hidden state, like a vanilla RNN.

The crucial property is the additive update $Z_t \odot H_{t-1} + (1-Z_t) \odot \tilde{H}_t$ . Information from earlier time steps reaches later ones via a path that does not pass through a tanh nonlinearity at every step. That path lets gradients flow back through long sequences without the multiplicative attenuation that vanilla RNNs suffer.

LSTM: separate cell state and three gates

LSTM was invented before GRU and is more elaborate. It maintains two running states: a hidden state $H_t$ (used by other layers) and a separate cell state $C_t$ (the "long-term memory"). Three gates control the cell:

I_t = \sigma(X_t W_{xi} + H_{t-1} W_{hi} + b_i) \quad \text{(input gate)}

F_t = \sigma(X_t W_{xf} + H_{t-1} W_{hf} + b_f) \quad \text{(forget gate)}

O_t = \sigma(X_t W_{xo} + H_{t-1} W_{ho} + b_o) \quad \text{(output gate)}

A candidate cell state uses tanh (so its values are in $[-1, 1]$ ):

\tilde{C}_t = \tanh(X_t W_{xc} + H_{t-1} W_{hc} + b_c)

The cell state update combines the previous cell state filtered by the forget gate with the candidate filtered by the input gate:

C_t = F_t \odot C_{t-1} + I_t \odot \tilde{C}_t

Finally, the hidden state is a tanh-squashed cell state filtered by the output gate:

H_t = O_t \odot \tanh(C_t)

Why the separate cell state matters

The defining feature of LSTM is the cell state highway: $C_t = F_t \odot C_{t-1} + I_t \odot \tilde{C}_t$ . The cell state passes through one element-wise multiplication and one element-wise addition per step — no tanh, no squashing, no aggressive nonlinearity in the path that carries information forward.

When the forget gate stays near 1 and the input gate stays near 0, $C_t \approx C_{t-1}$ and information is preserved verbatim across arbitrarily many steps. Gradients flowing backward through this path are multiplied only by the forget-gate values, which the network learns to keep near 1 for important memory.

The LSTM cell state is engineered to be a low-resistance pathway for long-range information. The hidden state is then a filtered, tanh-squashed view of that long-term memory presented to subsequent layers.

A common diagnostic when LSTMs underperform is to check the forget-gate biases. Initializing $b_f$ to a positive value (commonly $1$ ) starts the network in a "remember by default" regime, which empirically improves training on tasks with long dependencies.

GRU vs. LSTM: parameters, speed, performance

The two cells are close cousins. Comparing them directly:

Property	GRU	LSTM
Gates	2 (reset, update)	3 (input, forget, output)
Internal states	Hidden only	Hidden + cell
Parameters per cell	~3× vanilla RNN	~4× vanilla RNN
Speed	Faster	Slower
Long-sequence performance	Strong	Slightly stronger on hardest tasks

Empirically, GRU and LSTM perform very similarly on most language and time-series tasks. GRU is often preferred when training time matters or when the dataset is moderate; LSTM still wins on the longest sequences with the most demanding long-range dependencies. For practical purposes, "try GRU first and only switch to LSTM if you need to" is reasonable advice.

Stacking gated cells: deep RNNs

A single layer of GRU or LSTM is rarely enough for complex tasks. Deep RNNs stack multiple recurrent layers, where the hidden state of layer $l$ at time $t$ becomes the input of layer $l+1$ at time $t$ :

H_t^{(l)} = f^{(l)}(H_t^{(l-1)}, H_{t-1}^{(l)})

Two to four layers are typical. Deeper stacks are possible but require careful regularization (dropout between layers, layer normalization). The PyTorch one-liner nn.LSTM(num_layers=2) handles this; the equations above are what is happening inside.

Bidirectional RNNs: looking forward and backward

Some tasks need access to both past and future context. In named entity recognition, knowing the next word helps disambiguate the current one. Bidirectional RNNs run two independent recurrent passes — one forward in time, one backward — and concatenate their hidden states at each position:

\overrightarrow{H}_t = f(\overrightarrow{H}_{t-1}, x_t), \quad \overleftarrow{H}_t = f(\overleftarrow{H}_{t+1}, x_t)

H_t = [\overrightarrow{H}_t; \overleftarrow{H}_t]

Bidirectional RNNs are useful for tagging, classification, and any task with the full sequence available at inference. They are not suitable for autoregressive generation, where future tokens do not exist yet at decoding time.

A practical recipe

Three rules cover most situations:

Default to GRU for most language and time-series tasks. It trains faster and reaches similar performance.
Use LSTM when the task has very long dependencies (hundreds of steps) and benchmarks show LSTM matters, or when the literature for your specific problem uses LSTM and you want to compare directly.
Add bidirectionality for tagging and classification tasks where the full sequence is available at inference. Skip it for generation.

Across all variants, the implementation pattern is the same: replace the vanilla recurrence with a gated cell, keep gradient clipping, keep truncated BPTT, and let the network learn how to manage memory.

The main takeaway

Gated cells solve the vanishing-gradient problem of vanilla RNNs by introducing element-wise gates that control information flow. GRU uses two gates (reset and update) and a single hidden state. LSTM uses three gates (input, forget, output) and adds a separate cell state that acts as a long-term memory highway. The shared insight is the same: replace the multiplicative cascade through tanh with an additive update path, and gradients can flow across hundreds of time steps.

In practice, GRU is the modern default for most sequence tasks, with LSTM held in reserve for the hardest long-range problems. Both are usually stacked in two- or three-layer configurations, and bidirectional variants extend them to non-autoregressive tasks. Understanding the gating math also makes the next architectural step — attention and Transformers — easier to motivate, because attention is essentially what happens when you take the "weighted sum of past states" intuition behind gating and let it operate over arbitrary positions instead of just the immediately previous step.