Sequence Models and Language Modeling: Why MLPs Fail on Text

Everything covered so far — MLPs and CNNs — assumed independent samples. Whether the next image is a cat or a dog has nothing to do with the previous image. Sequence data breaks this assumption. Language, time series, and audio all carry temporal structure: scrambling the order destroys the meaning.

This article walks the path from why feedforward networks cannot model sequences, through the Markov approximation, to the hidden-state idea that produces RNNs, and finally to the formal goal of language modeling — assigning probabilities to sequences of tokens.

The i.i.d. assumption and why sequences break it

Both MLPs and CNNs train under a strong implicit assumption: independent and identically distributed (i.i.d.) samples. The model treats two consecutive training images as having no causal relationship. Shuffling the entire dataset would not affect training in any meaningful way.

Many of the most valuable real-world data sources do not work this way. Language has word order. Time series have temporal causality. Audio has phoneme transitions. The sentence "I forgot my umbrella, so I got soaked in the ___" almost forces the answer "rain" — but only because every preceding word is part of the conditioning context.

Mathematically, predicting the next value $x_t$ in a sequence is a conditional probability:

P(x_t \mid x_1, x_2, \ldots, x_{t-1})

The condition gets longer with every new time step. A standard MLP expects fixed-size input — it has no native way to absorb a context of unknown and growing length.

The Markov approximation: truncate history to fit the model

The most direct workaround is to chop off the history. Predict $x_t$ using only the most recent $\tau$ values:

P(x_t \mid x_1, \ldots, x_{t-1}) \approx P(x_t \mid x_{t-\tau}, \ldots, x_{t-1})

This is a $\tau$ -order Markov assumption. In code, it manifests as the sliding-window pattern that makes every basic time-series tutorial look identical:

def create_dataset(time_series, tau):
    features, labels = [], []
    for i in range(len(time_series) - tau):
        features.append(time_series[i: i + tau])      # past tau steps
        labels.append(time_series[i + tau])           # next value
    return torch.tensor(features), torch.tensor(labels)

# A 1000-step series with tau=4 produces X.shape = (996, 4) and y.shape = (996,)

Variable-length history becomes fixed-length input. An MLP can now consume (batch_size, tau) and produce a prediction. The cost is direct: anything older than $\tau$ steps is permanently discarded. For tasks that need long context — paragraph-level coherence, multi-day trends — this is too aggressive.

The hidden-state idea: a running summary of history

Instead of throwing history away, what if the model maintained a compact running summary of everything it has seen? A hidden state $h_{t-1}$ , written and updated step by step, that captures the relevant information from $x_1, \ldots, x_{t-1}$ in a fixed-size vector.

With such a summary, the prediction shrinks to:

P(x_t \mid x_{t-1}, h_{t-1})

Two inputs — the current observation and the running summary — instead of a window of arbitrary length. The summary itself is updated by a learned function $f$ :

h_t = f(x_t, h_{t-1})

When $f$ is parameterized as a neural network with shared weights at every step, this recurrence is the recurrent neural network. Sequence length is no longer a structural problem — the network ingests one time step at a time and updates its hidden state, regardless of how long the sequence eventually becomes.

Single-step versus multi-step prediction: error accumulates

Sequence models that look fantastic on a validation set sometimes collapse on real generation tasks. The reason is the difference between single-step and multi-step prediction.

Single-step prediction: at every time step, the model has access to the true history up to $t$ and predicts only $x_{t+1}$ . Errors do not propagate. Validation losses computed this way look small because each prediction starts from the ground truth.

Multi-step prediction: predict 50 steps into the future. After the first step, there is no more ground truth — the model's own (possibly wrong) prediction becomes the input to the next step. A 0.05 error at step 1 turns into a 0.1 error at step 2, then 0.2, then exponentially worse. By step 50, the prediction trajectory may have decayed into a flat line that has nothing to do with the true sequence.

Single-step validation loss is misleading for autoregressive generation. Always evaluate the multi-step trajectory before deploying a sequence model.

Language modeling as probability assignment

Language modeling is the most-studied instance of sequence prediction. The formal goal is not "teach the model to talk" but "assign a probability to a sequence of tokens." Given a token sequence $x_1, x_2, \ldots, x_T$ , what is $P(x_1, x_2, \ldots, x_T)$ ?

The chain rule of probability decomposes the joint into a product of conditionals:

P(x_1, x_2, \ldots, x_T) = P(x_1) \cdot P(x_2 \mid x_1) \cdot P(x_3 \mid x_1, x_2) \cdots P(x_T \mid x_1, \ldots, x_{T-1})

This is exactly how generative language models produce text. They do not synthesize a complete sentence in one shot. They predict one token at a time, conditioned on everything that came before.

Why counting fails as a language model

The simplest possible language model estimates each conditional probability by counting in a corpus. To estimate $P(\text{learning} \mid \text{machine})$ , count occurrences of "machine learning" and divide by occurrences of "machine" followed by anything.

Three structural problems doom this approach:

Zero-probability collapse. Any token sequence not seen in training has count zero. A single unseen bigram makes the entire sentence probability zero, even when the sentence is perfectly natural. Smoothing partially fixes this with formulas like

\hat{P}(x' \mid x) = \frac{n(x, x') + \varepsilon\, \hat{P}(x')}{n(x) + \varepsilon}

When $(x, x')$ appears often, the count dominates. When it is rare, the unigram $\hat{P}(x')$ takes over. The principle — when high-order statistics are unreliable, fall back to lower-order ones — is conceptually clean but mechanical.

Storage explodes. A trigram model needs counts for every triple of words. With a vocabulary of 50,000 tokens, the trigram space has $50{,}000^3 = 1.25 \times 10^{14}$ entries. Even sparse storage cannot keep up at scale.

No semantic understanding. "King" and "monarch" are nearly synonymous, but a count-based model treats them as completely separate symbols. Replacing one with the other in a sentence yields different counts and different probabilities, even though the meanings match.

Long-range dependencies vanish. A bigram or trigram model cannot see further back than two or three tokens. Sentences whose meaning hinges on context 30 tokens earlier are completely beyond reach.

These three walls — combinatorial storage, lack of semantics, and short context — are exactly what neural sequence models were built to break through. Embedding layers compress vocabularies into dense semantic vectors. Recurrent or attentional architectures carry context across long spans. Parameters scale with model size, not with the number of unique $n$ -grams.

How text becomes training tensors

Knowing the modeling goal is half the battle. The other half is structuring the data. Long documents are sliced into fixed-length subsequences:

Source: "the time machine by h g wells"

Sliding subsequences of length 5:
  the t  → input "t h e _ t", target predicts "i"
  he ti  → input "h e _ t i", target predicts "m"
  ...

Where to start cutting matters. Always starting from index 0 produces a fixed set of sub-windows. Random sampling picks a random offset, then shuffles the resulting subsequences and groups them into mini-batches. Subsequences in a batch are no longer adjacent in the original text, which improves sample independence. Sequential partitioning picks one random offset, then takes subsequences in order — useful when the model needs to maintain hidden state between batches.

Each subsequence becomes a tensor with three dimensions:

num_steps (sequence length): how many tokens per sample.
batch_size: how many samples processed in parallel.
feature_dim: how each token is represented (vocab size for one-hot, or embedding dimension for embedded tokens).

The training tensor shape is therefore (batch_size, num_steps, feature_dim). For a Fashion-MNIST-style example with 32 samples, 4 tokens per sample, and 256-dimensional embeddings, this is (32, 4, 256). Labels are (batch_size, num_steps) because the language modeling task asks for a next-token prediction at every position, not just at the end.

The main takeaway

Sequence modeling is what happens when the i.i.d. assumption breaks. The Markov approximation handles bounded-context tasks but throws away anything beyond $\tau$ steps. Hidden states recover unbounded context by running a learned summary forward through time, which is exactly what an RNN does.

Language modeling formalizes the goal: assign probabilities to sequences. The chain rule decomposes this into next-token prediction, which is also how generation works at inference time. Counting fails on language because of zero-probability collapse, exponential storage, lack of semantics, and short context windows — all four problems that neural sequence models, starting with RNNs, were designed to solve.

Sequence Models and Language Modeling: Why MLPs Fail on Text

The i.i.d. assumption and why sequences break it

The Markov approximation: truncate history to fit the model

The hidden-state idea: a running summary of history

Single-step versus multi-step prediction: error accumulates

Language modeling as probability assignment

Why counting fails as a language model

How text becomes training tensors

The main takeaway

Related reading

RNNs from Scratch: Backpropagation Through Time and Gradient Clipping

GRU and LSTM Explained: Gates, Cell States, and Long-Range Memory

PyTorch Sequence Models: A Practical Training Guide