I was watching a podcast the other day, and the topic of Markov chains came up in the context of deterministic v/s indeterministic behavior. By the end of that podcast, the modern GPTs started to come to mind. Modern AI models like GPTs often get described as “predicting the next word,” which sounds a lot like what Markov chains do. But the similarity ends at the surface. Underneath, the two systems behave in fundamentally different ways – especially when it comes to predictability and determinism.
This post outlines the journey I took to understand why they are different, even though they feel familiar.
Both systems predict the next token, but they do it with fundamentally different architectures and guarantees. Markov chains use explicit, discrete transitions; GPTs compute a continuous, context‑dependent distribution using attention and deep layers.
How Markov Chains Think: Fixed Paths, Fixed Futures
A Markov chain is built on a simple idea: the next state depends only on the current state.
This creates a world where:
- The system has discrete states (like words or events).
- Each state has a fixed set of outgoing transitions.
- The probabilities of those transitions are explicitly stored.
- The structure of the system is fully deterministic, even if the outcome is probabilistic.
[ Current State: token t ]
│
▼
[ Transition Table: P(next | token t) ]
│
▼
[ Sample next token t+1 ]
Think of it as a railway network:
- Every station has a fixed set of tracks.
- You can roll dice to choose which track to take.
- But the tracks themselves never change.
Even when randomness is involved, the space of possible outcomes is fixed and predictable.
How GPTs Think: Continuous Context, Infinite Possibilities
GPTs also predict the next token—but the mechanism is radically different.
Instead of fixed states and fixed transitions, GPTs operate in a continuous, high‑dimensional space. The “state” of the model at any moment is not a token—it’s a context vector built from:
- all previous tokens
- their relationships
- their semantic meaning
- the model’s learned internal representations
This means:
- There are no fixed outgoing edges.
- The model recomputes the next-token distribution from scratch each time.
- The same input token can lead to entirely different outputs depending on context.
- The system’s behavior is non-deterministic by design.
[ Input tokens: t1, t2, ..., tn ]
│
▼
[ Embedding layer → token vectors ]
│
▼
[ Transformer blocks (self‑attention + FFN) ]
• each block: attend to all previous tokens
• build contextual hidden state h
│
▼
[ Output projection → logits over vocab ]
│
▼
[ Softmax → P(next token | full context) ]
│
▼
[ Sample or pick next token ]
It’s less like a railway network and more like a GPS:
- It recalculates the route every time.
- The route depends on the entire environment.
- Two identical starting points can lead to different paths depending on context.
A side-by-side comparison
| Feature | Markov Chains | GPT/Transformer |
|---|---|---|
| State | Discrete token | Continuous context vector |
| Memory | Local (last state or few tokens) | Global within context window |
| Transitions | Stored, fixed probabilities | Computed dynamically via attention |
| Predictability | Structure predictable; outcomes probabilistic but bounded | Outputs can vary widely for same token depending on context |
| Expressiveness | Limited to local patterns | Captures long‑range dependencies and abstractions |
| Typical use | Simple text generators, stochastic processes | Large‑scale language generation, reasoning, code, dialogue |
Predictability, determinism, and creativity — a nuanced view
- Markov chains trend toward predictability. Given the transition graph and enough time, behavior converges to stationary patterns; the space of possible futures is fixed and enumerable.
- GPTs produce controlled non‑determinism. The same prompt can yield different outputs because the model samples from a context‑shaped distribution; the model’s internal representation encodes nuanced semantic and syntactic signals rather than fixed outgoing edges. Experimental comparisons show Markov/n‑gram generators quickly collapse into local loops, while transformer models sustain global coherence and richer continuations.
Important distinction: non‑determinism in GPTs is not random noise — it’s structured variability. Sampling temperature, top‑k/top‑p, and the model’s learned logits shape how creative or conservative the output is.
Markov: [state] --(stored probs)--> sample --> next state
Transformer:
[context] --(compute logits via attention & FFN)--> softmax --> sample --> next token
- In Markov models, randomness is applied to a fixed distribution.
- In transformers, randomness is applied to a computed distribution that depends on the entire context.
In conclusion
Markov chains store transitions. GPTs compute them.
That single architectural shift—from discrete stored edges to continuous, context‑driven computation—turns a predictable stochastic system into a generator capable of controlled creativity and long‑range coherence.
Until we meet next time,
Be courteous. Drive responsibly.





