Tag Archives: AI

#0435 – Why Markov chains and GPTs feel similar — and why they’re not


I was watching a podcast the other day, and the topic of Markov chains came up in the context of deterministic v/s indeterministic behavior. By the end of that podcast, the modern GPTs started to come to mind. Modern AI models like GPTs often get described as “predicting the next word,” which sounds a lot like what Markov chains do. But the similarity ends at the surface. Underneath, the two systems behave in fundamentally different ways – especially when it comes to predictability and determinism.

This post outlines the journey I took to understand why they are different, even though they feel familiar.

Both systems predict the next token, but they do it with fundamentally different architectures and guarantees. Markov chains use explicit, discrete transitions; GPTs compute a continuous, context‑dependent distribution using attention and deep layers.

How Markov Chains Think: Fixed Paths, Fixed Futures

A Markov chain is built on a simple idea: the next state depends only on the current state.

This creates a world where:

  • The system has discrete states (like words or events).
  • Each state has a fixed set of outgoing transitions.
  • The probabilities of those transitions are explicitly stored.
  • The structure of the system is fully deterministic, even if the outcome is probabilistic.
[ Current State: token t ]


[ Transition Table: P(next | token t) ]


[ Sample next token t+1 ]

Think of it as a railway network:

  • Every station has a fixed set of tracks.
  • You can roll dice to choose which track to take.
  • But the tracks themselves never change.

Even when randomness is involved, the space of possible outcomes is fixed and predictable.

How GPTs Think: Continuous Context, Infinite Possibilities

GPTs also predict the next token—but the mechanism is radically different.

Instead of fixed states and fixed transitions, GPTs operate in a continuous, high‑dimensional space. The “state” of the model at any moment is not a token—it’s a context vector built from:

  • all previous tokens
  • their relationships
  • their semantic meaning
  • the model’s learned internal representations

This means:

  • There are no fixed outgoing edges.
  • The model recomputes the next-token distribution from scratch each time.
  • The same input token can lead to entirely different outputs depending on context.
  • The system’s behavior is non-deterministic by design.
[ Input tokens: t1, t2, ..., tn ]


[ Embedding layer → token vectors ]


[ Transformer blocks (self‑attention + FFN) ]
• each block: attend to all previous tokens
• build contextual hidden state h


[ Output projection → logits over vocab ]


[ Softmax → P(next token | full context) ]


[ Sample or pick next token ]

It’s less like a railway network and more like a GPS:

  • It recalculates the route every time.
  • The route depends on the entire environment.
  • Two identical starting points can lead to different paths depending on context.

A side-by-side comparison

FeatureMarkov ChainsGPT/Transformer
StateDiscrete tokenContinuous context vector
MemoryLocal (last state or few tokens)Global within context window
TransitionsStored, fixed probabilitiesComputed dynamically via attention
PredictabilityStructure predictable; outcomes probabilistic but boundedOutputs can vary widely for same token depending on context
ExpressivenessLimited to local patternsCaptures long‑range dependencies and abstractions
Typical useSimple text generators, stochastic processesLarge‑scale language generation, reasoning, code, dialogue

Predictability, determinism, and creativity — a nuanced view

  • Markov chains trend toward predictability. Given the transition graph and enough time, behavior converges to stationary patterns; the space of possible futures is fixed and enumerable.
  • GPTs produce controlled non‑determinism. The same prompt can yield different outputs because the model samples from a context‑shaped distribution; the model’s internal representation encodes nuanced semantic and syntactic signals rather than fixed outgoing edges. Experimental comparisons show Markov/n‑gram generators quickly collapse into local loops, while transformer models sustain global coherence and richer continuations.

Important distinction: non‑determinism in GPTs is not random noise — it’s structured variability. Sampling temperature, top‑k/top‑p, and the model’s learned logits shape how creative or conservative the output is.

Markov:   [state] --(stored probs)--> sample --> next state
Transformer:
[context] --(compute logits via attention & FFN)--> softmax --> sample --> next token
  • In Markov models, randomness is applied to a fixed distribution.
  • In transformers, randomness is applied to a computed distribution that depends on the entire context.

In conclusion

Markov chains store transitions. GPTs compute them.

That single architectural shift—from discrete stored edges to continuous, context‑driven computation—turns a predictable stochastic system into a generator capable of controlled creativity and long‑range coherence.

Until we meet next time,

Be courteous. Drive responsibly.