How LLMs Actually Work
Tokens, transformers, and next-token prediction — the intuitive version
- Explain how raw text becomes tokens, then embeddings the model can compute on
- Describe attention and the transformer at a conceptual level, without the math
- Trace generation as repeated next-token prediction and explain what temperature actually does
- Distinguish the three training stages — pretraining, SFT, and alignment — and what each one buys
- Connect those mechanics to why models hallucinate, follow instructions, and react to small prompt changes
Underneath every agent is a model that does exactly one thing: predict the next token. This lesson builds the correct mental model — tokens, embeddings, attention, and autoregressive generation — and connects three training stages to the behaviors you'll fight or exploit when building agents: hallucination, prompt sensitivity, and instruction-following.
- 1One trick, repeated
- 2Tokens: the model's alphabet
- 3Embeddings: meaning as geometry
- 4Attention and the transformer
- 5Generation: sampling the next token
- 6Three stages that shape behavior
- 7Why this explains agent behavior
One trick, repeated
Start with the punchline, because it is almost insultingly simple: a large language model is a machine that reads some text and guesses the next word. That's the whole engine. Writing code, planning a trip, reasoning through a proof — all of it is that one guess, made over and over.
A little more precisely: the model takes a sequence of text and outputs a probability distribution over what token comes next (a token is roughly a word-piece; more on that next). When you send it "The capital of France is", it doesn't look up an answer. It scores every token in its vocabulary by how likely it is to come next — "Paris" scores high, "banana" scores near zero — then picks one, appends it to the text, and runs the entire computation again on the now-longer sequence. Token by token, a coherent answer assembles itself.
This is autoregressive generation ("autoregressive" just means each new token is predicted from all the tokens so far), and internalizing it changes how you think about agents. The model has no hidden memory, no database it queries, no plan it secretly holds. At each step it sees only the text in front of it and predicts a plausible continuation. An agent is what you get when you wrap that loop with tools and feed the results back in — but the engine inside is always just next-token prediction.
Tokens: the model's alphabet
Before a model can do anything with your text, it has to chop it into pieces it recognizes. Those pieces are tokens — and they aren't words. A token is a chunk of text (often a whole common word, sometimes a fragment of a rare one) that the model knows by an integer ID drawn from a fixed vocabulary. A tokenizer does the chopping first, and from then on the model only ever sees the IDs, never the letters.
The dominant algorithm is Byte Pair Encoding (BPE), used by GPT-4, Llama, Claude, and most modern models. BPE learns subword units from data: frequent words collapse into a single token, while rare or long words get split into pieces. As a rule of thumb, one token is about 4 characters, or 0.75 of an English word.
import tiktoken
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 / GPT-3.5; GPT-4o uses o200k_base
print(len(enc.encode("agentic"))) # -> 2 tokens
print(len(enc.encode(" the"))) # -> 1 token
print(len(enc.encode("antidisestablishmentarianism"))) # -> 6 tokensThis matters concretely. Token counts — not character counts — drive cost, latency, and context limits. Code, JSON, and non-English text tokenize far less efficiently, so they burn your budget faster. And because the model reasons over subword chunks, character-level tasks ("how many r's in strawberry?") are genuinely awkward for it: it never saw the individual letters, only the chunks they came packaged in.
Watch out
Tokens are not words
A common, costly mistake is treating tokens as words when estimating prompt size or cost. Common words are one token, but rare words, names, code, and other languages can be 3–5 tokens each. Always count with the actual tokenizer (tiktoken, HuggingFace tokenizers) — never eyeball it.
Embeddings: meaning as geometry
A token ID like 4351 is just a name tag — the number itself means nothing to a neural network. So the model's first real move is to translate each ID into an embedding: a long list of numbers (a vector of hundreds or thousands of values) that positions the token in space. The trick is that this space is arranged so distance encodes meaning — tokens with similar meanings sit close together, and the layout of the space is the model's notion of semantics.
The classic intuition: in a good embedding space, the step from "king" to "queen" points in roughly the same direction as the step from "man" to "woman". Relationships turn into consistent vector arithmetic.
Here's the subtlety that matters for builders. Early embeddings like Word2Vec were static — "bank" got one fixed vector whether you meant a riverbank or a vault. Transformers produce contextual embeddings: as a token flows through the network, its vector is reshaped by the surrounding tokens, so "bank" near "river" and "bank" near "loan" end up in different places. This context-dependence is the source of the model's flexibility — and it's also the foundation of retrieval and RAG, where you embed text into vectors and search by geometric proximity. Same machinery, different use.
Attention and the transformer
So a token's meaning depends on its neighbors — but which neighbors, and how much? That's the job of attention, the mechanism at the heart of the transformer architecture (Vaswani et al., 2017) that underpins every major LLM today.
The intuition first: for each token, the model asks "which other tokens should I look at to understand this one?" In "the trophy didn't fit in the suitcase because it was too big," attention is what lets "it" look back and lean heavily on "trophy" rather than "suitcase." Mechanically, each token emits a query (what am I looking for?), a key (what do I offer?), and a value (what I'll contribute if chosen). A token's query is matched against every other token's key to produce attention scores; a softmax turns those scores into weights, and the weighted blend of value vectors becomes the token's updated representation.
Three properties make this powerful:
- Parallel, not sequential. Unlike older RNNs, every token attends to every other token at once — which is why transformers train so efficiently on GPUs.
- Multi-head. The model runs many attention computations side by side, each learning a different kind of relationship (grammar, coreference, topic).
- Position-aware. Attention by itself is order-blind, so position has to be injected separately. Modern models use Rotary Positional Embeddings (RoPE), which generalize to long sequences far better than the original sinusoidal scheme.
Key insight
Attention is a soft blend, not a database lookup
It's tempting to picture attention as the model "looking up" a fact in memory. It isn't. Attention is a weighted average over value vectors computed from the current context. Nothing is retrieved from a store, and nothing persists between calls — which is exactly why an agent's context window is its only working memory.
Generation: sampling the next token
After all that computation, the model has to actually commit to a word. The final layer produces logits — one raw score per vocabulary token. A softmax turns those scores into a probability distribution, the model picks one token to emit, appends it, and runs the entire forward pass again. Repeat until a stop token or a length limit. That's generation: the loop from the first section, now with the picking step made explicit.
How it picks is where you get control:
- Temperature rescales the logits before softmax. Low temperature (→0) sharpens the distribution toward the single most likely token, making output near-deterministic. High temperature flattens it, lifting rarer tokens and making output more varied.
- Top-k restricts sampling to the k most likely tokens; top-p (nucleus) keeps the smallest set of tokens whose probabilities sum to p. Both prune the long tail of nonsense.
Under the hood, inference runs in two phases: prefill processes your whole prompt in parallel (compute-bound), then decode generates one token at a time (memory-bound). A KV cache stores the key/value vectors of past tokens so they aren't recomputed every step — the main reason long outputs stay affordable.
Tip
Temperature is not an intelligence dial
Temperature only reshapes the sampling distribution — it trades diversity for coherence. It does not make the model smarter or more knowledgeable. For agents and tool-calling, use a low temperature (often 0) so behavior is reliable and reproducible; save higher temperatures for brainstorming and creative drafting.
Three stages that shape behavior
A model you can actually chat with isn't trained in one shot — it's built in three stages, and each one explains something you'll see in production. The short version: the first stage teaches it the world, the second teaches it to answer, and the third teaches it to behave.
1. Pretraining. Self-supervised next-token prediction over trillions of tokens from the web, books, and code — no human labels. This is where the model soaks up language, world knowledge, and reasoning patterns. A purely pretrained "base" model is a brilliant autocomplete, but it doesn't follow instructions: ask it a question and it might just continue with more questions.
2. Supervised fine-tuning (SFT). Training on a smaller, carefully curated set of instruction→response pairs. This is where "answer the user's question" behavior is installed. Instruction-following is a learned habit from this stage, not an innate property of the base model.
3. Alignment. Tuning the model toward helpful, harmless, honest behavior using human (or AI) preferences. RLHF trains a reward model from human preference rankings, then optimizes the policy (usually via PPO). DPO skips the separate reward model and optimizes preference pairs directly — cheaper, and now widely used (Llama 3, Mistral). RLAIF swaps human labelers for an AI; Constitutional AI (Anthropic) is RLAIF guided by a written set of principles.
Why this explains agent behavior
Here's the payoff: once you see the model as a next-token predictor shaped by those three stages, its most frustrating quirks stop looking random. They're direct, predictable consequences of how it was built — and that's exactly why agents are designed the way they are.
Hallucination is structural, not a bug. The objective rewards plausible continuations, not true ones, and alignment can further reward confident, fluent answers. So when the model lacks a fact, it doesn't stop — it generates something that sounds right. This is systematic, which is exactly why agents add tools and retrieval: to ground generation in real data instead of parametric guesses.
Instruction-following is a learned veneer. It comes from SFT and alignment layered on a next-token predictor. That's why a well-placed instruction works — and why prompt injection is so dangerous: the model can't reliably tell your instructions from instructions hidden in a web page it just read.
Prompt sensitivity is overfitting to training patterns. Small wording changes can swing outputs because fine-tuning teaches the surface patterns of the training distribution, not robust understanding. Hence prompt engineering — and why you test prompts empirically rather than trusting intuition.
The takeaway: an LLM is a probabilistic text engine with brilliant strengths and predictable weaknesses. Agents exist precisely to compensate for those weaknesses with tools, memory, and structure.
Try it: See the model think in tokens
Install tiktoken (pip install tiktoken) and write a short script that (1) encodes a sentence, a rare word, a snippet of JSON, and a line of non-English text, then prints the token count for each — notice how unevenly tokens are spent. Then (2) call any chat model API twice on the same prompt with temperature=0 and again with temperature=1.0, and compare the outputs across a few runs. Write three sentences: what surprised you about token counts, what temperature changed, and one way each observation would affect how you'd build an agent (think cost budgeting and reliability).
Key takeaways
- 1An LLM does one thing — predict the next token — and generation is that operation run autoregressively in a loop.
- 2Text becomes tokens (subword IDs via BPE), then contextual embeddings where geometric distance encodes meaning.
- 3Attention lets every token weight every other token in parallel; the transformer plus positional encoding (now RoPE) makes this work.
- 4Three stages build the model: pretraining gives knowledge, SFT installs instruction-following, and alignment (RLHF/DPO/RLAIF) shapes behavior.
- 5Hallucination, prompt sensitivity, and instruction-following all fall out of these mechanics — and motivate why agents add tools, retrieval, and memory.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What does a large language model fundamentally compute at each step of generation?
2.Which statement about tokens is correct?
3.What does raising the temperature during sampling actually do?
4.Which training stage primarily installs an LLM's ability to follow instructions?
Go deeper
Hand-picked sources to keep learning
The definitive visual walk-through of attention and the transformer. Still the clearest intuition-builder for core concepts.
The foundational transformer paper. Its architecture still underpins every modern LLM.
Rigorous, accessible breakdown of pretraining, SFT, RLHF, and DPO and how the stages relate.
Clear, current explanation of prefill/decode phases, the KV cache, and sampling strategies.
Hands-on 2025 post that makes tokenization concrete by building it step by step.
Current survey of hallucination types, mechanistic causes, and mitigations with real statistics.