Chain-of-Thought & Reasoning Models
Letting models think before answering
- Explain chain-of-thought prompting, both few-shot and zero-shot, and when it actually helps
- Distinguish prompt-based CoT from RL-trained reasoning models and name the major families
- Explain test-time compute and why allocating more inference budget can raise accuracy
- Apply self-consistency and reason about where its gains plateau
- Decide when a reasoning model is worth the latency and cost — and when it is overkill
Asking a model to "show its work" before answering can turn a confident wrong guess into a correct, traceable solution. This lesson covers chain-of-thought prompting and its limits, then the modern reasoning models (OpenAI o-series, DeepSeek-R1, Claude extended thinking, Gemini thinking) that bake step-by-step reasoning in via reinforcement learning — plus the cost, latency, and self-consistency trade-offs you need to choose wisely in 2026.
- 1Think before you speak
- 2Few-shot, zero-shot, and the size threshold
- 3From prompting a chain to training one
- 4Test-time compute: buying accuracy at inference
- 5Self-consistency: vote across many chains
- 6When reasoning is worth it — and when it isn't
Think before you speak
Imagine being forced to shout the answer to a tricky word problem the instant you hear it — no pen, no paper, no muttering it through. You would get a lot of them wrong, not because you don't know the math, but because you had nowhere to work it out. A language model is in exactly that position. It generates one token at a time, left to right, with no scratchpad. If you demand the final answer immediately, all of its "reasoning" has to happen invisibly inside a single forward pass.
Chain-of-thought (CoT) prompting hands the model a scratchpad. Instead of jumping to the answer, you invite it to write the intermediate steps as text first. Here is the key mechanism: those generated steps become part of the context the model reads as it keeps going, so each later token is conditioned on the reasoning it just wrote. The model literally uses its own output as working memory.
The technique was introduced by Wei et al. (2022) at Google Brain. Their key result: prompting a 540B-parameter PaLM model with just 8 worked examples showing step-by-step solutions set a new state of the art on GSM8K grade-school math — far beyond answering directly. The reasoning was not new knowledge; it was a better path to knowledge the model already had.
Key insight
Why writing steps helps a token predictor
Each generated token is extra computation and extra context. By emitting reasoning first, the model spreads a hard problem across many forward passes instead of cramming it into one, and conditions its final answer on explicit intermediate conclusions rather than a hidden guess.
Few-shot, zero-shot, and the size threshold
There are two ways to ask for a chain of thought, and they differ only in how much you show the model.
- Few-shot CoT (Wei et al., 2022): you include a handful of exemplars that show the reasoning — a question, then a worked-out solution, then the answer. The model imitates that format on your new question.
- Zero-shot CoT (Kojima et al., 2022): you add no examples at all — just the instruction "Let's think step by step." Astonishingly, that single phrase alone unlocks much of the same benefit, because it cues the model toward the reasoning style it saw constantly in pretraining.
But there is a crucial catch: CoT is an emergent ability. It reliably helps only in large models (roughly 100B+ parameters). In small models, prompting for steps tends to produce fluent but illogical chains that can make accuracy worse than answering directly. The model writes confident, well-formed sentences that do not actually compute anything correct — reasoning-shaped text without the reasoning underneath.
Example
Zero-shot CoT in one line
Prompt: "A juggler has 16 balls; half are golf balls and half of those are blue. How many blue golf balls? Let's think step by step."
The trailing phrase pushes the model to write "16 / 2 = 8 golf balls; 8 / 2 = 4 blue" before answering 4, instead of guessing.
From prompting a chain to training one
So far you, the prompter, have been doing the work — coaxing steps out of a model with the right phrasing. The biggest shift since 2024 is that the best reasoning is no longer prompted at all; it is trained in. Reasoning models are trained with large-scale reinforcement learning to produce extended internal reasoning before they answer. They do not merely follow an instructed template; through trial and reward, they discover strategies — decomposing problems, checking their own work, backtracking from dead ends — the way a student learns to reason by doing thousands of problems, not by being handed a script.
This is fundamentally different from CoT prompting, and the distinction matters in practice. The major families as of 2026:
- OpenAI o-series (o1, o3, o4-mini, o3-pro). o3 and o4-mini shipped April 16, 2025; o3 set a new state of the art on ARC-AGI-1, scoring in the 41–53% range on the production semi-private eval (varying by effort level), with a high-compute preview configuration reaching 87.5%. The chain of thought is hidden from users. Exposes a
reasoning_effortcontrol (low/medium/high). - DeepSeek-R1 (Jan 20, 2025). Open-source (MIT), a 671B MoE plus distilled variants, trained with GRPO reinforcement learning. Matches o1 on math (79.8% AIME 2024) and exposes its reasoning traces.
- Claude extended thinking (Anthropic, from Claude 3.7 Sonnet, Feb 24, 2025). Emits structured
thinkingblocks; abudget_tokensparameter caps thinking (up to 128K). Newer Claude 4 models deprecatebudget_tokensin favor of aneffortparameter with adaptive thinking that decides when to think hard. - Gemini 2.5 (Google, March 2025). Flagship thinking model; ~92% AIME 2024, 63.8% SWE-bench Verified, 1M-token context, thinking fused with tool use.
Watch out
CoT prompting ≠ a reasoning model
These are not the same thing. CoT is a prompt that asks any model to show steps. A reasoning model is trained to reason internally. Telling an o-series or Gemini 2.5 model "think step by step" is mostly redundant — it already does, and you can't make it the same way you would a base model.
Test-time compute: buying accuracy at inference
Here is the simple idea underneath reasoning models: you can make a model smarter on a hard question just by letting it think longer — after training, at the moment you ask. That is test-time compute scaling. Instead of spending all the compute during training and then answering in a fixed-size burst, these models spend more compute at inference — generating longer reasoning chains, verifying intermediate results, and backtracking. More thinking budget generally buys more accuracy, up to a point of diminishing returns.
The useful analogy is Kahneman's two systems:
- System 1 — fast, automatic pattern-matching. A standard model answering directly.
- System 2 — slow, deliberate, effortful reasoning. A reasoning model working through a long chain.
For a trivia question, System 1 is fine and System 2 is wasteful. For a competition math problem or a tricky multi-file code change, deliberate System 2 thinking is where the accuracy gains live. This is why you now tune the budget: OpenAI's reasoning_effort, Anthropic's budget_tokens, and Gemini's thinking controls all let you dial how much the model deliberates per request — trading latency and cost for correctness on demand.
Tip
Match the budget to the problem
Start with low or medium effort. Raise the thinking budget only on the hard slice of your traffic where you can measure an accuracy lift. A high budget on easy requests just burns tokens and seconds.
Self-consistency: vote across many chains
If one student's reasoning can go astray, ask ten students and trust the answer most of them agree on. That is the whole intuition behind self-consistency (Wang et al., 2022), the strongest purely prompt-based enhancement to CoT. Instead of trusting one reasoning chain, you sample several with temperature > 0, then take the majority vote over their final answers. Different chains may reach the same correct answer by different routes, while wrong answers tend to scatter — so the mode is usually right.
from collections import Counter
def self_consistent_answer(model, prompt, n=10, temperature=0.7):
answers = []
for _ in range(n):
chain = model.generate(prompt + "\nLet's think step by step.",
temperature=temperature)
answers.append(extract_final_answer(chain))
# majority vote
return Counter(answers).most_common(1)[0][0]It reliably improves accuracy, but gains plateau after roughly 20-40 samples — beyond that, sampled chains start overlapping and you pay N times the cost for little lift. Self-consistency works with any CoT-capable model and even stacks on top of reasoning models, but it multiplies token cost, so reserve it for high-value, hard problems where a few extra points of accuracy justify the spend.
When reasoning is worth it — and when it isn't
Thinking longer is genuinely useful, but it is never free — and the bill comes due in two currencies: time and tokens. Before you reach for a reasoning model, weigh the concrete trade-offs:
- Latency. Reasoning chains take seconds to tens of seconds, versus milliseconds for a direct answer.
- Cost. Thinking tokens are billed (as output tokens on most APIs), so a long internal chain can dominate the bill even when the visible answer is short.
- Overkill risk. For factual lookups, classification, short creative writing, and most retrieval tasks, a reasoning model adds cost and delay with no quality gain.
Reasoning models earn their keep on complex multi-step math, code, scientific reasoning, and planning — tasks where the steps aren't known up front and a wrong intermediate step quietly poisons the answer.
And here is the 2026 reality check: a Wharton Generative AI Labs study (2025) found that adding explicit CoT prompting to reasoning models like o3-mini and o4-mini yielded only 2.9-3.1% accuracy gains while increasing latency 20-80%. For non-reasoning models, CoT prompting helped more (4.4-13.5%), but some models regressed (e.g., a 17.2% drop on certain tasks). The lesson: pick a reasoning model when the task is genuinely hard; do not reflexively bolt "think step by step" onto a model that already thinks.
Note
Who shows their thinking?
OpenAI hides the o-series chain of thought for safety/policy reasons; you see only the answer. DeepSeek-R1 and Anthropic's thinking blocks expose reasoning to API users — though Claude 4+ defaults to summarized (and newest Opus to omitted) thinking for latency, with full traces available via explicit config.
Try it: Measure the value of thinking
Pick 10 hard problems with known answers (e.g. GSM8K-style word problems or small logic puzzles). Run each through one model in three modes: (1) direct answer, (2) zero-shot CoT by appending 'Let's think step by step.', and (3) self-consistency — sample 5-10 CoT chains and majority-vote. Record accuracy, average latency, and approximate token cost for each mode. Then, if you have access, run the same set on a reasoning model (e.g. o4-mini, DeepSeek-R1, or Claude with extended thinking) at a low and a high thinking budget. Write a short table comparing accuracy vs cost/latency. You should see CoT and self-consistency help the base model, self-consistency's gains start to flatten, and explicit CoT add little to the reasoning model — the exact trade-offs this lesson describes.
Key takeaways
- 1Chain-of-thought prompting works because generated reasoning becomes context the model conditions on — turning a single forward pass into many.
- 2CoT is an emergent ability: it helps large models (~100B+) but can hurt small ones by producing fluent-but-illogical chains.
- 3Reasoning models (o-series, DeepSeek-R1, Claude extended thinking, Gemini thinking) are RL-trained to reason internally — fundamentally different from prompting a model to show steps.
- 4Test-time compute scaling means more inference-time thinking buys accuracy up to a point; tune the budget with controls like reasoning_effort and budget_tokens.
- 5Self-consistency (sample many chains, majority-vote) is the strongest prompt-based boost, but gains plateau around 20-40 samples and reasoning models are overkill for simple tasks.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.Why does chain-of-thought prompting improve a model's performance on hard problems?
2.What is the key difference between CoT prompting and a reasoning model like OpenAI o3 or DeepSeek-R1?
3.How does self-consistency improve over a single chain of thought, and what is its main limitation?
4.Based on current evidence, when is a reasoning model most clearly the WRONG choice?
Go deeper
Hand-picked sources to keep learning
The original CoT paper from Google Brain: few-shot exemplars, the ~100B size threshold, and GSM8K SOTA with 540B PaLM.
Introduces zero-shot CoT with the phrase 'Let's think step by step'.
The sample-and-marginalize approach: sample diverse chains, majority-vote the answer.
Open-source reasoning model technical report: GRPO training, distilled variants, benchmarks matching o1.
Official docs on thinking blocks, budget_tokens, adaptive vs manual thinking, and display modes as of 2026.
Empirical study quantifying CoT accuracy gains vs latency cost across standard and reasoning models.