Tree- & Graph-of-Thought Planning
Searching over many possible reasoning paths
- Explain how Tree-of-Thoughts branches, evaluates, and backtracks over reasoning paths instead of committing to one chain
- Describe how Graph-of-Thoughts generalizes the tree to allow merging and feedback between thoughts
- Choose a search strategy (BFS, DFS, beam) and an evaluation method (value vs. vote) for a given problem
- Judge when the extra compute of structured search is worth it — and when linear CoT or a reasoning model is the better call
- Position ToT/GoT correctly against trained reasoning models like o3 and DeepSeek-R1
Chain-of-thought reasons in a straight line — one guess, no take-backs. Tree-of-Thoughts and Graph-of-Thoughts turn that line into a search: the model generates several candidate steps, scores them, and explores the promising branches while abandoning dead ends. This lesson shows you how that search works, why it can turn a 4% success rate into 74%, what it costs, and how it relates to today's trained reasoning models.
- 1From a single line to a search
- 2How Tree-of-Thoughts actually runs
- 3Why search unlocks capability the model already has
- 4Graph-of-Thoughts: when a tree isn't enough
- 5The cost, and when it's worth paying
- 6ToT/GoT vs. trained reasoning models
From a single line to a search
Chain-of-thought (CoT) asks the model to think step by step, but it writes one chain top to bottom. If the third step is a wrong turn, every step after it inherits the mistake — there is no going back. For a creative essay that's fine. For a puzzle with a hidden solution path, it's fatal: the model commits to a guess before it can tell whether the guess leads anywhere.
Tree-of-Thoughts (ToT) removes the commitment. Instead of one chain, the model explores a tree of partial solutions. At each step it generates several candidate next thoughts, evaluates how promising each one looks, keeps the good branches, and prunes the rest. When a branch dead-ends, it backtracks and tries another. This is just classic state-space search — BFS, DFS, beam — with the language model standing in for both the move generator and the evaluator.
The mental model: CoT is writing your answer in pen; ToT is sketching several options in pencil, crossing out the bad ones, and following the best. The model already contains the knowledge to solve hard problems; ToT gives it a way to search for the path instead of betting on the first one it sees.
How Tree-of-Thoughts actually runs
ToT was introduced by Shunyu Yao and colleagues at Princeton (NeurIPS 2023). A thought is a coherent chunk of text — one intermediate step — and the tree is searched by an external orchestrator (Python), not inside a single prompt. Four design choices define a ToT system:
- Thought generation —
propose(generate dependent next steps sequentially; good for Game of 24) orsample(generate independent thoughts in parallel; good for creative writing). - State evaluation —
value(score each state independently, e.g. 1–10 or sure/maybe/impossible) orvote(have the model compare states and vote for the best). - Search algorithm — BFS keeps the top b states per level (beam search); DFS dives deep and backtracks on failure.
- Limits — branching factor, beam width, and max depth to keep the search finite.
Every generation and evaluation is a separate LLM call; the search logic lives in code.
def tot_bfs(x, gen, evaluate, steps, b=5, k=3):
states = [""] # frontier of partial solutions
for _ in range(steps):
# 1. GENERATE: expand each state into k candidate next thoughts
cands = [s + step for s in states for step in gen(x, s, k)]
# 2. EVALUATE: score each candidate (a separate LLM call)
scored = [(c, evaluate(x, c)) for c in cands]
# 3. SELECT: keep the best b (prune the rest = backtracking)
states = [c for c, _ in sorted(scored, key=lambda t: -t[1])[:b]]
return states[0]Key insight
Generation and evaluation are different calls
A common misread is that ToT runs BFS inside one model call. It doesn't. Generation (propose/sample) and evaluation (value/vote) are distinct LLM calls, and the orchestrator multiplies them. With branching k and beam b over s steps, your call count grows like s × b × k — the product, not the sum. That product is the whole cost story.
Why search unlocks capability the model already has
The headline result: on Game of 24 (combine four numbers with +−×÷ to make 24), GPT-4 with standard CoT prompting solved just 4% of puzzles. The same model wrapped in ToT solved 74% — roughly an 18x jump, with no change to the model's weights.
What changed wasn't knowledge; it was search. Game of 24 has a large combinatorial space and a clear failure signal at each step (a partial result that can't reach 24 is dead). CoT picks one arithmetic path and usually picks wrong. ToT generates several, prunes the impossible ones early, and backtracks — so it reliably finds the narrow path that exists.
That tells you exactly when ToT pays off. The ingredients are: (a) a large search space with hidden solution paths, (b) a strong intermediate evaluation signal — you can tell a good partial state from a bad one — and (c) value in lookahead and backtracking. Mathematical puzzles, constraint satisfaction, crosswords, and planning problems fit. Open-ended writing, summarization, and most everyday tasks do not — their evaluation signal is weak or subjective, so the extra branches just burn tokens.
Graph-of-Thoughts: when a tree isn't enough
A tree has one limitation built in: every thought has exactly one parent. You can branch, but you can never combine two branches. Graph-of-Thoughts (GoT) — Maciej Besta et al., ETH Zurich, AAAI 2024 — drops that constraint. Thoughts become vertices in an arbitrary directed graph, and edges express dependencies, so the orchestrator can do things a tree forbids:
- Aggregate / merge — fuse several partial thoughts into one stronger thought (e.g. merge independently-sorted sub-lists).
- Refine — loop a thought back through the model with feedback to improve it in place.
- Distill — collapse a sprawling subgraph into a compact result.
The classic example is sorting: split the list, sort the pieces in parallel branches, then merge them — a move a tree literally cannot represent. GoT reports a 62% quality improvement over ToT on sorting tasks while cutting cost by over 31%, precisely because merging avoids redundant re-derivation.
The trade-off is engineering complexity. A tree is a clean recursive search; a graph is a workflow you design — decide which thoughts merge, where feedback loops close, and how to avoid cycles. GoT is most worth it when sub-problems are independent and recombinable.
Watch out
Read benchmark claims in context
The 74%-vs-4% (ToT) and 62%/31% (GoT) numbers are real but task-specific — Game of 24 and sorting. They are not general guarantees. On tasks solvable by linear reasoning, CoT is faster, cheaper, and often just as accurate. Always pilot on your task before assuming structured search wins.
The cost, and when it's worth paying
Structured search is expensive by construction. Each node expansion is generation plus evaluation, and depth times branching multiplies fast: a beam of 5 across 3 steps can mean 5–10x more LLM calls than CoT, with matching latency and dollars. You manage this with the levers in your code:
- Beam width / branching — fewer candidates per level.
- Pruning — discard hopeless states immediately (the
impossiblevalue). - Early stopping — quit a branch or the whole search once a solution is verified.
- Cheaper evaluator — score states with a smaller, faster model than the generator.
A decision rule that holds up in practice:
| Use ToT/GoT when… | Use linear CoT (or a reasoning model) when… |
|---|---|
| Hidden solution path, big search space | The path is roughly linear |
| Strong, cheap intermediate evaluation | Evaluation is weak or subjective |
| Backtracking clearly helps | One pass usually succeeds |
| Correctness justifies 5–10x cost | Latency and cost are tight |
If you can't articulate a good evaluation signal for partial solutions, ToT has nothing to steer with — and you're paying 10x for noise. Start with CoT; reach for search only when a measurable evaluation signal and a real search space are both present.
ToT/GoT vs. trained reasoning models
The biggest 2025–2026 question: didn't o3 and DeepSeek-R1 make this obsolete? No — they're a different thing. The simplest way to see it: ToT/GoT put the search outside the model, in your Python; reasoning models put the search inside the model, in its weights. Concretely, ToT/GoT are prompt-level external frameworks — no training, search orchestrated in your code, one LLM call per thought. Trained reasoning models (OpenAI o1/o3, DeepSeek-R1, Claude extended thinking) have internalized sequential search via reinforcement learning on long chain-of-thought traces. They explore alternatives implicitly, inside a single stream of tokens — there's no tree in your orchestrator.
Each has an edge. Reasoning models are simpler to use (one call) and often match or beat hand-built ToT on math and code. But they can't explicitly parallelize independent branches or expose an interpretable decision tree with a real tool call at each node — which is exactly what an agent sometimes needs. So ToT survives as an orchestration pattern for branching, tool-using agentic search.
The research is converging the two. LATS (ICML 2024) fuses Monte Carlo Tree Search with LLM agents (92.7% pass@1 on HumanEval). Adaptive Graph-of-Thoughts (Feb 2025) unifies chain/tree/graph as a dynamic DAG at test time — up to 46.2% gains on GPQA with no training. ToTRL (May 2025) trains the tree into the weights. The direction is clear: make structured search both cheaper and trainable.
Tip
A lightweight middle ground
Don't need full BFS/DFS machinery? The multi-expert prompt variant (Hulbert, 2023) asks the model to imagine several experts reasoning step by step, sharing and self-correcting in one prompt. It captures some of ToT's branch-and-evaluate benefit at a fraction of the cost — a good first thing to try before building an orchestrator.
Try it: Build a tiny Tree-of-Thoughts solver for Game of 24
Implement a minimal ToT search in Python over the Game of 24 (use four numbers and +, −, ×, ÷ to make 24).
- Generate: write a
propose(state)function that calls your LLM to suggest the next arithmetic operation given the remaining numbers (theproposemode). - Evaluate: write
evaluate(state)that asks the LLM to classify the remaining numbers assure/maybe/impossibleof reaching 24, and map those to scores (e.g. 1.0 / 0.3 / 0.0). - Search: run BFS with a beam width of
b=5over 3 steps, keeping the top-scored states each level and pruning the rest. - Compare: run the same puzzles with plain CoT (one call: 'solve this, think step by step'). Log the success rate and the number of LLM calls for each approach.
Deliverable: a short table of CoT vs. ToT accuracy and call-count on 10 puzzles. Then write two sentences on whether the accuracy gain justified the extra calls — and what evaluation signal made ToT work here.
Key takeaways
- 1Tree-of-Thoughts replaces CoT's single chain with a search: generate candidate thoughts, evaluate them, keep the best branches, and backtrack from dead ends.
- 2Generation and evaluation are separate LLM calls run by external code, so cost scales like the product of steps × beam × branching — not the sum.
- 3ToT shines only when there's a large search space, a strong intermediate evaluation signal, and value in backtracking; for linear tasks, CoT is cheaper and just as good.
- 4Graph-of-Thoughts generalizes the tree to a directed graph, adding merging, refinement, and feedback loops that a tree structurally cannot express.
- 5Trained reasoning models (o3, DeepSeek-R1) internalize sequential search and don't replace ToT/GoT, which remain useful for interpretable, branching, tool-using agentic search.
Quiz
Lock in what you learned
Check your understanding
0 / 4 answered
1.What is the core difference between Chain-of-Thought and Tree-of-Thoughts?
2.In a ToT system, how does the total number of LLM calls scale?
3.What does Graph-of-Thoughts enable that Tree-of-Thoughts structurally cannot?
4.How do trained reasoning models (o3, DeepSeek-R1) relate to ToT/GoT?
Go deeper
Hand-picked sources to keep learning
The primary source — Yao et al., Princeton. BFS/DFS details, propose vs. sample, value vs. vote, and the Game of 24 result (74% vs 4%).
Besta et al., ETH Zurich. Generalizes ToT to arbitrary directed graphs with merging and feedback; ~62% quality gain over ToT on sorting at ~31% lower cost.
Combines Monte Carlo Tree Search with LLM agents for unified reasoning, acting, and planning; 92.7% HumanEval pass@1 on GPT-4.
Feb 2025. Unifies chain/tree/graph into a dynamic DAG at test time, no training; up to 46.2% improvement on GPQA scientific reasoning.
Official Python code for Game of 24 (BFS), creative writing (sample), and crosswords (DFS). Research proof-of-concept — not production software.
Accessible summary of generation/evaluation modes, search strategies, and the lightweight multi-expert prompting variant.