Tree- & Graph-of-Thought Planning

Searching over many possible reasoning paths

Advanced 13 minResearcherBuilder
What you'll be able to do
  • Explain how Tree-of-Thoughts branches, evaluates, and backtracks over reasoning paths instead of committing to one chain
  • Describe how Graph-of-Thoughts generalizes the tree to allow merging and feedback between thoughts
  • Choose a search strategy (BFS, DFS, beam) and an evaluation method (value vs. vote) for a given problem
  • Judge when the extra compute of structured search is worth it — and when linear CoT or a reasoning model is the better call
  • Position ToT/GoT correctly against trained reasoning models like o3 and DeepSeek-R1
At a glance

Chain-of-thought reasons in a straight line — one guess, no take-backs. Tree-of-Thoughts and Graph-of-Thoughts turn that line into a search: the model generates several candidate steps, scores them, and explores the promising branches while abandoning dead ends. This lesson shows you how that search works, why it can turn a 4% success rate into 74%, what it costs, and how it relates to today's trained reasoning models.

  1. 1From a single line to a search
  2. 2How Tree-of-Thoughts actually runs
  3. 3Why search unlocks capability the model already has
  4. 4Graph-of-Thoughts: when a tree isn't enough
  5. 5The cost, and when it's worth paying
  6. 6ToT/GoT vs. trained reasoning models

How Tree-of-Thoughts actually runs

ToT was introduced by Shunyu Yao and colleagues at Princeton (NeurIPS 2023). A thought is a coherent chunk of text — one intermediate step — and the tree is searched by an external orchestrator (Python), not inside a single prompt. Four design choices define a ToT system:

  1. Thought generationpropose (generate dependent next steps sequentially; good for Game of 24) or sample (generate independent thoughts in parallel; good for creative writing).
  2. State evaluationvalue (score each state independently, e.g. 1–10 or sure/maybe/impossible) or vote (have the model compare states and vote for the best).
  3. Search algorithm — BFS keeps the top b states per level (beam search); DFS dives deep and backtracks on failure.
  4. Limits — branching factor, beam width, and max depth to keep the search finite.

Every generation and evaluation is a separate LLM call; the search logic lives in code.

python
def tot_bfs(x, gen, evaluate, steps, b=5, k=3):
    states = [""]  # frontier of partial solutions
    for _ in range(steps):
        # 1. GENERATE: expand each state into k candidate next thoughts
        cands = [s + step for s in states for step in gen(x, s, k)]
        # 2. EVALUATE: score each candidate (a separate LLM call)
        scored = [(c, evaluate(x, c)) for c in cands]
        # 3. SELECT: keep the best b (prune the rest = backtracking)
        states = [c for c, _ in sorted(scored, key=lambda t: -t[1])[:b]]
    return states[0]

Key insight

Generation and evaluation are different calls

A common misread is that ToT runs BFS inside one model call. It doesn't. Generation (propose/sample) and evaluation (value/vote) are distinct LLM calls, and the orchestrator multiplies them. With branching k and beam b over s steps, your call count grows like s × b × k — the product, not the sum. That product is the whole cost story.

Why search unlocks capability the model already has

The headline result: on Game of 24 (combine four numbers with +−×÷ to make 24), GPT-4 with standard CoT prompting solved just 4% of puzzles. The same model wrapped in ToT solved 74% — roughly an 18x jump, with no change to the model's weights.

What changed wasn't knowledge; it was search. Game of 24 has a large combinatorial space and a clear failure signal at each step (a partial result that can't reach 24 is dead). CoT picks one arithmetic path and usually picks wrong. ToT generates several, prunes the impossible ones early, and backtracks — so it reliably finds the narrow path that exists.

That tells you exactly when ToT pays off. The ingredients are: (a) a large search space with hidden solution paths, (b) a strong intermediate evaluation signal — you can tell a good partial state from a bad one — and (c) value in lookahead and backtracking. Mathematical puzzles, constraint satisfaction, crosswords, and planning problems fit. Open-ended writing, summarization, and most everyday tasks do not — their evaluation signal is weak or subjective, so the extra branches just burn tokens.

Graph-of-Thoughts: when a tree isn't enough

A tree has one limitation built in: every thought has exactly one parent. You can branch, but you can never combine two branches. Graph-of-Thoughts (GoT) — Maciej Besta et al., ETH Zurich, AAAI 2024 — drops that constraint. Thoughts become vertices in an arbitrary directed graph, and edges express dependencies, so the orchestrator can do things a tree forbids:

  • Aggregate / merge — fuse several partial thoughts into one stronger thought (e.g. merge independently-sorted sub-lists).
  • Refine — loop a thought back through the model with feedback to improve it in place.
  • Distill — collapse a sprawling subgraph into a compact result.

The classic example is sorting: split the list, sort the pieces in parallel branches, then merge them — a move a tree literally cannot represent. GoT reports a 62% quality improvement over ToT on sorting tasks while cutting cost by over 31%, precisely because merging avoids redundant re-derivation.

The trade-off is engineering complexity. A tree is a clean recursive search; a graph is a workflow you design — decide which thoughts merge, where feedback loops close, and how to avoid cycles. GoT is most worth it when sub-problems are independent and recombinable.

Watch out

Read benchmark claims in context

The 74%-vs-4% (ToT) and 62%/31% (GoT) numbers are real but task-specific — Game of 24 and sorting. They are not general guarantees. On tasks solvable by linear reasoning, CoT is faster, cheaper, and often just as accurate. Always pilot on your task before assuming structured search wins.

The cost, and when it's worth paying

Structured search is expensive by construction. Each node expansion is generation plus evaluation, and depth times branching multiplies fast: a beam of 5 across 3 steps can mean 5–10x more LLM calls than CoT, with matching latency and dollars. You manage this with the levers in your code:

  • Beam width / branching — fewer candidates per level.
  • Pruning — discard hopeless states immediately (the impossible value).
  • Early stopping — quit a branch or the whole search once a solution is verified.
  • Cheaper evaluator — score states with a smaller, faster model than the generator.

A decision rule that holds up in practice:

Use ToT/GoT when…Use linear CoT (or a reasoning model) when…
Hidden solution path, big search spaceThe path is roughly linear
Strong, cheap intermediate evaluationEvaluation is weak or subjective
Backtracking clearly helpsOne pass usually succeeds
Correctness justifies 5–10x costLatency and cost are tight

If you can't articulate a good evaluation signal for partial solutions, ToT has nothing to steer with — and you're paying 10x for noise. Start with CoT; reach for search only when a measurable evaluation signal and a real search space are both present.

ToT/GoT vs. trained reasoning models

The biggest 2025–2026 question: didn't o3 and DeepSeek-R1 make this obsolete? No — they're a different thing. The simplest way to see it: ToT/GoT put the search outside the model, in your Python; reasoning models put the search inside the model, in its weights. Concretely, ToT/GoT are prompt-level external frameworks — no training, search orchestrated in your code, one LLM call per thought. Trained reasoning models (OpenAI o1/o3, DeepSeek-R1, Claude extended thinking) have internalized sequential search via reinforcement learning on long chain-of-thought traces. They explore alternatives implicitly, inside a single stream of tokens — there's no tree in your orchestrator.

Each has an edge. Reasoning models are simpler to use (one call) and often match or beat hand-built ToT on math and code. But they can't explicitly parallelize independent branches or expose an interpretable decision tree with a real tool call at each node — which is exactly what an agent sometimes needs. So ToT survives as an orchestration pattern for branching, tool-using agentic search.

The research is converging the two. LATS (ICML 2024) fuses Monte Carlo Tree Search with LLM agents (92.7% pass@1 on HumanEval). Adaptive Graph-of-Thoughts (Feb 2025) unifies chain/tree/graph as a dynamic DAG at test time — up to 46.2% gains on GPQA with no training. ToTRL (May 2025) trains the tree into the weights. The direction is clear: make structured search both cheaper and trainable.

Tip

A lightweight middle ground

Don't need full BFS/DFS machinery? The multi-expert prompt variant (Hulbert, 2023) asks the model to imagine several experts reasoning step by step, sharing and self-correcting in one prompt. It captures some of ToT's branch-and-evaluate benefit at a fraction of the cost — a good first thing to try before building an orchestrator.

Try it: Build a tiny Tree-of-Thoughts solver for Game of 24

Implement a minimal ToT search in Python over the Game of 24 (use four numbers and +, −, ×, ÷ to make 24).

  1. Generate: write a propose(state) function that calls your LLM to suggest the next arithmetic operation given the remaining numbers (the propose mode).
  2. Evaluate: write evaluate(state) that asks the LLM to classify the remaining numbers as sure / maybe / impossible of reaching 24, and map those to scores (e.g. 1.0 / 0.3 / 0.0).
  3. Search: run BFS with a beam width of b=5 over 3 steps, keeping the top-scored states each level and pruning the rest.
  4. Compare: run the same puzzles with plain CoT (one call: 'solve this, think step by step'). Log the success rate and the number of LLM calls for each approach.

Deliverable: a short table of CoT vs. ToT accuracy and call-count on 10 puzzles. Then write two sentences on whether the accuracy gain justified the extra calls — and what evaluation signal made ToT work here.

Key takeaways

  1. 1Tree-of-Thoughts replaces CoT's single chain with a search: generate candidate thoughts, evaluate them, keep the best branches, and backtrack from dead ends.
  2. 2Generation and evaluation are separate LLM calls run by external code, so cost scales like the product of steps × beam × branching — not the sum.
  3. 3ToT shines only when there's a large search space, a strong intermediate evaluation signal, and value in backtracking; for linear tasks, CoT is cheaper and just as good.
  4. 4Graph-of-Thoughts generalizes the tree to a directed graph, adding merging, refinement, and feedback loops that a tree structurally cannot express.
  5. 5Trained reasoning models (o3, DeepSeek-R1) internalize sequential search and don't replace ToT/GoT, which remain useful for interpretable, branching, tool-using agentic search.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.What is the core difference between Chain-of-Thought and Tree-of-Thoughts?

2.In a ToT system, how does the total number of LLM calls scale?

3.What does Graph-of-Thoughts enable that Tree-of-Thoughts structurally cannot?

4.How do trained reasoning models (o3, DeepSeek-R1) relate to ToT/GoT?

Go deeper

Hand-picked sources to keep learning