Capabilities & Limitations

What LLMs are brilliant and terrible at — and why agents exist

Beginner 12 minBuilderDecision-maker
What you'll be able to do
  • Name the core strengths LLMs reliably deliver and explain why those strengths make them the engine of agent systems
  • Explain hallucination, knowledge staleness, statelessness, and arithmetic unreliability as consequences of how LLMs are trained — not fixable bugs
  • Distinguish hallucination from staleness, and explain why fine-tuning fixes neither
  • Map each limitation to its concrete remedy: RAG, code tools, and external memory
  • Reason about model calibration and why confident-sounding output is not the same as correct output
At a glance

A language model is a probabilistic next-token predictor — astonishingly fluent, structurally unreliable. This lesson gives an honest tour of what LLMs are brilliant at (language, synthesis, code, flexible pattern completion) and what they are terrible at (fresh facts, precise arithmetic, true memory, calibrated confidence), then shows that each weakness maps cleanly to a fix — retrieval, tools, and external memory — which is precisely why agents exist.

  1. 1One engine, two faces
  2. 2What LLMs are genuinely brilliant at
  3. 3Hallucination: confidently wrong
  4. 4Knowledge cutoff and staleness
  5. 5Arithmetic and the statelessness problem
  6. 6Calibration: confidence is not accuracy
  7. 7Why these limits are why agents exist

One engine, two faces

Start with the one idea that makes the rest of this lesson click: an LLM is a fancy autocomplete, not a fact-checker. It was trained to guess the next word that sounds right given everything before it. Nothing in that training process rewards being correct or up to date — only being plausible. So when it's right, it's because plausible and true usually overlap; when it's wrong, it's because they sometimes don't.

Stated precisely: an LLM is a probabilistic text predictor that optimizes for plausibility, not correctness or recency. Everything it is great at and everything it is terrible at flows from that single fact. This is not a defect to apologize for — it is a trade. The same machinery that lets the model fluently continue any text in any style is what occasionally lets it continue confidently into a falsehood. Fluency and wrongness are not opposites; they come from the same place.

For an agent builder this reframes the whole problem. You are not waiting for a future model with "no weaknesses." You are designing a system around an engine whose strengths and weaknesses are both known and stable. The rest of this lesson catalogs both — and, crucially, shows that each major weakness has a direct, deployable remedy. That mapping (weakness → remedy) is the bridge from "LLM" to "agent," and it is the reason the rest of this course exists.

What LLMs are genuinely brilliant at

Before the caveats, take the strengths seriously — they are load-bearing. Think of the model as a brilliant generalist who has read almost everything: ask it to draft, rephrase, summarize, translate, or sketch code, and it is fast and fluent. It is the reasoning core of an agent because it is exceptional at a specific cluster of open-ended, language-shaped tasks:

  • Natural-language synthesis and transformation — drafting, rewriting, tone-shifting, explaining. This is the home turf.
  • Summarizing large corpora — collapsing fifty pages or a long transcript into the parts that matter.
  • Flexible code generation — producing idiomatic code for common patterns, glue scripts, and boilerplate across many languages.
  • Brainstorming and ideation — generating many candidate options fast.
  • Cross-domain analogy and pattern completion — recognizing structure and applying it to a new situation it never saw verbatim.
  • Translation and multilingual work — moving fluidly between languages and registers.

The through-line is flexibility: the model handles open-ended, fuzzy, linguistic tasks that resist hard-coding. That is exactly the capability traditional software lacks — you cannot write if/else rules for "make this email warmer." An agent borrows this flexibility to decide what to do next — and then leans on tools to do the parts the model is bad at.

Hallucination: confidently wrong

A hallucination is output that is fluent and plausible but false — the model fills a gap with something that sounds right instead of admitting it doesn't know. It comes in two flavors: factuality errors (stating an incorrect fact about the world) and faithfulness errors (distorting a source you actually gave it). Both sound equally confident, which is exactly what makes them dangerous.

Why does it happen? Two layers. First, the model predicts likely tokens, so a plausible-but-invented citation or API name can be more probable than the honest words "I'm not sure." Second — the deeper cause — OpenAI's 2025 analysis Why Language Models Hallucinate shows that standard training and evaluation reward confident guessing and penalize abstention. Benchmarks score "I don't know" as a miss, just like a wrong answer, so post-training quietly teaches models to bluff rather than hedge.

The practical reality: rates vary enormously. Best-in-class models reach under 1% on grounded summarization tasks (down from ~22% in 2021), yet specialized domains without mitigation can exceed 50% — GPT-4o hallucinated in 53% of one medical test set before mitigation. Hallucination is not a bug awaiting a patch; it is structural. The goal is mitigation and transparency, not elimination.

Watch out

Hallucination is not going to be "fixed"

It is a consequence of probabilistic prediction plus training incentives, and OpenAI's own research argues it is unavoidable under common conditions. Design for it: ground answers in retrieved sources, validate structured output, and treat unsourced model claims as untrusted until checked.

Knowledge cutoff and staleness

Imagine a brilliant expert who fell into a coma on a fixed date and just woke up. They can still reason superbly, but their facts stop at that date — and they don't always realize it. That is every base model: it is frozen at a knowledge cutoff, the date its training data ends. Ask it about anything after that and it does one of two things: admit it doesn't know, or confidently report what used to be true. The second is staleness, and it is a distinct failure mode from hallucination.

  • Hallucination = inventing a fact that never existed.
  • Staleness = faithfully recalling a fact that was true and is now wrong (a price, a leader, an API signature, a record).

Both produce confident-sounding errors, but the cure is different. And here is the trap that catches teams: fine-tuning does not fix staleness. Fine-tuning adjusts style and domain familiarity; it does not inject this week's facts. A fine-tuned stale model is more dangerous, because it states outdated information with greater authority.

The only real fix for staleness is to fetch current information at query time — retrieval from a live source or a web-search tool. You cannot prompt-engineer your way to facts the model was never given.

Key insight

The remedy follows from the diagnosis

Staleness is missing current data, so the fix is bring data in (RAG / web search). Hallucination is missing grounding and calibration, so the fix is ground + validate. Match the remedy to the actual failure mode — they are not interchangeable.

Arithmetic and the statelessness problem

Two more weaknesses fall straight out of "it predicts tokens."

Precise arithmetic. An LLM is not a calculator; it emits the most probable token sequence, not the correct one. 2 + 2 usually works because it appeared a million times in training — it's effectively memorized. But 12345 × 67890 is unreliable, because that exact answer was never a frequent pattern, so the model is pattern-matching toward a number that looks right. The accurate framing isn't "LLMs can't do math" — it's that they are unreliable for precise, novel arithmetic. The fix is not a bigger model; it is to call a calculator or code tool and let real arithmetic run.

Statelessness. LLMs have no memory between calls. Each request is processed completely fresh, as if meeting you for the first time. The "memory" you feel in a chatbot is an illusion: the entire conversation history is re-fed into the context window every single turn. That has hard limits — attention cost grows roughly with the square of context length, and a 2025 Chroma "context rot" study found all 18 frontier models tested losing accuracy continuously as context filled, with mid-window degradation exceeding 30 percentage points — a steady slope from token one, not a cliff at the limit. A bigger context window is not persistent memory; every session still starts blank unless you wire in an external memory store (e.g., a vector database).

Example

Don't compute — delegate

Instead of asking the model to multiply inline, give it a tool and let it call out.

python
def calculator(expression: str) -> float:
    """Evaluate a numeric expression safely."""
    import ast, operator as op
    ops = {ast.Add: op.add, ast.Sub: op.sub,
           ast.Mult: op.mul, ast.Div: op.truediv}
    def ev(node):
        if isinstance(node, ast.Constant):
            return node.value
        if isinstance(node, ast.BinOp):
            return ops[type(node.op)](ev(node.left), ev(node.right))
        raise ValueError("unsupported expression")
    return ev(ast.parse(expression, mode="eval").body)

# The model emits the call; your code returns the exact number.
print(calculator("12345 * 67890"))  # 838102050 — correct, every time

Calibration: confidence is not accuracy

Here's the uncomfortable part: the model sounds exactly as sure when it's right as when it's wrong. A model is well-calibrated when its expressed confidence tracks its real accuracy — when the things it says "I'm 90% sure" about turn out right about 90% of the time. LLMs are often poorly calibrated. RLHF (the human-feedback tuning that makes models helpful and confident) tends to reproduce human overconfidence, and miscalibration is worst exactly where it hurts most: at the edges of the model's knowledge and in low-resource domains.

The blunt operational lesson: a confident tone carries no information about correctness. The most capable models can be fluently, authoritatively wrong. A 2025 survey frames LLM uncertainty along four axes — input, reasoning, parameter, and prediction uncertainty — and a recurring finding is that capability benchmarks improve far faster than calibration does. Newer and smarter does not reliably mean better-calibrated.

For builders, calibration is why you don't let an agent's self-assessment be the final word. You add external checks: retrieval to ground claims, validators and schemas to catch malformed output, tests or execution to verify code, and human approval gates for high-stakes actions. Treat the model's confidence as a weak prior, never as proof.

Why these limits are why agents exist

Now the payoff. Each weakness in this lesson has a known, deployable fix — and when you line them up, the agent architecture writes itself. An agent isn't magic; it's an LLM with a prosthetic bolted onto each blind spot.

LimitationRoot causeRemedy (the agent's answer)
Hallucinationpredicts plausible, not true; trained to bluffgrounding via RAG + output validation/schemas
Stalenessfrozen at knowledge cutoffretrieval from live data / web-search tools
Bad arithmeticemits probable, not correct, tokenscode interpreter / calculator tool
Statelessnessno memory between callsexternal memory (vector DB) beyond the context window
Poor calibrationoverconfident, worst at the edgesexternal verification + human-in-the-loop

This is the thesis of the whole course in one table. An agent is what you get when you wrap the LLM's flexible reasoning with exactly the tools that cover its blind spots. The model decides what to do; tools and retrieval and memory do the parts it can't be trusted to do alone.

A caution to carry forward: remedies reduce, they don't erase. RAG cuts grounding errors but still propagates a stale or wrong source, and won't fix a reasoning slip. You are managing failure, not deleting it — and managing it well is the craft.

Try it: Diagnose the failure, prescribe the remedy

Run four small probes on any chat LLM and classify each failure, then name the correct fix.

  1. Staleness: ask for a fact that changed after the model's cutoff (a current officeholder, a latest software version, today's date).
  2. Arithmetic: ask it to compute a large multiplication like 48273 × 9182 with no tools, then verify with a calculator.
  3. Hallucination: ask for three sources/papers on a niche topic and check whether the citations actually exist.
  4. Statelessness: tell it a fact, start a brand-new session, and ask it to recall that fact.

For each probe, write two sentences: (a) which limitation it is (hallucination, staleness, arithmetic, statelessness, or calibration), and (b) the exact agentic remedy — RAG/web search, code tool, external memory, or external verification. This builds the single most valuable instinct from this lesson: mapping a failure to the tool that fixes it.

Key takeaways

  1. 1An LLM optimizes for plausible next tokens, not truth or recency — its strengths (fluency, synthesis, flexible code) and its weaknesses come from the same mechanism.
  2. 2Hallucination is structural: models are trained to guess confidently rather than abstain, so it must be mitigated, never assumed to be 'fixed.'
  3. 3Staleness is distinct from hallucination, and fine-tuning fixes neither freshness nor facts — only retrieval from current data does.
  4. 4LLMs are stateless and unreliable at precise arithmetic; external memory and code/calculator tools are the correct fixes, not bigger models.
  5. 5Each core limitation maps to a remedy — RAG, tools, external memory, external verification — which is exactly why the agent architecture exists.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.According to current (2025) research, why do LLMs hallucinate?

2.What is the key difference between hallucination and staleness?

3.Why does giving a model a larger context window NOT solve the memory problem?

4.Which remedy correctly matches the limitation it addresses?

Go deeper

Hand-picked sources to keep learning