Memory: Short-Term to Long-Term

Helping agents remember beyond a single context window

Intermediate 15 minBuilder

What you'll be able to do

Explain why the context window is the agent's working memory, and why a bigger window is not a substitute for a memory system
Implement short-term conversation memory using sliding-window and summarization strategies
Build long-term memory with embeddings, a vector store, and similarity retrieval
Distinguish episodic, semantic, and procedural memory and map them onto a real agent
Design the write and forget path — filtering, deduplication, and TTLs — to avoid stale memory, retrieval misses, and context bloat

At a glance

An agent's only true working memory is its context window — everything it 'knows' in a step must be sitting in that window. This lesson builds the full memory stack on top of that fact: short-term conversation memory and summarization, long-term memory via embeddings and vector stores, the four-type taxonomy (working, episodic, semantic, procedural), and the retrieval, writing, and forgetting policies that decide whether memory helps or quietly poisons your agent.

1Working memory is the context window
2Short-term memory: conversation and summarization
3Long-term memory: embeddings and vector stores
4Four kinds of memory: working, episodic, semantic, procedural
5Retrieval, writing, and forgetting
6Architectures, tools, and pitfalls

Working memory is the context window

Here is the simplest way to think about an agent's mind: at any given moment, it can only "think about" the words currently in front of it. That's the context window — the block of text (system prompt, conversation, tool results) handed to the model on each call. There is no hidden brain it quietly consults on the side.

Stated precisely: the only thing an agent can reason over in a given step is what is physically present in its context window. If a fact isn't in the prompt, embedded in the system message, or returned by a tool call, the model cannot use it — full stop. The context window is the agent's working memory, the equivalent of RAM in a computer.

This is why "memory" in agents is a bit of a misnomer. Everything else we call memory — conversation history, retrieved documents, summaries, prior outcomes — is really machinery for deciding what to load into that window and when. All of it has to be selected, formatted, and injected into the context before each model call. Memory engineering is, at bottom, context-loading engineering.

That reframing drives the whole lesson. The question is never "does the agent remember?" It is "at this step, did the right information get into the window, and was there room for it?" The window is finite and shared with the system prompt, tool definitions, and the current task. Memory systems exist to fill it well.

Key insight

Long context is not memory

It is tempting to think a 200k-token window removes the need for a memory system. It doesn't. On selective-recall tasks, agents with huge windows underperform purpose-built retrieval systems: models that score near-perfect on simple recall benchmarks drop to roughly 40–60% on interdependent multi-session tasks (MemoryArena). Long context and memory are complementary, not interchangeable — you still have to choose what goes in the window.

Short-term memory: conversation and summarization

The most basic memory an agent has is just the chat so far — the running list of messages in this session. The catch: that list grows forever, but the window doesn't. So you need a policy for what to keep once the conversation outgrows the budget. Three patterns dominate:

Sliding window — keep only the last N messages. Cheap and lossless for recent turns, but it amnesically forgets anything older.
Summarization — compress older turns into a compact running summary, then prepend it. Preserves the gist across long conversations at the cost of detail.
Summary-buffer (hybrid) — keep recent turns verbatim and a summary of everything older. This is the production default; it gives you precise recent context plus durable older context.

Concretely: imagine a 40-turn support chat. A sliding window of 8 remembers the last few exchanges but forgets the original complaint. A pure summary remembers the complaint but blurs the exact error message. The hybrid keeps both — recent turns word-for-word, plus a summary of everything before. A minimal version in Python:

python

def build_context(history, summary, keep=8, summarize_fn=None):
    recent = history[-keep:]
    older = history[:-keep]
    if older and summarize_fn:
        summary = summarize_fn(summary, older)  # recursive: fold old into running summary
    msgs = []
    if summary:
        msgs.append({"role": "system", "content": f"Summary so far:\n{summary}"})
    msgs.extend(recent)
    return msgs, summary

Recursive summarization — repeatedly folding older messages into the running summary — keeps cost bounded but introduces drift: each compression cycle can erase a low-frequency detail that mattered.

Watch out

Summarization is not lossless

Treating summarization as free compression is a trap. Summarization drift is a documented failure mode: a high-importance fact mentioned only once in older history can vanish through repeated compression. Protect critical facts (IDs, decisions, constraints) by pinning them outside the summarized region — e.g., in a dedicated core memory block that is never summarized away.

Long-term memory: embeddings and vector stores

Short-term memory dies when the session ends. To remember a user across days or conversations, you need something durable that lives outside the window. The standard trick: turn each memory into a list of numbers (an embedding) that captures its meaning, store those numbers, and later fetch the ones whose meaning is closest to the current question. That's the whole idea behind a vector database.

Long-term memory persists across sessions, and the pipeline is consistent across the ecosystem:

text → embedding model → vector → store in a vector DB → at query time, embed the query and do approximate-nearest-neighbor (ANN) search → inject the top-k retrieved chunks into the context.

("Approximate-nearest-neighbor" just means: find the closest vectors fast, trading a little exactness for speed.) Common backends include Qdrant, Weaviate, Milvus, Chroma (popular for prototyping), pgvector (Postgres-native), and Redis. The memory layer Mem0 alone supports 20+ such backends.

python

# Write a memory
vec = embed("User prefers metric units and dark-mode UIs.")
store.upsert(id="mem_42", vector=vec, payload={
    "text": "User prefers metric units and dark-mode UIs.",
    "user_id": "u_7", "type": "semantic", "ts": now(),
})

# Read at query time
hits = store.search(embed(query), top_k=5,
                    filter={"user_id": "u_7"})  # metadata scoping is essential
context = "\n".join(h.payload["text"] for h in hits)

Note the metadata filter: in multi-tenant systems, scoping retrieval by user_id (or session, or agent) is not optional — it's how you stop one user's memories from leaking into another's context.

Tip

The short-term/long-term split is about persistence, not window size

A common oversimplification says "short-term = in-context, long-term = vector store." The sharper distinction is whether memory survives across sessions. Systems like Letta keep core memory blocks that are both always in-context and long-lived. What makes memory "long-term" is durability, not where it sits at any moment.

Four kinds of memory: working, episodic, semantic, procedural

Not all memory is the same kind of thing. Your own mind keeps several flavors: what you're thinking about right now, specific things that happened, general facts you've learned, and skills you can just do. Agent memory borrows exactly this split. The field has converged on a taxonomy formalized in the CoALA paper (Sumers et al., 2023), now used across LangChain, Letta, Mem0, and others. Four memory types:

Type	What it holds	Example for a support agent
Working	The active context window	The current ticket and conversation
Episodic	Past experiences with when they happened	"On May 3, this user reported the same bug"
Semantic	Abstracted facts and preferences	"This user is on the Enterprise plan"
Procedural	Reusable skills that map situations → actions	A learned, reusable refund-processing routine

The distinctions are practical, not academic. Episodic memory is timestamped and specific — it answers "what happened?" Semantic memory is distilled and timeless — it answers "what is true?" You often derive semantic memory from many episodic events: notice the user mention metric units in three separate chats (episodic), and you can store "user prefers metric units" once (semantic).

Procedural memory is the one people get wrong: it stores reusable executable skills (think Voyager's growing skill library), not instructions or preferences. "The user likes terse replies" is semantic; a saved, callable routine for handling a refund is procedural.

Note

Why the taxonomy earns its keep

These aren't academic labels — they map to different storage and retrieval strategies. Episodic memory needs temporal indexing and time-aware retrieval; semantic memory needs deduplication and conflict resolution; procedural memory needs versioned, executable artifacts. Lumping them into one vector store is a common cause of poor recall.

Retrieval, writing, and forgetting

Having a store isn't enough — how you read from it, write to it, and prune it is where agents live or die. Think of it as three jobs: pulling the right memory back in, deciding what's even worth saving, and throwing out what's gone stale. All three are real engineering problems.

Reading (retrieval). Pure embedding similarity is a weak default — it finds text that is similar, not necessarily correct or useful. (Ask "what's the refund policy?" and similarity might surface a chat that merely mentions refunds but answers nothing.) Production systems layer signals:

Hybrid search — dense vectors + keyword (BM25) to catch exact terms embeddings miss.
Reranking — a cross-encoder re-sorts the top-k candidates by true relevance.
Metadata + temporal/entity filtering — scope by tenant, recency, or named entity.

Multi-signal retrieval (semantic + keyword + entity) is now the norm; Mem0 reports large gains on temporal and multi-hop reasoning from exactly this.

Writing (the write path). Naive append is the most common mistake. A good pipeline does: filter (is this worth keeping?), canonicalize (normalize entity names), deduplicate, resolve conflicts (version contradictory facts instead of silently overwriting), and priority-score. Writes are increasingly asynchronous (Mem0 defaults to async) so they don't block the agent loop.

Forgetting. Forgetting is a feature, not a bug. Without TTLs or versioning, outdated facts accumulate and the agent treats them as authoritative — it will happily quote last quarter's price forever. So expire stale entries, supersede old facts, and prune low-value memories on a schedule.

Watch out

More memory is not better memory

The "add everything" strategy actively degrades agents. In a medical-reasoning study, agents with 2,400+ memory records scored 13% accuracy versus 39% for agents with a curated 248-record store. Relevant memory has to compete with the current task for window space — quality and curation beat quantity every time.

Architectures, tools, and pitfalls

You rarely build all of this from scratch — a handful of architectures and frameworks already package these ideas. The most instructive is Letta (formerly the MemGPT research project), which borrows an idea from operating systems: treat memory as tiers, and let the agent move data between them like a computer pages between RAM and disk.

Letta uses an OS-inspired hierarchy: core memory (always in-context, like RAM), recall memory (searchable conversation history), and archival memory (external vector store, like disk). The agent pages between tiers via tool calls — powerful, but a wrong paging decision is a silent failure.
Mem0 is a framework-agnostic layer with user/session/agent scopes over a hybrid vector + graph + key-value store.
LangGraph (v1.0 GA October 2025) gives you built-in checkpointing for within-thread state (Postgres, Redis, MongoDB backends) and a separate store for cross-session long-term memory. Since October 2025, LangChain's create_agent runs on LangGraph's execution engine under the hood, replacing the legacy AgentExecutor.
A-MEM (NeurIPS 2025) advances memory evolution: a new memory triggers re-indexing and link formation with existing notes (Zettelkasten-style), so memory reorganizes itself rather than sitting static.

The three pitfalls to watch: stale memory (no TTL/versioning), retrieval misses (similar ≠ correct), and context bloat (add-all crowds out the task). Note MemGPT is now Letta — use the current name.

Don't forget evaluation. Agents can ace single-hop recall yet collapse on temporal and multi-hop, cross-session tasks. Benchmarks like LoCoMo, LongMemEval, MemoryArena, and BEAM test these distinct dimensions — measure the one you actually depend on.

Try it: Build a two-tier memory for a chat agent

Build a small agent with both short-term and long-term memory.

Short-term: Implement a summary-buffer policy — keep the last 8 messages verbatim and fold older messages into a running summary on each turn. Pin one critical fact (e.g., a user ID) in a core block that is never summarized.
Long-term: Stand up any vector store (Chroma or pgvector are easy) and a write path that filters trivial messages, deduplicates near-identical memories, and tags each with user_id, type (episodic/semantic), and a timestamp. Retrieve the top-5 with a user_id metadata filter and inject them.
Forget: Add a TTL so memories older than N days are excluded from retrieval.
Stress-test it: In session 1, tell the agent three facts (one mentioned only once). Start a fresh session 2 and ask about all three. Did the rare fact survive? Now flood the store with 200 irrelevant memories and re-ask — did retrieval quality degrade?

Write a short note on which pitfall bit you first: summarization drift, retrieval miss, or context bloat.

Key takeaways

1The context window is the agent's only working memory; every memory system is really machinery for choosing what to load into it.
2Long context is not a memory system — huge windows underperform purpose-built retrieval on selective, cross-session recall.
3Use the four-type taxonomy (working, episodic, semantic, procedural) to pick the right storage and retrieval strategy for each kind of memory.
4Long-term memory = embeddings + vector store + multi-signal retrieval; the write and forget path (filter, dedupe, conflict-resolve, TTL) matters as much as the read path.
5More memory is not better: curate aggressively, because add-all strategies bloat context and measurably degrade accuracy.

Quiz

Lock in what you learned

Check your understanding

0 / 4 answered

1.Why is a very large context window not a replacement for a memory system?

2.Which short-term memory strategy is the common production default?

3.Which statement about procedural memory is correct?

4.What does the medical-reasoning result (39% vs 13%) most directly demonstrate?

Go deeper

Hand-picked sources to keep learning

CoALA: Cognitive Architectures for Language Agents

The 2023 paper (Sumers et al.) that formalized the working/episodic/semantic/procedural taxonomy now used across the ecosystem.

Letta — Agent Memory Architecture

Authoritative deep-dive from the MemGPT/Letta team on in-context vs external memory and the core/recall/archival OS-paging model.

Mem0 — State of AI Agent Memory 2026

Current (2026) benchmark scores and an architectural comparison of Letta, Mem0, LangMem, and Zep.

A-MEM: Agentic Memory for LLM Agents (NeurIPS 2025)

Introduces Zettelkasten-inspired, self-evolving memory networks where new memories trigger re-indexing and link formation.

Memory for Autonomous LLM Agents: Survey

Comprehensive 2025–2026 survey of memory mechanisms, control policies, and benchmarks (LoCoMo, MemoryArena, BEAM). Source of the 'long context is not memory' framing.

LangGraph Persistence Docs

Practical implementation of in-session checkpointing and cross-session long-term memory for graph-based agents.