Why Your Agent Forgets (and How to Fix It)

By Arc Labs ResearchMay 2, 202618 min read

Users perceive agents as forgetting when retrieval fails. But the failure is rarely "the memory was deleted." It is usually "the memory was outranked." A user who once told the agent their preferred IDE finds, six months later, that the agent suggests a different one — because the relevant memory is the 47th-ranked result for that query, and the agent only sees the top 10.

Three architectural patterns cause this failure mode. All three are fixable; most agent memory implementations have at least two.

Filter aggressiveness vs retrieval quality

Storing everything maximizes junk. Storing nothing loses signal. The sweet spot is closer to aggressive filtering than most teams expect.

The retrieval failure model

Before diagnosing a specific failure, it helps to be precise about the taxonomy. "The agent forgot" can mean four distinct things, each with a different fix:

Wrong store. The memory was never written — the write pipeline filtered it, the extraction model missed it, or the session never got processed. This is actual forgetting. It's less common than teams assume.

Correct store, wrong rank. The memory exists but scores below the top-K threshold. This is the most common failure mode by a wide margin. The query vector is close enough to 2,000 noise memories that the one signal memory ends up at rank 23. The agent never sees it.

Correct rank, wrong context window. The memory is in the top-K results but the context budget is exhausted before it gets injected. At 128K tokens you have room for roughly 150–300 memories at typical lengths. If you retrieved 50 memories and your prompt template already occupies 60K tokens, memories get truncated. The memory was retrieved but not read.

Correct context, wrong reasoning. The memory was retrieved, injected, and the model read it — but the model's generation didn't use it. This is a prompting failure, not a memory failure. It's diagnosed by diffing completions with and without the memory injected.

Distinguishing these cases matters because the fixes are completely different. Wrong-rank failures require write-pipeline cleanup or ranking improvements. Context-window failures require token budget discipline. Reasoning failures require prompt engineering. Treating them all as "the memory system is broken" leads to changes that don't help.

Failure 1: indiscriminate writes

The most common pattern: every conversational turn is sent to LLM extraction. The extraction prompt reads something like "list every fact stated here." The model dutifully extracts the user's question, the agent's clarification, the small-talk, the restatement of a previous turn, the meta-commentary about the conversation. By month three, the store is 90% noise and 10% signal.

97.8% of stored memories were unhelpful at retrieval time — pleasantries, restatements, meta-talk about the conversation, or context that never reappeared.

— Mem0 production audit, 2024

Top-K retrieval doesn't get to choose between signal and noise — it returns the top 10 by similarity score. Noise that's lexically close to the query crowds out signal. The word "prefer" appears in 2,000 noise memories ("I'd prefer it if you could clarify", "do you prefer I explain this differently", "prefer to revisit this later"). A query about user lunch preferences competes against all of them.

The fix is a write pipeline that filters before storage, not after.

Quantifying the cost of junk

The math is straightforward and damning. Suppose a production store has 10,000 memories after six months of indiscriminate writes. The Mem0 audit figure of 97.8% junk rate puts useful memories at roughly 220. The other 9,780 are noise.

For a query about user lunch preferences, the signal memory is somewhere in those 220. Assuming uniform distribution of cosine similarity (a generous assumption — in practice noise memories cluster around common query terms and outrank signal more often than random), the probability that your signal memory is in the top 10 of 10,000 is:

P(signal in top-10) ≈ 10 / 10,000 = 0.10%

Even if you assume the 220 signal memories collectively occupy the top-220 slots (an extremely generous assumption), and you retrieve top-10:

P(at least one signal memory in top-10) ≈ 1 − (9780/10000)^10 ≈ 19.8%

So in the best case with uniform signal distribution, you have roughly a 20% chance of retrieving any useful memory. In a realistic store where noise memories cluster around common query terms (they do — "prefer", "like", "use", "want" appear in thousands of noise memories for any active user), the signal memories are systematically outranked. Real-world precision is lower than 20%.

The preventive fix is reducing junk at write time. If your write pipeline filters 75% of turns as non-memorizable, your store after six months has 2,500 memories. If 97.8% of that filtered set is still junk (unlikely — filtering should improve quality), you have 55 signal memories in 2,500. P(at least one in top-10) ≈ 1 − (2445/2500)^10 ≈ 19.8% — same. But in practice, the filtered store has much better signal density: if filtering improves quality to 50% useful, you have 1,250 signal memories, and the retrieval probability jumps to ~40%.

The point: write-time filtering doesn't just save storage. It directly improves retrieval precision, and the improvement is nonlinear.

Failure 2: entity fragmentation

The user said "I work for VW" in January. In March: "We had a meeting at Volkswagen." In May: "My employer requires all code to be reviewed." Without entity resolution, those are three separate strings in three separate memories. The embedding for "VW" is distant from "Volkswagen" (partially — they share some semantic overlap but are not the same token sequence). "My employer" shares almost no embedding space with "VW" for a model that doesn't know they co-refer.

A query about "the user's job" hits the embedding for "job" and "work." It might retrieve "I work for VW" — but "my employer requires all code to be reviewed" matches "code review" more strongly than "job." The memory that actually contains the most useful context ("my employer" = VW = a large automotive company with specific engineering constraints) doesn't surface.

Entity fragmentation cascades. Once you have three separate memories for the same entity, any update to one doesn't automatically update the others. If the user later says "I left VW," the supersession logic marks the "I work for VW" memory as inactive — but "my employer requires code review" is still active, because the supersession check didn't identify it as related.

The fix is entity resolution at write time. Every mention of an entity gets a canonical entity ID assigned during the write pipeline's normalization stage. "VW", "Volkswagen", "my employer" all map to entity:org:vw. Queries about employment pull all memories tagged with that entity ID, regardless of surface form. A supersession event on entity:org:vw marks all associated memories inactive in one operation.

This requires a lightweight entity registry — a map from surface forms and aliases to canonical IDs. The registry can be bootstrapped from Wikipedia redirects and organization name aliases for common entities, and extended with user-specific aliases as they emerge. The cost is modest compared to the retrieval quality improvement.

Failure 3: missing supersession

The user changed jobs. "Works at Acme" is in the store with confidence 0.78. "Works at Volkswagen" is stored two months later with confidence 0.65. Both retrieve for the query "where does the user work." The agent surfaces both, then either picks one or hedges with a confused answer.

If supersession ran correctly, "works at Acme" would be marked inactive the moment "works at Volkswagen" was written and the conflict was detected. It stays in the store — the audit trail is preserved — but it doesn't participate in retrieval. When the user asks "where do I work," only the current memory surfaces.

Without supersession, the agent is in a worse position than knowing nothing. It knows two contradictory facts and has no way to determine which is current. The one that surfaces is whichever happens to score higher on cosine similarity for the specific query — often the older one, because it has higher access_boost from being retrieved many times before the job change. The agent confidently answers with stale data.

The write pipeline's conflict-check stage handles supersession. When a new memory is about to be written, the pipeline checks for existing memories with high semantic overlap (≥ 0.85 similarity) and contradictory predicates ("works at X" vs "works at Y"). The conflict resolution logic marks the lower-confidence memory as superseded, links the new memory's ID in the superseded record's superseded_by field, and proceeds with the write.

The superseded memory is not deleted. It is retained with a flag that excludes it from retrieval but includes it in audit queries. This matters for GDPR deletion requests (which need to touch all versions of a memory), for debugging incorrect answers (the old memory might explain why the agent gave a wrong answer three months ago), and for edge cases where supersession fires incorrectly and you need to roll back.

The ranking problem in detail

Even with clean writes, entity resolution, and supersession, retrieval can fail. The mechanism: noise memories with high surface similarity to the query term crowd out signal.

Imagine a store for an active user with 6 months of well-filtered writes. The store has 2,000 memories. Of those, 200 are genuinely useful. The user asks "what does the user prefer for lunch?" The query embedding hits the vector for "prefer" — a word that appears in roughly 400 stored memories ("I'd prefer", "user prefers quiet", "prefers async communication", "preferred the earlier time slot"). Most of those 400 are not about lunch.

The actual lunch preference memory — "user mentioned preferring vegetarian food, specifically Indian cuisine, multiple times" — is in the store. Its cosine similarity to "what does the user prefer for lunch" is 0.76. But 20 other memories have similarity above 0.76 simply because they contain "prefer" prominently and have high access_boost from frequent past retrievals. The lunch memory ends up at rank 21. Top-10 retrieval misses it.

This is the ranking problem. It's not a storage failure; it's a retrieval precision failure. The solutions:

Increase K. Retrieve top-50 instead of top-10. The lunch memory probably appears in top-50. Cost: more tokens injected into the context, more noise in the prompt.

Add metadata filters. Tag memories by category at write time (food, work, technical, preferences). Query with a category filter: "preferences, food." This narrows the candidate set from 2,000 to perhaps 80, making rank-21 effectively rank-4.

Re-rank. Run a second-pass relevance model on the top-50 candidates before truncating to top-10. Cross-encoders outperform bi-encoders on precision for this task. The cost is a second model call per retrieval.

Improve the query. "what does the user prefer for lunch" is a generic query. "user_id=abc123, category=food_preferences, context=lunch decision" is a structured query. Structured queries with metadata filters beat unstructured semantic queries for precision in memory retrieval.

In practice, the fix is usually a combination: retrieve top-30, apply category filter, re-rank with a lightweight cross-encoder, inject top-10. This brings the lunch memory from rank 21 to rank 2 in most cases.

Cold-start done right

On a fresh user, the memory store is empty. Retrieval returns nothing; the agent behaves generically. This feels like forgetting — but it's actually the absence of any signal to retrieve. The perception is the same from the user's perspective.

Three tactics, in order of effectiveness:

Structured onboarding extraction. The first conversation is high-signal: users introduce themselves explicitly, state goals directly, and explain context they assume the system needs. Run extraction with higher aggressiveness on the first session — extract preferences, facts, and constraints that would normally be filtered out as marginal. The extraction quality threshold can be relaxed for session 1 without materially harming signal-to-noise, because any signal in session 1 is more valuable than typical signal.

After onboarding, ask structured questions: "What tools do you use most?" "What languages do you code in?" "What's your role?" These answers are directly extractable as structured facts with high source strength. They seed the store with reliable signal before the system has had time to observe organic behavior.

Cohort defaults. If you know anything about the user at signup (professional role, industry, platform — anything inferred from the onboarding context), populate cohort-based defaults. A user who signs up as a "backend engineer" probably wants code examples in a server-side language, probably doesn't want front-end framework suggestions first. These defaults get overwritten as real preferences emerge, but they prevent the "completely blank slate" first-impression failure.

Defaults should be stored with low confidence (source strength ≈ 0.20, reflecting that they're inferred from cohort, not stated) and should decay faster than normal memories. They exist only to fill the cold-start gap, not to persist indefinitely.

Treat absence as a result. A memory system that returns nothing should not cause the agent to guess. The correct response to empty retrieval is a question: "I don't have your preferences on file yet — what do you prefer?" That question, and the answer to it, seeds the store properly. Guessing based on general priors and being wrong is worse than asking once.

Junk accumulation over time

The degradation trajectory for an indiscriminate-write store is predictable. Month 1: the store has perhaps 500 memories. Most are noise but the signal memories are findable — there aren't enough noise memories yet to crowd them out. Top-10 retrieval returns something useful roughly 60% of the time. Users don't notice.

Month 6: 5,000 memories. The junk-to-signal ratio has hit 90:10 based on typical extraction rates. Top-10 precision is below 30%. Users are starting to notice that the agent "doesn't seem to remember things."

Month 24: 30,000 memories. Precision is near the floor — in a well-clustered noise store, top-10 retrieval on a typical query returns 0 or 1 useful memories. The agent has become demonstrably worse than it was in month 1.

This degradation is not repairable by a cleanup job. A cleanup job can delete memories that look like noise (pleasantries, meta-talk) based on heuristic signals, but it cannot safely delete ambiguous memories without risking false negatives. The 2.2% of useful memories are scattered through the corpus; a cleanup job that achieves 95% recall on junk will accidentally delete 10% of signal.

The fix has to be preventive. Write-time filtering — rejecting turns that produce no memorable facts before they ever enter the store — is the only sustainable approach. A filter that rejects 60% of turns at write time is worth more than any amount of post-hoc cleanup. It should be calibrated so that precision (fraction of stored memories that are useful) stays above 50% indefinitely, regardless of how long the system runs.

Context window crowding

Even a perfectly clean store with perfect retrieval can fail at the last step. Context windows are finite. At 128K tokens, with a typical prompt template consuming 20–40K tokens for system prompt, conversation history, and tool outputs, you have 90–110K tokens for injected memories. A typical stored memory is 50–150 tokens. That gives you a budget of roughly 700–2000 memories per request.

In practice the budget is much tighter. Retrieval returns top-K candidates; you inject all of them. If K=50 and average memory length is 100 tokens, that's 5,000 tokens — a manageable 5% of the context window. But most implementations inject conversation history (often 10,000–30,000 tokens), tool call outputs (variable but often large), and various system-level context blocks. The effective memory budget might be 2,000–5,000 tokens, or 20–50 memories.

The failure mode: retrieval returns the top-50 memories. The token budget is exhausted after 30. Memories 31–50 don't get injected. Memory 31 happens to be the one the user is asking about.

The fix is explicit token budget management. Calculate the available budget before retrieval. Retrieve top-K where K is calibrated to the budget. Prioritize within the retrieved set by recency and confidence — newer, higher-confidence memories get token budget first. And log which memories were retrieved but not injected; that's a diagnostic signal for budget tuning.

The diagnostic workflow

When a user reports that the agent "forgot something," follow this sequence before touching the code:

Step 1: Does the memory exist? Query the store directly with the user's ID and a broad semantic search for the topic. If no matching memory exists, the failure is in the write pipeline — extraction missed it, or the session was never processed. Fix the write pipeline or reprocess the session.

Step 2: What rank did it appear at? Run the same query the agent would have run at retrieval time. Find the memory in the ranked result list. If it's outside top-K, the failure is ranking. Examine why: is there a noise memory with higher similarity? Is access_boost inflating a stale memory? Does the memory lack metadata tags that would have filtered the candidate set?

Step 3: Was it in the token budget? If the memory was in top-K, check whether it was actually injected into the context. Log the retrieved-and-injected list; compare it to the retrieved-but-truncated list. If the memory was retrieved but truncated, the budget needs adjustment.

Step 4: Was it superseded? Check the memory's superseded flag. If it's marked inactive, find the memory that superseded it. Was the supersession correct? If yes, the agent should have surfaced the newer memory instead — go back to step 2 with the newer one. If no, the supersession was a false positive — roll back and tune the conflict threshold.

Step 5: Was the reasoning correct? If the memory was retrieved, injected, and not superseded, diff the agent's completion with and without the memory injected. If the output doesn't change, the model isn't using the injected context — that's a prompt engineering problem, not a memory problem.

This five-step sequence resolves 95% of "the agent forgot" reports within one diagnostic pass. The most common outcomes are step 2 (ranking failure, usually a noise accumulation problem) and step 4 (supersession false positive, usually a threshold tuned too aggressively).

What good memory hygiene looks like

Aggressive write-time filtering. Reject turns that contain no memorizable content before running extraction. A turn is memorizable if it contains a fact about the user, a stated preference, a decision, or a constraint. Pleasantries, restatements, clarifications, and meta-talk are not memorizable. Typical rejection rates for conversational agents are 40–70% of turns. If your rejection rate is below 30%, your filter is too permissive.

Entity resolution at write time. Before storing a memory, normalize all entity references to canonical IDs. Don't let "VW" and "Volkswagen" exist as separate strings in the same store. The normalization pass is cheap compared to the retrieval quality improvement.

Automatic supersession on conflict detection. Every write triggers a conflict check. High-similarity + contradictory-predicate = supersession. The threshold (0.85 similarity for "same fact") should be tuned against a labeled sample of your production data. Don't set it so high that legitimate updates don't trigger it, or so low that synonym variation creates false positives.

Decay configured per memory type. Preferences decay slowly (months). Events decay quickly (days). Contextual state (what the user is working on right now) decays within hours. A single global decay rate is always wrong for some type. Configure per-type decay constants and apply them in the freshness calculation.

Quarterly cleanup audits. Even with preventive filtering, stores accumulate marginal memories over time. A quarterly review of memories with low confidence and zero retrievals in 90 days is a safe cleanup signal — if a memory has never been useful, it's probably noise. Delete after manual spot-check, not automatically.

The common thread: memory quality is a write-time problem. The retrieval stage can compensate partially through re-ranking and budget management, but it can't overcome a fundamentally noisy store. Build the filter first.

Each of these failures has a dedicated page. Start with the cost of junk memories for the economics, or the 7-stage write pipeline for the fix.

Why Your Agent Forgets (and How to Fix It)

The retrieval failure model

Failure 1: indiscriminate writes

Quantifying the cost of junk

Failure 2: entity fragmentation

Failure 3: missing supersession

The ranking problem in detail

Cold-start done right

Junk accumulation over time

Context window crowding

The diagnostic workflow

What good memory hygiene looks like

Next

Related reading

Why Your Agent Forgets (and How to Fix It)

The retrieval failure model

Failure 1: indiscriminate writes

Quantifying the cost of junk

Failure 2: entity fragmentation

Failure 3: missing supersession

The ranking problem in detail

Cold-start done right

Junk accumulation over time

Context window crowding

The diagnostic workflow

What good memory hygiene looks like

Next

Related reading

Updates from the lab.