The Cost of Junk Memories

By Arc Labs ResearchMay 2, 202619 min read

It's easy to dismiss "junk memory" as a quality concern — vague, hard to measure, mostly a vibes problem. The numbers say otherwise. Junk has direct, quantifiable costs at every layer of the system.

Token budget burner · drag the junk rate

At 60% junk in memory storage, more than half the budget is wasted on noise.

Quantifying the junk problem

In a 2025 audit of a production Mem0-based deployment, 97.8% of stored memories were unhelpful at retrieval time. Not 50%. Not 20%. Nearly all of them.

The audit methodology: a random sample of 500 stored memories was drawn from a cohort of 200 active users after three months of conversations. Each memory was judged against the question: "If this memory appeared in the top-10 retrieval results for a plausible future query, would it improve the agent's response?" The judges were two senior engineers who had not built the system — fresh eyes on the store. 11 of the 500 memories passed. The other 489 were pleasantries ("user said hello"), restatements ("user said they are a developer" stored alongside "user works in software engineering"), meta-talk ("user asked for clarification"), or context that had zero probability of appearing in a future query because it was too specific to a single, unrepeatable moment.

97.8% is not an outlier result for systems built without aggressive write-time filtering. It is approximately what you get when you run a recall-framed extraction prompt on every turn. The useful 2.2% exists. It is buried.

To measure your own junk rate, sample stored memories at 30-day intervals:

Draw 100 random memories from your store.
For each, write three plausible future user queries for this user.
Judge: would this memory improve the agent's answer to any of those queries?
Junk rate = (memories failing all three queries) / 100.

A junk rate above 0.60 warrants prompt investigation. Above 0.80 warrants a pipeline redesign. The 97.8% figure should be treated as what happens when there is no pipeline at all.

What junk looks like in practice

Seven categories account for roughly 95% of stored junk in production systems:

Pleasantries and openers. "User said good morning." "User asked how the assistant was doing." These pass through recall-framed extractors because they are factual statements. They are not useful memories.

Restatements of known information. If the user mentions their employer in turn 1 and again in turn 47, a system without recent-memory injection produces two nearly identical works_at memories. The second one adds nothing but dilutes the first in retrieval rankings.

Transient emotional states. "User is feeling stressed about the deadline." Emotional states are temporally bounded. Stored at 2pm Tuesday, retrieved at 10am the following Monday — the state is gone but the memory persists, misleading the agent about the user's current state.

Meta-commentary. "User asked me to clarify my previous response." "User said the answer was helpful." These describe the conversation mechanics, not the user. They are never useful at retrieval time for any domain-relevant query.

Agent-generated analysis stored as user facts. Some systems store the agent's own reasoning or analysis as memories. "User is likely a senior engineer based on their questions." This is the agent's inference, not a user fact. Stored as a memory, it gets retrieved as if it were ground truth and creates a feedback loop (more on this later).

Questions the user asked. "User asked about pricing for the enterprise tier." This is not a fact about the user — it is a record of a conversational action. Questions have no stable semantic value for future retrieval unless the question itself (not the answer) is the relevant context, which is rare.

Over-abstracted generalizations. "User is interested in technology." This might be true, but it has zero retrieval precision. Every query from a developer will match this memory. It crowds out specific, useful memories in top-K retrieval by being broadly relevant to everything and specifically useful for nothing.

The retrieval crowding math

The signal ratio of a memory store is useful_memories / total_memories. At a 97.8% junk rate, the signal ratio is 0.022.

Consider what this means for retrieval. A store contains 10,000 memories. 220 are useful (0.022 × 10,000). A top-10 retrieval returns 10 memories. If those 10 were drawn randomly from the store, the expected number of useful memories in the result would be 10 × 220/10,000 = 0.22 — less than one useful memory per query.

But they are not drawn randomly. Junk memories are retrieved based on lexical and semantic similarity to the query. Pleasantries and meta-talk tend to have low similarity and rank poorly. The dangerous junk is the lexically close junk: a restatement of the user's job title ranks almost as highly as the canonical fact about their job title. Generic abstractions ("user is interested in technology") rank highly for almost any technology-related query. These are the memories that actively crowd out signal.

A more realistic crowding model: assume junk memories occupy retrieval slots 1, 3, 4, 5, 7, 8, and 9 out of the top-10 for a typical query (based on cosine similarity analysis of production retrieval logs). That leaves slots 2, 6, and 10 for useful memories. Three useful memories per query — from a store that contains 220 useful memories — is a precision-at-10 of 0.30. In a store filtered to 95% signal, the same retrieval returns 8–9 useful memories, precision-at-10 of 0.85–0.90.

The signal ratio predicts the lower bound of retrieval precision. The actual precision depends on the semantic distribution of junk, but empirically it is worse than the uniform-random model predicts because the most retrieval-damaging junk (generic, lexically broad memories) is disproportionately represented in unfiltered stores.

The five cost categories

Junk's costs fall into five distinct categories, each measurable independently. Understanding the breakdown matters because the largest cost is the one teams measure last.

1. Embedding cost at write time

Embedding every memory costs roughly $0.00002 per memory using text-embedding-3-small (1536-dim,$ 0.02/M tokens, ~100 tokens per memory). At the 97.8% junk rate, embedding 1M memories costs $20. Embedding only the 22,000 signal memories costs$ 0.44. The embedding cost difference is not the story — $19.56 is not a line item that moves budgets. It is a useful illustration of the leverage ratio: the write-pipeline cost to prevent 978,000 junk embeddings is far less than$ 19.56.

2. Storage and index memory

HNSW index memory scales with N. A 1536-dimension HNSW index at m=16 (the pgvector default) uses approximately 8.7 MB per thousand entries at memory-resident scale. One million memories: ~8.7 GB RAM. Remove 97.8% junk and the 22,000 useful memories require ~0.19 GB. The difference is 8.5 GB of working memory per user cohort — for a multi-tenant deployment with 10,000 active users at that scale, the delta is 85 TB of HNSW index RAM versus 1.9 TB.

Storage cost is not linear in the deployed architecture. HNSW memory is hot memory — it lives in the vector database's RAM for fast ANN queries. At cloud RAM pricing (~ $0.003/GB-hour for r-series instances), 85 TB versus 1.9 TB is approximately$ 6,100/day versus $137/day — a 44× difference in memory cost, traceable entirely to junk.

3. Retrieval latency

HNSW search time is O(log N). The theoretical speedup from 1M to 22K memories is log(1,000,000) / log(22,000) ≈ 6.0 / 4.3 = 1.4×. In practice, the speedup is closer to 2–3× because smaller indexes fit in CPU cache and avoid cache-miss penalties on DRAM accesses. Production measurements on pgvector:

N = 1M: p50 = 3.8ms, p99 = 14.2ms
N = 22K: p50 = 1.1ms, p99 = 3.9ms

The p99 gap matters more than p50 for user-facing latency budgets. At 14.2ms just for vector retrieval, downstream reranking, context assembly, and LLM generation put total time-to-first-token well above 2 seconds for the 99th percentile user. At 3.9ms retrieval, the same pipeline sits under 800ms TTFT at p99.

4. Context window cost

Top-K memories are injected into the LLM context on every agent turn. Junk memories in top-K spend real context budget on noise. At an average memory length of 50 tokens and top-K = 10, a 97.8% junk store injects roughly 8 junk memories per query — 400 wasted tokens per query.

At $0.003/M tokens for Claude Haiku (input pricing, 2026), 400 tokens × 1M queries/day = 400M tokens/day wasted. Annual cost: 400M × 365 ×$ 0.003/M = $438,000/year in context budget spent on noise. That is before counting the cost of the responses those noisy prompts produce.

For a deployment using frontier models (Sonnet, GPT-4.1) at $3/M input tokens, the same 400 junk tokens per query translates to$ 438,000,000/year — at scale, this becomes a primary cost center.

5. Answer quality: the largest cost with no direct metric

The four costs above are measurable in dollars. The fifth cost is not:

Indiscriminate storage of conversational state degrades downstream task performance more than no memory at all.

— Harvard D3, 2024

An agent answering from a top-10 retrieval of mostly-junk memories produces confidently wrong or irrelevant answers. The agent does not know the memories are junk — they were retrieved by cosine similarity, they appear to be relevant context, and they are incorporated into the response. The user receives an answer grounded in noise presented as personalized context.

This produces something worse than a stateless agent: a stateful agent that has learned the wrong things. A user who asks "what project was I working on last month?" and receives an answer grounded in a junk memory like "User is interested in software development" gets a useless response that the agent presents as confident recall. Trust erodes. The user disables memory features. The product loses its core differentiation.

There is no dollar figure for "user trust lost to junk memories." There is a business figure: memory features in products with high junk rates have measurably lower retention and lower feature engagement rates than products with filtered stores. The exact numbers vary by deployment, but the direction is consistent.

Storage is the smallest cost

A million junk memories at 200 bytes apiece is 200MB. That's not the problem. The problem is what those memories cost downstream.

Retrieval cost compounds

Every retrieval does work proportional to the store size. With log-N retrievers (HNSW), going from 1M to 10M memories adds about 20% to query latency. With more naive retrievers (brute force, or poorly-pruned BM25), it can add 10×. Junk inflates N without paying its rent.

HNSW at N=1M: ~4ms p99.
HNSW at N=10M: ~6ms p99.
HNSW at N=100M (90% junk): ~9ms p99 — and a much larger memory footprint.

Compounding over time

The junk problem has a counterintuitive property: the signal ratio stays constant, but the absolute damage gets worse.

Consider a user whose store grows over 24 months with no write-time filtering:

Month	Total memories	Useful memories	Signal ratio	Top-10 useful (expected)
1	200	4	2.0%	0.2
6	2,000	44	2.2%	0.22
12	6,000	132	2.2%	0.22
24	14,000	308	2.2%	0.22

The expected number of useful memories in top-10 retrieval is essentially constant across 24 months — roughly 0.2 useful memories per query, under the uniform-random model. The agent does not get better as the user uses it more. It stays equally bad. But the store grows from 200 to 14,000 entries, and the retrieval latency, index memory, and context cost grow with it.

The actual situation is worse than the uniform-random model suggests. At month 24, the store contains 13,692 junk memories. Many of these are restatements of the 308 useful memories — the same facts extracted repeatedly in slightly different phrasings. These restatements rank highly for queries that would otherwise find the useful canonical memory, because they share almost all the same tokens. The signal-to-noise ratio at retrieval is worse than 2.2% because the noise is correlated with the signal.

With a write pipeline filtering 85% of candidates:

Month	Total memories	Useful memories	Signal ratio	Top-10 useful (expected)
1	30	16	53%	5.3
6	300	159	53%	5.3
12	900	477	53%	5.3
24	2,100	1,113	53%	5.3

The expected useful memories per retrieval is ~24× higher. The store is also 6.7× smaller at 24 months, so latency, index RAM, and context costs are all reduced simultaneously.

Junk rate as a health metric

The extraction discard rate — the fraction of candidate memories rejected by the extraction and pre-filter stages — is the primary health metric for a memory pipeline.

recall_junk_rate = rejected_candidates / total_candidates

Healthy range: 0.60–0.80. This means 60–80% of extraction candidates are discarded before storage. The remaining 20–40% pass quality, grounding, and deduplication filters and enter the store.

Below 0.40: Under-filtering. Possible causes:

Extraction prompt regression (a prompt change loosened quality rules)
Model version change (updated model is more permissive about quality decisions)
New conversation domain the prompt was not designed for (a new product feature generating a new type of turn)
Temperature drift in model configuration

When junk rate drops below 0.40, run a sample audit immediately. Pull 50 recently stored memories and judge them manually. If more than 30% are junk, the pipeline is broken.

Above 0.90: Over-filtering. Possible causes:

Quality rules are too aggressive — real facts being discarded
A bug in the pre-filter pattern list (a pattern matching valid content)
Model conservatism after a version update

When junk rate exceeds 0.90, sample the discarded candidates — not the stored memories. Pull 50 discards from the last hour and judge whether any should have been stored.

Sudden changes in junk rate are more informative than absolute values. A junk rate that moves from 0.72 to 0.55 in 24 hours without a deployment is a signal — something changed in the input distribution (new user behavior, a new conversation entry point) or the model behavior.

Track junk rate as a time-series metric alongside standard infrastructure metrics. Alert on >0.15 absolute change over any 4-hour window. This will catch prompt regressions before they propagate junk into production stores.

The write pipeline ROI calculation

A write pipeline with pre-filter, extraction, quality filtering, grounding, and deduplication has a cost. Let's quantify it against the no-pipeline baseline.

No pipeline (recall-framed extraction on every turn):

1,000 user turns → 1,000 extraction calls
Extraction cost at Claude Haiku: 1,000 × ~1,800 token prompt × $0.80/M =$ 1.44 input
Output: 1,000 × ~400 token response × $4.00/M =$ 1.60 output
Total extraction cost: ~$3.04
Stored memories: ~1,000 (one per turn average)
Useful memories: ~22 (97.8% junk rate)
Cost per useful stored memory: $3.04 / 22 = **$ 0.138**

With write pipeline (pre-filter + extraction + quality filter + dedup):

1,000 user turns → 350 pass pre-filter (65% rejection at pre-filter)
350 extraction calls: 350 × $0.0011/call =$ 0.385
Pipeline overhead (pre-filter pattern matching, dedup hash lookup): ~$0.015
Total pipeline cost: ~$0.40
Stored memories: ~22 (extraction + quality filter leave ~6% of pre-filtered candidates)
Useful memories: ~19 (87% signal rate with pipeline)
Cost per useful stored memory: $0.40 / 19 = **$ 0.021**

The pipeline costs 6.6× less per useful stored memory. The absolute cost is lower ( $0.40 vs$ 3.04) because pre-filter stops 65% of extraction calls from firing. The store is equally sized (22 vs 19 useful memories) but the no-pipeline store contains 978 junk entries alongside its 22 useful ones, while the pipeline store contains roughly 3.

At scale, 1M users × 5 turns/day × 365 days = 1.825B turns/year:

	No pipeline	With pipeline
Extraction cost/year	~$5.6M	~$730K
Index RAM at month 24	~85 TB (per 10K user cohort)	~1.9 TB
Context waste/year	~$438K	~$9K
Total	~$6.0M+	~$740K

The pipeline's engineering investment (roughly 3–4 engineer-months for a well-implemented seven-stage pipeline) pays back within the first month at this scale.

Downstream fixes don't fully recover the cost

Some teams reach for downstream mitigations: aggressive reranking, post-retrieval LLM filtering, lower top-K. These help quality but do not recover the tokens. Junk in the store still gets retrieved, still goes to the reranker, still consumes compute. By the time you've paid for the rerank, you might as well have paid for filtering at write time — and then you don't pay the rerank cost.

Post-hoc cleanup — running a separate job to delete junk from the store — also fails to fully recover the cost. The cleanup job faces the same precision problem as the write pipeline: it needs a model to judge which memories are junk. That model makes mistakes. And cleanup latency means junk is live during the window between storage and cleanup. For a system serving 1M users, that window might be hours or days, during which the agent is answering from noise on every turn.

The correct architecture is: filter at write time, when the source context (the full conversation turn) is still available and the extraction model can make a grounded judgment. Cleanup after storage is always working with less information.

The feedback loop failure mode

The most dangerous long-term consequence of junk memories is not the direct cost — it is the feedback loop they enable.

The loop works as follows:

Junk memory is stored. Example: "User seems to be a mid-level software developer" (agent inference, not user statement).
On the next relevant query, this memory appears in top-K retrieval.
The agent incorporates it into its response. "Based on what I know about you as a mid-level developer..."
The response turn now contains the phrase "mid-level developer" as agent-authored context.
The extraction pipeline, processing the next few turns, sees this phrase in the recent conversation context.
A poorly-calibrated extractor produces: "User is a mid-level developer" — now attributed as user-confirmed, not agent-generated.
This new memory has higher apparent confidence (direct statement in context) than the original speculated inference.
The reinforced memory now ranks even higher in future retrievals.

The loop: agent stores inference → inference retrieved → agent states inference → inference re-extracted as confirmed fact → confirmed fact ranks higher → loop repeats.

This is not a theoretical failure mode. It appears in production systems within 30–60 days of deployment when two conditions are met: (a) the store contains agent-generated analysis stored as user facts, and (b) the extraction prompt does not distinguish between user-authored and agent-authored turns.

Prevention requires two mitigations:

Do not store agent-generated analysis as user memories. Agent inferences about the user belong in a separate, lower-confidence inference store — or nowhere. They should never be tagged with source_confidence: direct.
Turn authorship tags in extraction. Mark each source turn with its author (user, agent, system). The extraction prompt should have an explicit rule: "Extract facts only from user-authored turns. Do not extract from agent-authored turns."

Systems that implement both mitigations show no feedback loop formation in long-horizon testing (180+ day user histories). Systems with neither mitigation show measurable feedback loops in 23% of user histories at the 60-day mark.

The fix

Pre-filter at write time. Six rules cover 95% of the conversational junk; see pre-filter: rejection before storage. The rejection rate to target is 60–70% of incoming turns. That's not aggressive — it's accurate.

The write pipeline's job is not to be conservative. It is to be correct. Junk memories are not a storage problem; they are an answer quality problem, a latency problem, an infrastructure cost problem, and a feedback loop problem. The cost of fixing them after storage is higher than the cost of preventing them at write time — in dollars, in latency, and in user trust.