Structural Failures in Agentic Memory: A Seven-Stage Filter Approach

Introduction

Modern agents are only as good as the context they retrieve. When we audited over 10,000 production memory entries, we found that the majority were redundant or actively harmful.

The problem is structural, not incidental. Every conversational turn produces candidate memory entries — some high-signal (a new user preference, a corrected fact, a scheduled event), most low-signal (greetings, confirmations, agent-generated acknowledgments that have no informational content). Systems that apply no filtering at write time accumulate junk at the rate of the conversation, not the rate of actual information exchange. Systems that apply LLM-based extraction without a pre-filter stage apply an expensive model to candidates that a cheap statistical filter would have rejected in microseconds.

The consequences compound over time. An agent with a junk-dominated memory store does not merely retrieve irrelevant context — it actively degrades in performance as the store grows. Each query must compete against an increasing background of noise. Duplicate entries crowd out distinct signal in top-K results. Hallucinated profile entries — facts that the LLM extraction stage generated but that were never stated by the user — introduce false context that the agent will cite with misplaced confidence. The system becomes less reliable the longer it runs, precisely the opposite of the desired behavior.

This paper describes the architectural response to these failure modes: a seven-stage sequential filter that rejects candidates before they reach long-term storage, combined with a typed schema that enables multi-retriever composition at read time.

We define the Junk-to-Signal ratio $R$ as:

$R = \frac{\sum_{i=1}^n J_i}{\sum_{i=1}^n S_i}$

Where $J$ represents junk entries and $S$ represents high-signal facts or preferences. In current systems, $R$ often exceeds 10. ^[1]

The Seven-Stage Pipeline

To combat this, we developed a sequential filter. Each stage must pass for a candidate memory to be committed.

Deduplication: Hash and semantic checks.
Relevance Scoring: Is this actually useful for the agent?
Privacy Masking: PII removal.
Entity Resolution: Mapping to existing nodes.
Temporal Tagging: Assigning valid-from/valid-to.
Conflict Detection: Checking against existing facts.
Typed Extraction: Converting to the final schema.

Math of Filter Efficiency

The probability of a junk entry $j$ surviving the pipeline $P(j_{survive})$ is:

$P(j_{survive}) = \prod_{k=1}^7 P(k_{miss})$

By ensuring each stage $k$ has a low miss rate, we can reduce noise by orders of magnitude.

Conclusion

Recall's architecture ensures that $R < 0.1$ , allowing agents to operate with high precision without hitting context window limits with garbage.

Experimental Methodology

Dataset

We collected 10,247 production memory entry candidates from Recall deployments over a 90-day period (January–March 2026). All data was anonymized and collected under explicit user consent. Deployments spanned three workload types: general-purpose conversational agents (61%), coding assistants (24%), and research agents (15%).

Baseline Systems

We evaluated three external systems alongside Recall:

Mem0 — flat vector store with LLM-based extraction. No pre-filter stage. Stores candidates as embedding vectors with natural-language text fields; deduplication is not applied at write time.
Zep — temporal knowledge graph with a Go backend. Enforces some structural quality constraints through graph insertion rules, but does not apply a multi-stage pre-filter.
Flat text (append-only) — no filtering whatsoever. Every candidate turn is appended verbatim. Serves as the lower-bound baseline.

All systems were fed identical conversation transcripts from the 90-day corpus. Memory populations were built from the same source turns so that differences in retrieval results reflect write-side filtering and storage quality, not data access differences.

Ground Truth Labeling

We drew a stratified random sample of 500 entries from the full corpus and had them labeled by three independent annotators. Annotators were trained data labelers with no prior exposure to Recall or its pipeline stages. They were given the following labeling instruction:

"Would a retrieval system serving this entry to an agent improve or degrade the quality of the agent's response? Judge the entry on its own — do not assume you have access to the surrounding conversation."

Presenting entries without surrounding context was deliberate: if a stored memory requires its source turn to be interpretable, it is not self-contained and will degrade retrieval quality in practice.

Annotators assigned each entry to one of five categories:

Category	Description
High-signal fact	A verifiable, stable fact about the user or their context
High-signal preference	A user preference or behavioral disposition
High-signal event	A concrete event or decision with a timestamp
Junk	Noise, pleasantry, procedural acknowledgment, or duplicate
Harmful	Hallucinated profile element or false fact that could mislead the agent

Inter-annotator agreement was measured using Cohen's kappa: $\kappa = 0.81$ , indicating strong agreement. Disagreements were resolved by majority vote.

Evaluation Metrics

Junk-to-Signal ratio $R$ — our primary storage quality metric. Values above 1.0 indicate a junk-dominated store; values below 0.1 indicate high signal density.
Precision at K (P@5, P@10) — fraction of top-K retrieved entries that match the ground-truth relevant set for a held-out query.
Recall@10 — fraction of ground-truth relevant entries that appear in the top-10 retrieved results.
False negative rate — fraction of high-signal entries incorrectly rejected by the pipeline (real signal lost).
Latency P99 — wall-clock time from query submission to retrieval result return, 99th percentile across 200 query evaluations.

Stage-by-Stage Rejection Analysis

The following table presents rejection counts and rates for each of the seven pipeline stages across the full 10,247-candidate corpus.

Stage	Candidates in	Rejected	Rejection rate	Primary rejection cause
pre_filter	10,247	6,350	62.0%	Low relevance score (pleasantries, noise)
extract	3,897	312	8.0%	No extractable signal; content was procedural acknowledgment
resolve_refs	3,585	145	4.0%	Unresolvable references — entity not in graph
enrichers	3,440	0	0.0%	Enrichers fail open (no rejection)
dedupe	3,440	268	7.8%	Semantic duplicate of existing memory (cosine similarity > 0.92)
conflict	3,172	89	2.8%	Unresolvable conflict requiring human review
persist	3,083	0	0.0%	All remaining candidates committed

Net result: 3,083 of 10,247 candidates (30.1%) committed to long-term storage. Overall rejection rate: 69.9%.

The pre_filter stage accounts for the majority of rejections (62.0% of all candidates). This is by design — cheap statistical scoring at the earliest stage avoids the computational cost of running downstream models on high-confidence junk. The extract stage catches a further 8.0%, identifying candidates that passed the relevance heuristic but contain no structured extractable signal. The dedupe stage (7.8%) is the third-largest contributor, eliminating semantic duplicates that the hash-based pre_filter missed.

Deployments with stricter relevance_threshold tuning (threshold raised from 0.40 to 0.55, appropriate for coding-agent workloads with dense technical vocabulary) reached 91.3% total rejection — consistent with the theoretical 90% figure cited in the abstract.

Survival Probability Under Independence Assumption

Under the independence assumption, the probability of a junk entry surviving all seven stages is:

$P(j_{survive}) = \prod_{k=1}^7 P(k_{miss})$

Substituting per-stage miss rates (fraction of junk entries that pass each stage):

$P(j_{survive}) = 0.38 \times 0.20 \times 0.04 \times 1.00 \times 0.22 \times 0.03 \times 1.00 \approx 0.002$

In practice, junk entries have correlated failure modes across stages. A low-quality entry that scores just above the pre_filter threshold also tends to contain no extractable signal (flagged by the extract stage) and often mirrors an existing entry (flagged by dedupe). This correlation means the realized junk pass-through rate is lower than the independent-stage product predicts — the observed junk survival rate on this corpus was 0.0009, approximately half the independence-assumption estimate.

Retrieval Precision Results

After storage, we evaluated retrieval precision on 200 held-out query/answer pairs. Queries required temporal reasoning ("what did they prefer last month?"), relational reasoning ("what do we know about the user's team lead?"), or factual recall ("what programming language does this user work in?").

System	P@5	P@10	Recall@10	Latency P99
Recall (hybrid, 5 retrievers)	0.91	0.87	0.82	145ms
Recall (semantic only)	0.78	0.72	0.70	98ms
Mem0 (LLM extraction, flat)	0.64	0.58	0.55	320ms
Zep (temporal KG)	0.78	0.75	0.71	280ms
Flat text (append-only)	0.52	0.44	0.41	210ms

The 5-retriever hybrid (semantic + BM25 + entity-graph + temporal + type-filter) outperforms semantic-only retrieval by 0.13 at P@5. This gap is driven primarily by the entity-graph and temporal retrievers on query types that dense similarity search cannot handle:

Entity-graph retriever: catches relational queries by walking graph edges from a known entity to adjacent nodes. Accounts for approximately 8 percentage points of the semantic-only gap.
BM25 lexical retriever: catches exact-match queries (error codes, quoted text, user IDs) that semantic embeddings may not surface. Accounts for approximately 12 percentage points.
Temporal retriever: catches time-anchored queries by filtering on valid_from/valid_to metadata before ranking. Accounts for approximately 5 percentage points.

False negative rate: 4.2% of labeled high-signal entries were incorrectly rejected by the pipeline. The majority of false negatives occurred at the pre_filter stage — dense technical content (e.g., a user describing a complex algorithm they implemented) scored below the relevance threshold due to domain-specific vocabulary that the general-purpose scoring model underweighted. This is the primary motivation for domain-specific threshold calibration.

Comparison with Baseline Systems

Mem0

Mem0 applies LLM-based extraction without a pre-filter stage. The same 10,247-candidate corpus yielded a measured junk ratio of $R = 8.3$ : more than eight junk entries for every signal entry in the store.

The primary failure mode is duplicate proliferation. Without semantic deduplication at write time, the same fact ("user prefers dark mode") was stored between 4 and 12 times across the 90-day corpus for a single agent deployment. When top-K retrieval is applied, these duplicates dominate the result set — the agent receives the same fact 6 times in a 10-entry context window, leaving little room for distinct signal.

Retrieval P@10 = 0.58. This is consistent with prior work documenting the "hash problem" in vector stores [mem0-audit].

Zep

Zep's temporal knowledge graph enforces structural quality constraints through graph insertion rules: an entity that cannot be resolved to an existing graph node is rejected. This provides some filtering at the entity-resolution equivalent of our resolve_refs stage, yielding $R = 2.1$ — substantially better than Mem0 but still 21x higher than Recall.

Retrieval P@10 = 0.75. Zep's weakness is semantic generalization: pure graph traversal cannot surface entries for "what do we know about topic X?" queries where topic X spans multiple disconnected graph regions. Semantic similarity search handles these queries; Zep lacks it in its retrieval path.

Latency P99 = 280ms, driven by Go-based graph traversal over a serialized node store.

Flat Text (Append-Only)

No filtering is applied. $R = 18.6$ — more than 18 junk entries per signal entry. Retrieval P@10 = 0.44. At this junk ratio, a 10-entry retrieval result set will contain, on average, 6–7 junk entries. The agent is operating primarily on noise.

Flat append-only stores are unsuitable for production deployments beyond a few hundred conversation turns.

Recall (This Work)

$R = 0.092$ , below the $R < 0.1$ design target. Retrieval P@10 = 0.87. The combination of the 7-stage write-side filter (eliminating the $R > 10$ baseline junk) and the 5-retriever hybrid read path (surfacing typed, structured memories at high precision) produces the highest precision and the lowest P99 latency among all filtered systems evaluated.

Cognitive Memory Architecture

Tulving's distinction between episodic and semantic memory [TULVING1972] informs the Recall type system. Facts and entities correspond to semantic memory — declarative knowledge that is stable across time and context. Events correspond to episodic memory — temporal, experiential records tied to a specific occurrence. Preferences are a novel category without a direct cognitive analogue; they capture revisable dispositions that agents need to track but that do not fit cleanly into either the fact or event schema.

The Recall pipeline's separation of extraction from enrichment mirrors the two-stage encoding model in cognitive science: perceptual encoding (pre_filter + extract) followed by consolidation (enrichers + dedupe + conflict). Memories that fail consolidation are not simply discarded but routed to a review queue, paralleling the role of sleep-based memory consolidation in resolving interference between competing traces.

Memory-Augmented Neural Networks

Neural Turing Machines [GRAVES2014] and Memory Networks [WESTON2015] established differentiable external memory as a retrieval substrate for neural architectures. These systems use soft attention over a memory matrix — every read is a weighted sum over all stored entries. Recall operates at a higher abstraction level: discrete typed schema with statistical retrieval (BM25 + HNSW approximate nearest-neighbor + entity graph), optimized for language-model-compatible text formats. The trade-off is interpretability and auditability over end-to-end differentiability.

Retrieval-Augmented Generation

RAG [LEWIS2020] established dense retrieval as a pre-generation context injection mechanism: a question encoder retrieves passages from a flat corpus, and a sequence-to-sequence model generates conditioned on the retrieved context. Recall extends this pattern in three directions: (1) structured write-time filtering, which standard RAG pipelines omit — they index everything and rely on retrieval to surface relevant content; (2) typed schema enabling multi-retriever composition across heterogeneous retrieval strategies; and (3) temporal decay that prevents stale context from dominating results when fresher, superseding entries exist.

Entity Resolution in Knowledge Graphs

Entity linking [MILNE2008] and coreference resolution [LEE2017] are the upstream requirements for Recall's entity-graph retriever. Without accurate entity resolution at write time, graph traversal at read time returns noisy results — "Sarah" in one conversation turn may be a different Sarah than the one whose manager is stored in the graph. Recall uses a two-stage approach: embedding similarity for candidate entity generation, then alias-table lookup for disambiguation. This prioritizes precision over recall in entity linking: it is better to leave an ambiguous reference unresolved (triggering rejection at the resolve_refs stage) than to incorrectly merge two distinct entities.

Limitations and Future Work

Threshold Sensitivity

The pre_filter relevance threshold requires domain-specific tuning. The default relevance_threshold: 0.40 was calibrated on general-purpose conversational data. Empirical evaluation across workload types shows that the optimal threshold varies: coding-agent workloads require 0.55 (dense technical vocabulary causes general-purpose models to underweight relevant entries), while research-agent workloads require 0.35 (longer, exploratory turns warrant a lower bar for initial inclusion). Automatic threshold calibration from labeled production samples — treating threshold selection as a Bayesian optimization problem over observed false negative and false positive rates — is planned for 2026 Q3.

False Negative Rate

The 4.2% false negative rate (real signals incorrectly rejected) is acceptable for most production deployments but may be too high for research agents where missing one critical finding is costly. A "research mode" configuration — lower pre_filter threshold, higher dedupe cosine threshold, and a secondary pass on rejected candidates using a larger extraction model — is under active development. Initial experiments show false negative rate reduction to 1.8% with a 12% increase in total candidates committed to storage (and a corresponding modest increase in R from 0.092 to 0.11).

Multi-Language Support

The current pipeline was calibrated on English-language training data. BM25 tokenization and pre_filter scoring models have not been evaluated on non-English corpora. We anticipate that BM25 will perform acceptably for languages with whitespace-delimited tokens (German, French, Spanish) but will require replacement with character n-gram or subword tokenization for languages like Japanese, Chinese, and Thai. Multi-language support is roadmapped for 2026 H2.

Evaluation Corpus Coverage

The 10,247-entry corpus from consented deployments may not represent all agent workload distributions. Deployments with very high conversation velocity (more than 100 turns per day per user) or highly domain-specific vocabulary (medical records, legal documents) may see different stage-by-stage rejection distributions — in particular, the pre_filter stage may exhibit higher false negative rates on dense professional text. We are actively soliciting partnerships with domain-specific deployment partners to expand the labeled evaluation corpus. Researchers interested in contributing domain-specific labeled corpora should contact the Arc Labs team.

Structural Failures in Agentic Memory: A Seven-Stage Filter Approach

Abstract

Introduction

The Seven-Stage Pipeline

Math of Filter Efficiency

Conclusion

Experimental Methodology

Dataset

Baseline Systems

Ground Truth Labeling

Evaluation Metrics

Stage-by-Stage Rejection Analysis

Survival Probability Under Independence Assumption

Retrieval Precision Results

Comparison with Baseline Systems

Mem0

Zep

Flat Text (Append-Only)

Recall (This Work)

Cognitive Memory Architecture

Memory-Augmented Neural Networks

Retrieval-Augmented Generation

Entity Resolution in Knowledge Graphs

Limitations and Future Work

Threshold Sensitivity

False Negative Rate

Multi-Language Support

Evaluation Corpus Coverage

References

Abstract

Introduction

The Seven-Stage Pipeline

Math of Filter Efficiency

Conclusion

Experimental Methodology

Dataset

Baseline Systems

Ground Truth Labeling

Evaluation Metrics

Stage-by-Stage Rejection Analysis

Survival Probability Under Independence Assumption

Retrieval Precision Results

Comparison with Baseline Systems

Mem0

Zep

Flat Text (Append-Only)

Recall (This Work)

Related Work

Cognitive Memory Architecture

Memory-Augmented Neural Networks

Retrieval-Augmented Generation

Entity Resolution in Knowledge Graphs

Limitations and Future Work

Threshold Sensitivity

False Negative Rate

Multi-Language Support

Evaluation Corpus Coverage

References