LLM Extraction as Filtering, Not Recall
Most early agent memory systems use rolling extraction — every turn is sent to an LLM with a prompt like "extract every fact stated here." The result is a store full of restated questions, rephrased instructions, and trivial assertions that contribute nothing at retrieval time. The fix is a prompt-level reframing: extraction is not a recall problem. It is a precision problem. The LLM is not a stenographer; it is the second filter.
Reframe the question
Compare two prompts:
# Recall-framed (bad) List every factual statement made in the user turn below. # Filter-framed (good) Identify any durable user-specific facts, preferences, events, entities, or relations in the user turn below — content worth remembering across sessions. If the turn is acknowledgement, restatement, or transient state, return an empty list.
The first prompt is a recall problem: extract everything. The second is a filter problem: extract only the memorable. The model behavior changes substantially. Empty-list returns on conversational fluff happen reliably with the second framing and almost never with the first.
Why one LLM call, not three
The naive architecture has three sequential calls: one to extract raw candidates, a second to classify and type them, and a third to ground each candidate against the source turns. That design is logical but wrong in practice.
The original prototype ran all three calls. The combined p50 latency was 1,100ms and the cost per turn was roughly 0.0011 — a 71% latency reduction and a 71% cost reduction. Those are not incremental improvements; they change the economics of running the pipeline on every conversational turn.
The single-call design also produces better output. In the three-call design, the classification call operates on a list of decontextualized candidates — it does not see the source turns those candidates came from. The grounding call sees candidates but does not see the quality judgments. Each stage is partially blind. In the single-call design, the model extracts, classifies, and grounds in one coherent pass. It knows why it extracted something when it decides whether to keep it. Coherent context produces more consistent output than staged context.
The third argument is failure handling. Three sequential calls means three surfaces for timeouts, retries, partial failures, and schema violations. One call means one surface. A failure on extraction is also a failure on classification and grounding — there is nothing to classify or ground. The failure envelope shrinks.
The cost breakdown at Claude Haiku 4.5 pricing (~4/M output):
| Design | Calls | p50 latency | Cost/turn |
|---|---|---|---|
| Three-call (extract → classify → ground) | 3 | ~1,100ms | ~$0.0038 |
| Single-call | 1 | ~420ms | ~$0.0011 |
The model is configurable. Default is Claude Haiku 4.5. GPT-4.1-mini, Gemini Flash, and local Llama 3.1 via Ollama are all supported through the AnyLlm wrapper, which normalizes provider-specific APIs (structured output, JSON mode, constrained decoding) to a single interface.
The XML prompt architecture
The extraction system prompt is structured with XML tags, not JSON. This is a deliberate choice. LLMs handle XML-delimited sections more reliably than nested JSON in the system prompt position. The reason is training distribution: XML-style tags appear heavily in chain-of-thought and document-structure contexts in pretraining data. Using XML for section boundaries and JSON for the output schema gives each format the job it is best at.
The system prompt contains four sections:
<output_schema>
{
"memories": [{
"type": "fact" | "preference" | "event" | "entity" | "relation",
"subject": entity_id or null,
"predicate": short snake_case verb or attribute,
"object": { "literal": "..." } | { "entity": "ent_..." } | { "list": [...] },
"content": human-readable canonical sentence,
"event_at": ISO8601 timestamp (events only),
"source_confidence": "direct" | "confirmed" | "inferred" | "speculated",
"source_turn_ids": [array of turn IDs],
"quality_decision": "keep" | "discard",
"quality_reason": "short explanation",
"confidence_adjustment": -0.2 to +0.2,
"grounding_verdict": "Supported" | "Partial" | "Unknown" | "NotSupported"
}]
}
</output_schema>
<type_rules>
1. Only extract STABLE information. Do not extract:
- Transient emotional states ("I'm tired today")
- Meta-commentary about the conversation ("can you clarify?")
- Tool call outputs
- Questions the user asked
2. Resolve coreferences in the output
3. Prefer concrete over abstract
4. One fact per memory
5. For events, set event_at
6. Do NOT re-extract memories in the "Recent memories" list
7. If a statement contradicts a recent memory, still extract it — conflict handled downstream
8. If nothing is worth extracting, return {"memories": []}
</type_rules>
<quality_rules>
- KEEP if: direct statement of stable fact, concrete preference, specific event
- DISCARD if: generic ("user likes things"), speculation ("maybe X"),
transient ("feeling tired right now"), restating known info verbatim
- Bias: when in doubt, DISCARD.
</quality_rules>
<grounding_rules>
- "Supported": directly stated in source turns (no confidence penalty)
- "Partial": partially supported, some details inferred (penalty: -0.15)
- "Unknown": can't trace to specific source turn (penalty: -0.10)
- "NotSupported": not in source turns → discard
</grounding_rules>The output_schema section defines both the shape of each memory object and the fields that carry extraction judgment (quality_decision, grounding_verdict, confidence_adjustment). These are not metadata fields added after extraction — they are part of the extraction itself. The model produces a quality decision in the same generation pass that produces the content field.
Why predicate is snake_case: The predicate is designed to be used programmatically downstream in deduplication and relation traversal. A free-text predicate like "works at" differs superficially from "is employed by" and "has a job at" — a normalized snake_case verb (works_at) makes these equivalent at the storage layer without an additional normalization step. The extraction prompt enforces this by example in the few-shot section.
Prompt caching: The static sections — output_schema, type_rules, quality_rules, grounding_rules, and the few-shot examples — are placed at the top of the system prompt with cache_control: ephemeral on the Anthropic API. These sections total roughly 1,800 tokens. At a typical conversation turn rate, 85–90% of extraction calls are served from the cache. The savings compound significantly: at 1,296/day compared to no caching. For OpenAI-compatible endpoints (GPT-4.1-mini, Gemini Flash), JSON mode is enabled instead of constrained decoding — the output schema is embedded in the prompt itself rather than passed as a function definition.
Ordering matters for caching. The static sections must precede any dynamic content. If user-specific context (recent memories, active entities) appears before the rules sections, the cache key changes on every request and the cache never warms. The ordering is: system prompt statics → context injections (dynamic) → source turns (dynamic).
Context injection: what the LLM knows
Three dynamic sections are injected between the static system prompt and the source turns:
Active entities (up to 30). The memory store maintains a set of entities that are "in scope" for the current user — people, organizations, products, places that have appeared in previous sessions. These are injected so the extraction model can resolve coreferences to stable entity IDs rather than creating new entity records for the same person. The limit of 30 is not arbitrary: in synthetic conversation analysis, 95th-percentile active entity counts peak at 28 for a six-month user. Above 30, the incremental benefit drops while prompt length grows linearly. Each entity entry is compact — ent_012 | Priya Sharma | person | manager — approximately 60 tokens for 30 entities.
Recent memories (up to 15). The most recently stored memories for this user are injected with the explicit instruction "Do NOT re-extract memories present in this list." Without this, the same fact is extracted on every turn that references it. A user who mentions their employer in turn 1 and again in turn 47 would otherwise produce the same works_at memory twice. With recent memory injection, the turn-47 extraction produces nothing for that fact — it is already in the store. Fifteen memories is enough to cover a typical session (median session length is 8 turns in production data); anything older than the recent window is handled by the deduplication stage downstream. Each memory entry is approximately 80 tokens; 15 entries cost ~1,200 tokens per call.
Source turns (up to 20, truncated at 2,000 characters each). The extraction model operates on the raw conversation turns. The 20-turn window covers 95% of extraction-relevant context in production conversations — facts stated more than 20 turns back either already appear in the recent memories list (if previously extracted) or are genuinely stale and should not be re-extracted. The 2,000-character truncation per turn prevents runaway context growth from pathological turns (tool call outputs, large code pastes) without losing the conversational text that actually drives extraction.
Total dynamic context per call: roughly 60 (entities) + 1,200 (recent memories) + 20 × average_turn_length tokens. For a typical 200-character average turn, that is ~4,000 tokens of dynamic context on top of the ~1,800 cached static tokens.
The five types as filter criteria
The prompt should anchor on the five memory types. The model returns candidates that are each one of:
- Fact — a stable predicate ("user works at Volkswagen").
- Preference — a mutable choice ("user prefers dark mode").
- Event — a timestamped occurrence ("user joined the platform team on 2026-05-09").
- Entity — an identity introduction ("Volkswagen is an organization").
- Relation — a typed edge ("user reports to Sarah").
Anything that does not fit one of these five drops on the floor. The act of forcing the model to pick a type is itself a filter — content that resists typing usually was not memorable to begin with.
The bias toward rejection
The quality_rules section ends with one sentence that carries disproportionate weight: "when in doubt, DISCARD."
This is not a style preference. It is the fundamental prior of the system.
Consider what happens over a long-running agent deployment. A user interacts with an agent daily for six months. That is roughly 900 turns. Without a rejection bias, an extraction model running on each turn might produce 2–4 candidates per turn — 1,800 to 3,600 stored memories in six months from one user. With a rejection bias that produces 0.5 candidates per turn on average, the same six months yields 450 memories. The precision difference is 4–8×.
Now consider retrieval. A top-10 retrieval from 450 memories in a compact, high-quality store will surface relevant context on 70–80% of queries (based on internal evals). The same retrieval from 3,600 memories, most of them junk, degrades to 20–30% relevance at top-10. The failure mode is not that useful memories are missing — they are in the store. The failure mode is that they are buried under noise, and the cosine similarity scores of junk memories are not uniformly low. Many junk memories share surface-form terms with real queries and rank highly despite being semantically useless.
The alternative mental model — "store everything, clean up later" — fails in practice for three reasons:
-
Cleanup is never as complete as pre-filtering. Post-hoc deletion runs face the same precision problem: you need a model to identify junk, and that model makes mistakes. The difference is that pre-filter mistakes are cheaper — a falsely discarded extraction is a miss; a falsely kept junk memory costs retrieval and context budget on every future query that matches it.
-
Cleanup requires a second LLM call over stored data — more expensive than catching junk at write time when the source context is still available.
-
Cleanup latency means junk is live in the store for some window. If the agent is active during that window, it is answering from noise.
The grounding_rules reinforce the rejection bias from a different angle. A candidate grounded as "NotSupported" — meaning the model cannot trace it to any source turn — is discarded unconditionally. The grounding verdict is not just a quality signal; "NotSupported" is a hard discard rule. This catches a specific failure mode: the extraction model hallucinating plausible-sounding facts that were never stated. Without the grounding check, these hallucinations enter the memory store and become "remembered facts" that the agent will cite confidently in future turns.
What the source_confidence field means
Each extracted memory carries a source_confidence field with four possible values. These map directly to numeric weights in the confidence formula (see the confidence formula article):
| Value | Meaning | Source strength |
|---|---|---|
direct | Explicitly stated by the user in the source turns | 1.0 |
confirmed | Stated and subsequently confirmed, or stated more than once | 1.0 |
inferred | Logically follows from what was stated, but not explicit | 0.75 |
speculated | Plausible given context but not clearly supported | 0.30 |
The distinction between inferred and speculated is where extraction models most frequently make errors. "Georgian works in software" is a direct statement. "Georgian probably works in a technical role" is speculation — the word "probably" signals it. "Georgian is a backend engineer" inferred from "Georgian is debugging a Rust memory leak" is inference — it follows from the evidence but was not stated. The extraction prompt uses examples in the few-shot section to calibrate this boundary.
A speculated extraction is not automatically discarded. It is kept if the model assigns quality_decision: keep and the content is specific enough to be useful. But it enters the store with a source strength of 0.30, which means the composite confidence score will be low — typically 0.25–0.40 depending on other formula components. Low-confidence memories are not surfaced in high-stakes contexts (agent tools, structured outputs), only in low-stakes contexts (conversational hints). The confidence field governs where a memory can be used, not whether it is stored.
convention and constraint tags
Preferences, conventions, and constraints are all stored as memories, but they receive different tags that control runtime behaviour:
- No tag (preference): "User prefers dark mode." Surfaced in agent context when relevant. Not enforced when irrelevant.
conventiontag: "Always respond in bullet points." Injected into the agent system prompt on every turn, regardless of query. The agent honours it without the user repeating it.constrainttag: "Never reveal internal pricing details." Injected with a warning prefix that signals the agent to treat it as a hard rule.
A worked example illustrates why the distinction matters. Suppose a user says, in session 1: "I work on an enterprise sales team, so please always use formal language." This is a convention, not a preference. The next session, the user asks a casual question — "what's the weather like today?" A preference extraction would treat the formality request as context-dependent. A convention extraction tags it ["convention"], and the agent uses formal language on the weather question. That is the correct behavior.
The extraction prompt identifies conventions by pattern — instructions that use "always," "every time," "whenever," "make sure to" — and constraints by prohibition patterns — "never," "do not," "don't," "avoid always." These heuristics catch roughly 85% of user-stated rules. The remaining 15% are type-tagged as preferences and may be re-extracted as conventions if they recur across sessions (the recurrence signal is handled by the confidence formula's recurrence component).
A worked extraction example
Source turns:
[turn_042] I just finished my Arrive interview! The Stockholm team
wants someone strong in Java and React. [turn_043] Priya will be
my manager if I join. She seems great.The extraction model produces:
-
Event —
type: event,subject: ent_Georgian,predicate: completed_interview,object: { entity: "ent_Arrive_Stockholm" },event_at: <current timestamp>,source_confidence: direct,quality_decision: keep,grounding_verdict: Supported -
Fact —
type: fact,subject: ent_Arrive_Stockholm,predicate: requires_skills,object: { list: ["Java", "React"] },source_confidence: direct,quality_decision: keep,grounding_verdict: Supported,confidence_adjustment: -0.1(the requirement is stated as what "the team wants," not as a formal job description — slight downgrade) -
Entity —
type: entity,subject: ent_Priya,predicate: is_person,object: { literal: "Priya" },source_confidence: direct,quality_decision: keep,grounding_verdict: Supported -
Relation —
type: relation,subject: ent_Priya,predicate: potential_manager_of,object: { entity: "ent_Georgian" },source_confidence: inferred,quality_decision: keep,grounding_verdict: Partial,confidence_adjustment: -0.15(the relationship is conditional on joining)
Discarded:
- "She seems great" →
quality_decision: discard,quality_reason: TransientOpinion - "Georgian is considering joining Arrive" → already in recent memories as of the previous turn
- "Georgian is a job seeker" →
quality_decision: discard,quality_reason: TooGeneric(nothing actionable about the abstraction) - "Arrive Stockholm team is hiring" →
quality_decision: discard,quality_reason: TooGeneric(no subject-specific predicate)
The five kept memories from these two turns are specific, typed, grounded, and will be useful at retrieval time. The four discarded candidates would have added noise without adding signal.
Five discard patterns worth keeping on hand:
| Candidate | Decision | Reason |
|---|---|---|
| "Georgian is a software developer at Google" | DISCARD | Hallucinated employer — NotSupported |
| "Georgian feels excited about the interview" | DISCARD | TransientState |
| "User mentioned software development" | DISCARD | TooGeneric |
| "The user is probably a senior engineer" | DISCARD | Speculation without evidence |
| "Georgian asked about pricing" | DISCARD | Question, not statement |
Error handling
The extraction call returns a JSON array. Three failure modes occur in production:
Invalid JSON. The AnyLlm wrapper catches a JSON parse failure and retries once. On retry, it prepends "Return valid JSON only, no prose:" to the user message. The retry succeeds in roughly 92% of first-retry cases. If the retry also fails, the turn is logged as extraction-failed and skipped — no memories are written, no exception propagates to the caller. An extraction failure on one turn should not block the conversation.
Empty response. A valid {"memories": []} is not an error. It is the expected outcome for turns containing only pleasantries, meta-commentary, or content the model correctly determined was not memorable. These are logged separately from parse failures to keep failure metrics clean. Target empty-response rate in production is 35–50% of turns — anything below 30% suggests the extraction prompt may be under-filtering.
Schema violation on individual memories. The wrapper validates each memory object against the output schema. Objects that fail validation (missing required fields, invalid type enum values, malformed event_at timestamps) are dropped individually. Valid objects in the same response are kept. This "partial acceptance" approach is critical: a single malformed memory in a batch of five should not discard the other four. The validation failure is logged with the offending memory content for prompt regression monitoring.
Batch extraction vs turn-level
Some systems run extraction at end-of-session over the full transcript. This is legitimately useful — themes and arcs only appear at session-scale — but it is not a substitute for turn-level extraction. Run both:
- Turn-level → per-turn candidates (facts, preferences, events).
- Session-level → arc/theme summaries; "user is exploring relocation".
The session-level call is more expensive but runs once per session, not once per turn. The two extraction modes produce different memory types: turn-level extracts individual facts and events; session-level extracts context summaries, inferred goals, and arc-level patterns that only become legible when the full session is visible. Neither replaces the other.
Model tier and confidence
The extraction model is the largest single confidence input. A small fast model produces more candidates per turn but with higher false-positive rates. A frontier model produces fewer, more discriminating candidates. The confidence formula includes the extractor as a weighted component for exactly this reason — downstream stages need to know how much to trust the upstream judgment.
Default: Claude Haiku 4.5. Typical false-positive rate on the curated eval set: 6.2%. Upgrade path: Claude Sonnet 4.5 reduces false-positive rate to 2.8% at 4× the cost. The tradeoff is only worth making for high-stakes deployments where junk memory has compounding consequences — enterprise agents, long-horizon planning assistants, health or legal contexts.
Evaluating extraction quality
Extraction quality has two axes that move in opposite directions:
Precision: Of the memories actually stored, what fraction are genuinely useful at retrieval time? Target: >80%. Method: sample 100 stored memories per week; judge each manually as "would this be useful context for a plausible future query?" Anything below 70% precision is a signal the extraction prompt is under-filtering or the model is hallucinating.
Recall: Of the conversational turns that contained something worth storing, what fraction produced at least one stored memory? Target: >60%. Method: sample 50 turns that produced zero extractions; judge each manually as "did this turn contain anything memorable?" A zero-extraction rate above 55% is a signal the extraction prompt is over-filtering — real facts are being discarded.
The "bias toward discard" design intentionally sacrifices some recall for precision. An 80% precision / 65% recall point is better than a 50% precision / 85% recall point for a long-running memory system. The reason is asymmetric cost: a missed extraction can always be recovered if the information recurs (and recurring information will eventually pass the confidence threshold). A stored junk memory costs retrieval budget on every future query that matches it, indefinitely.
The most useful diagnostic metric in production is the extraction discard rate — the fraction of candidates produced that receive quality_decision: discard. Healthy range: 40–60%. Below 30% suggests the model is not applying quality rules aggressively enough (prompt regression, model version change, or new conversation domain). Above 75% suggests the model is over-filtering, possibly due to temperature drift or prompt injection by users crafting turns to defeat the extractor.