The 7-Stage Write Pipeline

By Arc Labs ResearchMay 2, 202622 min read

Most agent memory failures are not retrieval failures. They are write failures: a system stored too much, of the wrong kind, with too little context. By the time retrieval runs, the relevant memory exists — it just cannot beat the noise on score. The cheapest fix is to reject earlier. The seven-stage write pipeline puts a sequence of progressively more expensive checks in front of every conversational turn. Each stage either rejects, transforms, or hands the candidate to the next. By the time persistence runs, the pipeline has already discarded the vast majority of inputs.

Funnel · 100 turns through the pipeline

Each bar is the survivor count after that stage. Click run to watch particles drop through.

Why a funnel, not a fan-out

The instinct in many systems is to treat every turn as a potential memory: extract, embed, store, then clean up later. This pattern is expensive (every turn costs an LLM call) and fragile (cleanup is unbounded — bad memories surface for years). A funnel inverts the cost curve. Cheap rule-based filters run first; LLMs only see the survivors. By the time a candidate reaches the entity resolution stage, fewer than half of the original turns are still in the system. By the time it reaches conflict detection, fewer than a quarter remain.

The seven stages

Pre-filter. Pattern and length rules: greetings, acknowledgements, code-only blocks, meta-talk. Free; runs on every turn.
Extract. An LLM generates candidate memories from the surviving turn. The prompt is framed as filtering, not recall — see the dedicated page on LLM extraction as filtering .
Classify. Each candidate is tagged with a memory type — fact, preference, event, entity, relation. Type is load-bearing for retrieval, supersession, and decay.
Resolve entities. Pronouns and references become stable identities via a four-stage cascade: pronoun rules → grammar parse → fuzzy match → LLM judge.
Dedupe. Hash-equality, then cosine-similarity, then LLM judge for the ambiguous middle. See three tiers of dedup.
Conflict check. Does this candidate contradict an existing memory? If yes, supersede the older one (keep for audit) instead of overwriting.
Persist. Atomic write: vector index, lexical index, entity graph, and ledger commit together or not at all.

The rejection budget is the spec

A common configuration target: 60–70% pre-filter rejection, 25–30% extraction rejection (turns produced no candidates), 15–20% dedupe collapse. Net: 80–90% of incoming turns produce zero net new memories. That is not a bug. The benchmark is not candidates produced; it is retrieval precision at 30 days. Aggressive rejection at write time is the cheapest input to that metric.

Failure modes the funnel addresses

Greeting flood. Without pre-filter, conversational openers fill the store. Top-K retrieval starts pulling "hi how can I help you" alongside real facts.
Pronoun rot. Without resolution, "she" stored in the memory text fails to match queries about Sarah six months later.
Duplicate ranking inflation. Three near-identical memories in the top-10 push one valuable memory out of the context window.
Silent contradictions. A new fact that contradicts an old one without superseding it leaves both in retrieval — the agent surfaces whichever embeds closer to the query.

Implementation notes

Stages 2–6 should be asynchronous: the user response should not wait for memory extraction. Buffer turns to a queue; the background worker drains them. Persistence is the only synchronous concern — a write must be durable before it surfaces in retrieval. Each stage emits structured telemetry: rejection reason, latency, cost. The aggregate rejection-rate-over-time curve is the single most diagnostic signal in operating a memory system. A 20% drop in rejection rate usually means the upstream agent changed its conversational style.

Indiscriminate storage of conversational state degrades downstream task performance more than no memory at all.

— Harvard D3 group, 2024

The next four pages drill into individual stages: pre-filter patterns, framing extraction as filtering , entity resolution, and three-tier dedup.

Stage 2 in depth: why one LLM call

The description above lists Extract and Classify as two separate stages. In the real implementation they are one. Classification, quality gating, and grounding verification all happen inside the same LLM call that produces candidate memories. This is not an accident — it is a deliberate decision with cost, latency, and coherence motivations.

Cost. A second LLM call for classification would cost approximately the same as the extraction call itself, roughly $0.0025 on Haiku at typical memory turn lengths. Doing both in one call keeps the per-turn extraction cost flat. At 1M turns/day that amounts to$ 2,500/day — meaning every unnecessary LLM round trip adds up quickly at scale.

Latency. Two serial LLM calls add 300–600ms of wall-clock time in the P99 case. The HTTP path returns after Stage 6 (persist), but background workers still have throughput limits. Folding classification into extraction keeps the total worker time below the threshold where queue depth begins to grow.

Coherence. The classification of a memory candidate is not independent of its extraction. The same context — what type of claim is this? is it grounded in evidence from the turn? — that informs extraction also informs classification. Sending both questions to the same model call eliminates the risk of a classification disagreeing with the extraction because it was given slightly different context.

The prompt is structured with explicit XML sections:

<output_schema> — the typed output format the model must produce. This includes the memory text, memory type, grounding flag, quality decision (keep or discard), and the predicate_is_stateful boolean.
<type_rules> — extraction rules per memory type. Facts, preferences, events, entities, and relations each have distinct criteria. For example: facts require a subject, predicate, and object all resolvable to entities in context; preferences require an explicit first-person valence signal ("I prefer", "I don't like", not just a topic mention).
<grounding_rules> — grounding verification. A candidate memory is only kept if the turn contains direct evidence for it. The model is explicitly instructed to discard inferences, speculations, and paraphrases of things said in prior turns (those are already stored).
<quality_rules> — keep/discard classification. Transient facts (today's weather, last Tuesday's standup) are discarded. Speculative claims ("I might move to Berlin") are marked tentative or discarded depending on confidence level.
<examples> — four to six worked examples per memory type, showing both keep and discard decisions with reasoning.

Context injected into the call:

Up to 30 active entities for the current user, so the model can ground references to "the company" or "her manager" against known entities rather than creating new ones.
Up to 15 recent memories, to prevent re-extracting facts already in the store. Without this, a fact stated three turns ago would be extracted again, reaching the dedupe stage for elimination — which works, but wastes the dedupe call.
Up to 20 source turns, truncated at 2,000 characters each. The truncation limit matters: a 10,000-character code dump is almost certainly not going to yield memory candidates worth the full context cost. Pre-filter should have rejected it; if it didn't, the truncation cap limits damage.

The output of Stage 2 is zero or more typed memory candidates, each carrying quality, grounding, and predicate metadata. Everything that follows — entity resolution, dedupe, conflict detection — operates on these typed, pre-classified candidates.

The predicate_is_stateful flag

One of the fields on every extracted memory candidate is predicate_is_stateful: bool. It is set at extraction time, inside the Stage 2 LLM call, and it determines how the conflict detection stage handles contradictions.

A stateful predicate is one that can only have one true value at a time. Examples:

lives_in — a person lives in one city at a time.
works_at — a person works at one company at a time (in most contexts; the model is instructed to handle contractors differently).
current_role — a person holds one title at a time.
is_dating — a relationship status.

When Stage 5 detects a contradiction between a new memory and an existing one, and both memories carry a stateful predicate, the pipeline supersedes the older memory with the newer one. The older memory is not deleted — it is marked as superseded with a pointer to the newer memory and preserved in the audit log. This is not a soft delete. The audit record is permanent.

An idempotent predicate is one that can hold multiple true values simultaneously. Examples:

knows_skill — a person can know many skills.
has_interest — a person can have many interests.
has_visited — a person can have visited many places.

When Stage 5 detects a contradiction on an idempotent predicate, it does not supersede. Instead it creates a contradicts relationship between the two memories and flags them for human review. This matters because "contradicts" in the idempotent case often reflects a genuine inconsistency the agent surfaced — not a simple update — and surfacing it is more useful than silently resolving it.

The model is not always right about statefulness. A skill can become outdated; a preference can flip. That is fine. The system is optimized for the common case (most stateful predicates really are stateful) and surfaces the uncommon case (contradicts links on idempotent predicates) for human review rather than silently overwriting in either direction.

HyPE: write-time query indexing

Stage 7 does not run synchronously with the HTTP request. The HTTP response returns after Stage 6 (persist). Stage 7 — Hypothetical Prompt Embedding, or HyPE — runs as a background job queued via a transactional outbox pattern.

The job is simple: for each memory that was successfully persisted in Stage 6, generate three hypothetical questions a user might ask in order to retrieve that memory, embed each question, and store the embeddings separately from the memory's own content embedding.

For a memory like "Georgian works at Arrive as a data engineer", the three hypothetical queries might be:

"What company does Georgian work for?"
"What role does Georgian have at Arrive?"
"Who is the data engineer at Georgian's company?"

These questions are then embedded using the same embedding model used for content. At retrieval time, the query is compared against both the content embeddings and the question embeddings. A query that is phrased differently from the stored memory text — "where does Georgian work" instead of "Georgian works at Arrive" — is much more likely to hit one of the hypothetical question embeddings than the content embedding directly.

This is the core insight behind HyDE (Hypothetical Document Embedding), applied at write time rather than query time. The standard HyDE approach generates a hypothetical answer to a query at retrieval time and embeds that. HyPE inverts the direction: at write time, generate hypothetical questions for a known fact and embed those. The practical advantage is that the embedding cost is paid once per memory, not once per query.

The transactional outbox pattern ensures that Stage 7 completes even if the background worker restarts mid-job. The outbox record is created atomically with the Stage 6 persist. The worker marks the outbox entry complete only after all three embeddings are durably stored. If the worker crashes between embedding 2 and 3, it retries from the beginning of the job on restart. The embeddings are idempotent to write (the same memory text produces the same question text, which produces the same embedding), so retries are safe.

The content embedding for the memory itself also runs in Stage 7 — it does not block the HTTP response. This is a deliberate choice: embedding a 100-token memory takes 5–15ms on a fast inference endpoint, but at high query rates that latency accumulates in the critical path. Moving it to the background keeps HTTP P99 flat under load.

Rejection budget: a worked example

Take 1,000 incoming conversational turns. Walk them through the funnel stage by stage.

Stage 1 — pre_filter. Pre-filter rejects 30–50% of turns via pattern and length rules. Call it 40%, the midpoint. 400 turns are dropped here — greetings, acknowledgements, meta-talk, pure emoji reactions, tool call markers. Zero LLM cost. 600 turns survive.

Stage 2 — Extract. Of the 600 surviving turns, approximately 25–30% produce no memory candidates at all. The extraction call runs — unlike pre-filter, we cannot skip it — but the model determines that the turn contains nothing worth storing: transient content, already-known facts, speculation without a grounding signal. Call it 28%: 168 turns produce no candidates. 432 turns produce at least one candidate. If we assume an average of 1.5 candidates per turn (some turns yield two or three), we have roughly 648 candidates entering Stage 3.

Stage 3 — resolve_refs. Entity resolution does not reject candidates; it transforms them. A candidate referencing "she" becomes a candidate referencing EntityId 7f3a…. Zero or two LLM calls per candidate, depending on ambiguity. Most candidates resolve on the first pass (pronoun rules + grammar parse handle ~80% of cases). Roughly 130 of the 648 candidates require a fuzzy match or LLM judge. No candidates are dropped here under normal operation, though resolution failures produce a SkipReason::ResolutionFailed that drops the candidate rather than storing a dangling pronoun reference.

Stage 4 — Dedupe. Dedupe collapses 15–20% of extracted candidates into existing memories. Call it 17%: 110 candidates are determined to be near-duplicates of memories already in the store and are merged rather than stored as new. 538 candidates survive as distinct.

Stage 5 — Conflict. Conflict detection runs a SQL lookup against entity_facts for each candidate, filtered by entity and predicate. For stateful predicates, a conflict triggers supersession: the old memory is archived, the new one proceeds. For idempotent predicates, a conflict creates a contradicts link and both memories persist. Conflict detection does not net-reduce the candidate count in the common case — it routes candidates, not drops them. The exception is a quality check on the conflict resolution path: if the LLM judge determines that the "new" memory is actually weaker evidence than the existing one, it may drop the candidate instead. This affects roughly 3–5% of candidates that reach Stage 5.

Stage 6 — Persist. The atomic write stage. By this point, from the original 1,000 turns, roughly 510–520 distinct memory candidates are committed. But many of those candidates update existing entity records or supersede existing memories rather than creating net-new rows. Net-new memories from 1,000 turns: roughly 100–150.

This is the number that matters. 80–90% of incoming turns produce zero net new memories. The rejection budget is not inefficiency — it is the mechanism by which retrieval precision stays high at 30 days.

The trace_id: end-to-end provenance

Every memory produced by the pipeline carries a trace_id. The pipeline's HTTP response for a write call looks like this:

{
  "stored": 3,
  "merged": 1,
  "discarded": 2,
  "memory_ids": ["mem_abc123", "mem_def456", "mem_ghi789"],
  "trace_id": "trc_9f7e2b…"
}

The trace_id links to a full pipeline provenance record: one StageRecord span per stage, each carrying the stage name, latency in milliseconds, result (pass, reject, transform), and any metadata specific to that stage (pattern matched, dedupe tier reached, conflict type detected).

This enables three operational capabilities that are otherwise unavailable.

Replay. A trace_id can be used to re-run the pipeline on the original input with modified configuration. If you tighten the pre-filter and want to know what would have changed over the last 7 days, replay the traces. The replay infrastructure runs the pipeline against the stored inputs using the new configuration and reports the diff: how many more turns would have been dropped, how many fewer memories would have been written.

Rollback. If a configuration change causes a regression — extraction quality drops, a new pattern causes too many false positives — trace_ids let you identify the exact turns affected and surgically remove the memories they produced, rather than rolling back the entire store.

Debugging extraction failures. When a user reports "you didn't remember X", the trace_id for the conversation session shows exactly what happened to the turn that contained X. Was it dropped at pre-filter? Did extraction run but produce no candidates? Did dedupe collapse it into an existing memory that was less precise? The answer is in the trace. Without it, debugging write failures requires reproducing the conditions, which is often impossible six hours later.

Each stage emits a span with a consistent set of fields:

stage: one of pre_filter, extract, resolve_refs, dedupe, conflict, persist.
latency_ms: wall-clock duration.
result: pass, reject, transform, or error.
reason: for rejections, the structured reason (pattern matched, word count, resolution failed, duplicate confirmed, etc.).

The spans are structured logs, not distributed traces in the OpenTelemetry sense — there is no parent span propagation across service boundaries. They are written as JSONL to a stage_spans table, queryable by trace_id, memory_id, user_id, and time range.

Async-first design

The HTTP response for a write call returns after Stage 6. Not after Stage 7. This is not an oversight — it is the design.

Stage 7 (HyPE) generates three hypothetical queries per memory and embeds them. At 3 queries × 1.5 memories/turn × embedding latency of 10ms = 45ms per turn in the background. That is acceptable in a worker; it is not acceptable in the HTTP critical path at scale.

More importantly, the embedding for the memory itself runs in Stage 7, not Stage 6. Stage 6 persists the memory text, entity links, and metadata. The vector index entry is written in Stage 7, after the embedding is computed. This means there is a brief window — typically under two seconds — between when a memory is persisted and when it becomes retrievable via vector search. Lexical (BM25) retrieval is available immediately after Stage 6, since it does not require an embedding.

The transactional outbox pattern closes the reliability gap. When Stage 6 runs its atomic transaction, it includes an outbox row for the Stage 7 job. The outbox row contains the memory_id and all inputs needed for the Stage 7 job. The Stage 7 worker polls the outbox, processes jobs, and marks them complete only after durable writes. If the worker is killed between processing two memories in a batch, the unacknowledged outbox rows are retried on the next worker start. Stage 7 jobs are idempotent: running the same job twice produces the same embeddings (deterministic question generation given fixed model and seed) and the second write is a no-op.

The failure mode this prevents: a worker crash that leaves some memories without embeddings, causing them to be silently unretrievable by vector search. Without the outbox, this class of failure requires a periodic reconciliation job scanning for memories with no vector entry. With the outbox, it is handled automatically by retry.

Operating the pipeline

The rejection-rate-over-time chart is the primary operational signal. Run it as a dashboard panel, updated hourly, with per-stage breakdown.

What normal looks like:

pre_filter rejection: 30–50%
extract no-candidate rate: 25–30% of surviving turns
dedupe collapse: 15–20% of extracted candidates
Stage 5 LLM judge invocations: under 10% of candidates reaching Stage 5 (the cosine gate should be blocking the rest)

Signals that require investigation:

A sudden drop in pre_filter rejection rate (e.g. from 42% to 18%) almost always means the upstream agent changed its conversational style. New agents tend to emit more structured output — JSON payloads, tool calls, formatted lists — that does not match the greeting/acknowledgement pattern set. Add new patterns. Do not relax existing ones.

A sudden spike in extract no-candidate rate beyond 35% usually means the extraction prompt's quality rules are too aggressive for the new content type. Sample the discarded turns. If more than 15% look like they should have produced candidates, tighten the quality rules or add type-specific examples to the prompt.

A spike in extraction_json_fail — the model produced malformed output that could not be parsed against the <output_schema> — is rare on Haiku (under 0.2% of calls) but spikes under context window pressure. The most common cause: injecting too many recent memories or too many entity records into the call, pushing the prompt into the long-context regime where output quality degrades. Reduce the entity injection limit from 30 to 15 and monitor.

A dedupe collapse rate above 25% suggests the extraction prompt is re-extracting already-known facts despite the recent-memory injection. Increase the recent-memory injection count. If it is already at 15, check whether the user's conversation is cycling over the same topics — if so, the high dedupe rate is correct behavior, not a bug.

Watch the P99 latency for each stage separately. pre_filter should never exceed 5ms. extract is bounded by LLM latency: P50 around 400ms, P99 around 1,200ms on Haiku. resolve_refs P99 spikes when the LLM judge is invoked frequently — if P99 exceeds 2,500ms on a rolling window, the entity resolution cascade is calling the LLM judge too often, which usually means entity names in the store have drifted from how the user refers to them. Run a normalization pass on entity aliases.

Indiscriminate storage of conversational state degrades downstream task performance more than no memory at all.

— Harvard D3 group, 2024

The 7-Stage Write Pipeline

Why a funnel, not a fan-out

The seven stages

The rejection budget is the spec

Failure modes the funnel addresses

Implementation notes

Next

Stage 2 in depth: why one LLM call

The predicate_is_stateful flag

HyPE: write-time query indexing

Rejection budget: a worked example

The trace_id: end-to-end provenance

Async-first design

Operating the pipeline

Related reading

The 7-Stage Write Pipeline

Why a funnel, not a fan-out

The seven stages

The rejection budget is the spec

Failure modes the funnel addresses

Implementation notes

Next

Stage 2 in depth: why one LLM call

The predicate_is_stateful flag

HyPE: write-time query indexing

Rejection budget: a worked example

The trace_id: end-to-end provenance

Async-first design

Operating the pipeline

Related reading

Updates from the lab.