Tracing a Memory Back to Its Source

Arc Labs EngineeringMay 7, 202616 min read

Memory systems fail in a particular way that's different from most software bugs: the wrong information gets stored correctly. The pipeline didn't crash. The extraction ran successfully. The LLM returned a well-formed response. But somewhere in the interpretation of a conversation, a fact got slightly wrong, an entity got misidentified, or a preference got inverted. The stored memory looks legitimate — it has high confidence, proper type, correct scope — and it will surface in retrieval for as long as it exists.

The standard debugging toolkit doesn't help much here. Stack traces, error logs, exception handlers — none of these fire when a memory is wrong but valid. What you need is the ability to trace the memory back to the conversation that created it, understand what the extractor saw and concluded, and either correct the memory or discard it.

That's what provenance is for. Every memory in Recall carries a Provenance struct as a first-class field. Not in the logs, not in a separate audit table that gets pruned — embedded directly in the memory itself, persistent as long as the memory persists.

The Provenance struct

pub struct Provenance {
    pub source_turn_ids: Vec<String>,
    pub source_session_id: Option<String>,
    pub extracted_by: ExtractorInfo,
    pub extracted_at: DateTime<Utc>,
    pub extraction_trace_id: String,
}

pub struct ExtractorInfo {
    pub provider: String,
    pub model: String,
    pub model_version_hash: Option<String>,
    pub extractor_schema_version: u32,
}

Each field answers a different question:

source_turn_ids: Which specific conversation turns contributed to this memory? A memory can be grounded in multiple turns — a fact established in turn 12 and confirmed in turn 37 will list both IDs. This is the direct link from memory to conversation.

extracted_by: Which model processed those turns? provider and model give you the LLM identity. model_version_hash is a 16-character hex hash of the model identifier — stable across deployments, useful for correlating memories from a specific model version. extractor_schema_version is the schema version active when extraction ran (see the schema versioning guide for what this tracks).

extracted_at: When the extraction pipeline ran. Note this is pipeline execution time, not the time of the conversation. For async extraction (the default — turns are buffered and processed by a background worker), extracted_at may be minutes or hours after source_turn_ids were recorded.

extraction_trace_id: The trace ID of the write pipeline execution. This is the most powerful provenance field. Every call to remember() generates one pipeline execution, one trace. All memories produced by that execution share a single extraction_trace_id.

The extraction_trace_id: the thread that connects them

When something goes wrong with extraction — a prompt change produced unexpected outputs, a model upgrade changed the extraction vocabulary, a bug caused a batch of turns to be processed with wrong context — the extraction_trace_id is how you find everything that went wrong and fix it.

Every remember() response includes the trace ID:

const result = await recall.remember({ turns, scope });
console.log(result.trace_id);  // "trc_01HX7RABC..."
console.log(result.metadata.stored);   // 3
console.log(result.metadata.discarded); // 12
console.log(result.metadata.merged);    // 1

To list all memories from that execution:

const memories = await recall.listMemories({
  scope,
  filter: { extraction_trace_id: result.trace_id }
});
// Returns all memories where provenance.extraction_trace_id === result.trace_id

This query is indexed. It runs in constant time regardless of how many total memories are in the namespace.

You can also fetch the full pipeline trace — every span, every LLM call with its cost and token counts, every stage decision:

const trace = await recall.getTrace(result.trace_id);
// trace.spans: array of Span
// trace.spans[1].name === "extract"
// trace.spans[1].attributes["llm.cost_usd"] === 0.00012
// trace.spans[1].attributes["extract.candidates_produced"] === 2

The trace links observability (what the pipeline did) to provenance (what it produced). Between the two, you can answer almost any question about why a particular memory exists.

Debugging: "This memory is wrong"

You get a user report: the agent told them "your manager at Arrive is Sarah" when the correct answer is Priya. You find the memory:

const mem = await recall.getMemory("mem_8B3C...");

console.log(mem.content);
// "Georgian's manager at Arrive is Sarah"

console.log(mem.provenance.source_turn_ids);
// ["turn_089", "turn_091"]

console.log(mem.provenance.extracted_by.model);
// "claude-haiku-4-5"

console.log(mem.provenance.extraction_trace_id);
// "trc_01HX7RABC..."

Retrieve the source turns from your conversation store (Recall stores memory metadata, not the original conversation turns — those live in your system):

const turns = await yourConversationStore.getTurns(["turn_089", "turn_091"]);
console.log(turns[0].content);
// "user: Sarah reviewed my PR today. She's my manager at Arrive..."
console.log(turns[1].content);
// "assistant: Got it. Anything specific about the review you want to note?"

Now you have the full picture: the user mentioned "Sarah reviewed my PR" in a context where Sarah might have been described as a manager, the extractor extracted "Sarah is manager at Arrive" as a fact, and the existing entity for ent_priya_XYZ wasn't connected to this turn's context.

This happens when entity resolution fails — the extractor saw "Sarah" but didn't connect it back to a non-manager role. The correct action here depends on what actually happened in that conversation: either delete the wrong memory and let the correct one surface, or supersede it with a corrected version.

// Option 1: delete the wrong memory
await recall.forget({ memory_id: "mem_8B3C...", scope });

// Option 2: supersede with correction
await recall.updateMemory({
  memory_id: "mem_8B3C...",
  scope,
  patch: {
    content: "Sarah reviewed Georgian's PR at Arrive; Priya is Georgian's manager",
    memory_type: "event",  // this was a review event, not a fact about management
  }
});

Both operations create audit entries. The original memory is preserved in the audit log regardless.

Debugging: "This whole batch came out wrong"

A bigger problem: you changed your extraction prompt last Tuesday and processed 400 conversations before realizing the new prompt was extracting preference memories as facts, losing the versioning semantics. You need to roll back everything from that period.

The trace-based approach:

// Find trace IDs for the affected period
const traces = await recall.listTraces({
  namespace,
  since: "2026-04-22T09:00:00Z",  // when you deployed the bad prompt
  until: "2026-04-24T16:30:00Z",  // when you reverted
  operation: "remember"
});

console.log(traces.length);  // 847 pipeline executions

// Soft-delete all memories from those traces
for (const trace of traces) {
  const result = await recall.forgetByTrace({
    trace_id: trace.id,
    scope,
    reason: "bulk rollback: bad extraction prompt 2026-04-22 to 2026-04-24"
  });
  console.log(`Removed ${result.deleted_count} memories from ${trace.id}`);
}

forgetByTrace sets deleted_at on all memories sharing that trace ID. They don't surface in retrieval. They are preserved in the audit log with the reason you provided. If you realize the rollback was incorrect:

// Restore memories from a specific trace
await recall.restoreByTrace({
  trace_id: "trc_01HX7RABC...",
  scope
});
// Clears deleted_at; memories return to active state

The reason to use soft delete rather than hard delete is that rollbacks are often wrong in the other direction — you delete too much. Soft delete gives you a recovery path. Hard delete is irreversible and should be reserved for GDPR deletion requests and cases where you're certain the memories are contaminated.

Debugging: "Why didn't this memory surface?"

The hardest class of retrieval bugs: a memory you know exists didn't appear in the results for a query that should have matched it.

const debug = await recall.debugRetrieval({
  query: "what does my manager prefer for code reviews?",
  expected_memory_id: "mem_7A2B",
  scope: { user_id, agent_id }
});

The response tells you exactly where in the retrieval pipeline the memory got lost:

{
  "found_in_retrievers": ["entity_graph"],
  "not_found_in": ["semantic", "bm25", "temporal", "type"],
  "semantic_similarity": 0.68,
  "bm25_score": 0.21,
  "rrf_rank": 8,
  "rrf_cutoff": 5,
  "filtered_by": null,
  "policy_check": "passed — confidence 0.87, not superseded, in scope",
  "suggestion": "Memory ranked 8th after RRF fusion; top-K cutoff is 5. Consider lowering semantic threshold from 0.70 to 0.65, or increasing top-K to 8."
}

In this case, the memory was found by entity graph retrieval (the manager entity has a direct graph edge to the preference), but it wasn't found by semantic retrieval because the embedding similarity (0.68) was below the default threshold of 0.70. After RRF fusion, it ranked 8th — outside the top-5 cutoff.

Two possible fixes:

Lower the semantic similarity threshold (read.semantic.min_similarity: 0.65)
Increase the top-K cutoff (read.top_k: 8)

The first is higher risk: lowering the threshold lets in more noise, which reduces precision across all queries. The second is safer: more candidates go through RRF fusion and policy filtering, and the policy gate will still reject low-confidence or superseded memories.

A third option is to check whether the memory's content can be improved. A similarity of 0.68 suggests the query and the memory content are using different vocabulary for the same concept. If the memory says "prefers inline comments" and the query asks about "code review style," there's a genuine vocabulary gap. Re-writing the memory content to include "code review" would close the gap — but the better fix is usually to let the extraction process produce richer content in the first place, by providing better context in the extraction prompt.

The audit ledger

The audit log is separate from traces. Traces track pipeline execution details — spans, latencies, LLM calls. The audit log tracks memory lifecycle events — creation, update, supersession, deletion.

Every audit entry has:

{
  "audit_id": "aud_01HX...",
  "event_type": "insert" | "supersede" | "update" | "delete" | "restore" | "feedback",
  "memory_id": "mem_...",
  "actor": "api_key:ak_...",
  "timestamp": "2026-05-07T14:23:45Z",
  "reason": "bulk rollback: bad extraction prompt",
  "before": { /* memory state before */ },
  "after": { /* memory state after */ }
}

Query the audit log for a specific memory:

const history = await recall.auditLog({
  memory_id: "mem_8B3C...",
  scope
});
// Returns chronological list of all events for this memory

Query all events in a time window:

const events = await recall.auditLog({
  scope,
  since: "2026-04-22T00:00:00Z",
  until: "2026-04-24T23:59:59Z",
  event_types: ["insert", "delete"]
});

The audit log is the permanent record. Trace retention defaults to 7 days before hot storage rolls off. The audit log is retained for as long as the memory store exists — it doesn't expire.

Provenance invariants the system maintains

Two formal invariants ensure provenance is always present:

I-4 (Provenance existence): Every memory has at least one source turn ID in provenance.source_turn_ids. The write pipeline enforces this at the extraction stage — a candidate without source turn attribution fails the grounding check and is discarded.

I-5 (Provenance consistency): memory.scope must match the scope of the source turns. Cross-scope extraction (extracting from user A's turns into user B's namespace) is rejected by the storage layer before persist.

These invariants mean you can always answer "where did this memory come from?" and "was this memory supposed to be here?" for any memory in any namespace.

Practical discipline

In practice, provenance is most valuable in three situations: post-mortem debugging when users report wrong agent behavior, rolling back batches after a configuration mistake, and audits when you need to demonstrate that a specific memory was grounded in a specific user statement.

For day-to-day operations, the observability metrics (particularly recall_hallucination_blocked_total) are the right tool. For targeted investigations, the debugRetrieval API and audit log give you the specifics. The full provenance fields are there when you need to trace things back to the conversation level.

The key thing to internalize: the trace ID is not just a logging artifact. It's the foreign key between every memory and the pipeline execution that produced it. Keep it, expose it in your application layer where possible, and treat the retrieval debug API as standard debugging tooling rather than an emergency measure. Most retrieval issues, once traced, reveal straightforward configuration fixes.

The retrieval debug API in depth

The debugRetrieval response shown in the example above gives a summary. Here is how to interpret each field systematically and what action each outcome suggests.

found_in_retrievers — which of the 5 retrievers found this memory. Recall uses semantic (dense vector), BM25 (sparse lexical), temporal, type-based, and entity-graph retrievers in parallel. If the memory should have been found by semantic retrieval but wasn't, the embedding similarity was below threshold. If it should have been found by BM25 but wasn't, check whether the key terms in the memory are in the BM25 vocabulary — terms that appear in fewer than 3 memories in the namespace are often excluded from the vocabulary because they provide no discriminative value, even though they are the exact terms the user is searching for.

semantic_similarity — the cosine similarity between the query embedding and the memory's stored embedding. Use these bands as reference points during investigation:

Above 0.80: strong semantic match. If the memory is not in results at this similarity level, the problem is ranking, not retrieval — it was found but pushed out by higher-scoring competitors. Look at rrf_rank versus rrf_cutoff.
0.70–0.80: moderate match. The memory may surface or not depending on competition in the namespace. In a namespace with many semantically related memories, a 0.75 similarity may rank outside the top-K cutoff. In a sparse namespace, it will surface easily.
0.60–0.70: weak semantic match. The memory will surface only when competition is sparse. At 0.68, the default threshold of 0.70 excludes it entirely from the semantic retriever's candidate set.
Below 0.60: the semantic retriever treats this as unrelated to the query. This is correct behavior if the memory really isn't semantically related to the query; it signals a vocabulary gap if the memory content and query are expressing the same concept with different terminology.

bm25_score — the BM25 score for this memory against the query terms. A score of 0.0 means none of the query terms appear in the memory content after stemming and stop-word removal. A BM25 score of 0.0 combined with a low semantic similarity is the clearest indicator of a vocabulary gap — the memory uses entirely different words than the query. A high BM25 score with a low semantic similarity often means the memory contains the exact query terms but expresses a different concept — BM25 correctly finds it, and semantic correctly down-weights it. This is the intended behavior of hybrid retrieval.

rrf_rank — the memory's position after Reciprocal Rank Fusion across all retrievers. RRF accounts for the memory's rank position within each retriever that found it, not just whether it was found. A memory found by only one retriever gets an RRF score based on its rank within that retriever. A memory found by three retrievers, each ranking it in the top 10, will beat a memory found by only one retriever at rank 1 — the multi-retriever coverage adds up. This is why found_in_retrievers matters: a memory consistently found across many retrievers is harder to miss than one found by only the entity graph.

rrf_cutoff — the top-K cutoff applied after fusion. If rrf_rank > rrf_cutoff, the memory was retrieved and scored but not included in the final context window. The fix here is to increase read.top_k in your namespace config. Note that increasing top-K has a cost: more memories pass through to the policy gate and into context, which increases token consumption per read call. Set top-K to the minimum that reliably surfaces the memories you care about, and measure context token usage before and after.

filtered_by — if non-null, the memory was excluded by a hard filter after retrieval. Possible values and their fixes:

confidence_floor: the memory's confidence score is below the minimum threshold set in read.policy.min_confidence. Either accept lower-confidence memories by lowering the threshold, or address the root cause by improving extraction grounding so the memory is extracted with higher confidence.
freshness_floor: the memory's freshness score (a combination of recency and access frequency) is below the minimum threshold. This typically means the memory hasn't been accessed recently and is old. If the memory is still relevant, consider calling recall.touchMemory() to update its access timestamp.
status_filter: the memory has status superseded or expired. A superseded memory means a later extraction produced a more current version. Find the superseding memory and verify its content is correct. An expired memory means its expires_at timestamp has passed — if expiry was set incorrectly, restore the memory and extend its TTL.
scope_filter: the memory is outside the scope of the read request. This should not happen in correctly configured systems — it indicates a scope mismatch between write time and read time. Check the memory's provenance scope and the scope passed to the read call.

Proactive health monitoring

Don't wait for user reports to detect memory quality issues. The problems described in this guide — extraction errors, entity fragmentation, retrieval misses, confidence degradation — are all measurable before they produce user-visible failures. Set up automated checks that run weekly and alert before problems accumulate.

async function weeklyMemoryHealthCheck(namespace: string) {
  const health = await recall.health.deep({ namespace });

  const alerts: string[] = [];

  // Check confidence distribution
  const avgConfidence = health.confidence_stats.mean;
  if (avgConfidence < 0.55) {
    alerts.push(
      `Low average confidence: ${avgConfidence.toFixed(3)} — extractor quality may have degraded`
    );
  }

  // Check junk rate
  const junkRate = health.quality_metrics.estimated_junk_rate;
  if (junkRate > 0.15) {
    alerts.push(
      `High junk rate: ${(junkRate * 100).toFixed(1)}% — pre-filter may need tuning`
    );
  }

  // Check retrieval miss rate
  const missRate = health.retrieval_metrics.expected_miss_rate;
  if (missRate > 0.20) {
    alerts.push(
      `High retrieval miss rate: ${(missRate * 100).toFixed(1)}% — review retrieval thresholds`
    );
  }

  // Check entity fragmentation
  const fragmentation = health.entity_metrics.fragmentation_index;
  if (fragmentation > 0.15) {
    alerts.push(
      `Entity fragmentation: ${(fragmentation * 100).toFixed(1)}% — entity resolution may be degrading`
    );
  }

  // Check schema version skew
  if (health.alerts.includes("schema_version_mismatch")) {
    const versions = Object.keys(health.schema_versions);
    if (versions.length > 1) {
      const skewDays = health.schema_version_skew_days;
      if (skewDays > 7) {
        alerts.push(
          `Schema version mismatch persisting ${skewDays} days — migration may be stalled`
        );
      }
    }
  }

  return {
    healthy: alerts.length === 0,
    alerts,
    metrics: health
  };
}

Run this weekly and pipe alerts to your on-call system. Each alert type maps to a specific remediation path covered in this guide or the schema versioning guide. The point is not to act on every alert immediately — some fluctuation is normal — but to catch trends before they compound. A junk rate of 16% this week and 22% next week is a signal worth investigating. A stable 8% is not.

The fragmentation_index deserves particular attention because entity fragmentation has a cascading effect: when the same person exists as two separate entity nodes, all relational queries (find memories about X's manager, find all preferences expressed by X) may return incomplete results. The index is a ratio of suspected duplicate entity pairs to total entities, computed using name similarity and co-occurrence patterns. Values above 0.15 mean roughly 1 in 6 entities has a suspected duplicate — at that level, relational retrieval is unreliable.

Entity resolution debugging

When entity resolution fails, the entity graph fragments and relational queries return wrong or empty results. Detect entity resolution failures by looking for duplicate entity nodes:

const entities = await recall.listEntities({
  namespace,
  filter: { min_memory_count: 5 }  // only entities with substantial backing
});

// Find potential duplicates: entities with similar names
const suspectedDuplicates: Array<{
  entity_a: string;
  entity_b: string;
  name_a: string;
  name_b: string;
  similarity: string;
}> = [];

for (let i = 0; i < entities.length; i++) {
  for (let j = i + 1; j < entities.length; j++) {
    const similarity = jaroWinkler(
      entities[i].canonical_name,
      entities[j].canonical_name
    );
    if (similarity > 0.85 && entities[i].type === entities[j].type) {
      suspectedDuplicates.push({
        entity_a: entities[i].id,
        entity_b: entities[j].id,
        name_a: entities[i].canonical_name,
        name_b: entities[j].canonical_name,
        similarity: similarity.toFixed(3)
      });
    }
  }
}

The min_memory_count: 5 filter focuses the search on entities with enough backing to be worth resolving. Entities with 1–2 memories are often transient mentions that the system correctly didn't resolve to an existing entity — don't merge these automatically.

When you find suspected duplicates, inspect their backing memories before merging:

const memoriesA = await recall.listMemories({
  scope,
  filter: { subject_entity_id: entityA.id }
});
const memoriesB = await recall.listMemories({
  scope,
  filter: { subject_entity_id: entityB.id }
});

// If they refer to the same person, merge them
if (shouldMerge(memoriesA, memoriesB)) {
  await recall.mergeEntities({
    primary: entityA.id,       // keep this entity
    secondary: entityB.id,     // re-point secondary's memories to primary
    scope,
    reason: "entity resolution failure: same person, two nodes"
  });
}

mergeEntities re-wires all memories, relations, and graph edges from the secondary entity to the primary. The secondary entity is marked with status: "merged" and merged_into: primary_id — it is excluded from future entity resolution so the same fragmentation doesn't recur. The merge is recorded in the audit log with the reason you provided, making it traceable.

After merging, relational queries against the primary entity will include memories that were previously attached only to the secondary. This is the intended effect. Verify by running a relational read against the primary entity and checking that the memory count increases by the expected amount.

One scenario where you should not merge: when two entities with similar names are genuinely different people in the same namespace (e.g., two different people named "Alex" in a team context). In this case, the entity resolver was correct to keep them separate. Use the recall.updateEntity API to add disambiguating context to each entity's canonical_name field — for example, "Alex Chen" and "Alex Rivera" — so the resolver has more signal in future sessions.

Scope isolation debugging

A common production bug: memories from one user appearing in another user's context. This is prevented by the I-5 provenance invariant (memory scope must match source turn scope), but write path bugs upstream of Recall can submit turns with wrong scope metadata. The I-5 check fires at storage time and rejects the write — but it only catches mismatches that are detectable at write time. If the upstream system assigns the wrong user ID to a session before calling remember(), Recall has no way to know the ID is incorrect.

Check for scope pollution by comparing the provenance.source_scope on each memory to the scope used to retrieve it:

const userMemories = await recall.listMemories({
  scope: { user_id: userId, agent_id: agentId },
  filter: { created_since: "2026-05-01T00:00:00Z" }
});

// Check each memory's provenance scope matches the requested scope
const scopeViolations = userMemories.filter(
  (m) => m.provenance.source_scope?.user_id !== userId
);

if (scopeViolations.length > 0) {
  console.error(
    `${scopeViolations.length} memories with mismatched provenance scope`
  );
  // These memories were either:
  // 1. Migrated incorrectly from another user's namespace
  // 2. Stored via a bug in the scope assignment logic upstream

  // Quarantine them immediately
  for (const mem of scopeViolations) {
    await recall.forget({
      memory_id: mem.id,
      scope: { user_id: userId, agent_id: agentId },
      reason: "scope isolation violation: provenance scope mismatch"
    });
  }
}

The scope field on memories is set at write time and cannot be changed after storage — it is part of the provenance invariant and is immutable. If a memory is in the wrong scope, the only remediation is to delete it from the wrong scope and re-extract from the correct source turns, this time with the correct scope metadata.

After quarantining scope violations, investigate the write path to find where the wrong user ID was assigned. Common causes: a shared session ID being reused across user contexts in a multi-tenant setup, a middleware layer that maps session IDs to user IDs incorrectly after a session handoff, and migrations that bulk-copy memories without adjusting scope fields. The audit log entry for the original insert will contain the actor (API key) and timestamp, which helps narrow down which component of the write path submitted the incorrect scope.

Hallucination investigation workflow

When a user reports that the agent said something false that appeared to be grounded in memory, the investigation workflow follows a consistent pattern. The goal is to determine whether the false claim came from a wrong memory (an extraction failure) or from the model going beyond what the memory actually said (a faithfulness failure). The two cases have different root causes and different fixes.

Step 1: Get the response that contained the false claim and the retrieval context that was injected into the prompt. The retrieval context should include memory IDs — if your prompt template doesn't currently include them, add them. They are essential for this kind of investigation. Identify which memory IDs were cited by the response.

Step 2: Fetch those memories and examine their provenance:

const citedMemories = await Promise.all(
  citedMemoryIds.map((id) => recall.getMemory(id, scope))
);

for (const memory of citedMemories) {
  console.log({
    content: memory.content,
    confidence: memory.confidence,
    source_turns: memory.provenance.source_turn_ids,
    extractor: memory.provenance.extracted_by.model,
    extracted_at: memory.provenance.extracted_at
  });
}

Step 3: Check whether the memory content actually supports the claim the agent made. There are two distinct failure modes here:

Case A — The memory content is factually correct, but the model hallucinated beyond what the memory stated. The memory says "user works at Volkswagen." The agent said "user is the CEO of Volkswagen." This is a faithfulness failure — the model over-inferred from the retrieved memory, adding a seniority claim that has no grounding. This is not a memory quality issue. The memory is correct. The fix is a prompt-level instruction: add an explicit faithfulness constraint such as "cite exactly what the memory states; do not infer roles, relationships, or attributes beyond what is explicitly recorded." Evaluate the impact of this instruction across a representative sample of queries before deploying it — overly strict faithfulness instructions can make the agent less useful when reasonable inference is appropriate.

Case B — The memory content is itself wrong. The memory says "user is the CEO of Volkswagen" when the user's actual role is something more junior. Trace back to the source turns via source_turn_ids. If the source turn says something weaker — "I had a call with the Volkswagen team today" — and the extractor produced "user is CEO of Volkswagen," this is an extraction over-generalization. The extractor drew an ungrounded inference from insufficient evidence. Correct the specific memory using updateMemory or forget, then address the root cause by tightening the extraction prompt: require that every extracted fact be directly stated in the source turn, not inferred from context.

Step 4: For case B, run debugRetrieval to understand why the wrong memory surfaced at a high rank, and whether a correct version of the fact exists in the store at a lower rank:

const debug = await recall.debugRetrieval({
  query: "what is the user's role at Volkswagen?",
  scope,
  top_k_to_inspect: 10
});
// Review debug.retrieved_memories to find whether a correct memory exists
// at a lower rank, and why the wrong one ranked higher

If a correct memory exists but ranked lower, the problem is ranking competition. The wrong memory may have higher confidence (it was extracted with high certainty by the model, even though the underlying inference was wrong) or a stronger embedding match to the query terms. After correcting or deleting the wrong memory, verify that the correct memory now surfaces at an appropriate rank.

Common extraction error patterns and fixes

The most frequent extraction errors, detectable via provenance analysis and audit log inspection:

Over-extraction of agent statements. The model extracts facts from the assistant's own responses rather than from user statements. An assistant turn saying "I understand that you work at Volkswagen" produces a memory "user works at Volkswagen" — sourced from the agent's restatement rather than the user's direct assertion. The provenance will show source_turn_ids pointing to an assistant turn rather than a user turn. Fix: set extract_from_assistant: false in the write pipeline config, which restricts extraction to user turns only. If you need to extract from assistant turns for other reasons (e.g., the assistant generates structured data that should be remembered), add a custom instruction: "extract only facts the user directly asserted or confirmed; do not extract facts from the assistant's restatements or hypotheticals."

Pronoun resolution to wrong entity. A memory is attributed to the wrong entity because the pronoun resolver mapped "she" or "they" to a different entity than the one being discussed. This is detectable when subject_entity_id on the memory points to an entity whose name doesn't appear in the source turns. Inspect the resolution trace:

const trace = await recall.getTrace(memory.provenance.extraction_trace_id);
const resolutionSpan = trace.spans.find((s) => s.name === "resolve_refs");
const pronounResolution =
  resolutionSpan?.attributes?.["pronoun_resolutions"];
// e.g. { "she (turn_042)": "ent_sarah_XYZ", "she (turn_047)": "ent_priya_XYZ" }

If "she" resolved to the wrong entity, the ring buffer's recent context was stale at the time of extraction. The resolver tracks which entity a pronoun most recently referred to in a sliding window of turns. When that window doesn't include the turn where the entity was last named directly, resolution degrades. Fix: increase pronoun_resolution.ring_buffer_size from the default 20 turns to 40 turns to give the resolver a wider context window. The cost is marginally higher extraction latency on long conversations; the benefit is fewer entity misattributions in threads with multiple people.

Temporal anchor mismatch. An event memory's event_at field is set to the extraction timestamp rather than the time of the event described in the conversation. "I had a great meeting yesterday" produces event_at: <today> instead of event_at: <yesterday>. This causes the memory to sort incorrectly in temporal retrieval — it appears to be a memory from today rather than yesterday, which affects freshness scoring and temporal range queries. The extraction timestamp is available in provenance.extracted_at; if event_at matches extracted_at for an event memory, that's a strong signal the temporal anchor was incorrectly set. Fix: add an explicit instruction to the extraction prompt: "For event memories, set event_at to the time when the event occurred according to the conversation. 'Yesterday' means the calendar day before the conversation date. Do not set event_at to the current extraction time." You may also need to pass the conversation date explicitly to the extraction context so the model can resolve relative time expressions correctly.

Confidence inflation from corroboration. Multiple turns expressing similar but not identical facts cause the confidence score to inflate beyond what the evidence supports. A user mentioning "I usually prefer async communication" in three separate sessions might produce a memory with confidence 0.97, even though each individual mention was hedged. High confidence means the memory will survive policy filters and rank highly — but if the underlying statements were all hedged ("usually," "generally," "sometimes"), that confidence level may be misleading. This is less a bug and more a calibration issue, but it matters for the hallucination investigation workflow: a high-confidence memory that is hedged at the source will suppress lower-confidence memories that are more directly stated. If you're seeing this pattern, consider adding a post-extraction calibration step that penalizes confidence when source statements contain hedging language.

Each of these patterns leaves a detectable signature in provenance data. Over-extraction of agent statements shows source_turn_ids pointing to assistant turns. Pronoun misresolution shows entity IDs that don't match the named entities in source turns. Temporal anchor mismatch shows event_at equal to extracted_at. Confidence inflation is visible when high-confidence memories trace back to hedged source statements. The audit log and trace data give you the raw material to identify all of these without instrumenting the extraction pipeline itself.

All Engineering Posts