Layered Hallucination Defense in a Persistent Memory System
Abstract
Hallucinations in a persistent memory system are qualitatively worse than hallucinations in a stateless LLM: a fabricated memory stays in the store and gets retrieved on every future query about that entity. This paper documents Recall's three-layer defense architecture — write-time grounding verification, store-time consistency scanning, and read-time faithfulness scoring — with full algorithm specifications, worked examples, cost analysis, and the independence argument that makes multiplicative (rather than additive) escape rate reduction valid.
The Problem with Persistent Hallucinations
A hallucination in a stateless LLM response is bad. A hallucination in a persistent memory system is worse by an order of magnitude.
In a stateless system, the hallucination exists for one turn. It affects one response. No future query sees it. In a memory system, the hallucinated fact is persisted, indexed, and retrieved on every future query about that entity. A fabricated employer, a fictional project name, an invented relationship — these become permanent context injected into every subsequent conversation.
From our audit of production memory systems: "hallucinated user profiles, fabricated employers, a fictional character the model invented independently across 6+ days of sessions." [1] The fictional character was retrieved as fact for six days because nothing detected the hallucination at write time.
This motivates defense in depth: multiple independent guards at different points in the memory pipeline, each catching the class of hallucinations the others miss.
Architecture: Three Attack Surfaces
Hallucinations enter a memory system from three distinct directions, requiring different defenses:
Conversation turn
│
▼
┌──────────────────────┐
│ WRITE-TIME │ ← extractor invents facts not in the source
│ Grounding check │ ← Guard #1
└──────────┬───────────┘
▼
┌──────────────────────┐
│ STORE-TIME │ ← contradictions accumulate silently over time
│ Consistency scan │ ← Guard #2
└──────────┬───────────┘
▼
┌──────────────────────┐
│ READ-TIME │ ← LLM generates claims beyond its provided context
│ Faithfulness score │ ← Guard #3 (opt-in)
└──────────┬───────────┘
▼
Response| Guard | Trigger | Default | Latency added | Cost/op |
|---|---|---|---|---|
| Grounding | Write pipeline (between classify and resolve refs) | ON | ~100ms async | ~$0.0001 |
| Consistency scan | Background worker (nightly) | ON | 0 (async) | ~$0.005 amortized per 10K memories |
| Faithfulness | Read pipeline (after LLM response) | OFF | ~150–250ms | ~$0.0005 |
Guard #1: Write-Time Grounding Verification
What it catches
Write-time grounding catches fabricated extractions — cases where the extraction model invented a fact not present in the source conversation turns.
The grounding verifier asks a different question than the classifier. The classifier asks "is this memory worth storing?" — junk detection based on memory quality. The verifier asks "does the source actually say this?" — hallucination detection based on source support.
These are independent questions. A hallucinated memory about a real topic passes the quality check (it looks plausible) but fails the grounding check (it's not in the source).
Algorithm
async fn verify_grounding(
candidate: &CandidateMemory,
source_turns: &[Turn],
verifier: &dyn LlmClient,
) -> GroundingVerdict
enum GroundingVerdict {
Supported { evidence_spans: Vec<TextSpan> },
Partial { evidence_spans: Vec<TextSpan>, confidence_penalty: f32 },
NotSupported { reason: String },
Unknown, // verifier itself failed
}The prompt provides the candidate memory and the source turns and asks the verifier to return one of SUPPORTED, PARTIAL, NOT_SUPPORTED. Evidence spans from the source are returned with each verdict — these are stored in the memory's provenance.evidence_spans field and are traceable from any future memory retrieval.
Verdict handling
| Verdict | Action | Confidence adjustment |
|---|---|---|
| Supported | Keep, record evidence spans | None |
| Partial | Keep with grounding_partial tag, optionally rewrite content to match evidence | −0.10 to −0.30 |
| Not supported | Drop, increment hallucination_blocked_total | Discarded |
| Unknown | Per config: Block, Allow with flag, or Queue | — |
For Partial verdicts with rewrite_partial: true, a second LLM call rewrites the content to match only the evidence spans. This produces a softened but grounded version: "Georgian expects to deliver the MVP by end of April" rather than the fabricated "Georgian will deliver the MVP by April 30 with 95% confidence."
If adjusted_confidence < 0.3 (the minimum after penalty), the memory is dropped. Speculative memories that can't be fully grounded should not persist.
Worked example: caught fabrication
Source turns:
Turn 1: "Let's schedule the meeting for next Tuesday."
Turn 2: "I'll be joining from my home office in Bangalore."Extraction output:
{
"type": "fact",
"predicate": "works_at",
"object": "Google",
"content": "Georgian works as a software developer at Google",
"confidence": 0.85
}Grounding verdict:
{
"verdict": "NOT_SUPPORTED",
"evidence_spans": [],
"reasoning": "No mention of Google or employment in the source turns."
}Action: drop. hallucination_blocked_count incremented. Audit log entry created with the rejected candidate and the verdict reason. The fabricated memory never enters the store.
Worked example: partial grounding → rewrite
Source turn: "I think I should be done with the MVP by end of April, pretty confident."
Extraction output: "Georgian will deliver the Inbox3 MVP by April 30 with 95% confidence"
Verdict:
{
"verdict": "PARTIAL",
"evidence_spans": [{"span": "done with the MVP by end of April"}],
"reasoning": "'End of April' is supported; '95% confidence' extrapolates from 'pretty confident'",
"confidence_penalty": 0.20
}Stored memory:
{
"content": "Georgian expects to deliver Inbox3 MVP by end of April",
"confidence": 0.62,
"tags": ["grounding_partial"]
}The rewrite is cheap (one Haiku call, ~$0.00003) and prevents the spurious quantification from being cited verbatim in future responses.
Cost analysis
Per 1,000 user turns:
- Pre-filter rejects 40% → 600 turns reach extraction
- Extraction produces 1.2 candidates per turn → 720 candidates
- Classifier rejects 50% → 360 candidates reach grounding
- 360 grounding calls at ~0.036 per 1,000 turns**
For context: a single hallucinated memory that persists for 1,000 future retrievals costs far more in degraded agent quality than $0.036 in prevention spend.
Guard #2: Store-Time Consistency Scanning
What write-time grounding misses
Each memory is grounded individually against its source turns. But two individually-grounded memories can still contradict each other:
- Temporal evolution: "Georgian works at Datakynd" (June 2025) and "Georgian works at Arrive" (April 2026) — both grounded in their sources, but the store now holds a contradiction unless one supersedes the other.
- Semantic contradiction: "Georgian prefers concise responses" and "Georgian prefers detailed explanations" — distinguishable by context but the store doesn't know which applies when.
- Extractor drift: an extractor upgrade produces memories with slightly different predicates for the same underlying facts. Old + new coexist and look contradictory.
Three signals for finding contradictions
Signal A — structural: same (subject, predicate) but different object. The clearest case: same predicate, different value.
SELECT subject, predicate, array_agg(id) AS memory_ids
FROM memories
WHERE deleted_at IS NULL AND superseded_by IS NULL
AND type IN ('fact', 'preference')
GROUP BY subject, predicate
HAVING count(*) > 1;Signal B — semantic: memories with cosine similarity > 0.85 but different subjects or predicates. Catches paraphrase duplicates that escaped write-time dedup.
Signal C — temporal: memories about the same (subject, predicate) with different valid_from timestamps more than 30 days apart. Catches legitimate job changes and role shifts that should trigger supersession.
LLM judge verdict
For each suspect cluster, one LLM call classifies it:
| Verdict | Meaning | Action |
|---|---|---|
| Equivalent | All say the same thing, different phrasing | Merge into canonical; set others superseded_by = canonical |
| TemporalEvolution | Fact changed over time; latest supersedes | Set superseded_by, valid_to on older memories |
| Contradiction | Genuinely contradictory; can't auto-resolve | Flag for human review, link contradicts_with |
| Distinct | False positive; memories are about different things | No-op |
Auto-actions fire for Equivalent and TemporalEvolution. Contradiction requires human review because the system can't reliably determine which memory is true without additional context.
Worked example: temporal evolution
Memory A (2025-06-10): subject=ent_georgian, predicate=works_at, object="Datakynd"
Memory B (2026-04-05): subject=ent_georgian, predicate=works_at, object="Arrive"Judge verdict: TEMPORAL_EVOLUTION. Action: mem_A.superseded_by = mem_B, mem_A.valid_to = mem_B.valid_from. Both remain in store; retrieval returns only mem_B by default. Historical queries can still reach mem_A.
Worked example: unresolvable contradiction
Memory A (2026-04-01): subject=ent_inbox3, predicate=uses_database, object="Supabase"
Memory B (2026-04-02): subject=ent_inbox3, predicate=uses_database, object="Neon"Judge verdict: CONTRADICTION. The one-day gap and same predicate makes it ambiguous whether this is a migration (temporal evolution) or a hallucination. Neither auto-action fires. Dashboard alert: "2 memories in contradiction, review needed." Retrieval returns both, flagged; the LLM sees the conflict explicitly.
Merge semantics for equivalent clusters
When three memories all say "Georgian lives in Bangalore" with different phrasing:
- Pick highest-confidence as canonical
- Union
evidence_spansacross all access_count = sum(all access counts)- Mark others
superseded_by = canonical_id - Full history preserved in audit log
Cost
Default schedule: nightly at 3 AM. Per 10,000 memories in a typical namespace:
- Cluster detection SQL: ~500ms
- Judge calls: ~20–50 clusters, 1 Haiku call each → ~$0.005
- Write operations: SQL only
Total: < 1 second compute + ~$0.005 LLM spend per night. Amortized across a namespace with 10K memories, this is negligible.
Guard #3: Read-Time Faithfulness Scoring (Opt-In)
What the first two guards miss
Guards 1 and 2 ensure the memory store is accurate and consistent. Guard 3 addresses a different problem: even with an accurate store, the LLM generating the response can add claims beyond what the provided context contains.
Consider: the context contains "Inbox3 is at 60% completion, blockers are OAuth and rate limits." The LLM might respond "Your team of 3 engineers is currently allocated 50% to this project." The team size and allocation percentage were hallucinated from nothing — the memory store is fine, but the response isn't.
Algorithm
- Parse the LLM response into atomic factual claims (exclude advice and opinions).
- For each factual claim, check whether the provided context memories support it.
- Score:
faithfulness = supported_claims / total_claims. - Classify risk level based on unsupported claim count and criticality.
Input response:
"Your Inbox3 project is at 60% completion. The main blockers are OAuth
and rate limits. Your team of 3 engineers is allocated 50% to this project."
Claims extracted:
1. Inbox3 at 60% completion → context check → SUPPORTED (mem_X)
2. Blockers: OAuth, rate limits → context check → SUPPORTED (mem_Y)
3. Team of 3 engineers → context check → UNSUPPORTED
4. 50% allocation → context check → UNSUPPORTED
Faithfulness: 2/4 = 0.50 → risk HIGHRisk classification
| Risk | Condition |
|---|---|
| None | 0 unsupported claims |
| Low | ≤ 2 non-critical unsupported claims |
| Medium | 1–2 critical factual unsupported claims |
| High | ≥ 3 unsupported, or any critical unsupported |
Response policies
| Policy | Trigger | Action |
|---|---|---|
warn | Medium or High | Annotate response with uncertainty note on unsupported claims |
regenerate | High | Retry with stricter prompt; re-score; use if better |
block | Any unsupported claim | "I don't have enough reliable information to answer that" |
warn is the recommended starting configuration. block is appropriate for high-stakes domains (medical, legal, financial) where any hallucination is unacceptable. regenerate adds the most latency (~400ms total when triggered) and is most useful when the LLM systematically extrapolates beyond its context.
Why opt-in by default
Faithfulness scoring adds ~150–250ms to response latency and ~$0.0005 in cost per response. For most chat applications, Guards 1 and 2 are sufficient. Guard 3 provides additional protection for high-stakes deployments where incorrect information has material consequences.
Why the Defense is Layered (Not Redundant)
The three guards are not the same check applied three times. Each operates on different inputs and catches structurally different failure modes:
| Guard | Input | Failure mode it catches |
|---|---|---|
| Grounding | (candidate memory, source turns) | Fabrication at extraction |
| Consistency scan | (stored memory clusters) | Accumulated contradictions |
| Faithfulness | (LLM output, provided context) | Claims added at generation |
A hallucination that escapes Guard 1 (say, a partial grounding that cleared the minimum confidence threshold) is unlikely to escape Guard 2 (which would catch a contradiction with an existing memory about the same predicate). A memory that escapes both Guards 1 and 2 might still produce an unsupported claim at generation time that Guard 3 catches.
This independence is what makes the escape rate multiplicative rather than additive.
The math
Let eᵢ be the escape rate for guard i — the fraction of hallucinations that get past it.
Preliminary empirical estimates:
- Guard 1 (grounding) catches ~80% of fabrications →
e₁ ≈ 0.20 - Guard 2 (consistency) catches ~50% of what slips through →
e₂ ≈ 0.50 - Guard 3 (faithfulness) catches ~70% of what reaches response →
e₃ ≈ 0.30
Single guard: 20% escape rate.
Three independent guards:
P(escapes all three) = e₁ × e₂ × e₃ = 0.20 × 0.50 × 0.30 = 0.03Three percent, down from twenty. A 6.7× reduction from adding two more layers.
Caveat on independence: the independence assumption holds because each guard sees different data. It would break if a systematic extractor bias produced hallucinations that look consistent (fool Guard 2) and have high source-cited confidence (fool Guard 1). This is why observability tracks hallucination-blocked rates and alerts when they deviate — to catch systematic bias that the math doesn't account for.
The Hallucination Dashboard
Operations view of all three guards in one place:
Guard #1: Write-time grounding (last 24h)
Candidates processed: 1,847
Blocked (NOT_SUPPORTED): 43 (2.3%)
Partial (conf. penalty): 127 (6.9%)
Cost: $0.18
Guard #2: Consistency scan
Last run: 3 hours ago
Clusters found: 12
Auto-merged (equivalent): 7
Auto-superseded (temporal): 3
Flagged (contradiction): 2 ⚠
Guard #3: Read-time faithfulness (opt-in)
Responses scored: 2,341
Mean faithfulness: 0.94
High-risk responses: 18 (0.8%)
Regenerated: 12
Overall hallucination risk: LOWKey metric to watch: the "Blocked (NOT_SUPPORTED)" rate at Guard 1. A healthy system blocks 2–4% of candidates — low enough that the extractor is working, high enough that the guard is catching things. If this rate suddenly rises to 15%, something upstream changed (extractor model update, new conversation domain, prompt regression). If it drops to near zero, the grounding verifier may be misconfigured or its prompts are too lenient.
What Layered Defense Doesn't Fix
Honesty about limits:
- Sincerely believed wrong memories. The user told the agent incorrect information. The memory is grounded (the user said it), consistent (no contradictions), and faithfully reproduced. None of the three guards catches this — it's a data quality problem, not a memory quality problem.
- Reasoning errors over correct memories. A correct memory can be misused by the LLM in downstream reasoning. Guard 3 checks whether the response claims are supported by the context, but doesn't verify whether the reasoning connecting claim to context is valid.
- Adversarial injection. If a malicious user deliberately injects false information to contaminate the memory store, all three guards can be fooled (the source "supports" the claim, there are no contradictions because the true memory was never written, and the response faithfully cites the injected memory).
Memory's defense layers lift the floor on reliability. They are not a ceiling on agent trustworthiness.
Configuration
hallucination_guards:
grounding:
enabled: true
verifier_model: claude-haiku-4-5
min_confidence_after_penalty: 0.3
on_verifier_failure: Queue # Block | Allow | Queue
rewrite_partial: true
skip_for_types: [entity]
consistency_scan:
enabled: true
schedule: "0 3 * * *"
signals:
structural: true
semantic_similarity_threshold: 0.85
temporal_drift_days: 30
judge_model: claude-haiku-4-5
auto_actions:
merge_equivalent: true
supersede_temporal: true
flag_contradiction: true
max_clusters_per_scan: 200
faithfulness:
enabled: false
scorer_model: claude-haiku-4-5
on_hallucination: warn
risk_thresholds:
medium: 0.7
high: 0.5Start with the defaults. Enable Guard 3 (faithfulness.enabled: true) for high-stakes use cases. Tune min_confidence_after_penalty down if you want stricter write-time filtering, or up if you want to keep more partial inferences. The max_clusters_per_scan cap on Guard 2 controls the cost ceiling for the nightly job.
Calibrating Guard Sensitivity for Your Domain
The default configuration is calibrated for a general-purpose conversational assistant. Different domains impose different costs for false negatives (letting a hallucination through) versus false positives (blocking a legitimate memory). Adjusting the three guards to match your domain's tolerance is as important as enabling them in the first place.
Adjusting Guard 1 (grounding) sensitivity
The min_confidence_after_penalty threshold — default 0.3 — controls the floor below which a partially-grounded memory is dropped rather than stored. Adjusting this threshold has different implications depending on which direction your domain is optimized.
In a high-recall domain (legal research, medical history, personal knowledge management), missing a fact is catastrophic. A therapist's notes, a patient history, a lawyer's case timeline — these require preserving partial inferences that might be correct even when the source phrasing is imprecise. Lower this threshold to 0.2. A memory with base confidence s = 0.30 and penalty of 0.25 produces adjusted_confidence = 0.30 - 0.25 = 0.05, which clears the 0.2 floor and gets stored — tagged grounding_partial and down-weighted for retrieval, but not lost.
In a high-precision domain (financial compliance, regulatory reporting, audit trails), a wrong fact causes real harm — a misquoted contract term, a fabricated trade size, an invented approval date. Raise the threshold to 0.5. Now that same memory with adjusted_confidence = 0.05 is well below floor and is dropped. Only memories with strong partial grounding (penalty of 0.05 or less on a confident extraction) survive.
The practical effect is significant at the margin: moving from 0.3 to 0.5 typically drops an additional 3–7% of write-time candidates in a general corpus, concentrated in the speculative-extrapolation failure mode where the extractor makes confident-sounding claims from weak hedged phrasing. Moving from 0.3 to 0.2 recovers roughly the same fraction, accepting that some of those stored memories will be wrong.
Adjusting Guard 2 (consistency scan) sensitivity
Two parameters control consistency scan behavior.
The semantic_similarity_threshold for Signal B (default 0.85) determines when two memories are considered near-duplicate candidates for LLM judge evaluation. At 0.80, more clusters are found — higher recall, but more false positives passed to the judge. The judge handles them correctly (Distinct verdict fires for false positives), but at a small cost per false-positive cluster. At 0.90, fewer clusters are surfaced — some real contradictions expressed in moderately-similar but not near-identical language will be missed. The 0.85 default is a practical balance for English-language general knowledge; corpora with domain-specific terminology where the same concept has multiple phrasings (medical synonyms, legal equivalences) may benefit from a lower threshold closer to 0.80.
The temporal_drift_days threshold (default 30) controls the window for Signal C — temporal evolution detection. Facts that change faster than once a month — job titles in a fast-moving startup, project status in an active sprint, pricing in a volatile market — need a shorter window. Set temporal_drift_days: 7 to catch weekly-scale changes. The tradeoff: more clusters are generated per scan run, increasing judge call volume and nightly cost. In a namespace with 10K memories, moving from 30 days to 7 days may increase judge call volume by 40–60% on a domain with rapidly-evolving facts, though total cost remains small (from ~0.007 per nightly run).
Enabling and tuning Guard 3 (faithfulness)
Guard 3 is off by default. When you enable it, the two risk threshold parameters require domain-specific tuning before production use.
risk_thresholds.medium (default 0.7) and risk_thresholds.high (default 0.5) are faithfulness score thresholds, where faithfulness score = fraction of response claims that are supported by the provided memory context. A score of 0.6 (60% of claims supported) sits above the high threshold of 0.5 and below the medium threshold of 0.7, classifying the response as medium risk and triggering the warn policy.
For domains where the model is expected to draw legitimate inferences from memory — a reasoning-intensive research assistant that synthesizes across multiple stored notes, for instance — raise both thresholds to medium: 0.85, high: 0.70. This acknowledges that 15% of claims being inferential rather than directly cited is acceptable. Without this adjustment, a research assistant that connects two stored facts to produce a valid but non-literal conclusion will be flagged on nearly every response.
For domains requiring strict citation fidelity — a compliance bot, a contract review tool, a clinical summary generator — lower thresholds to medium: 0.80, high: 0.50 and set on_hallucination: block. Any response with more than 20% unsupported claims is rejected outright. These settings are intentionally strict: in a compliance context, a wrong number cited as fact has material legal consequences that outweigh the user experience cost of a blocked response.
Measuring guard effectiveness
Calibration without measurement is guesswork. Run a weekly effectiveness report against these target ranges:
guard1_block_rate: healthy range is 2–4% of grounded candidates blocked. Below 0.5% suggests the verifier prompt has drifted toward permissiveness or the extractor model changed. Above 8% sustained suggests a domain shift or extractor regression.guard2_clusters_per_10k_memories: healthy range is 1–5 clusters per 10K memories per scan. Above 20 suggests either threshold misconfiguration or a large batch of conflicting extractions (often from an extractor prompt change). Below 0.5 sustained suggests the similarity threshold is too high.guard3_high_risk_rate: healthy range is below 1% for general chat, below 0.1% for high-stakes deployments with block policy. A rising trend over consecutive weeks is an early warning signal even before threshold alarms fire.
Sudden spikes in any of these metrics warrant investigation before they become a data quality problem at scale.
The "lenient in dev, strict in prod" pattern
During development — while iterating on extraction prompts, testing new memory types, or integrating a new conversation domain — tight thresholds block legitimate extractions and obscure whether the extractor is working correctly. Use loose thresholds in development: min_confidence_after_penalty: 0.1, faithfulness.enabled: false. Log blocked candidates but don't drop them; review the logs to distinguish false negatives from genuine hallucinations.
Before promoting to production, tighten: raise min_confidence_after_penalty to production target, enable faithfulness guard if the domain warrants it, lower semantic_similarity_threshold if the domain has synonym-heavy terminology. Store threshold config in environment-specific YAML files (config/guards.dev.yaml, config/guards.prod.yaml) and validate the promotion diff in a code review before deployment. A threshold change that looks minor — moving min_confidence_after_penalty from 0.2 to 0.4 — can alter the stored-memory rate by several percentage points on a live system.
Observability and Alerting
The three guards emit structured metrics that aggregate into a coherent picture of memory system health. Instrumenting these metrics and defining alert conditions is the operational complement to getting the sensitivity configuration right.
Metrics emitted by each guard
Guard 1 (grounding) emits:
recall_grounding_verdicts_total{verdict="supported|partial|not_supported|unknown"}— raw verdict counts since process start, useful for ratio calculations over rolling windows.recall_grounding_latency_ms{quantile="0.5|0.99"}— p50 and p99 latency for the grounding verifier LLM call. A rising p99 (above 300ms) indicates verifier model saturation or network latency to the LLM provider.recall_hallucination_blocked_total— monotonically increasing count of NOT_SUPPORTED verdicts that resulted in a dropped candidate. This is the primary guard effectiveness signal.
Guard 2 (consistency scan) emits:
recall_consistency_clusters_found_total— number of suspect clusters identified by the three signals per scan run.recall_consistency_actions_total{action="merge|supersede|flag"}— count of each auto-action taken. The ratio ofmerge + supersedetoflagindicates how often the system can self-heal versus requiring human review.recall_consistency_scan_duration_ms— end-to-end duration of the nightly scan job. A sudden increase (more than 2× baseline) indicates either memory store growth crossing a cluster-detection knee or a threshold change that dramatically increased cluster volume.
Guard 3 (faithfulness) emits:
recall_faithfulness_scores_total{risk="none|low|medium|high"}— distribution of risk classifications across scored responses.recall_faithfulness_latency_ms{quantile="0.5|0.99"}— faithfulness scoring latency. Whenregeneratepolicy fires, the total latency for a single response can exceed 600ms; this metric helps distinguish scoring latency from regeneration latency.recall_response_policy_applied_total{policy="warn|regenerate|block"}— count of each policy action. A risingregeneratecount with stableblockcount suggests the LLM is extrapolating but the regenerated responses are better — the guard is working as designed. A risingblockcount suggests the domain is shifting or Guard 3 thresholds need recalibration.
Alert conditions and their causes
Alerting should distinguish between conditions that require immediate investigation and slow-moving trends that require scheduled recalibration.
guard1_block_rate > 8% for 3 or more consecutive hours: the grounding verifier is blocking an unusually high fraction of candidates. The most common causes are a new conversation domain the extractor handles poorly (generating confident-sounding but ungrounded extractions), an extractor model update that changed extraction behavior, or a prompt regression introduced during a recent deployment. Investigate by pulling the specific NOT_SUPPORTED reasons from the audit log — if a single predicate type accounts for most blocks, the extraction prompt for that predicate type has likely regressed.
guard1_block_rate < 0.5% sustained for 7 or more consecutive days: the grounding verifier may have become too permissive. Common causes include a verifier model update that made the verifier more lenient, a verifier prompt that drifted (cached prompt with stale content), or a domain shift where the new conversation topics happen to be well-grounded. Re-test against a golden dataset of known hallucinations; if the verifier fails to catch them, the prompt needs correction.
guard2_contradiction_flag_count > 10 in a single scan run: a large batch of memories entered the store that contradict existing memories and couldn't be auto-resolved. This pattern typically accompanies an extractor prompt change that altered predicate naming — the new memories use current_employer where the old memories used works_at, and the structural signal catches them as contradictions even though they're semantically equivalent. Check whether a schema version change or extractor deployment happened within the preceding 48 hours.
guard3_high_risk_rate > 5% when Guard 3 is enabled: the LLM is systematically extrapolating beyond the provided memory context on more than 1 in 20 responses. This is typically either a retrieval problem (not enough relevant memories are being retrieved, so the LLM fills gaps with generation) or a prompt problem (the system prompt implicitly encourages synthesis beyond the provided context). Switch from warn to regenerate policy temporarily to force self-correction on high-risk responses, and investigate memory retrieval coverage for the affected query types.
Correlating guard signals
Isolated metric movements are easier to diagnose than correlated ones. Two correlation patterns are particularly informative.
When Guard 1 block rate increases and Guard 3 high-risk rate also increases simultaneously, both write-time and read-time quality are degrading. The most likely explanation is a domain shift: the system has started receiving conversations from a topic area the extractor handles poorly. At write time, the extractor fabricates facts the verifier catches (Guard 1 blocks more). At read time, the retrieved memories for the new domain are sparse, so the LLM extrapolates (Guard 3 high-risk rate rises). The diagnostic is to sample the new conversation topics and check whether memory retrieval coverage is lower for them than for baseline domains.
When Guard 2 finds many contradictions but Guards 1 and 3 are stable, the problem is localized to the store and is most likely temporal evolution not being detected as supersession at write time. Facts are changing legitimately (job changes, project completions, preference updates), but the write-time pipeline is not recognizing them as supersession candidates — instead, each new fact is stored alongside the old one as an apparent contradiction. Check that predicate_is_stateful is correctly set to true for the affected memory types. Stateful predicates (those where the latest value supersedes all prior values) should trigger supersession logic at write time, not wait for the nightly scan to catch them.
The weekly hallucination risk report
In addition to real-time alerting, generate a weekly summary that aggregates all three guards into a single executive view. The report should include: total memory candidates processed, Guard 1 block rate (and trend vs. prior 4 weeks), Guard 2 contradiction flags and auto-resolution rate, mean faithfulness score if Guard 3 is enabled, and Guard 3 high-risk rate trend. A flat or declining trend across all three guards indicates a stable, well-calibrated system. A rising trend in any guard's alert rate — even before threshold alarms fire — is an early warning signal that warrants investigation before it compounds.
Integration Patterns for Async Workflows
The three guards were designed for the common case: a synchronous write pipeline and a synchronous read pipeline, each handling one request at a time. Production deployments introduce complications — asynchronous write queues, streaming LLM responses, and multi-agent orchestration — that require specific adaptations.
The synchronous vs. async grounding trade-off
By default, Guard 1 runs asynchronously. The write pipeline places the extracted candidate in a write queue, returns immediately to the caller, and the grounding verifier processes candidates from the queue in the background. This keeps the write path from adding 100ms of latency to every conversation turn.
The trade-off is a brief window of exposure: a memory may be retrievable for seconds to minutes (depending on queue depth and verifier throughput) before grounding completes and potentially removes it. In most applications, a subsequent query within that window is unlikely, and the window is short enough that the risk is acceptable.
For applications where the write and read paths are tightly coupled — a single-session agent where a memory written in turn 3 is retrieved in turn 4 — this window is not acceptable. A fabricated memory could be written, retrieved, and cited within the same conversation before grounding removes it. For these cases, set grounding.mode: synchronous. This blocks the write path until the grounding verdict returns, adding approximately 100ms to write latency. The caller's response to the user turn is not affected (grounding runs in parallel with response generation), but the memory is not eligible for retrieval until the verdict clears it.
Guard 3 in streaming responses
Faithfulness scoring requires the full LLM response as input — it must parse the response into atomic factual claims before it can check each claim against the memory context. This is structurally incompatible with streaming outputs, where the response is delivered token-by-token to the user before it is complete.
In streaming mode, Guard 3 can only run as a post-hoc check: score the completed response after delivery and trigger a corrective follow-up if the score is high-risk. The corrective pattern is an explicit correction message appended to the conversation: "I should clarify that [the unsupported claim] was not in the memories I was given — the information I have is [grounded alternative]." This is less user-friendly than blocking or regenerating the response before delivery, but it is the only approach compatible with streaming. The faithfulness score and any correction event are logged to the audit trail regardless, so post-hoc review remains possible even for streaming sessions.
For applications where faithfulness is a hard requirement (block policy), streaming must be disabled when Guard 3 is active. The two requirements are mutually exclusive: you cannot block a response that has already been streamed to the user.
Faithfulness in multi-agent systems
In a multi-agent setup — a retrieval agent that fetches memories and passes them to a reasoning agent that generates the response — faithfulness scoring must be applied carefully to avoid over-constraining legitimate reasoning.
The retrieval agent's memory context is a set of stored facts. The reasoning agent's job is to think about those facts, connect them, draw inferences, and produce a response. The reasoning agent's internal chain-of-thought may legitimately extend well beyond the literal text of the memory context — that inferential step is the point of having a reasoning agent. Applying Guard 3 to intermediate reasoning steps would block nearly all useful outputs.
The correct placement is to apply Guard 3 only at the final output node of the agent graph — the externally-visible response that the user receives. The reasoning agent's chain-of-thought is internal state; the check is whether the final externally-visible claim, after reasoning completes, can be traced back to the memory context. A claim like "given that Georgian completed the OAuth integration last Tuesday and the rate-limit refactor is the only remaining blocker, the project is likely to ship by end of month" is an inference — it goes beyond the literal memory text — but it is a traceable inference from two specific memories. Guard 3, applied at the response node with appropriate threshold configuration for inference-heavy domains, would correctly classify this as low-risk rather than flagging the inferred conclusion as unsupported.
The practical configuration for a multi-agent system: apply Guard 3 with risk_thresholds.medium: 0.85, risk_thresholds.high: 0.70 at the response node only, and log the memory context that was available to the reasoning agent alongside each faithfulness score. When a high-risk score fires, the audit trail shows exactly which memories were provided and which claims in the response couldn't be traced to them — making it straightforward to determine whether the failure was a retrieval gap (right memories not retrieved) or a generation failure (model extrapolated despite sufficient context).
References
- Mem0 Production Audit (2025): hallucinated user profiles across multi-day agent sessions.
- LongMemEval: Benchmarking Retrieval-Augmented Generation for Long Conversational Histories (2025).
- Es et al. (2023): 'RAGAS: Automated Evaluation of Retrieval Augmented Generation.' EMNLP.