Three Layers of Hallucination Defense

By Arc Labs Research10 min read

Memory systems can hallucinate. The model that extracts candidates from a turn can invent facts that aren't in the source text. Two memories that contradict each other can both sit in the store, fooling retrieval. A retrieved memory can be off-topic but high-similarity, and the agent answers from it. Each failure mode has a defense.

Three layers · pick a hallucination scenario
Each scenario gets caught at a different layer. Defense in depth: any one layer can fail and the next still catches.

Layer 1 — write-time grounding

Every extracted candidate carries the source span it came from. Before persisting, a grounding check verifies the candidate is supported by that span. This catches:

  • Fabricated extractions. The model invented a fact not in the input. The grounding check fails; the candidate is dropped.
  • Unsupported predicates. The user said "I'm thinking about Berlin" — the model extracted "user lives in Berlin." Partial grounding; confidence penalty applied.

Implementation: a small LLM judge takes (source span, candidate) and returns supported/partial/unsupported/contradicted. Partial and unknown apply confidence penalties (see the confidence formula); contradicted rejects.

Layer 2 — store-time consistency

Two memories that disagree about the same predicate cannot both be true. Store-time consistency runs as a periodic background job: scan for contradictions, propose supersessions, log unresolved conflicts for review.

  • Latent contradictions. A new memory contradicts an old one but the write-time conflict check missed it (the indexes were stale, or the embedding similarity was below threshold). The consistency scan catches it eventually.
  • Stale supersession chains. A memory was superseded; the supersedee was itself later superseded. The chain is now of length
  1. Consolidation flattens it.

Layer 3 — read-time faithfulness

Even with clean writes and a consistent store, retrieval can pull memories that match the query lexically but not semantically. A reranker — typically a cross-encoder over (query, memory) pairs — re-evaluates the top-K and discards low-grounding hits.

  • Off-topic high-similarity matches. Embedding similarity sometimes pulls memories that share surface form but answer a different question. Rerank suppresses these.
  • Outdated memories. A memory that was supersession-missed can still surface; rerank with awareness of supersession state filters them.

Why three layers, not one

Each layer has a different failure mode. Write-time grounding misses contradictions (it only sees one candidate). Store-time consistency misses fabrications (the candidate was rejected before storage; nothing to scan). Read-time faithfulness misses neither, but runs at query time — slow and expensive, so you can't lean on it for everything. Three cheap layers compose to better-than-any-single-expensive-one defense. This is the standard security architecture argument applied to memory.

What this doesn't fix

  • Honestly-believed wrong memories. The user told the agent something incorrect. The memory is consistent, grounded, and faithful. Defense layers don't catch this — that's a data problem, not a memory problem.
  • The agent's reasoning over correct memories. A correct retrieved memory can still be misused by the model. That's the prompt's problem, not memory's.

Memory's defense lifts the floor; it does not place a ceiling on the agent's overall truthfulness.

The escape rate math

The three-layer defense composes multiplicatively, not additively.

Empirically from preliminary testing:

  • Write-time grounding catches ~80% of hallucinations at write time → escape rate e₁ ≈ 0.20
  • Consistency scan catches ~50% of what slips through grounding → escape rate e₂ ≈ 0.50
  • Faithfulness scoring catches ~70% of what reaches response → escape rate e₃ ≈ 0.30

Combined escape rate: 0.20 × 0.50 × 0.30 = 0.03 (3%). Single-guard: 20%.

Independence holds because each guard operates on different inputs: grounding sees (candidate, source turn), consistency scan sees (stored memory clusters), faithfulness sees (LLM output, provided context). A hallucination type that escapes one guard is unlikely to share the same structural cause with another.

Grounding specifics

The grounding check adds ~100ms to write latency. Confidence penalties by verdict:

VerdictActionConfidence adjustment
SupportedKeep, record evidence spansNone
PartialKeep with tag, may rewrite content−0.10 to −0.20 (model-set)
Not supportedDrop, increment hallucination_blocked_total— (discarded)
Unknown (verifier failure)Per config: Block, Allow with tag, or Queue

After a Partial verdict with rewrite_partial: true, a second LLM call rewrites the content to match only the evidence spans — the softened version is stored instead of the original. This is cheap (Haiku, ~$0.00003) and prevents the partial inference from being cited verbatim.

Cost analysis: at 1,000 user turns, roughly 360 candidates survive to grounding (pre-filter drops 40%, extraction produces 1.2 candidates per surviving turn, classification drops 50%). At ~0.0001pergroundingcall:0.0001 per grounding call: **0.036 per 1,000 turns**.

min_confidence_after_penalty: 0.3 is the floor — candidates whose confidence drops below this after penalty are dropped rather than stored at very low confidence. Adjust based on how aggressively you want to filter inferred content.

Consistency scan: the three signals

The store-time scan identifies suspect clusters using three signals, not one:

Structural: memories with the same (subject, predicate) but different object. These are the clearest contradiction candidates — same predicate, different value.

Semantic: memory pairs with cosine similarity > 0.85 but different subjects or predicates. These catch paraphrase duplicates that escaped write-time dedup.

Temporal: memories about the same (subject, predicate) with different valid_from timestamps more than 30 days apart. These catch job changes, role shifts, and other temporal evolutions that look like contradictions but are actually correct supersessions.

An LLM judge classifies each cluster as: Equivalent (merge), TemporalEvolution (supersede), Contradiction (flag), or Distinct (false positive). Auto-actions for Equivalent and TemporalEvolution run by default; Contradiction is flagged for human review. Cost: ~$0.005 per 10,000 memories per scan (50–100 clusters, 1 Haiku call each).

Faithfulness policies

When read-time faithfulness scoring is enabled, three response policies govern what happens on high-risk verdicts:

PolicyTriggerAction
warnMedium or High riskAnnotate response with uncertainty note on unsupported claims
regenerateHigh riskRetry with stricter prompt; re-score; use if better
blockAny unsupported claimReplace response with "I don't have enough reliable information to answer that"

warn is the recommended starting point. block is appropriate for high-stakes domains (medical, legal, financial). regenerate adds the most latency (~400ms total when triggered) and is most useful when the LLM is systematically adding detail beyond the provided context.

Grounding worked examples

The grounding verdict table is concise, but the failure modes it covers are not. Two concrete end-to-end examples show how grounding catches structurally different fabrication patterns.

Example 1 — Employer fabrication

Source turn: "Let's schedule the meeting for next Tuesday. I'll be joining from my home office in Bangalore."

Extracted candidate:

{
  "type": "fact",
  "predicate": "works_at",
  "object": "Google",
  "content": "User works as a software developer at Google"
}

Grounding verdict: NOT_SUPPORTED. The source text mentions a home office in Bangalore. It mentions no employer, no role, and no company. Google does not appear anywhere in the source span.

Action: candidate dropped. hallucination_blocked_total incremented.

What happened: the extraction model inferred employment at Google from the combination of "home office" (implies tech worker) and "Bangalore" (large Google engineering presence). This is a classic over-inference — the model drew on training-data associations rather than facts stated in the conversation. The grounding check has no access to those associations; it only sees the source span, and the source span does not mention Google. The extraction is rejected.

This is the most dangerous fabrication pattern. The invented fact (works at Google) is specific, plausible, and would survive casual human review if stored — many Bangalore-based software developers do work at Google. Grounding catches it precisely because it operates on evidence spans rather than plausibility.

Example 2 — Quantification inflation

Source turn: "I think I should be done with the MVP by end of April, pretty confident."

Extracted candidate:

{
  "type": "fact",
  "predicate": "project_deadline",
  "content": "User will deliver the MVP by April 30 with 95% confidence"
}

Grounding verdict: PARTIAL. "End of April" is directly supported by the source text. "95% confidence" is not — the user said "pretty confident," which is informal hedged language, not a quantified probability estimate. The extraction model converted natural-language confidence into a specific percentage, which is fabrication even if directionally correct.

Action: stored with rewritten content: "User expects to deliver MVP by end of April." Confidence penalty applied: −0.20. Original extraction confidence (0.72) → stored confidence 0.52.

The rewrite removes the fabricated quantification while preserving the genuine commitment. The penalty reduces the memory's retrieval weight, signaling to downstream systems that this memory contains interpreted rather than verbatim content.

The key difference between the two examples: Example 1 is a hallucinated entity — Google never appeared in the source. The extraction invented a specific name out of contextual associations. Example 2 is a partially-grounded quantification — the deadline was genuinely stated, but the confidence figure was invented. Both are caught, but via different verdicts with different consequences. A NOT_SUPPORTED verdict discards the candidate entirely; a PARTIAL verdict fires the rewrite path and applies a penalty. The distinction matters operationally: over-aggressive rejection of PARTIAL verdicts discards useful inferred content along with the fabricated detail. The penalty-plus-rewrite path preserves what was real.

Store-time consistency: worked examples

The three signals (structural, semantic, temporal) each catch different contradiction patterns. Two concrete examples show the LLM judge in action on real-world contradiction types.

Example 1 — Temporal evolution (auto-resolved)

The weekly consistency scan finds two memories for the same subject and predicate:

  • Memory A (created 2025-06-10): {subject: ent_user, predicate: works_at, object: "Datakynd"}
  • Memory B (created 2026-04-05): {subject: ent_user, predicate: works_at, object: "Arrive"}

These are 299 days apart. The temporal signal fires because the gap exceeds the 30-day temporal_drift_days threshold. The LLM judge receives both memories with their timestamps and the question: "Are these contradictory, or does one supersede the other?"

LLM judge verdict: TEMPORAL_EVOLUTION. The 10-month gap and the job-change predicate strongly suggest the user changed employers — not that one memory is wrong. Temporal evolution is the most common resolution path for works_at, role, and reports_to predicates.

Action: Memory A gets superseded_by = B and valid_to = 2026-04-05. Both remain in the store. Future retrieval returns only Memory B by default (the superseded_by field filters it from standard queries). Historical queries — for audit, drift analysis, or incident review — can still surface Memory A. No human intervention is required; the scan auto-resolves and logs the action to the audit trail.

This resolution is not destructive. Datakynd is not erased — it becomes part of the user's employment history. If the user later asks "where did I work before Arrive?", Memory A is queryable through the supersession chain.

Example 2 — Unresolvable contradiction (flagged for review)

The scan finds two memories for the same entity and predicate:

  • Memory A (2026-04-01): {subject: ent_inbox3, predicate: uses_database, object: "Supabase"}
  • Memory B (2026-04-02): {subject: ent_inbox3, predicate: uses_database, object: "Neon"}

These are one day apart. The structural signal fires first (same subject and predicate, different object values). The temporal signal does not fire — one day is well below the 30-day threshold for inferring temporal evolution.

LLM judge verdict: CONTRADICTION. A one-day gap, same predicate, different database values could indicate a database migration (plausible for a startup project) or a hallucination in one of the extractions. The judge cannot determine which from the memories alone. Neither auto-action fires — TEMPORAL_EVOLUTION requires a sufficient time gap, and EQUIVALENT requires matching content.

Action: A dashboard alert is raised: "2 memories in contradiction — review needed." Retrieval returns both memories from the store, each flagged with a contradicts_with link pointing to the other. When either memory appears in an LLM context window, the system injects an explicit conflict note: "Note: two conflicting memories exist for this predicate — [Supabase (2026-04-01)] vs [Neon (2026-04-02)]." The LLM sees the conflict rather than silently receiving one value.

A human reviewer can resolve the contradiction by marking one memory as authoritative and the other as superseded. Until resolved, the agent acknowledges uncertainty rather than asserting a single value with false confidence. This is the correct behavior — better to surface a known conflict than to silently pick a side and be wrong.

Read-time faithfulness: anatomy of a response

Faithfulness scoring is easiest to understand through a concrete scored example. The following walks through Guard 3 catching fabricated claims in a realistic agent response.

User query: "How's the Inbox3 project going?"

Context injected into the prompt:

mem_1: "Inbox3 at 60% completion"
mem_2: "blockers: OAuth and rate limits"
mem_3: "team lead is Sarah"

LLM response: "Your Inbox3 project is at 60% completion. The main blockers are OAuth and rate limits. Your team of 3 engineers is allocated 50% to this project — at this pace you should ship by end of month."

Faithfulness scoring — claim by claim:

ClaimSource memoryVerdict
"at 60% completion"mem_1SUPPORTED
"blockers: OAuth and rate limits"mem_2SUPPORTED
"team of 3 engineers"noneUNSUPPORTED
"allocated 50%"noneUNSUPPORTED
"ship by end of month"derived from mem_1 + mem_2INFERRED

Score: 2 supported out of 5 verifiable claims = 0.40 faithfulness → HIGH risk.

Policy applied (warn): the response is delivered with an annotation on the two unsupported claims. The user sees the response but the system notes that team composition and allocation figures were not found in memory. The annotation might read: "Note: team size and allocation figures are not verified in stored context."

Audit log entry:

faithfulness=0.40, risk=HIGH, policy_applied=warn,
unsupported_claims=["team of 3 engineers", "allocated 50%"],
inferred_claims=["ship by end of month"]

What happened: the model added team composition and resource allocation from its general knowledge about software projects, not from the three memories provided. This is a common pattern — the model fills gaps in provided context with plausible-sounding details. The 60% completion and blocker information are accurate and should be delivered; the team and allocation figures are fabricated and should not.

Guard 3 catches this without blocking the legitimate information. Under warn policy, both supported and unsupported claims are delivered, but the user knows which to trust. Under block policy, the entire response would be replaced — appropriate for a medical or financial domain where even annotated fabricated figures could cause harm. Under regenerate, a second attempt with a stricter prompt ("do not include any information not present in the provided context") is tried before falling back to warn.

The INFERRED claim ("ship by end of month") is treated separately. It is derived from legitimate context (60% done + blockers exist → some delivery estimate is reasonable) rather than fabricated from general knowledge. Inferred claims are logged and annotated but not scored against faithfulness — they are the model doing reasoning over real inputs, not inventing inputs.

Tuning for your domain

The three guards have configurable sensitivity thresholds. Default values are calibrated for general-purpose agents. Domain-specific deployments will need to adjust.

Raising Guard 1 strictness

The key parameter is min_confidence_after_penalty, which defaults to 0.3. This floor determines when a PARTIAL verdict causes a drop versus a penalty-and-rewrite.

Lower to 0.2 for high-stakes domains (medical, legal, financial). A lower floor means even moderately partial verdicts get dropped rather than rewritten and stored at reduced confidence. The tradeoff: fewer inferred-but-useful memories survive. In domains where a wrong fact is worse than a missing one, this is the correct tradeoff. A memory asserting "patient takes 10mg metoprolol" when the source text said "patient mentioned heart medication" should be dropped, not stored at 0.52 confidence.

Raise to 0.4 for exploratory agents (research assistants, brainstorming tools) where inference has high value and false negatives (dropping useful inferences) are costly. An agent that infers "user is interested in distributed systems" from "user asked about Raft consensus" is doing useful inference that should survive the grounding check even if the predicate is not directly stated.

The right value for min_confidence_after_penalty is a function of your domain's asymmetry between false positives (storing a wrong fact) and false negatives (dropping a useful inference). Medical: strongly prefer false negatives. Research assistant: balance or prefer false positives.

Adjusting Guard 2 temporal window

temporal_drift_days (default 30) controls the minimum gap required before the LLM judge considers TEMPORAL_EVOLUTION as a resolution for same-predicate contradictions. Two memories less than 30 days apart for the same predicate will be classified as CONTRADICTION or EQUIVALENT — not as a temporal supersession — regardless of other signals.

Lower to 14 days for fast-moving domains: startup teams, active projects, organizational changes. A role change or employer change is plausible within two weeks for these subjects. Keeping the window at 30 days means legitimate rapid changes are flagged as contradictions and require human review rather than auto-resolving.

Raise to 60 days for stable domains: personal facts, long-term relationships, physical attributes, preferences. These predicates change slowly. A 30-day gap between two conflicting "home city" memories is more likely a hallucination than a legitimate move — raising the window forces those to the CONTRADICTION path and human review rather than auto-superseding.

As a rule of thumb, set temporal_drift_days to the typical cadence of legitimate change for the most volatile predicates in your domain. Job changes: 90–365 days for most professionals, 30 days for startup founders. Physical address: 180+ days for most users. Active project status: 7–14 days.

Enabling Guard 3 selectively

Faithfulness scoring adds ~80ms per response and an LLM call. Enabling it on every query is costly and unnecessary — many queries do not produce verifiable factual claims.

The recommended approach: gate Guard 3 on the query optimizer's specificity score. Specific factual queries (specificity > 0.7) produce responses with verifiable claims. Apply Guard 3 here. Conversational queries (specificity < 0.3) produce responses where the model legitimately draws on training knowledge — faithfulness scoring would produce false positives by flagging general-knowledge answers as unsupported.

Enable Guard 3 for: "What is X?", "What did Y say about Z?", "What is the status of project P?", "When does Q happen?" — queries where the agent is expected to answer from stored memory and deviation from that memory is the failure mode.

Disable Guard 3 for: "How does OAuth work?", "What are good practices for database indexing?", "Explain this error" — queries where the model is providing explanation or general knowledge. Faithfulness scoring against stored memory context is not meaningful here; the correct context is the model's training data, not stored facts.

The query optimizer's specificity score is available in the request context as query_specificity. Pass it as a condition in your faithfulness policy configuration:

faithfulness:
  enabled: true
  min_specificity: 0.7
  policy: warn
  high_risk_policy: regenerate

This single configuration change typically reduces Guard 3 invocations by 60–70% while preserving coverage on the queries where it matters.

Write-time grounding: what the verifier actually checks

The grounding verifier is a targeted LLM call, not a general question-answering call. It receives the candidate memory and the exact source turn text and asks one question: "Is this claim supported by this source text?"

The prompt structure:

<source>
{source turn text — verbatim}
</source>

<candidate>
type: {type}
predicate: {predicate}
content: {content}
</candidate>

Rate whether the candidate is: SUPPORTED (the source clearly states this), PARTIAL (the source
implies but doesn't directly state this), NOT_SUPPORTED (the source doesn't contain this
information), or CONTRADICTED (the source contradicts this).

Return JSON: {"verdict": "...", "evidence_span": "...", "reasoning": "..."}

The evidence span is the exact substring of the source turn that supports the candidate. When verdict is PARTIAL, the evidence span often contains hedged language ("I think", "probably", "might be") that the extractor over-interpreted. When verdict is NOT_SUPPORTED, the evidence span is empty — the verifier found no textual basis for the claim.

Two common extractor failure modes the verifier catches:

Temporal conflation: the source says "I used to work at Volkswagen" and the extractor produces {predicate: "works_at", object: "Volkswagen"} (present-tense fact). Verdict: NOT_SUPPORTED for the present-tense fact. The memory should be an Event type: "user previously worked at Volkswagen." The verifier rejects the incorrectly-typed extraction; a correct re-extraction with type Event would be SUPPORTED.

Role inference: source says "I met with the VP today" and the extractor produces {predicate: "is_VP", subject: "person_name"}. The source names a person the extractor inferred was the VP. Verdict: PARTIAL (the source confirms a meeting with someone who may be a VP but doesn't confirm their title). Confidence penalty applied. If the user's conversation later contains "... the VP, Jordan, approved the budget," a re-extraction with "Jordan" as subject would be SUPPORTED. Inference chains that accumulate PARTIAL verdicts are flagged: three consecutive PARTIAL verdicts for the same entity trigger a consolidation review.

The verifier is deliberately narrow. It does not evaluate whether the claim is plausible, consistent with other memories, or useful for retrieval. It evaluates only one question: does this source text actually say this? Broadening the verifier's scope would dilute its signal — a verifier that considers plausibility is doing the extractor's job, not guarding against it.

Store-time consistency: what "contradiction" means precisely

Not every disagreement between memories is a contradiction. The consistency scan uses a strict definition: a contradiction requires same subject, same predicate, different object, and neither memory supersedes the other. Without the "neither supersedes" condition, every superseded memory would look like a contradiction of its successor.

The four LLM judge verdicts in practice:

TemporalEvolution: "Georgian works at Datakynd" (2025) and "Georgian works at Arrive" (2026). Both are grounded. The more recent one is probably current. Action: set works_at_Datakynd.valid_to = works_at_Arrive.valid_from, add superseded_by link. Neither memory is deleted. The older memory becomes part of the employment history timeline and remains queryable through historical queries.

Equivalent: "User prefers TypeScript" and "The user likes to use TypeScript for new projects." Both say the same thing with different phrasing. Action: merge, keep higher-confidence as canonical, union source turn IDs. The merged memory gets access_count = sum(both), which increases its retrieval weight — a preference mentioned twice in different sessions is now correctly weighted as more salient than one mentioned once.

Contradiction: "Inbox3 uses PostgreSQL" (April 1) and "Inbox3 uses SQLite" (April 2). One-day gap, ambiguous causality — could be a migration or one could be wrong. Action: flag, link contradicts_with, surface for human review. Neither is auto-resolved. The conflict is surfaced in subsequent retrievals so the LLM sees both values and the conflict annotation rather than confidently citing one.

Distinct: "Sarah manages the platform team" and "Sarah is managing her workload well." Different predicates; the scan incorrectly flagged these as a manages/manages pair because both sentences contain "manages" and "Sarah." Action: no-op. This is the most common outcome — false positive by the structural SQL signal, caught by the LLM judge after inspecting the actual predicate semantics.

Frequency of each verdict in a typical namespace: Distinct (false positive) ~50%, Equivalent (auto-merge) ~25%, TemporalEvolution (auto-supersede) ~20%, Contradiction (human review) ~5%. The high false-positive rate from the SQL signal is expected and acceptable — the SQL query is intentionally broad to avoid missing real contradictions. The LLM judge's job is to eliminate false positives cheaply before any state change occurs.

Read-time faithfulness: what the reranker actually measures

Layer 3's faithfulness score is not the same as retrieval relevance. Relevance is "does this memory relate to the query?" Faithfulness is "does the generated response's claims trace back to these memories?"

The faithfulness check runs after the LLM generates a response. It:

  1. Parses the response into atomic factual claims. Strips advice, opinions, hedged statements, and questions — only declarative factual assertions are scored. "You should consider using Redis" is not scored. "Your current setup uses Redis" is scored.
  2. For each claim, checks whether any retrieved memory supports it. The check is semantic, not lexical — "user's stack includes Redis" supports the claim "you're using Redis" even though the wording differs.
  3. Computes faithfulness = supported_claims / total_claims.

A faithfulness score of 0.80 means 80% of the response's factual claims are supported by retrieved memories. The remaining 20% are either inferences the model drew from memory (often legitimate) or fabrications (problematic). The audit log distinguishes between these: claims that follow logically from retrieved memories are tagged INFERRED; claims with no memory connection are tagged UNSUPPORTED.

The policy distinction:

warn (default): append a note to the response indicating which claims could not be verified in stored memory. Useful for chat applications where some inference is expected and the user can assess the annotation.

regenerate: retry with a stricter system prompt: "Answer only from the provided memories. Do not infer beyond what is explicitly stated." Re-score; use the lower-risk response. If the regenerated response is still above the risk threshold, fall back to warn. This path adds ~200–400ms latency and is most useful when the model is systematically over-inferring for a particular query type.

block: replace the entire response with a fixed message. Appropriate for financial, medical, or legal domains where any unsupported claim reaching the user is unacceptable regardless of annotation. The blocked response is logged with the original faithfulness score for offline review.

Faithfulness adds ~200ms latency and is opt-in (default OFF). The latency comes from the claim-parsing call — decomposing a prose response into atomic verifiable assertions is itself an LLM call. The actual claim-by-claim lookup against retrieved memories is a vector operation and adds negligible time.

Measuring defense effectiveness over time

The three guards are most useful as trend signals, not single-point measurements. A single day's guard1_block_rate of 4% tells you little. Three consecutive weeks of it increasing from 2% to 3% to 4% tells you the extractor is drifting — something changed upstream.

Metrics to track weekly:

guard1_block_rate (grounding NOT_SUPPORTED rate): healthy 2–4%. A sudden spike above 6% suggests the extractor model changed, the conversation domain shifted into territory the extractor handles poorly, or an upstream data pipeline is sending malformed turns. Investigate within 24 hours of a spike; don't wait for the next weekly review.

guard1_partial_rate (PARTIAL verdicts): healthy 5–8%. A rising partial rate with a stable block rate means the extractor is becoming more speculative — producing claims that are partially grounded rather than fully fabricated. This is a subtler signal: the extractor isn't inventing facts outright, it's over-interpreting hedged language. Left unaddressed, partial rates above 12% indicate the extractor prompt needs retuning.

guard2_contradiction_count (per nightly scan): healthy 0–5 contradictions per 10,000 memories. A spike to 20+ contradictions signals one of: bulk import of inconsistent data, an extractor prompt change that generates predicates incompatible with existing ones, or a genuine period of high change in the user's life or organization (role changes, project pivots). The first two require engineering action; the third is expected and the resolution queue will process normally.

guard3_faithfulness_mean (if enabled): healthy above 0.85. A declining mean over several weeks means the LLM is increasingly extrapolating beyond the memory context it receives. This often correlates with decreasing memory coverage — if the retriever is returning fewer relevant memories per query (perhaps due to embedding drift or growing namespace size), the model fills the gap with inference. The fix is retrieval quality, not faithfulness policy.

Dashboard alert configuration: when guard1_block_rate has trended upward for 3 consecutive measurement periods, trigger an investigation. When guard2_contradiction_count exceeds 2× its 30-day moving average in a single scan, trigger an alert. When guard3_faithfulness_mean drops below 0.80 for two consecutive weeks, open a retrieval quality investigation. These alert conditions are conservative enough to avoid noise while sensitive enough to catch real problems before they affect users at scale.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.