Detecting Memory Drift

By Arc Labs ResearchMay 2, 202611 min read

A memory store is not a snapshot — it's a running record of an evolving relationship. Things change. The user changes jobs, the company rebrands, your application adds new features that need new types. Memory needs explicit detection for these shifts; otherwise old assumptions silently dominate retrieval.

Concept drift · centroid distance vs Jaccard

Drift triggers only when both signals agree. Try perturbing entities to see when the trigger fires.

Four kinds of drift

Concept drift. The meaning of an entity shifts. Twitter became X. Acme was acquired by Globex. The same identifier now refers to something materially different. Detection: dual-signal (see concept drift, dual-signal detection ).
Data drift. The distribution of memories shifts — users start writing differently, or a feature change generates a new conversational pattern. Detection: MMD (Maximum Mean Discrepancy) over recent vs. historical embeddings.
Schema drift. The set of memory types you use changes. You added skill; you sunset contact. Detection: changelog-driven, not statistical.
Vocabulary shift. New domain terminology enters the corpus. The embedding model — trained on data without these terms — embeds them poorly. Detection: rare-token incidence over time + reranker confidence on those queries.

Why detection matters

Without drift detection, two failure modes accrete:

Confidently wrong identity bindings. Memories about "Twitter" still retrieve when the user asks about "X." If the user's mental model has updated and yours hasn't, the agent gives stale answers from old memories.
Embedding model rot. The vector index encodes a snapshot of a model. Distribution shift means the index is increasingly miscalibrated against current queries. Recall degrades silently — the only signal is rerank confidence falling.

MMD — distribution drift detection

Maximum Mean Discrepancy is a kernel-based two-sample test. Given an embedding sample from recent memories and one from historical memories, MMD asks: do these come from the same distribution? MMD²(P, Q) = E[k(x, x')] − 2·E[k(x, y)] + E[k(y, y')] With an RBF kernel and a permutation test for significance, you get a p-value: under the null (no drift), how likely is the observed MMD? Below 0.01 = real drift signal; trigger an alert and an investigation.

Schema drift is a process, not an algorithm

Adding a memory type:

Define the new type's schema, decay, and confidence prior.
Backfill: classifier pass over recent (last 30 days) candidates to retroactively type them.
Update extraction prompts to surface the new type.
Update query optimizer plans to consult it where relevant.

Sunsetting a type: similar but in reverse — remap existing memories to a new (or merged) type, then drop the old one from extraction and routing.

Vocabulary drift is the slowest signal

New domain terminology enters slowly. The signal: rerank confidence on queries containing a recently-emerged token tracks lower than the baseline. The fix: re-embed memories with a fresh embedding model. This is expensive (every memory) but inevitable on a multi-year horizon. See background worker jobs for scheduling.

MMD in depth: kernel, permutation test, and sampling

MMD uses the RBF (radial basis function) kernel:

k(x, y) = exp(−‖x − y‖² / (2σ²))

σ is set using the median heuristic — median pairwise distance among the sample embeddings. This makes the kernel adaptive to the embedding geometry without requiring a separate tuning step.

MMD² measures the squared distance between distributions:

MMD²(P, Q) = E_P[k(x, x')] − 2·E_{P,Q}[k(x, y)] + E_Q[k(y, y')]

Under the null hypothesis (P = Q), all three terms are equal and MMD² is zero. Positive values indicate distributional divergence. The permutation test converts the raw MMD² value into a p-value: shuffle P ∪ Q into two random groups and compute MMD² for each of 1,000 shuffles. The p-value is the fraction of shuffled MMD² values that exceed the observed one.

p < 0.01 → alert. This threshold gives ~1% false positive rate on stable namespaces. Below 0.01 is the working threshold because drift scans run weekly — one false alert per hundred scans per namespace is acceptable noise.

Why 1,000 recent + 1,000 historical samples? MMD computation is O(m × n) in the sample sizes. Full-store computation at 100K memories would require ~10 billion kernel evaluations. Sampling 2,000 total makes the test fast enough to run weekly on commodity hardware (under 60 seconds) while preserving statistical power — the test is adequately powered at n=1,000 per group to detect moderate distributional shifts.

KL divergence for type distribution drift

Type distribution drift is cheaper to detect than distributional drift because it doesn't require embeddings. Compare the fraction of each memory type in recent vs historical memories:

KL(P_recent || P_historical) = Σ_t P_recent(t) × log(P_recent(t) / P_historical(t))

KL divergence	Interpretation
< 0.05	Stable — normal variation
0.05 – 0.15	Mild shift — monitor weekly
> 0.15	Significant shift — investigate

A rising KL divergence usually means extraction behavior changed: a model upgrade, a prompt change, or a shift in what the upstream agent is discussing. This is distinct from concept drift (entity semantics) and data drift (embedding distribution). It's cheap enough to compute daily.

Vocabulary Jaccard: when embeddings fail

The Jaccard signal tracks new terminology the embedding model may not encode well:

jaccard(V_recent, V_historical) = |V_recent ∩ V_historical| / |V_recent ∪ V_historical|

where V is the set of high-frequency tokens (appearing in >1% of memories) in each window. A Jaccard below 0.60 signals significant vocabulary shift. Cross-reference with reranker confidence on queries using the new tokens — if confidence is below baseline, the embedding model was trained without this terminology and re-embedding is worth planning.

All drift findings persist to the drift_findings table (not just logs), so you can track drift score trends across weeks and months rather than only seeing the current scan.

Data drift in practice: what you're comparing

MMD is the right test but the statistical framework only works when the two sample populations are drawn correctly. The two populations in a standard data drift scan:

Recent: embeddings of memories created in the last 30 days
Historical: embeddings of memories created 30–180 days ago (5 months back)

The 30-day gap between the two windows is intentional — it prevents the recent window from bleeding into the historical baseline. The 150-day historical window gives stable statistics because it averages over enough semantic variation that random monthly fluctuations don't dominate the baseline distribution. Narrowing the historical window to 30 days creates a test that is overly sensitive to month-to-month variation rather than genuine distributional shift.

For namespaces with fewer than 100 memories per month, extend both windows proportionally while maintaining the gap. The guiding constraint is sample count: you need at least 50 samples in each group for the permutation test to have meaningful statistical power. At 50 samples per group, the test can detect large shifts (effect size > 0.5) at p < 0.05 but not subtle ones. At 200+ samples per group, the test can detect moderate shifts (effect size > 0.2).

The permutation test at n=1,000 samples per group runs approximately 1,000 permutations, each requiring computation of MMD² on a random shuffle of the 2,000 combined samples. In practice, vectorized kernel computation on modern hardware (with embedding dimensions of 1,536) takes approximately 10 seconds total for this configuration. For weekly scans on a production system, this is negligible — schedule it as a background job during off-peak hours. For real-time or near-real-time monitoring, drop to 100 samples per group and accept a less powerful test (use p < 0.05 rather than p < 0.01 as the alert threshold, and treat the result as a weak signal that triggers closer manual review rather than automated alerting).

Two practical questions arise repeatedly when implementing MMD-based data drift detection:

What if the namespace has fewer than 1,000 recent memories? Sample what you have. The test loses power but does not break — the permutation test remains valid with any sample size. At n=100 per group, use p < 0.05 as the significance threshold and treat alerts as advisory rather than definitive. Cross-reference with the vocabulary Jaccard signal: if both indicate shift at low sample counts, the combined evidence is meaningful even if neither test individually reaches high confidence.

What if the historical memories don't exist yet (new namespace)? Skip data drift detection entirely until the namespace has at least 30 days of history. Running MMD against an empty or near-empty historical window produces meaningless results — the test always fires because you are sampling from different (small) random subsets rather than different time periods. Set drift_detection.data_drift.min_history_days: 30 in your configuration to suppress false alerts on new namespaces. The system should log a "skipped — insufficient history" record in drift_findings rather than silently omitting the scan, so the absence of data drift results is explicitly visible in the trend analysis view.

Concept drift vs. data drift: how they look different

Both concept drift and data drift cause retrieval quality to degrade over time. From the outside, they look nearly identical: queries return results that feel off-topic, precision metrics fall, users notice that the system "isn't as good as it used to be." They require fundamentally different interventions, so distinguishing them early saves significant remediation effort.

Concept drift is localized to specific entities. It appears in the signals as: queries about a specific entity ("Sarah") return memories that feel contradictory or temporally inconsistent — old memories about Sarah's role as an engineer conflict with newer memories about her role as an engineering manager. The centroid distance for the Sarah entity is above 0.4, indicating that her semantic representation has shifted substantially. The predicate Jaccard for Sarah-related memories (comparing old vs. new) is below 0.7, indicating that the predicates associated with her have changed. The dual-signal fires on Sarah specifically but not on other entities in the namespace.

The fix for concept drift is entity-specific: create a new entity version or re-wire the entity graph edges to reflect the current state. The historical memories remain valid historical records — Sarah was an engineer — but they should be retrieved with lower weight for current-state queries. See concept drift, dual-signal detection for the full entity versioning protocol.

Data drift is namespace-wide. It appears as: general retrieval precision declining across all query types simultaneously, not concentrated on queries about specific entities. The MMD test fires on the full namespace sample. Mean reranker confidence falls below baseline across all memory types, not just those connected to specific entities. Vocabulary Jaccard may also be declining, suggesting that the underlying language of the memories is shifting.

The fix for data drift is namespace-wide: rebuild the BM25 index with current term frequencies (the IDF scores from 6 months ago may no longer reflect what's common in the namespace), and consider scheduling a re-embedding pass using a fresher embedding model checkpoint. Re-embedding is expensive — every memory requires a new embedding call — but it is the correct intervention when distributional shift has caused the vector index to diverge from current query distributions.

The diagnostic question that separates the two: is the quality problem isolated to queries about specific entities or to memories of specific types? That pattern indicates concept drift. Is the quality problem approximately uniform across all query types, all entities, and all memory types? That pattern indicates data drift. In practice, you often see both simultaneously — a period of rapid change in the user's context (new job, new project) produces concept drift on the newly-relevant entities and data drift in the overall memory distribution as the mix of topics shifts.

Schema drift: the three trigger events

Schema drift is not detected by a statistical test on memory embeddings or token distributions. It is triggered by deliberate configuration events — changes that a human made to the extraction pipeline. Three events should always trigger a schema version bump and a follow-up drift assessment:

Model upgrade. Switching from one extraction model to another — even a minor checkpoint update like from claude-haiku-4-5 to claude-haiku-4-5-20251001 — can shift predicate vocabulary, confidence calibration, and field population rates in ways that appear as schema drift even when your type rules haven't changed. Newer checkpoints extract more fine-grained predicates. They may populate optional fields (like confidence_rationale or source_context) at different rates. They may parse ambiguous utterances differently, assigning memories to different types than the previous checkpoint would have.

Prompt change. Any modification to the extraction system prompt's type_rules, quality_rules, or custom_instructions sections directly controls what gets extracted and how it's structured. A change that makes the extractor more aggressive about creating Fact-type memories will visibly shift the type distribution KL divergence. A change to quality thresholds (raising the minimum confidence for memory creation from 0.6 to 0.75) will reduce memory creation rate overall. Both changes are intentional — but they look like drift from the monitoring system's perspective, and they need to be correlated with the prompt change event to be correctly interpreted.

Type rules change. Adding, removing, or redefining memory types in the extraction configuration is the most structurally significant change because it invalidates the historical type distribution baseline. Adding a new Skill type where previously skill-related memories were typed as Fact will cause existing Fact memories to be retroactively re-typed (via the backfill process), visibly shifting the type distribution. Removing a type causes a sudden drop in that type's proportion as new memories stop being typed to it.

For each of these three trigger events, the assessment process should follow the same three-step sequence:

Step 1: Predicate Jaccard. Compare predicate sets between the last 7 days under the old configuration and the first 7 days under the new configuration. If Jaccard < 0.7, the new configuration is producing meaningfully different predicates — apply predicate normalization to map equivalent predicates across the version boundary so that retrieval and entity graph queries remain coherent across historical and new memories.

Step 2: Confidence histogram comparison. Plot the distribution of extraction confidence scores from old vs. new configuration. If mean confidence shifts by more than 0.05 in either direction, apply a calibration scaling factor to the new configuration's confidence outputs so that confidence thresholds (used in retrieval filtering and entity-centric plan selection) remain correctly calibrated. A model upgrade that makes the extractor uniformly more confident will cause entity-centric plan selection to fire more aggressively — potentially useful, but it should be a deliberate choice rather than an accidental side effect.

Step 3: Type distribution KL divergence. Compare type proportions across the version boundary. If KL > 0.10, investigate whether the shift is the intended result of the configuration change. A type rule addition should produce a KL increase in the expected direction (new type appears, adjacent types shrink). An unexpected KL increase in a type that wasn't targeted by the change indicates an unintended side effect of the prompt modification or model update.

Drift detection dashboard

A practical monitoring view surfaces all four drift types in one place with different scan cadences based on computational cost and signal speed.

Daily signals — cheap, always on.

Vocabulary Jaccard is computed each morning by comparing the set of high-frequency tokens from yesterday's new memories against the 30-day rolling average token set. The comparison is a set operation on token lists — no embedding calls, no kernel computations, no SQL aggregations beyond token frequency counts. Cost: negligible. The output is a single float, trending downward when new terminology is entering the corpus faster than old terminology is fading. Below 0.6: fire an alert and begin correlating with reranker confidence on queries containing the new tokens. The trend line showing the 4-week trajectory is more informative than any single day's value — vocabulary shift is slow, and a single low reading may be an anomaly while a consistent downward trend indicates genuine shift.

Type distribution KL divergence is also computed daily from memory type counts. The count query is a simple SQL aggregation over created_at ranges — trivially fast even at large scale. Above 0.10: file an alert. The daily computation enables early detection of type distribution changes that typically precede or accompany schema drift events. When a KL spike aligns with a known configuration change, it confirms that the change took effect as expected. When it doesn't align with any known change, it is an unexpected extraction behavior shift that warrants investigation.

Weekly signals — moderate cost.

MMD for data drift runs on the weekend during low-traffic hours. The full computation (1,000 samples per group, 1,000 permutation shuffles) takes approximately 10 seconds on a modern CPU with vectorized kernel computation. The results are categorical: above 0.1 with p < 0.01 produces a high-severity alert requiring investigation before the next weekly scan. A value between 0.05 and 0.1 with p < 0.05 produces a medium-severity alert — flag for manual review but no immediate action required. Below 0.05 or p ≥ 0.05: stable, record the score for trend analysis.

Concept drift scanning runs per-entity for all entities with at least 5 memories in each comparison window (recent 30 days vs. historical 30–180 days). The dual-signal check — centroid distance above 0.4 and predicate Jaccard below 0.7 — runs as a batch SQL computation that fetches the memories for each qualifying entity and computes distances on the server. For namespaces with large entity graphs (hundreds of entities), this scan can take 30–60 seconds. The output is a per-entity drift report: entities that triggered the dual-signal receive an entity-specific alert with the centroid distance and Jaccard value, the timestamp of the first memory in the new cluster, and a suggested action (create entity version, re-wire relationships, or monitor for one more week).

Event-triggered signals.

Schema drift assessment fires on any of the three trigger events described above. Because it is event-triggered rather than scheduled, it runs regardless of the time of day or week. The assessment completes within seconds (predicate Jaccard and KL divergence are fast aggregations; confidence histogram comparison requires fetching a sample of recent memories). The result is immediately available in the dashboard under the "Schema changes" timeline view, correlated with the triggering event.

The unified findings view.

All signals write to the same drift_findings table with a common schema: namespace_id, drift_type (concept/data/schema/vocabulary), severity (high/medium/low), score (the raw signal value), threshold (the threshold that was breached), detected_at, and context (JSON blob with type-specific details — entity ID for concept drift, MMD value and p-value for data drift, version hashes for schema drift, Jaccard value and new token list for vocabulary drift).

The most actionable view of this table: sort by detected_at DESC, severity DESC. This surfaces the most recent, most severe findings first. A high-severity MMD alert from last weekend plus a high-severity concept drift finding from three days ago both visible at the top of the view provides the context needed to determine whether they are related (both could stem from a major application feature change that shifted user behavior and entity semantics simultaneously) or independent (a model upgrade drove schema drift while a genuine entity change drove concept drift).

Trend analysis across weeks reveals patterns that individual scans miss: a vocabulary Jaccard that has been declining at 0.02 per week for 10 weeks is a clear signal that re-embedding is approaching, even though no individual weekly reading crossed the 0.60 alert threshold. The drift_findings table makes this longitudinal view straightforward: a simple time-series query on Jaccard scores for a given namespace surfaces the trend without requiring any additional instrumentation.

Concept drift: the dual-signal detector

Concept drift detection requires two signals to fire simultaneously — AND logic. A single signal is insufficient: each has a meaningful false positive rate on its own, and different real-world events produce different single-signal patterns that should not trigger a drift alert.

The centroid distance signal. For each entity, compute the mean embedding of memories from the last 30 days (centroid_recent) and from 30–180 days ago (centroid_historical). Cosine distance between them:

d(e) = 1 − cosine_similarity(centroid_recent, centroid_historical)

Threshold: d > 0.4. Under a Beta(α=1.5, β=10) distribution of centroid distances for stable entities, the 95th percentile is approximately 0.35. Setting the threshold at 0.4 gives roughly a 2% false positive rate from this signal alone. At 10,000 entities, that is 200 spurious alerts per weekly scan — operationally unmanageable as a sole trigger.

The relation overlap signal. Compare the set of predicates the entity participated in across the two windows:

J(e) = |R_recent ∩ R_historical| / |R_recent ∪ R_historical|

Threshold: J < 0.5. For growing entities that accumulate new connections — a person adding new team memberships — Jaccard typically stays above 0.6. New predicates are additions, not replacements: old edges persist while new ones appear alongside them. For entities undergoing genuine semantic shift, old predicates disappear or are replaced, shrinking the intersection relative to the union.

Combined trigger: d(e) > 0.4 AND J(e) < 0.5. Together, the false positive rate on stable entities drops to approximately 0.04% — roughly 4 entities per 10,000 per weekly scan. That is a manageable signal.

Three concrete examples:

Twitter → X rebrand: embedding shifts (new product context, new owners, new policies) AND relation churn (new owned_by predicate replaces founded_by, API partner entities change). Both signals fire. Drift alert is correct.
Person accumulating new team memberships: embedding is stable (same person, same domain); new member_of edges are additions, old ones persist — Jaccard stays above 0.5. Second signal does not fire. No alert.
Embedding model retrained: all entity centroids shift slightly but uniformly; the content of memories is unchanged, only the model's encoding. No single entity exceeds d > 0.4 by a material margin, and predicate structure is unchanged. Neither signal fires independently.

Monitoring drift in production

All four drift types warrant different monitoring cadences. The driftScan API consolidates all signals into a single structured report:

// Weekly drift scan report
const report = await recall.driftScan({ namespace });

console.log({
  concept_drift: {
    entities_scanned: report.concept.entities_scanned,
    triggered: report.concept.triggered.length,    // entities with d>0.4 AND J<0.5
    pending_review: report.concept.pending_review  // acknowledged vs open
  },
  data_drift: {
    mmd_score: report.data.mmd_score,
    p_value: report.data.p_value,
    severity: report.data.severity,  // none | low | medium | high
    vocab_jaccard: report.data.vocab_jaccard
  },
  schema_drift: {
    predicate_jaccard: report.schema.predicate_jaccard,  // between current and baseline version
    field_drift_alerts: report.schema.field_drift_alerts  // fields with >20pp change
  },
  type_kl_divergence: report.type_kl  // KL between recent and historical type distribution
});

Healthy ranges for weekly monitoring:

Concept drift triggered: 0–2 entities per week is normal. Ten or more entities in one scan suggests an external event (company restructuring, product pivot) that drove correlated drift across an entity cluster — investigate the common factor rather than reviewing each entity independently.
MMD p-value: above 0.05 means stable. Below 0.01 is a high-severity alert requiring investigation before the next scan.
Vocabulary Jaccard: above 0.60 is stable. Below 0.50 means more than half the active vocabulary has turned over — BM25 index rebuild and re-embedding planning are warranted.
Type KL divergence: below 0.05 is stable. Above 0.15 means the type distribution has shifted significantly — investigate whether an extraction model update, prompt change, or upstream agent behavior change caused it.

All scan results are written to drift_findings with timestamps and severity levels. A vocabulary Jaccard declining at 0.02 per week for ten weeks is a clear signal that re-embedding is approaching, even though no individual reading crossed the 0.60 alert threshold. Longitudinal trend analysis is only possible because findings are persisted rather than only logged.

When to trigger re-embedding

Re-embedding means recomputing the vector representation of every memory using an updated or different embedding model. At 500,000 memories with a 50ms embedding latency each, a full re-embedding pass takes approximately 7 hours on a single connection, or 35 minutes with 12x parallelism. The trigger should be high-confidence.

Trigger condition 1: sustained MMD severity = high. A single week with MMD above 0.1 and p-value below 0.01 is not sufficient — one week could be sampling noise. Three consecutive weeks at high severity is a trend. If the distributional drift is persistent, the embedding model's representation of your namespace has drifted from the distribution it was calibrated on.

Trigger condition 2: Vocabulary Jaccard below 0.40. At 40% Jaccard, more than 60% of the high-frequency vocabulary in recent memories is new relative to historical memories. General-purpose embedding models are trained on broad web text, which by construction does not include domain-specific terminology that emerged after the training cutoff or specialized jargon from narrow technical domains. When 60% of your active vocabulary falls into this gap, vector similarity scores for queries using that vocabulary are unreliable.

Trigger condition 3: reranker confidence below 0.5 on new-token queries. The cross-encoder reranker operates on (query, memory) pairs independently of the embedding model. If reranker confidence for queries containing recently-emerged tokens falls below 0.5, those tokens are being poorly represented in the embedding space. The reranker is a more sensitive end-to-end quality signal than embedding similarity alone for this failure mode.

The shadow re-embedding strategy. When a trigger fires, do not immediately cut over to a new model. Dual-write all new memories with both the old and new embedding models simultaneously. For 30 days, both models produce embeddings for every incoming memory. The old model's embeddings serve live retrieval; the new model's embeddings accumulate in a shadow index.

After 30 days, compare retrieval quality between the two indices using Jaccard divergence over top-K results for a sample of 500 representative queries:

jaccard_overlap(old_top_k, new_top_k) = |old_top_k ∩ new_top_k| / |old_top_k ∪ new_top_k|

When average Jaccard overlap exceeds 0.80 — 80% retrieval agreement between old and new models — the new model is safe for full cutover. It is not radically changing retrieval for queries that were working before, and it is improving coverage for queries that were failing.

If average Jaccard overlap never reaches 0.80 over the 30-day shadow period, the models disagree structurally on a significant fraction of queries. This may reflect that the domain shift is so significant that the new model's encoding is materially different — not worse, just different. Evaluate on a golden query set with reranker scores to determine which model is correct before committing to cutover. If Jaccard stays below 0.60, consider whether a domain-specialized embedding model trained on corpus overlapping with your namespace's vocabulary is needed rather than a general-purpose model.