A Taxonomy of Drift in Long-Running Memory Systems
Abstract
The word 'drift' is overloaded in machine learning, covering at least four structurally different phenomena in a persistent memory system: concept drift (entity meanings change), data drift (distributional shift in content), schema drift (extractor behavior changes), and vocabulary shift (new domain terminology). This paper defines each type precisely, explains the detection math — MMD with RBF kernel for data drift, dual-signal centroid+Jaccard for concept drift, KL divergence for type distributions, vocabulary Jaccard for lexical shift — and discusses why each requires a different mitigation strategy.
Why Drift Is a First-Class Problem in Memory
A document store can be re-indexed whenever its embedding model changes. A retrieval system over a static corpus can rebuild from scratch on any schedule. A persistent memory system cannot.
Memory accumulates over months or years. Each memory is grounded in a source conversation. Re-extracting those memories with a new model would require replaying those conversations — often impossible, always expensive. The store carries its history, for better and worse.
From LongMemEval: "knowledge updates — correct adaptation to dynamic user information (e.g., a changed address superseding old values)" is a capability current memory systems fail at, with measured accuracy ranging from 20–85% and highly inconsistent across systems. [1] The core problem is that most systems treat memory as append-only without mechanisms to detect or respond to the ways the accumulated store becomes stale or misleading.
Recall distinguishes three fundamentally different types of drift, each with its own detection math and mitigation strategy. Conflating them — treating all drift as "memory quality degradation" — leads to applying the wrong fix: re-embedding when the real issue is entity versioning, or trying to update type schemas when the real issue is vocabulary shift.
Type 1: Concept Drift
Definition
Concept drift is when the referent of a symbol changes. What an entity name means has shifted, not just how it's expressed.
Example — product pivot. A user in January 2026 says "the product" and means their mobile app. By October 2026 they've pivoted; "the product" now means their web dashboard. The entity ent_the_product has two different underlying identities, separated by a pivot date.
Example — role change. "My manager" in early 2026 resolves to ent_priya_ABC. After a reorg in October, the same surface form resolves to ent_rahul_XYZ. The resolution is wrong unless the system detects the shift.
Example — company rebrand. "Inbox3" started as a personal project and became a company name. The entity must now distinguish the project (deprecated) from the company (active) or retrieval becomes ambiguous.
This is distinct from data drift or schema drift: the entity's identifier is unchanged, but its meaning has changed. This is fundamentally a coreference resolution problem, not an embedding distribution problem.
Detection: dual-signal
Concept drift detection requires two signals to fire simultaneously (AND logic, not OR). Either signal alone is too noisy.
Signal 1 — centroid distance.
For an entity, compute the semantic centroid of its memories in two windows:
centroid_recent = mean(embeddings of memories about entity, last 30 days)
centroid_old = mean(embeddings of memories about entity, 30–180 days ago)
drift_distance = 1 - cosine_similarity(centroid_recent, centroid_old)Under normal use with a stable entity, drift distance follows approximately Beta(α=1.5, β=10). The 95th percentile of this distribution is around 0.35. The alert threshold is 0.4 — conservative, giving roughly 2% false positives on stable entities.
A drift distance above 0.4 means the embedding-space center of mass for this entity has shifted significantly. It's necessary but not sufficient evidence — embeddings can shift because of new topics in the same domain, not just because the referent changed.
Signal 2 — relation overlap.
Also compare the relations the entity participates in across the two windows:
R_recent = predicates this entity appeared in, last 30 days
R_old = predicates this entity appeared in, 30–180 days ago
relation_overlap = |R_recent ∩ R_old| / |R_recent ∪ R_old|Low Jaccard (< 0.5) means the entity is now connected to entirely different things than before. An entity that was previously found in works_at, builds, and manages_project edges but now appears in delivers, coaches, and trains_client edges has likely changed domains — strong evidence of semantic shift.
Combined alert condition:
drifted = (drift_distance > 0.4) AND (relation_overlap < 0.5)Either signal alone fires too often on legitimate context changes. Together they identify cases where both the content and the structural role of the entity have changed — the definition of concept drift.
Three mitigation options
Option 1 — entity versioning (preferred): split the entity into versions with a canonical_from / canonical_until boundary. Memories are rewired to the appropriate version. Coreference resolution routes "the product" (in current context) to v2 and "the product back then" to v1. Clean history, correct retrieval.
Option 2 — merge with pivot note: preserve both referents as one entity with an embedded note describing the pivot event. Less clean but preserves all memories as-is. Works when the pivot is clear and the histories don't need strict separation.
Option 3 — acknowledge and ignore: user flags as false positive (entity legitimately evolved but all references remain valid). The coreference cache TTL (30 days) ensures resolution doesn't go stale indefinitely even without a formal versioning event.
Type 2: Data Drift
Definition
Data drift is a namespace-level distributional shift — the kinds of memories being written have changed, not any single entity's meaning.
Example: for 6 months, a user's conversations are about software engineering. They start freelancing as a fitness coach. Now memories are about workouts, nutrition, client sessions. The embedding model, trained on mixed data, may over-match fitness queries to engineering vocabulary and produce degraded recall precision on the new domain.
Data drift is different from concept drift (which is per-entity) and from schema drift (which is about memory structure). It's about whether the accumulated store still resembles the kinds of queries being asked.
Detection: Maximum Mean Discrepancy
MMD measures distributional distance using samples from two periods, without requiring density estimation in high-dimensional space. [2]
For two sample sets X ~ P (recent embeddings) and Y ~ Q (historical embeddings):
MMD²(P, Q) = E[k(x, x')] - 2·E[k(x, y)] + E[k(y, y')]with RBF (Gaussian) kernel:
k(x, y) = exp(-‖x - y‖² / (2σ²))σ is set using the median heuristic — the median pairwise distance among the combined sample. This makes the kernel adaptive to the embedding geometry without a separate tuning step.
Why MMD over alternatives:
| Method | Why not |
|---|---|
| KL divergence | Requires density estimation in high dimensions — unreliable |
| Wasserstein distance | O(n³) — computationally infeasible for large samples |
| Kolmogorov-Smirnov | 1D only, can't handle vector distributions |
| MMD | Works directly on samples, O(n²), formal significance test available |
Significance via permutation test:
A raw MMD value doesn't tell you whether the observed shift is statistically meaningful or just sampling noise. The permutation test does:
- Compute observed MMD² on
(X, Y). - Repeat 100 times: shuffle labels, recompute MMD².
p-value = (count of permuted MMD² ≥ observed MMD²) / 100.
A p-value below 0.01 means the observed distributional difference would happen by chance less than 1% of the time under the null hypothesis of identical distributions. That's a real signal.
Severity thresholds:
| Severity | Condition | Implication |
|---|---|---|
| High | MMD > 0.1 AND p < 0.01 | Significant distribution shift — recommend re-embedding |
| Medium | MMD > 0.05 AND p < 0.05 | Moderate shift — monitor, rebuild BM25 index |
| Low | MMD > 0.05 AND p ≥ 0.05 | Possible shift — not statistically significant |
| None | MMD ≤ 0.05 | Stable distribution |
Practical sampling constraint: MMD is O(n²) in sample size. Full-store computation at 100K memories would require ~10 billion kernel evaluations — hours of computation. Sampling 500 recent + 500 historical embeddings makes each weekly scan complete in under 60 seconds on commodity hardware, while preserving sufficient statistical power to detect moderate distributional shifts.
Secondary signals
Vocabulary Jaccard:
vocab_shift = 1 - jaccard(top_100_tokens_recent, top_100_tokens_historical)Where top_100 is the 100 most frequent tokens in memories from each window, weighted by IDF to emphasize informative terms. A Jaccard below 0.5 (more than 50% vocabulary turnover) confirms that the lexical domain has shifted — cheap enough to compute daily, no embedding math needed.
Type distribution KL divergence:
KL(P_recent ‖ P_historical) = Σₜ P_recent(t) × log(P_recent(t) / P_historical(t))where t ranges over memory types (Fact, Preference, Event, Entity, Relation). This measures whether the kind of information being captured is changing:
| KL divergence | Interpretation |
|---|---|
| < 0.05 | Stable — normal variation |
| 0.05 – 0.15 | Mild shift — monitor weekly |
| > 0.15 | Significant shift — investigate upstream extractor |
Rising KL usually means extraction behavior changed (model upgrade, prompt change), not just that the user's conversational domain shifted. It's a cheaper signal than MMD — computable from type counts, no embedding computation.
Mitigation
Re-tune BM25: vocabulary shift is fixed by rebuilding the GIN index with updated term statistics — a cheap, always-safe operation.
Re-embed: if the embedding model's strengths don't match the new domain (e.g., general-purpose model for a specialized medical corpus), re-embedding with a better-suited model can recover retrieval precision. Use the shadow strategy: dual-write with both old and new embeddings, backfill historical memories, compare top-K overlap via Jaccard divergence before cutover.
Acknowledge and continue: many data drifts are expected — the user legitimately pivoted. Alerting is useful; forced action is not. A well-designed drift dashboard distinguishes "this drift is expected and acknowledged" from "this drift is new and requires investigation."
Type 3: Schema Drift
Definition
Schema drift is the subtlest kind. The memory's structure — predicates, field population patterns, type vocabulary — changes. Often triggered by an extractor upgrade.
Example: upgrading from claude-haiku-4-5 to claude-haiku-5-0 for extraction. The new model:
- Uses slightly different predicate vocabulary (
saidvsstatedvscommunicated) - Populates
event_at78% of the time vs 42% before — a 36 percentage-point shift - Interprets ambiguous relations differently, producing different
objectvalues for the same source text
Memories from before the upgrade have different shape than memories after. Retrieval on predicates like communicated misses older memories that use said. Temporal filtering on event_at misses 58% of the older event records.
Detection: predicate vocabulary drift
Track predicate frequency per extractor version:
fn predicate_vocab_drift(v1: &ExtractorVersion, v2: &ExtractorVersion) -> f32 {
let v1_predicates = top_k_predicates(v1, k=50);
let v2_predicates = top_k_predicates(v2, k=50);
let overlap = jaccard(&v1_predicates, &v2_predicates);
1.0 - overlap // drift score
}| Threshold | Interpretation |
|---|---|
| > 0.3 | Suspicious — likely extractor behavior change |
| > 0.5 | Serious — requires predicate normalization |
Detection: field population drift
For each field in the memory schema, compare population rate between extractor versions:
event_at populated: 42% (v1) → 78% (v2) = +36pp ← alert
object populated: 88% (v1) → 94% (v2) = +6pp ← normal
tags populated: 23% (v1) → 34% (v2) = +11pp ← monitorA shift greater than 20 percentage points on any field triggers an alert. This is particularly important for fields used in retrieval logic — a field that was rarely populated (and therefore never indexed on) becoming frequently populated changes the retrieval semantics of the index.
Mitigation: predicate normalization
Map synonymous predicates to canonical forms at write time:
canonical_predicates:
communicated:
aliases: [said, stated, mentioned, announced, told, expressed]
has_attribute:
aliases: [is, has, was, are]
prefers:
aliases: [prefers, likes, wants, favors]At write time, raw predicates are canonicalized. At read time, the original raw_predicate is preserved for display. This makes retrieval consistent across extractor versions without requiring a historical re-extraction.
Mitigation: A/B extractor gate
Before fully switching extractors, run them side-by-side on incoming turns:
extractor:
mode: ab_test
a:
model: claude-haiku-4-5
weight: 0.5
b:
model: claude-haiku-5-0
weight: 0.5
comparison_enabled: true
alert_on_drift: 0.3 # alert if shapes diverge > 30%Each turn is extracted by both; outputs are compared. Default: use version A (the trusted one) until B is validated. Switch fully to B only when predicate vocabulary drift is below 0.3 and quality is equal or better on LongMemEval-style probes.
Mitigation: re-extraction
When the extractor upgrade is significant, offer re-extraction of historical turns:
await recall.reExtract({
scope: { namespace },
target_extractor: { model: "claude-haiku-5-0" },
source_turns_range: { from: "2025-06-01", to: "2026-04-01" },
strategy: "append" // safe: dedup handles equivalents, supersession handles conflicts
});Cost for a year of data (~5,000 turns): $5–10. This restores schema consistency without losing historical grounding.
Type 4: Vocabulary Shift
Definition
Vocabulary shift is when new terminology enters a domain that the embedding model was not trained on. Unlike concept drift (entity referents change), data drift (distribution changes), or schema drift (extractor behavior changes), vocabulary shift is a model limitation problem: the embedding model cannot represent new terms well because it never encountered them during training.
This distinction matters for remediation. The other three drift types are addressed by changes to the memory system — entity versioning, index rebuilds, predicate normalization. Vocabulary shift is addressed by changes to the model tier: selecting an embedding model whose training corpus covers the new terminology, or enriching the retrieval pipeline to compensate for the model's blind spots.
Example — new technology. In early 2026 a user's conversations are about Kubernetes and Docker. By late 2026 they are discussing WebAssembly Components (Wasm Components) — a relatively new technology that was not well-represented in the embedding model's training data. Vector queries for "Wasm component interface" return memories about general WebAssembly rather than the specific Component Model standard. The model does not know the term well enough to retrieve by semantic similarity.
Example — domain-specific jargon. A legal department's memory store accumulates references to "DPDPA" (India's Digital Personal Data Protection Act). The embedding model trained on general text has thin representations for this specific act; queries about DPDPA retrieve memories about privacy regulations in general rather than the specific act. The retrieval is semantically plausible — related topic — but not precisely on-target.
Example — internal naming. A user starts using an internal codename "Project Polaris" for their new initiative. The embedding model has no representation of this term at all. Queries for "Polaris" retrieve nothing related to the user's project. The BM25 index retrieves correctly on exact match, but any paraphrase — "the new initiative," "the infrastructure overhaul" — returns irrelevant results because the semantic bridge between the paraphrase and the codename does not exist in the model's representation space.
Detection: vocabulary Jaccard
The primary detection signal for vocabulary shift is a cheap token-frequency comparison that requires no embedding computation:
vocab_shift = 1 - jaccard(top_100_tokens_recent, top_100_tokens_historical)Where top_100_tokens is the 100 most frequent tokens in memories from each window, weighted by IDF (inverse document frequency) to emphasize informative terms over stop words. Stop words like "the," "a," and "is" are near-zero weight; domain-specific nouns and technical terms carry high weight.
A Jaccard below 0.5 (more than 50% vocabulary turnover) signals significant lexical shift. This is the earliest signal for domain change in a namespace — it appears before MMD rises, before retrieval quality measurably degrades, and before users report problems.
Severity:
| vocab_shift | Severity |
|---|---|
| < 0.3 | None — stable vocabulary |
| 0.3–0.5 | Mild — monitor, check retrieval quality on new terms |
| 0.5–0.7 | Significant — BM25 rebuild recommended |
| > 0.7 | Severe — consider re-embedding with domain-appropriate model |
The daily vocabulary Jaccard scan is computationally cheap — just token counting over recent memories, no embedding computation needed. Running it daily across all active namespaces adds negligible load even at scale, making it practical to treat as an always-on canary rather than a scheduled batch check.
Why BM25 does not solve it
BM25 handles exact term matching — it correctly retrieves memories containing the exact string "DPDPA" for a query containing "DPDPA." For out-of-vocabulary terms that appear verbatim in both query and stored memory, BM25 performs correctly regardless of the embedding model's knowledge.
But BM25 cannot handle synonyms or semantic neighbors. "Digital Personal Data Protection Act" does not match "DPDPA" in BM25; that synonym relationship is semantic retrieval's job. If the embedding model does not know the synonym relationship, neither retriever handles it. The failure mode is asymmetric: queries using the abbreviation find the memories using the abbreviation, but queries using the full name do not. Users who switch between forms experience inconsistent retrieval quality that is difficult to diagnose without explicit vocabulary shift detection.
For internal codenames like "Project Polaris," the situation is more severe. A user asking about "the infrastructure overhaul" expects the semantic retriever to bridge from the description to the codename. Without a trained synonym relationship, the bridge does not exist.
Mitigation: re-embedding with domain-appropriate model
The definitive fix for severe vocabulary shift: select an embedding model trained on domain-specific corpora. A legal embedding model (trained on case law, statutes, legal commentary) represents legal terminology far better than a general-purpose model. A code-specialized embedding model handles programming APIs and technical terminology better. Domain-specific models are increasingly available through embedding providers; the selection decision should be driven by measuring retrieval quality on domain-specific probe queries, not by model provider marketing.
Use the shadow re-embedding strategy to transition safely. Run both the old and new models in parallel for new memories (dual-write to separate embedding columns). Use the old model for retrieval from historical memories; use the new model for recent memories. Compare top-K overlap between old and new embeddings via Jaccard divergence over a 30-day window. When Jaccard exceeds 0.8 (80% retrieval agreement), the models are producing sufficiently aligned results that the new model can take over full retrieval. Backfill historical memories in batches during off-peak hours after the cutover gate is passed.
Mitigation: BM25 GIN index rebuild
For vocabulary shift that stays within a domain the embedding model can handle — new product names, new internal projects, new organizational terminology — a BM25 GIN rebuild updates the IDF statistics to reflect the new vocabulary distribution. New high-frequency terms get lower IDF (they are now common in the corpus); new rare but informative terms get higher IDF (they are distinctive). This does not improve semantic retrieval for out-of-vocabulary terms, but significantly improves lexical retrieval for domain-specific identifiers and names.
The GIN rebuild is a safe, always-appropriate maintenance operation for any namespace exhibiting vocabulary shift. It is cheap relative to re-embedding and produces immediate improvement for the exact-match retrieval path. The correct operational posture: rebuild the GIN index on any namespace showing vocabulary_shift > 0.5, regardless of whether re-embedding is planned.
Why This Separation Matters
Treating all drift as one undifferentiated "memory quality problem" leads to applying the wrong fix:
- Concept drift without versioning means entity resolution uses stale bindings. Re-embedding doesn't help — the embedding of "the product" is consistent within each period; the problem is that the periods resolve to different referents.
- Data drift without BM25 rebuild means lexical retrieval degrades on the new vocabulary while vector retrieval remains relatively stable — an asymmetric degradation that looks like "the memory system is random."
- Schema drift without normalization means the same fact, extracted before and after an upgrade, has different predicates and doesn't merge correctly. Historical and recent memories about the same subject partition into separate groups that the dedup logic can't reconcile.
Each drift type has a specific failure signature in retrieval. Diagnosing from the symptom (retrieval degradation) to the cause (which drift type) requires that you're looking for all three.
How Recall Differs From Alternatives
Mem0: no explicit drift detection of any kind. Memories accumulate indefinitely. Concept drift is invisible until users notice incorrect retrievals.
Zep: temporal knowledge graph handles some concept drift via time-versioned facts. No explicit data drift detection. Schema drift is undefined because Zep doesn't expose extractor versioning.
Letta: agent-managed memory can theoretically handle any drift — the agent notices and corrects. No systematic detection means drift gets handled inconsistently and depends on the agent noticing in the right conversational context.
Recall: three explicit detection loops with matched mitigation strategies and a unified drift dashboard. This is necessary (not optional) for enterprise deployments accumulating memory over multiple years.
Unified Configuration
drift:
concept_drift:
enabled: true
scan_schedule: "weekly"
thresholds:
distance: 0.4
relation_overlap: 0.5
data_drift:
enabled: true
scan_schedule: "weekly"
sample_size: 500
thresholds:
mmd_high: 0.1
mmd_medium: 0.05
p_value: 0.05
vocab_jaccard: 0.5
kl_divergence_alert: 0.15
schema_drift:
enabled: true
ab_test_on_extractor_change: true
alert_on_predicate_drift: 0.3
alert_on_field_drift_pp: 20
normalization_config: ./predicate_normalization.yaml
vocabulary_shift:
enabled: true
scan_schedule: "daily"
thresholds:
mild: 0.3
significant: 0.5
severe: 0.7
auto_gin_rebuild_on: "significant"All findings persist to the drift_findings table — not just logs — so you can trend drift scores across weeks and months rather than seeing only the current scan result. A namespace whose MMD score has been creeping upward over three months tells a different story than one that spiked suddenly this week.
Drift Detection Cadence and Compute Cost
The four drift types do not all warrant the same scan frequency. Vocabulary Jaccard is cheap enough to run daily on every active namespace. MMD is expensive enough that sampling is mandatory. Schema drift detection is not time-scheduled at all — it is event-triggered.
Per-type scan cadence
Concept drift: weekly scan, but only on entities with at least 5 memories in each time window (recent and historical). Entities with fewer memories don't have sufficient statistical coverage to produce meaningful centroid estimates — running centroid computation on sparse entities generates noise, not signal. For a namespace with 10,000 entities but only 200 with sufficient memory density, the weekly scan covers only those 200.
Data drift (MMD): weekly on namespaces with at least 1,000 total memories, sampling 500 recent and 500 historical embeddings. Namespaces below 1,000 memories don't have sufficient history to define a meaningful "historical" distribution; the MMD result would be dominated by sampling variance rather than distributional signal.
Schema drift: triggered automatically on extractor version change, not scheduled. When the extractor model or prompt is updated, the schema drift detection pipeline runs immediately and compares outputs of the new version against the previous 30 days of stored memories. There is no value in running schema drift detection between extractor changes — the extractor hasn't changed, so the predicate vocabulary and field population rates are stable by definition.
Vocabulary shift: daily Jaccard on all active namespaces (any namespace with at least one memory written in the past 7 days). Computationally cheap — just token counting with IDF weighting, no embedding computation, no kernel math.
Compute cost per 1M memories (weekly scan)
Concept drift centroid computation. Assuming 1M entities each with an average of 100 memories: 100M embedding reads at 768 dimensions (float32) = approximately 300 GB of data accessed. With warm database caches and embedding columns indexed, a realistic estimate is 20–40 minutes on commodity hardware. The computation is embarrassingly parallel across entity batches — 16 parallel workers reduce wall-clock time to roughly 2–3 minutes. The centroid computation itself (mean of vectors) is O(n × d) per entity: negligible arithmetic, dominated by I/O.
Data drift MMD. The sampling design keeps this tractable. 500 recent + 500 historical = 1,000 sampled embeddings at 768 dimensions. The O(n²) kernel computation: 1,000² = 1 million kernel evaluations. At approximately 100 nanoseconds per RBF evaluation (on CPU with SIMD): 100 milliseconds. The 100 permutation tests run at the same cost each: 100 × 100ms = 10 seconds total per namespace. For 100 active namespaces: approximately 17 minutes, but trivially parallelizable to under 1 minute with concurrent namespace scans.
Schema drift (predicate Jaccard). Counting predicates across memory versions requires an O(N) database scan with grouping by extractor version. For 1M memories, this is a standard aggregate query on an indexed column. Typical runtime on commodity hardware: 2–5 minutes, dominated by I/O throughput.
Vocabulary Jaccard (daily). Token counting over the last 30 days' memories with IDF weighting. For a namespace with 100K memories written in the past 30 days: a single table scan with tokenization, approximately 30 seconds. For all active namespaces combined, parallelized across worker threads: typically under 2 minutes.
Prioritization when resources are constrained
If compute budget allows only one drift check, run schema drift detection after every extractor change. It is the most actionable signal: the extractor changed, you know precisely when, you know which version transitioned, and you can apply predicate normalization immediately. The cost of missing schema drift is permanent — historical memories with old predicates will permanently fail to match queries using new predicate vocabulary unless normalized.
Data drift MMD is the second priority. It is the most statistically rigorous signal for domain changes and the only one that operates at the namespace level with a formal significance test. A high-confidence MMD finding drives the re-embedding decision, which is the highest-cost mitigation — worth having strong statistical grounds before committing.
Concept drift is the third priority. It is the most expensive scan and addresses the least frequent phenomenon — entity referents genuinely change on the order of months, not weeks. Reducing concept drift scan frequency from weekly to monthly introduces minimal risk of missing a real drift event.
Vocabulary Jaccard is fourth in operational priority for the simple reason that vocabulary shift is often benign domain growth rather than a retrieval problem. A user accumulating new technical vocabulary is not necessarily experiencing degraded retrieval — they may simply be broadening. The Jaccard signal is cheap to collect, but the operational response (evaluate retrieval quality on new terms, consider GIN rebuild) requires human judgment about whether the shift is causing actual problems.
Drift Findings Schema and Dashboard Interpretation
The drift_findings table
All four drift detectors write to a single persistent table rather than separate logs. The unified schema allows cross-type analysis: detecting patterns where multiple drift types occur simultaneously (which often indicates a significant external event — a company acquisition, a major product pivot, a technology platform change).
CREATE TABLE drift_findings (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
namespace TEXT NOT NULL,
drift_type TEXT NOT NULL, -- 'concept' | 'data' | 'schema' | 'vocabulary'
entity_id UUID, -- for concept drift: which entity
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
window_start TIMESTAMPTZ NOT NULL,
window_end TIMESTAMPTZ NOT NULL,
drift_score REAL NOT NULL, -- type-specific: centroid distance, MMD, Jaccard, etc.
signal_type TEXT, -- 'centroid' | 'relation_jaccard' | 'mmd' | 'predicate_jaccard' | 'vocab_jaccard'
severity TEXT NOT NULL, -- 'low' | 'medium' | 'high'
status TEXT NOT NULL DEFAULT 'open', -- 'open' | 'acknowledged' | 'resolved'
acknowledged_by TEXT,
acknowledged_at TIMESTAMPTZ,
resolution_action TEXT, -- 'versioned' | 'normalized' | 're-embedded' | 'false_positive'
metadata JSONB
);The entity_id column is nullable because only concept drift findings are entity-scoped; data drift, schema drift, and vocabulary shift findings are namespace-scoped. The signal_type column distinguishes which sub-signal triggered the finding — a concept drift finding triggered by centroid distance alone (below the combined-signal threshold) would appear with signal_type = 'centroid' and severity = 'low', while one that triggered both centroid and relation Jaccard appears with signal_type = 'centroid+relation_jaccard' and severity = 'high'.
Trend analysis
Because findings are persistent rows rather than transient log entries, drift score trends are queryable as time series. A namespace with concept drift score creeping from 0.25 to 0.38 over eight consecutive weekly scans — still below the 0.4 alert threshold — is exhibiting slow-burn drift that a threshold-based alerting system would miss entirely. The finding exists in the table for each week as a sub-threshold record; plotting the series reveals the monotonic movement.
SELECT detected_at, drift_score
FROM drift_findings
WHERE entity_id = $1
AND drift_type = 'concept'
ORDER BY detected_at;Monotonically increasing drift scores over multiple weeks, even below the alert threshold, warrant early intervention. Entity versioning applied before the threshold is crossed avoids a hard alert and the user impact that accompanies it. Slow-burn concept drift is particularly common in organizational contexts where entities like "the team," "the platform," or "the roadmap" are continuously evolving referents.
Dashboard interpretation guide
Many concept drift findings in one week. When ten or more entities in a namespace all trigger concept drift findings in the same weekly scan window, the cause is almost certainly an external event rather than organic drift in individual entities. Look at the window_end timestamps: if they cluster around the same date, check for a reorg announcement, a company acquisition, or a product pivot near that date. The correct response is a bulk entity versioning operation with a shared pivot event annotation, not individual review of each entity.
Data drift finding with high MMD but low vocabulary Jaccard. The embedding distribution has shifted, but the word-level vocabulary is stable. This pattern indicates that what changed is not the words being used but the embedding space itself — a likely embedding model update on the infrastructure side. Check whether the embedding model version changed near the finding date. If so, the finding is a detection artifact rather than a real domain shift; mark as resolution_action = 'false_positive' and add a note to the metadata JSONB field indicating the model version change.
Schema drift finding with predicate vocabulary drift above 0.3. An extractor prompt or model change has introduced new predicate vocabulary. The operational response is immediate: apply predicate normalization before the vocabulary gap between historical and recent memories widens further. The longer normalization is deferred, the larger the gap between old and new predicate vocabularies grows — more aliases to define, more historical memories that fail to match current queries.
Vocabulary shift with Jaccard below 0.5 sustained over three or more consecutive weeks. This is a genuine domain shift signal, not a transient fluctuation. The namespace's lexical domain has structurally changed. This is not necessarily a bug — a user who pivoted from software engineering to legal work should produce vocabulary Jaccard below 0.5. The operational question is whether the embedding model is still appropriate for the new domain. Evaluate retrieval quality on probe queries in the new vocabulary before committing to a re-embedding operation.
Keeping findings actionable
Each finding should reach status = 'resolved' within 30 days of detection. Unresolved findings are listed in the operations dashboard with age indicators. A backlog of more than 20 open findings at any time usually indicates one of two problems: the thresholds are calibrated too sensitively and generating more findings than the team can process (a tuning problem), or the operational bandwidth to handle drift resolution has not been allocated (a process problem).
The drift system is a detection system, not a self-healing system. Every finding ultimately requires a human decision about whether to version an entity, rebuild an index, apply normalization, or mark as false positive. Automated responses — auto-GIN-rebuild on vocabulary shift, auto-predicate-normalization on schema drift — reduce the operational load for low-severity findings but cannot replace the judgment call on high-severity findings. A high-severity concept drift finding on an entity like ent_the_company requires understanding what actually changed about the company, which requires context that lives outside the memory system.
Summary
Four phenomena hide under the word "drift" in memory systems:
-
Concept drift — entity meanings change. Detected by dual-signal (centroid distance AND Jaccard relation overlap). Mitigated by entity versioning or pivot annotations.
-
Data drift — embedding distribution shifts at the namespace level. Detected by MMD with RBF kernel and permutation test significance. Mitigated by BM25 rebuild or re-embedding.
-
Schema drift — extractor behavior changes predicate vocabulary and field population patterns. Detected by predicate Jaccard and field population rate comparisons. Mitigated by predicate normalization layer and A/B extractor gates.
-
Vocabulary shift — new domain terminology the embedding model was not trained on. Detected by vocabulary Jaccard on term frequency distributions. Mitigated by re-embedding with a domain-appropriate model.
Each requires a different detection mechanism because each is a different statement about what has changed. And each requires a different mitigation because the fix for "the referent of an entity has changed" is categorically different from the fix for "the IDF-weighted term distribution of the namespace has shifted by 62%."
References
- LongMemEval: Benchmarking Retrieval-Augmented Generation for Long Conversational Histories (2025).
- Gretton et al. (2012): 'A Kernel Two-Sample Test.' Journal of Machine Learning Research 13, 723–773.
- Quinonero-Candela et al. (2009): 'Dataset Shift in Machine Learning.' MIT Press.