Concept Drift: Dual-Signal Detection

By Arc Labs ResearchMay 2, 20269 min read

Detecting that an entity's meaning has changed is harder than it sounds. Embeddings drift naturally as the embedding model sees the entity in new contexts; relation graphs are sparse, so most entities don't have enough edges to compare meaningfully. Either signal alone produces too many false positives or too many false negatives. Anding the two signals — requiring both to agree before triggering — gives reliable detection that fires on real concept drift (Twitter → X, Acme acquired by Globex) while ignoring routine variation.

Dual-signal scatter · the trigger zone

The orange zone fires only when centroid shift is high AND relation overlap is low.

Signal 1 — centroid distance

For each entity, compute two centroids: the mean embedding of recent memories (last 30 days) and the mean of historical memories (30+ days old). The cosine distance between them is the centroid shift.

d(e) = 1 − cos(centroid_recent(e), centroid _historical(e))

Cosine distance between recent and historical centroids; threshold 0.4 fires high-drift suspicion.

Below 0.2: stable.
0.2–0.4: noisy.
Above 0.4: suspicious — but not enough alone.

Signal 2 — relation overlap

For each entity, compute the Jaccard index of its relation set in the recent window vs. the historical window.

J(e) = |R_recent(e) ∩ R_historical(e)| / |R _recent(e) ∪ R_historical(e)|

Below 0.5: substantial relation churn; threshold for the second signal.

Above 0.7: same relations, stable connections.
0.5–0.7: noisy churn.
Below 0.5: substantial churn — but not enough alone.

AND, not OR

The key insight: trigger only when both signals fire. d(e) > 0.4 ∧ J(e) < 0.5 . Either signal alone produces too many false positives. Together, they isolate the rare cases where an entity's meaning genuinely changed:

True positive (Twitter → X). Embedding shifts (new product context) AND relation churn (new owners, new policies). Fires.
False positive avoided (entity getting more use). Embedding stable; only relation count growing. Does not fire.
False positive avoided (model retraining noise). Embedding shifts; relations stable. Does not fire.

Cadence and cost

Run the full scan weekly. The bottleneck is the centroid computation; sampling 100 memories per entity gives stable centroids and bounds the cost. On a 1M-entity store, weekly scan completes in under an hour on commodity hardware. Real-time triggering (every write) is overkill — drift accumulates over weeks, not minutes. Daily is too noisy; weekly is the right cadence.

When dual-signal still fails

Sparse-relation entities. Entities with fewer than 3 relations both windows are skipped — Jaccard is too noisy. These rely on centroid alone, with a higher confidence threshold.
Slow-burn drift. Drift that happens over months (a slow product rebrand) may stay just-under-threshold each week. The fix: track centroid trajectory, not just snapshot. If the centroid is monotonically moving, accumulate evidence.

What concept drift actually means

Concept drift in memory systems is not the same as concept drift in machine learning literature, where it refers to distribution shifts in model training data. In memory systems, concept drift is a semantic problem: the referent of an entity identifier has changed, but the identifier itself remains the same. The entity ent_my_manager still exists, but the person it points to is now Rahul, not Priya. The string key is stable; the real-world object it names is not.

Three categories of concept drift arise in practice, and they require different responses.

Category 1 — Product pivots. A startup founder mentions "the product" dozens of times in early 2026 to mean their mobile app. They pivot to a web-first model in October. In November 2026, "the product" means the web dashboard. The entity ent_the_product now has two distinct underlying identities with a hard break at the pivot date. Memories from before the pivot carry wrong context when used to answer post-pivot questions. Asking "what does the product do?" after the pivot should not retrieve memories about the mobile app's onboarding flow. Without drift detection, that retrieval happens silently.

Category 2 — Role transitions. "My manager" resolves to Priya in Q1, then to Rahul after a reorg in Q3. Any memory using the pronoun "my manager" or the relationship reports_to needs to be time-windowed. Without concept drift detection, "ask my manager about the budget" routes to Priya months after Priya's replacement. The coreference engine has no mechanism to invalidate the Priya binding without an explicit signal that the role has turned over. Dual-signal detection provides that signal: when the entity's relation edges (reports_to, managed_by) churn while the manager-role embedding space also shifts to reflect a new person's context, the alert fires.

Category 3 — Company rebrands. "Twitter" existed as an entity in the memory store. Post-rebrand, the entity becomes "X" but the platform, the followers, and the use cases are the same referent. "X (formerly Twitter)" is one entity that should receive a canonical alias update, not a new entity. Contrast this with "Inbox3," which started as a personal side-project name and later became the legal company name: those are genuinely different referents that share a name during a transition window. Entity versioning handles the hard pivot; alias updates handle the rebrand with continuity.

These three categories produce different correct actions: pivots need entity versioning (create a new entity with a canonical_from date), role transitions need temporal edge validity (valid_from/valid_to on relation edges), and rebrands need alias updates (add an alternate name to the entity record). What makes the dual-signal approach general is that it detects all three without requiring any category-specific logic at scan time. The signal fires; the human reviewer determines which mitigation applies.

Why embeddings alone fail to detect concept drift

The natural instinct is to detect concept drift purely through embedding distance: if the distribution of embeddings for entity-related memories shifts, the concept drifted. This fails for two distinct reasons, one producing false positives and one producing false negatives.

False positives from legitimate context expansion. An entity that was primarily discussed in the context of software engineering picks up new mentions in the context of team management as the person grows in their role. The centroid of their recent embeddings shifts toward leadership and process vocabulary. The centroid distance exceeds 0.35 — but the entity (the person) hasn't changed its fundamental nature. The embedding model is sensitive to surrounding context, and context naturally expands as an entity's responsibilities grow. Centroid distance alone would fire here; the AND gate suppresses it because the person's relations (works_at, collaborates_with, reports_to) remain stable throughout the context expansion.

False negatives from embedding proximity within a domain. Some real concept drift does not produce large centroid shifts. If "the product" changes from a mobile app to a web app in the same product domain — both are software products serving the same user base, both described using engineering and UX vocabulary — the embedding distance between "mobile app memories" and "web app memories" may be modest. The structural change (different engineering team, different performance metrics, different customer acquisition channel, different deployment infrastructure) is invisible to the embedding space, because all of that vocabulary lives in the same software-product semantic neighborhood. Centroid alone would not fire. The AND gate catches it because the relation set changes substantially: the mobile team's members, the app store as a distribution channel, and the crash-reporting tools are all replaced by a web team, CDN edges, and browser error-tracking tools.

The 0.4 threshold derivation. The centroid distance threshold was chosen empirically. On a corpus of stable entities across a 6-month production deployment, centroid distances follow an approximately Beta-distributed shape with most mass below 0.3. Setting the threshold at 0.4 places the alert boundary at roughly the 97th to 98th percentile of the stable-entity distribution. At that threshold, the false-positive rate from the centroid signal in isolation is approximately 2–3% of all entities per weekly scan. On a 1M-entity namespace, that is 20,000–30,000 spurious alerts per week — far too noisy to surface for review. The Jaccard gate reduces the fired-alert set by a further 10–20x, bringing the practical false-positive rate to a manageable 0.1–0.3% of entities per week.

Why relation overlap alone fails

Relation sets are sparse for most entities. Consider an entity representing a person that appears in 50 memories total: it might participate in 5 distinct predicates — works_at, manages, collaborates_with, lives_in, attends. With only 5 data points per window, Jaccard is highly sensitive to sampling noise. One new predicate appearing in the recent window reduces Jaccard by 16 percentage points (from 5/5 to 5/6). Two new predicates drops it to 5/7, or 0.71 — still above threshold. Three new predicates drops it to 5/8, or 0.625. At this point, whether or not the threshold fires depends on exactly how many new predicates the extractor generated, which may vary week to week based on the documents ingested.

Legitimate growth generates false Jaccard instability. A new hire at a company naturally accumulates new team memberships, project associations, and collaboration edges over their first year. All of these are legitimate new relations — the person hasn't drifted; they've become more connected. Jaccard drops as the recent window accumulates new predicates relative to the (smaller) historical set. But this is growth, not drift: the historical predicates are still present in the recent window, and the new ones are additive rather than replacements. Without the centroid gate, every rapidly-onboarding entity would trigger drift alerts continuously.

The 0.5 threshold derivation. On stable entities with growing graphs, Jaccard typically stays above 0.6 because new predicates are additions to an existing set rather than replacements of it. The union grows; the intersection grows at the same rate. Below 0.5, more than half the relation set has turned over within the comparison window — this level of churn occurs reliably during genuine structural changes: company acquisition (all subsidiary_of and owned_by edges update), team dissolution (all member_of edges drop), or role change (reports_to and managed_by edges swap). Below 0.5 is where structural changes live; above 0.5 is where growth and noise live.

The AND requirement isolates the intersection. An entity accumulating new relations without changing its embedding domain (growth pattern) produces low Jaccard stability but also low centroid distance — the embedding centroid stays in the same semantic neighborhood even as new edges appear. It does not trigger. An entity whose embedding space shifts because the underlying language model was retrained produces high centroid distance but stable relations — the graph structure didn't change; only the vector representation of existing memories changed. It does not trigger either.

Centroid computation details

The centroid computation requires careful design to be both statistically sound and computationally tractable at scale.

Memory sampling strategy. For entities with more than 100 memories in either window, randomly sample 100 from each window before computing the centroid. The centroid of 100 sample embeddings drawn uniformly from a 1,000-embedding set is stable to within approximately ±0.02 cosine distance of the true centroid — well within the noise band between 0.35 and 0.40. Sampling bounds the computation to a constant number of operations per entity regardless of how large the entity's memory set grows. Without sampling, a heavily-mentioned entity with 50,000 memories would require 50,000 embedding lookups per scan; with sampling, it requires 100 regardless of total memory count.

Window definition. Recent window: memories created in the last 30 days. Historical window: memories created 30 to 180 days ago. The 30-day recent window captures current context; the 30-to-180-day window (5 months) provides a stable baseline that isn't so old as to be irrelevant — a 6-month-old memory of someone's job title is still meaningful context, but a 3-year-old memory may reflect a different life stage entirely. The 30-day gap between windows ensures the two sets are non-overlapping: we are not comparing different random samples from the same population, but genuinely different temporal slices.

Minimum memory requirement. Entities with fewer than 5 memories in either window are skipped. Five memories produce a noisy centroid — the sample mean of 5 768-dimensional vectors has high variance in all dimensions, and the resulting centroid is not a reliable representation of the entity's semantic neighborhood in that period. New entities and rarely-mentioned entities are correctly excluded: they don't have sufficient history to compute a meaningful baseline, and they are precisely the entities for which any centroid comparison would be noise-dominated.

Computation cost at scale. For a 1M-entity namespace, the entity distribution is highly skewed: most entities have fewer than 50 memories and are processed trivially. Entities with 100+ memories (the important, frequently-referenced entities) get the full 100-sample centroid computation. At 100 samples × 768 dimensions × 4 bytes per float32 = 307 KB of embedding data per entity, a batch of 10,000 large entities requires approximately 3 GB of embedding data transferred from storage — manageable within a 1-hour weekly batch window on commodity hardware with a local embedding cache. The centroid arithmetic itself (100 vector additions and one normalization per entity) is negligible; the bottleneck is storage I/O for the embedding retrieval.

Relation set construction

The relation set R(e) for entity e in a time window is the set of distinct predicates in which entity e participates during that window, normalized to their canonical forms and filtered by a minimum support threshold.

Predicate normalization. Raw extraction produces verb phrases with surface variation: "manages", "is managing", "led", "was the manager of" may all express the same predicate. Without normalization, predicate vocabulary drift — caused by an extractor model upgrade that consistently chooses different surface forms for the same relation — generates spurious concept drift alerts. The canonical predicate table maps surface forms to normalized identifiers (e.g., all manager-role verbs → manages). This table is shared with the schema drift detection system, which maintains it as part of the ontology layer. The concept drift scanner applies the same normalization before constructing R(e).

Minimum support threshold. Only predicates with at least 2 supporting memories within the window are included in R(e). A predicate appearing in exactly one memory may be an extraction error, a misread entity boundary, or an event so rare that it does not characterize the entity's structural position. Requiring 2 instances filters single-occurrence noise without meaningfully reducing signal: a genuine structural relation (works_at, manages, member_of) appears repeatedly across multiple conversations, not once.

Bidirectional inclusion. If entity e appears as both subject and object across different relations, both directions are included in R(e). The memory "Priya manages the backend team" adds manages to Priya's R(e). The memory "The backend team is managed by Priya" adds managed_by to the backend team entity's R(e). If Priya's role changes, both her manages predicate and the team's managed_by predicate will churn — the signal reinforces from both ends of the relation edge. This bidirectionality means structural changes affecting multiple entities are visible as correlated drift signals, which can be used for cross-entity consistency checks.

Predicate set vs. edge set. R(e) is a set of distinct predicates, not a multiset of edges. An entity that has 20 memories all expressing works_at contributes one element to R(e), not 20. This is intentional: we are measuring structural role diversity, not edge frequency. An entity that participates in many distinct types of relations is more structurally diverse and has a richer R(e); an entity that is mentioned 1,000 times in the same role has R(e) = works_at. Jaccard on predicate sets is more stable and interpretable than Jaccard on edge multisets.

Three mitigation options in detail

When the dual-signal fires and human review confirms genuine concept drift, three mitigation paths are available. The correct choice depends on the nature of the drift.

Option 1 — Entity versioning (preferred for hard pivots). Create a new entity ent_the_product_v2 with a canonical_from timestamp set to the detected pivot date. Wire all memories created after canonical_from to the v2 entity. Keep memories created before canonical_from on the v1 entity. Add a succeeds directed edge from v2 to v1. Coreference resolution then uses the temporal context of the query to route entity references to the correct version: a query about "the product" in a conversation that has no temporal anchors resolves to v2 (current); a query explicitly asking about "what the product was like before the pivot" uses the temporal signal to route to v1. Entity versioning is the most precise mitigation: it preserves historical accuracy while ensuring current queries get current context. The tradeoff is complexity — two entities must be maintained, and coreference must be version-aware.

Option 2 — Merge with pivot note (for soft continuity). Keep one entity with an embedded pivot_note field that records the semantic transition: "As of October 2026, this entity's context changed from mobile app (iOS/Android) to web dashboard (React SPA). Memories before this date reflect the mobile product." All memories remain on the single entity without temporal routing. The LLM receives the pivot note as part of the entity's context block when it retrieves entity-related memories, and it must reason about which era's context applies to the current question. This approach is operationally simpler — no versioned entity to maintain, no temporal routing logic — but loses precision. The model must now perform temporal disambiguation that the system could have handled structurally. For slow rebrands with genuine continuity (the core user base and mission remained stable), this is the correct tradeoff. For hard pivots with genuinely incompatible contexts, Option 1 is preferable.

Option 3 — Acknowledge and continue. The human reviewer examines the drift alert and determines that the signal fired on legitimate entity growth or a minor semantic expansion, not genuine drift. The entity's drift score is reset; the detected_at timestamp is recorded in drift_findings with outcome = acknowledged_no_action. The system continues without structural changes. This is correct for entities that are growing, gaining new context, and accumulating relations as expected. The coreference cache (30-day TTL per entity binding) ensures that any stale bindings from before the acknowledgment expire within a month without requiring an explicit cache invalidation.

Slow-burn drift and trajectory tracking

The dual-signal design targets step-change drift: an entity that changes significantly over a short period. Slow-burn drift presents differently. A product that gradually pivots over 6 months shows centroid distances of 0.25–0.35 each week (below the 0.4 threshold) and Jaccard values of 0.55–0.65 each week (above the 0.5 threshold). No single weekly scan triggers the alert. But cumulatively over 6 months, the entity has drifted as much as if the pivot had happened overnight.

Trajectory tracking mechanism. Instead of comparing the current centroid to the static historical centroid, trajectory tracking compares the centroid computed 6 weeks ago to the centroid computed today. If the centroid has moved monotonically in the same direction over the 6-week sequence — measured by checking that the dot product of consecutive week-over-week displacement vectors is positive — the system accumulates a trajectory score rather than a point score.

T(e) = Σ d(centroid_week_i(e), centroid_week_(i-1)(e)) for i = 1..6

Accumulated trajectory score over a 6-week window; fires a soft alert when sum exceeds 0.8.

The trajectory alert threshold is set at T(e) > 0.8. At an average week-over-week centroid distance of 0.133, the accumulated score reaches 0.8 after 6 weeks. This is equivalent to a single-week centroid distance of 0.8 — far above the 0.4 point-alert threshold. The trajectory threshold is intentionally set high to avoid alerting on noisy random walks that happen to accumulate positive dot products by chance; monotonic movement is necessary but not sufficient.

Soft alerts vs. hard alerts. Trajectory alerts are soft alerts. They do not require immediate action — they surface in the drift dashboard with the label "slow movement detected" and carry a 30-day acknowledgment deadline. Hard alerts (from the point-threshold dual-signal) require acknowledgment within 7 days because step-change drift can corrupt active queries immediately. Slow-burn drift is less urgent: the context degradation accumulates gradually, and a 30-day window is sufficient to review and mitigate before the accumulated drift produces significant retrieval errors.

Persistence schema. Trajectory findings are written to the same drift_findings table as point alerts, with signal_type = 'trajectory' to distinguish them. The schema includes window_start, window_end, centroid_delta_accumulated (the T(e) value), and direction_consistency_score (the mean dot product of consecutive displacement vectors, where 1.0 means perfectly monotonic and 0.0 means random walk). The dashboard renders a time series of weekly centroid distances for each entity in review, making slow burns visible as trends rather than threshold crossings. A reviewer looking at a 6-week chart of centroid distances at 0.15, 0.19, 0.22, 0.26, 0.30, 0.33 can immediately recognize the pattern, even though no individual week crossed 0.4.

Interaction with the point-alert system. If an entity accumulates a trajectory alert and then subsequently crosses the 0.4 point threshold in the same week, both alerts fire. The human reviewer sees the full history: the slow burn that preceded the step change. This combined signal provides stronger evidence that the drift is genuine rather than noise, and it helps the reviewer understand the timeline — the pivot may have started 6 weeks earlier than the point alert suggests, which is important for determining the correct canonical_from date for entity versioning.