Upgrading Your Extractor Without Breaking Your Memory Store

Arc Labs EngineeringMay 7, 202614 min read

Extractor upgrades are the one infrastructure operation that most documentation about LLM memory systems skips. Deploy a new model, update the extraction prompt, change the type rules — these feel like configuration changes. The service restarts, the new configuration loads, and everything continues. But what actually happened is that your memory store now contains two populations: everything extracted by the old configuration and everything going forward under the new one.

These two populations don't always agree on how to represent information. A model that previously extracted "prefers Python for data work" as a preference might extract the same statement as two separate facts under a new prompt. A schema change that renames work_location to primary_office creates a vocabulary gap — entity graph queries using the old predicate miss memories created after the upgrade. The confidence calibration of a new model might be systematically higher or lower, distorting the retrieval ranking for weeks.

None of this is unique to Recall. Any system that uses an LLM to structure data faces the same problem when the LLM changes. What's different is that Recall is designed to make this problem visible and manageable, rather than letting it accumulate silently.

The schema_version field

Every memory in Recall carries schema_version: u32. Every ExtractorInfo struct (embedded in Provenance) carries extractor_schema_version: u32. These are the same number — the extractor schema version at the time of extraction becomes the memory's schema version.

pub struct Memory {
    pub id: MemoryId,
    pub schema_version: u32,  // version of the extraction schema that produced this memory
    // ... other fields
    pub provenance: Provenance,
}

pub struct ExtractorInfo {
    pub provider: String,
    pub model: String,
    pub model_version_hash: Option<String>,  // 16-char hex hash of model identifier
    pub extractor_schema_version: u32,        // same value stamped onto the memory
}

You control extractor_schema_version in your write pipeline config:

write_pipeline:
  extractor:
    provider: anthropic
    model: claude-haiku-4-5
    schema_version: 3  # increment this when you change extraction config
    custom_instructions: |
      Extract work preferences as "preference" type.
      Use snake_case predicates throughout.

When you bump schema_version from 3 to 4, all subsequent memories get schema_version: 4. All previously stored memories retain schema_version: 3. The namespace now has a mixed version distribution.

What actually changes across versions

Three things can change independently, and all three justify incrementing schema_version:

1. The model itself. Switching from one LLM to another — or upgrading to a new checkpoint of the same model — changes the extraction behavior even if the prompt is identical. Different models have different calibration, different entity naming tendencies, different willingness to extract borderline candidates. If you change provider or model in your config, increment schema_version.

2. The extraction schema. The <type_rules> and <quality_rules> sections of the extraction prompt define what gets extracted and how. If you add a new predicate, remove one, or change the type classification rules, the resulting memories will have different structure. Increment.

3. The extraction prompt. Custom instructions (<custom_rules>) affect extraction behavior. Even if the schema is identical, different instructions produce different outputs for edge cases. If you change custom_instructions, increment.

model_version_hash gives you a stable identifier for the model even when provider versioning is opaque — it's a hash of the model name string, so claude-haiku-4-5 always maps to the same hash across deployments.

Detecting mixed-version namespaces

The schema_version_mismatch alert fires when a namespace contains memories from more than one schema version. Check it directly:

const health = await recall.health.deep({ namespace });

console.log(health.schema_versions);
// { "3": 48201, "4": 1847 }

console.log(health.alerts);
// ["schema_version_mismatch"]

A mixed-version namespace during an upgrade is expected — this is the normal state for 24–72 hours after an extractor change. The alert is informational during this window. If the mismatch persists for more than a week, it usually means either:

The cutover didn't complete (some write paths still use the old config)
The new extractor is failing silently and writes are falling back to the old schema

Check recall_writes_total labeled by schema_version (if you've added this dimension to your Prometheus relabeling) to see whether new-version memories are still being produced.

Migration strategy 1: coexist

The simplest approach: do nothing. Increment schema_version, deploy the new extraction config, and let old and new memories coexist in the namespace.

When it works: The semantic differences between versions are small. You changed the extraction prompt to use more precise predicate names but the underlying meaning is the same. The entity naming conventions are compatible. Confidence calibration is close enough.

When it fails: You changed how a core concept is typed — for example, job titles were previously stored as preference memories and are now stored as fact memories. The retrieval pipeline uses type-specific filters; a query with type: "fact" now finds job title memories that didn't exist in v3 while missing them in the v3 population. Retrieval results become inconsistent by memory age.

The test: Run debugRetrieval on a sample of queries against known memories from both versions. If the ranking order looks consistent, coexist is safe. If v3 memories are systematically ranking lower for recent queries, you have a semantic break.

Migration strategy 2: progressive rollout

Route a fraction of new writes to the new extractor schema and compare quality before full cutover. This requires running two write pipeline configurations simultaneously.

write_pipeline:
  extractor:
    strategy: progressive
    primary:
      schema_version: 3
      model: claude-haiku-4-5
      # ... v3 config
    candidate:
      schema_version: 4
      model: claude-haiku-4-5
      # ... v4 config
    rollout_percent: 10  # 10% of writes go to candidate
    compare_metric: retrieval_precision_at_5
    compare_window_days: 7

During the rollout window, both extractors process their assigned turns. The quality dashboard shows parallel precision@5 metrics for each version. After 7 days, if v4 precision is higher (or equal), you cut over fully by setting rollout_percent: 100 and then removing the primary config.

When to use it: Any significant change to extraction schema or a model upgrade where you're not confident the new behavior is strictly better. The week-long comparison window catches systematic regressions before they affect the full namespace.

The cost: Roughly 10% overhead during the rollout window (10% of turns processed by both extractors). This is usually acceptable compared to the cost of a full rollback.

Migration strategy 3: full re-extraction

Delete all memories in the namespace and re-extract from scratch against the raw conversation history. Cleanest result, highest cost and risk.

When it's justified: A breaking semantic change that makes v3 and v4 memories incompatible in a way that can't be papered over. The canonical example: you changed the entity naming convention from display names ("Priya Sharma") to ID-based slugs ("priya-sharma-arr"). The entity graph is built on entity IDs; a naming convention change requires rebuilding every entity and every relation that references them.

The process:

// 1. Snapshot current namespace stats for comparison
const before = await recall.stats({ namespace });

// 2. Soft-delete all memories in namespace (not hard-delete — preserves audit trail)
await recall.purgeNamespace({
  namespace,
  mode: "soft",
  reason: "full re-extraction for schema v4 migration"
});

// 3. Replay conversation history through new extractor
const sessions = await yourConversationStore.getAllSessions({ namespace });
for (const session of sessions) {
  await recall.remember({
    turns: session.turns,
    scope: { ...session.scope, namespace }
  });
}

// 4. Compare quality
const after = await recall.stats({ namespace });
console.log("Before:", before.retrieval_precision_at_5);
console.log("After:", after.retrieval_precision_at_5);

The risks: If your conversation history is large, re-extraction takes time and costs LLM tokens — at roughly $0.0025 per extraction call (current Haiku pricing), 10,000 sessions costs around$ 25 in LLM fees. More importantly, you can't re-extract if you don't have the conversation history. Recall stores provenance (source turn IDs) but not the turn content itself — that lives in your system. If you've pruned old conversations, full re-extraction is not available.

Rolling back an extractor upgrade

The safe rollback path:

Step 1: Lower schema_version back in your config and redeploy. New writes immediately resume with the old schema. This is instant.

Step 2: Soft-delete memories produced by the new schema version:

const newVersionMemories = await recall.listMemories({
  scope,
  filter: { schema_version: 4 }
});

for (const mem of newVersionMemories) {
  await recall.forget({
    memory_id: mem.id,
    scope,
    reason: "rollback: schema v4 regression"
  });
}

Or by trace ID if the new schema was deployed for a known time window:

const traces = await recall.listTraces({
  namespace,
  since: deploymentTime,
  until: rollbackTime,
  operation: "remember"
});

for (const trace of traces) {
  await recall.forgetByTrace({ trace_id: trace.id, scope });
}

Step 3: The v3 memories are still in place. They were never touched. Retrieval returns to the v3 population immediately after the soft-delete completes.

Rollback is safe because Recall uses soft delete, not hard delete. The v4 memories are marked deleted_at in the database and excluded from retrieval, but they exist in the audit log. If you later determine the rollback was unnecessary, you can restore:

await recall.restoreByVersion({ schema_version: 4, scope });

The decision table

Change type	Risk	Recommended strategy
Prompt wording changes, same schema	Low	Coexist
New predicate added	Low	Coexist
Model upgrade, same prompt	Medium	Progressive rollout
Type classification rules changed	Medium	Progressive rollout
Predicate renamed or removed	High	Progressive rollout
Entity naming convention changed	High	Full re-extraction
Fundamental type structure changed	High	Full re-extraction

When in doubt, use progressive rollout. The 10% overhead during the rollout window is almost always cheaper than investigating a subtle quality regression three weeks later.

The schema_version field on every memory is there so you always know exactly what generated any given memory — not just "which model" but "which configuration of which model on which date." That answer is the starting point for every migration decision.

Testing an extractor upgrade before rollout

Before committing to any migration strategy, test the new extractor configuration against a representative sample of historical conversation data. Choosing a strategy based on a gut read of the config diff is how teams end up running full re-extractions they didn't budget for. A structured pre-flight test tells you which strategy is actually appropriate.

The testing workflow uses rememberDry, a dry-run variant of the write pipeline that runs every stage (pre-filter → extract → resolve_refs → dedupe) but does not persist anything to the database. It returns the candidates that would have been stored, allowing you to compare extraction behavior across configurations without touching production data:

// Load a test dataset: 50 conversations with known "golden" memories
const testConversations = await loadTestDataset("extractor_test_v4");

const results = {
  v3: { precision: 0, recall: 0, extraction_count: 0, junk_rate: 0 },
  v4: { precision: 0, recall: 0, extraction_count: 0, junk_rate: 0 }
};

for (const conversation of testConversations) {
  // Process with v3 config
  const v3Result = await recall.rememberDry({
    turns: conversation.turns,
    scope: conversation.scope,
    extractor_override: { schema_version: 3, model: "claude-haiku-4-5" }
  });
  
  // Process with v4 config  
  const v4Result = await recall.rememberDry({
    turns: conversation.turns,
    scope: conversation.scope,
    extractor_override: { schema_version: 4, model: "claude-haiku-4-5", custom_instructions: "..." }
  });
  
  // Compare against golden memories
  results.v3.precision += computePrecision(v3Result.candidates, conversation.golden_memories);
  results.v4.precision += computePrecision(v4Result.candidates, conversation.golden_memories);
}

The golden memory set is a manually curated list of facts that should be extracted from each conversation — the ground truth for what a good extractor does. Building this set takes time: expect to spend two to three hours reviewing fifty conversations with a domain expert marking which facts are extractable. That investment pays for itself the first time it catches a regression before it ships.

Four metrics determine whether to proceed with the upgrade:

Precision — what fraction of the extracted memories are genuinely useful facts? Compute this by having an LLM judge (Claude Opus works well here) rate each extracted memory on a 1–5 scale, then treating ratings 4–5 as "useful." Precision is the fraction of useful memories in the extracted set. A drop in precision between v3 and v4 means the new extractor is generating more junk, which will accumulate in the namespace and distort retrieval ranking.

Recall — what fraction of the known-important facts from the golden set were captured? Recall is measured against the golden memories: for each golden fact, did the extractor produce a memory that covers the same information (allowing paraphrase)? Low recall means the new schema's type rules are more restrictive than intended, or the new model is less willing to extract borderline candidates. Borderline cases are often where extraction matters most.

Extraction count — did the new version extract more or fewer memories per turn? More is not always better. A v4 extractor that produces three times as many memories per turn as v3 is almost certainly extracting noise. A rough heuristic: if the extraction count per turn increased more than 40%, investigate before proceeding.

Junk rate — junk memories are memories that are syntactically valid extractions but semantically worthless: memories about the assistant's own behavior ("the user said hello"), over-broad generalizations, and memories with no retrievable context. Run the LLM judge on a random sample of 100 candidates per version and compare junk rates. Even a 5-percentage-point increase in junk rate compounds over time in an active namespace.

If v4 precision is higher with similar or better recall, proceed to progressive rollout. If v4 recall is lower, investigate whether the new schema's type rules are more restrictive than intended — often a single overly narrow type constraint is blocking a category of valid extractions that the previous schema captured implicitly.

Predicate vocabulary drift across versions

One of the subtler migration problems has nothing to do with which facts are extracted and everything to do with how extracted facts are labeled. An extractor trained to use snake_case predicates (works_at, reports_to) upgraded to use a new model may produce camelCase or verb-phrase predicates (worksAt, is_manager_of). The entity graph uses predicates as edge labels; a works_at edge and a worksAt edge are structurally different even if semantically identical. Queries that filter by predicate return different results depending on which version produced the memory.

This is predicate vocabulary drift: the set of edge label strings that the extractor uses changes across versions even when the underlying semantic intent stays the same. It happens because LLMs don't have hard constraints on output format unless the prompt enforces them explicitly — and prompts that didn't pin predicate formatting under v3 may produce different capitalization conventions under v4's base model.

Measure vocabulary drift before any schema upgrade reaches production:

const v3Stats = await recall.predicateStats({
  namespace,
  filter: { schema_version: 3 },
  top_k: 50
});
// { "works_at": 1847, "reports_to": 923, "prefers": 412, ... }

const v4Stats = await recall.predicateStats({
  namespace,
  filter: { schema_version: 4 },
  top_k: 50
});

// Compute Jaccard between top-50 predicate vocabularies
const v3Predicates = new Set(Object.keys(v3Stats));
const v4Predicates = new Set(Object.keys(v4Stats));
const intersection = new Set([...v3Predicates].filter(p => v4Predicates.has(p)));
const jaccard = intersection.size / (v3Predicates.size + v4Predicates.size - intersection.size);

console.log(`Predicate Jaccard: ${jaccard.toFixed(3)}`);
// Above 0.8: minimal vocabulary drift, coexist is safe
// 0.5-0.8: noticeable drift, add predicate normalization
// Below 0.5: significant break, progressive rollout required

The Jaccard coefficient over the top-50 predicate vocabularies is a quick proxy for migration risk. A Jaccard above 0.8 means the two versions are using nearly the same edge labels — the coexist strategy is safe, and cross-version queries will return consistent results. Between 0.5 and 0.8, there's noticeable drift: some important predicates have changed form. Below 0.5, the vocabulary has effectively changed; memories from the two versions speak different structural languages, and the progressive rollout strategy (or full re-extraction) is needed.

When Jaccard is below 0.8, add predicate normalization to your write pipeline configuration before the new schema reaches production:

write_pipeline:
  predicate_normalization:
    enabled: true
    mappings:
      - from: ["worksAt", "works_for", "is_employed_at"]
        to: "works_at"
      - from: ["reports_to", "reportsTo", "managed_by"]
        to: "reports_to"

Normalization applies at write time: the raw predicate produced by the extractor is preserved in the raw_predicate field on the memory, and the canonical form is stored in the indexed predicate field. Retrieval queries use the canonical form, so filter: { predicate: "works_at" } returns memories from both v3 and v4 regardless of which surface form the extractor produced. Display code uses raw_predicate if it exists, falling back to predicate, so the UI shows what the extractor actually said while the data layer remains consistent.

One operational note: predicate normalization mappings are write-time-only. They don't retroactively normalize memories already in the store. If you deploy normalization after v4 has been writing un-normalized predicates for a week, you'll have a window of v4 memories with un-normalized predicate values. The fix is a one-time migration script that reads memories with un-normalized predicates and re-writes the predicate field using the normalization table.

Confidence calibration drift

A model upgrade does not only change what gets extracted — it changes how confident the extractor claims to be about each extraction. Confidence calibration is the relationship between the extractor's stated confidence and the actual probability that the extraction is correct. A well-calibrated extractor at confidence 0.75 is correct about 75% of the time. A poorly calibrated one might be correct 95% of the time at stated confidence 0.75 — or 50% of the time.

Model upgrades frequently shift calibration. A new base model checkpoint may be more conservative with confidence (most extractions cluster in the 0.4–0.65 range), while the previous checkpoint was more generous (clustering in 0.7–0.85). The practical consequence: retrieval thresholds tuned for the old calibration are now wrong. If your retrieval configuration filters out memories below confidence 0.6 and the new model marks everything below 0.7, you're discarding memories the new model considers borderline but correct — memories that would have surfaced under the old configuration.

Detect calibration drift by comparing confidence distributions across versions:

const v3Confidences = await recall.confidenceHistogram({
  namespace,
  filter: { schema_version: 3 },
  buckets: [0, 0.3, 0.5, 0.7, 0.85, 1.0]
});
// { "0-0.3": 812, "0.3-0.5": 2341, "0.5-0.7": 4892, "0.7-0.85": 3112, "0.85-1.0": 1205 }

const v4Confidences = await recall.confidenceHistogram({
  namespace,
  filter: { schema_version: 4 },
  buckets: [0, 0.3, 0.5, 0.7, 0.85, 1.0]
});

Compare the distributions visually: if the v4 histogram has substantially more mass in the high-confidence buckets (0.7–1.0), the new model is more generous. If mass has shifted toward the low-confidence buckets (0–0.5), the new model is more conservative. Either shift is a signal that your retrieval threshold needs adjustment — or that a per-version confidence scaling factor is needed.

A per-version scaling factor is a blunt but effective correction. Apply it in your write pipeline configuration for the new schema version:

write_pipeline:
  extractor:
    schema_version: 4
    confidence_scaling:
      enabled: true
      factor: 0.92  # scale down v4 confidence by 8% to match v3 calibration

The scaling factor multiplies the extractor's raw confidence before it's stored. A factor of 0.92 shifts a raw score of 0.85 down to 0.782, bringing the v4 distribution closer to the v3 distribution without changing the retrieval threshold. The factor applies to new writes only — existing v4 memories in the store keep their original stored confidence. If you've already accumulated a large body of v4 memories with inflated confidence, a one-time backfill is needed to apply the correction to existing records.

The ideal approach is proper confidence recalibration using ground-truth labels: sample 200 memories from each version, have a human annotator or LLM judge rate whether each is correct, and fit an isotonic regression to map stated confidence to empirical accuracy. This gives you a calibration curve rather than a single scaling factor. For most production situations, the simple linear scale is sufficient and can be deployed in hours rather than days.

Operating metrics to watch during a migration

A migration that looks clean from the configuration side can still produce silent quality regressions in the data layer. The schema_version field and the migration strategy are inputs; what actually happened to your memory store is measured in metrics. Set up dashboards for the following metrics during any migration window and keep them visible for at least 14 days after full cutover.

recall_writes_by_schema_version — the most direct indicator of cutover progress. This metric should show a clean step function: v3 writes dropping to zero and v4 writes replacing them at the deployment timestamp. If v3 writes continue after the cutover, some write paths didn't update — the most common cause is a configuration deployed to only a subset of instances in a multi-instance setup. Check your deployment tooling for partial rollouts.

recall_extraction_json_fail_rate — the fraction of extraction attempts where the LLM didn't produce valid JSON. The extraction pipeline asks the LLM to respond with a structured JSON object; when the response is malformed, the extraction fails and the turn is skipped. A spike after a schema version bump usually means the extraction prompt changed in a way that broke the output format contract. The new prompt may have been longer than the model's effective instruction-following window, or a new custom_instructions block conflicts with the output format specification. Roll back immediately when this metric spikes above 2% — every failed extraction is a missed memory.

recall_retrieval_precision_by_version — if you have a continuous evaluation framework (sample queries scored by an LLM judge on a daily basis), watch per-version precision during the progressive rollout period. The 7-day comparison window in the progressive rollout config is specifically designed to capture this metric across a representative slice of query traffic. A downward trend in v4 precision is an early rollback signal — don't wait for it to bottom out before acting.

recall_dedup_collapse_rate — the fraction of new memories that merged into existing ones rather than creating new records. This metric has two failure modes after a schema upgrade: a spike (collapse rate much higher than baseline) means the new extractor is producing memories that are semantically equivalent to things already in the store — it's re-extracting facts the namespace already has, wasting tokens and producing no new information. A drop (collapse rate much lower than baseline) means the new extractor's vocabulary is diverging enough from existing memories that deduplication is failing to recognize matches — previously identical facts are being stored as separate memories under different surface forms. Both outcomes warrant investigation before full cutover.

recall_entity_resolution_fail_rate — the write pipeline resolves each extracted entity reference through a 4-stage cascade: exact match → alias match → fuzzy match → UUID v5 fallback. The fallback stage creates a new entity node keyed on a hash of the raw string — it's a last resort that generates entity fragmentation when the same person appears under multiple surface forms. A spike in the fail rate after a schema upgrade means the new extractor is producing entity references that don't match the existing entity table. This creates two nodes for the same person: ent_priya_XYZ (from historical extractions) and ent_dr_priya_sharma_ABC (from the new extractor that started adding "Dr." prefixes). Cross-session entity queries return incomplete results until the entities are merged. Watch this metric closely and run entity deduplication if the fail rate climbs above 1%.

All Engineering Posts