The Background Worker: 7 Maintenance Jobs

By Arc Labs ResearchMay 2, 20268 min read

Memory is not write-once-read-many — it's write-some, read-many, and degrade-quietly. The background worker is what keeps a memory store healthy in production. Each job is small; their absence is what's expensive.

Job timeline · what runs when

Hourly jobs, daily jobs, weekly jobs, monthly jobs. Each fires on its own cadence.

The seven jobs

1. Freshness decay (hourly). Apply exponential decay; flip the retrievable bit when a memory falls below the floor. Cheap; runs on a sliding window of recently-modified memories.
2. Consolidation (daily). Find near-duplicates that escaped write-time dedup; propose merges via LLM judge; flatten supersession chains longer than 2.
3. Drift scan (daily). Concept drift detection on entities updated in the last 30 days. Surface candidates for human review.
4. Consistency check (weekly). Cross-memory contradiction detection: pairs of memories that disagree about the same predicate but neither is superseded. Most expensive job; runs off-peak.
5. Embedding refresh (monthly). Re-embed memories whose content has been updated, or all memories when the embedding model is updated. Most expensive job; run in batches with progress tracking.
6. Garbage collect (weekly). Hard-delete memories that have been in the forgotten state past the retention window. Reclaim index space.
7. Snapshot (daily). Point-in-time backup. Critical for incident recovery and historical drift analysis.

Picking cadences

Cheap-and-frequent or expensive-and-rare. Avoid the middle ground (medium cost, daily) — cumulative impact accumulates fastest there.

Hourly: latency-sensitive things (the retrievable flag).
Daily: low-cost janitorial work.
Weekly: expensive but bounded jobs.
Monthly: very expensive jobs that don't need fresher data.

Observability

Each job emits at least:

Items processed.
Items modified.
Latency p50/p99.
Failures (with reasons).
Cost (LLM calls, compute time). The "items modified" trend is the most diagnostic. A consolidation job that suddenly modifies 10× more memories than usual probably means upstream extraction got noisier. A drift scan that suddenly fires on 50 entities means a real shift happened — investigate.

Don't block writes

Background jobs operate on copies or with lock-free reads where possible. A consolidation job that holds a write lock for 10 minutes is a 10-minute partial outage. Use snapshots + reconciliation patterns; never long write locks.

Failure isolation

One job's failure should not cascade. Each runs in its own process or worker pool with explicit timeouts. The drift scan failing should not delay the freshness decay; the embedding refresh stalling should not block consolidation.

Concurrent execution groups

The seven jobs run in three groups based on shared-state constraints:

Group 1 (concurrent): decay + prune + prune_obs + idempotency_sweep — no shared write state, safe to run simultaneously.
Group 2 (sequential): consolidate → consistency_scan — must not run concurrently because both modify the entity graph and supersession chains. A race here can produce orphaned supersession pointers.
Group 3 (alone): drift — read-heavy weekly scan that takes long enough that running anything else concurrently risks lock contention.

The daily schedule for a typical namespace:

03:00  Consistency scan        (~2 min)
04:00  Prune                   (~5 min)
06:00  Consolidation run 1     (~1 min)
12:00  Consolidation run 2
18:00  Consolidation run 3
00:00  Consolidation run 4
Weekly (Sun 02:00):  Drift scan

Freshness decay runs at query time, not as a scheduled job — each retrieval applies the decay formula in-flight. The background worker updates half-life recommendations based on observed retrieval patterns, but the decay computation itself has no worker overhead.

Job locking

Per-namespace locks prevent concurrent runs of the same job across multiple worker instances:

INSERT INTO worker_locks (namespace_id, job_type, locked_at, locked_by, expires_at)
VALUES ($ns, $job, now(), $worker_id, now() + interval '10 minutes')
ON CONFLICT (namespace_id, job_type) DO NOTHING;

If the INSERT produces 0 rows, another worker holds the lock and this instance skips the run. Locks auto-expire at the expires_at timestamp to prevent stuck jobs from blocking indefinitely. If a job panics, a LockReleaseGuard RAII struct fires Drop and releases the lock immediately using a fresh single-threaded runtime — no dead locks from panics.

On graceful shutdown, jobs check a CancellationToken at batch boundaries. The drift job checks at each namespace boundary (worst-case shutdown delay: ~5 seconds per namespace). Consolidate and consistency scan check at LLM call boundaries (worst-case: ~30 seconds).

Incremental processing and resumption

For namespaces with >100K memories, jobs process in batches and checkpoint progress:

struct JobState {
    namespace_id: NamespaceId,
    job_type: JobType,
    last_processed_memory_id: Option<MemoryId>,
    total_processed: u64,
    batch_size: usize,  // default 1,000 for decay, one predicate group for consolidate
}

State is persisted to the database after each batch. On restart after a crash, the job resumes from the last checkpoint rather than re-scanning from zero. Worst-case re-processing is one batch — 1,000 records for decay, one predicate group for consolidation.

Consolidation: how it actually merges

The consolidation algorithm uses graph clustering, not a simple pairwise comparison:

Find pairs of memories with similarity > 0.75 and same (subject, predicate, type).
Union-find clustering: if A-B are a pair and B-C are a pair, then {A, B, C} is one cluster.
For each cluster: one LLM judge call to determine if they're equivalent or distinct.
If equivalent: update canonical content to the LLM's suggested phrasing, set access_count = sum(all counts), union source turn IDs, mark others as superseded by canonical.
Re-embed the canonical (content changed).

Canonical selection priority: highest confidence → most accesses → most recent.

At 10K memories with ~5% requiring consolidation: under $0.01 and under 5 seconds per run.

Prune eligibility criteria

A memory is prune-eligible when ALL of:

Age > 365 days (configurable)
Last accessed > 180 days ago
Freshness × access_boost < 0.1
Either superseded AND older than 1 year, OR access_count = 0
No active relation references it via evidence_memory_ids

Before hard deletion, a snapshot is written to prune_log. Restore is possible within the grace period. For GDPR hard-delete requests, even the prune_log snapshot is scrubbed after the jurisdiction-specific legal retention window (default 90 days).

Job 1: Freshness decay in depth

The summary above says freshness decay runs hourly. The detail matters: decay is not actually computed in a scheduled batch job. It is computed at retrieval time, on the memories being retrieved, using values already present in the row. The background job's role is narrower and cheaper than it might appear.

Query-time computation

At retrieval, each retrieved memory's freshness is computed inline:

freshness(t) = 2^(-t/τ)

where t = days since last_accessed_at, and τ = the type-specific half-life for that memory's predicate. Fact memories default to τ = 90 days. Preference memories default to τ = 30 days. Relationship memories default to τ = 180 days.

This computation is cheap for two reasons. First, it only runs for the memories being retrieved in the current query — not for the full store. A namespace with 500K memories triggers decay computation for at most the top-K retrieval candidates (typically 20–50). Second, it requires no database calls beyond what retrieval already does — last_accessed_at is already on the row, and the half-life table is cached in the worker process.

What the hourly background job actually does

The background job has two responsibilities, neither of which is computing freshness:

Responsibility 1 — update the retrievable flag. When freshness drops below the floor (0.1), the memory must be flipped to retrievable = false so it stops appearing in retrieval results. This cannot happen lazily at query time because a memory that hasn't been queried recently may have crossed the floor without any retrieval event to trigger the flip.

The hourly scan is bounded by last_accessed_at > now() - (τ × 4). Beyond 4 half-lives, freshness is below 0.0625 (floor is 0.1), and the memory was already flipped in a prior scan. The 4× bound keeps the hourly scan small even for large namespaces — at τ = 90 days, the window is 360 days, which sounds large, but the retrievable = false flip happens on first crossing and subsequent scans skip already-flipped memories via a simple filter.

Responsibility 2 — flush batched access increments. When a memory is retrieved, its access_count should be incremented and last_accessed_at updated. But doing a database write per retrieval causes write amplification under heavy load — a hot memory retrieved 500 times per hour would generate 500 writes per hour to that single row.

Instead, retrievals write to an in-process buffer (a counter per memory ID, held in memory). The hourly job flushes the buffer to the database in a single batched UPDATE per namespace. Worst-case staleness of access counts: 1 hour, which is acceptable — freshness decay over one hour at τ = 90 days is negligible, and access_boost changes of this magnitude do not affect retrieval ranking materially.

This architecture prevents hot memories from causing database write amplification under heavy retrieval load, which is the dominant operational failure mode for naive decay implementations.

Job 2: Consolidation algorithm in depth

The high-level description of consolidation covers the four-phase structure. The details of each phase are where implementation choices matter.

Phase 1 — candidate pair detection

Finding near-duplicate pairs via embedding similarity requires a join against itself, which is expensive at scale. The (subject, predicate, type) constraint is the key optimization — it reduces the join space from all memories × all memories to same-predicate memories × same-predicate memories. In practice, most predicates have few memories per subject, keeping the join tractable.

SELECT a.id, b.id, dot_product(a.embedding, b.embedding) AS similarity
FROM memories a
JOIN memories b ON a.subject = b.subject AND a.predicate = b.predicate
  AND a.type = b.type AND a.id < b.id
WHERE a.deleted_at IS NULL AND b.deleted_at IS NULL
  AND a.schema_version = b.schema_version
  AND dot_product(a.embedding, b.embedding) > 0.75

The a.id < b.id constraint prevents duplicate pairs (A-B and B-A). The schema_version constraint avoids comparing memories across schema migrations, where embedding spaces may have shifted. At 10K memories: 10K² / 2 = 50M theoretical pairs, but with the same-subject and same-predicate constraint, typically 100–500 candidate pairs in practice.

Phase 2 — union-find clustering

Pairwise similarity is transitive in clusters but not necessarily in data. If A-B have similarity 0.80 and B-C have similarity 0.78, but A-C have similarity 0.68 (below threshold), a naive pairwise approach would only merge A-B and B-C separately, potentially leaving B in an ambiguous state.

Union-find resolves this: A, B, and C are placed in the same cluster because the A-B and B-C edges connect them, regardless of the A-C edge. Union-find runs in O(pairs × α(N)) — effectively linear — on the candidate pair set. Clusters rarely exceed 3–4 memories in practice. A cluster of 10+ memories is a signal that a predicate is being over-extracted upstream; investigate the extraction model.

Phase 3 — LLM judgment per cluster

One LLM call per cluster, not per pair. The prompt presents all memories in the cluster with their content, confidence scores, and timestamps, and asks: "Are these memories saying the same thing, or do they represent distinct facts?"

Returns one of: Equivalent (merge), Distinct (false positive — do not merge). The judge also returns a canonical_content suggestion for the merged memory when Equivalent is returned — this is the rewritten phrasing that best captures the shared meaning across cluster members.

At 50 clusters with one Haiku call each: approximately $0.005 per consolidation run. At 10K memories with 5% requiring consolidation (500 memories, ~150 clusters), the cost is still under$ 0.01 per run.

Phase 4 — merge execution

For Equivalent clusters, the canonical memory is selected by priority (highest confidence → most accesses → newest). The canonical's content is updated to the LLM's suggested phrasing. access_count is set to the sum of all cluster members' counts — retrieval weighting should reflect total historical access, not just the canonical's individual history. source_turn_ids are unioned — provenance is preserved across all merged members.

Non-canonical memories are marked superseded_by = canonical_id, not deleted. This preserves audit trails and allows the supersession chain to be traversed for debugging. Re-embedding the canonical is mandatory because its content changed; re-embedding non-canonicals is skipped since they are no longer retrievable.

Job 5: Embedding refresh scheduling

Re-embedding is the most expensive single operation a background worker performs. A full namespace refresh at scale takes days and costs real money. When to trigger it and how to bound cost are operational decisions with significant impact.

When to schedule a full refresh

Three conditions warrant a full re-embedding pass:

Embedding model upgrade: a new provider or a new model version with a different embedding space. Embeddings from text-embedding-3-small and text-embedding-3-large are not compatible — mixing them in a vector index produces silently degraded retrieval quality without any obvious error signal. A model upgrade requires re-embedding all memories before the new model goes live for production retrieval.

Shadow strategy divergence: the Recall retrieval system can run a shadow embedding model alongside the primary, comparing retrieval results on live queries without serving them to users. When Jaccard similarity between primary and shadow retrieval sets falls below 0.5 for two or more consecutive weeks, the shadow model is capturing meaningfully different content. This warrants promoting the shadow model and re-embedding the full store.

Vocabulary drift: when the vocabulary of incoming memories has drifted substantially from the vocabulary of stored memories — measured as Jaccard similarity of top-1000 token n-grams below 0.6, sustained for 4+ weeks — older embeddings are increasingly poor representations of the domain. A full refresh realigns the index with current usage patterns.

Batch sizing and rate limiting

The math for a 10M memory namespace with OpenAI text-embedding-3-small:

Parameter	Value
Total memories	10M
API rate limit (full)	3,000 RPM
Production headroom reserved	1,000 RPM
Backfill rate	2,000 RPM
Batch size (memories per call)	100
Calls needed	100,000
Time at 2,000 RPM	50 min per 100K calls ≈ 83 hours total

Batch size of 100 maximizes throughput — each API call embeds 100 memories, keeping the requests-per-minute rate near the limit without exceeding it. Smaller batches waste request budget on overhead; larger batches risk timeouts.

Cost

text-embedding-3-small at $0.02 per 1M tokens. Assuming average memory content = 50 tokens (a typical short factual sentence): 10M × 50 = 500M tokens = **$ 10** for a full namespace refresh. text-embedding-3-large (higher retrieval quality): $0.13 per 1M tokens = **$ 65**.

A model switch at 10M memories costs $10–65 in embedding fees alone, before compute time. At 100M memories:$ 100–650. Plan model upgrades accordingly — quarterly at most for large namespaces, triggered by measurable retrieval degradation rather than model release cadence.

Checkpoint and resumption

A crash during the 3.5-day backfill run must not restart from scratch. The job tracks last_reembedded_memory_id in the worker_state table, updated after each batch commit. On resume, the query is:

SELECT id, content FROM memories
WHERE namespace_id = $ns AND id > $last_reembedded_memory_id
ORDER BY id ASC LIMIT $batch_size;

Worst-case re-processing on crash is one batch (100 memories), not the full backfill. The worker_state entry is written atomically with the embedding update — no partial-batch checkpoints that could leave the index in an inconsistent state.

The backfill runs alongside production traffic. Production retrieval uses the primary model's embeddings until the backfill completes and the cutover is triggered. Do not cut over at 70% completion — incomplete backfills produce a split index where some memories are in the old embedding space and some are in the new one, causing systematically worse retrieval for the un-migrated fraction.

Operational runbook: handling a stuck job

Background jobs get stuck. The lock is held, the process is unresponsive, and other workers cannot acquire the lock to run the job for the affected namespace. The following steps apply to any of the seven jobs.

Step 1 — identify

The worker_locks table contains active and recently released locks. A lock whose expires_at timestamp is in the past but whose released_at is NULL is stuck:

SELECT namespace_id, job_type, locked_by, locked_at, expires_at
FROM worker_locks
WHERE expires_at < now() AND released_at IS NULL;

If this query returns rows, those jobs are stuck. The locked_by field contains the worker instance ID that holds the lock.

Step 2 — investigate

Check the worker process logs for the locked_by instance ID. Is the process still alive? A running process that holds an expired lock is in a worse state than a dead one — it may be in an infinite loop, blocked on a network call, or deadlocked on an internal mutex.

If the process crashed, the LockReleaseGuard RAII struct should have released the lock automatically on Drop. If the lock is still held after a confirmed crash, the crash occurred in a signal handler, during panic unwinding before Drop ran, or in a context where the Tokio runtime was dropped before the guard could execute. This is rare but possible when the OS sends SIGKILL rather than SIGTERM.

Step 3 — force release

If the worker is confirmed dead and the lock is preventing other workers from proceeding, manually release it:

UPDATE worker_locks
SET released_at = now(), release_reason = 'manual: process confirmed dead'
WHERE namespace_id = $ns AND job_type = $job AND released_at IS NULL;

Log the forced release with the worker ID and timestamp in your incident management system. This is a rare operation and should be tracked — repeated forced releases for the same job indicate a systemic problem (job taking longer than expires_at, memory pressure causing OOM kills, or network instability causing blocked database calls).

Step 4 — resume

The next scheduled run of the job picks up the lock normally and resumes from the last checkpoint. For incremental jobs (decay, consolidation, consistency scan), the JobState checkpoint prevents re-processing — the job continues from last_processed_memory_id.

For non-incremental jobs (snapshot), the next run starts fresh. A snapshot that was 80% complete at the time of the crash is abandoned; the next run creates a new snapshot from scratch. This is safe — snapshots are append-only and the partial snapshot from the crashed run is not referenced by any downstream system.

Prevention

The stuck-job failure mode has two root causes, each with a prevention:

Root cause 1 — job runtime exceeds expires_at. Prevention: set expires_at = now() + interval '<2× expected duration>'. A drift scan expected to run for 10 minutes gets expires_at = now() + interval '20 minutes'. Monitor the p99 job duration metric and adjust expires_at when p99 approaches the lock expiry. Do not set expires_at to an absurdly large value (hours) to avoid the problem — this makes forced releases take longer and hides real performance regressions.

Root cause 2 — process crash without lock release. Prevention: prefer SIGTERM over SIGKILL for worker shutdown. SIGTERM triggers graceful shutdown, which fires CancellationToken cancellation and allows RAII guards to run. SIGKILL bypasses all cleanup. In Kubernetes, set terminationGracePeriodSeconds to at least 60 seconds for worker pods. Add a preStop hook that sends SIGTERM and waits for the process to acknowledge it before Kubernetes sends SIGKILL.

Both preventions together reduce stuck-lock incidents to near zero in steady-state operation. The runbook above is for residual cases — hardware failures, OOM kills under memory pressure, and other unpreventable crashes.

The consolidation job in depth

Consolidation is the most complex background job and the one most likely to affect retrieval quality. It runs 4 times daily (at 06:00, 12:00, 18:00, 00:00) to spread work across time rather than doing one large batch. A single large consolidation batch creates a brief retrieval quality dip as memories are being merged and re-embedded; four small batches make that dip imperceptible.

Each consolidation run processes memories created or modified in the last 6 hours. The union-find clustering:

// For each (subject, predicate, type) group with multiple members:
fn find_clusters(memories: &[Memory], threshold: f32) -> Vec<Vec<MemoryId>> {
    let mut uf = UnionFind::new(memories.len());

    for i in 0..memories.len() {
        for j in (i+1)..memories.len() {
            let similarity = cosine_similarity(&memories[i].embedding, &memories[j].embedding);
            if similarity > threshold {  // default 0.75
                uf.union(i, j);
            }
        }
    }

    uf.components()  // returns Vec<Vec<MemoryId>> where each inner vec is a cluster
}

The threshold 0.75 is lower than the write-time dedup threshold (0.92). This is intentional. Write-time dedup catches near-identical memories: same session, same context, likely the same extraction from slightly different phrasings in the same conversation. Background consolidation catches near-duplicate memories that accumulated across sessions with different wording — a preference stated on Monday and the same preference paraphrased on Thursday, which may score 0.78 in cosine space. These are too different for write-time dedup to catch (0.78 < 0.92) but similar enough for background consolidation to flag (0.78 > 0.75).

The LLM judge for each cluster receives all cluster members in a single call — not one call per pair. Batch-per-cluster is significantly cheaper than pairwise: a cluster of 4 near-duplicates requires 1 judge call, not 6 pairwise calls. The judge returns a verdict (equivalent/distinct) and, if equivalent, the canonical phrasing it prefers among the options. The canonical content gets stored; other cluster members get superseded_by = canonical_id.

Clusters the judge returns as "distinct" are left untouched. These are false positives from the cosine similarity threshold — memories that are numerically close but semantically different. Example: "User prefers dark mode" and "User prefers working in dark, quiet environments" might score 0.77 in cosine space but are clearly distinct predicates. The LLM judge catches this; no merge occurs.

Freshness decay: implementation details

Freshness is not stored as a field — it is computed at query time. The memory table stores created_at, last_accessed_at, access_count, and memory_type. The freshness formula:

fn freshness(memory: &Memory, now: DateTime<Utc>) -> f32 {
    let age_days = (now - memory.last_accessed_at).num_days() as f32;
    let tau = half_life_days(memory.memory_type);
    let base_freshness = 2.0_f32.powf(-age_days / tau);
    let access_boost = (1.0 + (1.0 + memory.access_count as f32).ln()).min(3.0);
    (base_freshness * access_boost).max(0.1)  // floor at 0.1
}

Half-life values by memory type: Preference (tau=90 days), Fact (tau=180 days), Event (tau=30 days), Relationship (tau=365 days). These reflect how quickly each type of memory typically becomes stale. An event ("user has a meeting Thursday") becomes stale in weeks; a relationship ("user works with Sarah") is stable for months to years.

The background worker does not compute freshness for every memory every hour — that would be prohibitively expensive at scale. Instead it runs a targeted query to mark memories as "below the retrievable threshold" when it can predict their freshness will fall below 0.1. The decay worker runs hourly over memories where now() - last_accessed_at > prune_threshold_days * ln(2), a precomputed estimate of when freshness falls below floor.

When the access count is 0, the threshold collapses to t = -tau × log₂(0.1) ≈ 3.32 × tau. For a Preference memory (tau=90), that is 299 days before falling below floor. For an Event memory (tau=30), it is 99 days. These are the natural expiry dates for unaccessed memories — not when they are deleted (that is garbage collection's job), but when they stop appearing in standard retrieval results.

The access_boost term applies a multiplier up to 3.0 based on how often a memory has been retrieved. A memory with access_count = 50 gets a 4.1× boost (capped at 3.0), meaning it stays retrievable approximately three times longer than an equivalent unaccessed memory. This reflects a sensible prior: memories that have been retrieved frequently are more likely to be genuinely useful and should not decay as rapidly.

Embedding refresh job: when to trigger

The embedding refresh job is the most expensive background operation in terms of API cost and should be triggered deliberately, not on a routine schedule. The monthly cadence in the job list refers to routine content-update backfill — not full re-embedding.

Three conditions that trigger an embedding refresh:

Embedding model change: the model used to produce all existing embeddings is being replaced. Old and new embeddings live in incompatible vector spaces — a query embedded with the new model will not retrieve memories embedded with the old model reliably. All embeddings must be refreshed before the new model's query path is activated.

Memory content updates: memories whose content field was modified by consolidation (canonical rewriting) or manual correction have stale embeddings. The consolidation job flags these with needs_reembedding = true; the embedding refresh job processes this backlog on its regular cadence. At 5% consolidation rate with 10K memories, the backlog is 500 memories per day — trivial at batch sizes of 1,000.

Vocabulary drift alert: when the vocabulary Jaccard similarity between recent queries and the existing embedding space drops below 0.40 sustained over 3 consecutive weeks, it signals the domain has shifted enough that the current embedding model may be poorly calibrated. A full re-embedding with a domain-appropriate model may be warranted.

Process for a full re-embedding (model change):

async fn shadow_reembed(
    namespace: &str,
    old_model: &dyn EmbeddingModel,
    new_model: &dyn EmbeddingModel,
) -> Result<()> {
    // Phase 1: dual-write new memories with both models
    config.set_dual_embed(namespace, true);  // new writes get both embeddings

    // Phase 2: backfill historical memories in batches of 1,000
    let mut cursor = None;
    loop {
        let batch = store.memories_without_new_embedding(namespace, cursor, 1000).await?;
        if batch.is_empty() { break; }

        let embeddings = new_model.embed_batch(&batch.contents()).await?;
        store.update_embeddings_v2(namespace, &batch.ids(), &embeddings).await?;

        cursor = batch.last_id();
        tokio::time::sleep(Duration::from_millis(50)).await;  // rate limiting
    }

    // Phase 3: validate Jaccard overlap before cutover
    let overlap = compute_topk_jaccard(namespace, old_model, new_model).await?;
    if overlap < 0.80 { return Err(anyhow!("Models disagree too much; abort cutover")); }

    // Phase 4: switch query path to new model
    config.set_active_embedding_model(namespace, new_model.id());
    config.set_dual_embed(namespace, false);

    Ok(())
}

The Jaccard validation in Phase 3 checks that the two models return substantially overlapping top-K results for a representative sample of queries. An overlap below 0.80 means the models disagree about which memories are relevant to typical queries. At 0.80 overlap, the transition is safe. Below 0.80, the cutover is aborted and the discrepancy is logged for investigation.

The 50ms rate limiting between batches of 1,000 memories prevents embedding API throttling. At a typical embedding API throughput of 1,000 embeddings per second, a namespace of 1M memories requires about 17 minutes of batched processing — plus the 50ms gaps, totaling approximately 20 minutes for Phase 2. Plan model migrations during low-traffic windows.

Operating the background worker: deployment patterns

The background worker should not run in the same process as the write or read pipelines. It is a separate service with different resource requirements: long-running, bursty CPU, moderate I/O, no response-time SLO. Mixing it with the write pipeline means a runaway consolidation job can affect write latency; mixing it with the read pipeline means a drift scan that spikes CPU affects query throughput.

Deployment pattern: run a single background worker process per cluster, elected via the database lock mechanism. If the elected worker dies, the lock auto-expires and a standby worker is elected within one lock expiry period (default: 10 minutes). Multiple standby workers can safely coexist in the process pool — the lock mechanism prevents concurrent runs of the same job on the same namespace, so standbys simply skip the locked jobs until they acquire the lock.

Resource sizing: the background worker's RAM requirement is dominated by consolidation — holding a batch of embeddings in memory during the clustering step. At batch size 1,000 with 768-dimensional embeddings at 4 bytes per float: 1,000 × 768 × 4 = ~3MB per batch. Two batches in memory simultaneously (current + next pre-fetched): ~6MB. Add process overhead: 2GB RAM is comfortable for namespaces up to 10M memories. CPU is bursty during LLM judge calls — the bottleneck is the LLM API rate limit, not local CPU. 2 CPU cores handle the coordination logic without saturation.

Scaling beyond 10M memories: at 100M memories, the consolidation job's pairwise cosine similarity comparison becomes the bottleneck. Switch to approximate nearest-neighbor (ANN) candidate selection at this scale: use the vector index to find candidates (sub-linear time) rather than scanning all pairs (quadratic time). The union-find clustering step is unchanged; only the candidate generation step changes.

Monitoring: set alerts on the background_job_last_success metric per job type. If any job has not succeeded within 2× its expected cadence, alert. Concrete thresholds:

Job	Expected cadence	Alert threshold
Consolidation	6 hours	12 hours since last success
Consistency check	24 hours	48 hours since last success
Drift scan	7 days	14 days since last success
Embedding refresh	30 days	45 days since last success
Garbage collect	7 days	10 days since last success
Snapshot	24 hours	36 hours since last success

The snapshot alert threshold is tight (36 hours) because a failed snapshot is a silent data recovery risk. Every other job failure degrades quality gradually; a snapshot gap means you cannot recover to a known-good state if another job causes data corruption. When the snapshot alert fires, treat it as an incident — not a background item for the next engineering review.