Three Tiers of Deduplication

By Arc Labs ResearchMay 2, 202620 min read

Without dedup, paraphrases pile up. The user mentions Vim three times in three different ways across a year. The store now has three near-identical memories that all match the same queries — pushing one valuable memory out of the top-K for every query they touch. With aggressive dedup, you lose temporal signal — the third mention might be a stronger endorsement than the first, but if you collapse them you lose the count. The three-tier approach handles both: collapse identical-enough memories, but increment a repetition counter so the confidence formula can still reward repeated observations.

Threshold zones · cosine similarity

Drag the slider. Below 0.85: keep. 0.85–0.92: LLM judge. Above 0.92: auto-merge.

The three tiers

Tier 1 — hash equality. Normalize whitespace and case, then hash. Rejects exact restatements and replays. Free.
Tier 2 — cosine similarity. Embed the candidate and compare against neighbors in the vector store. Two thresholds: sim > 0.92 auto-merges; 0.85 ≤ sim ≤ 0.92 escalates to tier 3.
Tier 3 — LLM judge. A small model classifies the pair as duplicate-or-distinct. Used only on the ambiguous middle band — typically 5–10% of extracted candidates.

The thresholds are not arbitrary

The 0.85 and 0.92 numbers come from calibration runs against labeled pair sets. They reflect the embedding model's separation between "same fact, different words" and "different facts, related topics."

sim > 0.92 → merge · 0.85 ≤ sim ≤ 0.92 → judge · sim < 0.85 → keep

Choose thresholds by holding tier 3 misclassification rate to under 10%.

Different embedding models have different separation properties. If you swap embedding models, recalibrate the thresholds — do not import them blindly.

Merging vs. superseding

Merge and supersede are not the same operation. Merging applies when two memories make the same claim ("user uses Vim" and "user prefers Vim as editor"). Superseding applies when one contradicts the other ("user uses Vim" and "user switched to Helix"). Dedup handles the first; conflict detection (the next pipeline stage) handles the second.

Implementation notes

Scope dedup by user. A fact about User A is not a duplicate of the same fact about User B. Per-user namespaces in the vector index handle this naturally.
Scope dedup by type. A fact and a preference with the same surface text are different memories. Always include type in the dedup keying.
Bound the search. Limit cosine search to the top-N nearest neighbors (N=20 works well). Searching the entire store is wasteful and rarely changes the outcome.

The hash normalization in detail

The Tier 1 check is deceptively simple. The normalization function does exactly this:

fn normalize_for_hash(content: &str) -> String {
    content.to_lowercase()
        .chars()
        .filter(|c| !c.is_whitespace() && c.is_alphanumeric() || *c == '.')
        .collect::<String>()
        .trim()
        .to_string()
}

Lowercase the string. Strip every character that is neither alphanumeric nor a literal period. Drop all whitespace. The result is a canonical flat string that collapses surface variation without losing structural information.

Walk through three examples where the same underlying fact arrives in different phrasings:

Input	Normalized
"User works at Volkswagen AG"	"userworksatvolkswagenag"
" user WORKS at Volkswagen AG "	"userworksatvolkswagenag"
"User works at Volkswagen AG."	"userworksatvolkswagenag."

All three normalize to the same string (the period in the third survives, but only because periods are explicitly kept — if that matters to you, drop the || *c == '.' clause). For most content hashing purposes, periods in company names ("U.S.A.") are worth preserving. Punctuation at sentence end is noise; you can strip trailing periods explicitly if you want the third example to also match.

The hash itself combines content with memory metadata:

fn content_hash(memory: &CandidateMemory) -> [u8; 32] {
    let normalized = normalize_for_hash(&memory.content);
    let combined = format!("{}|{}|{}|{}",
        memory.memory_type,
        memory.subject.as_deref().unwrap_or(""),
        memory.predicate.as_deref().unwrap_or(""),
        normalized
    );
    Sha256::digest(combined.as_bytes()).into()
}

The combined key is type|subject|predicate|content. This prevents cross-type false positives: a Fact memory and a Preference memory about "User likes Python" produce different hashes even with identical normalized content, because their memory_type strings differ. It also prevents cross-entity false positives: "Georgian runs Datakynd" and "Priya runs Datakynd" hash differently because the subject differs.

The hash is stored as content_hash BYTEA with a unique index per scope. A uniqueness constraint in the database means Tier 1 is a single index lookup — the database enforces it, the application just catches the unique-violation error and routes the candidate to the merge path.

This matters historically: Tier 1 alone would have caught the "668 copies of a single feedback-loop hallucination" bug that appeared in Mem0's 2024 audit. A buggy extraction loop that re-extracted the same turn repeatedly would have produced 668 identical normalized hashes. The first would insert; the next 667 would hit the unique constraint and be routed to merge instead of insert. The repetition counter would end up at 668, which is useful signal that something went wrong upstream. Without Tier 1, all 668 would land in the store as separate memories and saturate retrieval.

Tier 2: why entity_facts scoping matters

Tier 2 fires when the candidate passes Tier 1 (no exact hash match) and has both subject and predicate set. The check is:

existing = storage.entity_facts(subject, predicate, scope)
for each existing memory:
    sim = dot_product(candidate.embedding, existing.embedding)
    if sim > 0.92: merge
    if sim >= 0.85: escalate to Tier 3

The entity_facts call fetches all existing memories that share the same (subject, predicate, scope) triple. This scoping is not an optimization — it is a precision decision.

Consider what happens if you ran Tier 2 against all memories for the user. For a user with 50,000 memories, you'd be computing cosine similarity against 50,000 vectors for every candidate. Even with HNSW approximate nearest-neighbor search cutting that to ~100 candidates, you'd surface memories from different subjects and predicates. "Georgian works at Volkswagen" (subject: Georgian, predicate: employer) might score 0.86 against "Priya works at Google" (same grammatical structure, different entities) — high enough to escalate to Tier 3. Tier 3 would correctly reject the false positive, but you'd have paid an LLM call for a case that subject/predicate scoping would have excluded for free.

By scoping to (subject, predicate), the search set for a typical fact is 5–50 memories. These are the only plausible duplicates: if two memories share the same subject and predicate, they are making a claim about the same relationship between the same entity and something else. "Georgian's employer is Volkswagen" and "Georgian works at Volkswagen" have the same subject (Georgian) and the same predicate (employer/works at). They are candidates for dedup. "Priya's employer is Google" has a different subject and never enters the comparison.

Because the embeddings are L2-normalized at write time, cosine similarity equals the dot product. That is a 5× speedup over computing the full cosine formula (dot / (|a| × |b|)) — the normalization eliminates both magnitude computations. With a search set of 20–50 vectors, Tier 2 runs in 5–15ms including the database round-trip. This is the right cost structure: cheaper than an LLM call by two orders of magnitude, more discriminating than a hash check.

Tier 2 only routes the first matching existing memory to Tier 3. If three memories all fall in the 0.85–0.92 band, only the highest-scoring one goes to the LLM judge. If that judge calls them duplicates, the candidate is merged into it. The other two are not re-evaluated — they already survived their own dedup passes when they were written, so they are distinct enough from each other. The cost saving is meaningful: three Tier 3 calls for the same candidate would be unusual and wasteful.

The ambiguous band: why 0.85–0.92

The threshold pair 0.85 and 0.92 defines the zone where cosine similarity is insufficient to decide without additional reasoning. To understand why these specific numbers, you need to know what memories live at each boundary.

At sim=0.93 (above the auto-merge threshold), you see pairs like:

"Georgian runs Datakynd, a freelance consultancy" / "Georgian owns a freelance business named Datakynd"
"User prefers dark mode" / "User likes dark themes"
"Alice reports to Bob" / "Alice is Bob's direct report"

These are same-fact-different-phrasing at high similarity. The model's geometry places them far enough apart that the hash won't catch them, but close enough that a human reviewing them would agree they are the same claim in every case. Auto-merging is correct.

At sim=0.88 (inside the ambiguous band), you see pairs like:

"Georgian works at Volkswagen" / "Georgian is employed by VW"
"User writes Python" / "User's primary language is Python"
"The meeting is Monday at 3pm" / "The standup is scheduled for Monday"

These are probably the same fact, but the phrasing differences are meaningful enough that auto-merge would occasionally be wrong. "VW" vs. "Volkswagen" is the same entity; the LLM judge will correctly call these duplicates. But "The meeting" vs. "The standup" might or might not be the same event — that requires reasoning about whether these terms refer to the same calendar item. The ambiguous band routes these to the LLM judge, which has the reasoning capacity to make that call.

At sim=0.84 (just below the lower threshold), you see pairs like:

"Georgian works at Volkswagen" / "Georgian's team at Volkswagen"
"User likes Python" / "User's team uses Python"
"The Berlin office is large" / "The Munich office is large"

These are related facts but not the same fact. "Works at" and "team at" are different predicates. "User likes" and "user's team uses" are different subjects. "Berlin" and "Munich" are different entities. Routing these to the LLM judge would produce a high false-positive rejection rate — the judge would correctly say "not the same fact" but you'd be paying LLM costs for cases that are clearly distinct at the semantic level. The 0.85 lower bound ensures the LLM judge sees genuine ambiguity, not obvious distinctions.

The 0.85–0.92 band is calibrated by building a labeled dataset of 500+ memory pairs with similarity scores and ground-truth duplicate labels, then finding the thresholds that minimize (false auto-merges + LLM cost per candidate). The numbers 0.85 and 0.92 are not transferable across embedding models — they reflect the specific geometry of text-embedding-3-small's latent space. See the recalibration section below.

LLM judge parallelism

The Tier 3 prompt is structured as a binary classification:

Are these two memories the same fact?

Memory A (existing): "Georgian runs Datakynd, a freelance consultancy"
Memory B (new):      "Georgian owns a freelance business named Datakynd"

Output: { "same": true | false, "confidence": 0.0-1.0, "reason": "..." }

If same: true and confidence >= 0.75, the candidate is merged into the existing memory. The confidence threshold of 0.75 exists to handle the judge's own uncertainty — at 0.74, the judge is saying "I lean toward same fact but I'm not sure." In that case, keeping both memories is the conservative choice.

The implementation runs all Tier 3 calls for a batch via FuturesUnordered. This matters at batch scale. A typical write batch of 20 conversation turns might extract 8 candidates, of which 3 fall in the ambiguous 0.85–0.92 band. Sequential Tier 3 execution would take 3 × ~200ms = 600ms. With FuturesUnordered, all three fire concurrently and complete in ~200ms total (the longest of the three, not the sum). The practical overhead is approximately one LLM call's latency regardless of how many ambiguous pairs exist in the batch.

FuturesUnordered is the right primitive because the three calls are fully independent — there's no ordering dependency between judging pair A and pair B. Each result is processed as it arrives. If one judge call errors, the other two proceed normally. Errors are non-fatal: a failed Tier 3 call defaults to "not a duplicate" (keep the candidate), which is the conservative choice. Writing a duplicate is recoverable; dropping a unique memory is not.

In practice, Tier 3 fires on roughly 10% of candidates that pass Tier 1. The 90% that auto-resolve in Tier 1 or Tier 2 pay no LLM cost. The 10% that reach Tier 3 pay approximately $0.004 per 1,000 candidates at current model pricing for the classification prompt. At 100,000 memory writes per day, Tier 3 sees ~10,000 candidates and costs ~$ 0.04/day.

Merge vs. supersede: the full distinction

Deduplication and conflict detection are adjacent pipeline stages with different semantics. Understanding exactly where the boundary falls prevents misrouting.

Obvious merge — same claim, different phrasing: "User prefers dark mode" + "User always uses dark themes" → sim=0.93 → auto-merge. The two memories make identical claims. The merge updates the existing memory's repetition counter, source provenance, and confidence. Nothing is lost.

Obvious supersede — same predicate, different value: "User works at Volkswagen" (written January) + "User now works at Stripe" (written July) → the predicate (employer) is the same, the value changed. This is a conflict, not a duplicate. The pipeline routes it to conflict detection, which supersedes the Volkswagen memory (marks it as historical, keeps it for audit) and writes the Stripe memory as current.

Ambiguous case — same predicate, plausibly same value with updated framing: "User runs Datakynd" (written 2025) + "User recently sold Datakynd and started a new consultancy" (written 2026). These share the subject and predicate (subject: User, predicate: employer/business). Cosine similarity might land at 0.87 (in the ambiguous band) because both mention Datakynd. Tier 3 receives this pair. The LLM judge, reading the full text, outputs {"same": false, "confidence": 0.91, "reason": "Memory B describes a transition that Memory A does not mention — they are temporally distinct facts, not paraphrases"}. The candidate passes dedup, and conflict detection then handles the Datakynd → new consultancy transition as a supersession.

The routing rule: if the values of the predicate agree (same employer, same preference, same location), it's a dedup candidate. If the values disagree (employer changed, preference reversed, location moved), it's a conflict candidate. The LLM judge at Tier 3 handles cases where "agree" versus "disagree" is itself ambiguous — it has enough context to reason about whether the second memory updates or duplicates the first.

How the repetition counter integrates

When a merge fires, the merge_into_existing function runs:

pub fn merge_into_existing(existing: &mut Memory, new: CandidateMemory) {
    existing.access_count += 1;
    existing.last_accessed = Some(Utc::now());
    // Append source turns (dedup by turn ID)
    for turn_id in new.provenance.source_turn_ids {
        if !existing.provenance.source_turn_ids.contains(&turn_id) {
            existing.provenance.source_turn_ids.push(turn_id);
        }
    }
    // Recompute confidence with increased repetition
    existing.confidence = recompute_confidence(existing);
    // If new has stronger source_confidence, adopt it
    if new.source_confidence > existing.source_confidence {
        existing.source_confidence = new.source_confidence;
    }
}

Nothing is lost. The merge strengthens the existing memory.

The access_count increment feeds into the repetition component of the confidence formula. Recall that repetition boost is:

r(n) = 1 − 1 / (1 + ln(1 + n))

r(n) is the repetition boost as a function of independent observation count.

Concrete values: n=1 → 0.307, n=2 → 0.478, n=3 → 0.535, n=5 → 0.625.

Walk through a memory that gets merged three times. Starting from n=0 (first write): r(0) = 0. After first merge (n=1): r(1) = 0.307. Contribution to confidence: 0.307 × 0.20 = 0.061. After second merge (n=2): r(2) = 0.478. Contribution: 0.478 × 0.20 = 0.096. After third merge (n=3): r(3) = 0.535. Contribution: 0.535 × 0.20 = 0.107.

The cumulative confidence boost from three independent merges is 0.107, which is enough to push a borderline memory from below the 0.5 retrieval floor to above it if source and extractor quality are moderate. Three independent sightings of the same fact genuinely increase your confidence that the fact is true — the confidence formula encodes that relationship.

The source_confidence adoption in merge_into_existing handles the common case where the same fact appears in different quality sources over time. If the original memory was sourced from weak inference (source_confidence=0.50) and the new candidate was sourced from a direct user statement (source_confidence=0.95), the existing memory adopts the stronger source. The user's direct confirmation upgrades the memory's trustworthiness retroactively. This is correct behavior: the most trustworthy source for a fact should define the memory's standing.

The provenance dedup (if !contains(&turn_id)) ensures that even if the same turn is processed twice (due to worker retry), it doesn't inflate the source_turn_ids list. Source turn IDs are auditable: at any point you can trace exactly which conversation turns contributed to a memory and in what sequence.

Background consolidation vs. write-time dedup

The three-tier pipeline runs at write time, per candidate. It is fast and scoped: it only compares the incoming candidate against existing memories with the same (subject, predicate, scope). This catches duplicates that arise when the same fact surfaces in the same structural form across different sessions.

Background consolidation is a separate job that runs periodically (typically nightly or weekly). It uses a lower similarity threshold (0.75 rather than 0.85) and broader scoping. Where write-time dedup requires matching (subject, predicate), background consolidation only requires matching (subject, type). This catches two categories of duplicates that escape write-time dedup:

Category 1 — predicate drift. "Georgian's employer is Volkswagen" and "Georgian works for the Volkswagen group" might have different predicates (employer vs. employer_of_record) if entity resolution assigned them differently. Write-time dedup would not compare them (different predicates). Background consolidation, operating on (subject="Georgian", type=Fact), would find them as near-duplicates at the 0.75 threshold and propose a merge.

Category 2 — temporal accumulation. A user who mentions Vim once per month across 18 months produces 18 separately-written memories that each survive write-time dedup (they arrived sequentially, not concurrently). After 18 months, the memory store has 18 near-duplicate Vim preference memories. Background consolidation runs union-find clustering on the full set of (subject=User, type=Preference) memories, finds the Vim cluster, runs an LLM judge on representative pairs, and merges them into a single memory with n=18.

The lower threshold (0.75 vs. 0.85) is intentional: background consolidation can tolerate a higher false-positive rate because merges are proposed for human review or a higher-cost LLM judge, not executed immediately. At 0.75, you catch more borderline cases; the review step filters the false positives. At write time, you want a threshold that auto-executes correctly — hence the higher 0.85 floor.

Threshold recalibration when switching embedding models

The 0.85 and 0.92 thresholds are properties of the embedding model's geometry, not universal constants. When you switch from text-embedding-3-small to text-embedding-3-large (or any other model), the similarity scores for the same pairs will change. Large models typically produce more discriminating embeddings — the gap between "same fact, different phrasing" and "different facts, same topic" is wider. That means a pair that scored 0.88 with the small model might score 0.94 with the large model, and the same pair might auto-merge under the old thresholds but should stay in the ambiguous band under correctly recalibrated thresholds.

The recalibration procedure:

Collect 200–500 labeled pairs. Extract existing memories from your store that have previously been merged (these are confirmed duplicates) and memories that survived dedup separately (confirmed distinct). Label them: duplicate: true/false.
Embed both members of each pair with the new model. Compute cosine similarity for each pair under the new model.
Build similarity distributions. Separate the pairs into duplicate=true and duplicate=false sets. Plot the similarity distributions for each set.
Set thresholds at percentile crossings. The lower threshold (0.85 equivalent) should be the 5th percentile of the duplicate=true distribution — you want to miss at most 5% of true duplicates below it. The upper threshold (0.92 equivalent) should be the 95th percentile of the duplicate=false distribution — you want at most 5% of false positives auto-merging above it.
Validate with held-out pairs. Reserve 20% of your labeled set for validation. Measure the false-merge rate (duplicates incorrectly auto-merged), false-keep rate (duplicates incorrectly kept as distinct), and LLM-escalation rate (fraction hitting the ambiguous band). Target: false-merge rate under 1%, LLM-escalation rate under 15%.

The recalibration takes roughly an hour of engineering time and avoids the failure mode of importing thresholds that were calibrated for a different model. If you are running in production with text-embedding-3-small and plan to upgrade to text-embedding-3-large, run the recalibration on a sample of your production memory pairs before cutover. The distribution shift may be subtle enough that performance looks fine for weeks, then degrades as edge cases accumulate.

Performance characteristics

The three-tier pipeline is designed so that the expensive steps only run when necessary:

Tier 1: Hash lookup against a unique index. This is a single B-tree probe. Under 1ms including database round-trip.
Tier 2: Cosine similarity over 5–50 candidates (same subject+predicate). L2-normalized dot products. 5–15ms including the entity_facts query.
Tier 3: LLM judge call. ~200ms per call. Runs in parallel for batches via FuturesUnordered.

At pipeline scale: p50 dedup latency is ~10ms (most candidates resolve in Tier 1 or Tier 2). p99 is ~220ms (the 10% that hit Tier 3, rounded up for tail latency in the LLM call). The 10ms vs. 220ms spread is the right design: pay 200ms only for the 1-in-10 candidates that need it.

The pipeline is asynchronous relative to the user's response. The write pipeline runs in a background worker; users don't wait for dedup to complete before seeing a reply. The latency figures matter for worker throughput sizing, not for user-perceived latency. A worker that processes candidates sequentially handles ~100 candidates/second (limited by Tier 3 parallelism). A worker with 10 concurrent Tier 3 slots handles ~1,000 candidates/second. Size worker concurrency based on your observed Tier 3 rate and target throughput.