BM25 for Memory, in Plain English

By Arc Labs Research14 min read

Dense embeddings dominated retrieval discourse for half a decade — and then the practitioners came back. BM25 still holds its own on rare-term queries, identifier lookups, and out-of-distribution data. In an agent memory system, BM25 is the second retriever in the five-retriever ensemble for exactly this reason: it catches what semantic misses.

TF saturation · adjustable k₁
Adding the 30th occurrence of a term contributes much less than the 2nd. The asymptote sits at k₁+1.

The full formula

BM25(D, Q) = Σi IDF(qi) × f(qi, D)·(k 1+1) / (f(qi, D) + k1·(1 − b + b·|D|/avgdl))

Okapi BM25, the standard formulation. k₁ ≈ 1.2; b ≈ 0.75 are typical defaults.

Three pieces matter:

  • Inverse Document Frequency. Rare terms are more discriminating. IDF(q) = ln((N − n + 0.5) / (n + 0.5) + 1) where N is the total number of memories and n is the count containing q.
  • TF saturation. Term frequency contribution saturates: 30 occurrences are not 30× more relevant than 1. The k₁ parameter controls how fast saturation kicks in.
  • Length normalization. Long memories shouldn't get a free ride from containing the term once. The b parameter (0.75 default) sets how aggressively to penalize length.

IDF in depth: what rare means in a memory store

Inverse Document Frequency is BM25's signal-to-noise filter. The IDF formula is:

IDF(q) = ln( (N − n(q) + 0.5) / (n(q) + 0.5) + 1 )

N = total documents; n(q) = documents containing term q. The +0.5 Laplace smoothing prevents division by zero.

Run through concrete numbers. Say your agent has 50,000 memories. The word "work" appears in 30,000 of them — it's nearly universal. Its IDF:

ln((50000 − 30000 + 0.5) / (30000 + 0.5) + 1)
= ln(20000.5 / 30000.5 + 1)
= ln(1.667)
≈ 0.51

Now consider the term "VW123-platform-team", a project identifier that appears in exactly 3 memories:

ln((50000 − 3 + 0.5) / (3 + 0.5) + 1)
= ln(49997.5 / 3.5 + 1)
= ln(14286.4 + 1)
≈ 9.57

That's an 18× IDF difference. When the user queries for "VW123-platform-team", the retriever multiplies each term's TF contribution by its IDF weight. "Work" barely registers. The project identifier dominates the score. This is exactly the right behavior — the query is specific and the rare term should drive precision.

High-IDF terms in a typical memory store: project codes, ticket IDs, names in non-Latin scripts (they almost never appear in more than a handful of memories), version numbers, unusual technical jargon ("HPA autoscaler deadband"), named entities specific to a single employer.

Low-IDF terms: "user", "prefers", "works", "like", "told", "said", "meeting", "team". These are noise in almost every query. BM25 naturally down-weights them without requiring a manual stopword list, because they appear in the majority of documents and their IDF approaches zero.

Why does this matter for precision in memory retrieval specifically? A memory store is not a web corpus. It's a dense personal graph of facts about one person or agent. Domain vocabulary is extremely narrow. Identifiers — badge IDs, project names, conference room names, client codes — appear infrequently but are exactly what users query for. BM25 IDF amplifies exactly those terms.

TF saturation: the math behind the curve

Without saturation, raw term frequency would mean a memory containing the word "Munich" 20 times scores 20× higher than one containing it once. That's wrong — relevance doesn't scale linearly with repetition. BM25 solves this with the k₁ parameter.

The TF component of BM25 is not f(q, D) directly. It's:

saturated_tf = f(q, D) × (k₁ + 1) / (f(q, D) + k₁)

(The length-normalization denominator is simplified here for clarity.) Work through four values at k₁ = 1.2:

f(q, D)saturated_tfmarginal gain
11.83
51.97+0.14
102.05+0.08
302.15+0.10

The asymptote is k₁ + 1 = 2.2. The jump from 0 to 1 occurrence is the biggest single gain (0 → 1.83). Going from 1 to 5 occurrences adds only 0.14. The 30th occurrence adds barely 0.10 more than the 10th. The curve flattens fast.

Now compare k₁ = 2.0 at the same term frequencies:

f(q, D)k₁=1.2k₁=2.0delta
11.831.50−0.33
51.971.86−0.11
102.052.00−0.05
302.152.19+0.04

At k₁ = 2.0 the curve is flatter — the first occurrence contributes less, and the saturation ceiling is higher (3.0 vs 2.2). This is useful for long documents where a term can legitimately appear 20–30 times without being noise. But in a memory store, individual memories are short: typically 20–80 tokens. A term rarely appears more than 3–4 times in a single memory. For short text, k₁ = 1.2 is the right default: the first occurrence captures nearly all the TF signal, and you don't want the formula to keep rewarding repetition that just means the memory is verbose.

The practical upshot: if you're indexing memory notes that are one to three sentences long, leave k₁ at 1.2. If you index long memories — extended episodic logs, full conversation transcripts attached as memories — push k₁ toward 1.6–2.0.

Length normalization: when it helps and when it hurts

Consider two memories that both contain the term "Volkswagen" exactly once:

  • Memory A: "User works at Volkswagen." (5 tokens)
  • Memory B: "User attended a company all-hands at the Munich headquarters of Volkswagen, the fourth-largest automaker by production volume, on Thursday afternoon." (30 tokens)

Without any length normalization (b = 0), BM25 treats both identically for the Volkswagen term. That's actually the right call here — both memories are equally relevant to a query about Volkswagen. Length normalization helps only when longer documents have a higher raw term count because they're longer, not because they're more relevant.

With b = 0.75 at these lengths (assuming avgdl = 15 tokens):

Memory A denominator factor: 1 − 0.75 + 0.75 × (5/15)  = 0.25 + 0.25 = 0.50
Memory B denominator factor: 1 − 0.75 + 0.75 × (30/15) = 0.25 + 1.50 = 1.75

Memory B gets penalized by a factor of 3.5 relative to Memory A, even though it contains Volkswagen exactly once in both. That's over-penalizing — Memory B isn't more verbose about Volkswagen, it's just a longer memory about the same entity.

This is the central tension with length normalization in memory systems. Web documents are long because they repeat key terms more. Memory objects are long because the event itself was complex. The correlation between document length and term repetition is weaker in memory stores than in web corpora.

Practical guidance:

  • b = 0.75 (default): Good starting point. May over-penalize richer episodic memories if avgdl is pulled down by many short fact-memories.
  • b = 0.5: Better when your memory store contains a mix of one-sentence facts and multi-sentence episodic records. Reduced penalty for length.
  • b = 0: Disables length normalization entirely. Correct when all memories are roughly the same length (e.g., you enforce a token-cap at write time).
  • b = 1: Full normalization. Every document is treated as if it's avgdl long. Useful theoretically; almost never correct in practice because it punishes legitimate detail.

The default of 0.75 comes from the TREC experiments that originally calibrated BM25. It was tuned on news wire documents. Treat it as a prior to update, not a constant.

BM25 vs semantic: the failure analysis

Neither retriever wins universally. The failure modes are complementary, which is why hybrid retrieval is the correct architecture.

Where BM25 wins:

Exact identifiers and codes. "What's badge ID 4729?" retrieves the one memory containing that exact string. An embedding model has no reason to put badge ID 4729 near any conceptual neighbor — there isn't one. Cosine similarity returns the closest number, which could be badge ID 4730 or just "employee ID" memories in general. BM25 returns the exact match.

Out-of-vocabulary terms. Embedding models are trained on fixed vocabularies. A tokenizer that has never seen "HPA-autoscaler-deadband" will split it into subword tokens and embed a noisy average. The resulting vector drifts from the true semantic position. BM25 treats it as a literal token sequence — no drift.

Non-Latin script names. A Chinese or Arabic personal name will embed reasonably inside a multilingual model, but only if the model was trained on that script. A monolingual English embedding model produces near-random vectors for characters it never learned. BM25 is entirely script-agnostic: a byte sequence is a byte sequence.

Numbers and version strings. "Use Node 18.12.1" and "Use Node 18.16.0" are semantically nearly identical to a general embedding model. They're in the same high-density cluster. BM25 distinguishes them precisely.

Where semantic wins:

Paraphrase and synonymy. "What does the user dislike?" should retrieve a memory saying "User finds meetings unproductive." There's no lexical overlap with "dislike." Cosine similarity catches this; BM25 scores zero for both "dislike" and "dislikes" if neither appears in the memory text.

Concept neighbors. "transportation preferences" should surface memories about trains, bikes, and car rentals even if the word "transportation" never appears. Dense embeddings are trained to place these concepts nearby.

Cross-lingual retrieval. A query in English over memories stored in mixed languages. Multilingual embedding models handle this; BM25 requires explicit translation.

Typo tolerance (soft). A query with a misspelled term returns nothing from BM25 if the token doesn't exist verbatim. Embeddings tolerate reasonable spelling variation because nearby subword tokens land in similar vector positions.

The overlap zone:

Short queries with common words fall into a zone where neither wins cleanly. "Meeting" returns too much from BM25 (every memory mentioning any meeting). It returns semantically coherent but possibly vague results from dense retrieval. This is where type-filter and temporal retrievers pick up the slack.

The key architectural takeaway: BM25 and semantic don't overlap in their failure cases. BM25 fails on paraphrase. Semantic fails on identifiers. Their union captures nearly all query types that either one alone handles well.

The predicate + content search structure

Recall memories have a structured schema: a content field carrying the full natural language text, and a predicate field carrying the structured relationship type — works_at, prefers, attended, owns, reported_bug. BM25 indexes both, but not with equal weight.

In the PostgreSQL tsvector configuration, content is indexed at weight A and predicate at weight B. The ts_rank_cd function applies per-field weights when computing the final score. Weight A documents contribute more than weight B documents for the same matching term.

Why does predicate weighting matter? Consider the query "what does the user work on?" The content of a relevant memory might be "User is a backend engineer on the payments service at Stripe" — generic enough that "work" doesn't appear at all. But the predicate is works_on. The predicate token "works_on" is a strong signal. Weighting predicate lower than content sounds counterintuitive but is correct: you don't want predicate-only matches to dominate when the content field has rich matching text. You want the predicate to act as a tiebreaker — a secondary signal that lifts structurally relevant memories when content scores are close.

The subject field (the entity identifier, e.g., the user's node ID) is used for scope filtering, not for BM25 scoring. Including it in the full-text index would inflate IDF for entity identifiers that should function as hard filters, not ranking signals.

Predicate vocabulary in Recall is constrained: the extraction pipeline normalizes predicates to a finite ontology (works_at, prefers, reported_bug, etc.). This means predicate tokens have predictable IDF — "works_at" appears in thousands of memories, so its IDF is low and it contributes little to ranking. That's correct. The predicate signal is really carried by the specificity of the content text co-occurring with the right predicate, not by the predicate token alone.

Index configuration in Recall

PostgreSQL tsvector

Recall's Postgres implementation uses ts_rank_cd with normalization flag 32. Understanding what flag 32 does is not obvious from the documentation summary.

ts_rank_cd accepts a normalization integer that is interpreted as a bitmask. Flag 32 means: divide the rank by itself plus one. Concretely:

normalized_rank = rank / (rank + 1)

This maps all rank values into the range [0, 1). A memory with a very high raw rank (many matching terms, high IDF) gets a score approaching 1.0. A memory with a modest raw rank gets a score proportionally lower. The division-by-one acts like a soft ceiling.

Why flag 32 specifically? The alternatives are worse for memory-sized text. Flag 1 divides by document length — that's double-counting length normalization already handled by BM25's b parameter. Flag 4 divides by the mean harmonic distance between matching terms, which is useful for phrase proximity scoring but adds latency for short memories where proximity doesn't add signal. Flag 32 is the simplest scalar normalization that makes scores comparable across queries with different term counts.

The tsvector configuration for the memory table looks roughly like:

to_tsvector('english', content) AS content_vec   -- weight A
to_tsvector('english', predicate) AS pred_vec    -- weight B

The two vectors are combined with || and weighted with setweight(content_vec, 'A') || setweight(pred_vec, 'B') before storage. The English text search configuration handles stemming (Munich → Munich, munichens fail — you need the language config that matches your corpus language).

SQLite FTS5

SQLite FTS5 exposes BM25 as the default rank function. The parameters that matter at table creation are bm25(content, predicate) with optional column weights specified as positional floats:

CREATE VIRTUAL TABLE memories USING fts5(
  content,
  predicate,
  content_rowid='id',
  tokenize='unicode61 remove_diacritics 1'
);

Column weights in FTS5 are passed to the bm25() function at query time: ORDER BY bm25(memories, 10.0, 5.0) gives content a 2× weight advantage over predicate. The default is uniform weight 1.0 per column.

The unicode61 remove_diacritics 1 tokenizer handles accented characters (Müller → Muller) which is important for personal names in European locales. Without diacritic removal, "Müller" and "Muller" are distinct tokens and you miss exact matches when the query and memory use different encodings of the same name.

avgdl is maintained internally by FTS5 — you don't compute it explicitly. The k₁ and b parameters are fixed at their defaults (1.2 and 0.75) in FTS5 and are not configurable without modifying the SQLite source.

BM25 in the RRF fusion pipeline

BM25's score — say, 12.5 — and cosine similarity — say, 0.87 — are incommensurable. Different scales, different distributions, different upper bounds. You cannot add them, average them, or multiply them and expect a meaningful result. Calibrating both onto a common scale requires corpus-specific normalization that breaks every time the score distribution shifts (new memories, new domains, model updates).

RRF sidesteps this entirely by using ranks. The formula:

RRF(m) = Σi wi / (k + ranki(m))

k is a constant (default 60). rank_i(m) is m's rank in retriever i's list, 1-indexed. Missing = 0 contribution.

Work through a concrete example with k = 60. Memory M1 ranks #1 in BM25 and #15 in semantic:

RRF(M1) = 1/(60+1) + 1/(60+15)
        = 1/61    + 1/75
        = 0.01639 + 0.01333
        = 0.02972

Memory M2 ranks #8 in BM25 and #3 in semantic:

RRF(M2) = 1/(60+8) + 1/(60+3)
        = 1/68    + 1/63
        = 0.01471 + 0.01587
        = 0.03058

M2 beats M1 despite ranking lower in BM25. Why? Its high semantic rank (#3 vs #15) contributes significantly more via the reciprocal formula. M1 dominates only if you're weighting BM25 more heavily — which you'd do via w₁ > w₂ if your eval data shows BM25 is the more reliable retriever for your query distribution.

Memory M3 ranks #2 in BM25 but is absent from the semantic results entirely:

RRF(M3) = 1/(60+2) + 0
        = 1/62
        = 0.01613

M3 scores lower than both M1 and M2, despite its strong BM25 rank. Absence from a retriever is not a penalty — it's just zero contribution. But in practice, a memory that ranks #2 in BM25 and is genuinely absent from the top-100 semantic results is unusual unless it's a highly specific lexical match (an identifier, a code) that has no semantic neighbors. That's fine — such memories should rank based on their BM25 signal.

The practical consequence: when you're debugging low-quality fusion results, BM25's rank in the fusion pipeline is what you tune, not its raw score. A memory that gets BM25 rank #1 in every query because it happens to be the longest memory in the store (lots of term overlap with any query) is a signal to investigate avgdl, b, or your query preprocessing — not to tune the RRF weights.

Vocabulary drift and IDF staleness

IDF statistics are computed over the current corpus at index build time. They're stale the moment a new memory is written. Usually this is fine — adding one memory to a 50,000-memory store doesn't meaningfully shift IDF. The problem is vocabulary drift: when the user starts a new domain and adds memories with terminology that was previously absent.

A user who has been an engineer for two years has a memory store dominated by engineering vocabulary. "CI pipeline", "deployment", "PR", "staging" appear in thousands of memories — low IDF. The word "diagnosis", "dosage", "contraindication" appears zero times. IDF(diagnosis) at this point is maximally high (ln(N + 1) ≈ ln(50001) ≈ 10.8) because it appears in zero documents.

Now the user starts medical school. Over three months, "diagnosis" appears in 500 memories. IDF(diagnosis) should now be around:

ln((50000 − 500 + 0.5) / (500 + 0.5) + 1)
= ln(49500.5 / 500.5 + 1)
= ln(98.9 + 1)
= ln(99.9)
≈ 4.60

But the stale index still has IDF(diagnosis) ≈ 10.8. Queries containing "diagnosis" will score 2.3× too high relative to any query term whose IDF was computed accurately. Precision degrades — the BM25 retriever surfaces medical memories disproportionately for any query that happens to mention medical terms.

Recall detects vocabulary drift by tracking vocabulary Jaccard similarity between the current corpus vocabulary and the indexed vocabulary at last-build time. Jaccard is a natural fit:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

When Jaccard drops below a threshold (typically 0.85 for a 50K-memory store), the index is flagged for rebuild. The rebuild cost is amortized — it happens in a background worker and doesn't block reads. During the rebuild window, the stale index continues serving queries; the new index swaps in atomically.

An alternative to full rebuild is incremental IDF update: recompute IDF for only the terms whose document frequency has changed by more than some threshold (e.g., ±10%). This is cheaper but complex to implement correctly in the presence of deletions. For memory stores below 10M documents, full rebuild is simple enough to be the right default.

Why BM25 still matters in 2026

  • Out-of-vocabulary terms. Embeddings trained on web data may not have seen "VW123-platform-team" — a tokenized identifier. BM25 doesn't care; tokens are tokens.
  • Exact-match queries. "What's badge ID 4729?" — semantic returns nearby concepts; BM25 returns the document with that exact string.
  • Domain-shifted data. An embedding model trained on general text underperforms on, say, medical or legal terminology. BM25 is domain-agnostic by design.

Tuning k₁ and b

  • k₁ (default 1.2): lower for short, dense memory text (more aggressive saturation). Higher for long memories.
  • b (default 0.75): lower if memories vary widely in length and you don't want length penalties dominating. 0 disables length normalization entirely.
  • avgdl (average doc length): recompute periodically. Stale avgdl makes the normalization mis-weight new corpora.

Common mistakes

Getting BM25 wrong in a memory system usually comes from one of five places.

Indexing normalized text instead of raw tokens. If your pipeline aggressively stems and lowercases before indexing, "VW123" and "vw123" map to the same token — probably fine. But over-stemming turns "meetings" and "meet" into the same stem, and "diagnoses" becomes "diagnos", which may not match "diagnosis" depending on the stemmer. The failure mode is silent: precision looks fine in aggregate but drops on morphologically complex queries. Index with the lightest stemming that still handles obvious plurals. For identifiers, disable stemming entirely in a parallel index.

Stale avgdl. avgdl is recomputed at index build time. If you add 100,000 new memories without rebuilding, avgdl no longer represents the actual average. Short memories get over-penalized (the normalization term thinks documents are shorter than the reference average). The symptom is BM25 scores that seem uniformly high or low compared to expected ranges. Solution: track avgdl as a live statistic and update it incrementally, or trigger periodic rebuilds.

Forgetting to index the predicate field. The most common configuration error. The memory's content text is indexed, but the structured predicate is not. Queries like "what does the user prefer" have weak lexical overlap with content text ("User likes dark roast coffee") but strong overlap with the predicate token "prefers". Without predicate indexing, BM25 misses these entirely and the failure looks like a semantic retrieval failure when it's actually a schema configuration failure.

No stopword handling for extremely common terms. IDF handles common terms gracefully in theory, but in a small memory store (< 1,000 memories), IDF statistics are noisy. A term appearing in 600 of 1,000 memories still has IDF ≈ 0.51, which sounds low but will still outweigh a rare term appearing 3 times in the TF component of a long memory. For small stores, a small stopword list (20–50 terms: "user", "memory", "note", "said", "told", "mentioned", "works", "likes") is a pragmatic fix.

Score-summing BM25 with cosine similarity directly. BM25 score 12.5 and cosine similarity 0.87 are not addable. Any weighted sum of these values produces a result that changes behavior when either distribution shifts — a new embedding model update, a corpus growth spurt. Use rank-based fusion (RRF) at the fusion layer. The scores only need to be good enough to produce the right ranked list within each retriever; they never need to be comparable across retrievers.

Implementation choices

  • Tantivy / Lucene: mature, fast, custom analyzers. Best for dedicated search infrastructure.
  • Postgres + tsvector + ts_rank_cd: good when you're already on Postgres and don't want a separate index. Use normalization flag 32 for bounded score output.
  • SQLite FTS5: built-in BM25 in embedded mode — surprisingly fast at the 100K-memory scale. Column weights are passed at query time via the bm25() function.

At the 10M+ memory scale, dedicated lexical infrastructure (Tantivy in-process, or a Postgres extension) outperforms general-purpose ranking by 10–20× on tail latency. The p99 for lexical retrieval on a 10M-memory store sits at approximately 12ms — well within the 50ms total retrieval budget that includes semantic, entity-graph, and temporal retrievers all running in parallel.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.