Reciprocal Rank Fusion, Explained

By Arc Labs Research18 min read

Multi-retriever systems face the same problem as ensemble classifiers: how do you combine outputs that disagree? Score-based fusion (weighted sum, weighted geometric mean) requires all scores to live on the same scale. They don't. BM25 returns values in roughly [0, 30]. Cosine similarity returns [-1, 1]. Graph hop-scores return [0, 1]. Normalize them and you introduce calibration bugs that change behavior every time the underlying distributions shift.

RRF sidesteps the problem by using ranks, not scores. The position of an item in a list carries the entire signal.

The formula

RRF(m) = Σi wi / (k + ranki(m))

k is a constant; rank_i(m) is m's rank in list i (1-indexed). Items absent from a list contribute 0.

For each candidate memory, sum w_i / (k + rank) across every retriever list it appears in. Higher sum → better rank in the fused output. Items absent from a retriever's list contribute zero — no normalization, no penalty, no calibration step.

RRF fusion · adjustable k
Lower k makes the top of each list dominate. Higher k flattens the curve.

Why score normalization fails

The intuitive fix for score heterogeneity is min-max normalization: for each retriever, rescale all scores to [0, 1] and sum. It sounds clean. It breaks in practice.

Take a concrete query. BM25 returns three results with raw scores 28.4, 14.2, and 3.1. Cosine similarity returns the same three memories with scores 0.91, 0.88, and 0.61. Min-max normalize both to [0, 1]:

  • BM25 normalized: 1.00, 0.44, 0.00
  • Cosine normalized: 1.00, 0.90, 0.00

Sum them: memory A scores 2.00, memory B scores 1.34, memory C scores 0.00. Memory A wins. But what does a BM25 score of 28.4 mean? It means the query term appears many times in the memory text, weighted by IDF. A BM25 score of 28 versus 14 does not imply the first memory is twice as relevant — BM25 scores are corpus-relative. In a different corpus with more repetitive text, the same memory might score 8. The normalized value 1.00 carries the implicit claim that this is the most relevant result in the corpus, which is exactly what you cannot know from a single retriever's score.

Cosine similarity has the opposite problem. The difference between 0.91 and 0.88 is tiny in absolute terms but enormous in rank — these are likely both excellent semantic matches. Normalizing to 1.00 and 0.90 preserves the structure, but that still gets summed with a BM25 normalized score that carries completely different semantics.

The deeper issue is that normalization is a function of the retrieved set, not of relevance. If you run BM25 and return 50 results instead of 10, the normalization denominator changes, and every score shifts. Add a new retriever, and all existing normalized scores are invalidated. The calibration surface is fragile and opaque.

Min-max normalization also amplifies noise at the tails. If the lowest- scoring result in a BM25 list is a near-zero junk hit, it anchors the lower bound and compresses all meaningful scores into a narrow band near 1.0. The ranking inside that band is then dominated by floating-point rounding, not retrieval quality.

RRF discards scores entirely. The fact that memory A was rank 1 in BM25 carries exactly one bit of information: BM25 thinks it is the best match for this query in this retriever. That information is commensurable with "rank 1 in cosine similarity." Ordinal position is a universal currency.

The k constant: what the spread numbers mean

k is a smoothing constant. It controls how much extra weight the top of any list receives relative to deeper positions. Small k → steep dropoff, rank-1 dominates. Large k → flat distribution, deeper results contribute meaningfully.

The Cormack et al. paper tested k ∈ {10, 60, 200} on TREC data. k = 60 was the empirical sweet spot. Here is what the spread actually looks like:

k = 10:

  • Rank 1: 1/11 ≈ 0.091
  • Rank 5: 1/15 ≈ 0.067
  • Rank 10: 1/20 = 0.050
  • Rank 50: 1/60 ≈ 0.017
  • Spread (rank 1 / rank 50): 5.4×

k = 60:

  • Rank 1: 1/61 ≈ 0.0164
  • Rank 5: 1/65 ≈ 0.0154
  • Rank 10: 1/70 ≈ 0.0143
  • Rank 50: 1/110 ≈ 0.0091
  • Spread (rank 1 / rank 50): 1.8×

k = 200:

  • Rank 1: 1/201 ≈ 0.00498
  • Rank 5: 1/205 ≈ 0.00488
  • Rank 10: 1/210 ≈ 0.00476
  • Rank 50: 1/250 ≈ 0.00400
  • Spread (rank 1 / rank 50): 1.25×

At k = 10, a rank-1 result in any single retriever is almost unbeatable. It receives 5.4× the contribution of a rank-50 result. If you have three retrievers and memory A appears at rank 1 in just one of them while memory B appears at rank 30 in all three, B's total score is 3 × 0.025 = 0.075 and A's score is 0.091 + 2 × 0 = 0.091. A wins — barely. Lowering k further tilts this toward single-strong-signal dominance.

At k = 60, the same scenario inverts. B scores 3 × 1/90 ≈ 0.033 and A scores 0.0164. B wins by 2×. This is the consensus property: three retrievers agreeing at a middling rank beats one retriever firing confidently on a single signal. For mixed queries — most real queries — this is what you want.

At k = 200, the spread is so flat (1.25×) that rank differences are nearly meaningless. A result at rank 200 contributes almost as much as rank 1. You would set this only when your retrievers are extremely noisy and you want contribution breadth to dominate rank quality. Practically, this means you trust the union of retriever outputs more than the internal ordering of any single one.

When to deviate from k = 60:

Use k = 10–20 when you have one highly reliable retriever — typically semantic search on a well-tuned embedding model — and the others serve as weak signals or tie-breakers. The strong retriever's rank-1 result will dominate, which is correct behavior when you trust it.

Use k = 100–200 when all retrievers are noisy (e.g., graph traversal in a sparsely connected entity graph, or BM25 on short, low-vocabulary memories). High k dilutes individual rank quality and rewards breadth of coverage across lists.

The default k = 60 works well across a wide range of query types and retriever qualities. Most teams should not tune k before first tuning which retrievers participate and what per-retriever weights to assign.

Per-retriever weights: when to use them

The default weight for every retriever is 1.0, meaning all signals are treated equally. This is a reasonable prior for a new deployment where you have no eval data. Once you have a labeled set — even 200 query–relevant-memory pairs — you can start tuning weights.

Recall's default weights by retriever:

RetrieverDefault weightNotes
Semantic (R1)1.0Baseline for all queries
BM25 (R2)1.0Equal to semantic
Entity graph (R3)1.0Equal weight; boosted for entity-centric queries
Temporal (R4)1.0Equal weight; boosted for time-heavy domains
Graph traversal (R5)0.8Multi-hop results have inherently lower confidence

Graph traversal is down-weighted to 0.8 for a specific reason: multi-hop paths add noise. A direct edge A→B is a strong relational signal. A two-hop path A→C→B is weaker — C might be a shared context that connects many nodes, making the association coincidental. At three hops, the relevance signal is typically diluted to noise. Weighting graph traversal at 0.8 instead of 1.0 means that even a rank-1 multi-hop result contributes 0.8/61 ≈ 0.013 instead of 0.016 — a 20% haircut that reflects lower confidence without excluding the signal entirely.

When to boost semantic weight to 2.0: Dense paraphrase queries — where the user's language is stylistically different from the stored memory text but semantically equivalent — are pure semantic retrieval problems. BM25 and the entity graph contribute noise. Setting semantic weight to 2.0 and others to 1.0 means a rank-1 semantic hit scores 2/61 ≈ 0.033, which beats any combination of lower-quality signals. The query optimizer does this automatically when query analysis detects a paraphrase pattern (high embedding confidence, low entity match rate).

When to boost temporal weight to 1.5: Time-heavy domains — customer support, meeting notes, incident logs — tend to have queries where recency is part of the relevance criteria. A memory from six months ago about the same topic is less useful than one from last week, even if semantically closer. Temporal weight 1.5 lets the range-index results punch above their default weight. Note that temporal weighting at the RRF level is not the same as the freshness decay applied later in the pipeline — temporal weight determines how much the temporal retriever's rank contributes to fusion; freshness decay applies a multiplicative penalty post-fusion.

How to calibrate on an eval set: Treat weights as hyperparameters. Fix k at 60. For each candidate weight configuration, compute nDCG@10 and Recall@20 on your labeled set. Grid search over the weight space is cheap — RRF is sub-millisecond to compute, so you can evaluate thousands of configurations in seconds. The best configuration for your deployment is query-distribution-dependent. A coding assistant will look different from a customer-success agent.

Worked fusion example

Three retrievers, five candidate memories, k = 60, all weights = 1.0. The query is "what did I tell the team about the API rate limits?"

MemorySemantic rankBM25 rankGraph rank
A1absent3
B51absent
C225
Dabsent41
E1032

Compute RRF scores at k = 60:

Memory A: 1/61 + 0 + 1/63 = 0.01639 + 0.01587 = 0.03226

Memory B: 1/65 + 1/61 + 0 = 0.01538 + 0.01639 = 0.03177

Memory C: 1/62 + 1/62 + 1/65 = 0.01613 + 0.01613 + 0.01538 = 0.04764

Memory D: 0 + 1/64 + 1/61 = 0.01563 + 0.01639 = 0.03202

Memory E: 1/70 + 1/63 + 1/62 = 0.01429 + 0.01587 + 0.01613 = 0.04629

Final ranking: C (0.047) > E (0.046) > A (0.032) > D (0.032) > B (0.032).

The result is instructive. Memory A was rank 1 in semantic — the strongest single-retriever signal in the set — but it finished third. Memory C, which was rank 2 in two retrievers and rank 5 in a third, won decisively. Memory E, which was never ranked first in any retriever, finished second by appearing consistently across all three lists. This is the consensus property in practice: triangulated moderate confidence beats single-source high confidence.

Memory D demonstrates absent-list behavior. It received no semantic contribution, but its rank 1 in graph and rank 4 in BM25 put it level with memory A. If the query had a strong relational component — "what did I tell the team" is indeed a relational query involving an entity "team" — the entity graph might warrant a weight boost to 1.5, which would push D into second place.

Absent memories and the missing-retriever case

When a memory appears in only one retriever's list, its RRF score is exactly w/(k + rank). At default weights and k = 60:

  • Rank 1 in one retriever, absent from all others: 1/61 ≈ 0.0164
  • Rank 3 in all three retrievers: 3 × 1/63 ≈ 0.0476

A memory at rank 1 in exactly one retriever cannot beat a memory at rank 3 in all three retrievers, no matter how high the single-retriever confidence. This is a deliberate property. Individual retrievers can overfit to surface features: BM25 will rank high any memory that contains the exact query tokens, even if the memory is semantically irrelevant. Semantic search will rank high any memory in the same embedding neighborhood, even if it is factually disconnected. The fusion formula penalizes outlier signals naturally.

The missing-retriever case — where fewer than the full set of retrievers return a result — is distinct from the absent-memory case. If the entity graph retriever is skipped entirely (query optimizer decided it won't contribute), then no memory receives an entity graph contribution, and the RRF scores collapse to a two-retriever sum. This is correct: the formula is symmetric. The only effect is that the absolute scores are lower. Relative ranking is unaffected.

RRF in the full retrieval pipeline

RRF is not the first step and not the last. It sits in the middle of a multi-stage pipeline. The full flow for a single query:

  1. Query analysis. Parse intent, extract entities, detect temporal anchors, classify query type.
  2. Retriever fan-out. Five retrievers run in parallel. Each returns a ranked list of up to 50 candidates. Total latency: max(individual latencies) ≈ 12ms p99.
  3. RRF fusion. Merge all lists into one unified ranking. Input: up to 5 × 50 = 250 candidates (with deduplication). Output: top-100 by RRF score. Compute time: sub-millisecond.
  4. Policy weights. Apply weight(m) = RRF_score × freshness(age(m)) × access_boost(m). Freshness is a decay function; access_boost is a logarithmic repetition multiplier. Both are described in their own pages. This step re-ranks the top-100 into a final ordering.
  5. Top-K selection. Keep the top K memories (typically 20–50) for the next stage.
  6. Optional rerank. A cross-encoder scores each of the top-K memories against the full query. This is 15–30ms p99 on commodity hardware. Not every query justifies this cost.
  7. Top-N to context. The final N memories (typically 5–15) are injected into the prompt.

RRF handles recall: it ensures that relevant memories from any retriever are visible to the policy stage. Freshness and access boost handle temporal and behavioral biases. The reranker handles precision: it removes semantically close but factually wrong memories from the final set. Each stage has a different job, and they are not substitutes for each other.

The position of RRF — after fan-out, before policy weights — is important. It means per-memory policy adjustments operate on an already- fused score, not on individual retriever scores. This keeps the policy logic simple: one multiplier per memory, not five.

Reranking vs fusion: complementary, not alternatives

A common mistake is to reach for a reranker when fusion quality is the actual problem. Rerankers are expensive (cross-encoder inference) and can only improve precision within the set they receive. If the correct memory is at rank 120 in the fused list and you only pass top-50 to the reranker, the reranker cannot help.

Fusion solves recall. Its job is to ensure the right memories are present somewhere in the top-K. If fusion is miscalibrated — wrong k, wrong weights, missing retriever — the correct memories may not make it into the top-K at all. No amount of reranking fixes that.

Reranking solves precision. Its job is to push correct memories to the top of the top-K and push incorrect ones down. A reranker sees the full query and each memory together, which lets it detect subtle incoherence — a memory that matches the query topic but refers to a different person, a different time, or a different context. RRF cannot see this because it only has ranks, not content.

The operational rule: invest in fusion quality first. Measure Recall@20 (does the correct memory appear in the top 20 fused results?) before measuring nDCG@5 (is the correct memory in the top 5?). If Recall@20 is below 85%, fix fusion — add a retriever, tune weights, or lower k to favor more reliable retrievers. Once Recall@20 is satisfactory, add a reranker to improve nDCG@5.

Running a reranker on a list with poor recall is expensive waste: you spend 20–30ms reordering irrelevant memories.

Failure modes in depth

Truncated lists. The most common production failure. BM25 returns 20 results; semantic returns 200. Semantic candidates at ranks 21–200 see no BM25 competition. Their RRF scores accumulate only the semantic contribution, which means they cannot beat any BM25-present memory — even if they are actually more relevant. The correct fix is to truncate all lists to the same K before fusion. Choose K conservatively: if the weakest retriever reliably returns 50 results, use K = 50 as the truncation depth for all retrievers. This ensures the playing field is level.

Tie ranks. When a retriever produces multiple results with identical scores, how you assign ranks matters. Gap rank: if three results are tied for rank 5, assign them ranks 5, 6, 7 (gap-ranking). Dense rank: assign all three rank 5. Dense rank is correct for RRF because gap rank artificially inflates the denominator for tied results, reducing their contribution. With gap rank, a result tied for rank 5 might get assigned rank 7 and receive 1/67 instead of 1/65. That 3% difference is an artifact of tie-breaking, not a signal about relevance.

Single-retriever queries. When only one retriever returns results — either because the others were skipped or returned nothing — RRF collapses to that retriever's ordering with contributions w/(k+rank). This is mathematically correct but operationally important: you lose the consensus benefit entirely. Single-retriever quality is now the ceiling. If your semantic retriever returns no results for a rare-identifier query, and BM25 is the only retriever that fires, the fused output is only as good as BM25. The query optimizer should be aware of this: if it skips multiple retrievers, it should lower confidence in the output and potentially widen context injection.

Score collapse. If all retrievers return the same results in the same order, RRF amplifies the agreement rather than resolving it. The top-1 result gets contributions from every retriever, making it overwhelmingly dominant. This can happen when all retrievers share a common indexing structure (e.g., all search the same embedding space with different distance functions). True diversity of retrieval mechanism is a prerequisite for fusion to add value over any single retriever.

k tuned on the wrong distribution. k affects how much rank depth matters. If you tune k on a benchmark where queries are short and results are abundant, you may overfit to a high-diversity regime. Deployed queries in a narrow-domain assistant tend to have lower diversity — the same five memories appear at the top of every retriever, and k choice has little effect. Benchmark your k on queries sampled from the actual deployment distribution.

Calibrating RRF in practice

What to measure. The two primary metrics for fusion quality are:

  • Recall@K: fraction of labeled relevant memories that appear in the top-K fused results. Use K = 20. This measures whether fusion is doing its job: getting relevant memories into the candidate set.
  • nDCG@10: normalized discounted cumulative gain at rank 10. This measures ranking quality — are relevant memories near the top or buried?

Measure both because they catch different failures. Low Recall@20 means fusion is missing relevant memories outright. High Recall@20 but low nDCG@10 means the memories are present but ranked poorly — a job for weight tuning or k adjustment, not retriever addition.

What bad RRF looks like. The most common failure signature is retriever dominance: one retriever contributes to nearly all top-5 results across a diverse query set. Log which retriever is the primary contributor to each of the top-5 fused results. Run this over 100 representative queries. If semantic contributes to >80% of top-5 slots and entity graph contributes to under 5%, you have a dominance problem. Either the entity graph retriever is misconfigured, the weight is too low for your query distribution, or the graph itself is sparse in ways that matter for your domain.

How to diagnose. For each query in your eval set, record: which retrievers returned a result for each top-5 memory, what rank each retriever assigned it, and what the final RRF score was. Build a heat map: rows are query types (temporal, relational, lexical, semantic), columns are retrievers, cells are average contribution to top-5 score. Outliers in this heat map indicate miscalibration.

When to change k vs change weights. k controls the global aggressiveness of rank preference. Weights control the relative importance of individual retrievers. Change k when you want to shift how much any strong rank signal matters across all retrievers. Change weights when specific retrievers are under- or over-contributing for your query distribution. In practice: fix k = 60 as the default; only adjust it after you have measured that a specific retriever quality distribution warrants it. Adjust weights first — they have more targeted effect and are easier to reason about.

Tie-breaking. When two memories have identical RRF scores — rare but possible when they appear in the same retrievers at the same ranks with the same weights — the system must be deterministic. Recall breaks ties by memory ID (lexicographic order). This is arbitrary but stable across requests, which prevents non-deterministic results in evaluation and in production.

Why it works

  • Score-free. No normalization step means no calibration bug. Drop in a new retriever; nothing else changes.
  • Top-heavy. The reciprocal weighting puts heavy weight on rank 1, less on rank 2, almost nothing past rank 50. This matches the intuition that the top of a list carries most of the signal.
  • Composable. Add retrievers without rebalancing. The k constant tunes the aggressiveness; each retriever can also carry an optional weight wi.
  • Consensus-rewarding. A memory at rank 10 in three retrievers beats a memory at rank 1 in one. Agreement across independent signals is a stronger indicator of relevance than any single strong signal.
  • Deterministic. Given the same ranked lists, k, and weights, RRF always produces the same output. Tie-breaking by memory ID makes the full pipeline reproducible.

We propose Reciprocal Rank Fusion (RRF), a simple method for combining the document rankings from multiple IR systems.

— Cormack, Clarke, Büttcher — Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods (SIGIR 2009)

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.