Five Retrievers Are Better Than One

By Arc Labs ResearchMay 2, 202611 min read

Memory retrieval is not a single problem. Asking "what's my badge ID?" is a precision query — one specific lexical match. Asking "what did I do last Tuesday?" is a temporal slice over the event log. Asking "who reports to Sarah?" is a one-hop walk on the entity graph. A single retriever optimized for any one of these will fail the others.

RRF fusion · 5 retrievers, 1 ranked output

Each retriever scores a different facet. RRF combines them without score normalization.

The five retrievers

Semantic (dense). Embedding-based cosine similarity over all memories. Catches paraphrases, conceptual neighbors. Misses out-of-vocabulary terms and rare identifiers.
Lexical (BM25). Term-frequency / inverse-document-frequency over the memory text. Catches exact matches, rare terms, identifiers. Misses paraphrases.
Entity graph. Multi-hop traversal from query-mentioned entities. Catches relational queries that have no lexical or semantic anchor in the memory text itself.
Temporal. Range index over the event timestamps. Catches time-anchored queries ("last week", "since I joined").
Type filter. Hard filter to the memory type the query implies. "What should I configure?" → preferences only. "When did X happen?" → events only.

Why fuse rather than route

A naive design uses a router: classify the query, pick the best retriever, run it. This works most of the time and fails badly the rest. Real queries are usually mixed. "What did I tell Sarah about the Berlin office last week?" is temporal, relational, and lexical at once. Fusion handles mixed queries gracefully — every retriever contributes its top results, and the fusion weights determine which signal dominates. Items that appear in multiple lists get a natural boost without any explicit cross-retriever logic.

Cost and latency

Five retrievers running in parallel cost only as much as the slowest one — fan-out is the right shape for memory retrieval. Typical p99 numbers on a 10M-memory store:

Semantic (HNSW): 8ms.
Lexical (Tantivy / pg_trgm): 12ms.
Entity graph (1-hop): 4ms.
Temporal (BTREE range): 2ms.
Type filter: 1ms (combined with above).

Total fan-out latency: max ≈ 12ms. Plus fusion (sub-ms) and optional rerank (typically 15–30ms for top-50). End-to-end memory retrieval below 50ms p99 on commodity hardware.

When the query optimizer can skip retrievers

Not every query benefits from every retriever. A pure temporal query ("yesterday") gets nothing useful from BM25 — the retriever returns noise that fusion has to overcome. The query optimizer selectively skips retrievers when the query signature predicts low contribution. This halves p99 on simple queries without affecting quality.

How the semantic retriever works

Dense vector search works by embedding the query with the exact same model used at write time, then computing cosine similarity between the query vector and every stored memory vector. The model-at-query-time must match the model-at-write-time precisely — this is not a soft requirement. If memories were written with text-embedding-3-small and the query is embedded with text-embedding-ada-002, the cosine similarities are meaningless garbage: the two models occupy geometrically incompatible spaces, and there is no transformation between them. This failure mode is silent — the retriever returns results, they just have no relationship to actual relevance. Model version pinning in the namespace configuration exists specifically to prevent this.

Cosine similarity and dot product are often confused. For unnormalized vectors, they differ significantly. For L2-normalized vectors (which most embedding APIs return by default), cosine similarity and dot product are equivalent since cos(θ) = (a·b) / (|a||b|) reduces to a·b when both vectors have unit norm. Memory embeddings are stored pre-normalized. Cosine is preferred as a conceptual framing because it measures angular distance in the embedding space, making the score independent of vector magnitude — two vectors pointing in the same direction but with different magnitudes score 1.0, which is semantically correct.

HNSW (Hierarchical Navigable Small Worlds) executes the query search without comparing against every stored vector. At query time, HNSW maintains a dynamic candidate list of size ef_search. The search starts at a high-level entry point in the graph, greedily traverses to closer neighbors, and expands the candidate list as it descends layers until it reaches layer 0 where the final ef_search candidates are evaluated and the top-K are returned. The word "approximate" here has a specific meaning: at any finite ef_search, there exists a nonzero probability that an exact nearest neighbor is not in the returned set. At ef_search=64 over a 10M-vector store, roughly 5% of exact top-10 matches are missed — almost always the 8th through 10th nearest neighbors, not the top-1 or top-2. Increasing ef_search raises recall at the cost of latency; the curve is concave, meaning the first doublings of ef_search buy significant recall gains and later doublings buy very little.

The practical score range for cosine similarity is [-1, 1] in theory, but memory embeddings from conversational data cluster in [0.5, 1.0] for relevant content and [0.2, 0.5] for unrelated content. Negative cosine similarities are essentially impossible in this domain because all memory text embeddings share a positive bias toward everyday language tokens. This clustering matters for score-weighted fusion: a cosine of 0.95 and a cosine of 0.60 are both "relevant" by human judgment, but the fusion formula weight × sqrt(score) distinguishes them: sqrt(0.95) = 0.975 versus sqrt(0.60) = 0.775. The semantic retriever contributes its full weight only for very high-confidence matches.

Semantic retrieval fails in three identifiable patterns. First, out-of-vocabulary proper nouns: an internal codename like "Project Kestrel" that never appeared in the embedding model's training data has no stable geometric neighborhood — any memory mentioning it scores against a random-ish vector, so results are dominated by whatever happened to be in the training data near the tokens "project" and "kestrel" in isolation. Second, rare identifiers: employee IDs like "E-47821", ticket numbers like "JIRA-9921", or badge IDs have high lexical specificity but low semantic signal — the embedding model sees a number and produces a vector near other numbers, not near the semantic content of the memory that mentioned it. Third, vocabulary divergence without paraphrase: a query asking "what's my stance on technical debt?" fails if the relevant memory says "I always push back on shortcuts in the codebase" — the concepts align but the tokens don't overlap paraphrastically enough for the embedding to bridge the gap. These are the cases BM25 and entity-graph retrieval exist to cover.

How the lexical retriever works

BM25 over PostgreSQL uses a GIN index on to_tsvector(memory_content) to answer full-text queries without scanning every row. The GIN index pre-computes the posting lists — for each lexeme (stemmed term), the set of memory row IDs that contain it. A query for "kestrel" hits the GIN index directly, retrieves the posting list for that lexeme in microseconds, and ranks results using ts_rank_cd. PostgreSQL's ts_rank_cd implements a variant of BM25 with normalization flag 32, which divides the raw score by the document length (measured in lexemes), preventing long memories from dominating purely by volume of token matches.

The memory schema exposes two text fields to the lexical retriever: the predicate field (a short structured phrase summarizing the memory, e.g. "badge ID is 47821") and the content field (the full memory text). These receive different weights in the tsvector: predicate is weighted 'B', content is weighted 'A'. PostgreSQL's weighting scheme assigns rank multipliers of 0.1 (D), 0.2 (C), 0.4 (B), and 1.0 (A). Because predicates are short and high-signal, a term appearing in the predicate gets a 4× multiplier over a term appearing only in the content body. A query for "badge ID" matching the predicate "badge ID is 47821" outranks a query matching a tangential mention buried in a long conversation transcript.

IDF behavior is the mechanism that makes BM25 good at exactly what semantic search is bad at. IDF (inverse document frequency) scores each term by how rarely it appears across the full memory corpus: IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5)) where N is total memory count and df(t) is the count of memories containing term t. A word like "works" might appear in 60% of all memories — its IDF is log(0.67/0.61) ≈ 0.09, nearly zero, contributing almost nothing to the final score. A term like "Kestrel" appearing in only 2 of 1M memories gets IDF = log(999999.5/2.5) ≈ 13.1 — over 100× larger. This is why a query for "Project Kestrel" reliably surfaces the two relevant memories even when the semantic retriever distributes attention across hundreds of vaguely related results.

IDF scores are computed at index time and encoded in the GIN index. When a user's domain changes significantly — for example, when their company acquires a new subsidiary and a new vocabulary of project names, product lines, and organizational terms appears — the existing IDF scores become stale. The new terms accumulate high document frequencies as they appear across many memories, but the IDF denominator is based on the pre-acquisition corpus. The symptom is that newly common terms rank too high (old IDF) and domain-specific rare terms rank unpredictably. The weekly vocabulary Jaccard check compares the current top-5000 lexemes by frequency against the lexemes from the previous week. When Jaccard similarity drops below a threshold (typically 0.7), a REINDEX on the GIN index is scheduled during the next low-traffic window. Reindexing recalculates IDF across the current corpus and restores correct ranking behavior.

BM25 also fails in identifiable ways. The most common failure is paraphrase: a query for "what's my take on technical debt?" will not match a memory that says "I always advocate for refactoring over patching" because no token overlaps. BM25 has no concept of semantic similarity — it operates on lexemes after stemming and stop-word removal, and "advocate" has no overlap with "take" after that process. This is not a tuning failure; it is a fundamental property of term-frequency retrieval. The solution is not to fix BM25 but to run it in parallel with semantic retrieval so that each covers the other's blind spots.

How the entity graph retriever works

The entity graph retriever operates differently from the other four: it does not scan a vector or text index. Instead, it traverses a graph where nodes are entities (people, places, organizations, projects, concepts) and edges are relationships between them (reports-to, works-on, mentioned-with, part-of). The traversal starts from entity references extracted during Stage 1 NLP parsing. The ParsedQuery struct's entity_refs field contains the resolved entities — "Sarah" resolved to the canonical entity ID for Sarah Chen, "Berlin office" resolved to the location entity for the Berlin branch. Each resolved entity becomes a traversal root.

Hop scoring assigns a base confidence to memories discovered at each graph distance from a query entity. The scale is: score(0-hop) = 1.0 (the memory directly mentions the queried entity), score(1-hop) = 0.6 (the memory mentions an entity one edge away), score(2-hop) = 0.35, score(3-hop) = 0.15. The exponential decay is intentional — at 3 hops, the relational connection is weak enough that the memory is likely noise. Most production queries cap traversal at 2 hops; 3-hop traversal is only enabled for explicit multi-hop queries detected by the optimizer (queries containing "who does X work with who also knows Y").

Each edge carries additional modifiers that adjust the base hop score. Edge type matters: structural edges (org hierarchy, formal team membership) use a type_prior of 1.0 — these are high-confidence, slow-changing relationships. Semantic edges (co-mentioned entities, inferred associations) use type_prior = 0.9 — slightly less reliable. Lifecycle edges (relationships that have a defined end date, like a former manager relationship) use a freshness multiplier of 0.3 when valid_to < now. The final score per memory is hop_score × edge_confidence × edge_freshness × edge_type_prior. A memory discovered via a 1-hop structural edge with confidence 0.95 scores 0.6 × 0.95 × 1.0 × 1.0 = 0.57. The same memory reached via a 2-hop expired lifecycle edge scores 0.35 × 0.95 × 0.3 × 0.3 = 0.030 — effectively noise that RRF will bury.

The sparse-graph problem affects new namespaces. A namespace with fewer than 500 memories has a thin graph: few entities have been extracted, edges between them are sparse, and traversal from any query entity reaches at most a handful of memories. In this regime, the graph retriever returns few results and contributes little to the fused ranking — its effective weight via RRF approaches zero because it contributes few items to the merged list. This is a feature, not a bug: it means the retriever degrades gracefully. Above 2,000 memories, graphs begin to develop meaningful connectivity, and the retriever starts contributing non-trivially to relational query results. Above 10,000 memories in a namespace, the graph becomes a primary signal for any query that mentions a named entity.

Entity resolution quality gates the entire retriever. If "Sarah" in the query resolves to the wrong Sarah (name collision between two contacts), the traversal starts from the wrong root and returns irrelevant memories with high scores. The resolution layer uses a combination of mention context (surrounding tokens), prior probability (how often "Sarah" has referred to each candidate in past queries), and structural signals (Sarah's org relationship to the querying user). When resolution confidence is below 0.7, the retriever runs traversal from all candidate entities and merges the results with confidence-weighted scores. This is conservative but avoids the hard failure of committing to the wrong root entity.

How the temporal retriever works

Temporal queries are identified and parsed during Stage 1 of the query pipeline, before any retriever executes. The NLP parser produces a temporal_window: Option<(DateTime, DateTime)> — a closed interval if a temporal expression was found, or None if the query contains no detectable time reference. The parser handles three classes of temporal expression. Absolute dates are the easiest: "last Tuesday", "on March 15th", "in Q3" map to exact calendar intervals with no ambiguity. Relative dates require the current timestamp as an anchor: "last week" → [now - 7d, now], "yesterday" → [now - 1d, now], "this month" → [month_start, now]. Fuzzy expressions are the hardest: "recently" maps to [now - 30d, now] by default (configurable per namespace), "a few months ago" maps to [now - 90d, now - 30d] (the interval excludes the last 30 days because "a few months ago" implies something not recent).

When a temporal window is parsed, the retriever executes a BTREE range scan on the event_at index: SELECT memory_id FROM memories WHERE event_at BETWEEN $1 AND $2 AND type = 'Event'. This is a pure index operation — no vector computation, no text scoring. The 2ms p99 latency comes from the BTREE index being small enough to fit in buffer cache on any reasonably sized server. The retriever does not rank within the window by default; it returns all events in the window and lets RRF handle relative ordering (the most relevant events will also rank well in the semantic and lexical retrievers, getting a fusion boost).

Only Event-type memories have a populated event_at field. Fact and Preference memories use created_at as a secondary signal, but this is not the same as the event timestamp — created_at is when the memory was extracted by the pipeline, which can be minutes or hours after the actual conversation turn the memory was derived from. When the temporal window is active and no Event-type memories are found in the window, the retriever does not fall back to created_at on Fact/Preference memories. Those types are handled by the semantic and lexical retrievers, which have no temporal awareness — if the user asks "what did I say about the budget last week?" and the budget memory is typed Fact (because it was extracted as a stable predicate rather than an event), the temporal retriever misses it, but semantic will find it because the embedding for "budget" will match, and lexical will find it if the word "budget" appears in the memory text.

The fuzzy expression parser has one important edge case: "since we last talked." This phrase implies a temporal boundary defined by the previous session, not a fixed interval. The parser looks up the last_session_end_at timestamp from the namespace session log and constructs the window as [last_session_end_at, now]. If no session log entry exists (first session, or session log was purged), the parser returns None and the temporal retriever is skipped. This is the correct behavior — returning all memories since the beginning of time would not be useful.

How the type filter works

The type_hints field in ParsedQuery is not populated by the NLP parser. The NLP parse in Stage 1 performs entity extraction, coreference resolution, and temporal window detection — it does not infer memory type. Type hints are populated exclusively by the query optimizer in Stage 2, after the initial parse completes. This ordering matters: the optimizer has access to namespace statistics (memory type distribution, vocabulary density per type) that the NLP parser does not use.

The optimizer applies rule-based type classification with the following trigger vocabulary. Queries containing "prefer", "like", "want", "setting", "configure", "my default" are classified as Preference queries; type_hints is set to [Preference]. Queries containing "when did", "at what time", "happened", "occurred", "was it", "did I" (past tense event framing) are classified as Event queries; type_hints is set to [Event]. Queries containing "who is", "what is [named entity]", "tell me about [person]" are classified as Entity queries; type_hints is set to [Entity]. Queries that match multiple patterns get multiple type hints. Queries with no pattern matches leave type_hints empty.

When type_hints is non-empty, the type filter executes as a hard pre-filter: the other retrievers (semantic, BM25) run their searches restricted to the type-filtered subset of the memory store. This is implemented at the index level: the vector index partition for each type is kept separate (pgvector partition by type), so semantic search over [Preference] only scans the Preference partition. BM25 uses a WHERE type = ANY($1) clause added to the GIN index query. The net effect is faster execution (smaller search space) and improved precision (results stay on-type). The temporal retriever always runs over Event-type memories regardless of type_hints — its type restriction is hardcoded, not configurable by the optimizer.

The type filter has a known over-constraint failure mode. A query like "what's my setting for the Kestrel project timeout?" will route to Preference because of "setting". But if the relevant memory was extracted as a Fact (because the pipeline classified "timeout is 30 seconds" as a stable predicate, not a preference), the Preference filter excludes it. The result is zero or low-quality results. The mitigation is a widening fallback: when the type-filtered retrieval produces fewer than 5 results, the filter is dropped and the search reruns over all types. The widening adds one retrieval round-trip (roughly 15ms) but prevents the hard failure. The widening threshold of 5 results is configurable per namespace.

Score-weighted RRF

Standard Reciprocal Rank Fusion computes a combined score for each candidate memory as the sum of 1 / (k + rank_i) across all retrievers that returned it, where rank_i is the position of the memory in retriever i's result list and k is a smoothing constant. At k=60, a memory at rank 1 contributes 1/61 ≈ 0.0164; a memory at rank 50 contributes 1/110 ≈ 0.0091 — a 1.8× spread. Standard RRF ignores the actual retriever scores entirely: a memory returned by semantic search with cosine 0.95 and a memory returned with cosine 0.52 contribute identically to the RRF sum if they occupy the same rank. This is deliberate in the original formulation — it avoids the normalization problem of combining scores from incommensurable scales.

Recall uses score-weighted RRF to restore some score signal without reintroducing the normalization problem. The formula is weight_i × sqrt(score_i) / (k + rank_i). The sqrt(score_i) term is key: it moderates the score signal while keeping contributions positive and bounded. A cosine similarity of 0.95 contributes sqrt(0.95) = 0.975 — nearly full weight. A cosine similarity of 0.25 contributes sqrt(0.25) = 0.5 — half weight. This means that a high-rank but low-confidence result from the semantic retriever is penalized relative to a high-rank high-confidence result. The sqrt compresses the dynamic range of scores into a range where rank still dominates but score provides a meaningful secondary signal.

rrf_score(doc) = Σᵢ [ wᵢ × √scoreᵢ(doc) / (k + rankᵢ(doc)) ]

Score-weighted RRF across retrievers i ∈ {semantic, BM25, graph, temporal}.

Per-retriever weights reflect the expected signal quality of each retriever. Semantic and BM25 both carry weight 1.0 — they are the primary retrievers and neither is systematically more reliable than the other. Entity graph carries weight 1.0 for direct (0-hop) matches and 0.8 for multi-hop traversal results — multi-hop results have higher noise and warrant a slight penalty. Temporal retriever carries weight 1.0 when a temporal_window was parsed (the query explicitly asked about a time period) and weight 0.5 when no window was found but the retriever was run anyway — the latter is rare and occurs only when the optimizer runs the temporal retriever speculatively on queries mentioning time-adjacent words like "recent" or "latest" that did not produce a parseable window.

The k=60 constant is set for a result list of up to 100 candidates per retriever. Lowering k to 10 produces a 5.4× spread between rank-1 and rank-50 — this over-rewards rank-1 results and makes the fused ranking nearly equivalent to choosing the top result from the best retriever, discarding the ensemble benefit. Raising k to 200 flattens the spread to 1.4× — every result in the list contributes nearly equally regardless of rank, which wastes the ranking information each retriever produces. k=60 is the standard recommendation from the original RRF paper for lists of 50–100 candidates and holds well empirically on memory retrieval benchmarks.

Retriever failure modes and mitigations

Understanding when each retriever fails is as important as understanding how it succeeds. Each failure is triggered by a specific query structure, and each has a concrete mitigation in the fusion architecture.

Semantic fails on exact identifiers. The query "what's my badge ID 47821?" contains a specific number that has no semantic neighborhood in the embedding model's geometry. The embedding for "47821" sits near other number tokens, not near "badge" or "ID" in any meaningful way. The semantic retriever may return memories about badges or IDs in general, not the specific memory containing "47821". BM25 handles this correctly: "47821" appears in the memory predicate "badge ID is 47821" and gets a very high IDF score (a 5-digit number appearing in exactly one memory). BM25 ranks this memory at position 1; semantic may not rank it in the top 20. Fusion surfaces it because BM25's strong rank-1 signal dominates the 1/(k+1) contribution.

BM25 fails on paraphrase and conceptual queries. The query "what's that thing I mentioned that's like an API but for people?" has no useful lexical signal — "API", "people", and "like" are either too common (low IDF) or too general. BM25 returns noise. Semantic retrieval embeds the query and finds memories about interfaces, social graphs, org charts, or network protocols — one of which is likely the intended memory. Here, BM25 contributes nothing to the fused ranking; its results all appear below rank 20 and their RRF contribution is negligible. Semantic dominates, which is the correct behavior.

Entity graph fails with no named entity anchor. The query "what did I say about climate?" contains a concept ("climate") but no specific named entity that can be resolved to a graph node. The entity graph retriever's traversal set is empty; it returns zero results. The retriever contributes nothing to RRF — not noise, just absence. Semantic retrieval handles this well (embedding similarity for climate-related memories) and BM25 handles the exact term "climate". The absence of graph results is benign because the other two retrievers cover the query adequately.

Temporal fails on unparseable relative expressions. The query "what changed since we last talked?" requires looking up the previous session boundary. If the session log is unavailable or corrupted, the parser returns None for temporal_window and the temporal retriever is skipped. The optimizer detects the skip and does not penalize the fused result — it simply means the temporal signal is absent. More problematically, queries like "what's been on my mind lately?" contain implicit temporal framing ("lately") but no parseable expression that maps to a concrete interval. The parser resolves "lately" to the default fuzzy window [now - 30d, now] — this may or may not match the user's intent. When the resolved window produces fewer than 3 Event-type results, the temporal retriever contributes little to fusion regardless.

Type filter over-constrains and costs recall. The query "what's my setting for that thing" routes the type filter to [Preference]. If the relevant memory was extracted as a Fact (because the memory pipeline classified "timeout is configured to 30 seconds" as a stable factual predicate rather than a user preference), it is excluded from the type-filtered search space. The semantic and BM25 retrievers run only over Preference memories and find nothing relevant. The widening fallback triggers: fewer than 5 results → drop type filter → re-run over all types → the Fact memory is found. The cost is one additional retrieval round-trip, paid only when the narrow search fails. For queries where the type filter is correct, no widening occurs and latency is lower than unfiltered retrieval due to the smaller search space.