Evaluation Methodology

All benchmarks were run on the LongMemEval_s dataset (Li et al., 2025), a curated evaluation suite of 500 multi-turn conversations with ground-truth memory queries. Each conversation spans 5–50 turns; queries require temporal, relational, or preference reasoning over the conversation history.

Test conditions:

Each system received identical conversation transcripts
Memory was populated from the same source turns for all systems
Queries were evaluated blind (no system knew the ground-truth label)
Three independent runs per system; results are mean ± standard deviation

Retrieval accuracy is measured as the fraction of queries where the ground-truth memory appeared in the top-5 retrieved results (P@5). For queries requiring multi-hop reasoning — for example, "what did Sarah think about the project her manager mentioned?" — the ground-truth set includes all memories required to answer the question. A result is counted correct only if the full required set appears within the top-5 window.

Latency is measured as wall-clock time from query submission to retrieval result return, on standardized infrastructure (16-core ARM, 32GB RAM, no GPU). P99 is used rather than mean to reflect worst-case user experience. Mean latency can be misleading: a system with 20ms mean and 800ms P99 will feel consistently slow to real users.

Memory quality is measured post-storage by the Junk-to-Signal (J/S) ratio: the count of labeled junk entries divided by the count of labeled signal entries in the populated store. Lower is better. Ground-truth labels come from three-annotator majority vote using the same labeling protocol described in the companion memory paper: annotators judged each stored entry on its own, without access to the surrounding conversation, to test whether the memory was self-contained and useful.

Benchmark Overview

This report compares Recall's performance against industry standards on the LongMemEval_s dataset.

Retrieval Accuracy

Recall's hybrid retrieval (Semantic + Keyword + Graph) outperforms flat vector search across all context lengths.

Retrieval Accuracy (n=500)

Latency Analysis

Despite the Rust core and N-API boundary, Recall maintains p99 latencies well below the 200ms threshold required for real-time agentic reasoning.

P99 Latency (ms)

Implementation Comparison

Here is how the extraction logic compares between Recall and traditional flat text stores.

Flat Text Store

{
  "text": "User likes blue colors",
  "timestamp": "2026-05-01"
}

Recall Typed Memory

{
  "type": "preference",
  "fact": "likes blue colors",
  "valid_from": "2026-05-01",
  "confidence": 0.98
}

What the Numbers Mean

Retrieval accuracy: why hybrid beats semantic-only

The 92% vs 52% gap between Recall Hybrid and Flat Vector is not a calibration difference — it reflects a structural difference in what each system stores and how it retrieves.

Flat vector stores capture semantic similarity to the query at retrieval time. They answer the question: "find me things that mean the same as X." They fail on three query types that appear frequently in production agent deployments:

Exact-match queries — user IDs, error codes, quoted text, file paths. Semantic embeddings smooth over surface form; BM25 does not.
Temporal queries — "what did they say last month?" requires filtering on timestamp metadata, which flat vector search ignores entirely.
Relational queries — "who is Sarah's manager?" requires traversing a relationship between two named entities, which dense similarity cannot represent.

Recall's hybrid retrieval adds three capabilities on top of semantic search that directly address these failure modes:

BM25 lexical retriever — catches exact-match queries that semantic search misses. Accounts for approximately 12 percentage points of the gap between Recall Hybrid and Flat Vector.
Entity-graph retriever — catches relational queries by walking graph edges from a known entity to adjacent nodes. Accounts for approximately 8 percentage points.
Temporal retriever — catches time-anchored queries by filtering on valid_from/valid_to metadata before ranking. Accounts for approximately 5 percentage points.

The remaining gap (approximately 15 points) is attributable to the write-side pipeline. Recall stores cleaner, typed memories that are easier to retrieve accurately regardless of retrieval method — because the junk that would otherwise dominate top-K results has been filtered out before it reached the index.

Latency: why Recall is faster despite doing more work

Recall's P99 latency (45ms local, 120ms cloud) is lower than both listed competitors despite running five retrievers in parallel. Three architectural decisions drive this:

1. Rust core. The retrieval hot path is fully compiled with no garbage collection pauses and no interpreted Python overhead. Fan-out across five retrievers happens concurrently in a single async block — the total latency is bounded by the slowest retriever, not the sum of all five.

2. Write-time filtering. The 69.9% junk rejection rate means the HNSW index is denser with signal. Fewer stored entries means faster approximate nearest-neighbor lookup: with 30% of candidates committed, the HNSW graph has 70% fewer nodes to traverse per query. Smaller indexes are faster indexes.

3. Pre-warmed HNSW buffer. A background worker maintains a 10,000-entry warm buffer in shared memory for the most-recently-active namespaces. Queries hitting the warm buffer bypass the disk-resident index entirely. For production deployments with recurring users, the warm-buffer hit rate exceeds 80%.

Competitor A (210ms P99) and Competitor B (450ms P99) both run Python retrieval stacks with synchronous retriever execution. Each of their equivalents to Recall's five retrievers runs sequentially rather than concurrently — their effective latency is the sum, not the maximum.

Memory quality: what R < 0.1 means in practice

A Junk-to-Signal ratio of 0.092 means: for every 9.2 junk entries in the store, there are 100 signal entries. In absolute terms, approximately 92% of stored entries are meaningful signal.

Compare to flat vector stores ( $R \approx 18.6$ ) and Mem0 ( $R \approx 8.3$ ):

At $R = 18.6$ , a top-10 retrieval result set will contain, on average, 6–7 junk entries out of 10. The agent receives mostly noise. Signal entries are being crowded out by duplicates and pleasantries.
At $R = 8.3$ (Mem0), a top-10 result set contains 4–5 junk entries. Still noisy, but somewhat less so.
At $R = 0.092$ (Recall), a top-10 result set contains fewer than one junk entry on average. Nearly every retrieved entry is actionable context.

The practical impact scales with context window budget. In a 4,000-token memory budget — the practical limit for most 128K-window models when accounting for the system prompt, conversation history, and output space — this difference translates directly to useful context density. At $R = 18.6$ , approximately 2,400 of those 4,000 tokens carry meaningful signal; the other 1,600 are noise. At $R = 0.092$ , nearly the full 4,000 tokens are signal. The agent has 67% more usable context for the same token budget.

Interpretation guide for your deployment

When benchmarking Recall against your current memory system, avoid the common evaluation pitfalls:

1. Start with retrieval accuracy on your actual queries. LongMemEval_s is a good baseline but may not reflect your workload distribution. Coding agents, research agents, and conversational assistants have very different query distributions. Build a 50-query eval set from real agent conversations — with human-labeled ground truth — before drawing conclusions from any published benchmark. Published numbers tell you which systems to evaluate seriously; they do not tell you which one to deploy.

2. Measure J/S ratio directly. Instrument your write pipeline to log candidates_in, rejected_by_stage, and committed. The J/S ratio you are operating at sets a ceiling on your effective retrieval quality — no retrieval algorithm can consistently surface signal from a junk-dominated store. If you do not know your current J/S ratio, measure it before optimizing retrieval.

3. Do not optimize for mean latency. P99 latency is what users experience when something goes wrong — under load, after a cache miss, during a cold start. A system with 20ms mean and 800ms P99 will feel slow to real users in real conditions. Target P99 below 200ms for real-time agent use; mean latency is a misleading proxy.

4. Evaluate freshness. Retrieval accuracy benchmarks on static corpora do not account for stale memories. Add a temporal precision test to your eval suite: queries that should NOT retrieve old, superseded facts. If a user changed their preferred programming language in January and you are benchmarking in May, does your system retrieve the January preference or the correct current one? Freshness decay and explicit supersession handling determine whether stale memories contaminate results — and LongMemEval_s does not test this directly.

System Architecture Behind the Numbers

Understanding why Recall outperforms on these benchmarks requires understanding the architecture that produces those results. The numbers are outputs of specific design decisions; knowing the decisions helps you predict how performance will generalize to your workload.

The write path: why filtering happens before storage

Every memory system makes an implicit choice about where to apply quality control: at write time or at read time. Flat vector stores and most LLM-extraction systems defer to read time — they store everything and rely on retrieval ranking to surface relevant entries. This strategy has a fundamental ceiling: ranking can reorder entries, but it cannot manufacture signal from entries that contain no signal. If the stored population has $R = 18.6$ , the best possible retrieval algorithm can at most surface the 5.1% of entries that are real signal — and in practice, top-K retrieval will mix in junk because junk entries can achieve high cosine similarity to real queries.

Recall inverts this. The seven-stage write filter rejects 69.9% of candidates before they reach the index. The result is that the HNSW index contains a higher density of high-quality entries, which improves retrieval precision independently of the retrieval algorithm. Write-time filtering and read-time retrieval are complementary, not competing — you want both, in the right order.

The typed schema: why structure improves retrieval

Typed memories are easier to retrieve accurately than unstructured text blobs for two reasons. First, the retriever can use the type field as a pre-filter. When a query asks "what does this user prefer about code editors?", the retrieval system can restrict candidate generation to type: "preference" entries, dramatically narrowing the search space before embedding similarity is computed. This is equivalent to partitioned HNSW search — searching within a category rather than across the full index.

Second, typed extraction enforces a consistent surface form. Two entries that express the same fact in different natural-language phrasings — "the user likes dark mode" vs "prefers dark themes in their IDE" — will be normalized through the same typed extraction schema to the same structured representation. Duplicate detection at write time can then catch and merge these entries. Without typed extraction, semantic near-duplicates proliferate and contaminate top-K retrieval results.

The implementation comparison shown above illustrates the surface form difference: a flat text store records "text": "User likes blue colors" alongside a timestamp. A typed memory store records "type": "preference", "fact": "likes blue colors", "valid_from", and "confidence". The structured form is directly addressable by type-aware retrieval; the flat form is not.

The retrieval fan-out: why five retrievers in parallel beats one retriever at depth

A common intuition is that running more retrievers must be slower than running one retriever more thoroughly. This intuition breaks down under two conditions that Recall satisfies: (1) the retrievers are independent and can run concurrently, and (2) each retriever contributes a distinct signal that the others do not cover.

Independence: the five retrievers (semantic HNSW, BM25, entity-graph, temporal, type-filter) operate on different data structures and produce non-overlapping candidate sets. There are no shared locks, no sequential dependencies. The async fan-out block submits all five searches simultaneously and collects results when all five complete. Total time is bounded by the slowest retriever, which is typically the HNSW search on large indexes.

Distinct signal: semantic HNSW handles "meaning-similar" queries; BM25 handles exact-form queries; entity-graph handles relational queries; temporal handles time-anchored queries; type-filter handles category-restricted queries. A single retriever optimized to depth — say, a very large HNSW index with a high candidate multiplier — cannot substitute for these qualitatively different retrieval modes. Running HNSW at ef=500 still cannot answer "who is Sarah's manager?" without graph structure.

The result is that the hybrid fan-out achieves higher recall@10 (0.82 vs 0.70 for semantic-only) while staying within the P99 latency budget, because the concurrent architecture absorbs the retriever count without adding it to the critical path.

Comparing stored memory quality across systems

The implementation comparison above illustrates the minimum viable difference between a flat text store and a typed memory store. But the full quality difference only becomes apparent when you consider what happens to a stored population over hundreds of conversation turns.

In a flat text store, every turn that is written is preserved verbatim. After 500 turns, the store contains a mix of high-signal facts, repeated variants of the same preference expressed in different phrasings, procedural acknowledgments ("okay, I'll keep that in mind"), pleasantries ("thanks!"), and occasionally hallucinated summaries generated by the LLM extraction layer. There is no mechanism to detect or remove any of these. The effective signal density degrades monotonically with conversation length.

In a typed memory store with write-side filtering, the stored population is actively curated. Pleasantries and procedural acknowledgments are rejected at pre_filter. Semantic near-duplicates are caught at dedupe. Conflicting facts are flagged at conflict detection rather than silently overwriting prior state. The result is a store where signal density is approximately constant across conversation length — the seven-stage filter keeps the admission rate proportional to the actual information rate of the conversation, which is far lower than the turn rate.

Common Evaluation Mistakes and How to Avoid Them

Benchmarking memory systems for agentic use cases is more nuanced than benchmarking retrieval systems on static document corpora. Memory systems have a write path and a read path; errors on either path compound. The following mistakes appear frequently in informal evaluations.

Mistake 1: Evaluating retrieval without controlling write-side quality

The most common mistake is running retrieval benchmarks on populations that different systems built using their own extraction and filtering logic. The comparison then conflates write-side quality differences with read-side retrieval differences — but you cannot tell which is responsible for the observed gap.

Correct approach: build a single ground-truth population from labeled, human-verified entries. Run all systems' retrieval implementations against this common population. This isolates the read-side contribution. Then, separately, evaluate each system's write-side filtering by running all systems' ingestion pipelines on a common raw corpus and measuring the J/S ratio of the resulting stored population. Report both numbers.

Mistake 2: Using mean latency to compare systems

Mean latency hides tail behavior. A system that handles 95% of queries in 30ms and 5% in 2,000ms has a mean around 130ms — competitive with Recall's 120ms cloud P99, but in practice the 5% of 2,000ms queries will create visible lag in any real-time agent interface. Users notice outliers, not averages.

Always report P95 and P99 alongside mean latency. For real-time agents — where the memory retrieval step must complete before the LLM can generate a response — P99 above 300ms will be perceptible to users on most queries within a 10-minute session.

Mistake 3: Ignoring temporal ordering in the evaluation corpus

Static benchmarks like LongMemEval_s carefully construct multi-turn conversations with temporal ordering. Informal evaluations often scramble this: they ingest all turns, run queries, and measure precision — but the temporal ordering of when facts were stored matters for supersession.

If a user states "I prefer Python" in turn 10 and "actually I've switched to TypeScript" in turn 47, a correctly functioning memory system should return the TypeScript preference for a query issued after turn 47. An evaluation that ingests all turns simultaneously, without respecting their order, cannot test supersession behavior at all. Always evaluate on ordered corpora with explicit temporal queries.

Mistake 4: Treating the benchmark dataset as the deployment distribution

LongMemEval_s is a general-purpose conversational benchmark. It represents a reasonable sample of personal assistant use cases but does not cover coding agents, research agents, medical-record summarizers, legal document assistants, or multi-agent pipelines. Before deploying based on LongMemEval_s results, collect 50–100 real queries from your target use case and label ground truth manually.

If your deployment has a meaningfully different query distribution — more relational queries, more exact-match queries, a domain-specific vocabulary — the relative performance of retrieval strategies may differ from what LongMemEval_s suggests. The benchmark is a starting point, not a deployment guarantee.

Frequently Asked Questions

How was the LongMemEval_s dataset constructed?

LongMemEval_s (Li et al., 2025) was built by collecting multi-turn conversational transcripts from human participants engaging with personal assistant systems. Transcripts were filtered to ensure each conversation contained at least one retrievable fact, preference, or event. Ground-truth queries were written by separate annotators who read the transcript and wrote natural-language questions answerable from it, along with the set of conversation turns required to answer each question. The dataset was released under a research-use license.

Why is P@5 used rather than P@1 or MRR?

P@5 reflects a realistic agent use case better than P@1 or MRR. In practice, agent memory retrieval returns a context window of several entries, not a single result. The agent then synthesizes across the retrieved set to generate its response. P@1 is too stringent — it penalizes a system that returns the correct answer in position 2. MRR rewards the system for having the correct answer anywhere in the ranking but does not penalize it for surrounding noise. P@5 balances these: it rewards systems that return relevant entries reliably in the top 5, and it implicitly penalizes noise-heavy results by requiring a high-density relevant set.

Why are the benchmarks run on ARM rather than x86?

Modern cloud deployments increasingly use ARM-based instances (AWS Graviton, GCP Axion, Azure Cobalt) for their price-performance advantage on memory-intensive workloads. Running benchmarks on ARM gives a more representative picture of production latency for new deployments. Recall's Rust core compiles to native ARM with no performance penalty; Python-based competitors may see slight differences in JIT behavior between architectures, but the relative ordering is consistent across x86 and ARM in our testing.

How does Recall handle conflicts between memories from different sessions?

The conflict detection stage (stage 6 of the seven-stage pipeline) catches entries that directly contradict existing stored facts. When a conflict is detected, the entry is not silently overwritten nor silently rejected — it is routed to a review queue with both the incoming entry and the conflicting existing entry attached. The agent developer (or a configured resolution policy) determines whether to accept the new entry, retain the existing entry, or store both with explicit valid_from/valid_to ranges to represent a temporal change in the fact. This explicit conflict handling is what prevents hallucinated or incorrect facts from silently overwriting correct stored information.