Memory Layer

Recall

A Rust-core, polyglot-SDK memory layer that fixes three structural failures in today's agent memory systems: noise injection at write time, imprecise retrieval at read time, and missing freshness management for long-lived deployments.

~90%

Junk Rejection

<200ms

P99 Latency

85%+

LongMemEval

The Problem

Extraction produces garbage.

Independent evaluations of popular memory systems show the majority of stored entries are hash duplicates, hallucinated profiles, or feedback-loop contamination. When we audited over 10,000 production memory entries across Recall deployments, we found a Junk-to-Signal ratio exceeding 10 in unfiltered systems — meaning for every high-signal fact or preference, there were more than 10 junk entries consuming index space and degrading retrieval precision. The root cause is architectural: systems that store everything and filter at retrieval time cannot achieve high precision without sacrificing recall. Recall inverts this: filter aggressively at write time so retrieval can be fast and precise.

Typed Memory Schema

Five types with explicit semantics.

Flat text loses temporal and relational signal. Typed schema makes retrieval possible — each memory knows what it is, when it is valid from, and what it relates to. The five types are: fact (stable declarative information, half-life 180 days), preference (revisable disposition, half-life 90 days), event (time-anchored occurrence, half-life 30 days), entity (persistent referent in the user's world, half-life 365 days), and relation (typed edge between two entities, half-life 180 days). Type matters for three reasons: per-type confidence priors calibrate extraction quality (a preference extracted from a single off-hand comment gets prior 0.20; an explicit fact gets 0.15 prior, both combined with source strength), per-type decay ensures stale memories stop dominating retrieval, and per-type routing allows the query optimizer to pre-filter to relevant types before scoring.

factStable information · 180d half-life→ Works at Datakynd

preferenceRevisable opinion · 90d half-life→ Likes casual tone

eventTime-bounded occurrence · 30d half-life→ Interview on Mar 28

entityPersistent referent · 365d half-life→ Priya (person)

relationEdge between entities · 180d half-life→ Priya manages G

Seven-Stage Pipeline

Quality at the gate.

A dedicated pre-write classifier rejects ~90% of candidates before they ever touch storage. Every memory that survives has passed through seven validation stages: pre_filter (relevance scoring, ~62% rejection), extract (typed candidate extraction, ~8% rejection), resolve_refs (entity resolution, ~4% rejection), enrichers (metadata addition, fail-open), dedupe (semantic and hash deduplication, ~8% rejection), conflict (contradiction detection and resolution, ~3% rejection), and persist (committed to storage). The survival probability of a junk entry through all seven stages is approximately 0.2% — the product of the per-stage miss rates. In practice, junk entries fail multiple stages simultaneously (a low-quality entry tends to score low on relevance AND produce low-confidence extractions AND match existing duplicates), so the realized junk pass-through rate is lower than the independent-stage product suggests.

recall pipeline

▸[01]prefilter— Noise rejection

○[02]extract

○[03]classify

○[04]resolve_refs

○[05]dedupe

○[06]conflict_check

○[07]persist

Hybrid Retrieval

Five retrievers. One ranked list.

Semantic search finds similar meaning. Keyword search finds exact matches. Entity-graph traversal finds relational context. Temporal range scanning finds time-anchored memories. Type-filtering pre-narrows the candidate pool. Recall runs all five concurrently and fuses results with score-weighted RRF: score = weight × sqrt(raw_score) / (k + rank) with k=60. The sqrt transformation compresses within-retriever score variation; the weight factor allows per-retriever calibration for different query types; k=60 prevents rank-1 results from dominating across retrievers. On LongMemEval_s (n=500 multi-turn conversations), hybrid retrieval achieves P@5 of 0.91 — compared to 0.78 for semantic-only. The 0.13 P@5 gap is driven primarily by the entity-graph retriever (handling relational queries) and the temporal retriever (handling time-anchored queries), both of which semantic search systematically misses.

recall query

$query memories where type = preference

[READ] scanning 2,431 memories...

[FILTER] temporal relevance → 847 results

[RANK] hybrid score (BM25 + vector)...

[DEDUP] entity coreference → 312 unique

[BUILD] context window (k=60)...

✓ Ready: 12 facts, 4 preferences, 2 events

▸retrieving...

Confidence and Freshness

Every memory knows how much to trust it.

Retrieval precision is not just about finding the right memory — it's about knowing whether to trust it. Recall computes a confidence score for every memory at write time: conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t), where s is source strength (explicit statement vs. inferred), r is recency at write time, e is evidence count (corroborating memories), and t is the type-specific prior. Confidence is combined with freshness at read time: freshness(t) = 2^(-t/τ), where τ is the type-specific half-life. A preference memory written 90 days ago (one full half-life) has freshness 0.5 — half its original influence. A fact written 180 days ago has the same. This means stale memories do not disappear abruptly; they fade out of ranking as newer evidence accumulates, and reappear if freshened by a new mention. The result is a retrieval system that naturally prioritizes recent, high-confidence evidence without requiring manual curation.

Drift Detection

Catching when the user's world has changed.

Agent memory degrades when the world it models has changed. A user who has changed jobs, moved cities, or shifted preferences is poorly served by memories that reflect who they were 18 months ago. Recall's background worker runs four types of drift detection: data drift (the statistical distribution of new memories has diverged from historical — detected via MMD with RBF kernel and permutation test), concept drift (a specific entity's embedding centroid has moved > 0.4 cosine distance AND its relation set Jaccard similarity has dropped below 0.5 — a dual-signal gate that reduces false positives), schema drift (the frequency of type assignments has shifted, suggesting the extraction model has changed), and vocabulary drift (new technical terms have entered the corpus that the BM25 index doesn't capture well — detected via IDF-weighted vocabulary Jaccard). Each drift type triggers a different response: concept drift flags the entity for user review, data drift triggers re-embedding consideration, schema drift triggers extraction model recalibration.

Production Observability

Instrument what matters.

A memory system is only as trustworthy as its observability. Recall emits structured traces for every write operation (keyed by extraction_trace_id, linking the source turn to every candidate's pipeline journey) and every read operation (keyed by query_trace_id, linking the query to each retriever's results and the final RRF fusion). These trace IDs are stored in provenance metadata — meaning you can always ask 'where did this memory come from?' and trace back to the exact source turn and extraction model version. Three SLOs to instrument from day one: retrieval latency P99 < 200ms (violations indicate HNSW degradation or index bloat), write pipeline throughput > 100 candidates/second per worker (violations indicate LLM extraction bottleneck), and false negative rate < 5% (high false negatives indicate pre_filter is over-rejecting). Prometheus metric names for all stages are documented in Chapter 11 of the spec.

The design decisions that define Recall

Every production memory system faces the same three choices: what to store, how to retrieve it, and when to trust it. The decisions Recall makes on each are different from most systems — and the differences compound.

What to store: filter at write time, not retrieval time. Most memory systems store everything and rely on retrieval ranking to surface what matters. This produces a Junk-to-Signal ratio that degrades retrieval precision proportional to the junk fraction. The Recall approach inverts this: the seven-stage write pipeline rejects ~70% of candidate memories before they reach storage. A denser, cleaner store means any retrieval method produces better results — there's simply less noise to rank past. The pre_filter stage alone accounts for ~62% of total rejections; its relevance_threshold is the single most impactful tuning parameter in the system.

How to retrieve: five methods, fused. No single retrieval method covers the full query space. Semantic search (cosine similarity over HNSW) handles paraphrase and concept matching. BM25 handles exact matches — product names, error codes, quoted terms, user IDs. Entity-graph traversal handles relational queries ("what do I know about Sarah's manager?"). Temporal range scanning handles time-anchored queries ("what was the plan last quarter?"). Type-filter pre-scans narrow the candidate pool before any scoring happens. Running all five concurrently and fusing with score-weighted RRF (k=60) achieves P@5 of 0.91 on LongMemEval_s — versus 0.78 for semantic-only. The 13-point gap represents queries that semantic search structurally cannot handle.

When to trust it: confidence × freshness, computed separately. A memory's reliability depends on two independent signals that decay at different rates. Confidence (how much to trust the content) is computed at write time from source strength, evidence count, recency, and type prior: conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t). Freshness (how current the memory is) is computed at retrieval time from time elapsed: freshness(t) = 2^(-t/τ) with type-specific half-lives. The two signals are multiplied before ranking. This means a high-confidence memory from 6 months ago competes on equal terms with a low-confidence memory from yesterday — neither automatically wins.

How the pipeline stages interact

The seven stages are sequential by design: each stage produces a more refined candidate set for the next. But understanding the interaction between stages matters for tuning.

pre_filter → extract: pre_filter rejects by predicted relevance; extract then attempts to identify what type of memory the surviving candidate should become. These stages should be tuned together. If pre_filter is too permissive (low threshold), extract receives more low-quality candidates and produces lower-confidence extractions. If pre_filter is too aggressive (high threshold), the false negative rate rises — real signals rejected before extraction. The recommended starting point is relevance_threshold: 0.40 for general workloads, 0.55 for high-noise environments (customer support, voice transcripts), and 0.35 for low-noise environments (structured research notes).

dedupe → conflict: deduplication (semantic hash, similarity threshold 0.85) runs before conflict detection. This ordering matters: two nearly-identical memories of "user works at Acme" and "user works at Acme Corp" should be merged by dedupe, not treated as conflicting by conflict detection. Getting this wrong produces unnecessary conflict flags that require human review. The dedupe threshold (0.85 default) is the key parameter; higher values reduce merging of semantically similar but factually distinct memories.

conflict → persist: the conflict stage is the final gate before storage. It checks whether any surviving candidate contradicts an existing memory with confidence above a threshold. For most deployments, the recommended handling is strategy: flag_for_review rather than strategy: auto_supersede — the latter is faster but loses the older memory silently. Flag-for-review adds a small latency overhead (the human loop) but preserves the full decision history.

What the numbers mean in practice

The headline metrics — ~90% junk rejection, P99 < 200ms, 85%+ on LongMemEval — translate into specific improvements in deployed agents.

~90% junk rejection: In a 4,096-token memory budget (the practical limit for most 128K-window models after system prompt and conversation history), the difference between R=10 (10 junk entries per signal) and R=0.1 (0.1 junk per signal) is approximately 3,700 tokens of useful context versus 370 tokens of useful context. The remaining 3,330 tokens are either noise or unused. Junk is not neutral — it actively occupies the context budget and degrades the agent's ability to surface what matters.

P99 < 200ms: Multi-retriever fan-out adds latency versus single-retriever retrieval. Recall's P99 is lower than most single-retriever alternatives despite running five retrievers concurrently, because the Rust core eliminates GC pauses that dominate Python-based systems' tail latency at the 99th percentile. For real-time agent use cases, P99 is the operative metric — mean latency hides GC spikes.

85%+ on LongMemEval: The LongMemEval_s benchmark specifically tests temporal, relational, and preference reasoning over multi-turn conversations — the query types most likely to fail in flat vector stores. The 85% figure reflects the full pipeline: write-time filtering producing a clean store, plus five-retriever fusion handling the full query-type distribution. Either half alone produces lower scores.

Deployment modes and operational considerations

Recall ships in three deployment modes, each with different operational characteristics.

Embedded mode (local, in-process): the Recall binary links directly into the host process via napi-rs (Node) or PyO3 (Python). No network boundary. Data never leaves the host process. P99 latency: ~45ms. Best for: single-tenant applications, edge deployments, development environments. Trade-off: all memory lives on the local machine; no replication, no multi-node fan-out.

Self-hosted mode (Postgres + Recall server): the Recall server runs as a separate process alongside a Postgres instance (with pgvector). The SDK communicates over a local network boundary. P99 latency: ~80-120ms. Best for: multi-tenant deployments, team-shared memory stores, production applications that need replication and backup via standard Postgres tooling. Trade-off: operational overhead of running Postgres and the Recall server.

Managed cloud (Recall-hosted): fully managed, multi-region, SLA-backed. SDK connects to the managed endpoint. P99 latency: ~120ms (same region). Best for: teams that don't want to operate infrastructure. Trade-off: data leaves your infrastructure; review the security page for current compliance state and roadmap.

All three modes use the same SDK interface. Migration from embedded to self-hosted to managed is a configuration change, not a code change — the VectorStore and MemoryStore abstractions in the Rust core ensure that the pipeline behavior is identical regardless of which backend the data lands in. This means you can develop and test in embedded mode and promote to self-hosted or managed in production without rewriting integration code.

Reading the research

The research content on this site is organized at two levels. The overview papers (linked below) assume familiarity with the pipeline architecture described above. The Learn track builds up that foundation incrementally — if any concept in the papers is unfamiliar, the relevant Learn page will have it with a worked example and interactive demo. The cheat sheet has every formula, threshold, and constant on one page for quick reference while reading. For the quantitative comparisons referenced above, the 2026 benchmark report documents methodology, dataset, and full results with interpretation guidance so you can evaluate the numbers against your own workload rather than taking the headline figures at face value.

The research papers

Three technical papers go deep on the areas where Recall's design departs most from conventional approaches:

Confidence and Scoring — Full derivation of the confidence formula, freshness decay schedule, RRF with score-weighting, and end-to-end worked examples tracing a single memory from extraction through scoring to final ranking position.
Layered Hallucination Defense — Architecture of the three-layer guard (write-time grounding, store-time consistency, read-time faithfulness), multiplicative escape rate analysis, calibration methodology, and integration patterns for async workloads.
Drift Taxonomy — Four types of drift (data, concept, schema, vocabulary) with detection algorithms, significance testing, unified configuration, and the monitoring cadence required to catch slow-moving drift before it degrades retrieval quality. Includes a full PostgreSQL schema for storing drift findings and a dashboard interpretation guide.