Reading the spec: a guided tour

By Arc LabsApril 12, 20265 min read

Recall's architectural spec is open. Nineteen chapters, ~400KB of dense prose. Reading it cover-to-cover is a commitment. This guide is what we'd recommend depending on what you're trying to learn.

The spec has a specific purpose: to be precise enough that someone could re-implement Recall from scratch, while being honest about the design tradeoffs we made and why. It is not marketing. There are chapters that describe known failure modes, limitations, and places where we chose simplicity over theoretical optimality. When we say something "works well in practice," the spec cites what corpus we tested on.

This also means the spec has a learning curve. Chapters 03 and 14 assume familiarity with vector indexing and information retrieval. Chapters 08 and 12 reference academic literature on hallucination detection and distributional drift. For readers who want the concepts without the formal machinery, the Learn track is the right entry point. For readers who want to go deeper than the Learn track, or who want to verify claims rather than take them on faith, the spec is the document.

If you're evaluating Recall

Read these four chapters in order; they answer "is this the right shape for my problem?"

Chapter 00 (Overview) — what we're building and why.
Chapter 02 (Data model) — the typed schema and lifecycle states.
Chapter 04 (Write pipeline) — the seven stages and the rejection rates.
Chapter 16 (Deployment) — embedded, self-hosted, managed.

Chapter 00 in depth: The overview chapter makes two key claims that Chapter 02 substantiates: (1) that typed schema over flat text is not an ergonomic choice but a functional requirement — without types, you cannot apply per-type decay, confidence priors, or retrieval routing; (2) that the write-time filter is the central architectural decision, not retrieval. Most memory systems spend their design budget on retrieval; Recall spends it on the gate.

Chapter 02 detail: The typed schema defines five types with explicit semantics, confidence priors, and decay parameters. Importantly, the chapter also defines the lifecycle state machine: pending → active → superseded → expired. A memory that is expired is not deleted — it's retained for audit and provenance tracing, but excluded from retrieval. This is the mechanism that lets you answer "what did this agent believe about X as of 30 days ago?" even after the belief has changed.

Chapter 04 key insight: The pre_filter stage is where the architecture makes its biggest bet. It rejects ~90% of candidates before they reach extraction. This is aggressive — the cost of a false negative (rejecting a real signal) is non-zero. The chapter's defense: the distribution of agent conversation turns is heavily skewed toward noise (pleasantries, acknowledgments, task confirmations). The 10% of turns that contain real signal are dense enough that a high threshold filters well. Chapter 04 also introduces the concept of retriever-weighted RRF, which is the fusion mechanism for the 5 retrievers described in Chapter 05.

If you're curious about the math

Two chapters carry the formulas. The interactive demos on this site are renderings of these:

Chapter 03 (Mathematical foundations) — confidence formula, decay curves, RRF, BM25, MMD drift detection.
Chapter 14 (Indexing) — HNSW math, scaling tiers, IVF/PQ trade-offs.

We've turned much of this into the cheat sheet and the math track in Learn; the spec is the deeper version.

Chapter 03 deep content: The mathematical foundations chapter covers four areas:

Confidence formula: conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t) where s = source strength (0-1, explicit statement vs. inferred), r = recency (freshness decay applied at read time), e = evidence count (number of corroborating memories), t = type-specific prior. The weights were calibrated empirically on a labeled corpus of 500 conversations.
Freshness decay: freshness(t) = 2^(-t/τ) where τ is the type-specific half-life: Preference=90d, Event=30d, Fact=180d, Relation=180d, Entity=365d. At t=τ, a memory scores 0.5 freshness regardless of its initial confidence. The decay is applied multiplicatively at read time, not destructively — the stored confidence value is unchanged.
RRF fusion: score = weight × sqrt(raw_score) / (k + rank) with k=60 per retriever. The sqrt transformation compresses score variation within a retriever's results; the weight factor allows per-retriever calibration; k=60 is the Borda count offset that prevents rank 1 from dominating. Total score per candidate is summed across all retrievers.
MMD drift detection: Chapter 03 uses Maximum Mean Discrepancy with an RBF kernel to detect when the embedding distribution of an entity's associated memories has shifted. The permutation test determines whether the observed MMD is statistically significant.

Chapter 14 key content: The indexing chapter covers HNSW parameter selection (m=16 is the recommended default; ef_construction=200 for build quality; ef_search=64 for query balance), the four scaling tiers (SQLite-vec → pgvector HNSW → sharded pgvector → specialized vector DBs), and the memory formula bytes_per_vec ≈ dims × 4 + m × 8 × log₂(N). At m=16, dims=1536, N=1M: approximately 320MB for the embedding layer.

If you're building something similar

Read the operational chapters. They're where the engineering reality lives:

Chapter 08 (Hallucination guards) — three-layer defense.
Chapter 10 (Background worker) — seven maintenance jobs.
Chapter 11 (Observability) — what to instrument.
Chapter 12 (Drift) — concept, data, schema, vocabulary.
Chapter 13 (Query optimizer) — feature vector + plan selection.

Chapter 08 — Hallucination guards in detail: Three layers with independent escape rates: write-time grounding (ε₁=0.20), store-time consistency scan (ε₂=0.50), read-time faithfulness (ε₃=0.30). Multiplicative: 0.03 escape rate for genuine junk. The chapter is notable for explaining what "consistency" means operationally — the consistency scanner detects logical contradictions between new candidates and existing memories, not just semantic similarity. A new fact "user works at Acme" doesn't necessarily contradict an old fact "user works at Datakynd" (they could have changed jobs) — the scanner distinguishes temporal updates from logical conflicts.

Chapter 10 — Background worker jobs: Seven maintenance jobs running in three concurrency groups. Group 1 (consolidation + freshness): runs every 4 hours. Group 2 (embedding refresh + re-extraction): runs weekly or on trigger. Group 3 (prune + audit): runs daily. The job locking mechanism uses PostgreSQL INSERT ON CONFLICT DO NOTHING against a background_jobs table — if a job for a given namespace is already running, the new attempt fails the insert and skips. This prevents thundering herd on high-volume namespaces.

Chapter 13 — Query optimizer: The optimizer converts a raw query string into a ParsedQuery struct with: entity_refs (resolved entity IDs), temporal_window (explicit or inferred date range), type_hints (optimizer-only, not user-facing), specificity (specific vs. general — drives retriever weighting), negations (terms to exclude), and rewrites (synonym expansion). The optimizer selects among four retrieval plans based on the ParsedQuery feature vector — a hand-coded decision tree, not a learned model.

If you're worried about production posture

Three chapters address security and operations:

Chapter 16 (Deployment) — modes, infrastructure assumptions.
Chapter 17 (Security) — auth, isolation, key management.
Chapter 11 (Observability) — failure modes and SLOs.

The companion /security page is the public-facing version with current compliance state.

Chapter 17 — Security specifics: Three isolation boundaries: (1) namespace isolation (database-level, not query-level — queries cannot cross namespace boundaries even with crafted inputs); (2) encryption at rest (AES-256 for stored memories, separate key per namespace); (3) key management (keys stored in a secrets manager, never in the database alongside the data). Chapter 17 also covers the disclosure policy and the currently-known limitations (the managed cloud does not yet offer customer-managed KMS integration — roadmapped).

Chapter 11 — Observability instrumentation: Three SLOs to instrument from day one: retrieval latency P99 < 200ms, write pipeline throughput > 100 candidates/second, and false negative rate < 5%. The chapter provides Prometheus metric names for each stage of the write pipeline and the retrieval fan-out. Structured traces (via OpenTelemetry) are emitted for every write and read operation, keyed by extraction_trace_id (write) and query_trace_id (read). These trace IDs are also stored in provenance metadata, creating a link between a retrieval result and the exact operation that created it.

The deployment chapter deserves a standalone read

Chapter 16 is often underweighted by technical readers who treat it as boilerplate. It isn't. The three deployment modes — embedded, self-hosted, managed — have meaningfully different properties beyond just "where does the binary run."

Embedded mode runs the entire memory pipeline in-process. There is no network boundary between your application and the memory layer. This means retrieval latency is dominated by disk I/O (for SQLite-vec) or local pgvector, not network round-trips. It also means the data never leaves your process — there is no network to intercept. The tradeoff: you own the upgrade cycle, and if the Recall version embedded in your binary has a bug, you need to ship a new binary to fix it.

Self-hosted mode runs Recall as a sidecar or separate service. The memory layer is network-accessible within your infrastructure, but controlled by you. This unlocks pgvector-backed persistence with proper HNSW indexing, multi-tenant namespace isolation, and the full background worker suite. The tradeoff: you're operating a stateful service, which is qualitatively harder than operating stateless compute.

Managed mode is the cloud offering — you call our API, we handle the infrastructure. Chapter 16 is honest about what this implies for data residency (currently US-West and EU-West regions), what the SLA targets are (99.9% retrieval availability), and what we don't yet have (customer-managed KMS, custom regions on request). If you're evaluating Recall for a regulated workload, Chapter 16 is the chapter to read before Chapter 17.

If you're checking our work

Chapter 18 is the roadmap. Chapter 19 is appendices — citations, prior art, references to academic work we lean on. Both are short and worth a skim.

Chapter 18 is unusual for a technical spec: it doesn't just list features on a timeline, it explains which architectural decisions are open vs. closed. Closed decisions (the five memory types, the lifecycle state machine, RRF as the fusion strategy) are unlikely to change because they have substantial downstream dependencies in the implementation. Open decisions (the exact confidence formula weights, the choice of MMD over other drift statistics, the background job schedule) may change as we accumulate more empirical data. The roadmap flags which category each item falls into, so you can evaluate which pending changes are likely to affect your integration.

Chapter 19's citations are worth reading as a secondary reading list. The spec draws substantially on the original HNSW paper (Malkov & Yashunin 2018), on the Reciprocal Rank Fusion paper (Cormack et al. 2009), and on more recent work on agent memory from 2023-2025. If you want to understand why a design choice was made rather than just what it is, the citations give you the intellectual lineage.

How the spec evolves

The spec is versioned. We tag releases as the architecture firms up. Major changes go through public proposal first; the spec repository is the source of truth. If you find an error or want to argue with a design choice, the issue tracker is the right place.

The versioning convention is SemVer-inspired but applied to architecture rather than code. Patch versions are typographic corrections and clarifications that don't change meaning. Minor versions add new sections, cover new deployment modes, or add precision to existing specifications without contradicting them. Major versions introduce breaking changes — a new memory type, a change to the confidence formula weights, a revised lifecycle state machine. When a major version ships, we publish a migration guide that explains what changed and what existing integrations need to account for.

Currently the spec is at 1.x. We expect to stay there for the foreseeable future — the core architecture (five types, seven write stages, five retrievers, RRF fusion) is stable. The next major version would likely be triggered by adding a sixth memory type, which requires changes to the data model, the decay schedule, and the confidence formula simultaneously.

Quick links: github.com/arc-labs/recall, /research (overview), /learn (interactive).

A note on reading order within chapters

Most chapters have the same internal structure: motivation → formal specification → implementation notes → known limitations. If a chapter is dense, the implementation notes section is usually the most useful entry point — it describes the concrete behavior in plain terms, and you can work backward to the formal spec for precision when you need it.

The exception is Chapter 03 (Mathematical foundations), which is purely formal. There's no "motivation" section because the formulas are referenced by every other chapter. We recommend reading Chapter 03 after you've read at least Chapters 00, 02, and 04, so you have context for what each variable represents before you encounter it in a formula.

What to read alongside the spec

The spec is precise but dense. Three companion resources make it more approachable:

The Learn track at /learn renders spec chapters as interactive pages with D3 visualizations. If Chapter 03's confidence formula feels abstract, the confidence-formula learn page walks through a worked example turn by turn.
The research papers at /research cover the topics where we have quantitative evaluation: the seven-stage pipeline's rejection rates, the 2026 benchmark comparing retrieval systems, and three long-form papers on confidence scoring, hallucination defense, and drift taxonomy.
The cheat sheet at /learn/cheat-sheet is a single page with every formula, threshold, and constant in the spec. When you need to remember "what's the default k in RRF?" or "what's the entity half-life?" — that's the page.

What the spec doesn't cover

Some topics are deliberately out of scope:

Language model selection: the spec assumes a capable LLM for extraction and reranking but is model-agnostic. We've evaluated the pipeline with GPT-4o, Claude 3.7 Sonnet, and Gemini 2.0 Flash as the extraction backbone. Extraction quality varies; we'll publish per-model calibration data separately.
Conversation turn format: the write pipeline accepts raw turn objects (user + assistant string pairs). The spec doesn't prescribe how you structure your agent's conversation — that's your concern.
Application-layer memory policy: when to write, what scope to write with, when to expire — these are decisions the spec architecturally supports but doesn't dictate. The use-case guides at /use-cases cover these policy decisions by domain.

The out-of-scope list also reveals something about the design philosophy: Recall is not trying to be a complete agent framework. It's trying to be the best possible memory layer that a complete agent framework can slot in. The interface boundaries (write turns, read memories) are narrow by intent. If you want a system that also handles prompt construction, tool orchestration, and response generation, Recall is not that — but it will compose cleanly with whatever does handle those things.

This is worth flagging because it's the most common misread of the spec among engineers who come from frameworks that try to handle everything. The spec doesn't discuss prompt injection, because prompt construction is upstream of Recall's write interface. It doesn't discuss tool selection, because tool selection is downstream of Recall's read interface. The spec's scope is the memory layer only — and it tries to do that one thing well enough that you don't have to think about it again.