What we shipped — April 2026

By Arc LabsMay 2, 20265 min read

Short post: a snapshot of what we shipped this month, on this site and in the codebase. We intend to do a monthly recap; this is the first.

April's goal was to take the architectural spec from "open document on GitHub" to "something an engineer can learn from without reading 400KB of dense prose." That meant interactive content, reference material at multiple levels of depth, and enough trust-layer transparency that developers evaluating Recall could make an informed decision without a sales call. The headline number is 28 Learn pages and 24 D3 demos, but the work behind those numbers is the more interesting story.

The Learn track

Twenty-eight pages on agent memory engineering, organized as five tracks: foundations, write pipeline, read pipeline, math, and production. Each page has at least one interactive D3 demo. The most polished is the cross-pipeline playground that runs a single turn through the full memory architecture.

The 28 pages are organized into five tracks: Foundations (what is agent memory, why agents forget, typed memory vs flat), Write pipeline (all 7 stages individually, three-tier deduplication, pre-filter rejection, LLM extraction as filtering), Read pipeline (five retrievers, query optimizer, hybrid retrieval, HNSW tuning), Math (confidence formula, freshness decay, reciprocal rank fusion, BM25, logarithmic repetition boost), and Production (scaling to 1B memories, concept drift, hallucination defense, background workers, observability).

Each page required one to three passes: first a structural draft from the spec, then a math pass to verify all formulas, then an interactive demo pass. The D3 demos were the hardest — the cross-pipeline playground that lets you trace a single turn through all 7 write stages and all 5 read stages required about 40 hours of implementation.

Two pages took unexpectedly long: HNSW tuning required us to run actual HNSW benchmarks (pgvector on test data) to get honest parameter sensitivity numbers — we didn't want to publish a table we couldn't verify. Three-tier deduplication required working out the union-find algorithm's behavior on edge cases (chains of near-duplicates vs. star clusters) before we were confident the description was accurate.

Reference content

Glossary — 50+ terms with definitions.
Cheat sheet — every formula, threshold, and constant on one page.
Three listicles: best agent memory frameworks, best vector databases for agent memory, top 10 mistakes.

The cheat sheet took longer than any individual Learn page. The constraint was: every formula, threshold, and constant on a single page, with enough annotation that the formula is interpretable without context. The result: 47 entries covering the confidence formula and all its sub-components, the full freshness decay schedule per type, RRF and BM25 formulas, HNSW parameters and their sensitivity ranges, write-pipeline stage names and rejection rates, retriever weights, and the background worker job schedule.

The three listicles (best agent memory frameworks, best vector databases for agent memory, top 10 mistakes) are unusual content for us — we don't typically write comparison content. The reasoning: these are the queries that bring engineers into the decision process. A developer who Googles "best agent memory framework 2026" and lands on an honest, technically grounded comparison is a developer who can make a real decision. We ran the comparison methodology through the spec — every claim in those pages is architecturally grounded, not marketing.

The glossary also deserves mention: 50+ terms, each defined at the precision level of the spec. Terms like "supersession" (when a new memory replaces an existing one of the same type for the same entity), "namespace" (the isolation boundary for a user or session), and "retrieval fan-out" (the parallel execution of the five retrievers before RRF fusion) appear throughout the spec and documentation but aren't always defined where they first appear. The glossary is the canonical reference. It's also schema-marked as a DefinedTermSet so that AI search engines can surface individual definitions directly.

Use-case guides

Five use-case pages walking through the patterns and pitfalls in different domains: customer support, coding agents, personal assistants, sales / CRM, research agents.

The five use-case guides cover the domains where agent memory systems have proven out at scale: customer support (ticket continuity, account context, cross-rep handoffs), coding agents (decision history, style preferences, framework migration handling), personal assistants (entity graphs, privacy and deletion, context-aware retrieval gating), sales/CRM (account graphs, objection patterns, CRM integration boundary), and research agents (source provenance, claim relations, hypothesis tracking as state machines).

Each guide is structured around three questions: what fails without memory (the concrete failure mode, not abstract), what memory architecture solves it (specific types, retrievers, pipeline tuning), and what to watch out for (the pitfalls that weren't obvious until we'd seen them in practice). The research agents guide took the longest because the architecture for research workloads (no auto-supersession, atomic claim extraction, manual contradiction resolution) is genuinely different from the others.

Trust pages

/security — encryption, isolation, key management, disclosure. Honest about state.
/open-source — license, governance, contributing, security policy.
/changelog — what's new, RSS feed.
/status — managed cloud regions and SLA targets.

The security page was the most contentious internally. We wanted to be honest about what we don't yet have (customer-managed KMS, FedRAMP authorization, SOC 2 Type II report) while being accurate about what we do have (AES-256 at rest, namespace isolation at the database level, TLS in transit, a defined disclosure policy). The compromise: publish exactly what's true, with a public roadmap for what's missing. "Honest about state" in the description is the operating principle.

The open-source page covers the governance structure (Apache 2.0 core, Rust crate published to crates.io, N-API bindings on npm). One specific thing we want engineers to know: the bindings are not wrappers around a hosted API — they link against the actual Rust binary. There is no network boundary when you use the local embedded mode. The security properties of embedded mode (data never leaves your process) are real, not marketing.

Quickstart and FAQ

/quickstart — five steps from zero to a working memory-enabled agent. Per-language pages for TypeScript, Python, and Rust.

/faq — common questions about agent memory and Recall, schema-marked for AI search engines.

The quickstart was written with a specific constraint: every step must be verifiable with a console.log or equivalent — no "trust that it worked." Step 1 initializes the client and verifies the connection. Step 2 writes three turns and prints the extraction result so you can see what memories were actually created. Step 3 reads memories and prints the retrieval result with confidence scores. Step 4 shows the full lifecycle (write → read → update → delete). Step 5 connects the memory layer to a real LLM call so the output is demonstrably different from what you'd get without memory.

The per-language pages are not just copy-paste of the same logic in different syntax. The TypeScript page uses the N-API bindings in embedded mode. The Python page uses the Python bindings against a local SQLite-vec store. The Rust page builds a minimal agent binary that links against the Rust crate directly. The underlying API is identical, but the integration pattern differs by language, and we wanted each page to show the idiomatic pattern for that ecosystem.

The FAQ is marked up with FAQ schema (JSON-LD FAQPage) so that AI search engines and featured snippets can surface individual answers directly. The questions were selected by analyzing the agent engineering community — Stack Overflow threads, Discord channels, GitHub issues — for the questions that come up most frequently but have the worst signal-to-noise ratio in existing answers. "What's the difference between semantic memory and episodic memory for agents?" consistently gets answers that are either too abstract (quoting cognitive science literature) or too specific (describing one particular framework). We tried to answer at the architecture level: what the distinction means for a production system, not for a psychology paper.

What's next

May 2026 priorities: reproducible benchmark harness (so the comparison pages can stop flagging "unverified" cells), more interactive demos for the math pages, and the first external case study from a design-partner customer.

The first external case study will cover a design-partner customer who has been running Recall in production for three months on a customer support agent. The aggregate numbers are good, but what we want to publish is the operational story: what they had to tune, what surprised them, and what they would do differently if they were starting over. We'll publish that with their review and approval in May.

The benchmark harness deserves a bit more context. The three listicle pages (best frameworks, best vector databases, top 10 mistakes) contain comparison tables. Several cells in those tables are currently flagged "unverified" — they represent claims we believe are accurate based on documentation and community reports, but that we haven't validated through direct measurement. The benchmark harness will let us run standardized retrieval quality and latency tests against each system and replace "unverified" with reproducible numbers. The harness itself will be open-source, so other teams can run the same tests against their own data and verify our numbers.

The interactive demos roadmap for May focuses on the math track. The confidence formula page currently has a static worked example. We want to replace it with a live calculator — input a conversation turn, see the extraction result, adjust the source strength slider, watch the confidence score update. Similarly for the freshness decay page: a timeline visualization where you can set a memory's creation date and type, then scrub forward in time and watch the effective confidence decay. These require more implementation work than the write-pipeline demos because they need to expose sub-components of the pipeline (the confidence calculator, the decay function) as standalone interactive elements.

What we learned from shipping all of this

Three things surprised us in the April build:

Math pages drive the most engagement, not the how-to pages. We expected the quickstart and use-case guides to get the most traffic. The HNSW tuning page and the confidence formula page outperformed them by 3x in average session time. Engineers are hungry for the internals, not just the interface.

The interactive demos are load-bearing. We considered skipping them to ship faster. We didn't, and we're glad — the cross-pipeline playground has become the most-cited page in community discussions. "Have you seen the Recall pipeline demo?" is a common link in agent engineering Slack channels. The demo does pedagogical work that prose cannot.

Five use-case pages is not enough. We've gotten inbound interest for at least four more: healthcare (HIPAA requirements), legal (privilege boundaries), finance (audit trail requirements), and education (student progress tracking). These will likely ship in May, after we've worked through the compliance implications in the security chapter.

One broader observation from the April build: the hardest part of technical content is calibrating depth. Too shallow, and an experienced engineer learns nothing and doesn't trust you. Too deep, and you lose the reader who just wants to know if the system does what they need. We tried to solve this with progressive disclosure — every Learn page has a "quick answer" at the top (one paragraph, the key claim) and then the full technical content below. The cheat sheet is the maximum-density version of the same idea. We'll see from analytics whether the calibration worked, but we feel better about it than we expected to when we started.

The April work also clarified something about what "honest technical content" means in practice: it requires saying things that are uncomfortable. The security page lists what we don't have yet. The FAQ has an answer to "should I use Recall for regulated data?" that honestly says "it depends on your regulatory environment and we're not there yet on customer-managed KMS." The confidence formula page explains the empirical calibration methodology and acknowledges that the weights may not generalize to all conversation distributions. None of this is typical of software documentation. We think it's the right way to build trust with engineers who will eventually put this system in production.

The production track in detail

The production track — the fifth group of Learn pages — covers the operational concerns that you don't face until you're running agent memory at scale, but that you need to design for from the beginning. It's the most densely interconnected track: the scaling page references the HNSW tuning page which references the background workers page which references the observability page.

Scaling to 1B memories is the page that required the most upfront thinking. Scaling memory systems is qualitatively different from scaling stateless compute — the challenge isn't request throughput, it's that the retrieval quality degrades as the memory graph gets denser (more memories per entity, more entity relations to traverse). The page walks through the inflection points: what changes at 1M memories per namespace, at 10M, at 100M, and at 1B. At each tier, different architectural assumptions break down and different mitigations apply. The four scaling tiers in Chapter 14 (SQLite-vec → pgvector HNSW → sharded pgvector → specialized vector DBs) map directly onto these inflection points.

Concept drift is the page that most often prompts questions in community channels. The four drift types (concept drift, data drift, schema drift, vocabulary drift) sound abstract, but the concrete failure modes are recognizable to anyone who has built a long-running agent: the agent that keeps recommending a framework the user abandoned six months ago (concept drift), the agent whose retrieval gets worse over time because the embedding model was updated (data drift), the agent that stops understanding certain queries after a schema migration (schema drift), or the agent that doesn't recognize that "k8s" and "Kubernetes" are the same thing (vocabulary drift). The page covers detection strategies for each and the remediation job in the background worker that addresses each type.

Hallucination defense on the production page is a higher-level treatment than Chapter 08 of the spec. Where Chapter 08 explains the three-layer mechanism in detail, the Learn page focuses on operational questions: how do you know the defense is working? What metrics indicate that the escape rate is increasing? When should you tune the thresholds, and how? The page recommends specific alerting thresholds (write-time grounding acceptance rate < 70% suggests the pre-filter threshold may be too aggressive; store-time consistency scan conflict rate > 15% suggests the entity graph has contradictions that need manual review) and explains the trade-offs in threshold selection.

The observability page was written in close coordination with the team members who run the managed cloud infrastructure. It's grounded in the actual Prometheus metrics we export, the actual OpenTelemetry trace structure, and the actual alerting rules we use in production. The SLO targets (retrieval P99 < 200ms, write throughput > 100 candidates/second, false negative rate < 5%) are numbers we hold ourselves to, not aspirational targets. If you set up the same Prometheus dashboards we use internally, you'll be running the same monitoring we are.

The production track pages are linked to the spec chapters they derive from. We made this explicit because the Learn pages are intentionally simplified — if you read a Learn page and it raises a question that the page doesn't answer, the spec chapter is where to look next. The cross-referencing was one of the last things we added in April, and it turned out to be one of the most useful structural decisions of the whole build.