Lessons from building Recall in Rust

By Arc Labs5 min read

We built Recall's core in Rust. This is a blunt account of why, what worked, and what we'd reconsider.

Why Rust

Three properties drove the choice:

  • Predictable latency. Memory retrieval sits on the critical path of every agent turn. Garbage-collected runtimes introduce p99 spikes that compound across a multi-retriever fan-out. Rust's deterministic deallocation gives us flat tail latency.
  • Zero-copy serialization. Vectors are large; copying them between layers adds up. Rust's ownership model lets us pass references through the pipeline without paying for serialization at every stage.
  • Cross-language SDKs. A Rust core compiles cleanly into both Node (via napi-rs) and Python (via PyO3). One codebase; native bindings everywhere.
JS
Rust

Predictable Tail Latency (Simulated)

What worked well

napi-rs is the right Node binding story in 2026

We tried two paths: classic N-API via node-addon-api and napi-rs. The latter was the right call. Build is dramatically faster, the type ergonomics are real Rust types, and the resulting SDK feels like idiomatic TypeScript without compromise.

#[napi]
pub struct Recall {
  inner: Arc<RecallCore>,
}

#[napi]
impl Recall {
  #[napi]
  pub async fn write(&self, args: WriteArgs) -> Result<WriteResult> {
    self.inner.write(args.into()).await
      .map(Into::into)
      .map_err(napi_err)
  }
}

Async functions on the Rust side become real Promise<T> in TypeScript. No callback bridging. No ceremony.

PyO3 is solid; the GIL is the gotcha

The Python bindings via PyO3 work, but the GIL costs us throughput in batch operations. Our workaround: release the GIL aggressively before any Rust-internal compute, re-acquire only at the boundary back to Python.

#[pyfunction]
fn write_batch(py: Python, items: Vec<PyTurn>) -> PyResult<Vec<PyResult>> {
  let rust_items: Vec<Turn> = items.into_iter().map(Into::into).collect();
  py.allow_threads(|| {
    // Heavy compute happens here without the GIL held
    runtime.block_on(core.write_batch(rust_items))
  })
  .map(|res| res.into_iter().map(Into::into).collect())
  .map_err(into_py_err)
}

sqlx + pgvector composes cleanly

We use sqlx for Postgres access. The pgvector extension's wire format integrates as a custom type with no friction. Compile-time SQL checking has caught countless bugs that would otherwise have shipped.

Performance characteristics we measured

We don't publish performance claims without measuring them. Here's what our internal benchmarks showed:

Write pipeline throughput: 380 candidates/second on a single 16-core ARM machine (no GPU). The bottleneck is the extraction LLM call, not the Rust pipeline — the pipeline itself can process > 10,000 candidates/second when extraction is mocked out. The practical limit is LLM throughput.

Retrieval latency breakdown (5-retriever fan-out, N=1M memories, p99 measured over 10,000 queries):

ComponentP50P99
HNSW nearest-neighbor search (pgvector)8ms22ms
BM25 (GIN index tsvector query)4ms11ms
Entity-graph traversal (2-hop, BFS)6ms18ms
Temporal range scan (BTREE)2ms6ms
Type-filter pre-scan1ms3ms
RRF fusion and reranking2ms5ms
Network + serialization overhead3ms12ms
Total P9926ms77ms

The fan-out is fully concurrent: all five retrievers run in a single tokio::join! block. The P99 total is dominated by HNSW (the slowest retriever) plus serialization overhead, not by any sequential execution.

GC pause comparison: we ran equivalent Python code (using psycopg2 for pgvector queries) and measured p99 latency at 340ms — 4.4x slower than the Rust path. Most of the gap is GC pauses compounded across the fan-out: each retriever runs in Python's GIL context, and a major GC cycle mid-fan-out inflates tail latency dramatically. Rust's deterministic deallocation eliminates this tail.

Memory footprint: the Recall binary (embedded mode) adds ~18MB to the host process's footprint at startup, growing proportional to the HNSW warm buffer (configured separately, default 64MB). For a typical Node.js server process, this is acceptable; for memory-constrained edge deployments, the warm buffer can be disabled.

The HNSW implementation decision

We considered four options for approximate nearest-neighbor search:

  1. Pure Rust (usearch, hnswlib-rs): maximum control, required us to maintain the index ourselves
  2. pgvector: Postgres extension, mature, SQL-compatible, slightly slower than dedicated systems
  3. Qdrant: dedicated vector database, fastest, but adds a network boundary
  4. Weaviate/Pinecone: managed services, adds latency and external dependency

We chose pgvector. The reasoning:

pgvector gives us the ACID transaction boundary we needed for the write pipeline. The 7-stage write pipeline makes decisions (deduplicate, supersede, persist) that must be atomic. With a separate vector database, we'd need distributed transactions or two-phase commit. With pgvector, the entire pipeline runs inside a single Postgres transaction — the memory either commits fully or rolls back fully.

The performance trade-off is acceptable for our retrieval pattern. pgvector's HNSW is 20-30% slower than Qdrant's implementation at N=1M. At N=10M, the gap grows to 40-50%. But our retrieval is fan-out (5 concurrent queries), and the bottleneck quickly becomes network and serialization, not raw HNSW speed. The ACID guarantee is worth more to us than the vector-DB speedup at our current scale.

Operational simplicity: adding Qdrant means running and operating a separate service, a separate failure domain, and a separate backup strategy. pgvector is Postgres — operators already know how to run it.

We do anticipate moving the embedding layer to a dedicated vector database at 10M+ memories per namespace. That migration is planned; the abstraction layer (our VectorStore trait in Rust) makes it mechanical.

Lessons from the N-API boundary

The TypeScript SDK is a thin wrapper over the Rust binary via napi-rs. This boundary has some surprising properties:

Async semantics work cleanly: napi-rs maps Rust async fn to JavaScript Promise<T>. The tokio runtime runs on a dedicated thread pool; JavaScript awaits are genuine non-blocking suspensions. The Node event loop is not blocked during any Rust operation.

// In Rust: async fn that doesn't block Node's event loop
#[napi]
impl Recall {
  #[napi]
  pub async fn search(&self, args: SearchArgs) -> Result<Vec<MemoryResult>> {
    // Runs on tokio threadpool; Node event loop is free
    self.inner.search(args.into()).await
      .map(|results| results.into_iter().map(Into::into).collect())
      .map_err(napi_err)
  }
}

Error propagation requires explicit mapping: Rust errors are typed (RecallError::NotFound, RecallError::Conflict, etc.). JavaScript errors are Error objects with string messages. napi-rs provides a napi::Error type, but you have to write the mapping:

fn napi_err(e: RecallError) -> napi::Error {
  match e {
    RecallError::NotFound(id) => {
      napi::Error::from_reason(format!("Memory not found: {}", id))
    }
    RecallError::Conflict { memory_id, conflict_with } => {
      napi::Error::from_reason(
        format!("Conflict: {} contradicts {}", memory_id, conflict_with)
      )
    }
    RecallError::DatabaseError(db_err) => {
      // Don't leak internal DB state to JavaScript callers
      napi::Error::from_reason("Internal storage error")
    }
    _ => napi::Error::from_reason(e.to_string()),
  }
}

This mapping is where we've caught most of our API design mistakes. When you write the Rust-to-JavaScript error conversion, you're forced to think about what the JavaScript caller can actually do with each error type. RecallError::Conflict should surface enough information for the caller to retry with a merge strategy; RecallError::DatabaseError should not surface internal state.

Type generation is real but incomplete: napi-rs generates TypeScript .d.ts files from the Rust code. The generated types are accurate for primitive arguments, but complex nested structs require hand-written type augmentations. We generate the base types and then maintain a types.ts with the augmentations — a small but real maintenance surface.

Testing the boundary is the hardest part: Rust unit tests can't test the napi boundary (they're pure Rust). The napi binary can't be imported into a Node test runner without the full build. We ended up with three test layers: Rust unit tests (90% of coverage), a small Rust integration test suite that mocks the LLM calls, and a TypeScript integration test suite that tests the JavaScript API surface against a real Rust binary. The TypeScript tests caught the most boundary-specific bugs.

What we'd reconsider

Async story is heavier than it needs to be

The Rust async ecosystem is great in production but heavyweight to learn. Send bounds, lifetime gymnastics inside async fn, and the runtime split (tokio vs async-std vs smol) cost new contributors weeks of setup. If we were starting today with a smaller team, we might prototype in Go and migrate later.

Error type design needs more discipline

Rust's Result<T, E> is wonderful — until you have ten error types and every binding boundary requires explicit conversions. We've consolidated to a single RecallError enum at the public API; internally, thiserror keeps things readable.

Testing the bindings is harder than testing the core

Rust unit tests are great. Testing that the napi-rs binding behaves correctly under all the JavaScript / async / error edge cases is a separate effort. We invested heavily in TypeScript-side integration tests run against the actual binary; that caught more bugs than any amount of Rust unit tests.

Would we pick Rust again? Yes — but knowing the costs. The right rule of thumb: pick Rust when the workload is latency-critical, when cross-language bindings matter, and when the team has at least one Rust-experienced senior. Pick Go or TypeScript for everything else.

What we'd tell teams evaluating Rust for an AI workload

Three questions determine if Rust is the right call:

Is latency-predictability a hard requirement? If you need flat p99 tail latency — because you're on the critical path of agent reasoning, because you're doing multi-retriever fan-out, because a single slow response breaks the user experience — Rust's deterministic deallocation eliminates the GC tail. If latency is loose (batch jobs, background processing) this advantage doesn't apply.

Do you need cross-language SDKs? A Rust core compiles cleanly to Node (napi-rs) and Python (PyO3). One codebase, native bindings, no network boundary. If you need SDKs in three or more languages, Rust amortizes the cross-language investment better than any alternative. If you're TypeScript-only, this advantage doesn't apply.

Does the team have Rust depth? The ramp-up cost for an experienced engineer coming from Go or TypeScript is 3-6 months before they're productive in idiomatic Rust. The borrow checker, lifetime annotations, and async send bounds are genuinely hard. We've seen experienced engineers take 4 weeks just to get comfortable with Arc<Mutex<T>> vs Arc<RwLock<T>> trade-offs in the context of our specific access patterns. Budget for this honestly.

If all three are true: Rust is the right call. If one is false, evaluate carefully. If two are false, pick Go (deterministic performance, fast cross-language bindings via CGo or gRPC, shorter ramp).

sqlx and compile-time SQL verification

The brief mention above doesn't do sqlx justice. Compile-time SQL checking is one of the most underrated quality-of-life wins in the entire Rust ecosystem, and it matters particularly for a system where the database schema evolves alongside the pipeline.

Here's what it looks like in practice:

// This fails to compile if the SQL is wrong, the column type mismatches,
// or the struct field doesn't match the database schema
let memories = sqlx::query_as!(
    MemoryRow,
    r#"
    SELECT id, content, type as "memory_type: MemoryType", confidence,
           valid_from, valid_until, scope_user_id
    FROM memories
    WHERE scope_user_id = $1
      AND valid_until IS NULL
    ORDER BY confidence DESC
    LIMIT $2
    "#,
    user_id,
    limit as i64
)
.fetch_all(&pool)
.await?;

At compile time, sqlx connects to a test database and verifies: the table exists, the columns exist, the types match the Rust struct, and the query is syntactically valid. If you rename a column in a migration, the code fails to compile before the bug reaches any test. This has saved us from several would-have-been production incidents — schema drift caught at build time, not in prod at 3am.

The workflow: run migrations before cargo build in CI. The DATABASE_URL environment variable points sqlx at the migrated schema. No runtime reflection, no ORM-style magic — just SQL that the compiler has verified.

The one friction point: compile-time checking requires a live database at build time, which complicates offline builds and some CI environments. Our workaround: commit the sqlx-data.json offline metadata file for CI builds that don't have a database available. It's a small cost for a large safety guarantee.

Ownership model and concurrent retrieval

The five-retriever fan-out is the critical performance path, and Rust's ownership model makes writing it safely straightforward:

pub async fn retrieve(&self, query: &ParsedQuery) -> Result<Vec<RankedMemory>> {
    let (semantic, bm25, entity_graph, temporal, type_filter) = tokio::join!(
        self.semantic_retriever.retrieve(query),
        self.bm25_retriever.retrieve(query),
        self.entity_graph_retriever.retrieve(query),
        self.temporal_retriever.retrieve(query),
        self.type_filter_retriever.retrieve(query),
    );

    // All five complete; now fuse with RRF
    let candidates = merge_results([semantic?, bm25?, entity_graph?, temporal?, type_filter?]);
    Ok(self.rrf_fuse(candidates))
}

Each retriever holds a Arc<PgPool> — a reference-counted pointer to the shared connection pool. The compiler verifies that all five retriever references are Send + Sync (required to cross the tokio::spawn boundary). If any retriever type is accidentally not thread-safe, the code doesn't compile. In Python or Go, thread-safety is a runtime concern, documented in comments, verified in code review. In Rust, it's a compile-time guarantee.

The Arc<PgPool> model also means pgvector connection pooling is shared across all five retrievers without any explicit synchronization code. tokio::join! schedules all five futures on the tokio thread pool concurrently; the pool handles connection contention internally. The result: genuinely parallel retrieval with no locking ceremony on our side.

One subtlety: tokio::join! waits for the slowest of the five. If HNSW takes 22ms p99 and the others take 3-11ms, the total fan-out p99 is the HNSW p99 plus the fusion overhead (≈5ms), not the sum of all five. This is the right model for retrieval — the bottleneck determines the tail, not the accumulation.

Cross-compilation and distribution

Shipping native binaries to three language ecosystems (Node, Python, and direct Rust) without breaking users across three operating systems and two CPU architectures requires discipline.

Our build matrix:

  • x86_64-unknown-linux-gnu (most server deployments)
  • aarch64-unknown-linux-gnu (AWS Graviton, Apple Silicon containers)
  • x86_64-apple-darwin (developer machines)
  • aarch64-apple-darwin (Apple Silicon developer machines)
  • x86_64-pc-windows-msvc (Windows development environments)

napi-rs has first-class support for this build matrix via GitHub Actions. The workflow: compile each target in its native environment (cross-compilation works but produces larger binaries and has occasional edge cases with dynamic linking). Each target produces a .node file; the npm package ships all of them and selects the right one at install time via a platform-detect shim.

PyO3 follows a similar pattern via maturin, which handles the Python wheel packaging across targets.

The hardest part: libssl and libpq (Postgres client) dynamic linking. On Linux, the .so version varies by distribution. Our solution: link statically against openssl (via the openssl crate's vendored feature) and against libpq (via PQ_LIB_STATIC). Static linking makes the binary self-contained at the cost of a larger file size (~8MB larger). Worth it: no "wrong libssl version" errors in production.

Closing

Rust is the right call for the core of an agent memory system. The investment compounds: faster bindings, predictable latency, fewer 3am pages. The investment is real: longer ramp time, more careful API design, more discipline around error types and async shapes.

The specific wins — flat p99 tail latency, compile-time SQL verification, thread-safety guarantees enforced by the borrow checker, zero-copy vector passing through the pipeline — are not things you can retrofit onto a Go or Python codebase later. They emerge from the ownership model at every layer of the architecture. If you're building something where those properties matter, the ramp cost is worth paying. If you're prototyping or working on a batch job, pick the language your team already knows.

The core is open source. Read the code.

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.