Mem0 vs Letta vs Zep vs Recall

By Arc Labs Research14 min read

Disclosure: we are the team behind Recall. We tried to write this honestly. Cells marked unverified are differences we couldn't confirm from public sources at time of writing — we'd rather flag the gap than guess. If you've benchmarked these systems, we'd value the correction.

Snapshot

PropertyMem0LettaZepRecall
Open sourceYesYesPartialYes
Core languagePythonPythonGoRust
Storage defaultQdrant + PostgresPostgres + pgvectorPostgres + pgvectorSQLite-vec → pgvector
Typed memory schemaLooseLoose (block-based)LooseStrict (fact/pref/event/entity/relation)
Pre-store filteringLightConfigurableLight7-stage pipeline
Multi-retriever fusionSingle (semantic)ConfigurableHybrid (vec + lexical)Five (vec + BM25 + graph + temporal + type)
Entity graphNoAdjacent (via blocks)YesYes
Temporal-first designNoNoYesAdjacent
Drift detectionTBDTBDTBDDual-signal
Hallucination defense layersTBDTBDTBD3-layer
Reported junk rate (audit)97.8%TBDTBD<10% target

When Mem0 wins

You want the lowest-friction starting point. Mem0's API is the simplest of the four — three lines of code from "no memory" to "agent persists facts." For prototypes and demos, this is the right choice. The trade-off is that production hardening — write-time filtering, supersession, drift detection — is largely a build-it-yourself exercise.

When Letta wins

You want the agent's memory shape itself to be programmable. Letta's "memory blocks" treat memory as a typed editable region the agent can manipulate explicitly — closer to a programming abstraction than a retrieval one. For agents that reason about their own memory (auto-edit user preferences, organize their own notes), Letta is uniquely well- designed.

When Zep wins

You have heavy temporal-query workloads. Zep emphasizes time-anchored retrieval and built first-class temporal indexes. For applications where "what happened last week" or "what changed since" queries dominate, Zep's design is well-targeted. The trade-off is fewer primitives for non-temporal queries; you'll layer your own.

When Recall wins

You care about retrieval precision over time on long-running, high-volume conversational workloads. Recall's seven-stage write pipeline targets the junk-rate problem head-on, and the typed schema enables type-aware retrieval, supersession, and decay. The trade-off is a more involved write path — there's more to configure, and the rejection rate is higher than systems that store more eagerly.

Dimensions not in the table

  • SDK ergonomics. Subjective; we'd rather not score these. All four have functional SDKs in their primary languages.
  • Hosted offering. All four offer some hosted variant. Pricing differences exist but change quarterly; benchmark current pricing yourself.
  • Integration ecosystem. Mem0 has the largest external integration surface (LangChain, LlamaIndex). Letta is most often used standalone. Zep and Recall are stack-agnostic.

How to evaluate honestly

Build the same agent against two of these systems with identical prompts. Measure:

  • Retrieval precision at 30 days (sample 100 queries; manually score top-5).
  • Latency p99 on your real query distribution.
  • Token cost per response (memory-induced overhead).
  • Operational toil (incidents per week, time spent on cleanup). Public benchmarks are useful as filters but rarely match your workload. The real test is a small head-to-head on your data.

Mem0 architecture deep-dive

Mem0's core design is Python-based with Qdrant handling vector storage and Postgres managing metadata. The write path is lightweight: user-provided text goes through an LLM extraction call, the extracted facts are embedded, and the embeddings are upserted to Qdrant. There is no multi-stage pipeline; the LLM extraction call is the only gate between raw text and stored memory.

The "loose schema" in the comparison table means Mem0 stores key-value facts without enforced types: {"memory": "User prefers Python", "metadata": {"category": "preference"}}. The category field is user-defined and not enforced at the storage layer. This flexibility is the reason Mem0 is fast to start with — no schema migration, no type classification, no predicate normalization. The cost is that downstream logic (supersession, type-specific decay, type-aware retrieval) is not provided out of the box; you build it on top of the raw stored strings.

Supersession in Mem0 works by using an LLM similarity check at write time. When a new memory is deemed contradictory to an existing one, the old one is deleted and the new one is inserted. This is a simpler mechanism than typed supersession because it doesn't distinguish between a preference update, a fact update, and a relation change — all contradictions are treated the same way. That said, it handles the common case adequately. What it misses: partial supersessions (where the new memory is a qualified update, not a full replacement) and append-semantics conflicts (events shouldn't be superseded even when they appear contradictory — "user ran a marathon" and "user says they don't run" should coexist, not cancel).

Mem0's retrieval is a single ANN search against Qdrant. There is no BM25 component, no graph traversal, no temporal index. For most early-stage products this is sufficient — vector retrieval catches 75–85% of queries in practice. The remaining 15–25% covers exact identifier matches (API keys, account numbers), relational queries (who is connected to whom), and time-anchored queries (what happened on a specific date). These either fail silently or return semantically plausible but factually wrong results. If your application is in one of those categories, Mem0's retrieval layer will require supplementation before it's production-reliable.

One consequence of the loose schema worth flagging for operators: because nothing enforces what's stored, junk accumulates at the rate the LLM judges content to be storable. On conversational data where most turns contain some extractable fact, that rate is high. Audits from production Mem0 deployments (published by community members, unverified by us) suggest junk rates of 60–80% — meaning the majority of stored memories contribute noise rather than signal during retrieval. This is not a Mem0-specific flaw so much as the result of any system that prioritizes recall over precision at write time.

Letta architecture deep-dive

Letta, the system formerly known as MemGPT, implements a fundamentally different memory model. Instead of a passive store that the agent reads from at query time, Letta gives the agent active control over its own memory. "Memory blocks" are typed text regions that the agent can read and edit via tool calls: core_memory_append, core_memory_replace, archival_memory_insert, archival_memory_search.

The agent decides what to memorize. When the model determines something is worth keeping, it explicitly calls core_memory_append("user prefers TypeScript"). When it changes its mind, it calls core_memory_replace("user prefers Python", "user prefers TypeScript"). This makes the memory model highly flexible — the agent's own judgment determines what's stored and how it's organized. For agents that need to reason about their own knowledge state (self-updating preference models, agents that maintain structured notes about ongoing work), Letta's design is architecturally suited in a way the other three systems are not.

The "adjacent entity graph" row in the comparison table reflects that Letta doesn't have a first-class entity graph. But because the agent controls its own memory, a sophisticated agent can build and maintain its own entity registry by writing structured facts to archival memory and querying them back via similarity search. This is the Letta design philosophy: programmability over out-of-the-box structure. Whether that's an advantage or a liability depends on how much you trust your model's judgment and how much schema structure you need.

Letta's limitations in production are worth being explicit about. First, memory quality is bounded by the model's judgment at write time. A model that over-generalizes stores bloated summaries that contain less than the original detail; one that under-extracts misses important specifics. Without a write-time pipeline that operates independently of the model's tool-calling decisions, junk accumulation is correlated with model quality — you're not protecting against model errors, you're downstream of them. Second, the agent calling memory tools adds latency and token cost to every turn. A model that decides to run three archival memory searches before responding adds those round-trips to response time. For low-latency applications, this is a real constraint. Third, archival memory is a flat vector store backed by Postgres and pgvector; there's no structured schema, no type-specific retrieval, no temporal index. Semantic similarity is the only retrieval signal available.

For operators evaluating Letta: the clearest indicator that Letta is the right choice is if your agent's primary job is to manage and reason about its own knowledge state, not to answer user queries from a large external memory. For the "large external memory" case, the other three systems have more appropriate retrieval architectures.

Zep architecture deep-dive

Zep is written in Go — unlike the Python-based Mem0 and Letta — and takes a graph-first approach to memory. Its core abstraction is a temporal knowledge graph: entities and relations, with time-versioned facts. The fact "User works at Volkswagen" is stored as a graph edge with a valid_from timestamp. When "user joined Stripe" is extracted from a later turn, Zep creates a new works_at edge for Stripe and sets valid_to on the Volkswagen edge. The historical state is preserved, not overwritten.

This temporal versioning is Zep's strongest design decision and the clearest case where Zep wins. Temporal queries like "where did the user work six months ago?" are answered by filtering graph edges by valid_at(target_date). This is structurally cleaner than approaches that use memory freshness decay to handle temporal resolution — Zep explicitly represents the history of facts rather than aging them into expiry. For applications where the history of beliefs or states matters (customer success tools, long-running project assistants, any workload where what was true at a specific time is as important as what's true now), Zep's model is the most honest representation of reality.

Zep's retrieval uses a hybrid of vector search and BM25 over node attributes and edge labels. The graph structure enables relational queries via multi-hop traversal, though the traversal primitives are less expressive than a typed graph retriever with edge_kind differentiation — Zep's graph is primarily temporal, not typed in the way Recall's entity graph distinguishes WORKS_AT from KNOWS from OWNS. Temporal queries are Zep's sweet spot; non-temporal, non-relational semantic queries rely on the BM25 and vector hybrid without the graph signal, which puts it on roughly equal footing with Recall and ahead of Mem0 for exact-match retrieval.

Where Zep is weaker is worth naming clearly for evaluation purposes. There is no explicit write-time filtering pipeline — conversation turns go through entity extraction, but there is no pre-filter stage, no rate gate, and no confidence floor equivalent to the pipeline stages Recall runs before storage. Zep stores more eagerly, which is fine for entity-relational data but means non-entity conversational filler can accumulate. Concept drift between periods — an entity's role or attribute changing gradually while keeping the same identifier — is handled by the temporal versioning mechanism but is not systematically detected as a drift signal to surface to operators. Finally, the Go implementation means Python ecosystem integrations (LangChain, LlamaIndex) run through a REST API rather than a native client, which adds network latency to the write path on Python stacks and makes tighter integration harder.

Head-to-head: write-time filtering

This is the axis where the four systems differ most significantly in production, and where the architectural differences have the most direct operational consequence.

Mem0 runs one LLM gate. Everything that passes the LLM extraction call is stored. The LLM uses no explicit rejection criteria — it decides on its own whether something is worth storing, which effectively means "anything extractable as a fact is stored." On conversational data, where most turns contain some surface-level extractable content, this results in a high storage rate and a high junk rate. The system is optimizing for recall over precision: it would rather store a useless fact than miss a useful one.

Letta delegates the decision entirely to the agent. The model explicitly decides to store via tool call. For capable models (Claude Sonnet, GPT-4-class), this can be high-precision in practice — the model stores only what it judges important and skips conversational filler it deems irrelevant. For weaker models, precision varies widely and there's no pipeline backstop. Storage cost is effectively inference cost: every archival_memory_insert call is a tool use that adds latency and token cost to the turn in which it occurs.

Zep uses an entity-extraction gate. Zep extracts entities and relations from turns and adds them to the temporal knowledge graph. Non-entity conversational text — statements, opinions, atomic facts without clear entity anchors — may be missed by the extractor. Precision is high for entity-relational data. Recall is lower for facts like "user prefers dark mode," which requires identifying "user" as an entity and "prefers" as a predicate with an implicit subject — extractable in principle, but dependent on the extractor's ability to handle implicit subjects and non-nominal predicates.

Recall runs seven stages before a memory is committed. The pre-filter stage rejects roughly 40–60% of turns before any LLM call is made — turns that are greetings, acknowledgments, system noise, or otherwise below a heuristic signal threshold. Extraction uses a single combined LLM call with explicit rejection criteria including a grounding check (is this claim supported by the turn text?) and a confidence floor (is the model confident enough about this extraction?). Deduplication merges near-duplicate candidates against existing stored memories. Conflict resolution prevents contradictory memories from coexisting silently by routing them to supersession logic rather than storing both. The target junk rate after this pipeline is under 10% of committed memories. The cost is higher write latency — approximately 420ms median versus Mem0's approximately 150ms — and more complex initial configuration. For a high-volume deployment, the pre-filter's savings on LLM extraction calls offset a meaningful fraction of infrastructure cost; see the cost comparison section for numbers.

Head-to-head: retrieval quality

For a 12-month memory store with 5,000 active memories per user, retrieval quality varies significantly across query types. The right comparison is not "which system is better overall" but "which system is better for the query types that dominate your application."

Semantic similarity queries ("tell me about my preferences for this project type"): All four systems handle this adequately. Vector retrieval over well-tuned embeddings catches 75–85% of relevant memories regardless of the retrieval architecture. The dominant variable at this query type is embedding model selection, not retrieval architecture. If semantic similarity queries represent 90%+ of your workload, the four systems will perform similarly, and you should optimize on write-time precision and latency instead.

Exact-match queries ("what's my API key for service X"): Mem0, using ANN search only, misses exact identifiers at non-trivial rates when the embedding space doesn't favor them — exact character sequences don't always cluster the way semantic similarity does. Zep and Recall both incorporate BM25 components (BM25 over node attributes for Zep, BM25 with a GIN index for Recall), which catches most exact term matches that ANN misses. Letta's archival search relies on semantic similarity from pgvector — exact identifiers may not surface reliably without lexical search. For applications where users store and retrieve identifiers, codes, or precise strings, BM25 coverage is not optional.

Temporal queries ("what did I do last Tuesday"): Zep wins clearly here. Its temporal graph with valid_at filtering provides structurally correct answers to time-window queries and represents the cleanest architectural solution for this query class. Recall's Event-type temporal index is competitive for "what happened around this time" queries but less precise for exact time-window filtering. Mem0 and Letta have no first-class temporal retrieval mechanism — temporal queries either fail or are answered by semantic similarity against timestamps embedded in the text, which is unreliable.

Relational queries ("who is on Sarah's team"): Recall and Zep both support entity graph traversal. Recall's typed graph with edge_kind differentiation handles complex multi-hop queries where the relation type matters — distinguishing "Sarah manages" from "Sarah works with" from "Sarah reports to" requires typed edges, not just adjacency. Zep's temporal KG is stronger for time-versioned relational queries ("who was on Sarah's team before the reorg"). Mem0 stores no entity graph at all; relational queries cannot be answered structurally and fall back to semantic similarity, which will frequently surface wrong or incomplete answers. Letta can build an ad-hoc entity registry via archival memory but has no graph retrieval primitives — multi-hop traversal requires multiple sequential similarity searches.

Cost comparison at 100K users

Estimating monthly infrastructure cost for a 100K-user deployment, each generating 20 turns per day, stored for 12 months. These are order-of-magnitude estimates based on publicly available pricing; validate with your actual usage patterns before committing.

Storage volume: 100K users × 20 turns/day × 365 days = 730M turns per year. With Mem0's storage behavior (approximately 5% of turns result in stored memories after LLM extraction, based on community reports — unverified), approximately 36M stored memories. With Recall's write pipeline, approximately 40–45% of turns pass the pre-filter and go to LLM extraction; of those, roughly 60% survive deduplication and conflict resolution, yielding approximately 35M net stored memories. Storage count is similar between the two despite the very different pipelines because Recall's higher pass-through rate is offset by deduplication collapsing near-duplicates. Embedding storage at 768 dimensions: approximately 198 GB of vector data for both.

LLM extraction cost: Mem0 calls the LLM for every turn to extract memories: 730M turns × 0.001perLLMcall=0.001 per LLM call = 730K/year at current API pricing. Recall pre-filters approximately 55% of turns before any LLM call, passing 328M turns to LLM extraction × 0.0011percall(slightlyhigherpercallcostforthemorestructuredextractionprompt)=0.0011 per call (slightly higher per-call cost for the more structured extraction prompt) = 361K/year. The pre-filter reduces extraction cost by roughly half at this scale. Letta's cost is harder to estimate because model-controlled memory storage varies by agent design; for an agent that stores memories on 5% of turns, inference overhead is lower, but per-turn latency and token cost for the turn in which storage occurs is higher.

Retrieval precision impact on downstream cost: This is the less obvious cost driver. A system with a 5% retrieval precision rate (1 in 20 retrieved memories is relevant) means the model must process approximately 9.5 irrelevant memories for every relevant one in its context window. At typical context windows, this translates to roughly 3,000–8,000 wasted tokens per response in memory overhead. At a 30% retrieval precision rate, that overhead drops to roughly 500–2,000 wasted tokens. At 100K users × 20 queries/day, the difference compounds: lower precision adds meaningful token cost to the inference layer even if the memory storage costs are similar. The concrete number depends on your LLM and memory window size, but precision's downstream cost is non-trivial at scale.

Operational toil: Harder to quantify but relevant. Systems with high junk rates require periodic manual cleanup or automated dedup jobs. A 70% junk rate on 36M stored memories means approximately 25M memories that consume storage, slow retrieval, and pad context windows but don't contribute useful information. Cleaning these requires either periodic batch reprocessing or accepting gradual quality degradation over time. Budget for this if you choose a system without aggressive write-time filtering.

Evaluation methodology

A reproducible approach to evaluating these systems on your own data. Public benchmarks are useful as initial filters but almost never match production query distributions. The right evaluation is on your workload.

Step 1 — Dataset preparation: sample 200 multi-turn conversations from your application's existing logs or a representative simulation. Annotate 50 "golden" query-answer pairs manually — queries that require retrieved memories to answer correctly, with the correct answer explicitly specified. Aim for your realistic query distribution: if 30% of your queries are temporal, make 15 of the 50 golden pairs temporal queries. Skewing toward the easy query type (semantic similarity) will make all systems look similar and miss the real differentiation.

Step 2 — System setup: deploy each system you're evaluating with default configuration. Do not tune hyperparameters before the initial measurement — you want to know what you get out of the box, which is what most teams actually deploy. Write the 200 conversations to each system's memory store. Allow 24 hours for any background processing — HyPE generation, graph construction, deduplication jobs — to complete before measuring.

Step 3 — Retrieval precision measurement: for each of the 50 golden queries, retrieve top-10 memories from each system. Score each retrieved memory as relevant (would help answer the query correctly) or irrelevant (would not help or would mislead). Retrieval precision@10 = (relevant memories in top-10) / 10. Record per-system and per-query-type results. A well-tuned system should exceed 40% precision@10 on a realistic query distribution; systems at or below 20% have retrieval quality issues that will affect answer quality.

Step 4 — End-to-end answer quality: for the same 50 queries, use an identical LLM (fixed model, fixed system prompt, fixed answer prompt template) with each system's retrieved context injected. Score answers on a 1–3 scale: 1 (wrong or significantly incomplete), 2 (partially correct, missing key details), 3 (correct). This measures the compounded effect of retrieval precision plus context quality plus LLM reasoning on your actual question types. This score matters more than retrieval precision alone because it captures what the user actually experiences.

Step 5 — Latency and cost: measure p50 and p99 write latency (time from turn submission to memory committed), p50 and p99 retrieval latency (time from query to retrieved context ready), and total LLM token cost per 100 turns including both extraction and retrieval overhead. These are operational constraints that often dominate the final architectural decision for high-volume applications more than precision differences do. A system with 20% better retrieval precision but 3x higher p99 write latency may be the wrong choice if your write path is on the critical path for user response time.

Step 6 — Longitudinal decay check: re-run your 50 golden queries after 30 days of continued writes (simulate this with additional synthetic conversations if needed). Retrieval precision naturally decays as junk accumulates in systems without write-time filtering — a system at 45% precision@10 on day 1 that drops to 20% on day 30 has a worse long-term profile than one that holds 35% steady. This check is the most commonly skipped step in evaluations and the one that reveals the most about operational behavior at scale.

The most important result is Step 4 end-to-end answer quality on your specific query distribution and Step 6 longitudinal stability. Systems that look similar on general benchmarks at day 0 often diverge significantly on domain-specific queries and diverge further over time as memory stores age.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.