What is Agent Memory?

By Arc Labs Research9 min read

An LLM is a function with no internal state. It receives a context window and produces a response; the next call starts fresh. To make this function feel like an agent that knows you across conversations, something else must hold state: the memory layer.

Agent memory is the durable substrate that turns repeated calls into a continuous relationship. It is not just "store everything." Every production memory system makes choices about what to keep, how to type it, when to age it out, when to supersede it, and how to retrieve it cheaply. Those choices are the discipline of memory engineering.

Memory is not RAG

Retrieval-augmented generation (RAG) and agent memory share retrieval mechanics but solve different problems. RAG retrieves from a static corpus — documentation, manuals, knowledge bases — that the agent did not write. Agent memory writes new state from interactions, types it, ages it, supersedes it when contradicted, and reasons over relations. RAG is a read-only library. Memory is a living workspace. A useful agent often needs both: RAG for stable knowledge, memory for the user's specific state. See memory vs RAG vs long context for the full decision framework.

Five types of memory

Most production systems converge on five memory types. Treating them as one bucket — the "flat-text" approach — collapses temporal and relational signals that the application critically needs. (See typed memory vs flat text.)

  • Fact — a stable predicate. "User works at Volkswagen." Changes rarely, retrieved often.
  • Preference — a mutable choice. "User prefers dark mode." Changes occasionally, supersedes when contradicted.
  • Event — a timestamped occurrence. "User joined the platform team on 2026-05-09." Append-only, decays fastest.
  • Entity — an identity. "Volkswagen is an organization." Sticky; once resolved, lasts.
  • Relation — a typed edge between entities. "User reports to Sarah," "Sarah leads platform team." Enables multi-hop reasoning.
Memory lifecycle · click any state
Memories transition through states. Each transition has rules and triggers.

The lifecycle

A memory does not just exist forever. It enters as active, may be superseded by a contradicting memory, may decay to expired when its freshness drops below the retrieval floor, and is eventually forgotten by the background worker. Re-access can lift expired memories back to active. Each transition is deterministic and audit-logged — "why did the agent forget that?" should always have an answer.

Memory is more than the store

It's tempting to equate "memory" with "the database that holds it." That's like equating a database with the queries against it. A real memory system has at least:

  • A write pipeline that decides what to store (see the 7-stage write pipeline ).
  • A read pipeline that retrieves and ranks (see five retrievers are better than one ).
  • A store with at least vector, lexical, and graph indexes.
  • A background worker for decay, consolidation, drift detection, and GC.
  • A confidence model that gives each memory a trust score.

Why this matters

Flat text-storage systems achieve only 5–65% accuracy on temporal-reasoning queries over long-running conversations.

— LongMemEval (2024)

The headline failure mode of agents-with-flat-memory is not that they forget — it's that they fail to reason. They cannot answer "what changed since last time we talked about X?" because they cannot

Three cognitive memory types — and how they map

Cognitive scientists distinguish three memory types in humans: episodic (specific past events), semantic (general facts and knowledge), and procedural (how to do things). Agent memory maps imperfectly but usefully to these categories. Understanding the mapping helps at extraction time — when a new piece of information arrives, the right question is "which cognitive type does this resemble?" before assigning a technical type.

Episodic maps to Event. Episodic memory in humans is memory for specific past occurrences: "I had coffee with Sarah on Tuesday." The defining property is that it is bound to a specific time and context. In agent memory, this maps directly to the Event type: "user joined the platform team on May 9th," "user deployed the API at 2pm last Tuesday," "user reported a billing issue on March 15th." Episodic memories decay fastest because they are context-specific. An event that happened three months ago is usually less relevant than one that happened last week, and its relevance continues to decline unless it belongs to a category the user revisits (e.g., recurring billing issues). The 30-day half-life in the decay model reflects this: events lose half their freshness every 30 days absent re-access, reaching the 0.1 expiry floor after roughly 100 days.

Semantic maps to Fact and Preference. Semantic memory in humans is memory for general knowledge: "Paris is the capital of France," "TypeScript is a typed superset of JavaScript." It is not bound to a specific time or event — it is simply known. In agent memory, this maps to the Fact type ("user works at Volkswagen," "user's team uses a microservices architecture") and the Preference type ("user prefers dark mode," "user prefers concise explanations over detailed ones"). Semantic memories are relatively stable and decay slowly. The Fact type uses a 180-day half-life; the Preference type, which changes more often, uses a 90-day half-life. A new contradicting fact supersedes the old one (the user moved from Volkswagen to Rivian → the old Fact is superseded, not deleted), while a new Preference entry supersedes the old preference for the same domain.

Procedural has no direct mapping. Human procedural memory is memory for how to do things: how to ride a bicycle, how to touch-type. It is largely unconscious and expressed through performance, not recall. Conversational agents do not have a clean equivalent. Some systems add a Procedure type for step-by-step workflows the user has explicitly taught the agent ("when I say 'run the usual deploy,' do: check CI, merge to main, notify the team"). This is not one of the five canonical types — it is a domain-specific extension appropriate for agents that learn and execute multi-step tasks from user instruction. If you need this, add it as a custom type with superseded_by logic (a new procedure for the same trigger supersedes the old one) and an access-boosted freshness model (procedures that are invoked frequently should not decay).

The mapping is actionable at extraction time. When the write pipeline's extract-classify-ground stage receives a new piece of information, a useful heuristic: "did this happen at a specific time?" → likely Event. "Is this a stable state about the user or their world?" → likely Fact. "Is this a mutable choice or preference?" → likely Preference. "Does this refer to another entity and establish a relationship?" → likely Relation. "Is this defining a new actor?" → likely Entity. The LLM classifier in stage 2 applies these heuristics, but knowing the mapping helps when debugging misclassification or tuning the classifier prompt.

The lifecycle in detail — what triggers each transition

Each state transition in the memory lifecycle has a specific trigger. The transitions are deterministic and audit-logged — every memory's state history is queryable, so "why did this memory get superseded?" or "when did this expire?" always has a traceable answer.

Extracted → Active. A memory candidate completes all 7 stages of the write pipeline with a confidence score above the 0.3 floor. The confidence score is computed in stage 2 (extract-classify-ground) as a composite of source confidence (how assertive was the statement in the source text?), grounding confidence (is the claim actually supported by the text around it?), and extraction consistency (did the extractor produce this memory consistently across the extraction temperature sweep?). Once a memory enters Active state, it participates in fusion retrieval and is returned in regular queries.

Active → Superseded. A new memory is written with the same type, subject, and predicate as an existing Active memory, and the predicate is marked predicate_is_stateful = true in the schema (meaning the predicate describes a state rather than an event, so only one value can be true at a time). The conflict detection stage (stage 5) detects the match and creates a superseded_by link from the old memory to the new one. The old memory transitions to Superseded state. Superseded memories are excluded from regular retrieval (they do not appear in standard retrieve() responses) but are visible in audit queries (retrieve(include_superseded=True)) and explicit history requests ("what did I tell you I preferred last year?"). Superseded memories are never permanently deleted by the GC worker unless they also meet the low-confidence + long-expired criteria — the history is preserved.

Active → Expired. The background freshness worker runs on a configurable schedule (default: weekly). For each Active memory, it computes a freshness score using exponential decay:

freshness(m) = 2^{-t/\tau} \times access\_boost(m)

where t is days since last access, τ is the type-specific half-life (30 days for Events, 90 for Preferences, 180 for Facts, 365 for Entities and Relations), and access_boost(m) = min(3.0, 1.2^{access_count}). When freshness drops below 0.1, the memory transitions to Expired. The access boost is multiplicative and capped at 3.0 — a memory accessed 20 times gets the same boost as one accessed 100 times. This prevents high-frequency-access memories from becoming immortal.

Expired → Active (reversal). Expiry is reversible. A direct access to an Expired memory — triggered by a query that retrieves it in historical mode, or by the user explicitly asking about that topic — applies the access boost and recalculates freshness. If the recalculated freshness exceeds 0.1, the memory transitions back to Active. This is the correct behavior for "I haven't thought about this in six months but now it's relevant again" — the memory was stale, not wrong.

Expired → Forgotten. The garbage collection worker runs monthly. It permanently deletes memories that satisfy both conditions: expired for 90+ days AND confidence below 0.3. The two-condition filter is important. A high-confidence memory that happens to be stale (user hasn't accessed the "user works at Volkswagen" Fact in 120 days because no conversations touched on their employer) is not deleted — it may become relevant when the topic returns. Only low-confidence AND long-expired memories are GC'd, because the combination indicates the memory was probably an extraction error that never got validated by access.

Active → Contradicts (link, not state). When two Active memories assert contradictory things about the same subject — same type, same subject entity, same predicate, but incompatible object values — the conflict detection stage creates a contradicts directed link between them. Neither memory is superseded. The higher-confidence memory is promoted in retrieval weight: its effective score during ranking is multiplied by the confidence delta ratio. The lower-confidence memory is demoted but remains Active and retrievable. For high-importance contradictions (defined by predicate importance score in the schema), the system sets a needs_review flag visible in the admin API. A human reviewer or supervisor agent can resolve the contradiction by writing a superseding memory that explicitly overrides both.

The write pipeline overview

Before a memory reaches the store, it passes through 7 sequential stages. Each stage can reject candidates, modify them, or route them to a different path. Understanding each stage is essential for debugging why a memory did or did not get stored.

Stage 1 — Pre-filter. The first gate is cheap and runs synchronously. Four checks: word count gate (reject turns with fewer than 5 or more than 10,000 words), pattern match (reject turns matching chit-chat patterns: "ok," "thanks," "got it," "sounds good," single-word responses), rate gate (reject turns identical or near-identical to a turn processed within the last 60 seconds — deduplication of rapid re-submissions), role gate (configurable: assistant turns, system turns, or tool results can be excluded from extraction if the deployment only wants to extract from user turns). Approximately 40–60% of conversational turns are rejected at this stage. The rejection rate is high because most conversational turns carry no extractable information.

Stage 2 — Extract, classify, and ground. A single LLM call handles three tasks simultaneously to minimize latency. The prompt asks the model to: (1) extract memory candidates from the turn text, (2) classify each candidate by type (Fact, Preference, Event, Entity, Relation), (3) assign source confidence on a 0–1 scale, and (4) verify grounding — is the claim actually supported by the surrounding text, or is the model hallucinating an inference? The single-call approach reduces per-turn cost relative to separate extract/classify/ground calls. Multiple candidates may emerge from one turn: "I moved from Boston to Berlin last month to work at Volkswagen's EV division" yields an Event (moved cities), a Fact (works at Volkswagen EV division), and potentially an Entity update (Berlin, Volkswagen EV division). Approximately 10–20% of turns that pass the pre-filter produce zero extractable memory candidates.

Stage 3 — Entity resolution. Each memory's subject and object are resolved to stable entity IDs to enable deduplication and relation traversal. The resolution uses a 4-stage cascade. First: exact string match against the entity table (case-insensitive, whitespace-normalized). Second: alias match — the entity table stores known aliases ("VW" → Volkswagen, "the company" may be resolvable if the context window contains a prior coreference). Third: trigram fuzzy match with a 0.85 similarity threshold, to catch typos and name variations ("Volkswagon" → Volkswagen). Fourth: LLM fallback for cases the first three stages cannot resolve — the LLM is given the entity string and recent conversation context to determine if it matches an existing entity. If all four stages fail to match, a new entity is created with a UUID v5 of the normalized raw string as its ID. Coreference resolution (mapping "she" or "her" to a specific named entity) uses a per-pronoun ring buffer of size 30, with a TTL of 30 conversations — the 30 most recent conversations' coreference assignments are available for resolution.

Stage 4 — Deduplication. Three tiers of duplicate detection. Tier 1: SHA-256 hash of the normalized memory content — exact duplicates are detected in O(1) and merged immediately. Tier 2: cosine similarity above 0.92 between the new candidate's embedding and existing memories of the same type for the same subject — near-duplicates are merged. Tier 3: for pairs with cosine similarity in the 0.85–0.92 band, an LLM judge evaluates whether the two memories convey the same information. The LLM judge produces a binary is-duplicate decision. On merge, the repetition counter on the surviving memory is incremented (repetition is a weak signal for confidence), the confidence is recalculated as a weighted average of the two merged candidates' confidences, and the evidence turn list (the list of turn IDs that contributed this memory) is updated with the new turn ID.

Stage 5 — Conflict detection. Check for two conflict types. Supersession: the new memory has the same type, subject entity, and predicate as an existing Active memory, and predicate_is_stateful = true. Create superseded_by link, transition old memory to Superseded. Contradiction: the new memory has the same type, subject entity, and predicate as an existing Active memory, but predicate_is_stateful = false (the predicate can be true multiple times — e.g., "user attended event X" and "user attended event Y" are not contradictions). Incompatible objects on a stateful predicate constitute a contradiction — create contradicts link, demote lower-confidence memory, set needs_review if importance threshold exceeded.

Stage 6 — Persist. Atomic write across four tables: the memories table (the memory record with type, subject, object, confidence, status, timestamps), the entity table (upsert for any new or updated entities), the relation table (for Relation-type memories, the directed edge between subject and object entities), and the vector embedding table (the embedding of the memory content, stored alongside the memory ID). All four writes are in a single database transaction — all succeed or all roll back. A partial write (memory record without embedding, or embedding without memory record) would cause retrieval inconsistency and is not permitted.

Stage 7 — HyPE (background, async). Hypothetical Prompt Embeddings. After the synchronous persist completes, an async worker generates 3 hypothetical questions per memory: "If someone wanted to find this memory later, what question would they ask?" Examples for "user prefers TypeScript for new projects": "What programming language does this user prefer?", "Does this user have a language preference for new work?", "Should I recommend TypeScript or JavaScript for this user?" These hypothetical questions are embedded and stored in a separate hypothetical_embeddings table linked to the memory ID. HyDE-style retrieval then matches query embeddings against question embeddings rather than against memory content embeddings directly, improving recall for queries that are phrased differently from the memory's assertion style.

The read pipeline overview

Retrieval proceeds through 4 stages. The goal is to return the top-K memories most relevant to the query, ranked by relevance, confidence, and freshness, within a configurable latency budget.

Stage 1 — Understand. Parse the query into a ParsedQuery struct with fields: entity references (which named entities does the query mention?), temporal window (does the query specify a time range, e.g., "last month," "since March"?), predicate hints (is the query asking about a specific predicate, e.g., "where does X work"?), type hints (is the query asking for a specific memory type, e.g., "what events happened"?), negations (does the query contain "not," "never," "didn't"?), and specificity score (0–1, how precisely constrained is the query?). When specificity is below 0.8 — the query is broad or ambiguous — an LLM enrichment call rewrites the query into 2–3 semantically equivalent forms that improve recall. "Tell me about Sarah" becomes ["what is Sarah's role?", "what is my relationship with Sarah?", "what do I know about Sarah professionally?"]. The query optimizer then selects a retrieval plan from a lookup table that maps ParsedQuery patterns to retriever weight configurations.

Stage 2 — Retrieve. Five retrievers run in parallel. Semantic retriever: HNSW cosine similarity search against memory content embeddings (and hypothetical embeddings when HyPE is enabled). Lexical retriever: BM25 over a GIN-indexed tsvector column on memory content — handles exact term matches, proper nouns, and technical identifiers that embeddings may not capture well. Entity graph retriever: typed edge traversal from the entities mentioned in the query — retrieve all memories where the subject or object is a named entity in the query, then expand one hop via Relation-type memories. Temporal retriever: BTREE range scan on event_at for queries with a temporal window — highly efficient for Event-type memories. Type filter: a hard pre-filter applied to all retrievers when a type hint is present — eliminates entire memory type buckets from consideration before retrieval. Each retriever returns a ranked list of (memory_id, score) pairs. Score-weighted Reciprocal Rank Fusion combines the lists:

rrf\_score(m) = \sum_{r} w_r \cdot \frac{\sqrt{score_r(m)}}{k + rank_r(m)}

where k=60 (standard RRF constant), w_r is the weight for retriever r (tuned by query plan), and score_r(m) is the raw score from retriever r for memory m. The sqrt of the raw score is used rather than the raw score to reduce the dominance of extremely high-scoring results from a single retriever.

Stage 3 — Rank. Optional cross-encoder rerank on the top-50 fused results. The cross-encoder scores (query, memory_content) pairs with a more expensive model that sees both the query and the full memory text simultaneously, rather than comparing embeddings independently. Cross-encoding is more accurate than embedding similarity but too slow to run on the full candidate set — hence the two-stage approach (retrieve a large candidate set cheaply, then rerank a smaller set expensively). After reranking, confidence and freshness weights are applied:

final\_score(m) = retrieval\_score(m) \times conf(m) \times freshness(m)

A memory with high retrieval relevance but low confidence (0.3) or low freshness (0.1) is aggressively demoted. A memory with moderate retrieval relevance but high confidence (0.95) and high freshness (0.9) is promoted. This weighting is what prevents low-confidence extraction errors and stale memories from contaminating the context injected into the LLM.

Stage 4 — Filter and assemble. Hard threshold filters: confidence floor (default 0.5 for production queries, 0.3 for historical or audit queries), status filter (Active only by default; Superseded and Expired are included in historical mode), negative assertion filter (memories that match query negations are suppressed — if the query includes "not Sarah," memories with Sarah as subject are removed from the result set). Context assembly: the top-K memories (default K=10) are grouped into 6 categories (Facts, Preferences, Events, Entities, Relations, and a catch-all Recent category), token-budgeted against a configurable limit (default 2,000 tokens for the injected context block), and formatted into the prompt injection string. Categories are ordered by relevance to the query: the category most represented in the top-K results appears first.

Why "just store everything" fails

The appeal of flat-text, no-filter memory is real: zero false negatives at write time. Every piece of information is stored and theoretically recoverable. But this creates compounding problems that become severe at scale.

Retrieval crowding. Measured on unfiltered conversational data, approximately 97.8% of conversational turns contain no extractable memory-worthy information — chit-chat, filler, task-transient state, repetition of already-known facts. A store built by writing every turn without filtering will have a junk rate near this proportion. At a store size of 10,000 memories, 9,780 are junk. Retrieval returns the top-50 by fused score. The expected number of useful memories in that top-50 depends on how well retrieval can discriminate useful from junk. In the worst case (uniform score distribution), useful memories make up 2.2% of the full store, so the expected useful memories in top-50 is approximately 1.1. Retrieval precision: 2.2%. The model receives a context with 2.2% useful signal and 97.8% junk. Its responses reflect this. In practice, useful memories are slightly more retrievable (they contain more specific language than "sounds good" or "thanks"), but the improvement is bounded — retrieval alone cannot solve a 97.8% junk base rate.

Accumulation cost. A high-volume deployment writing 100,000 turns per day without filtering stores 36.5 million memory objects per year. Each memory requires a vector embedding stored in the HNSW index. At 1,536 dimensions × 4 bytes per float, that is approximately 6KB per embedding. The HNSW index RAM requirement at 36.5M vectors is approximately 219GB. This is the vector index alone, not the full memory record storage. The total infrastructure cost exceeds what most deployments can justify. With a 40% pass-through rate from the write pipeline's pre-filter and extraction stages, 14.6M memories per year → approximately 88GB vector index RAM — still substantial, but approximately 2.5× more cost-efficient than storing everything.

Junk perpetuation. Junk memories, once stored, are not passive. They get retrieved occasionally when their content happens to match a query by surface-level similarity. When retrieved, they accumulate access_count increments. Access count feeds the access boost in the freshness model: access_boost = min(3.0, 1.2^{access_count}). A junk memory retrieved 10 times has an access boost of 6.2 (capped to 3.0), meaning it effectively has 3× the freshness of a never-accessed memory at the same age. This makes junk memories harder to evict via decay — they receive the same freshness boosts as legitimate memories because the system cannot distinguish access from retrieval-noise from access from genuine relevance. The decay system can clear low-confidence junk over time (the 30-day Event half-life plus the 90-day GC trigger), but high-volume unfiltered stores accumulate junk faster than decay can remove it, causing the effective junk rate in the store to grow over time rather than stabilize. The write pipeline's pre-filter and confidence model are the primary defense against this accumulation pattern.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.