Memory vs RAG vs Long Context

By Arc Labs ResearchMay 2, 202610 min read

A large class of agent design questions reduces to: where does this knowledge live, and how does it get into the prompt? RAG, long context, and agent memory are the three canonical answers. They are not competing — they are complementary, and confusing them leads to systems that work in demos and fail in production.

Same query, three modes

Each mode wins different queries. The right answer is usually 'use all three.'

Side-by-side

	RAG	Long context	Memory
Source	Static corpus you maintain	Per-request input	Accumulated from interactions
Mutability	Read-only at runtime	Read-only	Read/write at runtime
Per-user	Usually shared	Always per-request	Always per-user
Latency	Retrieval cost	Token cost only	Retrieval cost
Token cost	Bounded by top-K	Full payload every call	Bounded by budget
Best for	Documentation, KBs	One-shot analysis	Personalization, continuity
Failure mode	Stale/missing entries	Lost-in-the-middle	Junk accumulation

When RAG wins

RAG is the right primitive when your knowledge is large, static, and shared across users. Documentation, product manuals, regulatory filings, codebases. The corpus is curated; you own its structure; users do not write into it.

The data exists before the user shows up.
The same answer applies to every user.
You'd rather update the source documents than the agent.

When long context wins

Long context wins for ad-hoc analysis of fresh material. The user pastes a 30-page contract and asks for a summary. Indexing for RAG isn't worth the round-trip; the answer needs the whole document; nothing about it persists.

Material is provided per-request.
The full content is needed, not snippets.
The interaction is one-shot — no continuity across sessions. The trade-off: long context is bounded by the model's window and degrades quality through Lost-in-the-Middle on large prompts.

Performance is highest when relevant information appears at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle.

— Liu et al., 2024

When memory wins

Memory wins when state must accumulate from interactions and persist across sessions. The user's preferences, history, ongoing projects, recent events. Anything that the user creates by talking to the agent and that should affect future conversations.

State accumulates from interactions.
Each user has different state.
Continuity across sessions is the product.

They compose, not compete

A real agent often uses all three. A coding agent might use:

RAG for the codebase, language docs, and API references.
Long context for the open files in the current task.
Memory for the user's coding preferences, project history, recurring patterns. Routing happens per-query: a "how do I use this API?" question hits RAG; a "fix this file" task uses long context; a "what was that pattern we agreed on last week?" question hits memory.

RAG in depth: how it works and where it breaks

RAG has three phases. Understanding each phase is prerequisite to debugging the failure modes.

Phase 1 — offline indexing: Documents are split into chunks, typically 512–1,024 tokens. Each chunk is passed through an embedding model (text-embedding-3-small, text-embedding-ada-002, or a self-hosted equivalent) to produce a dense vector. Those vectors are stored in a vector index alongside the chunk text and metadata. This phase runs once at ingestion time and again whenever the source documents change.

Phase 2 — online retrieval: At query time, the user's query is embedded with the same model used for indexing. The vector index performs approximate nearest-neighbor (ANN) search against the stored chunk embeddings, returning the top-K most similar chunks. K is typically 5–20. Those chunks are injected into the prompt as context, usually with a prefix like "Use the following documents to answer the question."

Phase 3 — generation: The LLM generates an answer conditioned on the retrieved chunks and the user's query. It has no access to the rest of the corpus — only what retrieval returned.

RAG's failure modes, in order of severity:

Staleness. The corpus was indexed a week ago. The document changed yesterday. RAG returns the outdated version with no staleness signal — the LLM has no way to know the chunk it received is stale. For fast-moving documentation or code, RAG without continuous re-indexing pipelines is dangerous. Mitigation: index with a last_modified timestamp and inject a staleness warning into the chunk metadata when the retrieved content is older than a configurable threshold. Something like [POSSIBLY OUTDATED: indexed 14 days ago] prepended to the chunk text causes the model to hedge appropriately.

Chunking failures. A table spanning two chunks has its first half retrieved but not its second. A sentence that only makes sense with its preceding paragraph is retrieved in isolation. No chunking strategy eliminates this — it is an irreducible problem at chunk boundaries. The heuristic is that ~20% of factual questions require information from multiple chunks that fall on opposite sides of a boundary. Mitigation: overlapping chunks (50% overlap between adjacent chunks), or semantic chunking where splits are placed at paragraph or section boundaries rather than at fixed token counts. For tabular data, treat the entire table as one chunk regardless of size, and truncate to the model's limit if necessary.

Missing knowledge. The corpus does not contain the answer. RAG confidently retrieves the nearest chunk — which is wrong — and the model generates a plausible-sounding but incorrect answer. This is the hallucination failure mode of RAG: the model cannot distinguish "the corpus does not know" from "the corpus says this." It will synthesize an answer from the nearest chunks regardless. Mitigation: include explicit "no relevant documents found" handling. If the top-K similarity scores are all below a threshold (typically cosine similarity < 0.75), inject "No relevant information was found in the knowledge base" into the prompt and instruct the model to say so. Alternatively, add a retrieval confidence gate: only inject chunks whose similarity score exceeds a floor, and tell the model how many chunks were retrieved to help it calibrate its confidence.

Long context in depth: the quality curve

Long context windows have grown significantly. Claude claude-sonnet-4-6 supports 128K tokens. Gemini 1.5 Pro supports 1M tokens. The temptation is to load the entire corpus, full conversation history, and all available documents into a single massive context and let the model figure it out. The empirical evidence says this strategy is suboptimal above roughly 4K–8K tokens of relevant content.

The Liu et al. (2024) "Lost in the Middle" experiments provide concrete numbers. On a 20-document retrieval task where one document contains the correct answer:

Answer at position 1 (beginning of context): 86% accuracy
Answer at position 10 (middle): 65% accuracy
Answer at position 20 (end): 82% accuracy

At 30 documents:

Position 1: 84%
Position 15 (middle): 41%
Position 30: 79%

The degradation is not linear across the context. It is specifically the middle that suffers. The model attends most strongly to the beginning and end of the context window. Material buried in the middle of a 100K token context is effectively invisible. This means naive "load everything" strategies do not just trade quality for completeness — they actively harm quality on the subset of information that lands in the middle.

Long context also incurs cost at every call. A 100K-token context at Claude claude-sonnet-4-6 pricing costs approximately $0.30 per call in input tokens alone. For an agent making 50 calls per session, that is$ 15 per session in input costs — before accounting for output tokens. Compare this to a memory system that retrieves 4K tokens of relevant context for approximately $0.012 per call. Long context makes sense for one-shot tasks where the full document is genuinely necessary. It is economically prohibitive as the primary mechanism for high-frequency interactive agents. After a year of daily conversations, a user's transcript is too large to load even in a 1M-token window; the approach hits a hard ceiling that memory-based systems do not.

The practical ceiling on where long context remains useful: one-shot document analysis tasks where the entire document is relevant, not just excerpts. Contract review, paper summarization, code audit across a repository. In these cases, the retrieval problem does not exist — everything is relevant — and long context is the correct tool.

Memory in depth: what accumulation means

Memory accumulation is the mechanism by which each conversation turn leaves a permanent deposit in a per-user store. After a turn like "I prefer TypeScript for new projects," the memory write pipeline extracts a Preference-type memory (subject: user, predicate: prefers_language, value: TypeScript, scope: new_projects), stores it with an initial confidence of 0.65 (first mention, explicitly stated), and makes it retrievable for all future sessions. After session 5, where the user has mentioned this preference four more times in different contexts, the repetition boost has raised confidence to 0.82. The memory now scores near the top of preference retrieval for any TypeScript-related query, outcompeting less-reinforced preferences.

The accumulation compounds across weeks and months. After six months of daily use, a user's memory store might contain 2,000 high-confidence memories covering their preferences across tools and languages, their project history and recurring patterns, relationships with teammates, and technical decisions made in past sessions. A new session that retrieves the top 50 most relevant memories gets a dense, personalized context with no staleness problem (each memory decays individually; stale memories drop below the retrieval floor) and no per-call re-indexing cost (the write pipeline runs asynchronously after each turn, not at retrieval time).

The failure mode is junk accumulation: extracted memories that were false positives from the extraction stage, duplicates that were not caught by deduplication, or memories that captured transient conversational artifacts rather than durable state. A user saying "let's pretend you're a pirate" should not generate a Preference memory for pirate-themed responses. Without write-time filtering, exactly this happens. After six months without quality filtering, the store might contain 10,000 memories of which 8,000 are low-quality: conversational filler, test queries, one-off requests, incorrect extractions. Retrieval precision collapses because junk outcompetes real memories in semantic similarity — they were extracted from the same conversation space, so they embed nearby in the same vector neighborhood.

This is why the write pipeline's pre-filter stage is the most important performance lever in a memory system. Junk enters faster than it decays without active filtering at write time. The pre-filter — word count gate, pattern matching, rate limiting per turn — catches the easy cases. The extraction stage's grounding score catches the subtler ones: memories with low grounding (the model inferred the memory rather than reading it directly) are discarded before they reach the store.

Cost comparison at scale

Concrete cost analysis for an agent making 20 queries per day over 30 days — 600 total queries — for one user:

RAG approach (static corpus, 50K-doc knowledge base, 500-token chunks):

Indexing is a one-time or periodic cost. Embedding 50K chunks at 500 tokens each: 25M tokens × $0.0001/1K tokens (text-embedding-3-small) =$ 2.50 one-time. Incremental re-indexing for changed documents: roughly $0.001 per document restamped. Storage for 50K embedding vectors at 1536 dimensions: negligible in managed services.

Per-query runtime cost: embed query ( $0.0001) + retrieve 10 chunks (~5K tokens injected) at$ 3/1M input tokens = $0.015 per query. For 600 queries: approximately$ 9. The key advantage: this cost is shared across all N users querying the same corpus. A 10,000-user deployment still pays roughly $9 in per-query embedding costs, split across the user base.

Long context approach (load full conversation history each session):

Assume an average context of 50K tokens per session (growing over time as history accumulates). 20 sessions per month × 50K tokens = 1M input tokens per month. At $3/1M tokens:$ 3/user/month in input cost. After six months, the conversation history has grown; average context is now 120K tokens. 20 sessions × 120K tokens = 2.4M input tokens/month = $7.20/user/month, rising. Plus output tokens (not calculated here). Total:$ 5–15/user/month, scaling linearly and without bound as conversation history grows. The approach hits the model's context window limit before it hits an economic limit, but economic costs become prohibitive well before that point for active users.

Memory approach (4K-token context from retrieved memories):

Per-query: embed query ( $0.0001) + retrieve 50 memories assembled into a 4K-token context at$ 3/1M = $0.012, plus write pipeline per turn: extraction LLM call (~$ 0.002/turn). For 600 queries and 600 write turns: $7.20 in retrieval +$ 1.20 in writes = $8.40/user/month. Unlike long context, this does not scale with conversation history. After six months, the memory store is larger but the retrieval still returns the top 50 most relevant memories into a fixed 4K token budget. The per-query cost stays flat at$ 0.013/query regardless of how long the user has been using the product.

Routing in a composed system

In a real system using all three modes, routing logic determines which mode answers each query. The routing decision is not expensive — it runs before the model call and adds minimal latency. A simple routing layer in Python:

def route_query(query: str, parsed: ParsedQuery, user_provided_documents: bool) -> list[str]:
    modes = []

    # Memory is nearly always included — it provides personalization at low cost
    modes.append("memory")

    # RAG for knowledge-base queries: low specificity or known KB entities
    if parsed.contains_kb_entity or parsed.specificity < 0.5:
        modes.append("rag")

    # Long context only when the user explicitly provides material
    if user_provided_documents:
        modes.append("long_context")

    return modes

The routing threshold matters more than it might appear. Setting specificity < 0.5 for RAG means low-specificity queries ("how does this work?", "what are best practices for X?") hit the knowledge base, while high-specificity queries ("what did we decide about the database schema last Tuesday?") go only to memory. The specificity score is a rough measure of how general versus personalized the query is: queries containing personal pronouns, specific project names, or references to past events score high; generic how-to queries score low.

In practice, most queries are low-specificity (users asking general questions) or high-specificity (users asking about their own context). Very few queries require both RAG and memory simultaneously — but for those that do (e.g., "how does the API work and have I used this pattern before?"), the two retrievals run in parallel and their outputs are assembled into a single context block.

The merged context block ordering matters given the lost-in-the-middle finding. Memory context — the most personalized, highest-signal content — goes at the beginning of the injected context. RAG chunks go after. Long context (if any) goes last but is flagged as highest-priority for the model with an explicit instruction. This ordering biases the model toward personal context while keeping shared knowledge accessible.

Decision framework for new projects

Walk through this decision tree when starting a new AI product:

Step 1: Is your knowledge static or dynamic?

If the knowledge exists before the user shows up and changes by editorial decision (documentation, product manuals, regulatory filings, curated knowledge bases): start with RAG. The corpus is yours to maintain. If the knowledge is generated by users through interaction (preferences, history, ongoing work, relationship context): you need memory. If it is both: you need both, which is the common case.

Step 2: Is the information per-user or shared?

Shared knowledge that gives the same answer to all users belongs in a RAG corpus. User-specific state that differs per user belongs in memory. A product where the same documentation serves all users is pure RAG. A personal assistant where context is entirely individual is pure memory. Most products live between these extremes.

Step 3: Does the interaction span sessions?

Single-session products (document analysis, one-shot code generation, single-turn QA): transcript buffer is fine, no persistence needed, long context is the right tool. Multi-session products where each session should know about previous ones (personal assistant, coding copilot, customer support): you need durable storage. Long context does not persist — once the session ends, the context is gone. Memory and a database are the two options; memory is better when the state is conversational and relational rather than structured.

Step 4: What is your token budget?

If per-query cost is a hard constraint: memory is more economical than long context at scale, and RAG is economical when the corpus is shared. If the interaction is genuinely one-shot and cost is not a constraint: long context is simpler to implement and avoids the engineering overhead of a write pipeline.

Starting configuration by product type:

Documentation assistant: RAG primary, no memory needed. Users share the same corpus; no per-user state accumulates.
Personal AI assistant: Memory primary, RAG for general knowledge if the assistant covers broad domains, no long context.
Code editor copilot: Long context for open files in the current task, RAG for language documentation and API references, memory for the user's coding preferences and project history.
Customer support agent: RAG for product documentation (shared, curated, changes by product decision), memory for customer history (per-user, conversational), no long context.
Research agent: Long context for uploaded papers (user-provided, full-document relevant), RAG for general literature (shared corpus), memory for the user's research notes and prior analyses.

These are starting points, not rigid templates. The signal to add a mode is when you hit the failure mode of the mode you are currently using: stale answers → your RAG indexing cadence is too slow; forgotten context across sessions → you need memory; cost scaling with history → you need to move off long context.

^[]