LangChain Memory vs LangGraph State vs Recall

By Arc Labs ResearchMay 2, 202612 min read

These three are often discussed as "memory solutions," but they target different scopes. Confusing them produces over-engineering (using Recall for what LangGraph state handles) or under-engineering (using LangChain memory for what needs durable storage).

At a glance

	LangChain Memory	LangGraph State	Recall
Scope	Single conversation	Single graph execution	Across sessions, durable
Lifetime	Until session ends	Until graph completes	Indefinite
Persistence	Optional (you implement)	Optional checkpointing	Built-in
Schema	Buffer / window / summary	TypedDict per node	Typed memory primitives
Retrieval	Whole buffer or summary	Read state by key	Multi-retriever fusion
Best for	In-session continuity	Workflow state	Cross-session relationships

LangChain Memory — short-window continuity

LangChain's memory abstractions (ConversationBufferMemory, ConversationSummaryMemory, VectorStoreRetrieverMemory) all answer the same question: how do you give an LLM access to its own prior turns within a conversation?

BufferMemory: the entire transcript verbatim.
WindowMemory: the last N turns.
SummaryMemory: a running compressed summary.
VectorStoreMemory: embed turns; retrieve relevant ones. These all operate on conversational state, not durable user state. They don't survive a session restart unless you build that yourself; they don't have types, supersession, or decay; they don't address cross-session continuity.

LangGraph State — workflow state

LangGraph models an agent as a graph. Each node executes; state flows along edges. The State object is typed (a TypedDict) and shared across nodes within an execution. With checkpointing, the state can persist across runs. This is the right primitive for workflow state — variables that matter for the current task. "What document did the user upload?", "what step are we on?", "what intermediate result did we compute?" These are not memory; they are runtime state.

Recall — durable typed memory

Recall is the third layer. It holds what should persist across sessions, weeks, months — the user's preferences, history, ongoing relationships, evolving entities. It is opinionated about types, supersession, decay, retrieval, and drift in ways the other two are intentionally not.

They compose

A real agent typically uses all three:

LangGraph State for "what task am I currently doing? what tools have I called? what intermediate results have I computed?"
LangChain Memory (or built-in equivalent) for "what was just said in this conversation?"
Recall for "what does this user prefer? who do they work with? what have they done over the months I've known them?"

Common anti-patterns

Putting workflow state in long-term memory. "Current step number" stored as a fact crowds the memory with churned junk. Use graph state instead.
Putting cross-session state in conversation buffer. The buffer disappears at session end. The user's preferences should not.
Using a vector store as if it were memory. See vector DB ≠ memory. Embedding-only stores handle retrieval but not lifecycle.

Picking what you actually need

Most teams overestimate what they need. A coding-assistant prototype probably needs only LangGraph State and a transcript buffer. A long-term personal assistant needs Recall on top. A multi-agent system might need Recall plus an additional cross-agent state layer. Build up; don't start at the top.

LangChain Memory internals

Each LangChain memory type has a specific cost structure and failure mode. Choosing wrong creates problems that only surface at scale.

ConversationBufferMemory stores the full transcript as a list of (role, content) pairs. At each call, the full list is serialized and prepended to the system prompt — or injected at a designated placeholder in the prompt template. Cost is O(total_turns × average_turn_length) tokens per call. This is not amortized: every call pays the full cost. At 50 turns averaging 200 tokens each, the buffer is 10,000 tokens before the model sees the new user message. That number is a significant fraction of a 32K context window and nearly the entire practical budget for a 16K window. There is no filtering, prioritization, or compression. The model receives the entire transcript and must attend over it uniformly. Right for: short interactions, prototypes where you need to inspect what the model saw, debugging sessions where verbatim history matters.

ConversationWindowMemory stores only the last N turns. This addresses the token scaling problem but introduces an abrupt cliff: everything before turn (current_turn - N) is completely gone. A user preference stated 25 turns ago is invisible when N=20. This is not lossy compression — it is total truncation. The information is not summarized; it simply does not exist in the prompt. The right N depends on average turn length and context window size. At 200 tokens per turn and a 16K context budget for history, N≤80. At 500 tokens per turn, N≤32. Calculating N correctly requires knowing your average turn length in advance, which most teams don't measure. Right for: fixed-length interactions with no long-range dependencies, workflows where recent context is sufficient by design.

ConversationSummaryMemory compresses older turns into a running summary via an LLM call. The implementation maintains two buffers: a "recent" verbatim window and a "summary" of everything older. When the verbatim window exceeds a threshold (configurable, default ~2,000 tokens), the oldest turns are passed to the summarizer LLM and the resulting summary replaces them. The summary is then prepended to the verbatim window. This preserves information across turn count at the cost of fidelity — the summary is the LLM's interpretation of what mattered, not the original text. Critical details that the summarizer judged less important may be lost. Importantly, every session incurs summarization LLM calls (at roughly $0.001–0.005 per compression pass), making this more expensive than it appears. Right for: long conversations where gist and high-level state matter more than verbatim history, use cases where the user is unlikely to ask about specific exact phrasing or numerical values from prior turns.

VectorStoreRetrieverMemory embeds each turn and stores it in a vector index. At query time, the query (typically the current user message) is embedded and the most similar prior turns are retrieved. This is the closest to memory-system behavior of the four types, but it lacks the properties that make a real memory system useful: no type classification, no supersession (if the user says "I changed my mind, I now prefer Python" both the old and new statements are in the store with equal weight), no decay or freshness scoring, no relation graph. Retrieval is purely embedding-cosine — no BM25, no temporal filtering, no graph traversal. Two turns that contradict each other may both appear in the same retrieved context. Right for: long conversations where a small subset of prior turns is semantically relevant to each new turn, and where contradiction is uncommon or acceptable.

The practical cost comparison at 1,000-turn session scale:

Type	Tokens/call (1K turns)	Latency overhead	Info retention
BufferMemory	~200K (impractical)	None	Complete
WindowMemory (N=20)	~4K	None	Last 20 turns only
SummaryMemory	~2–6K	LLM call per compression	Lossy gist
VectorStoreMemory (top-5)	~1K	Embedding + ANN query	Top-5 by similarity

LangGraph State internals

LangGraph's State object is a TypedDict shared across all nodes in a graph execution. Each node receives the current state and returns a dict containing only the fields it modified. LangGraph merges these deltas using annotated reducers:

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph
import operator

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]   # append-only; new items merged in
    current_task: str                         # last-write-wins (default)
    tool_results: dict[str, str]              # last-write-wins
    iteration_count: int                      # last-write-wins
    final_answer: str | None                  # last-write-wins

def router_node(state: AgentState) -> dict:
    # Only return fields this node changed
    return {"current_task": "search", "iteration_count": state["iteration_count"] + 1}

def tool_node(state: AgentState) -> dict:
    result = run_search(state["current_task"])
    return {
        "tool_results": {**state["tool_results"], "search": result},
        "messages": [{"role": "tool", "content": result}],  # appended via operator.add
    }

The Annotated[list, operator.add] pattern is load-bearing: without it, a node returning {"messages": [new_msg]} would overwrite the entire message list rather than appending. Always annotate list fields with operator.add or a custom reducer. For dicts, last-write-wins merges the entire dict, not individual keys — if two nodes both write tool_results, the second write wins entirely.

Checkpointing. With MemorySaver (in-process) or AsyncSqliteSaver (file-backed), LangGraph serializes state to a checkpoint store after every node execution. A graph that crashes mid-execution can resume from the last completed node by passing the same thread_id to the next invocation. This is the "optional persistence" in the comparison table — it requires explicit configuration:

from langgraph.checkpoint.sqlite.aio import AsyncSqliteSaver

async with AsyncSqliteSaver.from_conn_string("checkpoints.db") as checkpointer:
    graph = builder.compile(checkpointer=checkpointer)
    config = {"configurable": {"thread_id": "user-session-42"}}
    result = await graph.ainvoke(initial_state, config)
    # Resume after crash:
    resumed = await graph.ainvoke(None, config)  # None = resume from checkpoint

Without checkpointing, state is in-process memory and is lost if the process dies.

State size limits. LangGraph state is not designed for large payloads. The state object is serialized in full at every checkpoint. A 1MB state object with 10,000 memory entries means 1MB written to the checkpoint store at every node transition. This multiplies quickly: a 20-node graph with 10,000 memories in state generates 200MB of checkpoint writes per execution. More fundamentally, state serialization, delta merging, and checkpoint comparison all assume state objects are small — a few dozen fields, a few hundred tokens at most. Large payloads belong in a database referenced by ID from the state. The canonical pattern:

class AgentState(TypedDict):
    messages: Annotated[list, operator.add]
    current_task: str
    # NOT: memories: list[dict]  ← wrong
    # YES: a scoped reference that the tool uses to fetch from Recall
    user_id: str
    session_id: str

The user_id and session_id fields in state are what the memory tool uses to call Recall. The memories themselves never enter the state object.

Thread ID and multi-tenancy. LangGraph's thread_id in the checkpointer config is the primary key for state isolation. One thread_id per user session. If the same user has two concurrent browser tabs open (two sessions), use two thread IDs — state is not thread-safe across concurrent executions with the same thread ID. For cross-session state (what Recall handles), the user ID in Recall's scope provides the continuity that LangGraph's thread ID intentionally does not.

Recall as a LangGraph tool

The natural integration pattern for Recall in a LangGraph agent is a pair of tools: recall_read and recall_write. These are normal LangGraph tool nodes — the agent's planner node decides when to call them.

from langchain_core.tools import tool
from recall import RecallClient

recall_client = RecallClient(api_key=RECALL_API_KEY)

@tool
async def recall_read(query: str, user_id: str) -> str:
    """Retrieve relevant memories for the current query."""
    memories = await recall_client.retrieve(
        query=query,
        scope={"user_id": user_id, "agent_id": "my_agent"},
        limit=10,
    )
    return format_memories_as_context(memories)

@tool
async def recall_write(turn: str, user_id: str) -> None:
    """Write a conversation turn to the memory pipeline."""
    await recall_client.write(
        content=turn,
        scope={"user_id": user_id, "agent_id": "my_agent"},
    )

In the graph wiring, recall_read is called by the node that assembles context before the LLM call. recall_write is called by a post-response node after the LLM response is generated and returned to the user. The write call is fire-and-forget — the response does not wait for it:

async def assemble_context_node(state: AgentState) -> dict:
    last_user_msg = state["messages"][-1]["content"]
    memory_context = await recall_read.ainvoke(
        {"query": last_user_msg, "user_id": state["user_id"]}
    )
    return {"memory_context": memory_context}

async def post_response_node(state: AgentState) -> dict:
    full_turn = f"User: {state['messages'][-2]['content']}\nAssistant: {state['messages'][-1]['content']}"
    # Fire and forget — don't await or block the graph on this
    asyncio.create_task(
        recall_write.ainvoke({"turn": full_turn, "user_id": state["user_id"]})
    )
    return {}

The asymmetry is intentional: reads block the response path because the model needs memory context to generate a good response. Writes are decoupled from the response path because the user does not wait for memory extraction — that work happens in the background. This means the write pipeline's 7 stages (extraction, entity resolution, deduplication, conflict detection, persistence, embedding, HyPE) do not add latency to the user-facing response.

Memory context injection. The recall_read output is injected into the LLM prompt at a designated position — typically after the system prompt and before the conversation history. A reasonable format:

[Memory context — what I know about this user]
- Fact: User works at Volkswagen (confidence: 0.92)
- Preference: User prefers TypeScript for new projects (confidence: 0.87)
- Relation: User reports to Sarah, who leads the platform team (confidence: 0.78)
[End memory context]

The LLM treats this as ground truth about the user. High-confidence memories should be presented assertively; low-confidence memories (below 0.5) can be omitted or flagged with hedging language in the format string.

Adding Recall to an existing LangChain agent

The migration path from ConversationBufferMemory to Recall is a four-step process. Each step is independently reversible.

Step 1: replace the memory load call. In a typical LangChain agent, memory is loaded at the start of each call via memory.load_memory_variables({}). Replace this with a recall_client.retrieve() call. The return value shape is different (a list of typed memory objects rather than a dict with a history key), so you need to reformat the output into the prompt injection your chain expects.

# Before
memory_vars = memory.load_memory_variables({})
history_text = memory_vars["history"]

# After
memories = await recall_client.retrieve(
    query=current_user_message,
    scope={"user_id": user_id, "agent_id": "my_agent"},
)
history_text = format_memories_as_context(memories)

Step 2: add the write call. After the chain generates a response, write the full turn to Recall. The write call receives the human message and AI response concatenated. Do this after returning the response to the user, not before:

response = await chain.ainvoke({"input": user_message, "history": history_text})
# Return response to user here, then:
asyncio.create_task(
    recall_client.write(
        content=f"User: {user_message}\nAssistant: {response['output']}",
        scope={"user_id": user_id, "agent_id": "my_agent"},
    )
)

Step 3: remove durable state from the old buffer. Anything in the old buffer that represents durable user state — preferences ("I like dark mode"), facts ("I work at Volkswagen"), ongoing relationships — will now be extracted automatically by the Recall write pipeline. You do not need to extract it yourself. Session-transient state (current topic, current task, intermediate results) can stay in a lightweight in-session buffer. After removing durable state from the buffer, the buffer will be smaller and cheaper. Many teams find they can shrink N in their WindowMemory significantly once Recall handles cross-session facts.

Step 4 (optional): backfill historical conversations. Run recall_client.write() over your existing conversation logs to bootstrap the memory store with prior context. The write pipeline processes historical turns identically to live turns — the extraction, classification, and deduplication stages do not distinguish between live and backfill writes. Approximate cost: $0.002 per conversation turn for extraction (one LLM call per turn for the extract-classify-ground stage). For a corpus of 100,000 turns, backfill cost is approximately$ 200 at current LLM pricing.

A common mistake during migration: continuing to run memory.save_context() (the LangChain call that saves a turn to the buffer) alongside the new Recall write. This causes double-writing — the turn enters both the old buffer and the new Recall pipeline. Disable memory.save_context() as soon as the Recall write call is active.

Multi-agent patterns with shared Recall

In multi-agent systems, multiple agents — a planner, an executor, a reviewer — can share the same Recall namespace. This enables coordination through memory without explicit message-passing between agents.

Shared namespace pattern. All agents use the same scope parameter:

SHARED_SCOPE = {"user_id": user_id, "agent_id": "shared"}

# Planner agent writes a decision
await recall_client.write(
    content="Decided to use PostgreSQL for the data store. Rationale: existing team expertise, JSONB support for schema flexibility.",
    scope=SHARED_SCOPE,
)

# Executor agent reads it when implementing the data layer
memories = await recall_client.retrieve(
    query="what database should I use",
    scope=SHARED_SCOPE,
)
# Returns: Fact memory "decided to use PostgreSQL for the data store" with confidence 0.89

# Reviewer agent reads it when checking the implementation
memories = await recall_client.retrieve(
    query="data store technology decision",
    scope=SHARED_SCOPE,
)

Without shared memory, the executor re-derives the database choice independently (possibly reaching a different conclusion), or requires the planner to explicitly pass the decision through LangGraph state to every downstream agent. Shared Recall removes the coordination overhead: agents read from the same ground truth.

Agent-specific namespaces alongside shared. Agents can maintain private state alongside shared state by using their own agent_id:

# Planner's private reasoning — not visible to other agents
await recall_client.write(
    content="Considered DynamoDB but ruled out due to team unfamiliarity.",
    scope={"user_id": user_id, "agent_id": "planner"},
)

# Shared decision — visible to all agents
await recall_client.write(
    content="Selected PostgreSQL as the primary data store.",
    scope={"user_id": user_id, "agent_id": "shared"},
)

Conflict resolution across agents. If the planner writes "use PostgreSQL" and a different agent writes "use SQLite" in the same scope, the conflict detection stage (stage 5 of the write pipeline) detects the contradiction: same type (Fact), same subject (data store), same predicate (technology choice), incompatible objects (PostgreSQL vs SQLite). Rather than silently superseding one with the other, the pipeline creates a contradicts link between the two memories and promotes the higher-confidence one in retrieval weight. The lower-confidence memory is demoted but not deleted. A supervisor agent can query for contradictions by filtering on the contradicts relation type and resolve them explicitly — typically by writing a third memory that supersedes both.

This pattern is especially useful in long-running multi-agent pipelines where multiple agents run concurrently and may reach conflicting intermediate conclusions. The memory layer surfaces conflicts explicitly rather than hiding them.

When to stop at LangGraph State

Not every use case needs Recall. The correct escalation path avoids adding layers before they pay for themselves.

LangGraph State only: the task is fully contained within one session and the output is deterministic. A user uploads a file; the agent processes it and returns a result. Nothing about this interaction should influence future sessions. The graph state holds the file reference, intermediate parse results, and final output. Checkpointing is configured for crash recovery only, not persistence. Adding Recall here would incur extraction cost with zero retrieval benefit — no future session will ask about this file.

Add a conversation buffer: the task involves multi-turn dialogue within a session. The agent needs context from turns earlier in the same conversation — the user said "use the second approach" and the agent needs to remember what the two approaches were. A WindowMemory with N=20 is sufficient. The total session will not exceed 20 turns, or if it does, earlier turns are truly irrelevant to the current exchange.

Add Recall: any of these conditions are true:

The user's preferences should affect future sessions ("I prefer concise explanations" should carry forward).
The agent needs to recall facts about the user established in prior sessions ("last time we talked, you were evaluating PostgreSQL vs MySQL — did you decide?").
The user works with other people the agent should know about ("remind me what I told you about the Sarah situation").
The user's context evolves over time and the agent should track the evolution ("how has your take on this changed over the past month?").

Add Recall with shared namespace: multiple agents collaborate on behalf of the same user and need to share ground truth. Or the same agent runs in multiple concurrent instances for the same user (parallel browser tabs, concurrent API calls). Shared memory namespace enables coordination without routing all communication through LangGraph state, which would require a single-threaded execution model.

The key discipline: add each layer when a specific use case requires it, not preemptively. Most agents genuinely need LangGraph State. Many need a session buffer. Some need Recall. Few need the full shared-namespace multi-agent setup from the start.