Memory for Research Agents

By Arc Labs Research9 min read

The problem

Research is path-dependent. The hypothesis you investigated on Tuesday shapes the sources you read Wednesday and the contradictions you noticed Thursday. A research agent without memory cannot follow that path — it loses the thread between sessions and re-investigates the same things from scratch.

The harder failure: contradictions go unsurfaced. Source A claimed X. Source B claimed not-X. Without persistent memory linking the two, the agent presents whichever was most recent without flagging the conflict.

What agent memory gives you

Research agents need three memory categories beyond the typical user-personalization stack:

  • Source provenance — every claim memory carries its source (URL, document ID, page). This is the ground truth for citation and re-verification.
  • Hypothesis tracking — open questions, partial answers, dead ends. Stored as events with status (active, resolved, abandoned). Searchable temporally — "what hypotheses are still open after 30 days?"
  • Claim relations — typed edges between memories: supports, contradicts, extends. The graph of these relations is the substrate for contradiction surfacing.

With provenance + relations, the agent can answer "what's the strongest evidence for X?" or "what contradicts what we currently believe?" — questions that flat-text memory cannot.

Source provenance as a first-class constraint

For research agents, every memory that reaches the store must carry source provenance. This is non-negotiable — a research claim without a citable source is worse than no claim, because it sounds authoritative without being verifiable.

The Recall write pipeline enforces this via the extract stage: candidates that fail to carry at least one source reference (document ID, URL, page number, or span offset) are rejected at the conflict stage. Configure this explicitly:

await recall.write({
  scope: { project_id },
  source: {
    document: { id: docId, url: paperUrl, title: paperTitle },
    page: 42,
    span: [1200, 1580],  // character offsets within the page
    accessed_at: new Date().toISOString(),
  },
  candidates: [
    {
      type: "fact",  // research claims stored as facts or custom "claim" type
      content: "Intervention group showed 23% reduction in latency (p < 0.001, n=150)",
      tags: ["quantitative", "primary-finding"],
    },
  ],
});

The provenance struct (source_turn_ids, extracted_by, extraction_trace_id) ensures you can always ask: "Where did this claim come from? Which model version extracted it? What was the exact source span?" These are questions the researcher will ask every time they go to cite something.

Span offsets matter because papers get updated, URLs move, and PDFs get re-typeset. Storing the character offset against a specific document version means you can verify the claim against the exact text that was read, even if the canonical URL has since pointed to a revised version. Pair span offsets with a content hash of the document at access time if re-verification against original text is a hard requirement.

Source entities in the graph are first-class nodes: paper nodes carry author relations, publication year, DOI, venue. A "Smith (2024)" query resolves to the source entity node, and from there the graph surfaces all claims derived from that source. This is the research-specific complement to personal assistant entity resolution — instead of resolving people, you are resolving sources.

The claim relation graph

Research memory is a knowledge graph where nodes are claims and edges are semantic relations: supports, contradicts, extends, supersedes, replicates.

This is the structure that makes research retrieval qualitatively different from personal assistant retrieval. A flat list of facts cannot answer "what's the current state of the evidence on X?" — but a graph of claims with typed relations can. The relation type carries meaning that bare similarity scores cannot. Two claims that score 0.92 cosine similarity might support each other, or they might make contradictory assertions about the same measurement. Similarity alone cannot distinguish the two cases.

// New claim that contradicts an existing one
await recall.write({
  scope: { project_id },
  source: { document: { id: docId }, page: 18 },
  candidates: [
    {
      type: "fact",
      content: "Contrary to Smith (2024), no significant latency reduction found (n=80, p=0.34)",
      relations: [
        {
          type: "contradicts",
          target: "mem_smith2024_latency_claim",
          confidence: 0.85,
          note: "Different cohort size; p-value non-significant"
        }
      ],
      tags: ["quantitative", "null-result"],
    },
  ],
});

Querying the graph for open contradictions:

const openContradictions = await recall.queryGraph({
  scope: { project_id },
  edge_type: "contradicts",
  filter: { status: "unresolved" },
  // Returns pairs of contradicting claims with their sources
});

Contradiction resolution is manual. The researcher decides: is this a genuine contradiction (different conditions), a supersession (newer methodology), or an error in one of the sources? The resolution itself becomes a new edge — a resolved_by relation pointing to the resolution note. The resolution note is itself a memory node: it carries the researcher's reasoning for how the contradiction was resolved, who resolved it, and when.

Relation edges are bidirectional by query convention but stored directionally. A contradicts edge from claim A to claim B means A was written with explicit knowledge that it contradicts B. The reverse query (find all things that contradict A) traverses the edge in reverse. Storing directionality preserves the temporal and epistemic ordering of discovery: A was known first, B was written in the context of A.

The extends relation type covers the most common case in research: a new study uses a similar methodology on a different cohort, or investigates a related but distinct variable. Extends is not the same as supports: a supporting claim makes the same assertion; an extending claim builds on the same framework to ask a different question. Distinguishing these at write time makes graph queries far more precise.

Hypothesis tracking as a state machine

Research hypotheses live through a state machine: open → active_investigation → supported → contradicted → abandoned → resolved.

Implement this with events and custom tags:

// Create a hypothesis
await recall.write({
  scope: { project_id },
  candidates: [
    {
      type: "event",  // hypotheses are events — they have a lifecycle
      content: "Hypothesis: intervention effect is moderated by baseline severity",
      tags: ["hypothesis", "status:open"],
      entities: ["hypothesis_moderator_effect"],
    },
  ],
});

// Update hypothesis status when evidence mounts
await recall.updateMemory({
  id: "mem_hypothesis_123",
  patch: {
    tags: ["hypothesis", "status:supported"],
    supersession_note: "Three independent replications found same moderator pattern"
  }
});

Querying open hypotheses at session start:

const openHypotheses = await recall.search({
  query: "what questions are still unanswered",
  scope: { project_id },
  types: ["event"],
  filters: { tags: ["hypothesis", "status:open"] },
  sort: "valid_from",  // oldest open hypotheses first
  limit: 10,
});

At session start, the agent surfaces open hypotheses — the researcher immediately sees what threads are live and can pick up where they left off. This is the core continuity benefit that distinguishes a memory-equipped research agent from a stateless one. Without hypothesis tracking, every session starts with "where were we?" — with it, the agent leads with the current state of investigation.

Status transitions require evidence. Moving from open to supported requires at least one supports relation edge from a claim to the hypothesis node. Moving from open to contradicted requires at least one contradicts edge. The abandoned status is manually set — the researcher decides a line of inquiry is no longer worth pursuing. Abandoned hypotheses are retained (not deleted) because they document negative space: "we looked at this and decided not to pursue it for the following reason."

Hypothesis nodes can also carry relations to other hypotheses. A sub-hypothesis is a refinement: if the parent hypothesis is "intervention effect is moderated by baseline severity", a sub-hypothesis might be "the moderation is stronger in adults over 40". The parent-child relation is stored as parent_of / child_of. Graph traversal from the parent surfaces all sub-hypotheses and their statuses, giving a complete picture of how an investigation has decomposed.

Write pipeline behavior for research agents

Research agents need different write-pipeline tuning than other domains.

pre_filter: The relevance threshold should be lower than the default (0.35 vs 0.40). Research conversations contain dense technical content — the filter should be permissive because missing a real finding is worse than storing a few extra methodological notes. The cost of a false negative (missing a claim) is higher than a false positive (storing an unimportant procedural detail). Pre-filter should pass anything that mentions: a quantitative result, a named entity (author, paper, institution), a methodological term, or a hypothesis-relevant keyword. Filter out: researcher's meta-commentary on the interface, acknowledgments, formatting instructions.

extract: Should create claims at atomic granularity. "Smith et al. (2024) found that X reduces Y by 23% in cohort Z" is one claim. The summary "the intervention worked" is not a claim — it is a lossy compression that breaks citation. Extract atomically; reconstruct summaries on demand. The graph holds the truth; rendering is done at query time by composing retrieved claims into a summary, not by storing the summary.

Each extracted claim should be assigned a claim type: quantitative (carries a numeric result), qualitative (descriptive finding), methodological (a finding about research design), or null-result (a non-finding). Claim type affects how the claim is weighted in retrieval: quantitative claims with effect sizes rank higher for "strongest evidence" queries; null-results rank higher for "what failed to replicate" queries.

conflict: Research agent conflict detection must not auto-supersede. When a new claim contradicts an existing one, the default behavior (auto-superseding the older fact) is wrong for research — the older fact might be the better one, or both might be correct under different conditions. Configure conflict handling to flag and hold for researcher review:

// In pipeline configuration
const pipelineConfig = {
  conflict: {
    strategy: "flag_for_review",  // not "auto_supersede"
    reviewer: "researcher",
    notification: true,  // surface the conflict at next session start
  },
};

dedupe: Semantic deduplication threshold should be higher than default (0.92 vs 0.85). Research claims that look similar might make subtly different quantitative assertions — dedupe should be conservative. "23% reduction in latency" and "24% reduction in latency" will score above 0.85 similarity but represent different findings and should be stored separately. Only deduplicate when the claims are genuinely identical: same assertion, same source, same quantitative result.

infer: Research agents should run minimal inference. The infer stage for personal assistants extracts implicit preferences from behavior; for research agents, inference risks inventing claims. Limit the infer stage to: tagging claims with their research domain (based on terminology), identifying relation candidates for researcher confirmation, and flagging hypothesis-claim connections (this claim might support hypothesis H). Do not infer new claims; only infer metadata and relations.

Retriever strategy for research

For a "what do we know about X?" query in a research project, the retrieval stack weights differently than for personal assistant queries:

  • Semantic retriever (weight 0.40): finds conceptually similar claims even if they use different terminology — critical for surfacing evidence that uses synonyms or domain-specific vocabulary
  • BM25/lexical (weight 0.35): exact-match is critical for author names, paper titles, technical terms, p-values, and effect sizes — "Smith 2024" and "0.001" must match exactly
  • Entity-graph (weight 0.20): traverses the claim-relation graph to surface supporting and contradicting evidence from the structural relationships, not just similarity
  • Temporal (weight 0.05): "what have we found in the last 30 days?" — less critical for research than recency-sensitive domains like news analysis

The entity-graph retriever is different in research: entities are claims, authors, and papers — not people and organizations. A query for "Smith 2024" should graph-walk from that source entity to all claims derived from it, then optionally to all claims that contradict those claims. This is claim-centered graph traversal rather than person-centered.

BM25 weight is intentionally high (0.35) for research — higher than the typical 0.20 for general-purpose agents. Research retrieval must be precise on technical terminology. A semantic-only retriever might surface "latency improvement" when the query is specifically about "response time reduction" — similar concepts, but a different measurement. BM25 anchors precision; semantic retrieval supplies recall.

For contradiction surfacing specifically ("what contradicts what we believe about X?"), the retrieval strategy inverts: the entity-graph retriever dominates (weight 0.60) because it can directly traverse contradicts edges from known claims about X. Semantic and lexical retrieval serve as fallbacks for claims that haven't been explicitly connected via graph edges yet.

Measuring research memory quality

Three metrics determine whether the research memory store is functioning correctly.

Claim precision: of memories tagged as claims, what fraction have a citable source reference? Should be 100%. Any claim memory without provenance is a data quality failure — it should never have passed the extract stage. If claim precision drops below 100%, investigate whether the source struct is being correctly populated at write time. A common failure: the source document is set at the session level but not propagated to individual candidates during extraction.

Contradiction resolution rate: what fraction of flagged contradictions have been resolved by the researcher? If this is below 50% after 30 days, the contradiction-surfacing mechanism is not actionable. Possible causes: the contradictions are being flagged but not surfaced to the researcher at session start; the contradictions are too numerous to review (extract is over-generating contradicts relations); or the researcher does not have a clear workflow for resolving them. Check whether the researcher is seeing the flags, then check whether the flags are accurate.

Graph density: average number of relations per claim node. Fewer than 0.5 relations per claim means the graph is sparse — claims are not being connected to each other. Either the extraction stage is not creating relations, the deduplication threshold is too aggressive (merging claims that should have been separate nodes with edges between them), or the researcher is not resolving contradictions into resolved edges. Target greater than 1.5 relations per claim for a mature research project (typically 3+ months of active use).

A secondary metric worth tracking: source entity coverage — what fraction of cited papers have a corresponding source entity node in the graph with at least 2 claims attached? Low source entity coverage means claims are being stored without proper graph integration, and "find all claims from Smith 2024" queries will fail or return incomplete results. Source entity coverage should track toward 100% for all papers the agent has actively processed.

Example flow

  1. 1
    Researcher asks about a topic
    Agent retrieves prior research memories, hypotheses, and source provenance for this topic.
  2. 2
    Agent presents what's known
    'You have 3 supporting sources for X (Smith 2024, Lee 2025, ...) and 1 contradicting (Kim 2025). Two hypotheses are still open: A and B.'
  3. 3
    Researcher asks for a new investigation
    Agent runs the search/tool. New findings extracted as candidate memories with source IDs.
  4. 4
    Conflict detection on new claim
    Pipeline checks if the new claim contradicts existing memories. If yes, doesn't auto-supersede — flags for researcher review.
  5. 5
    Researcher resolves the conflict
    Either accepts the new claim (older becomes superseded with explanation) or marks both as 'open question' with relation links.
  6. 6
    Memory graph updates
    New supports/contradicts edges added. The graph becomes more useful with each session.

Patterns that work

  • + Source-aware extraction
    Every extracted claim must carry source provenance. No source = no memory. This is non-negotiable for research workloads.
  • + Typed claim relations
    Build a knowledge graph where edges are typed (supports, contradicts, extends). Queries can ask graph questions, not just similarity questions.
  • + Hypothesis as first-class type
    Add a 'hypothesis' memory type alongside the canonical five. State machine: open → supported → contradicted → resolved.
  • + Manual conflict resolution
    Don't auto-supersede in research. Flag conflicts and let the human decide. Auto-resolution is the quickest path to a wrong knowledge graph.

Pitfalls to avoid

  • Claims without provenance
    A research memory without a citable source is worse than no memory — it sounds authoritative without being verifiable. Reject candidates that lack provenance.
  • Treating recency as truth
    The newer source isn't always right. Don't auto-supersede based on date alone; let edge type (contradicts vs extends) and source quality drive resolution.
  • Storing summaries, not claims
    Summaries are lossy. Extract atomic claims with their sources; reconstruct summaries on demand. The graph holds the truth, not the rendering.
  • Mixing reasoning with memory
    The agent's reasoning chain is workflow state, not memory. Don't pollute the research memory with intermediate hypotheses the agent considered and rejected.

Code sketch

// Every claim memory requires source
await recall.write({
  scope: { project_id },
  source: { document: docId, page: 42, span: [120, 380] },
  candidates: [
    {
      type: "claim",
      content: "X causes Y in N=200 cohort",
      relations: [{ type: "contradicts", target: "claim_xyz789" }],
    },
  ],
  // Pipeline rejects claims without source
});

// Surface contradictions
const conflicts = await recall.queryGraph({
  scope: { project_id },
  edge_type: "contradicts",
  status: "unresolved",
});

Go deeper

Build this with Recall

Recall is open source and ships with the architecture above out of the box.

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.