Memory for Coding Assistants

By Arc Labs Research9 min read

The problem

Coding assistants are repeated-interaction by definition — the same developer, the same project, often the same files. Every conversation that starts cold is a conversation where the agent has to relearn idioms, preferences, and prior decisions that the human already explained.

The most visible failure mode: the assistant suggests a refactor the user previously declined, with the same reasoning. Or the code reviewer flags a pattern the team agreed to keep. Both signal "this tool is not paying attention," and trust collapses quickly.

What agent memory gives you

For coding agents, four memory categories carry most of the value:

  • Project facts — language, stack, conventions, build commands, deployment targets. Mostly stable; supersede on stack migration.
  • Code-style preferences — formatting, naming, error-handling style, comment density. Mutable; explicit overrides should win over inferences.
  • Decision history — patterns the team has accepted or declined, with the reasoning. Stored as events with the originating PR / commit as source.
  • Procedural memory — recurring workflows ("how we deploy", "how we open a PR"). Closer to runbook content than facts.

The result is an assistant that gets quieter and more useful over time, instead of louder and more annoying.

How the write pipeline behaves for coding agents

Recall's write pipeline runs in seven stages: pre_filter → extract → resolve_refs → [enrichers] → dedupe → conflict → persist. Each stage has configuration surface that coding agents should tune deliberately.

pre_filter is the first gate. For coding agents, configure it to reject: inline diffs and patches (the repo is the source of truth for code, not memory), test output and CI status, compilation errors (transient noise, not signal), and raw log snippets. The rejection threshold is controlled by relevance_threshold, which defaults to 0.40. For coding workloads, bump it to 0.55 — you want to be more aggressive about rejecting noise because a large fraction of coding turns are mechanically generated output rather than decisions or preferences.

extract converts the surviving turns into typed candidates. The mapping for coding agents:

  • Code-style feedback → preference (source: the conversation turn, confidence: 0.70 if explicit, 0.45 if inferred)
  • Architectural decisions → fact (confidence prior: 0.15, because architecture is hard to extract with certainty from a single conversation)
  • Deployment workflow steps → fact tagged as procedural
  • Refactor declines → event with tags: ["decision", "declined"]

The confidence prior on architectural decisions being low (0.15) is intentional. The agent should not lock in a high-confidence architectural fact from one ambiguous exchange. These gain confidence over time as they're confirmed by consistent behavior across many sessions.

resolve_refs resolves natural-language references — "that function", "the auth module", "the PR we reviewed yesterday" — into stable entity IDs. For coding agents this means your entity graph needs repos, files, modules, functions, PRs, and developers as first-class entity types. Without this stage, every decision about "the auth module" is recorded as a free-text annotation disconnected from the canonical entity, and graph-based retrieval cannot find it.

conflict checks whether a new preference or decision contradicts an existing memory. When the user accepts a pattern they previously declined, the old event gets a superseded_by pointer to the new one. The decision history is traversable, not destroyed — this matters when someone asks "why did we change our mind about X?" six months later.

The full pipeline is idempotent: running the same turn through it twice produces one memory, not two. The dedupe stage uses the session ID and turn index as part of the dedup key.

Retriever strategy for code context

Recall ships five retriever types. Their relative weights require tuning per domain; the defaults below work for most coding workloads.

Semantic retriever (cosine similarity over pgvector HNSW): Best for conceptual queries like "find all decisions related to error handling" or "what do we know about our approach to retry logic". Weight: 0.35. This retriever is the weakest at exact symbol lookup.

BM25/lexical retriever: Strongest for symbol names, file paths, function names, and module identifiers. handlePayment, UserRepository, apps/auth/src/middleware — these are exact-match queries in disguise. The lexical retriever uses a GIN index on tsvector. Weight: 0.35. Run both the semantic and lexical retrievers together; they complement each other.

Entity-graph retriever: Critical for "what decisions affect this file?" or "what do we know about everything this module touches?". The retriever walks the entity graph from the target entity outward. auth/middleware.ts → related modules → decisions about those modules. Hop limit: 2 (direct neighbors and their neighbors). Beyond two hops, result quality degrades for most coding queries. Weight: 0.20.

Temporal retriever: For "what changed in the last sprint?" or "decisions from before the migration". Uses a BTREE index on valid_from. Set the time window at query time; the temporal retriever is most useful as a secondary constraint rather than a primary retriever.

Type-filter: Not a scorer — a pre-narrower. Applied before scoring to reduce the candidate pool. For "show me my declined suggestions" → type=event, tags=["decision","declined"]. Always apply a type-filter when you know what you're looking for; it cuts scoring load and eliminates category-level false positives.

RRF score-weighting formula: weight × sqrt(score) / (k + rank) with k=60. Per-retriever weights are tunable; the above defaults (0.35 / 0.35 / 0.20 / ...) work for most coding workloads. Shift weight toward BM25 if your codebase has many unusual symbol names; shift toward entity-graph if you have a large, well-connected module graph.

Per-repo namespace design

Memory is scoped to { user_id, repo_id }. Code-style preferences in a Go monorepo don't apply to a Python side project. The scope key keeps them isolated — a user who works across ten repositories gets ten independent memory spaces, each accumulating knowledge about that specific project.

For teams (same repo, multiple developers), introduce an org_id scope for team-level preferences that apply to everyone — agreed naming conventions, the error-handling strategy the team voted on, the testing philosophy. Keep user-scoped memory for individual style notes that don't apply team-wide.

// Individual coding style preferences
scope: { user_id: "u_abc", repo_id: "r_xyz" }

// Team-wide architectural decisions
scope: { org_id: "org_abc", repo_id: "r_xyz" }

At query time, read from both scopes using a scope union, or query them separately and merge at the agent prompt layer. When a user-level preference conflicts with a team-level preference, the team-level fact should win — the agent's job is to apply team conventions, not override them with individual style drift.

One common mistake: using only user_id as the scope without repo_id. This produces a single memory space for all of a user's projects. The agent starts injecting Python style preferences into TypeScript sessions, or worse, starts referencing architecture decisions from a different codebase as context.

Handling framework migrations

Stack migrations — React to Next.js, REST to GraphQL, Jest to Vitest, Webpack to Vite — create the biggest challenge for coding agent memory. A large fraction of existing preferences and decisions become invalid simultaneously. The agent that continues applying pre-migration preferences post-migration is actively harmful.

The right signal for triggering bulk supersession: when the user explicitly says "we migrated from X to Y", or when the commit message contains a well-known migration signal ("chore: migrate from webpack to vite"), fire the bulk supersession operation.

// On stack migration signal
await recall.bulkSupersede({
  scope: { user_id, repo_id },
  filter: {
    types: ["preference", "fact"],
    tags: ["framework:react"]  // tag memories at write time for this
  },
  reason: "Migrated to Next.js App Router — react-specific preferences no longer apply",
  superseded_by: null  // no single replacement; bulk invalidation
});

The key dependency: memories must be tagged with the framework or tool they apply to at write time. Tag the framework at extraction: when a preference is extracted that references a specific build tool or framework, include framework:<name> in the tags. Cost of tagging at extraction: negligible. Cost of not tagging when a migration happens: replaying eighteen months of irrelevant decisions, with no clean way to invalidate only the affected subset.

For partial migrations (e.g., migrating authentication from one library to another while keeping the rest of the stack), scope the bulk supersession narrowly using multiple tag filters rather than invalidating all preferences. Precision here matters — don't accidentally supersede architectural decisions that remain valid post-migration.

Measuring memory quality for coding agents

Memory quality degrades silently. Unlike a bug in the code path, a degraded memory store doesn't throw errors — it just produces subtly worse suggestions, and the developer stops noticing any benefit. Track these three metrics weekly.

Supersession rate: what fraction of newly written memories supersede an existing memory? A rate below 5% is healthy for a stable codebase. A rate above 20% signals either a rapidly evolving project or a misconfigured conflict detection threshold. Check which memory types are superseding most — if fact memories are superseding at 25% but preference memories are at 3%, the project's architecture is in flux, which is expected information. If preference memories are superseding at 25%, you have either a rapidly changing team style or conflict detection that's too aggressive.

Decision retrieval precision: when the agent says "you previously decided X", trace back to the source memory ID. Build a small eval set: fifty decisions made in the last quarter, sampled across different file paths and decision types. Query the memory store for each decision's context and verify the agent would cite the right memory. Target: greater than 80% correct citation. Below 70% means your entity graph is under-connected or your retriever weights are miscalibrated.

Junk rejection rate: add instrumentation to the pre_filter stage. Log rejected candidates with their computed relevance score and the turn they came from. Review a random sample weekly. If more than 5% of the rejected batch contains real decisions or preferences — content a human would say is worth remembering — lower your relevance_threshold incrementally. The goal is to catch all decisions and reject all noise; the threshold is the single lever.

The provenance endpoint returns extracted_by.model_version_hash for each memory. When your extraction model is updated, use this field to detect whether the memory distribution shifted — a jump in supersession rate immediately following a model update is a signal that the new model extracts differently, not that the codebase changed.

Integration pattern

import { Recall } from "@arc-labs/recall";

const recall = new Recall({ apiKey: process.env.RECALL_API_KEY });

// At session start: load context for the file being opened
async function loadCodeContext(userId: string, repoId: string, filePath: string) {
  const [decisions, preferences, relatedFiles] = await Promise.all([
    recall.search({
      query: `decisions about ${filePath}`,
      scope: { user_id: userId, repo_id: repoId },
      types: ["event"],
      filters: { tags: ["decision"] },
      limit: 10,
    }),
    recall.search({
      query: "code style formatting naming conventions",
      scope: { user_id: userId, repo_id: repoId },
      types: ["preference"],
      limit: 15,
    }),
    recall.search({
      query: filePath,
      scope: { user_id: userId, repo_id: repoId },
      hints: { entities: [filePath], hop: 1 },
      types: ["fact", "event"],
      limit: 10,
    }),
  ]);
  return { decisions, preferences, relatedFiles };
}

// After session: write what was learned
async function persistSession(userId: string, repoId: string, turn: Turn) {
  await recall.write({
    scope: { user_id: userId, repo_id: repoId },
    source: { turn },
    // Pipeline handles extraction, dedup, conflict resolution
  });
}

Three parallel reads at session open — decisions about the specific file, style preferences, and entity-graph neighbors of the file — cover the main retrieval needs without sequential latency. The write after session end is a single call; the seven-stage pipeline handles everything downstream.

The hints.entities field in the third query passes the file path as an entity hint to the entity-graph retriever. Without this, the retriever has to discover the entity from the query text, which is slower and less precise. When you know what entity you're querying about, pass it directly.

For editors that open multiple files simultaneously (split panes, multi-tab workflows), fan the read out per file and deduplicate at the merge layer before prompt injection. The token budget for memory context in a coding session is typically 800–1,200 tokens; prioritize decisions over preferences over facts when trimming.

Example flow

  1. 1
    Developer opens a file
    Agent retrieves project facts, file-specific decisions, and recent changes scoped to this user + this repo.
  2. 2
    Developer asks for a refactor
    Query optimizer detects the file path; entity-graph retriever pulls related files and their decision history.
  3. 3
    Memory check before suggestion
    Agent checks for similar past suggestions. If the user previously declined this pattern, the agent surfaces that fact in its reasoning.
  4. 4
    Suggestion includes citations
    'I'm suggesting X because Y, but I see you previously chose Z for the same reason — should I default to your prior decision?'
  5. 5
    User chooses; outcome stored
    Decision becomes a new event memory with citation to the PR/commit. Supersedes any contradicting older decisions.
  6. 6
    Background drift detection
    If the user's preferences have shifted (new framework, new style guide), drift detection surfaces it for review.

Patterns that work

  • + Per-repo namespace
    Memory scoped to (user × repo). Code-style preferences vary by project; never assume a user's monorepo preferences apply to their hobby project.
  • + Decisions linked to code
    Every decision memory carries a source (PR URL, commit SHA, or file path). Makes 'why did we decide that?' queries answerable.
  • + Procedural memory as runbooks
    Store workflows as structured procedures, not free text. 'How we deploy' is a sequence of steps with check conditions, not a paragraph.
  • + Supersession on framework migration
    Stack migrations invalidate large slices of memory. Bulk-supersede when the user signals 'we moved from X to Y'; don't wait for incidental conflicts.

Pitfalls to avoid

  • Storing source code as memory
    The repo is the source of truth for code. Memory should hold decisions about code, not the code itself. Re-extracting the file is cheaper and more accurate.
  • Preferences without confidence
    An inferred preference from one off-hand comment shouldn't override an explicit setting. Source-strength matters; an explicit user statement scores higher than ambient chat.
  • Cross-user style bleed
    In team contexts, distinguish team-level preferences (codify in the team namespace) from individual ones (user namespace). Mixing them causes 'whose style is this?' confusion.
  • Forgetting to decay procedural memory
    How the team deploys changes. A 2-year-old runbook stored as a fact is actively misleading. Procedural memory needs decay even if you store it as facts.

Code sketch

// Before suggesting a change, check decision history
const priorDecisions = await recall.search({
  query: `refactor ${functionName}`,
  scope: { user_id, repo_id },
  types: ["event"], // decisions are events
  filters: { tags: ["decision"] },
  limit: 10,
});

if (priorDecisions.some(d => contradicts(d, proposedChange))) {
  // Surface the prior decision in your reasoning, don't silently override
}

Go deeper

Build this with Recall

Recall is open source and ships with the architecture above out of the box.

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.