Context Aggregation: Token Budgeting

By Arc Labs Research10 min read

Retrieval ends when the top-K memories are picked. Context assembly is what happens next: you have a list of fused, reranked memories and a token budget, and you have to produce a prompt that fits and works. The naive approach — concatenate memories until you hit the budget — wastes the budget and triggers Lost-in-the-Middle. Better: budget by category, structure the output, and place high-priority items at the start and end of the context.

Token budget · drag categories
Six categories share a budget; tune the split to match your query mix.

The six categories

  • Facts. Stable predicates about the user. Highest reuse — touch nearly every query. Default 25–30% of budget.
  • Preferences. Mutable choices. Always include if the query has any configuration component. Default 8–12%.
  • Recent events. Last 30 days, time-anchored. Highly relevant for "did I" / "when did" queries. Default 15–20%.
  • Entities. Compact summaries of mentioned entities. Default 8%.
  • Session summary. Compressed running summary of the current session. Default 12%.
  • Recent turns. Verbatim recent dialogue. Default 25–30%.

Lost-in-the-middle

Stanford's well-known result: LLM answer quality on retrieval-augmented prompts follows a U-curve over position. Information in the middle of long contexts gets ignored relative to the start and end.

Performance is highest when relevant information appears at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.

— Liu et al., 'Lost in the Middle' (2024)

Mitigations:

  • Place the highest-priority memories at the start and end; lower-priority in the middle.
  • Use structured headings (markdown or XML) to give the model anchor points.
  • Keep the total context shorter — the curve flattens dramatically below 4K tokens.

Dynamic budgeting

A static 25/12/20/8/12/25 split is fine as a default but suboptimal per-query. The query optimizer's signature should also drive the budget split:

  • Temporal queries get 35% events, 10% facts, 8% preferences.
  • Configuration queries get 30% preferences, 20% facts, 5% events.
  • Relational queries get 20% entities, 25% facts, 10% events. Borrowing across categories is allowed — if a category produces fewer results than its budget, the slack should rebalance to other categories rather than going unused.

Cap individual items

Long memories crowd out budget. Cap each rendered memory at ~80 tokens; truncate with an ellipsis. The full memory is in the store; the rendered prompt is a pointer plus enough text to disambiguate. This trade-off is almost always net-positive for answer quality.

Make memories citable

Render every included memory with a stable ID. When the model produces an answer, ask it to cite the memory IDs it used. This gives you read-time grounding (see three layers of hallucination defense ), and the citations are first-class telemetry: which memories actually drive answers?

The token budget arithmetic

The apparent opportunity with 128K-token models — Claude claude-sonnet-4-6, for instance — is to include everything in context. In practice, the window is large but the attention budget is not. Consider where the tokens go in a realistic deployment:

  • System prompt (instructions, persona, tool definitions, safety rails): typically 800–2,000 tokens. A minimal instruction-only prompt is 800; one with tool definitions and behavioral guidelines reaches 2,000 easily.
  • Current conversation (recent turns, user message): 200–500 tokens per turn at 10 turns = 2,000–5,000 tokens. A multi-turn conversation where each exchange includes code or structured output can hit 10,000 tokens for recent turns alone.
  • Model response budget (tokens reserved for the model's reply): 2,000–4,000 tokens. Constrained responses reserve less; open-ended assistant responses reserve more.
  • Remaining for memory context: 128,000 − 2,000 (system) − 5,000 (turns) − 4,000 (response) = 117,000 tokens — theoretical maximum.

The theoretical maximum is misleading. The Lost-in-the-Middle degradation sets in well before the window fills. At 8,000 tokens of memory context, answer quality on retrieval-dependent questions degrades measurably versus 4,000 tokens of well-organized memory context. At 16,000 tokens, only the first and last 2,000 tokens of the memory block are reliably attended to — the middle 12,000 tokens contribute noise proportional to their length. At 32,000 tokens, middle-position accuracy on multi-document QA tasks can drop below 40% in controlled studies.

The practical recommendation is to keep the memory block under 4,000 tokens. This is not a window size constraint — it is an attention quality constraint. The rest of the 128K window can hold conversation history, which the model attends to more reliably because it is sequentially structured and the model was trained on sequential dialogue. Memory context, being non-sequential and injected in bulk, degrades faster with size.

At 4,000 tokens total for the memory block, the six-category default allocation works out to:

CategoryPercentageToken budgetApproximate items (at 80 tokens each)
Facts25%1,000~12 facts
Recent events20%800~10 events
Preferences12%480~6 preferences
Session summary12%4801 summary block
Entities8%320~4 entity summaries
Recent turns23%920~11 turn excerpts
Total100%4,000

The 80-token-per-item cap is essential to making this arithmetic work. A single uncapped memory that describes a complex event in 400 tokens consumes half the events budget by itself and displaces nine other events. Capping enforces the constraint that no single memory dominates the budget regardless of how much information the extractor originally stored.

The U-curve in detail

The Liu et al. (2024) finding is often summarized as "middle context gets ignored," but the specific properties of the U-curve matter for implementation decisions.

Property 1 — Position 1 and position N both perform well. Both the first item in a retrieval block and the last item score above 80% accuracy on multi-document QA tasks, even when the total context is long. This is consistent with how transformers attend — strong recency effects at the end, and strong primacy effects at the beginning because the beginning anchors the attention pattern.

Property 2 — Middle positions degrade to as low as 55% accuracy. On tasks where the correct answer is in the middle third of a 32K-token context, accuracy can be 25–30 percentage points lower than when the same answer is in the first or last position. This is not subtle — it is a large, reliable effect across model families.

Property 3 — The "good zone" covers approximately the first and last 20% of total context. At 4,000 tokens, the good zone is the first 800 tokens and the last 800 tokens. The middle 2,400 tokens are degraded attention territory. At 20,000 tokens, the good zone is the first 4,000 and last 4,000 — the middle 12,000 is largely wasted.

Property 4 — The curve flattens dramatically below 4,000 tokens. Below 4K total context, the difference between first-position and middle-position accuracy is small — both hover near 90%. This is why the 4,000-token recommendation is not arbitrary. Below 4K, the U-curve is nearly flat; above 4K, it steepens nonlinearly.

Property 5 — The curve steepens above 16K. At 32K total context, middle-position accuracy in controlled experiments drops below 40%. The degradation is not linear — each doubling of context roughly squares the penalty for middle-position content.

These properties have a direct implication for memory injection strategy: if a deployment uses 20K tokens of conversation history plus 8K of memory context, the total context that matters for position effects is not just the 8K memory block in isolation. Memories injected immediately after the system prompt (before the conversation history) are in the first 10% of a 30K context — good position. Memories injected immediately before the current user turn (after the conversation history) are in the last 10% — also good position. Memories injected in the middle of a long conversation history are in the worst position regardless of their relevance score.

The mitigation that has the best cost-benefit ratio: repeat the single highest-priority memory twice — once at the very beginning of the memory block, and once at the very end. The duplication cost is approximately 80 tokens (one capped memory). The attention dividend is that the most important memory receives both primacy and recency weighting simultaneously. For memories below the highest priority, duplicate only if the budget allows — typically the top 3 memories can be duplicated within a 4,000-token budget without exceeding it.

How slack rebalancing works

When a category's retriever returns fewer memories than its token budget allows, the unspent tokens must be redistributed rather than discarded. The rebalancing algorithm runs after all retrievers have completed but before the rendering step — at this point the system knows exactly how many memories each retriever returned and can compute actual token usage before assembling the final prompt.

The algorithm in pseudocode:

total_budget = 4000

# Nominal allocation
allocated = {
  facts: 1000, preferences: 480, events: 800,
  entities: 320, summary: 480, recent: 920
}

# After retrieval — compute actual usage
actual_used = {
  facts: 600,      # only 7 facts returned × 80 tokens ≈ 600 used
  preferences: 480,
  events: 800,
  entities: 80,    # only 1 entity returned × 80 tokens = 80 used
  summary: 480,
  recent: 920
}

# Compute slack per category
slack = {}
for cat in allocated:
  slack[cat] = max(0, allocated[cat] - actual_used[cat])
# slack = { facts: 400, entities: 240 }  → 640 tokens unspent

# Redistribute to priority-ordered absorbers
# Priority order for absorbing slack: events > facts > recent > summary > preferences > entities
absorbers = [events, facts, recent, summary, preferences]
remaining_slack = 640

for cat in absorbers:
  if actual_used[cat] < allocated[cat]:
    continue  # category is already at its limit
  absorbable = allocated[cat] * 0.5  # max 50% increase per category
  take = min(absorbable, remaining_slack / len(absorbers))
  allocated[cat] += take
  remaining_slack -= take

The 50% cap on per-category absorption prevents one category from consuming all slack. If events is the highest-priority absorber and the total slack is 640 tokens, events can absorb at most 400 additional tokens (50% of its 800-token allocation), leaving 240 for the next absorber. This cap is important because the retrievers are tuned for their nominal budget — asking the events retriever to fill double its budget forces it to include lower-relevance memories that may degrade answer quality.

In practice, entity slack is the most common source of rebalanceable tokens. Many queries mention few entities, and the entity retriever returns 1–2 items when its budget is sized for 4. The entity slack typically flows to facts or events, which have more retrievable candidates at any given query.

Memory rendering formats compared

The format in which memories are serialized into the prompt affects both answer quality and citation mechanics. Three formats are in common use.

Free-text bullets — do not use in production:

- User works at Volkswagen
- User prefers dark mode
- User joined the platform team on May 9th

This format has no stable IDs, no type signal, no confidence values, and no structure the model can reference in citations. "The memory about the user's employer" is not a citable reference. Debugging is impossible: when the model gives a wrong answer, there is no way to trace which memory it relied on or whether it hallucinated. Free-text bullets work in prototypes where the memory store has fewer than 20 items and citations are not needed. They are inappropriate for production.

JSON — recommended default:

{
  "memories": {
    "facts": [
      {
        "id": "mem_abc",
        "subject": "user",
        "predicate": "works-at",
        "object": "Volkswagen",
        "confidence": 0.82
      }
    ],
    "preferences": [
      {
        "id": "mem_def",
        "key": "editor.theme",
        "value": "dark",
        "confidence": 0.93
      }
    ],
    "events": [
      {
        "id": "mem_ghi",
        "event": "joined platform team",
        "at": "2026-05-09",
        "precision": "day",
        "confidence": 0.78
      }
    ]
  }
}

The model can reference memories.facts[0].id in citations. The structured fields give the model unambiguous semantic anchors — it knows that predicate: "works-at" is a relationship field, not a description. The confidence field is visible to the model, allowing it to hedge appropriately on low-confidence memories without explicit instruction. The category grouping (facts, preferences, events) mirrors the type system and reinforces the semantic distinction between categories.

XML — alternative for XML-native prompts:

<memories>
  <facts>
    <fact id="mem_abc" confidence="0.82">
      User works at Volkswagen
    </fact>
  </facts>
  <preferences>
    <preference id="mem_def" key="editor.theme" confidence="0.93">
      dark
    </preference>
  </preferences>
</memories>

XML is slightly more verbose than JSON for the same information, but some model families respond better to XML structure — particularly when the surrounding prompt already uses XML for other structured content (tool definitions, safety rails). The citation mechanics work identically: the model references mem_abc as the memory ID. Choose XML when the system prompt already uses XML-style structure; choose JSON otherwise. The choice between them matters less than the presence of stable IDs and confidence values.

One additional consideration: JSON is easier to parse programmatically when extracting citation IDs from model responses. A regex like \[mem_[a-z0-9]+\] reliably extracts JSON-formatted citations. XML citations embedded in prose are less structurally predictable.

Citation mechanics and telemetry

When memories are rendered with stable IDs, the inference prompt should include a citation instruction. A minimal version: "When your answer is based on a retrieved memory, cite the memory ID in square brackets, for example [mem_abc]. If no memory supports a claim, do not cite." This instruction costs approximately 40–60 tokens. It is worthwhile at any scale.

Citation collection works as follows: after the model responds, parse the response text for the pattern \[mem_[a-z0-9]+\]. For each matched ID, look up the memory in the store and increment its citation_count. The citation_count feeds into the access boost calculation: access_boost = min(3.0, 1.2^access_count). A memory cited frequently across many queries accumulates a boost multiplier that raises its effective retrieval weight, creating a positive feedback loop for high-signal memories. This is intentional — memories that the model finds useful repeatedly should rank higher in future retrieval.

Citation absence is as informative as citation presence. Define "retrieved but uncited" as a memory that appeared in the rendered context (was in the top-K results) but was not cited in the model's response. Track this per memory across queries. A memory with retrieval_count = 50 and citation_count = 0 after 50 appearances in context is a strong signal that the memory is junk: too general to be specifically useful, a near-duplicate of a more specific memory, or misclassified into the wrong type so it appears in the wrong context. The model itself is telling you, through non-citation, that this memory adds nothing.

A weekly job that flags memories with retrieval_count > 20 and citation_count = 0 produces a human-reviewable list. In practice, the list falls into three categories: (1) memories too generic to cite ("user uses computers"), which should be deleted; (2) memories correctly retrieved but not cited because the answer didn't require them — these are false positives in the junk detector and can be filtered by checking whether any query in the retrieval history was closely related to the memory's content; (3) memories misclassified into a type that surfaces them in the wrong context — these should be reclassified.

The citation feedback loop is the cleanest production signal available for memory quality. It does not require labeling, does not require human annotation, and scales with usage. Every query where the model cites or fails to cite a memory is a free data point about that memory's utility.

Query-type budget tables in full

The dynamic budgeting section introduced three query types. The full allocation tables for each type follow, with token counts at the 4,000-token total budget.

Temporal query — user asks about events, timing, sequences, or history. Characteristic phrasings: "what happened last week?", "when did I join?", "what have I done on this project?"

CategoryPercentageToken budgetApproximate items
Events35%1,400~17 events
Facts15%600~7 facts
Preferences5%200~2 preferences
Entities10%400~5 entity summaries
Session summary10%4001 summary block
Recent turns25%1,000~12 turn excerpts

The event budget expands to 35% because temporal queries are almost exclusively answered by event memories. Facts and preferences are included at reduced allocation because they provide background context — the user's employer and role help interpret which events are relevant — but they should not compete with events for attention. The entity budget is kept at 10% rather than reducing it further because temporal queries frequently involve entity resolution ("what did Priya's team do last week?") and entity memories help identify which events are in scope.

Configuration query — user asks about settings, preferences, behavior, or how they want things to work. Characteristic phrasings: "how do I have this configured?", "what's my default?", "what did I say about X?"

CategoryPercentageToken budgetApproximate items
Preferences30%1,200~15 preferences
Facts20%800~10 facts
Events5%200~2 events
Entities8%320~4 entity summaries
Session summary12%4801 summary block
Recent turns25%1,000~12 turn excerpts

Preferences dominate at 30% because configuration queries are answered almost entirely by preference memories. Facts at 20% provide supporting context — the user's role and employer help interpret which preferences are applicable. Events drop to 5% — only the most recent events are included as background, since configuration queries rarely depend on historical events. Entity budget stays moderate because configuration queries sometimes reference entities ("how is my work machine configured?").

Relational query — user asks about people, organizational relationships, project membership, or network structure. Characteristic phrasings: "who does Sarah report to?", "who's on the platform team?", "what's the org structure for this project?"

CategoryPercentageToken budgetApproximate items
Facts25%1,000~12 facts
Entities20%800~10 entity summaries
Events10%400~5 events
Preferences5%200~2 preferences
Session summary15%6001 summary block
Recent turns25%1,000~12 turn excerpts

Facts and Entities share the top allocation for relational queries. Facts carry the predicates that describe relationships ("reports-to," "member-of," "manages"), and entity summaries provide the identity context needed to resolve references. Events at 10% are higher than for configuration queries because relational changes often show up as events ("Sarah joined the platform team" is an Event that also encodes a Relation). The session summary budget is slightly higher at 15% because relational queries often require more context about which project or team is being discussed.

Routing mixed queries. Not all queries fit cleanly into one type. A query like "what is Priya currently working on, and when did she join her team?" is both relational and temporal. The routing logic computes a weighted average based on which type signals are present in the ParsedQuery:

  • temporal_window ≠ null → temporal weight += 0.5
  • entity_refs with multiple distinct entities → relational weight += 0.4
  • predicate_hints containing preference-like terms → configuration weight += 0.3
  • Weights are normalized to sum to 1.0, then the budget is the weighted average of the two closest type allocations

For the example query: temporal weight = 0.5, relational weight = 0.4, normalized to 0.56 temporal / 0.44 relational. The events allocation becomes 0.56 × 35% + 0.44 × 10% = 19.6% + 4.4% = 24%. The entities allocation becomes 0.56 × 10% + 0.44 × 20% = 5.6% + 8.8% = 14.4%. The blended budget surfaces both event memories (for the "when did she join" sub-question) and entity memories (for the "what is Priya working on" sub-question) without either dominating.

The query type is determined by the ParsedQuery's signature produced in Stage 1 of the read pipeline. The budget allocator receives the ParsedQuery alongside the retrieval results and applies the appropriate allocation — or the blended allocation for mixed queries — before calling the rendering step.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.