The Query Optimizer for Memory

By Arc Labs ResearchMay 2, 202610 min read

With five retrievers running in parallel, fan-out latency is bounded by the slowest. But slow doesn't mean useful — a temporal range query gets nothing useful from BM25 over the full corpus. Running it anyway burns budget and adds noise to fusion. The query optimizer's job is to predict, before retrieval, which retrievers are worth consulting for this specific query. The features are cheap; the savings are real.

Optimizer · pick a query, see the plan

The optimizer extracts features from the query and selects a plan template.

The query feature vector

Entity density (0.35 weight). Number of detected entities per token. High density → entity graph is high-value.
Temporal precision (0.25 weight). Presence of explicit time anchors ("yesterday", "since 2024", "last Tuesday"). High → temporal retriever leads.
Lexical rarity (0.25 weight). Mean inverse document frequency of query terms. High → BM25 wins; low → semantic wins.
Type cues (0.15 weight). Verb + noun patterns predict the target memory type ("what should I configure" → preference; "when did I" → event).

spec = 0.35·entityDensity + 0.25·temporal + 0.25·lexicalRarity + 0.15·typeCues

Specificity score; higher = more selective query → more aggressive routing.

The plan templates

Entity-centric. Graph + type-filter + semantic. For high-entity, low-temporal queries.
Temporal. Temporal + semantic. For time-anchored queries.
Precision. Lexical + semantic + graph. For rare-term queries.
Exploratory. Semantic + type-filter + lexical. For broad, low-specificity queries.
Balanced. All five retrievers, equal RRF weights. Fallback for unclear queries.

The latency win

On 10M-memory stores, typical numbers:

Balanced (all 5): p99 ≈ 50ms.
Temporal-only: p99 ≈ 18ms.
Entity-centric: p99 ≈ 32ms.
Precision (4 retrievers): p99 ≈ 45ms. Routing 60% of queries to a 3-retriever plan and 40% to balanced halves p99 on the simple majority while preserving full coverage where it matters.

When the optimizer is wrong

Two failure modes to guard against:

Under-coverage. A query the optimizer routed to "temporal-only" was actually mixed and now misses entity-relevant results. Mitigate by always including semantic as a baseline retriever.
Plan thrashing. Tiny query variations produce different plans, leading to inconsistent results. Mitigate by hashing on a normalized feature vector, not the raw query. Both are bounded by always running a "safety semantic" retriever even when not in the plan — its results contribute via RRF if it surfaces high-rank items the chosen plan missed.

The rule DAG — how plans are selected

Plan selection is a fast-path decision tree executed in order. Given a parsed query with a specificity score:

Specificity ≥ 0.8 → precise plan: semantic (top 20) + entity graph if refs present. No rerank — the query is already narrow enough that the top candidates are rarely wrong.
Entity refs with confidence > 0.85 → entity-centric: entity graph (weight 2.0) + BM25 or semantic depending on specificity. High-specificity entity queries skip the embedding API entirely — entity graph + BM25 are pure SQL — saving ~200ms per read.
Temporal window detected → temporal-anchored: temporal retriever primary, semantic secondary. Temporal here is a SQL BTREE scan on event_at, not a vector lookup.
Specificity < 0.4 → exploratory: all four retrievers, rerank ON, diversity ON. When the query is ambiguous, reranking compensates by re-scoring the fused candidates against the full query intent.
Default → hybrid balanced: equal weights, no rerank.

Worked examples from spec tests:

Query	Entities	Temporal	Specificity	Plan chosen	Latency
"what's my manager's communication style?"	Priya (0.94)	—	0.78	entity-centric	~25ms
"something about deadlines"	none	—	0.22	exploratory + rerank	~180ms
"what did Priya say yesterday about the MVP?"	Priya (0.94)	yesterday	0.85	precise (specificity fast-path)	~20ms

The query result cache (Mode B)

The cache stores fused, post-policy results keyed by hash(normalize(query) + user_id + agent_id + namespace). Normalization lowercases, strips punctuation, and canonicalizes whitespace — two queries that differ only in casing or trailing punctuation share a cache entry.

Key properties:

Storage: in-process DashMap. No Redis tier currently.
TTL: 30 seconds. Configurable.
Bound: 1,000 entries, LRU eviction.
Hit latency: under 1ms (vs 150–400ms full pipeline).
Invalidation: any remember() in a scope bumps the scope's version, clearing all entries for that scope.

Expected hit rate is 15–30% for chat-continuation workloads and ~5–10% for exploratory workloads. At 20% hit rate, average read latency improves from 200ms to ~161ms — the p99 improvement is larger because cache hits clip the long tail.

recall_cache_hit_rate is the dashboard signal to watch. A rising hit rate with rising latency is a paradox: either the cache size is too small (evicting hot entries) or the post-cache pipeline is slower (check reranker latency).

The learned optimizer (Mode C)

Mode C requires >10K labeled queries per day and is opt-in. It adds two adjustments on top of rule-based plans:

Plan selection by query signature. Queries cluster into ~100 signatures (has entities, has temporal, has type hints, specificity bucket, length bucket, has negation). For each signature with >30 examples, NDCG is computed against each plan variant. The best plan for that signature is promoted.

RRF weight tuning. Per retriever, marginal utility = NDCG(full) − NDCG(ablation without that retriever). New weight: w_new = w_old × (1 + 0.1 × normalize(marginal_utility)), clamped to [0.1, 3.0]. Updates apply to 10% of traffic first; rule-based NDCG is the floor — if learned weights produce lower NDCG on a validation holdout, they roll back automatically.

The learning job runs weekly, takes under 1 minute per namespace. Cold start: rule-based plans until

30 queries per signature have feedback or 1 month has passed.

ParsedQuery: what the optimizer actually receives

The optimizer doesn't receive raw text. Before the optimizer runs, Stage 1 of the read pipeline parses the query into a ParsedQuery struct:

pub struct ParsedQuery {
    pub original: String,
    pub rewrites: Vec<String>,       // 2-3 semantic rewrites when specificity < 0.8
    pub entity_refs: Vec<EntityRef>, // entities extracted + resolved
    pub temporal_window: Option<(DateTime<Utc>, DateTime<Utc>)>,
    pub predicate_hints: Vec<String>,
    pub type_hints: Vec<MemoryType>, // populated ONLY by the optimizer, not by NLP
    pub negations: Vec<String>,
    pub specificity: f32,
}

The optimizer's job is to read this struct and produce a RetrievalPlan. Two fields are notable: entity_refs carries not just entity names but the resolved entity IDs and resolution confidence. An entity_ref with confidence 0.95 (exact database match) versus 0.62 (fuzzy match) affects whether the entity-centric plan is selected — the rule DAG's second condition requires entity ref confidence above 0.85. A fuzzy match at 0.62 won't trigger entity-centric even if the entity name appears clearly in the query.

type_hints is empty when it arrives from NLP. The optimizer fills it in based on query patterns — verb-noun analysis, predicate recognition, and the type_cues component of the specificity score. This matters because type_hints feeds directly into the type-filter retriever. The optimizer is therefore the sole gatekeeper for type-filtered retrieval: no optimizer pass means no type filtering, regardless of how clearly the query implies a type.

The rewrites field connects to a key design choice in how vague queries are handled. When specificity is below 0.8, the NLP stage generates 2–3 semantic rewrites of the original query using a small LLM call with the prompt: "What are alternative ways someone might ask about the same information as this query?" All rewrites are embedded independently using the same embedding model, and the top-K results from each rewrite's semantic search are unioned into the candidate set before fusion. This is HyDE-style retrieval (hypothetical document embeddings) applied not to hypothetical documents but to hypothetical queries — achieving the same distributional broadening effect.

For a vague query like "that thing I mentioned about storage," the rewrites might be:

"user's preference for database type"
"user's storage backend choice"
"what storage system was discussed"

The union of top-K from all three rewrites as semantic retrieval candidates typically 40–60% more diverse than top-K from the original query alone. The cost is three embedding API calls instead of one, which adds ~15ms to the NLP stage. This is why the threshold is specificity < 0.8 rather than always on — high-specificity queries don't benefit from rewriting because the original query already distributes correctly in embedding space.

Plan selection examples in depth

Tracing five concrete queries through the full plan selection process illustrates how the feature vector, rule DAG, and ParsedQuery interact.

"Who is Sarah's manager?"

The NLP stage extracts one entity (Sarah) with confidence 0.93 — an exact database match. Temporal: none detected. The query is short and specific: entityDensity = 0.9, temporal = 0.0, lexicalRarity = 0.35 (manager, who are common terms), typeCues = 0.6 (relational query pattern). Specificity = 0.35·0.9 + 0.25·0.0 + 0.25·0.35 + 0.15·0.6 = 0.315 + 0 + 0.0875 + 0.09 = 0.49.

Specificity 0.49 does not trigger the first DAG rule (≥ 0.8). The second rule fires: entity ref confidence 0.93 > 0.85. Plan selected: entity-centric. What happens at retrieval: entity graph traversal from Sarah's EntityId, looking for manager_of or reports_to edge types. BM25 supplements with a predicate filter on "manager." No embedding API call is made. Expected result: 1-hop graph edge (Sarah → reports_to → Manager) found in approximately 4ms. The total read pipeline completes in roughly 25ms — almost entirely SQL.

"Did I finish the OAuth PR?"

Two recognized entities: OAuth (product/protocol entity, confidence 0.81) and PR (implied artifact entity, confidence 0.74 — inferred from context). Temporal: implicit — "did I finish" implies a past completed event, which the NLP stage translates to a temporal_window spanning the last 30 days. Specificity: entityDensity moderate (2 entities, both moderate confidence), lexicalRarity moderate (OAuth is domain-specific, PR is moderately rare). Overall specificity ≈ 0.65.

Specificity 0.65 does not trigger DAG rule 1. Entity ref confidence (max 0.81) is below 0.85 — DAG rule 2 does not fire. Temporal window detected — DAG rule 3 fires. Plan selected: temporal-anchored. What happens: temporal retriever scans event_at for the last 30 days, filtering to Event-type memories. Semantic retriever runs in parallel with the query embedding, with rewrites generated because specificity < 0.8 (rewrites: "completed OAuth pull request," "merged PR for authentication," "finished OAuth implementation"). Expected: Event-type memories about OAuth retrieved via temporal retriever, semantic retriever finding memories about "completed," "finished," and "merged" for the PR context.

"Something about our API"

No entities extracted — API is a generic term not resolved to a specific entity in the graph. Temporal: none. Query length is short, all terms are common: entityDensity = 0.0, temporal = 0.0, lexicalRarity = 0.15 (API, something, our are all high-frequency), typeCues = 0.2. Specificity = 0.35·0.0 + 0.25·0.0 + 0.25·0.15 + 0.15·0.2 = 0 + 0 + 0.0375 + 0.03 = 0.068. Specificity 0.068 is well below 0.4 — DAG rule 4 fires immediately.

Plan selected: exploratory. All four retrievers run: semantic (with 3 rewrites given very low specificity), BM25 (broad lexical scan), type-filter (no type hint, so all types consulted), entity graph (no seed entity, so traversal starts from recently-active entities). Rerank: ON. MMR diversity: ON. Latency: approximately 180ms — the full pipeline with reranking. Expected output: a diverse set of memories across all mentions of API-related content: facts about API design decisions, events involving API work, preferences about API conventions. The reranker and MMR together ensure the returned set isn't dominated by the five most recent API mentions.

"What's my badge number"

No entities extracted (badge numbers are identifiers, not named entities in the graph). Temporal: none. The key signal here is lexical rarity: "badge" and "badge number" are rare in typical memory corpora — few memories contain these terms, giving them high IDF. lexicalRarity = 0.92. entityDensity = 0.0, temporal = 0.0, typeCues = 0.4 (identifier lookup pattern). Specificity = 0.35·0.0 + 0.25·0.0 + 0.25·0.92 + 0.15·0.4 = 0 + 0 + 0.23 + 0.06 = 0.29.

Specificity 0.29 would normally trigger the exploratory path (< 0.4), but the lexical rarity component is the dominant signal at 0.92. The rule DAG operates on the full specificity score; however, at the plan level, the optimizer notes that lexicalRarity alone is above 0.8, which is a secondary heuristic that promotes the precision plan even when overall specificity is below 0.4. Plan selected: precision. BM25 is primary with the GIN index finding "badge" as a rare term with very high IDF — few documents in the corpus contain it, so the BM25 score for an exact match is extremely high. Semantic retriever is secondary. Expected result: exact-match memory found in approximately 18ms via BM25, long before semantic scores are computed.

"What was happening with Inbox3 in March?"

Entity: Inbox3 (product entity, confidence 0.97 — exact match). Temporal: explicit — "in March" is resolved to a temporal_window spanning March 1–31 of the current or most recent year. entityDensity = 0.95 (high entity confidence, single entity query), temporal = 0.95 (explicit temporal anchor), lexicalRarity = 0.45 (Inbox3 is somewhat rare, "happening" is common), typeCues = 0.3. Specificity = 0.35·0.95 + 0.25·0.95 + 0.25·0.45 + 0.15·0.3 = 0.3325 + 0.2375 + 0.1125 + 0.045 = 0.728.

Both entity-centric and temporal conditions could apply (confidence 0.97 > 0.85, temporal window set). However, the DAG evaluates rules in order, and specificity 0.728 does not trigger rule 1. Rule 2 fires first (entity ref confidence > 0.85) before rule 3 (temporal). Plan selected: entity-centric — but with temporal post-filtering added because a temporal window is present. The temporal filter is applied as a SQL predicate on the entity-graph results (WHERE event_at BETWEEN '2026-03-01' AND '2026-03-31'), not as a separate retriever pass. No rerank — the combined entity + temporal constraint is already narrow. Expected result: all memories mentioning Inbox3 created or dated to March, found in approximately 20ms.

Cost model: when to use Mode C

The learned optimizer (Mode C) adds overhead and requires sufficient labeled data to produce signal rather than noise. Understanding when it earns its keep requires examining the cost structure of each mode.

Mode A (rule-based) is correct for the common case. Empirically, 60–70% of queries fall cleanly into one plan template with unambiguous feature vectors. The rule DAG executes in under 1ms. No external dependencies, no labeled data, no training pipeline. For new namespaces, new deployments, or deployments without reliable feedback signals, Mode A is always the right starting point. Getting it wrong in Mode A is recoverable — the optimizer's failure modes (under-coverage, plan thrashing) are bounded by the safety semantic retriever.

Mode B (cached) gives the largest latency improvement per unit of infrastructure investment. A 20% cache hit rate on a 200ms average pipeline reduces mean latency to approximately 161ms — a 20% improvement in mean latency with zero retrieval cost for those hits. The p99 improvement is proportionally larger because cache hits entirely eliminate the long-tail retrieval cases. Mode B is worth enabling for any deployment where users ask similar questions repeatedly: chat-continuation workloads where follow-up questions are semantically similar to the previous exchange, automated agents running the same memory lookups on a schedule, and dashboards that periodically refresh the same summary queries.

Mode C (learned) is worth deploying when you have evidence that the rule-based plan is systematically suboptimal for your specific query distribution. The indicators:

NDCG@5 from rule-based plans is below 0.60 on your golden query set (below 0.6 means more than 40% of the top-5 results are irrelevant on average)
Users frequently follow up with corrections ("you missed X," "what about Y") — this suggests retrieval is consistently leaving relevant memories behind
p99 latency is acceptable but mean relevance scores from the reranker are low — indicating the plan is retrieving the wrong candidate set, not that the pipeline is slow

The data volume threshold for Mode C is greater than 10,000 queries per day with attached user feedback signals (thumbs up/down on responses, correction patterns, explicit ratings). Without that volume, the per-signature sample counts stay below 30 — the minimum for NDCG estimates to be stable. With sufficient volume, Mode C typically improves NDCG@5 by 8–15% on specialized domains: medical (where temporal precision patterns differ from general queries), legal (where entity references are dense and entity confidence thresholds need tuning), and coding (where lexical rarity scores for identifiers dominate and BM25 should be weighted higher than default rules specify).

The rollout path for Mode C: enable on 10% of traffic, compare NDCG against the rule-based baseline for 1 week, promote to 100% if the improvement holds on the validation holdout. The rule-based NDCG is always the floor — learned weights that perform worse than rules on the holdout roll back automatically after each weekly training run.

Negation handling

A query like "what did I do that wasn't related to the frontend?" contains a negation. The ParsedQuery struct's negations field captures these: in this case, negations: ["frontend"]. The optimizer adds a negation filter to the retrieval plan, which is applied as a post-fusion step: memories with high BM25 similarity to the negated terms are demoted in the final ranking, even if they scored well across all retrievers.

The choice of post-fusion filtering rather than pre-filtering is deliberate and worth understanding. A pre-filter would remove from the candidate set any memory containing the negated term before retrieval completes. This is too aggressive: a memory about "frontend and backend architecture tradeoffs" contains "frontend" but may be directly relevant to a question about non-frontend work if the memory substantively discusses backend components. The negation "not frontend" expresses intent about the answer's focus, not a blanket exclusion of any memory that mentions the word.

Post-filtering implements this correctly. After all retrievers return candidates and RRF fusion produces an initial ranking, the negation filter runs as follows: compute the BM25 similarity between each retrieved memory and the negated term text. Memories above the negation threshold — defaulting to a BM25 score of 0.6 against the negation term — are demoted to the bottom of the ranking rather than removed. Demotion rather than removal preserves the memory for edge cases where it is the only relevant result in the namespace; a demoted memory still surfaces if nothing better exists, but it yields to any result that doesn't strongly match the negation.

The threshold of 0.6 BM25 against the negation term is a tunable parameter. In namespaces where negations are frequent and high-precision exclusion is preferred (for example, a coding assistant where "not Python" should strongly suppress Python memories in favor of other languages), lower the threshold to 0.4. In namespaces where negations are mostly soft preferences ("mostly non-frontend" rather than "strictly nothing about frontend"), raise it to 0.75. The default 0.6 represents the general-purpose tradeoff: aggressive enough to meaningfully re-order the results, conservative enough not to lose genuinely relevant memories that happen to mention the negated term in passing.

Multiple negations compound: each negated term is scored independently, and a memory is demoted if it exceeds the threshold for any negation. A memory that strongly matches two negation terms receives a larger demotion penalty, pushing it further toward the bottom of the ranked list.

The ParsedQuery struct — what the optimizer receives

The optimizer doesn't work from raw text. By the time it runs, the query has been parsed into a structured ParsedQuery by Stage 1 of the read pipeline. This struct is the optimizer's entire input — it never sees the original string again except as stored in the original field.

pub struct ParsedQuery {
    pub original: String,
    pub rewrites: Vec<String>,           // LLM-generated semantic equivalents
    pub entity_refs: Vec<EntityRef>,     // resolved entity IDs from query text
    pub temporal_window: Option<(DateTime<Utc>, DateTime<Utc>)>,
    pub predicate_hints: Vec<String>,    // hinted predicates from query vocabulary
    pub type_hints: Vec<MemoryType>,     // optimizer-set (not NLP-set) type filters
    pub negations: Vec<String>,          // "not X" patterns
    pub specificity: f32,               // 0.0–1.0 composite score
}

The rewrites field comes from a conditional LLM enrichment call that fires when specificity < 0.8. The enrichment generates 2–3 semantically equivalent reformulations to improve semantic recall. The semantic retriever runs against all reformulations and takes the max score per memory — if any reformulation ranks a memory highly, that memory survives into the candidate set. When specificity ≥ 0.8, enrichment is skipped entirely. A precise query already distributes well in embedding space; generating synonyms for a query that names a specific entity and time window pulls in off-topic memories from adjacent vocabulary and adds noise rather than recall.

The entity_refs field is populated via a 4-stage resolution cascade in Stage 1:

Exact match against the entity name index.
Alias match against stored alternative names.
Trigram fuzzy match, accepting candidates above a similarity threshold.
LLM fallback for ambiguous references that did not resolve in earlier stages.

Each resolved entity carries a confidence score from the resolution stage that found it. An entity ref with confidence below 0.85 is not passed to the entity-centric plan. If the resolution is uncertain — a 15% or greater chance of naming the wrong entity — forcing entity-graph traversal from that anchor produces garbage results. The entity graph is extremely high-recall within the correct entity's neighborhood; anchoring on the wrong entity means that neighborhood is completely irrelevant to the query.

The type_hints field is always empty from Stage 1 NLP parsing. This is intentional. The NLP parser identifies temporal windows and entity refs through pattern matching and named entity recognition. Type classification — deciding whether a query is asking about a preference, an event, a skill, or a fact — requires semantic understanding of query intent that the NLP parser does not perform. The optimizer fills type_hints based on its own feature analysis. Downstream retrievers read this field; nothing upstream of the optimizer sets it.

Per-query plan selection: 5 worked examples

Abstract routing rules become precise when applied to specific queries. These five examples trace full feature extraction through plan selection and expected retrieval behavior.

Query 1: "what's my API key for the Recall SDK?"

Entity refs: none — "API key" is generic, not a named entity in the graph. Temporal window: none. Lexical rarity: high — IDF of "API key" in a general-domain namespace is approximately 4.2, and "Recall SDK" is a rare bigram. Specificity: ~0.65. Rule DAG: specificity below 0.8 (no precise fast-path), no entity refs above threshold (no entity-centric plan), no temporal window (no temporal plan), specificity above 0.4 (no exploratory plan). Plan selected: precision — lexical + semantic + entity graph. BM25 on "API key" ranks the exact stored memory near the top because the term combination is rare. Entity graph is included but contributes little without an anchor entity. Rerank is off.

Query 2: "who is on the backend team?"

Entity refs: ent_backend_team resolves via exact match, confidence 0.91 (above the 0.85 threshold). Temporal window: none. Specificity: ~0.72. Rule DAG: DAG rule 1 (specificity ≥ 0.8) does not fire. DAG rule 2 fires: entity ref confidence 0.91 > 0.85. Plan selected: entity-centric — entity graph weight 2.0, BM25 for lexical backup, semantic for members referenced by description. Entity graph traversal from ent_backend_team returns all member_of edges in 1–2 hops via SQL join on the graph edge table. Expected latency: approximately 25ms, almost entirely SQL.

Query 3: "what did I work on last Tuesday and Wednesday?"

Entity refs: implicit user (0-hop, confidence 1.0, always present). Temporal window: Tuesday 00:00 through Wednesday 23:59. No named external entities. Lexical rarity: low ("work" is high-frequency). Specificity: ~0.71. Rule DAG: rules 1 and 2 do not fire. Rule 3 fires: temporal window detected. Plan selected: temporal-anchored — temporal retriever performs a BTREE range scan on event_at bounded by the two-day window. Semantic retriever runs in parallel, catching non-event memories from those days that are topically relevant to what the user worked on.

Query 4: "what are my general preferences?"

Entity refs: none. Temporal window: none. Lexical rarity: low — IDF of "preferences" is approximately 1.1, a very common term in a personal-assistant namespace. Type cues: the "preferences" keyword is a first-class vocabulary signal for the Preference memory type. Specificity: ~0.28. Rule DAG: specificity below 0.4 — DAG rule 4 fires immediately. Plan selected: exploratory — all retrievers, rerank ON, MMR diversity ON. The optimizer sets type_hints = [Preference] before retrieval, pre-filtering the corpus to Preference-typed memories. Reranker and MMR together ensure the returned set is not dominated by five semantically near-duplicate preference memories.

Query 5: "what database did Priya recommend for Project Kestrel?"

Entity refs: ent_priya_ABC (confidence 0.96), ent_project_kestrel (confidence 0.89). No temporal window. Lexical rarity: medium-high — "Kestrel" has IDF ≈ 5.1. Specificity: 0.84. Rule DAG: specificity ≥ 0.8 — DAG rule 1 fires. Plan selected: precise fast-path — semantic top-20, entity graph anchored at both entities, no rerank, top-K = 5. Expected result: if the recommendation memory was stored, semantic retrieval finds it via the combined Priya + Kestrel + database embedding. Entity graph confirms by returning memories involving both anchors. If the memory was not stored, near-misses surface topically adjacent results rather than confabulating.

Optimizer configuration and tuning

The optimizer exposes its thresholds and mode settings through the deployment configuration file. Defaults are calibrated against a diverse benchmark workload; most deployments benefit from minor tuning based on their specific query distribution.

read_pipeline:
  optimizer:
    mode: rule  # rule | cached | learned

    rule:
      precise_threshold: 0.8    # specificity above this → precise plan
      exploratory_threshold: 0.4 # below this → exploratory + rerank
      entity_confidence_min: 0.85 # min entity resolution confidence for entity-centric plan

    cache:
      enabled: true
      ttl_seconds: 30
      max_entries: 1000
      normalize_casing: true
      normalize_punctuation: true

    learned:
      enabled: false  # requires eval data
      min_queries_per_signature: 30
      update_cadence: weekly
      rollback_on_ndcg_regression: true
      traffic_fraction_for_new_weights: 0.1

Raise precise_threshold from 0.8 to 0.9 if you're seeing high false-negative rates on entity queries — entity-centric plans not triggering when they should. This shrinks the precise fast-path and gives more queries a chance to be evaluated by the entity-centric branch. Lower it toward 0.7 if the reranker is running unnecessarily on queries that are already quite specific — the reranker adds ~80ms on a 10M-memory store, and skipping it on clearly narrow queries is pure savings.

Lower exploratory_threshold from 0.4 to 0.3 if broad queries are producing poor results and you want more aggressive reranking on moderately ambiguous queries. Be aware that each exploratory plan invocation carries the full reranker cost — expanding the exploratory window increases mean latency.

The entity_confidence_min is the most critical threshold for precision. At 0.85, a resolved entity has 85%+ confidence — high enough that graph traversal from it is expected to land in the right neighborhood. At 0.70, you would get more entity-centric plans but 30% of them would be anchored to the wrong entity. The entity graph is high-recall within the correct entity's neighborhood and high-noise outside it. Do not lower this threshold without careful evaluation of entity resolution accuracy in your namespace.

Observability: reading the optimizer dashboard

The optimizer emits per-plan latency and quality metrics as Prometheus-compatible counters and histograms. These are the signals to watch in production.

recall_optimizer_plan_selected{plan="precise"}        → 18% of queries
recall_optimizer_plan_selected{plan="entity_centric"} → 31%
recall_optimizer_plan_selected{plan="temporal"}       → 12%
recall_optimizer_plan_selected{plan="exploratory"}    → 22%
recall_optimizer_plan_selected{plan="balanced"}       → 17%

recall_retrieval_latency_ms{plan="precise",p99}        → 22ms
recall_retrieval_latency_ms{plan="entity_centric",p99} → 28ms
recall_retrieval_latency_ms{plan="balanced",p99}       → 52ms

The plan distribution tells you what kind of workload your users have. A deployment where 50% or more of queries route to "exploratory" is a low-specificity workload — users ask vague questions, and the reranker runs constantly. Consider whether upstream query construction can add entity context or temporal anchors before queries reach Recall. A deployment with 40%+ entity-centric routing has relationship-heavy conversations; entity graph traversal is efficient but depends heavily on entity resolution quality.

When recall_cache_hit_rate is low (below 10%), either the query distribution is too diverse (many unique queries) or the TTL is too short for your workload's repeat pattern. A 30-second TTL captures conversational repetition but not session repetition. When it is high (above 40%), check whether the cache is hiding retrieval quality problems — a cached result that was mediocre 25 seconds ago is still mediocre now. High hit rate with low user satisfaction is a sign the cache is serving stale low-quality results rather than fresh accurate ones.

recall_reranker_latency_ms should be checked separately from retrieval latency. The reranker contributes the largest single component of exploratory plan latency. If reranker p99 exceeds 150ms, reduce the fusion candidate count passed to the reranker via rerank_top_k, or switch to a smaller, faster cross-encoder model. In Mode C deployments, recall_ndcg_by_plan breaks out retrieval quality per plan type — a plan with high selection rate but low NDCG is a routing error that warrants threshold recalibration.