What to Watch When Recall Goes to Production

Arc Labs Engineering20 min read

When Recall first ran in a real deployment, the monitoring dashboard looked like any other: p50 and p99 latency, error rate, request throughput. Three weeks into production, the first reliability issue wasn't a timeout or an HTTP 500. It was gradual retrieval drift — the coding agent was surfacing architecture decisions from eight months ago as if they were still current. Nothing in the standard metrics moved. By the time a user flagged it, roughly 2,000 low-quality memories had accumulated in the store with no signal that anything was wrong.

That incident shaped the observability design. Recall ships with three metric families, not two. Throughput and cost look like what you're used to. Quality — junk rate, duplicate rate, retrieval precision — is the addition that makes memory-specific problems visible before users report them.

The three metric families

Throughput metrics

These are the standard ones. Recall exposes them because they're table stakes for any production service, not because they're what distinguishes a memory system.

MetricTypeLabelsWhat it tells you
recall_writes_totalcounternamespace, outcomeTotal write operations, split by outcome
recall_reads_totalcounternamespace, cache_statusTotal search operations
recall_forgets_totalcounternamespace, scopeTotal delete/forget operations
recall_write_latency_secondshistogramnamespace, stagePer-stage write latency
recall_read_latency_secondshistogramnamespace, with_rerankEnd-to-end read latency
recall_queue_depthgaugequeuePending background work
recall_worker_lag_secondsgaugejobTime since last successful background job run

The outcome label on recall_writes_total has four values: stored, discarded, merged, error. Watch the ratio, not just the total. In a healthy system, discarded is the largest bucket by a wide margin — that's the funnel working.

The stage label on recall_write_latency_seconds gives you per-stage histograms. Under normal conditions:

pre_filter:     p50 ≈ 1ms    p99 ≈ 3ms
extract:        p50 ≈ 450ms  p99 ≈ 1800ms
resolve_refs:   p50 ≈ 12ms   p99 ≈ 45ms
dedupe:         p50 ≈ 40ms   p99 ≈ 150ms
conflict:       p50 ≈ 15ms   p99 ≈ 60ms
persist:        p50 ≈ 22ms   p99 ≈ 80ms

If your p99 for extract is regularly above 2,000ms, either your input turns are very long (check MAX_TURN_CONTENT_CHARS truncation) or your LLM provider is under load. If dedupe is slow, HNSW ef_search is probably set too high for your index size.

Read latency has two modes: with reranking and without. Without reranking, p50 is around 25ms and p99 around 75ms. With reranking (cross-encoder), p50 climbs to 80ms and p99 to about 195ms. Both modes run the five retrievers in parallel; the difference is the serial reranking pass on the final candidates.

Cost metrics

LLM calls dominate Recall's operational cost. The extract stage alone accounts for roughly 42% of total LLM spend — this is expected, because extraction simultaneously filters, classifies, grounds, and resolves coreferences in a single prompt. It's the most complex judgment in the pipeline.

MetricTypeLabelsWhat it tells you
recall_llm_calls_totalcounterprovider, model, stageLLM invocations per stage
recall_llm_tokens_in_totalcounterprovider, model, stageInput tokens
recall_llm_tokens_out_totalcounterprovider, model, stageOutput tokens
recall_llm_cost_usd_totalgaugeprovider, stageAccumulated cost in USD
recall_embedding_calls_totalcounterprovider, modelEmbedding calls
recall_embedding_cost_usd_totalcounterproviderEmbedding cost
recall_storage_bytesgaugenamespace, typeStorage consumed

Cost is accumulated with rust_decimal::Decimal internally — not f64. This matters because floating-point accumulation across thousands of writes produces drift:

f64:     Σ(0.001, 1000 times) = 0.9999999999999893
Decimal: Σ(0.001, 1000 times) = 1.000

For financial metrics at any meaningful scale, the drift in f64 is not negligible. The Prometheus export converts to f64 at export time, but the accumulation is exact.

The biggest cost lever is batch size. Default configuration processes one conversational turn per extraction call. Batching 4–8 turns per call reduces per-memory cost by approximately 30%, because the fixed overhead of the system prompt (which is cached, but still counts toward input tokens on the first call per session) is amortized across more candidates.

write_pipeline:
  extraction:
    batch_size: 6  # process up to 6 turns per LLM call
    max_turns_per_prompt: 20  # hard cap on turns per prompt

The stage labels on cost metrics let you identify which part of the pipeline drives your bill. A typical breakdown at moderate scale:

extract:       42%  (dominant — single LLM call doing filter+classify+ground+resolve)
classify:      20%  (type classification pass)
ground:        15%  (grounding verification)
embeddings:    12%  (vector computation)
consolidate:    5%  (background consolidation jobs)
rerank:         4%  (cross-encoder, if enabled)
judge (scan):   2%  (LLM judge in dedup tier 3)

Quality metrics — the family nobody else ships

This is what makes Recall observability different. These metrics track memory quality directly, not just infrastructure health.

MetricTypeLabelsWhat it tells you
recall_junk_rategaugenamespacediscarded/(stored+merged+discarded) per write run
recall_duplicate_rategaugenamespacemerged/(stored+merged+discarded) per write run
recall_hallucination_blocked_totalcounternamespace, sourceGrounding rejections at extraction
recall_contradictions_activegaugenamespaceCurrent unresolved contradictions in store
recall_retrieval_precision_at_kgaugenamespace, kSampled precision@k (LLM-judged, 50 samples/day)
recall_extractor_schema_drift_scoregaugenamespaceDrift indicator from the weekly drift scan
recall_faithfulness_score_meangaugenamespaceMean faithfulness score (opt-in, read-time check)

The junk rate: your most important metric

recall_junk_rate is computed as:

junk_rate = discarded / (stored + merged + discarded)

A healthy junk_rate is 0.60–0.80. This sounds alarmingly high — the system is throwing away 60–80% of everything that comes in. That's correct. The pre-filter and extraction stages are supposed to be aggressive. Most conversational turns contain no durable information: greetings, clarifications, acknowledgements, tool call outputs, questions without answers. The funnel's job is to drop them.

If junk_rate falls below 0.50, something changed upstream. The most common cause is that your agent changed its conversational style — longer turns, more content per turn, fewer social exchanges — and the pre-filter patterns aren't matching as many of them. The effect: noisier writes. This shows up in retrieval precision 2–3 weeks later when the accumulated noise starts competing with real memories in top-K results.

If junk_rate rises above 0.90, your pre-filter rules are probably too aggressive, or the LLM extraction is over-refusing. Check recall_hallucination_blocked_total — if grounding rejections are high, the extraction model may be generating claims not supported by source turns, which get dropped at the grounding stage. This is less common but worth checking if you've recently changed your extraction prompt.

recall_duplicate_rate is the complementary signal:

duplicate_rate = merged / (stored + merged + discarded)

A rising duplicate rate (trending up 0.5–1% per week) usually means the extractor is paraphrasing the same fact in slightly different ways across sessions, and the cosine similarity threshold for tier-2 dedup is too strict to catch the paraphrases. Lower the dedup threshold from 0.75 to 0.70 and monitor for a week.

recall_retrieval_precision_at_k is sampled — 50 LLM-judged retrievals per day by default. Don't use it for real-time alerting; it's too noisy at that cadence. Use it for weekly trend analysis: if precision@5 is trending down across three weeks, there's a systematic quality problem that the other metrics should be pointing to.

The Prometheus scrape endpoint

Every Recall deployment exposes metrics at /v1/metrics in Prometheus text format. No extra configuration needed for embedded or self-hosted deployments.

scrape_configs:
  - job_name: 'recall'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/v1/metrics'
    scrape_interval: 30s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

A sample of what you'll see:

# HELP recall_writes_total Total write operations
# TYPE recall_writes_total counter
recall_writes_total{namespace="user_123/inbox3",outcome="stored"} 8473
recall_writes_total{namespace="user_123/inbox3",outcome="discarded"} 12847
recall_writes_total{namespace="user_123/inbox3",outcome="merged"} 421

# HELP recall_junk_rate Junk rate per write run
# TYPE recall_junk_rate gauge
recall_junk_rate{namespace="user_123/inbox3"} 0.593

# HELP recall_write_latency_seconds Latency per write pipeline stage
# TYPE recall_write_latency_seconds histogram
recall_write_latency_seconds_bucket{stage="extract",le="0.5"} 342
recall_write_latency_seconds_bucket{stage="extract",le="1.0"} 891
recall_write_latency_seconds_bucket{stage="extract",le="2.0"} 1247
recall_write_latency_seconds_sum{stage="extract"} 542.8
recall_write_latency_seconds_count{stage="extract"} 1253

For cloud deployments, managed Prometheus with pre-built Grafana dashboards ships as part of the Enterprise tier. For self-hosted, the Grafana dashboard JSON is available in the observability/ directory of the Recall repository.

OpenTelemetry traces

Every API call produces an OTel trace. Every pipeline stage is a span. LLM calls within stages carry cost and token attribution.

Trace: remember_call  [total: 1,209ms]
├─ span: pre_filter        [2ms]
├─ span: extract           [420ms]
│   └─ llm.provider = anthropic
│   └─ llm.model = claude-haiku-4-5
│   └─ llm.tokens_in = 340
│   └─ llm.tokens_out = 120
│   └─ llm.cost_usd = 0.00012
│   └─ extract.candidates_produced = 2
├─ span: resolve_refs      [5ms]
├─ span: dedupe            [45ms]
│   └─ span: tier1_hash    [1ms]
│   └─ span: tier2_cosine  [40ms]
│   └─ span: tier3_judge   [skipped — no ambiguous band hits]
├─ span: conflict_check    [15ms]
└─ span: persist           [22ms]

OTel exporters: Jaeger, Grafana Tempo, Datadog APM, Honeycomb, New Relic, OpenTelemetry Collector. For teams on Langfuse:

observability:
  exporters:
    - type: langfuse
      public_key: pk-lf-...
      secret_key: sk-lf-...
      host: https://cloud.langfuse.com
      operations: ["remember", "search", "build_context"]

Each Recall trace becomes a Langfuse trace; LLM calls within it become generations with cost attribution.

Trace retention defaults to 7 days hot (full traces, queryable), 7–90 days warm (10% sample), 90+ days cold (aggregated stats only). Configure per tier in observability.traces.

Three debugging workflows

"Why did Recall return this memory?"

Pass explain: true on any search call:

const result = await recall.search({
  query: "what does my manager think about the new architecture?",
  scope: { user_id, agent_id },
  explain: true
});

// Every memory in result has retrieval_detail:
console.log(result.memories[0].retrieval_detail);
// {
//   matched_retrievers: ["semantic", "entity_graph"],
//   scores: { semantic: 0.82, entity_graph: 0.90 },
//   rrf_score: 0.048,
//   rrf_rank: 1,
//   reranked: false,
//   filters_passed: ["scope", "confidence>=0.5", "not_superseded"]
// }

This tells you exactly why a specific memory ranked where it did. If you're debugging a surprising top-1 result, check whether entity_graph is boosting it — graph retrieval is not purely semantic, and a well-connected entity can surface memories that aren't the closest embedding match.

"Why was this memory NOT returned?"

When a retrieval result is missing something you expected:

const debug = await recall.debugRetrieval({
  query: "what's my manager's communication style?",
  expected_memory_id: "mem_7A2B",
  scope: { user_id, agent_id }
});

// {
//   found_in_retrievers: ["entity_graph"],
//   not_found_in: ["semantic", "bm25", "temporal", "type"],
//   semantic_similarity: 0.68,   // below the 0.70 threshold
//   rrf_rank: 8,                  // out of top 5
//   filtered_by: null,            // not filtered, just didn't rank
//   suggestion: "Lower semantic threshold from 0.70 to 0.65"
// }

A semantic similarity of 0.68 for a memory that should be relevant usually means the query and the memory are using different vocabulary for the same concept. The entity graph found it because the entity relation is explicit, but semantic search missed it because the embedding distance was below threshold. Lowering the threshold is one option; re-extracting the memory with a more representative content string is another.

"Why is cost spiking?"

const breakdown = await recall.costBreakdown({
  namespace: "user_123/inbox3",
  since: "1h ago",
  group_by: ["stage", "model", "hour"]
});
// Returns time-series cost per dimension

The most common causes of sudden cost spikes:

  • Batch size dropped: if something restarted your worker with default config, batch_size may have reset to 1. More LLM calls per turn = proportionally more cost.
  • Model pricing change: a provider updated its pricing and your cost per token is higher.
  • Traffic spike: more turns in → more candidates → more extraction calls. Check recall_writes_total at the same time.
  • Bug: reprocessing same data: if recall_queue_depth is stuck high and cost is spiking, a job may be re-queuing the same turns. Check the worker logs for repeated trace IDs.

Alert thresholds

Recall ships built-in alert rules. These are starting points — tune to your traffic patterns:

AlertConditionSeverityInterpretation
High junk ratejunk_rate > 0.3 for 1hwarningExtraction not filtering — probable agent style change
Latency regressionp99_read > 500ms for 10mcriticalHNSW ef_search too high, index needs rebuild, or reranker slow
Queue backed upqueue_depth > 10000warningWrite pipeline falling behind; check worker health
Worker stalledworker_lag > 30mcriticalBackground jobs not running; freshness scores stale
Cost spikehourly_cost > 2x last_7d_avgwarningCheck batch size, model pricing, traffic
Contradiction surgenew_contradictions > 10 in 1hwarningUpstream data quality issue
Hallucination rategrounding_blocked_rate > 0.1warningExtractor generating ungrounded claims
Faithfulness dropfaithfulness_mean < 0.85criticalRead-time faithfulness failing; check retriever quality
Drift detecteddrift_score > 0.3warningEntity semantics shifting; review drift report
Schema mismatchmixed versions in namespacecriticalUpgrade cutover incomplete

Configure alert channels in recall.yaml:

alerts:
  channels:
    - type: slack_webhook
      url: https://hooks.slack.com/...
      severities: ["critical"]
    - type: email
      recipients: ["oncall@company.com"]
      severities: ["warning", "critical"]
  rules:
    latency_regression:
      enabled: true
      threshold_ms: 500
      percentile: 99
      window: 10m
      channels: ["slack_webhook"]

Alert deduplication: same alert key within 1 hour fires once. After resolution, 15 minutes of stable behavior required before re-firing.

Health check endpoints

Three levels:

GET /v1/health — liveness only. Returns 200 if the process is alive. Use for container orchestrator liveness probes.

GET /v1/ready — readiness. Checks DB connection pool, background worker reachability, at least one LLM provider healthy, index accessible. Returns 200 only if all pass. Use for Kubernetes readiness probes.

GET /v1/health/deep?namespace=... — namespace-level health. Checks storage quota, recent write/read success, worker schedule compliance, active alerts. Use for operator dashboards and pre-deployment checks.

The quality-first discipline

The hardest part of operating a memory system is that most failures are slow and quiet. Retrieval precision doesn't crash — it drifts. Costs don't spike — they creep. LLM-powered pipelines don't produce obvious errors when something changes upstream; they produce subtly different outputs that accumulate.

The quality metrics exist specifically to give you an earlier signal than user reports. recall_junk_rate will move before precision@k moves. recall_duplicate_rate will trend up before users notice repeated information in agent responses. The drift score will spike before entity resolution starts returning wrong results.

Treat quality metrics as your primary indicators. Treat latency and error rate as your secondary check that the infrastructure is functioning. That inversion — quality first, infrastructure second — is what makes memory-specific monitoring actually useful.

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.