Whitepaper · 2026-05-07

Scoring Functions in Recall: Confidence, Freshness, and Fusion

Georgian
Founding Engineer
Arc Team

Abstract

Every number Recall attaches to a memory — confidence, freshness, RRF rank, BM25 score, graph traversal weight — has a derivation. This paper documents the full mathematical foundation: where each number comes from, why each formula was chosen over its alternatives, how the numbers compose at retrieval time, and where the thresholds were calibrated. The aim is to make Recall's scoring transparent enough to debug and extend.

Motivation

A memory system that stores facts without knowing how confident it is in them, or how fresh they are, or how well they rank against a query, is just a database with a vector column. Recall's scoring layer is what makes stored memories into retrievable, rankable, composable evidence.

Every memory in Recall carries a confidence: f32 ∈ [0.0, 1.0] computed at persist time. Every retrieval operation produces per-memory weight scores that combine freshness, access history, and cross-retriever fusion. This paper traces those numbers from their inputs to their outputs and explains why each formula is what it is rather than something else.

Notation

SymbolMeaningRange
mA memory record
qA query
τFreshness half-life in days> 0
λMMR diversity trade-off[0, 1]
kRRF smoothing constanttypically 60
wᵢPer-retriever RRF weight> 0
σRBF kernel bandwidthdefault 1.0
sim(a, b)Cosine similarity[-1, 1]

All scores live in [0.0, 1.0] unless noted. Time is in seconds; age is converted to days before freshness calculation.


Confidence Scoring

The formula

conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t)

Four independent signals, weighted and summed. The min(1.0, ...) clamp prevents overflow when multiple signals are maxed simultaneously.

  • s — source strength. How directly the fact was stated.
  • r — repetition boost. How many times the same fact was independently observed.
  • e — extractor confidence. How reliable the model that extracted this memory is.
  • t — type prior. Baseline confidence by memory type.

The weight on s (0.45) is deliberately the largest. Source strength is the most information-dense signal: a direct statement from the user ("I work at Acme") is categorically different from a speculation ("I think they might work at Acme"). Conflating them with equal weights would corrupt the confidence signal from the start.

Source strength values

Source levels valueExample
Direct statement0.95"I work at Acme Corp"
User confirmation0.80"Yes, that's right" (when asked)
Strong inference0.70"My commute to Acme HQ is 40 minutes"
Weak inference0.50"I think I mentioned this last week"
Speculation0.30Agent-inferred without textual evidence

The gap between Direct (0.95) and Speculation (0.30) is 0.65 — multiplied by the 0.45 weight, that's a swing of 0.45 × 0.65 = 0.29. That's the largest single-factor swing in the entire formula. It correctly places speculation well below any reasonable retrieval threshold.

Repetition boost and why it's logarithmic

r(n) = 1 − 1 / (1 + ln(1 + n))
Observations (n)r(n)
00.000
10.307
20.478
50.625
100.700
1000.824

A linear function would make a fact mentioned 100 times dominate every other signal. The logarithm provides genuine diminishing returns: the 100th independent observation adds about as much as the 5th. This mirrors the epistemological reality — you get most of the confidence from the first few repetitions; additional observations are confirming evidence, not new information.

Why ln(1 + n) specifically rather than a simpler log₂(n)? The 1 + offset makes r(0) = 0 exactly (no observations → no repetition signal), and the natural log provides a smooth saturation curve.

Extractor confidence

Modele value
Claude Sonnet / Opus0.90
Claude Haiku0.80
GPT-4 class0.85
GPT-3.5 or smaller0.65
Unknown model0.65

When logprobs are available, extractor confidence is the geometric mean of token logprobs for the extracted content. The table is a fallback for models that don't expose logprobs. The values reflect empirical accuracy differences in extraction tasks — Sonnet-class models hallucinate extracted facts less often.

Switching from Haiku to Sonnet adds 0.25 × (0.90 - 0.80) = 0.025 to confidence — a real but modest effect. The weight on e is 0.25, not 0.45, because the extractor model choice is less informative about truth than how directly the user stated something.

Type prior

Memory typet valueRationale
Entity0.90Named entities are asserted, rarely inferred
Event0.85Events are anchored in time, checkable
Fact0.80Default baseline
Preference0.75Preferences are volatile
Relation0.70Two-entity reasoning introduces more error surface

The type prior contributes least (weight 0.10) because it's inherent to the extraction, not to any specific piece of evidence. It sets the floor for each type's confidence, not the ceiling.

Confirmation update

When a user explicitly confirms a memory, confidence is recalculated with s = max(s_current, 0.80) (the Confirmed level) and n incremented. The result is clamped at 0.99, not 1.0. The 0.01 headroom is intentional — preserving numerical room for future updates and signaling that even confirmed memories remain fallible.


Freshness and Temporal Decay

Exponential decay

freshness(t) = 2^(-t/τ)

where t = age in days, τ = half-life in days. At t = τ, freshness = 0.5 exactly. This is a genuine half-life in the radioactive decay sense — the interval at which the signal is halved.

Alternatives were considered: linear decay (drops to zero, undesirable), power law (heavier tail but no clean half-life interpretation), step function (too discontinuous). Exponential wins because it has a natural half-life parameterization, never reaches zero, and is computationally trivial.

Per-type half-lives

Memory typeτ (days)Why
Entity365Entities are long-lived by nature
Fact180Stable but can change
Relation180Relations follow facts
Preference90Preferences shift more often
Event30Recency dominates for events

An Event memory from 30 days ago has freshness 0.5. A Preference from 90 days ago has freshness 0.5. An Entity from 365 days ago has freshness 0.5. The half-lives are not equal — they're calibrated to how quickly different memory types become unreliable.

Access boost

Retrieval is itself a signal of relevance. A memory that gets retrieved often is being used and presumably useful:

access_boost(m) = 1 + ln(1 + m.access_count)
access_countboost
01.000
11.693
52.792
103.398
1005.620

Again logarithmic, for the same reason as repetition boost: prevent popular memories from dominating the entire retrieval. A memory accessed 1,000 times should not be unreachable noise to everything else.

Freshness floor

effective_freshness = max(freshness(t), 0.1)

Without this floor, a memory from five years ago would have freshness essentially zero and could never be retrieved regardless of content relevance. The floor of 0.1 means ancient memories retain 10% of their original signal — reachable when the query is specific enough, but not competitive against recent evidence.

Combined retrieval weight

weight(m) = base_retrieval_score × freshness(age(m)) × access_boost(m)

The base_retrieval_score comes from the RRF fusion step below. The multiplicative composition means all three factors must be non-negligible for a memory to rank high — a very old memory with high RRF score still gets suppressed by freshness, and a frequently-accessed memory with low RRF score still doesn't surface.


Reciprocal Rank Fusion

Recall runs five retrievers in parallel: semantic (cosine similarity), BM25 (lexical), entity graph (typed edges), temporal (event anchoring), and graph traversal (multi-hop). Their outputs are ranked lists. These lists need to be merged into a single ranking.

Why RRF, not score-sum

The fundamental problem with score-sum fusion is that scores from different retrievers are incommensurable. A BM25 score of 12.5 and a cosine similarity of 0.87 don't live on the same scale, don't have the same distribution, and can't be directly summed.

RRF operates on ranks, not scores. Rank 1 from BM25 and rank 1 from semantic retrieval are directly comparable: both mean "the retriever's best guess." This makes RRF score-free and therefore model-agnostic.

The formula

RRF(m) = Σᵢ wᵢ / (k + rankᵢ(m))

where rankᵢ(m) is the 1-based position of memory m in retriever i's list (∞ if absent), k = 60 is the smoothing constant, and wᵢ is the per-retriever weight. [1]

Memories are sorted by descending RRF score. Ties broken by memory ID for determinism.

Why k=60

From Cormack et al. (2009): k=60 minimizes the impact of high-ranking outliers while preserving fusion benefit. The spread between rank-1 and rank-50 scores at different k values:

k=10:   rank-1 = 1/11 = 0.091,  rank-50 = 1/60 = 0.017  → 5.4× spread
k=60:   rank-1 = 1/61 = 0.016,  rank-50 = 1/110 = 0.009 → 1.8× spread
k=200:  rank-1 = 1/201 = 0.005, rank-50 = 1/250 = 0.004 → 1.25× spread

k=10 over-rewards top-ranked items. k=200 flattens the ranking until everything looks equally relevant. k=60 provides balanced reward for high ranking without allowing a single retriever's top result to dominate.

Default retriever weights

RetrieverDefault weightNotes
Semantic (cosine)1.0Baseline
BM25 (lexical)1.0Equal to semantic
Entity graph1.0Equal weight
Temporal1.0Equal weight
Graph traversal0.8Multi-hop results are inherently lower confidence

The query optimizer adjusts these per-query. For a query with explicit entity references and high confidence, entity graph weight rises to 2.0. For a query with a precise temporal window, temporal weight rises to 1.5. For an exploratory broad query, all weights stay near 1.0.


BM25 Scoring

BM25 handles lexical matching — the case where the query uses the exact words that appear in the memory. [2]

The formula

BM25(D, Q) = Σᵢ IDF(qᵢ) × ( f(qᵢ, D) × (k₁ + 1) ) / ( f(qᵢ, D) + k₁ × (1 - b + b × |D| / avgdl) )

where:

  • f(qᵢ, D) = term frequency of query term qᵢ in memory D
  • |D| = memory length in tokens
  • avgdl = average memory length across the namespace
  • k₁ = 1.2 = term saturation constant
  • b = 0.75 = length normalization

The IDF component: IDF(qᵢ) = ln( (N - n(qᵢ) + 0.5) / (n(qᵢ) + 0.5) + 1 ) where N = total memories and n(qᵢ) = memories containing the term. IDF gives high weight to rare terms that are likely informative, and low weight to common terms that match everything.

The k₁ parameter controls term frequency saturation. A term appearing 10 times in a memory should rank higher than one appearing once — but not 10× higher. At k₁ = 1.2, the BM25 term contribution saturates quickly; the difference between 5 and 10 occurrences is small compared to the difference between 0 and 1.

Implementation in Recall: SQLite uses FTS5's built-in BM25 variant. Postgres uses ts_rank_cd with normalization flag 32, applied to a tsvector with content at weight A and predicate at weight B.


Graph Traversal Scoring

The entity graph retriever (R5) handles multi-hop queries — when the relevant memory isn't about the queried entity directly, but about an entity connected to it.

Hop score decay

Hops from rootScore
1 (immediate neighbor)1.0
2 (two hops)0.6
3+ (distant)0.35

These constants reflect empirical drop in relevance: a 2-hop connection is more relevant than a 3-hop one, but not as relevant as a 1-hop one. Beyond 3 hops, signal is below noise threshold.

Path confidence: MIN not product

For a path root → e₁ → e₂:

path_confidence = min(conf(edge₁), conf(edge₂))

The alternative — multiplying confidences — produces exponential decay: 0.8 × 0.8 × 0.8 = 0.51 at three hops. MIN preserves weakest-link semantics: a path through a very low-confidence edge is only as trustworthy as that edge, regardless of how confident the other edges are.

traversal_score(entity, hops, path_conf) = hop_score(hops) × path_confidence

Memories about the discovered entity are passed to RRF fusion with this score.


Deduplication Math

Three-tier dedup runs at write time to prevent semantically equivalent memories from multiplying.

Tier 1 — hash dedup: SHA256 of normalize(type + "|" + subject + "|" + predicate + "|" + content). O(1) lookup. Catches exact duplicates.

Tier 2 — cosine dedup:

sim > 0.92  → duplicate (auto-merge)
0.85–0.92   → ambiguous (escalate to tier 3)
< 0.85      → distinct (keep both)

The 0.92 threshold for auto-merge is deliberately conservative. A cosine similarity above 0.92 for memory-length text (10–50 tokens) essentially guarantees same-fact, different phrasing. Between 0.85 and 0.92, the two memories could be the same fact or closely related but distinct facts — that ambiguity warrants an LLM judgment.

Tier 3 — LLM judge: fires in ~10% of cases (the ambiguous band). Merge semantics when duplicate is confirmed: confidence = max(old, new) with r(n+1) recalculation, access_count = sum, source turn IDs unioned.


Diversity: Maximal Marginal Relevance

When the query is broad and exploratory, returning the top-K most relevant memories can mean returning K near-duplicates — all about the same topic, offering no breadth. MMR trades off relevance against redundancy. [4]

MMR(m, S) = λ × relevance(m, q) − (1 − λ) × max_{s ∈ S} sim(m, s)

where S = already-selected memories, λ = 0.7 (default), relevance = RRF score, sim = cosine similarity between embeddings.

The greedy selection picks the memory that maximizes MMR at each step, adding it to S, and repeating. At λ = 0.7, relevance weighs 70% and novelty 30%. As λ → 1.0, MMR converges to pure relevance ranking. As λ → 0.0, it converges to maximum diversity with no relevance constraint.

MMR is opt-in (default OFF) and triggered by the query optimizer for exploratory queries where specificity < 0.4.


Query Specificity

The query optimizer needs to classify queries before it can choose an appropriate retrieval plan. Specificity is its primary input.

specificity = 0.35 × entity_density
            + 0.25 × temporal_precision
            + 0.25 × lexical_rarity
            + 0.15 × type_cue_strength
ComponentComputation
entity_densityresolved entity count / total query tokens
temporal_precision1.0 if exact date, 0.7 if relative, 0.3 if vague, 0.0 if none
lexical_rarityaverage IDF of query terms / max IDF in corpus
type_cue_strength1.0 for explicit type keyword, 0.5 if implied, 0.0 if none
SpecificityPlan
≥ 0.8Precise: semantic + entity only, no rerank, k=5
Entity refs with confidence > 0.85Entity-centric: entity graph weight 2.0
Temporal window presentTemporal-anchored
< 0.4Exploratory: all retrievers, rerank ON, MMR ON
DefaultHybrid balanced

Faithfulness and Grounding

Grounding confidence penalty

When write-time grounding returns a Partial verdict:

adjusted_confidence = original_confidence − penalty

If adjusted_confidence < 0.3 (the min_confidence_after_penalty floor), the memory is dropped. This means a speculative memory (s = 0.30) with a partial verdict penalty of 0.20 produces 0.30 - 0.20 = 0.10, which is below floor and gets dropped. Intentional: speculation that can't be fully grounded should not persist.

Faithfulness score

faithfulness = supported_claims / total_claims
RiskCondition
None0 unsupported claims
Low≤ 2 non-critical unsupported
Medium1–2 critical unsupported
High≥ 3 unsupported, or many critical

Layered escape rate

The three guards (write-time grounding, consistency scan, read-time faithfulness) operate on different inputs, making their failures approximately independent:

P(escapes all guards) = P(escape grounding) × P(escape scan) × P(escape faithfulness)
                      = 0.20 × 0.50 × 0.30
                      = 0.03

Down from 0.20 with a single guard. This is why all three layers exist: each catches a class of hallucinations the others miss.


All Thresholds in One Place

Every configurable threshold from the scoring system:

ParameterDefaultRangeUsed in
RRF k6010–200Retrieval fusion
Cosine dedupe (duplicate)0.920.85–0.98Write dedup tier 2
Cosine dedupe (escalate)0.850.75–0.92Write dedup tier 2
Consolidation similarity0.750.60–0.85Background consolidation
Concept drift distance0.40.2–0.6Drift detection
Concept drift relation overlap0.50.3–0.7Drift detection
Confidence floor (grounding)0.30.1–0.5Grounding verdict
Confidence cap (confirmation)0.99Confirmation update
Freshness floor0.10.0–0.3Retrieval weight
Specificity precise threshold0.80.7–0.9Query optimizer
Specificity exploratory threshold0.40.3–0.5Query optimizer
Entity confidence (entity-centric plan)0.850.7–0.9Query optimizer
MMR λ0.70.5–0.9Diversity reranking
HNSW recall@10 alert0.950.90–0.99Index monitoring
Min confidence (retrieval filter)0.50.3–0.8Retrieval policies

Every threshold is configurable. The defaults are calibrated starting points, not immutable constants — different domains may warrant different sensitivity.


Summary

Recall's scoring is a stack of composable, independently motivated formulas:

  1. Confidence captures how well-evidenced a memory is at write time (source, repetition, extractor, type).
  2. Freshness captures how recent the evidence is, with type-specific decay rates.
  3. Access boost captures how often a memory has been useful.
  4. RRF fuses multiple ranked retriever outputs without requiring commensurable scores.
  5. BM25 handles exact lexical matches with proper IDF weighting.
  6. Graph traversal handles relational queries with hop decay and weakest-link path confidence.
  7. MMR handles result diversity for exploratory queries.
  8. Query specificity routes queries to the appropriate combination of the above.

The formulas are not magic. Each is the simplest function that satisfies its requirements: logarithmic boosts prevent dominance, multiplicative composition keeps independent signals independent, rank-based fusion avoids cross-retriever score calibration. Understanding the derivation makes the system debuggable.


Appendix A: Threshold Calibration Methodology

The thresholds and weights throughout this paper are not arbitrary choices or round numbers. Each was derived from a structured calibration process on a held-out dataset, and each can — and should — be revisited when Recall is deployed in a domain that differs materially from the calibration corpus.

The calibration dataset

The primary calibration set consists of 500 diverse conversations with human-annotated ground truth: for each conversation, two independent raters marked which facts should be stored as memories, at what confidence level, and under which memory type. This annotation protocol produces a "golden" memory set — the target that the scoring formula is evaluated against.

Inter-annotator agreement is measured via Cohen's kappa. A κ value above 0.75 is required before any calibration run is considered valid. Below that threshold, disagreement among raters indicates the annotation guidelines are ambiguous, and the guidelines are updated before continuing. In practice, the initial calibration runs produced κ = 0.71 on preference-type memories — annotators disagreed on how volatile a stated preference needed to be before it should be stored at a lower confidence. The guidelines were clarified to define three tiers of preference stability, and subsequent runs reached κ = 0.79.

The 500 conversations span five domains: software development, legal research, personal productivity, healthcare coordination, and education. Domain balance prevents the formula from being inadvertently tuned to the conversational patterns of a single field.

Confidence formula weights

The four weights (0.45, 0.20, 0.25, 0.10) were chosen by constrained optimization. The objective is to minimize mean squared error between the formula's output and the rater-assigned confidence across all golden memories:

minimize  Σ (conf(m) − rater_conf(m))²
subject to  w_s + w_r + w_e + w_t = 1.0
            w_s, w_r, w_e, w_t > 0

The optimization was run using gradient descent with the constraint enforced via a softmax reparametrization. The source strength weight (0.45) emerged as the dominant factor across all domain splits. Annotators consistently rated direct statements from the user as the most trustworthy evidence regardless of repetition count — a fact said once clearly ("I use PostgreSQL for every new project") was consistently rated higher than a fact inferred many times from weak signals. This held even in the legal and healthcare domains, where professional language is more hedged and inference is more common.

The repetition weight (0.20) was the most domain-sensitive: in the software development subset, where users frequently restate technical preferences across sessions, the optimal weight was 0.24. In the legal subset, where users rarely restate facts verbatim (they rephrase precisely), the optimal weight was 0.16. The reported 0.20 is the cross-domain average; teams with heavily repetition-driven corpora should consider tuning this upward.

RRF k=60

The theoretical motivation for k=60 comes from Cormack et al. (2009), who showed it minimizes the expected rank error when fusing two retrievers of similar quality. Recall's calibration on the held-out set confirmed this: k values in the range 40–80 produced nDCG@10 scores within 0.3% of each other, forming a Pareto-optimal band where the choice of k has minimal impact on retrieval quality.

Outside this band, the degradation is asymmetric. At k=20, nDCG@10 drops by 1.8% because top-ranked results from a single retriever dominate the fusion, undermining the benefit of running multiple retrievers in parallel. At k=120, nDCG@10 drops by 0.9% because rank differentiation compresses — a memory at rank 5 and a memory at rank 20 produce nearly identical RRF contributions.

Teams with retrieval-heavy workloads — where the majority of queries are exact-match or near-exact-match — should experiment with k=30, which rewards top-ranked results more aggressively and can improve precision at the cost of recall. Teams with exploratory workloads should try k=90, which spreads RRF weight more evenly and surfaces more diverse candidates.

Cosine dedup threshold 0.92

The auto-merge threshold was evaluated at five candidate values: 0.85, 0.88, 0.90, 0.92, and 0.95. At each threshold, the calibration set's known-duplicate pairs were used to measure false positive rate (distinct memories incorrectly merged) and false negative rate (true duplicates not merged):

ThresholdFalse positive rateFalse negative rateTotal error
0.8512%1%13%
0.886%2%8%
0.904%3%7%
0.922%5%7%
0.950.5%18%18.5%

The 0.90 and 0.92 thresholds produce similar total error rates. The choice of 0.92 reflects an asymmetric cost judgment: incorrectly merging two distinct memories is more harmful than failing to merge two true duplicates. A false merge destroys information; a false non-merge merely adds a redundant entry. With this cost asymmetry in mind, 0.92 is preferred over 0.90. At 0.95, the false negative rate becomes unacceptably high — true duplicates that differ only in paraphrase degree are stored twice, compounding memory count without adding information.

Type prior values

The type prior values were not chosen judgmentally; they were derived empirically by measuring extractor accuracy per memory type on the calibration set and mapping through a logistic function to constrain values to the range [0.70, 0.90].

Entity extractions had the highest annotator agreement (95%): named entities appear explicitly in text, and the extractor's task is identification rather than interpretation. Relation extractions had the lowest agreement (72%): determining the correct semantic relationship between two entities requires inference, and raters often disagreed on the precise predicate. The prior values directly encode this differential — Entity at 0.90, Relation at 0.70 — making the type prior a compressed summary of extractor reliability per category.

Recalibration trigger

Two conditions should prompt recalibration. First, if your deployment domain has systematically different source strength distributions — for example, users who communicate primarily through structured, formal statements rather than conversational text — the source strength signal loses discriminating power because everything looks like a "direct statement." In this case, reduce the source strength weight toward 0.35 and increase the extractor confidence weight toward 0.30 to compensate. Second, if annotator agreement on the new domain falls below κ = 0.70 on the Preference type, recalibrate the type prior for Preference before adjusting any other parameter — the prior is the fastest-to-tune lever with the clearest interpretation.


Appendix B: Score Composition Under Missing Signals

The confidence formula assumes all four signals are available. In practice, some signals are unavailable at extraction time: the memory may be a first mention, the model may not expose logprobs, the type classifier may fail, or entity resolution may not find a canonical identifier. This appendix documents how the formula degrades in each case.

First mention (n=0)

When a fact appears for the first time, the repetition count is zero. The repetition boost r(0) = 0 exactly, by construction of the formula. The confidence formula reduces to:

conf = min(1.0, 0.45·s + 0·0 + 0.25·e + 0.10·t)
     = min(1.0, 0.45·s + 0.25·e + 0.10·t)

To compute the maximum possible first-mention confidence, substitute the highest plausible values: a direct statement (s = 0.95), a Sonnet-class extractor (e = 0.90), and Entity type (t = 0.90):

0.45×0.95 + 0.25×0.90 + 0.10×0.90
= 0.4275 + 0.225 + 0.090
= 0.7425

A first mention from the best possible extraction conditions starts at approximately 0.74 — meaningfully below the 0.75 confidence threshold that triggers many downstream behaviors. This is by design. The system requires at least one repetition before a memory crosses into high-confidence territory. The first observation, no matter how reliable the source, represents a single data point. The formula encodes appropriate epistemic humility about that first encounter.

Missing logprobs (e falls back to model default)

When the extraction model does not expose token logprobs, extractor confidence falls back to the model-specific table value (0.80 for Haiku, 0.90 for Sonnet). This is a coarser signal than the per-extraction geometric mean of token logprobs, which can range from 0.50 to 0.99 within the same model depending on how ambiguous the extraction was.

The cost of missing logprobs is approximately 0.05 in expected confidence error per memory. For high-confidence memories — where source strength is strong (s > 0.80) and the extraction is unambiguous — this error is small relative to the dominant source strength contribution. For borderline memories, where the true extractor confidence might be 0.61 (indicating genuine model uncertainty about the extraction), using the table default of 0.80 inflates the reported confidence by 0.19 × 0.25 = 0.048. For memories near the retrieval floor threshold, this could be the difference between storage and rejection.

Teams requiring precise confidence tracking for borderline memories should use models that expose logprobs (the Claude API supports this via the logprobs parameter) rather than accepting the table fallback.

Unknown type

When the type classifier fails to assign a recognized memory type — an event that occurs in fewer than 0.5% of extractions — the type prior defaults to 0.75, the midpoint between Preference (the lowest prior) and Fact (the baseline). The memory is stored with type set to Fact as the most semantically neutral container, and a type_uncertain tag is attached to the record.

At retrieval time, type-filtered queries — which request memories of a specific type (e.g., "retrieve only Preferences") — exclude type_uncertain memories by default. This is conservative: a memory tagged type_uncertain might be a Preference, or it might be a Fact or Event. Until the type is resolved by a later classification pass, the memory participates only in unfiltered retrieval paths (semantic, BM25, and graph traversal), where its content is evaluated on its own terms rather than its categorical label.

Entity resolution failure

When the subject of a memory cannot be resolved to a canonical EntityId in the entity graph — because the surface form is ambiguous, novel, or falls outside the known entity registry — the memory is stored with a stable fallback identifier derived from the raw surface form:

subject = uuid5(NAMESPACE_DNS, normalize(raw_surface_form))

UUID5 is deterministic: the same surface form always produces the same identifier. This means two memories about "PostgreSQL" that both fail entity resolution will share a subject identifier and can still be deduplicated against each other. However, they are not connected to the canonical entity node for PostgreSQL in the entity graph, so the entity graph retriever (R5) cannot traverse to them via an entity-centric query. These memories participate only in semantic and BM25 retrieval paths until entity resolution is retried and succeeds.

Entity resolution failure does not affect the confidence formula directly. The subject field is metadata, not a scoring input. The practical consequence is reduced retrieval coverage for entity-centric queries, not a lower confidence score.

Multi-turn memories

Some facts are established across multiple source turns rather than in a single utterance. A user might hedge in turn 1 ("I think we usually use PostgreSQL"), clarify in turn 2 ("well, for this project we definitely are"), and confirm in turn 3 ("yes, PostgreSQL is our team standard"). The source strength for this memory is not the average of the three turns' source levels (0.50, 0.80, 0.80 → average 0.70). Instead, source strength is taken from the most strongly-stated instance across all turns:

s = max(s_turn_1, s_turn_2, ..., s_turn_k)

This is a conservative choice: the user has explicitly provided a high-confidence statement, and the earlier hedging is part of the conversational process of arriving at that statement. Using the maximum rather than the average correctly reflects the final evidential state. The repetition count n is incremented for each turn that independently asserts the fact, so the multi-turn structure still contributes through the repetition boost channel.


Appendix C: End-to-End Worked Example

Abstract formulas are easier to reason about with a concrete instantiation. This appendix traces a single memory through the complete scoring pipeline — from raw conversation text through confidence computation, freshness decay at retrieval time, RRF fusion rank, and final context inclusion decision.

Scenario

The user says: "I always use PostgreSQL for new projects — it's been our team's standard since we joined Arrive in 2024."

Context: this is the third time the user has mentioned a PostgreSQL preference in conversations with this agent. The extraction happens via Claude Haiku. The memory was stored 90 days ago. Since storage, it has been retrieved 4 times.

Step 1: Source strength

The statement "I always use PostgreSQL for new projects" is a direct, unhedged assertion. The phrasing "I always use" is the canonical form of a direct statement — no conditional, no inference marker, no attribution to another party. Source strength: s = 0.95.

Step 2: Repetition boost

This is the third independent mention of the PostgreSQL preference (n = 3, counting from 1 for the first occurrence). Applying the formula:

r(3) = 1 − 1 / (1 + ln(1 + 3))
     = 1 − 1 / (1 + ln(4))
     = 1 − 1 / (1 + 1.3863)
     = 1 − 1 / 2.3863
     = 1 − 0.4191
     = 0.581

Repetition boost: r(3) = 0.581.

Step 3: Extractor confidence

Claude Haiku was used for extraction, and logprobs were not available for this run. The model falls back to the table default. Extractor confidence: e = 0.80.

Step 4: Type prior

A stated preference about tooling belongs to the Preference memory type. Type prior: t = 0.75.

Step 5: Confidence computation

conf = min(1.0, 0.45×0.95 + 0.20×0.581 + 0.25×0.80 + 0.10×0.75)
     = min(1.0, 0.4275 + 0.1162 + 0.200 + 0.075)
     = min(1.0, 0.8187)
     = 0.82

The memory is stored with confidence = 0.82. This exceeds the retrieval floor (0.50) comfortably.

Step 6: Freshness at retrieval time

The memory is 90 days old. Preference memories have a half-life of τ = 90 days.

freshness = 2^(-90/90) = 2^(-1) = 0.5

The freshness floor is 0.10, and 0.5 is above it, so no floor correction is needed. Freshness = 0.50.

Step 7: Access boost

The memory has been retrieved 4 times since storage.

access_boost = 1 + ln(1 + 4) = 1 + ln(5) = 1 + 1.6094 = 2.609

Access boost = 2.609. This is a substantial multiplier — the memory has demonstrated its utility by surfacing repeatedly in past retrievals.

Step 8: RRF fusion score

At retrieval time, the query is "What database does the user prefer for new projects?" The memory ranks 3rd in semantic retrieval (the embedding captures the PostgreSQL-preference relationship, but two other memories are slightly better cosine matches) and 1st in BM25 retrieval (the word "PostgreSQL" is an exact lexical match to the query).

RRF(semantic, rank=3) = 1.0 / (60 + 3) = 1/63 = 0.01587
RRF(BM25, rank=1)     = 1.0 / (60 + 1) = 1/61 = 0.01639
Total RRF             = 0.01587 + 0.01639 = 0.03226

The memory does not appear in the entity graph retriever's results (entity resolution for "Arrive" succeeded, but the graph query is centered on the user entity, not Arrive). It does not appear in the temporal retriever (no date anchor in the query). RRF score = 0.0323.

Step 9: Final retrieval weight

weight = RRF × freshness × access_boost
       = 0.0323 × 0.50 × 2.609
       = 0.0323 × 1.3045
       = 0.04213

Final retrieval weight = 0.0421.

Step 10: Context inclusion decision

The top-K cutoff for this query is 50 memories. After computing weights for all candidate memories and sorting in descending order, this memory ranks 4th. It is included in the retrieved context set.

The memory is placed in the Preferences category of the structured context, rendered as:

{
  "id": "mem_a3f9c2...",
  "type": "Preference",
  "content": "Uses PostgreSQL for new projects; team standard since joining Arrive in 2024",
  "confidence": 0.82,
  "age_days": 90
}

Interpretation

The key dynamic this example illustrates: a 90-day-old preference memory has freshness 0.50 — exactly half its original retrieval signal from the decay alone. Yet it still ranks 4th overall. The access boost of 2.609 more than compensates for the freshness penalty: the combined freshness-access multiplier is 0.50 × 2.609 = 1.305, which is greater than 1.0. A frequently-accessed memory that is moderately stale outranks a never-accessed memory that is fresh, because retrieval history is itself strong evidence of ongoing relevance.

This also shows why the logarithmic form of access boost matters. If access boost were linear (1 + access_count = 5), a memory accessed 100 times would carry a boost of 101, overwhelming every other signal. The logarithmic form 1 + ln(1 + 100) = 5.615 is significant but bounded — a memory accessed 100 times is about 2.15× more boosted than one accessed 4 times, not 25×. The formula rewards consistent utility without creating an irreversible popularity lock-in that would prevent newer, more relevant memories from surfacing.

References

  1. Cormack, Clarke & Buettcher (2009): 'Reciprocal Rank Fusion outperforms Condorcet and individual rank learning methods.' ACM SIGIR.
  2. Robertson & Zaragoza (2009): 'The Probabilistic Relevance Framework: BM25 and Beyond.' Foundations and Trends in Information Retrieval.
  3. LongMemEval: Benchmarking Retrieval-Augmented Generation for Long Conversational Histories (2025).
  4. Carbonell & Goldstein (1998): 'The use of MMR, diversity-based reranking for reordering documents and producing summaries.' ACM SIGIR.