The Confidence Formula

By Arc Labs ResearchMay 2, 202614 min read

A memory store with no per-item trust signal is one big "everything is true" bucket. The confidence formula gives you a single floating-point number per memory that downstream components — ranking, conflict detection, decay — can act on consistently.

Confidence builder · 4 sliders

Tune each input; watch the contribution breakdown.

The formula

conf(m) = min(1, 0.45·s + 0.20·r + 0.25·e + 0.10·t)

Each component is in [0, 1]; the weighted sum is capped at 1.

s — source strength. How trustworthy is the channel that produced this memory? Ranges from 0.30 (agent speculation) to 0.95 (direct user statement).
r — repetition boost. Logarithmic in the number of independent observations. Builds slowly; cannot be spammed. See why logarithmic, not linear.
e — extractor model quality. Encodes which model produced the candidate, or — when logprobs are available — the geometric mean of per-token logprobs over the extracted span.
t — type prior. Some memory types are more reliable a priori. Entity beats Relation; Event beats Preference.

Why these weights

The 0.45 / 0.20 / 0.25 / 0.10 split is calibrated, not arbitrary. Source dominates because it captures the trustworthiness of the input channel: a fact extracted from a user statement is fundamentally more trustworthy than one inferred from passing chat. Extractor quality is next; rep boost matters but should not let spam dominate; type prior is a tiebreaker.

The practical consequence: source can swing confidence by up to 0.45 × 0.65 = 0.29 (direct statement at 0.95 vs. speculation at 0.30). Extractor quality swings it by at most 0.25 × 0.25 = 0.0625. Type prior can shift it by at most 0.10 × 0.20 = 0.02. Source is doing most of the work by design.

Tune the weights for your data on a labeled eval set. Don't import them blindly — the ratios reflect a particular data distribution.

Source strength in depth

Source strength captures who said what and in what register. The five levels are not evenly spaced, and the gaps between them are intentional.

Level	s	Example
Direct statement	0.95	"I work at Acme."
Confirmed	0.80	"Yes, that's right" in response to a question
Strong inference	0.70	"My commute to Acme HQ takes 40 minutes"
Weak inference	0.50	"I think I mentioned I was at Acme?"
Speculation	0.30	Agent infers employer from a job title in a signature

The gap between direct (0.95) and confirmed (0.80) is 0.15. It exists because confirmation is parasitic on the agent's prior belief — the user is ratifying a candidate, not initiating a claim. The human phrasing "yes that's right" is more likely to produce an affirmation bias than "I work at Acme" said unprompted. If the user is just agreeing to get past a prompt, that confirmation is weaker evidence than a clean first-person statement.

Strong inference (0.70) covers cases where the claim is deducible from something the user said, but they did not assert it directly. "My commute to Acme HQ" implies employment at Acme without stating it. One degree of inference from a concrete reference is worth 0.70. The gap below confirmed (0.80 → 0.70) encodes that inference from context is less reliable than an explicit yes.

Weak inference (0.50) is the murky middle: the user is uncertain themselves. "I think I mentioned..." is hedged; the source signal is degraded. Setting weak inference at 0.50 means the memory will only pass the retrieval floor if extractor and type prior compensate.

Speculation (0.30) applies when the agent originated the claim with no direct user input. An agent that infers employer from a GPG key in a signature has no user-supplied confirmation. 0.30 is low enough that speculation-sourced memories require strong repetition and high extractor quality to become retrievable. This is the right behavior: you want agent-inferred facts to earn trust over time, not be handed it.

Edge cases worth calling out:

Rhetorical questions. "Does anyone actually enjoy standups?" is not a preference claim. The extractor should skip it; if it doesn't, tag the source as speculation (0.30) because the user was not asserting a belief.

Reported speech and hearsay. "I heard that Acme is pivoting to ML" is not a direct statement about the user, and it's not a direct statement about Acme either — it's one layer of hearsay removed. The source for a memory extracted from this should be capped at weak inference (0.50), and the memory should note the indirect attribution. Without that cap, a single piece of gossip can seed a high-confidence memory about a third party.

Self-referential corrections. "Wait, I didn't say I worked at Acme — I said I interviewed there." This is a correction. The write pipeline should treat it as a conflict signal that supersedes the original claim, not as a new weak-inference observation. Source updates from corrections go through conflict resolution, not the normal confidence path.

Repetition boost: independent observations only

The repetition component rewards the same fact being seen multiple times. The formula is:

r(n) = 1 − 1 / (1 + ln(1 + n))

r(n) is the repetition boost as a function of independent observation count.

Concrete values: n=1 → 0.307, n=2 → 0.478, n=5 → 0.625, n=10 → 0.700, n=100 → 0.824. The first observation is the biggest jump. By n=10, diminishing returns are aggressive.

The critical constraint: n counts independent observations only. The repetition counter increments when the new observation comes from a meaningfully different source than previous ones. Three consecutive turns where the user says "I work at Acme," "I'm at Acme," and "Acme is my employer" in the same session count as one observation. They are paraphrases in the same conversational context — the new information is zero. If the same fact surfaces in a different session a week later, that is an independent observation; increment n.

What makes an observation independent?

Different session. The strongest independence signal. A different timestamp and session ID mean the user re-asserted the fact in a different context.
Different surface form with temporal separation. "I prefer dark mode" and "I always use dark themes" one hour apart in the same session is borderline. Six months apart in different sessions is clearly independent.
Different input modality. A fact extracted from a user's calendar is independent from the same fact extracted from a chat turn.

Independence detection is done at dedup time using the three-tier pipeline. When tier 3 (LLM judge) classifies a candidate as a duplicate of an existing memory, the system must decide whether this is a restatement in the same context or a genuinely new independent observation. The rule: if the candidate arrives in a session older than 1 hour from the original observation and the surface form differs, increment n. Otherwise, do not.

When user confirms a memory — the agent asks "are you still at Acme?" and the user says "yes" — source is updated to max(s_current, 0.80) and n is incremented by 1. The resulting confidence is re-clamped at 0.99, not 1.0, to preserve numerical headroom in the ranking layer. A confidence of exactly 1.0 would break relative ordering for any memories that are all fully-confident.

Extractor confidence and logprobs

The extractor quality component has two modes: table-based fallback and logprob-derived.

Table-based fallback. When the extraction system has access to model identity but not token probabilities, e maps the model to a static value:

Model	e
Claude Sonnet / Opus	0.90
GPT-4 class	0.85
Claude Haiku	0.80
GPT-3.5 / smaller models	0.65
Unknown model	0.65

The spread is intentional. Frontier models extract structured facts with fewer hallucinations, and the table encodes that empirically. The Haiku entry (0.80) is lower than Sonnet (0.90) not because Haiku is bad, but because extraction quality degrades more on ambiguous spans — the cases where logprob variance would be highest if you had it.

Logprob-derived. When the model returns per-token log probabilities for the extracted span, e is computed as the geometric mean of the token logprobs, transformed to [0, 1]:

e = exp( (1/n) · Σ log P(token_i) )

Geometric mean of token logprobs over the extracted span, exponentiated to a probability.

This gives you a per-extraction quality signal rather than a per-model one. The practical difference is significant. Consider two extractions of "I work at Acme Corp":

High-certainty: each token ("Acme", "Corp") has logprob near -0.1. Geometric mean ≈ exp(-0.1) ≈ 0.905. Close to Sonnet's table value.
Uncertain: "Acme" has logprob -2.5 (the model was choosing between Acme, Apple, and Amazon). Geometric mean ≈ exp(-2.5) ≈ 0.082. This extraction should not be trusted regardless of which model produced it.

The logprob path catches within-model variance that the table cannot. A Sonnet extraction where the named entity was ambiguous in context can score lower than a Haiku extraction where the entity was unambiguous.

When logprobs are unavailable — common with hosted APIs that don't expose them — fall back to the table. Do not use confidence-of-extraction proxies like self-consistency sampling unless you're willing to pay the cost of multiple inference calls per extraction.

Type prior: why the hierarchy

The type prior t encodes how reliable each memory category is a priori, independent of how the memory was sourced or extracted.

Type	t	Rationale
Entity	0.90	Named entities are discrete, verifiable
Event	0.85	Events are anchored in time
Fact	0.80	Default baseline for declarative claims
Preference	0.75	Preferences drift; prior is lower
Relation	0.70	Requires two valid entities; more error surface

Entity sits at the top (0.90) because named entities — people, companies, places, products — are either right or wrong. "Acme Corp" is a valid company or it isn't. The model either extracted the name correctly or didn't. There's little room for gradation, and that binary structure makes entity memories high-precision.

Event is 0.85 rather than 0.90 because events have more structure: a time, a place, sometimes participants. Getting the name right is not enough — the date also has to be right. But events are temporally anchored, which means they are falsifiable in a way that preferences are not.

Fact (0.80) is the baseline for declarative claims with no stronger category. "The user's preferred language is Python" is a fact. It's not an entity, it's not a preference in the sense of something likely to change, it's just a claim. 0.80 is the default.

Preference (0.75) is lower because preferences change. A user who preferred Python three years ago might prefer TypeScript now. The lower prior doesn't mean preferences are extracted worse — it means the memory system should demand more evidence before raising confidence to retrieval-dominant levels.

Relation is at the bottom (0.70) because a relation involves two entities. "Alice reports to Bob" requires Alice to be correctly identified, Bob to be correctly identified, and the relationship direction to be correct. Three places to fail. The lower prior propagates that compound uncertainty without requiring the system to explicitly track per-entity extraction quality through relation formation.

When type prior is decisive: Two memories are otherwise tied — same source strength, same repetition count, same extractor. One is an Entity memory (t=0.90), the other is a Relation memory (t=0.70). The type prior contributes 0.10 × 0.90 = 0.090 vs 0.10 × 0.70 = 0.070. A 0.02 difference in the final score. That's small in absolute terms, but it's the tiebreaker: in retrieval ranking, 0.02 consistently pushes the Entity memory higher across all ties. Over a large memory store, this adds up to measurably better precision at the top of the ranked list.

Weight sensitivity analysis

Three worked examples showing how the formula behaves in practice. All use conf(m) = min(1, 0.45·s + 0.20·r + 0.25·e + 0.10·t).

Scenario A — high-confidence entity memory: User directly stated "I work at Acme" (s=0.95). The fact has been independently confirmed across three sessions (n=3, r≈0.541). Extraction done by Sonnet (e=0.90). Type is Entity (t=0.90).

0.45 × 0.95  = 0.4275
0.20 × 0.541 = 0.1082
0.25 × 0.90  = 0.2250
0.10 × 0.90  = 0.0900
────────────────────
sum           = 0.8507 → conf = 0.851

This memory comfortably clears the 0.5 retrieval floor and the 0.4 render gate. It will dominate any conflicting claims with lower confidence, and its decay floor is high enough that it survives moderate staleness.

Scenario B — medium-confidence preference: User mentioned preferring dark mode in a single session (s=0.70, strong inference from "I hate white backgrounds"). No repetitions yet (n=0, r=0). Extraction by Haiku (e=0.80). Type is Preference (t=0.75).

0.45 × 0.70 = 0.3150
0.20 × 0.00 = 0.0000
0.25 × 0.80 = 0.2000
0.10 × 0.75 = 0.0750
────────────────────
sum          = 0.5900 → conf = 0.590

This memory passes the retrieval floor (barely) but won't survive much freshness decay. If the user doesn't restate the preference in subsequent sessions, r stays at 0 and the memory fades. This is correct behavior: a single inferred preference with no confirmation should fade.

Scenario C — borderline speculative memory: Agent inferred from a signature that the user works in finance (s=0.30, speculation). First observation only (r=0). Unknown model (e=0.65). Type is Fact (t=0.80).

0.45 × 0.30 = 0.1350
0.20 × 0.00 = 0.0000
0.25 × 0.65 = 0.1625
0.10 × 0.80 = 0.0800
────────────────────
sum          = 0.3775 → conf = 0.378

This memory falls below the 0.4 render gate. It exists in the store and is auditable, but it won't be injected into context. It needs either a repetition boost (a second independent observation would push r to 0.307, lifting the total to ~0.44) or a source upgrade (if the user later confirms it, s→0.80, total→0.64) before it becomes useful.

Where confidence shows up

Retrieval boost. Multiplicative factor on retriever scores during fusion. High-confidence memories rank higher all else equal.
Conflict resolution. When two memories disagree, the higher-confidence one wins; the lower-confidence one is superseded but kept for audit.
Decay floor. Low-confidence memories fall below the retrieval floor faster as freshness decays.
Render gating. Below 0.4, memories are excluded from context unless no higher-confidence alternative exists.
Dedup merge. When a duplicate is detected, the merged memory takes confidence = max(old.confidence, new.confidence), then recomputes r(n+1) for the combined observation count.
Graph edge confidence. For multi-hop entity graph traversal, path confidence = min(edge1.conf, edge2.conf). The weakest link in the chain caps the path's trustworthiness.

Confidence in the retrieval pipeline

Confidence does not directly appear in the final retrieval weight — it feeds into it. The full formula for how a memory's retrieval weight is computed during fusion:

retrieval_weight = base_score × freshness(t) × (1 + ln(1 + access_count))

Confidence feeds decay floor; freshness and access boost scale the base score.

Confidence interacts with this in two places. First, the retrieval minimum: memories with conf below 0.5 are excluded from the candidate set entirely, unless no higher-confidence candidates exist for the query. This is a hard gate, not a soft penalty. Second, confidence affects the decay floor: a memory's retrievable flag is set false when freshness falls below a threshold, and that threshold is scaled by confidence — a high-confidence memory has a lower threshold, meaning it stays retrievable longer even as freshness decays.

The practical outcome: a memory with conf=0.85 might remain in the candidate pool after 90 days of no access, while an identical memory with conf=0.45 drops below the floor at 45 days. Confidence buys longevity in the store.

Freshness is computed at retrieval time from the stored write timestamp, not materialized at write time. The background consolidation job only materializes the retrievable flag when a memory's freshness falls below the per-confidence threshold. This keeps write costs low and lets you retroactively adjust decay parameters without reprocessing the store.

Access boost (the ln(1 + access_count) term) is independent of confidence. It rewards memories that prove their value at retrieval time — if a memory is retrieved often, the system infers it's useful and keeps it fresh. The interaction: a low-confidence memory that gets retrieved often can accumulate access boost that partially compensates for its low confidence floor. It won't match a high-confidence memory with the same access count, but it stays in the game.

Grounding penalty

Confidence has a write-time penalty when grounding is partial: the source span only loosely supports the candidate. The penalty is additive (subtracted from the weighted sum) and capped — confidence cannot drop below 0.3 from grounding alone, because a coherent candidate from a reasonable extractor is still worth retrieving.

conf'(m) = max(0.3, conf(m) − {0.15 if partial · 0.10 if unknown})

Partial-coverage and unknown-grounding penalties; 0.3 floor protects extraction signal.

Grounding verdicts are computed by the write pipeline against the source document. Three outcomes:

Supported. The extracted memory is traceable to a specific span in the source. No penalty.
Partial. The memory is plausible given the source but the exact claim is not verbatim or directly deducible. Penalty is -0.10 to -0.20 depending on coverage; use -0.15 as the default.
Not supported. The source does not support the claim. Discard the candidate; do not write it to the store.

The 0.3 floor matters because partial-grounding penalties can stack with low source strength. A speculation-sourced memory (s=0.30) with an unknown extractor (e=0.65) and a partial grounding hit could naively fall to 0.378 - 0.15 = 0.228. The floor catches it at 0.30 — still below the render gate, but preserved for audit and future reinforcement.

Tuning guidance

The default weights are a starting point. To recalibrate for your data:

Build a labeled eval set. Take 200–500 memories from your store with known ground-truth labels (correct / incorrect / partially-correct). Label them manually or with a high-quality model judge. The set should be stratified by memory type and source level — don't let it be 90% direct statements or the calibration will only work at the high end.

Fit weights by minimizing rank inversion. For each pair (m_correct, m_incorrect), compute conf under candidate weights and check whether m_correct ranks higher. Minimize the fraction of inversions. This is a weighted classification problem that gradient descent or Nelder-Mead can solve in under a minute on 500 examples.

Validate on held-out data. Split 80/20. Fit on the 400; evaluate on the 100. Report precision@5 and precision@10 in the ranking. If your calibrated weights are an improvement over the defaults, ship them. If not, investigate whether your eval set is the problem.

When to change only one weight. If you are deploying for a use case where the extractor model is fixed and high-quality (e.g., always Sonnet), the extractor weight (0.25) contributes less differentiation than the default assumes. You can safely redistribute some of that weight to source or repetition. Conversely, if your user base generates a lot of preference-style memories and you've found preferences decay faster than the default type prior implies, lower t for Preference and raise it for Event.

Common failure modes

Overconfident junk. Symptom: high-scoring memories that are wrong. Common cause: source strength was set too high because the extraction pipeline classified a paraphrase as a direct statement. A user who said "Acme seems like a good company" has source strength of at most 0.50 (weak inference), not 0.95. If source classification is coarse (binary rather than five-level), every non-obvious statement gets inflated.

Diagnosis: audit the top-50 memories by confidence. For each one, inspect the raw source span and verify the source level assignment. A precision- at-50 below 0.80 means source classification is broken.

Underconfident valuable memories. Symptom: correct memories not being retrieved. Common cause: the repetition counter was reset on a schema migration or dedup pass and n was not restored. A memory that had accumulated n=8 (r≈0.68) now has n=0 (r=0), dropping confidence by 0.20 × 0.68 = 0.136. With a tight retrieval floor, that's enough to fall out of the candidate set.

Diagnosis: pull memories that users reference ("I told you I prefer...") and check their confidence. If correct memories are below 0.5, inspect repetition counts first — they're the component most likely to be zeroed by infrastructure operations.

Confidence stable but wrong. The formula has no mechanism to detect that a previously-true memory has become false. User changed jobs; the Acme memory has conf=0.85 and will stay at 0.85 indefinitely. This is why confidence is paired with freshness decay and concept drift detection. Confidence measures how well the claim was established at write time; it does not measure whether the claim is still true. For volatile facts (job, location, relationship status), pair high confidence with short half-lives or drift monitoring.

Type prior miscategorized. A memory extracted as a Fact that should be a Preference (or vice versa) carries the wrong t. Fact (0.80) vs. Preference (0.75) is only a 0.10 × 0.05 = 0.005 swing, so this is usually noise. But Relation (0.70) vs Entity (0.90) miscategorization produces a 0.10 × 0.20 = 0.02 swing — visible in tight ranking scenarios. Type classification should be validated in the same eval set as weight tuning.

Formula output too flat. If most memories in your store land between 0.55 and 0.70, the formula has low discrimination power. This happens when your extractor always uses the same model (e constant), all memories are from a single source level (s constant), and repetition hasn't had time to build up (r all near 0). In the short run, type prior and small source variation are your only differentiators. The solution is time: repetition builds as the system accumulates sessions. In the short run, tighten the retrieval floor to 0.55 so that the top of your flat distribution still stands out.