Why Logarithmic Repetition Boost

By Arc Labs ResearchMay 2, 202614 min read

When the same memory is re-observed, the confidence formula's repetition component should increase. The question is: by how much, and on what curve?

Linear vs logarithmic boost

Same x-axis (n observations), wildly different curves.

The log boost formula

r(n) = 1 − 1 / (1 + ln(1 + n))

r(n) is the repetition boost as a function of observation count.

The table below shows sampled values:

Observations (n)	r(n)	Marginal gain over previous row
0	0.000	—
1	0.307	+0.307
2	0.478	+0.171
5	0.625	+0.147 (3 more obs)
10	0.700	+0.075 (5 more obs)
100	0.824	+0.124 (90 more obs)
1000	0.873	+0.049 (900 more obs)

The first observation moves the boost from 0 to 0.307 — a substantial jump. The 1000th observation moves it from 0.873 to 0.873 — essentially nothing. That diminishing character is the entire point.

The math in depth

Start with the goal: a function that maps observation count n ∈ {0, 1, 2, ...} to a boost value in [0, 1), is monotonically increasing, front-loaded, and asymptotic to 1 without reaching it.

A natural first candidate is 1 − 1/n, but it is undefined at n=0 and hits 0 for n=1 — wrong in both directions.

Shift by 1: 1 − 1/(1 + n). Now r(0)=0 and r(1)=0.5. But this is hyperbolic, not logarithmic — it decays too slowly. At n=100 you get r(100) ≈ 0.99, which is indistinguishable from 1. The 100th observation adds nearly as much as the first. That's not diminishing-returns; that's approximately linear in a different disguise.

Replace n with ln(1 + n) inside the denominator:

r(n) = 1 − 1 / (1 + ln(1 + n))

The inner 1 + before the log serves two purposes. First, it keeps the denominator at 1 when n=0, giving r(0)=0 cleanly. Second, it prevents the denominator from being ln(1) = 0 at n=0, which would make the fraction undefined. Without that offset, you need a separate case for n=0.

The outer 1 − flips the function from "starts high, decays to 0" into "starts at 0, grows toward 1." The asymptote comes from what happens as n→∞: ln(1+n)→∞, so 1/(1+ln(1+n))→0, so r(n)→1. It never reaches 1 because the fraction never reaches 0 for finite n.

In practice the function saturates well before ∞. By n=10 you are at r(10) = 0.700, leaving only 0.300 of headroom. By n=100 you are at 0.824, leaving 0.176. The marginal contribution of additional observations past n=10 is real but small — each doubling of observation count adds roughly 0.05 to the boost, which at a weight of 0.20 in the confidence formula amounts to 0.01 per doubling.

Computationally the function is a single log call and two additions. There is no lookup table, no piecewise definition, no tunable parameter to overfit. That simplicity is deliberate — memory systems need functions you can reason about without running simulations.

The 0.20 weight in the confidence formula

The full confidence formula is:

conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t)

Where:

s = source strength (0–1)
r = repetition boost, i.e. r(n) from above
e = extractor quality (0–1)
t = type prior (0–1)

Repetition gets 0.20 — second-lowest among the four components. This is intentional. Source strength (0.45) is the primary driver because a single high-quality source — an explicit user statement, a structured form field, a verified document — is more informative than ten casual restatements. Extractor quality (0.25) ranks second because the extraction step introduces model-level uncertainty that has to be discounted.

Repetition at 0.20 reflects its role as corroboration, not foundation. Repeated observations strengthen a memory's confidence; they don't substitute for quality signal at the first observation. A memory with low source strength that has been seen 1000 times still only reaches:

conf = 0.45·(0.20) + 0.20·r(1000) + 0.25·(0.50) + 0.10·(0.50)
     = 0.09 + 0.20·0.873 + 0.125 + 0.05
     = 0.09 + 0.175 + 0.125 + 0.05
     = 0.440

That's below the 0.50 threshold used for active retrieval in most deployments. Spam alone can't get a low-quality memory into production use, even at 1000 observations. That ceiling is the point.

Type prior (0.10) is lowest because it encodes a prior probability of the memory type being accurate — useful for calibration but not for discrimination between memories of the same type.

Independence requirement

The repetition counter only increments for independent observations. Three identical back-to-back turns count as one observation. This is not optional — without it, the independence requirement becomes the entire spam defense.

An independent observation meets all three conditions:

Different turn. Two extractions from the same conversational turn always count as one, even if the extraction model surfaces the fact twice (e.g., once in the main extraction pass, once in a dedup check).
Different session. Repetitions within the same session reflect the user re-emphasizing a point in context, which is weaker signal than the same fact surfacing in two completely separate contexts. A session boundary is the minimum independence gate.
Different surface form (preferred, not required). "I use dark mode" and "I prefer dark interfaces" are more independent than "I use dark mode" repeated verbatim. Surface form diversity suggests the user genuinely holds the belief rather than copy-pasting. This is scored heuristically — not enforced as a hard gate — but it informs whether a marginal increment is counted.

Duplicate detection is distinct from re-observation counting. When the dedup stage determines two stored memories are semantically identical, the question is whether the second is a restatement (same session, same source) or a genuine independent observation (different session, different source). Only the latter increments n.

In practice, the dedup pipeline runs a similarity threshold check and a source-metadata comparison. If both memories come from different sessions and have similarity ≥ 0.92 (the threshold for "same fact"), they are merged and n is incremented. If they come from the same session, they are merged but n stays the same — you saw the same thing twice in one conversation, not twice independently.

r(n) vs access_boost: two logs, different purposes

The repetition boost r(n) and the access_boost multiplier look similar — both are logarithmic functions of a count. They serve completely different roles at completely different points in the pipeline.

r(n) = 1 − 1 / (1 + ln(1 + n))          # write-time, confidence component
access_boost = 1 + ln(1 + access_count)   # retrieval-time, ranking multiplier

r(n) is computed at write time and baked into the stored confidence score. It measures how many times the same fact was independently extracted, across sessions, before the memory was written or last merged. It is static between dedup merges. When you retrieve a memory, r(n) is already encoded in the stored confidence field — you don't recompute it.

access_boost is computed at retrieval time and used as a ranking multiplier on top of cosine similarity. It measures how many times this memory has been retrieved and surfaced to the agent. It is dynamic — every retrieval increments access_count, and the next retrieval uses the updated value.

The practical consequence: a memory can have high r(n) but low access_boost (facts that were observed many times before any retrieval — common for batch-ingested preferences) or low r(n) but high access_boost (a fact seen once but retrieved constantly — common for a user's name). Treating them as the same would corrupt both signals.

The formulas are superficially similar because both model diminishing returns. But the base of the log, the constant offsets, and the scale are tuned independently. access_boost is a multiplicative weight applied during ranking; r(n) is an additive component inside a weighted confidence sum. Conflating them is a common implementation mistake.

Spam resistance: worked example

Suppose an adversarial extraction pipeline submits the same fact 1000 times in one session (e.g., a poorly-configured extraction loop repeats every turn, a user pastes the same text 1000 times, or a test suite generates synthetic repetitions).

Because all 1000 observations come from the same session, only the first counts as an independent observation (session gate). n=1. r(1) = 0.307.

Now suppose the spam is more sophisticated: 1000 separate sessions, each containing one mention. This passes the session gate. n=1000. r(1000) = 1 − 1/(1 + ln(1001)) ≈ 1 − 1/(1 + 6.909) ≈ 1 − 0.127 ≈ 0.873.

The maximum possible contribution to confidence from repetition alone: 0.20 × 0.873 = 0.175.

Even with 1000 completely independent sessions all asserting the same fact, repetition contributes at most 0.175 to the confidence score. The remaining confidence must come from source strength, extractor quality, and type prior. A fact manufactured by spam with no genuine source signal cannot cross the active-retrieval threshold on repetition alone.

To put that in perspective: a single high-quality source (direct user statement, source strength ≈ 0.90) contributes 0.45 × 0.90 = 0.405 to confidence. That one observation outweighs 1000 spam observations in repetition contribution by a factor of more than 2×.

Dedup merge semantics and r(n) recalculation

When the dedup stage determines two stored memories refer to the same fact, it merges them according to these rules:

merged.confidence = max(old.confidence, new.confidence)
                    with r(n_old + n_new) recalculation
merged.access_count = old.access_count + new.access_count
merged.source_turn_ids = union(old.source_turn_ids, new.source_turn_ids)

The confidence recalculation is the important part. Finding a duplicate observation increases n and therefore increases r(n). If you have a memory with n=3 (r=0.568) and you discover a duplicate that was stored separately with n=2 (r=0.478), the merged memory has n=5 (r=0.625). The confidence of the merged memory is recomputed using the new r value.

This means dedup is not just a storage efficiency — it actively improves confidence calibration. Two separately-stored versions of the same fact, each with modest confidence, merge into a single memory with higher confidence. The system becomes more certain as it discovers that different sessions independently arrived at the same conclusion.

The max(old.confidence, new.confidence) baseline ensures that merging never decreases confidence. If one version had a high-quality source and the other a weak one, the merged version inherits the high-quality source's confidence floor before the r(n) recalculation adds further boost.

Source turn IDs are unioned rather than merged to preserve the audit trail. You can always inspect which sessions contributed to a merged memory — this matters for debugging retrieval quality and for GDPR deletion requests (delete the session → recalculate without its contributions).

Preference vs fact vs event

Repetition signal varies enormously by memory type:

Preferences are the primary use case for high repetition counts. A user who mentions "dark mode" in five separate conversations across six months is providing strong evidence — not just of the preference, but of its persistence. Users don't spontaneously bring up UI preferences unless they matter. Repetition in this category is informative and should be rewarded. r(5) = 0.625 in the confidence formula, contributing 0.20 × 0.625 = 0.125 to total confidence on top of whatever the source and extractor contribute.

Facts (biographical, structural — job, location, languages spoken) repeat moderately. A user mentions their job once in an introduction, once when discussing a work problem, once when explaining context. Three observations, n=3, r(3) ≈ 0.530. The repetition signal is meaningful but not dominant — the primary confidence driver for facts is source strength, since facts are often stated explicitly and clearly.

Events almost never repeat — every event is unique by definition. "I flew to Tokyo last Tuesday" is not strengthened by a second observation; a second observation is more likely to be a restatement or a retrieval artifact than a genuine independent sighting of the same event. For events, r(n) stays near r(1) = 0.307 almost always, and the confidence is driven by extractor quality and the type prior. Tuning the system to wait for repetition before trusting event memories would cause most events to never reach active-retrieval confidence.

This asymmetry implies that the appropriate confidence thresholds differ by type. A preference with n=5 deserves high confidence. An event with n=5 should raise a flag — why is the same event appearing five times? — before contributing to confidence.

Cross-session repetition: the temporal dimension

Consider a user who mentions "I prefer dark mode" in five separate conversations over six months. Compare that to a user who says "dark mode, dark mode, dark mode, dark mode, dark mode" five times in one conversation turn.

Both have n=5 if you count naively. But they are not equally informative.

The first case: five independent sessions, spaced over 200 days. Each session was a different context (a bug report, a feature request, a casual greeting, a productivity question, a settings update). The user had no reason to think "I should mention dark mode" — it surfaced organically each time. That's robust evidence. r(5) = 0.625 and the confidence is genuinely earned.

The second case: five mentions in one conversational turn. This fails the session gate and counts as n=1 regardless of raw count. r(1) = 0.307. The user may be emphasizing, may be agitated, may be testing the system. One turn is one turn.

The session boundary is the temporal independence gate. But session spacing also matters directionally — five sessions in one hour is different from five sessions in six months. The current formula doesn't encode temporal spacing of independent sessions (that's captured separately via the freshness decay component), but session spacing is a consideration for future refinements to the independence detection heuristic.

In the cross-session scenario, the access_boost multiplier compounds the effect. A memory that has been retrieved in five different sessions will have access_count ≥ 5. access_boost = 1 + ln(6) ≈ 2.79. That multiplier amplifies the memory's rank in future retrievals, making it more likely to surface again. The combination of high r(n) (strong write-time confidence) and high access_boost (strong retrieval-time rank) creates a well-reinforced memory that consistently surfaces in relevant contexts.

Alternatives considered

The chosen form was not the first candidate evaluated. Three alternatives were considered and rejected:

Square root: r(n) = sqrt(n) / (1 + sqrt(n)). This has a similar front-loaded shape but decays more slowly. At n=1000, r(1000) ≈ 0.97 — leaving substantial room for spam to accumulate confidence. The tail behavior is not aggressive enough. You would need to cap it manually, which re-introduces the "Cap at N" problem.

Sigmoid: r(n) = 1 / (1 + exp(−k·(n − n0))). The sigmoid has an S-shaped curve — slow at low n, fast in the middle, then saturating. This is useful for systems with a hard threshold ("once we see 5 observations, we trust it; fewer than 3, we don't"). For Recall's use case — continuous boosting with no hard gate — the sigmoid's slow behavior at low n is a liability. The first observation should matter substantially; sigmoid undersells it.

Linear cap: r(n) = min(n/N, 1.0) for some N. Simple to implement. Completely linear up to N, then flat. The problem is the cliff: everything above N is treated identically. A memory with n=11 and one with n=1000 are indistinguishable if N=10. This discards real information, and the choice of N is arbitrary — there's no principled reason to treat 10 as a ceiling.

The chosen 1 − 1/(1 + ln(1 + n)) wins on all three criteria: the tail behavior is aggressive enough to resist spam, the low-n behavior is meaningful, and there are no tunable parameters to overfit.