Typed Memory: Beyond Flat Text

By Arc Labs ResearchMay 2, 20269 min read

The simplest memory schema is "string in, string out": every memory is a piece of text; retrieval is text similarity. It works for prototypes and breaks at scale. The reason isn't capacity — it's that flat text loses information that downstream components need.

Type classifier · paste any sentence

Heuristics here are illustrative; production uses a small LLM. The point is that type is determined at write time.

What types enable

Type-aware retrieval. A query about configuration should consult only preferences. A "what did I do last week?" query should consult only events. Type as a retrieval filter eliminates noise that flat text systems must rerank against.
Type-driven supersession. A new preference supersedes the old one (you changed your mind). A new event does not supersede the old one (both happened). Without types, "supersede" has no rule.
Type-specific decay. Events are local in time; facts are not. A 6-month rule for facts is too aggressive; a 6-month rule for events is too lenient. Types let decay match reality.
Type-specific confidence priors. Explicit user preferences are more trustworthy a priori than inferred relations. The confidence formula's type-prior component encodes this.

The five canonical types

Fact — a stable predicate. {"{ subject: \"user\", predicate: \"works-at\", object: \"Volkswagen\" }"}
Preference — a mutable choice. {"{ user: \"u1\", preference: \"dark-mode\", confidence: 0.93 }"}
Event — a timestamped occurrence. {"{ user: \"u1\", event: \"joined-platform-team\", at: \"2026-05-09\" }"}
Entity — an identity. {"{ id: \"e_volkswagen\", type: \"organization\", aliases: [\"VW\", \"Volkswagen AG\"] }"}
Relation — a typed edge. {"{ from: \"u1\", relation: \"reports-to\", to: \"e_sarah\", since: \"2026-05-09\" }"}

The cost of typing

One small-LLM classifier call per surviving candidate (after pre-filter and extraction). At 100K turns/day with 25% surviving to classify, that's 25K classifier calls/day at maybe $0.0001 each —$ 2.50/day. Negligible.

When flat is fine

Single-session agents. A coding session that won't persist beyond the IDE close. No supersession needed; no decay needed. Flat text retrieves fine for the window.
Read-only memory. If your "memory" is actually a static curated KB, you're doing RAG, not memory — and types don't pay back.
Tiny stores. Below ~1000 memories per user, flat text is good enough. Most apps don't stay below that.

Flat-text storage collapsed temporal precision; structured systems retained 80%+ accuracy on temporal queries vs. 5–65% for flat systems.

— LongMemEval (2024)

Schema evolution is solvable

A common objection: "we'll get the types wrong and have to migrate." Yes — but migrations on typed memory are tractable. Add a new type with a backfill classifier; sunset an old type with a recategorization pass. Flat-text systems also have schema problems (everyone invents their own bullet-point conventions), they just don't admit them.

What flat text actually loses — concrete examples

Abstract arguments for typed schemas are easy to dismiss. Concrete failure scenarios are harder to ignore. The following three scenarios describe failure modes that appear in production flat-text memory systems, not theoretical ones.

Scenario 1 — The preference overwrite problem. A user tells the agent "I prefer dark mode" on day 1. The flat system writes one memory: "user prefers dark mode." Sixty days later, the user says "actually I switched to light mode." The flat system now has two memories: "user prefers dark mode" and "user switched to light mode." When the system queries "what display mode does the user prefer?", semantic retrieval returns both strings. They score similarly — both discuss the user's display mode preference — typically within 0.05 cosine similarity of each other. The LLM sees both in context. It may correctly infer that the more recent statement supersedes the earlier one, but only if the timestamp is visible in the text, only if the LLM attends to that timestamp, and only if recency is treated as a supersession signal rather than a recency bias correction. Empirically, this fails roughly half the time: the LLM picks the memory in the better position in context, which is not reliably the newer one.

With typed Preference memory, the write pipeline identifies the second statement as a Preference with predicate_is_stateful = true. The system queries the store for existing Preferences with the same preference_key (editor.theme). It finds one. The existing memory's status is set to superseded and superseded_by is populated with the new memory's ID. At retrieval time, the store filter status = active returns exactly one Preference memory for display mode. The answer is structurally guaranteed to be correct — no LLM reasoning required, no position-in-context sensitivity.

Scenario 2 — The temporal confusion problem. The store contains two memories written at different times: "User joined the platform team on May 9th" (an event) and "User works at Volkswagen" (a stable fact). A user asks: "What happened last Tuesday?" In a flat system, semantic retrieval scores both memories against this query. Both involve the user's professional context. The event memory matches well because of the date. The fact memory also scores reasonably because both are about the user's work life and the embedding space does not separate temporal from atemporal statements cleanly. The LLM receives both in context. It may correctly exclude the Volkswagen fact as non-responsive, but the presence of the irrelevant memory consumes tokens, increases the chance of hallucination, and complicates the model's task.

At scale — a store with 10,000 memories, many of which are facts about the user's professional life — the query "what happened last Tuesday?" without type filtering returns dozens of facts alongside the handful of relevant events. The model cannot reliably sift them. With typed Event memory: the temporal retriever filters to type = Event first, then applies a temporal window filter (event_at BETWEEN Tuesday_start AND Tuesday_end). The result set contains only events. No facts, no preferences, no entities appear unless they are explicitly referenced by the matching events. Temporal confusion is eliminated at the retrieval step, not deferred to the LLM.

Scenario 3 — The entity fragmentation problem. The store contains three flat-text memories written at different times: "Priya manages the project," "Priya is the engineering manager," and "Sarah works for Priya." In a flat-text system, these are three independent strings. They share the name "Priya" but the system does not know that all three refer to the same entity. Query: "Who does Sarah work for, and what is their role?" The system retrieves all three because they all mention Priya, but it cannot chain them. It returns the strings as a list and the LLM must infer the connection. This works sometimes, but fails when the connecting entity name is not identical across memories — "Priya" vs "P. Sharma" vs "the EM" — or when the chain is longer than one hop.

With typed Entity and Relation memory: "Priya" resolves to an Entity record e_priya with aliases ["Priya", "P. Sharma", "the EM"]. "Priya is the engineering manager" becomes a Fact with subject e_priya and predicate has-role. "Sarah works for Priya" becomes a Relation {from: e_sarah, relation: reports-to, to: e_priya}. "Priya manages the project" becomes a Relation {from: e_priya, relation: manages, to: e_project}. Query "who does Sarah work for?" resolves e_sarah, traverses the reports-to edge to e_priya, retrieves the has-role Fact for e_priya. Two hops in the graph. The answer is structural, not lexical, and robust to name variation.

Full type schemas with field specifications

Each type carries a specific set of fields. Understanding these schemas is necessary for implementing the write pipeline and the retrieval layer correctly. The fields below are the canonical set; implementations may add domain-specific fields but should not remove these.

Fact schema — represents a stable predicate that changes infrequently and supersedes previous values when updated:

{
  "id": "mem_abc123",
  "type": "Fact",
  "subject": "user",
  "predicate": "works-at",
  "object": "e_volkswagen",
  "source_confidence": 0.9,
  "repetition_count": 3,
  "extractor_confidence": 0.87,
  "type_prior": 0.15,
  "confidence": 0.82,
  "created_at": "2026-01-15T09:00:00Z",
  "last_accessed_at": "2026-05-06T14:23:00Z",
  "access_count": 12,
  "status": "active",
  "predicate_is_stateful": true,
  "superseded_by": null,
  "evidence_turn_id": "turn_xyz"
}

The predicate_is_stateful field is set by the extractor at write time. For Facts, this is almost always true: if the user now works at a different company, the new Fact supersedes the old one. The evidence_turn_id links the memory back to the specific conversation turn that produced it — essential for debugging extraction errors and for audit trails.

Preference schema — represents a mutable user choice that supersedes previous values for the same preference key:

{
  "id": "mem_def456",
  "type": "Preference",
  "preference_key": "editor.theme",
  "preference_value": "dark",
  "confidence": 0.93,
  "created_at": "2026-03-01T10:00:00Z",
  "updated_at": "2026-04-15T09:00:00Z",
  "status": "active",
  "supersedes": "mem_old789",
  "predicate_is_stateful": true
}

The preference_key is a namespaced identifier rather than free text — this enables exact-match lookup rather than fuzzy lookup when checking for existing preferences to supersede. Keys follow a domain.attribute convention (editor.theme, notifications.email, language.primary). The supersedes field creates a linked list of preference history, which is useful for debugging and for "what did I prefer before?" queries.

Event schema — represents a point-in-time occurrence that accumulates rather than supersedes:

{
  "id": "mem_ghi789",
  "type": "Event",
  "event_description": "joined platform team",
  "event_at": "2026-05-09T00:00:00Z",
  "event_at_precision": "day",
  "participants": ["e_user", "e_platform_team"],
  "confidence": 0.78,
  "created_at": "2026-05-09T16:00:00Z",
  "status": "active",
  "predicate_is_stateful": false
}

The event_at_precision field encodes what the extractor actually knows about the timestamp. Values: "exact" (ISO 8601 to the second), "day" (date known, time not), "week" (approximate week), "month" (approximate month), "approximate" (temporal language like "recently" or "last year"), "unknown" (event is known to have happened but no time anchor could be extracted). Temporal retrieval uses this field to decide which window operator to apply. A precision of "week" means the event should match any temporal query whose window overlaps with the entire week range, not just the specific day.

The participants array links events to entity records. This enables queries like "what events did e_platform_team participate in?" without full-text search. The predicate_is_stateful field is always false for Events — two distinct events are two distinct memories, never a supersession relationship.

Entity schema — represents a named entity with identity resolution across aliases:

{
  "id": "e_volkswagen",
  "type": "Entity",
  "entity_type": "organization",
  "canonical_name": "Volkswagen AG",
  "aliases": ["VW", "Volkswagen", "Volkswagen Group"],
  "attributes": {
    "industry": "automotive",
    "headquarters": "Wolfsburg, Germany"
  },
  "confidence": 0.95,
  "created_at": "2026-01-15T09:00:00Z",
  "mention_count": 47,
  "status": "active"
}

Entity schemas are the foundation of the graph. Every Relation points to two Entity IDs. Every Event's participants array contains Entity IDs. The aliases array is used by the entity resolution step at write time: when the extractor produces the string "VW" in a Fact or Relation, the entity resolver searches the alias index before creating a new Entity record. If "VW" matches e_volkswagen's alias list, the new memory's object field is set to e_volkswagen rather than spawning a duplicate entity.

Relation schema — represents a typed, directed edge between two entities:

{
  "id": "rel_jkl012",
  "type": "Relation",
  "from_entity": "e_user",
  "relation_type": "reports-to",
  "to_entity": "e_sarah",
  "since": "2026-05-09T00:00:00Z",
  "confidence": 0.74,
  "created_at": "2026-05-09T16:00:00Z",
  "status": "active",
  "predicate_is_stateful": true,
  "evidence_turn_id": "turn_abc"
}

Relation types come from a controlled vocabulary to enable graph traversal — free-text relation types make traversal impossible because "reports to," "works under," and "is managed by" are semantically equivalent but lexically distinct. The vocabulary includes: reports-to, manages, works-with, member-of, owns, created, uses, located-at. Domain extensions add to this list; they don't use free text.

Type-specific confidence priors

The confidence formula across all memory types is:

conf(m) = min(1.0, 0.45·s + 0.20·r + 0.25·e + 0.10·t)

Where s is source strength (how directly the user stated the information), r is repetition factor (how many times this has been confirmed), e is extractor confidence (the LLM's self-reported confidence in the extraction), and t is the type prior. The type prior reflects the baseline accuracy of each type category before any evidence is considered:

Preference (t = 0.20): highest prior. When a user explicitly states a preference in first-person — "I prefer X" or "I always use Y" — this is the most direct possible signal. The error rate for explicit preference extraction is low. The type prior is set high to reflect this.
Fact (t = 0.15): high prior. Stable predicates like employment, location, and role are usually stated explicitly and extracted correctly. The main failure mode is misextraction of complex or conditional statements ("I used to work at Volkswagen" extracted as present-tense). The prior is slightly below Preference because fact statements are more varied in form.
Entity (t = 0.12): medium-high. When an entity is clearly named and unambiguous, entity extraction is reliable. The prior drops when pronoun resolution is required — "she manages the project" requires resolving "she" to a specific entity, which introduces error.
Relation (t = 0.10): medium. Relations are frequently inferred from context rather than stated explicitly. "John and Sarah work together on the platform team" implies a works-with relation but doesn't state it. Inference-based extraction has higher error rates than explicit extraction.
Event (t = 0.08): medium-low. Events themselves are usually correctly identified. The lower prior reflects that the timestamp precision is often wrong — the extractor correctly identifies that something happened but incorrectly identifies when. Since Events without correct timestamps are significantly less useful, the prior is discounted.

These priors interact with the other components in non-obvious ways. Consider a Preference memory with strong source signal: s = 0.9 (explicit first-person statement), r = 0 (first time this preference has been mentioned), e = 0.9 (extractor is confident), t = 0.20:

conf = min(1.0, 0.45×0.9 + 0.20×0 + 0.25×0.9 + 0.10×0.20) = min(1.0, 0.405 + 0 + 0.225 + 0.02) = 0.65

A new preference starts at 0.65 confidence even when the source signal is strong. The repetition component r has not accumulated yet — this is deliberate. A single statement, even a clear one, could be sarcastic, hypothetical, or misread by the extractor. Confidence builds with repetition. After the user mentions the same preference in three separate conversations (r = 0.5 for three repetitions, which contributes 0.20 × 0.5 = 0.10), confidence reaches 0.65 + 0.10 = 0.75. After five repetitions (r = 1.0), confidence reaches 0.65 + 0.20 = 0.85.

Type-specific decay half-lives

Memory freshness decays over time independently of confidence. The freshness function for each memory type uses an exponential decay model:

freshness(t) = 2^(-t/τ)

Where t is time since last access in days and τ is the half-life in days — the time at which freshness reaches 0.5. Each type has a different characteristic half-life:

Entity (τ = 365 days): entities are persistent. "Volkswagen" doesn't become stale. An entity record created three years ago is still valid. The half-life is measured in years because the underlying real-world entity changes infrequently relative to conversational timescales.
Fact (τ = 180 days): stable predicates change infrequently. "User works at Volkswagen" has a 6-month half-life. At t = 180, freshness = 0.5 — still very much retrievable, just weighted less than a recent confirmation. At t = 360, freshness = 0.25 — likely stale but not purged.
Relation (τ = 180 days): like facts, relations are relatively stable. A reports-to edge from six months ago still carries 0.5 freshness. The organization may have restructured, but the prior relation is still a reasonable baseline until explicitly superseded.
Preference (τ = 90 days): preferences change more often than facts. A preference for dark mode that hasn't been mentioned in three months may reflect evolved preference or irrelevance — the user's environment changed, the preference stopped mattering. After 90 days without mention, freshness = 0.5.
Event (τ = 30 days): events are inherently historical. A meeting or project milestone from six months ago is rarely relevant to current queries. At t = 180:

freshness = 2^(-180/30) = 2^(-6) = 0.016

Effectively near zero. The event is not purged from the store — it remains available for explicit historical queries — but it does not surface in routine retrieval unless explicitly requested.

These half-lives interact with the access boost. Each time a memory is retrieved and used, its effective freshness is multiplied: boosted_freshness = freshness × min(3.0, 1.2^access_count). A meeting event that has been accessed five times in the past month accumulates a boost of 1.2^5 = 2.49. Even with its 30-day half-life, if the event is 45 days old, its freshness before boost is 2^(-45/30) = 2^(-1.5) = 0.35. After boost: 0.35 × 2.49 = 0.87. Frequently revisited events stay alive despite their short half-life. This is the intended behavior — the half-life represents passive decay, not active relevance.

The classifier: how type assignment works

Type assignment happens inside Stage 2 of the write pipeline — the LLM extraction call. It is not a separate model call. The extraction prompt asks the model to produce a JSON object containing both the extracted memory content and the memory type in a single output. This design choice avoids the latency and cost of a dedicated classification step while still getting reliable type labels.

The extraction prompt structure: the model is given the conversation turn, the five type definitions with examples, and an instruction to choose from {Fact, Preference, Event, Entity, Relation}. The model outputs a structured JSON block with a type field alongside the memory fields. Because extraction and classification share a single forward pass, the model uses the same context (the conversation turn) for both — there is no information loss between extraction and classification.

Post-extraction rule adjustments handle ambiguous cases that the LLM systematically misclassifies:

If type = Fact and the predicate contains temporal language — "used to," "was," "had been" — reclassify as Event with event_at_precision = "approximate". The extractor saw a fact-shaped statement but the temporal language indicates a historical event, not a current state.
If type = Relation and both subject and object resolve to the same entity, reclassify as Fact. Self-referential relations are not graph edges — "the user is a senior engineer" with subject and object both resolving to the user entity is a Fact with predicate has-role.
If type = Event and no event_at timestamp can be parsed from the text or the conversation metadata, keep as Event but set event_at = null and event_at_precision = "unknown". The temporal retriever will skip it when filtering by date range, but the semantic retriever can still surface it for queries that don't require a time anchor.

Common misclassifications and their costs: extracting "user joined the team" as Fact instead of Event loses the timestamp anchor entirely. The memory is treated as a stable predicate ("user is a team member") rather than a historical event. It becomes invisible to temporal queries ("when did I join?") and it won't carry the event_at field that temporal retrieval requires. This is one of the higher-cost misclassification errors because it affects a structurally different retrieval path.

Extracting "user prefers TypeScript" as Fact instead of Preference means it won't be superseded through the Preference supersession path when the user later says "I'm now using Go." The system will create a new Fact and link the two with a contradicts edge rather than a clean supersession link. The old Fact and new Fact coexist as contradicting statements, and the resolution becomes a confidence comparison at retrieval time rather than a structural supersession. This degrades to the flat-text problem described in Scenario 1.

Schema evolution in practice

Schema evolution happens when the domain requires a new type or when an existing type has been overloaded with semantically distinct concepts.

Adding a domain-specific type — healthcare example. A healthcare agent needs to track medications separately from general facts. The existing Fact type could hold medication information, but this leads to the same problems as flat text: medication supersession rules are different from fact supersession rules (a new medication prescription doesn't supersede an old one — both may be active), and medication-specific retrieval (by drug class, by interaction risk) requires dedicated indexes.

Adding a Medication type: (1) Define the schema — medication_name, dosage, frequency, prescribed_by, started_at, ended_at, active. (2) Write a backfill classifier prompt targeting existing Fact memories where the predicate contains "takes," "prescribed," "on medication," or similar. (3) Run the backfill as a background job over the existing store, reclassifying matching Facts as Medications. (4) Existing Facts not matching the classifier are untouched — no risk of overclassification beyond what the classifier flags. (5) Update the write pipeline's extraction prompt to include the new Medication type definition. Total migration for a 100K-memory store: a few hours of background processing plus a prompt update.

Splitting an overloaded type — Fact into Fact and Constraint. After deployment, an analysis of the store reveals that a subset of Facts are hard constraints — "user cannot use external APIs due to compliance requirements" — that should never be superseded by normal Fact updates. If the user later says "we switched to a different compliance framework," a normal Fact supersession would remove the constraint. But the constraint might still apply under the new framework. Constraints need a different supersession policy: they can only be superseded by an explicit constraint-removal event, not by a new Fact with a matching predicate.

Split procedure: (1) Add Constraint type schema. (2) Backfill classifier identifies Facts containing constraint language ("cannot," "must not," "required to," "prohibited"). (3) Reclassify. (4) Update the supersession logic: when a new Fact has predicate_is_stateful = true and matches subject+predicate, check whether the existing memory is type Constraint. If so, do not supersede — instead create a conflicts-with link and flag for human review. (5) Update the extraction prompt to include Constraint as a type option with its definition.

This migration is surgical — only the flagged memories change type. Everything else in the store is untouched. The key insight is that typed schemas make schema evolution mechanically predictable: you know exactly which memories need to change, and you can verify completeness by running the backfill classifier and inspecting the results before committing.