Entity Graphs for Multi-Hop Reasoning

By Arc Labs ResearchMay 2, 202610 min read

Consider the query "who's on Sarah's team?" The relevant memory might be "Alice reports to Sarah." That memory's text contains "Alice" and "Sarah" — but the query mentions only "Sarah." A vector retriever pulls memories about Sarah, but no memory says the answer directly. The information is structural, not lexical. Graph retrieval handles structural queries by traversing typed edges between entities. Once entity resolution has produced stable identifiers, every memory becomes either a node attribute or a typed edge — and queries that mention an entity can pull connected nodes regardless of whether they appear in any single memory's text.

Graph hopper · click any entity

The selected entity's 1-hop neighborhood lights up; multi-hop scoring decays.

The graph schema

Nodes are entities — people, organizations, places, projects, artifacts. Each carries a type, aliases, and aggregate attributes ("most recent role", "most recent location").
Edges are typed relations: works-at, reports-to, lives-in, collaborated-with. Each carries a confidence and a temporal window (when did this relation hold?).
Memories reference entities; they are not themselves nodes. A memory like "user joined Volkswagen" creates or updates the (user)→[works-at]→(Volkswagen) edge.

Hop scoring

Distance-from-query matters. The user is 1-hop relevant for any query about them; their manager is 1-hop relevant for queries about Sarah; the manager's other reports are 2-hop relevant. The score decays:

score(0-hop) = 1.0 · score(1-hop) = 0.6 · score(2-hop) = 0.35 · score(3-hop) = 0.15

Hop scores; tunable per-relation. Beyond 3 hops, signal is below noise.

These constants are calibrated; not arbitrary. They reflect the empirical drop in relevance as you add hops on real conversational graphs.

Edge type matters more than distance

A 1-hop edge of type collaborated-with from a 5-year-old project is less relevant than a 2-hop edge of type currently-reports-to. Score by the worse of:

Hop distance (with the decay above).
Edge confidence × edge freshness (per-edge).
Edge type prior (some types decay faster — collaborated, met-with). This composition is why the graph retriever lives alongside semantic and lexical, not above them. Graph rank goes into RRF; fusion handles the cross-retriever balance.

When graph retrieval fails

Sparse graphs. Early in a user's lifecycle, the entity graph has few edges. Graph retrieval returns thin results; semantic and lexical carry the load.
Wrong-name graphs. Without good entity resolution (see entity resolution), entities fragment and the graph is partitioned into disconnected pieces. Graph quality is bounded by resolution quality.
Pure-text queries. "What did the user say about climate change?" has no entity anchor. The graph retriever returns nothing useful; semantic dominates fusion. Correct behavior.

Three edge kinds

Relations in the entity graph have a edge_kind field that affects retrieval priority:

Semantic edges: relations extracted from conversation content. "Priya manages Georgian" creates a semantic manager_of edge. These are the most common and most informative.
Structural edges: relations from explicit data structures — a user adding a contact, linking a project to an organization. Higher confidence than semantic edges because they're direct assertions, not inferred from text.
Lifecycle edges: relations tracking entity state transitions — "project entered active status", "person joined organization". These have explicit valid_from/valid_to timestamps and participate in temporal filtering. A reports_to edge from 2024 has valid_to set when a 2026 reorg creates a new one.

The retriever weights these differently: structural > semantic > lifecycle when confidence is equal. But confidence × freshness × edge_type_prior is the final score — an old structural edge can rank below a recent semantic one.

Every relation carries evidence_memory_ids: Vec<MemoryId> — the specific memories that established this edge. If you query why a graph edge exists, you can trace it back to the exact memory and the source turn that produced it.

The sparse-graph early-lifecycle problem

Graph retrieval is genuinely weak for new users. A namespace with 50 memories has few edges. The graph retriever produces thin results; semantic and BM25 carry retrieval. This is correct behavior but it means the optimizer's entity-centric plan may underperform for new users until the graph is populated.

The practical threshold: the entity graph starts contributing meaningfully at ~500 memories and becomes reliable at ~2,000. Below that, lean on semantic retrieval and don't optimize for graph performance.

The entity attribute cache (Redis, 1-hour TTL) accelerates entity lookup for established entities. New entities below the caching threshold hit Postgres directly — this is fast at small scale but warms up as the namespace grows.

Resolution quality is the multiplier on graph quality. An entity that has been seen in 20 different surface forms (Priya, priya.sharma, "my manager", "Priya at Arrive") benefits most from the four-stage resolution cascade. Resolution failures partition the graph into disconnected islands — the most common cause of graph retrieval underperforming on a nominally entity-rich query.

Multi-hop traversal: how queries actually execute

Walk through a concrete multi-hop query execution step-by-step. This is the sequence the graph retriever follows when it receives a parsed query with multiple entity anchors.

Query: "What has Alice worked on with Sarah's team?"

Step 1 — entity extraction

Stage 1 of the read pipeline extracts entity references from the query. This query resolves to: ent_alice_XYZ (confidence 0.94) and ent_sarah_DEF (confidence 0.91). Both are above the 0.80 threshold for the entity-centric retrieval plan, so the optimizer chooses a graph-anchored plan rather than falling back to pure semantic retrieval.

Step 2 — 1-hop traversal from Alice

The entity graph retriever fetches all edges incident to ent_alice_XYZ. Returns:

ent_project_kestrel via collaborates_with edge (confidence 0.82, created 3 months ago)
ent_platform_team via member_of edge (confidence 0.90, valid_to = null, still current)

Step 3 — 1-hop traversal from Sarah

Fetch all edges incident to ent_sarah_DEF. Returns:

ent_platform_team via leads edge (confidence 0.95, valid_to = null)
ent_alice_XYZ via managed_by edge (confidence 0.88)
ent_project_anvil via sponsoring edge (confidence 0.71, created 8 months ago)

Step 4 — 2-hop from Sarah via platform_team

Traverse Sarah → platform_team → members. Returns: {ent_alice_XYZ, ent_bob_GHI, ent_carol_JKL, ent_dave_MNO}. This 2-hop path has hop score 0.35. Alice appears in both Sarah's 1-hop neighborhood (via managed_by) and the 2-hop neighborhood (via platform_team membership).

Step 5 — intersection and path scoring

Both Alice's 1-hop traversal and Sarah's 2-hop traversal arrive at ent_platform_team. Platform team membership is the structural answer to "what connects them." The strongest single-hop path from Sarah to the team:

score = min(Alice_edge_conf, Sarah_edge_conf) × hop_score
      = min(0.90, 0.95) × 0.6
      = 0.54

The 2-hop path through platform_team gives 0.90 × 0.95 × 0.35 = 0.30. The direct managed_by edge is stronger; the 2-hop path through the team membership is a corroborating signal.

Step 6 — memory retrieval

Memories tagged with ent_project_kestrel and ent_platform_team enter the RRF fusion pool with entity match scores of 0.82 and 0.90 respectively. These scores weight the graph retriever's contribution in the RRF merge alongside results from the semantic and BM25 retrievers. The final ranked list that surfaces to the LLM context is the fused output — the graph retriever does not operate in isolation.

This execution pattern generalizes: the graph retriever is always one input to RRF, never the sole retriever. Queries with strong entity anchors weight the graph results more heavily in fusion; queries with weak entity anchors weight semantic results more heavily. The weighting is implicit in the RRF rank positions, not configured per-query.

Edge confidence, freshness, and type priors

The score for a memory surfaced via graph traversal is not just hop count. Three factors compose into the final traversal score, and all three must be considered when interpreting why a memory was or was not retrieved.

Hop score

The base decay factor from the existing formula: 1.0 at 0 hops, 0.6 at 1 hop, 0.35 at 2 hops, 0.15 at 3 hops.

Edge confidence

Each edge has a confidence score assigned at creation time, reflecting the certainty of the underlying inference. Sources and typical confidence ranges:

Structural edges (explicit user-provided data): 0.90–0.99
Semantic edges from unambiguous text ("Alice reports to Sarah"): 0.80–0.92
Semantic edges from ambiguous text ("we worked together"): 0.60–0.75
Inferred edges from co-occurrence patterns: 0.50–0.65

Path confidence is the MIN of all edge confidences along the traversal path — weakest-link semantics. A chain is only as reliable as its weakest link. A 2-hop path with edges at 0.92 and 0.65 has path confidence 0.65, not the average of 0.79. This means long chains through uncertain edges degrade quickly.

Edge type priors and decay

Different edge types represent relations with fundamentally different temporal stability:

Edge type	Confidence prior	Decay half-life (τ)
`member_of` (current team)	High	180 days
`reports_to` (management)	High	180 days
`collaborated_with` (past project)	Medium	30 days
`mentioned_with` (co-occurrence)	Low	14 days
`met_with` (one-off meeting)	Low	7 days

An edge created a year ago with type collaborated_with gets freshness:

freshness = 2^(-days_since_creation / τ)
           = 2^(-365 / 30)
           = 2^(-12.2)
           ≈ 0.0002

Effectively zero. It still appears in the graph (it is not deleted) but contributes essentially zero traversal score. Only lifecycle edges with valid_to = null (still current) should be traversed for high-weight retrieval — traversal implementations should filter for valid_to IS NULL OR valid_to > NOW() before scoring.

Combined traversal score

traversal_score = hop_score × path_confidence × edge_freshness × edge_type_prior

For a 2-hop path through a 6-month-old collaborated_with edge (confidence 0.70, type_prior 0.8):

= 0.35 × 0.70 × 2^(-180/30) × 0.8
= 0.35 × 0.70 × 0.016 × 0.8
= 0.0031

After RRF fusion with k=60, even a rank-1 result from this path contributes 1.0 / (60 + 1) × 0.0031 ≈ 0.00005 to the fused score — essentially nothing. Stale collaboration edges retire gracefully from retrieval without explicit deletion. The graph does not need a cleanup job to remove expired edges; the scoring formula handles relevance decay automatically.

This property is important for auditability: you can always ask "why was this edge not contributing?" and get a numeric answer from the freshness formula rather than "the edge was deleted."

Entity resolution quality: the graph multiplier

The existing content introduces resolution quality as the multiplier on graph quality. Here is what bad resolution looks like in practice, and how to detect and remediate it.

Fragmentation signatures

A fragmented graph has entities that should be one node appearing as multiple disconnected nodes. Detection: entities with high name similarity that have no alias or merged_into link between them are candidate duplicates. Run this query weekly as part of graph health monitoring:

SELECT a.id, a.canonical_name, b.id, b.canonical_name,
       similarity(a.canonical_name, b.canonical_name) AS sim
FROM entities a
JOIN entities b ON a.id < b.id AND a.type = b.type
WHERE similarity(a.canonical_name, b.canonical_name) > 0.85
  AND NOT EXISTS (
    SELECT 1 FROM entity_links
    WHERE (from_id = a.id AND to_id = b.id)
       OR (from_id = b.id AND to_id = a.id)
  )
LIMIT 20;

This returns candidate duplicate pairs ordered by similarity. Pairs above 0.92 similarity are almost certainly the same entity; pairs between 0.85 and 0.92 require human review or an LLM disambiguation call. When a pair is confirmed as the same entity, mergeEntities(a.id, b.id) re-parents all edges and memories from the secondary entity to the primary, then marks the secondary as merged_into.

Resolution failure indicators

Two metrics flag resolution degradation early, before the fragmentation query shows visible problems:

entity_resolution_fail_rate rising: the 4-stage cascade (embedding match → alias match → LLM disambiguation → UUID v5 creation) is falling back to UUID v5 creation more often than baseline. This means new surface forms are not matching existing entities. The most common cause is a naming pattern shift — a person's name starts appearing with a title or honorific ("Dr. Priya Sharma" versus the stored alias "Priya Sharma") that does not match without alias normalization. Fix: add the new surface form as an alias rather than letting it fragment.

Entity count growing at the same rate as memory count: in a healthy system, entity count grows logarithmically as new people, projects, and organizations are introduced, while memory count grows linearly (conversations continue with the same people). If entity count and memory count grow at the same slope, entity resolution is failing to match new mentions to existing entities — every mention is creating a new entity rather than updating an existing one. This is a clear signal of systematic resolution failure, not just edge cases.

Graph health score

Compute a graph health score weekly:

health = (1 - fragment_rate) × resolution_precision × edge_coverage

Where:

fragment_rate = duplicate entity pairs above 0.85 similarity / total entity count
resolution_precision = 1 - (entity_resolution_fail_rate / baseline_fail_rate)
edge_coverage = namespaces with >10 edges / namespaces with >500 memories

A health score above 0.85 indicates a functioning graph. Below 0.70 indicates a graph that is degrading retrieval quality meaningfully — graph retriever weight in RRF should be reduced automatically when health drops below threshold, falling back to semantic + BM25 dominance until graph quality recovers.

Entity graph as a retrieval signal: practical configuration

The entity graph retriever's contribution to RRF fusion is configurable per-deployment. The defaults are tuned for balanced workloads; specialized agent types benefit from adjustments.

Default configuration

read_pipeline:
  retrievers:
    entity_graph:
      enabled: true
      max_hops: 3                    # beyond 3, signal below noise
      weight: 1.0                    # in RRF fusion (relative to other retrievers)
      min_edge_confidence: 0.50      # edges below this confidence ignored
      skip_edge_types:               # low-value transient edges skipped
        - "met_with"
      entity_cache_ttl: 3600         # entity attribute cache TTL in seconds
      health_threshold: 0.70         # auto-reduce weight below this graph health score

Workload-specific tuning

Increase weight to 2.0 for workloads where relational queries dominate. Examples: HR agents (team membership, reporting structure), project management agents (task assignment, project membership), CRM agents (account ownership, contact relationships). These workloads issue structural queries frequently and the graph retriever contributes high-signal results.

Decrease weight to 0.5 for workloads that are primarily informational with no relational structure. Examples: FAQ agents, documentation agents, general knowledge assistants. These workloads rarely issue entity-anchored queries; the graph retriever contributes little while consuming traversal time.

Set weight to 0.0 (disabled) for workloads with confirmed sparse graphs. Early in a deployment's lifecycle — before reaching the ~500-memory threshold where graph contribution becomes meaningful — disabling the graph retriever eliminates traversal overhead with no quality cost.

The skip_edge_types parameter

Skipping met_with edges cuts graph traversal time by approximately 30% without meaningful quality loss. The reasoning: met_with edges have a 7-day decay half-life, so any met_with edge older than 30 days has freshness 2^(-30/7) ≈ 0.05 — contributing almost nothing to traversal score even if traversed. Rather than traversing and scoring at near-zero, skip these edges entirely. The only case where met_with edges matter is within the first week after a meeting memory was created; for most retrieval queries that window has already passed.

The same logic applies to mentioned_with edges (14-day half-life): consider adding to skip_edge_types for latency-sensitive deployments. The quality impact at 14+ days is negligible.

Cache TTL and warm-up

The entity_cache_ttl of 3,600 seconds (1 hour) balances freshness against Redis hit rate. For namespaces with rapidly changing entity graphs (active users in a fast-moving project), reduce to 600 seconds. For mature namespaces with stable entities (long-term personal assistant agents), increase to 7,200 seconds. The L2 cache hit rate should be monitored per namespace — a hit rate below 60% on an established namespace indicates either a TTL that is too short or an entity resolution fragmentation problem causing frequent cache misses for unfamiliar entity forms.