Scaling to 1B Memories: Index Tiers

By Arc Labs Research10 min read

Memory infrastructure decisions are different from "vector database" decisions. The access pattern is per-user namespaced; the typical user store is small; the global store is enormous. The right architecture depends on which dimension dominates your scale, and the answer changes as you grow.

Tier picker · slide N
Each tier has a clear capacity ceiling. The right tier is the one that can hold your store with margin.

Tier 1 — SQLite-vec, embedded

Up to 100K memories per user-store. Single SQLite file with a vector extension. The whole memory system runs in-process. Latency is sub-millisecond.

  • When to use: single-user agents, desktop apps, edge deployments.
  • When to leave: multi-tenancy, >100K per user, or distributed reads.
  • Failure mode: linear-scan ANN doesn't have HNSW's log-N. Above 100K, query latency rises sharply.

Tier 2 — pgvector HNSW

Up to ~10M memories. Postgres with the pgvector extension; one HNSW index. Multi-tenant via row-level security or per-user schemas.

  • When to use: SaaS at any reasonable scale; you already run Postgres.
  • When to leave: p99 query latency rises above your SLO; index build times become operational issues.
  • Failure mode: HNSW build is single-threaded; index rebuilds at this scale start hurting.

Tier 3 — sharded pgvector

Up to ~100M memories. Postgres-on-Postgres, with a routing layer that picks the shard by user ID hash. Each shard maintains its own HNSW.

  • When to use: you've hit pgvector ceiling; ops team can manage Postgres clusters.
  • When to leave: rebuild orchestration becomes a bottleneck; cross-shard queries become common.
  • Failure mode: coordinated shard rebuilds; consistency-during-migration; tail-latency outliers from straggler shards.

Tier 4 — specialized vector DBs

100M+ memories. Qdrant, Milvus, or similar; designed from scratch for vector workloads. Often pair with PQ (product quantization) or DiskANN for memory-economical storage.

  • When to use: billion-scale; you've outgrown Postgres; recall vs cost is a real constraint.
  • When to leave: you start questioning whether you actually need a billion memories — usually you don't, and aggressive consolidation is cheaper than infra growth.
  • Failure mode: operational complexity; ecosystem maturity vs Postgres; coordinating with the rest of your data plane.

Vector store ≠ memory store

The vector index is one component. A real memory store also holds:

  • Lexical index (BM25, Tantivy or pg_trgm).
  • Entity graph (typed edges, hop traversal).
  • Temporal index (BTREE on event timestamps).
  • Audit ledger (supersession chains, access counts).
  • Metadata (confidence, type, freshness). Tier 4 vector DBs typically don't manage these — you'll need Postgres anyway for the rest. At very large scale, the architecture becomes "Postgres for everything except the vector index, which is in a specialized store, with a coordinator that fan-outs."

Reality check

Most teams that imagine they need 1B memories don't. A typical memory system at 1M users and ~200 retained memories per user is 200M memories — comfortably tier 3. Above that, consolidation usually beats further scale-out: aggressive supersession + decay reduces memory counts faster than user growth grows them. Before reaching for tier 4, audit your store. What's your dedup rate? Your supersession rate? Your decay floor? An aggressive cleanup pass often shrinks the store enough to stay in tier 3 for another year.

HNSW parameter math

Understanding HNSW parameters makes tier transitions easier to reason about.

Memory cost:

bytes_per_vector ≈ dims × 4 + m × 8 × log₂(N)

For N = 1M vectors, dims = 1536, m = 16:

bytes_per_vector ≈ 6,144 + 128 × 20 ≈ 8,700 bytes
Total index ≈ 8.3 GB in RAM

At N = 10M, the log₂ term grows by ~3 bits: index grows to ~85 GB. This is where tier 2 becomes uncomfortable on typical 64GB server RAM — you start fighting the OS page cache.

Build time:

T_build ≈ N × log(N) × m × ef_construction / parallelism

Empirically: 1M vectors on 8 cores → 20–40 minutes at ef_construction = 200. 10M vectors → 4–8 hours. At 10M, you care about incremental insertion latency (amortized log(N) per insert) rather than cold rebuild time.

Query time:

T_query ≈ log(N) × ef_search × vector_distance_cost

ef_search = 80: ~5–15ms for 1M vectors, ~15–30ms for 10M. Raising ef_search increases recall@k but costs linearly. At 10M memories with ef_search = 120, you start bumping against p99 SLOs if your SLO is 100ms end-to-end.

When HNSW starts breaking down

HNSW becomes problematic above ~50M vectors on a single node:

  • Index build time exceeds maintenance windows. A nightly rebuild of a 50M-vector index takes 8+ hours on a 16-core machine. You can't afford nightly rebuilds.
  • RAM footprint exceeds available memory. At 50M vectors with m=16, the index is ~4GB. Combined with the data tables and OS page cache, this pushes you to 128GB+ servers.
  • Partial indexes become necessary for multi-tenancy: a global HNSW at 50M must post-filter by namespace, which reduces effective ef_search headroom and hurts recall.

Tier 3 solutions: time-partitioned HNSW per namespace (smaller indexes, faster rebuilds, cold tiering for old partitions). Tier 4: consider DiskANN (disk-resident approximate NN for memory-constrained environments) or IVF-PQ (product quantization reduces index size by 10–30× at modest recall cost).

Four-layer cache system

Independent of scale tier, Recall stacks four caches:

LayerContentsStorageTTLHit rate
L1: Query resultFused search resultsIn-process DashMap30s~20%
L2: Entity attributeCanonical names, top attributesRedis1 hour~80%
L3: Embedding(query_hash → embedding)Redis24 hours~40%
L4: HNSW warm bufferHNSW graph pagesShared memoryNo TTL~95%

L2 (entity attribute) is what makes entity resolution fast — looking up "Priya" and getting ent_priya_XYZ with attributes hits Redis at under 2ms rather than querying Postgres. L4 keeps the HNSW graph warm across Postgres restarts; without it, the first query after a restart is slow while pages are loaded from disk.

Re-embedding at scale: the shadow strategy

When you change embedding models, you can't just swap the model and rebuild — old embeddings in the index are incompatible with new query embeddings. The shadow strategy runs the two models in parallel:

  1. Dual-write: new memories embed with both old (v1) and new (v2) models.
  2. Backfill: background job re-embeds historical memories with v2. Rate-limited to avoid provider throttling. Checkpointed for resumption.
  3. Shadow query: reads compute results with both v1 and v2, comparing top-K overlap via Jaccard divergence. Alert if divergence > 0.5.
  4. Validation: run LongMemEval-style probes against both versions.
  5. Cutover: queries switch to v2. v1 retained for N days as rollback.
  6. Cleanup: drop v1 column and index.

Cost to re-embed 10M memories at OpenAI text-embedding-3-small rates: ~$4 and ~1 week at 3,000 RPM. Most of the "cost" is time, not dollars — plan accordingly.

Scaling lexical and graph indexes alongside vectors

The vector index gets the most attention in scaling discussions, but it is only one of five indexes in a memory store. Scaling to 100M+ memories requires planning all five in parallel — the vector index rarely becomes the bottleneck first.

BM25 / GIN index

PostgreSQL's GIN index for full-text search scales differently from HNSW. At 10M memories, a GIN index on to_tsvector(content) is approximately 2–4 GB. Build time for a fresh GIN at 10M records: 10–15 minutes. Unlike HNSW, GIN supports parallel index builds natively in PostgreSQL 11+, so the cold-build story is friendlier.

The bottleneck at 100M records is not build time but write-time maintenance: every INSERT updates the GIN index, which involves updating potentially thousands of posting lists. At high write rates (1,000+ turns/second), GIN writes become a measurable bottleneck. The mitigation is gin_pending_list_limit, which batches GIN updates into a pending list and flushes them asynchronously during vacuums rather than on every write. This can cut write-time GIN overhead by 60–80% at the cost of slightly stale full-text index state (typically seconds, not minutes).

At 100M memories, the GIN index is 20–40 GB. This is large but manageable on a dedicated Postgres instance. The GIN does not need to live on the same machine as the HNSW — at tier 3, BM25 traffic can route to a read-replica while ANN traffic routes to the primary or dedicated vector DB.

Entity graph (edge table)

The entity graph scales by node and edge count, not by memory count. A namespace with 10M memories might have 500K entities and 2M edges. A Postgres table with 2M rows is trivial — the entity graph scales easily to 100M edges before requiring specialized infrastructure.

Edge traversal (multi-hop queries) uses recursive CTEs or WITH RECURSIVE in PostgreSQL. Performance characteristics:

  • 2-hop traversal at 2M edges: under 5ms
  • 3-hop traversal at 2M edges: 15–50ms depending on fan-out
  • 4-hop traversal: exponential growth in intermediate results, frequently timing out

Cap hop depth at 3 in production. Beyond 3 hops, the signal-to-noise ratio falls below retrieval usefulness on real conversational graphs — and the query time cost is significant. The practical answer to "why limit at 3 hops?" is that measured LongMemEval scores do not improve past 3 hops while query latency rises sharply.

Temporal BTREE index

The event_at BTREE index is the cheapest to maintain of all five. BTREE is update-friendly and scales linearly with row count. At 10M event memories, a range scan for a 7-day window returns fewer than 1,000 results in 2–5ms. The temporal index does not require sharding until 500M+ events. If you are sharding before that threshold, shard by namespace_id first, which collocates each user's temporal data on one shard and keeps range scans local.

Metadata indexes

Confidence, status, schema_version, and namespace — composite indexes cover these. A composite index on (namespace_id, status, confidence) enables the most common filter pattern (all active memories for a user above a confidence threshold). At 100M memories, this index is approximately 3–4 GB. Unlike GIN, composite BTREE indexes have minimal write-time overhead — they add roughly 10–20% to INSERT cost rather than the potentially large posting-list updates GIN requires.

The key insight across all five indexes: GIN is the write-time bottleneck at high write rates, HNSW is the RAM bottleneck at high memory counts, and the graph and temporal indexes scale without drama until you are well into tier 3. Plan your scaling headroom for GIN and HNSW first; the others will not surprise you.

Multi-tenant scaling patterns

A memory system serving 1M users has different scaling constraints than one serving 1K users. The unit of isolation is the namespace (one per user, typically); the challenge is that all namespaces share infrastructure while having wildly different usage profiles.

Per-user namespace isolation

Row-level security in PostgreSQL enforces that queries with WHERE namespace_id = $user_namespace only see their namespace's rows. This is correct and necessary, but adds a predicate to every query. At 10M total memories across 100K users (average 100 memories/user), the namespace predicate is cheap — most indexes include namespace_id as a leading column, so the predicate prunes at index traversal time rather than requiring a heap scan.

The index design that makes this work: all five indexes should use namespace_id as the leftmost column in composite indexes, or — for HNSW — pair the vector index with a namespace_id filter that applies during approximate neighbor search rather than post-hoc. pgvector supports filtering via WHERE predicates on top of HNSW, but with degraded recall if the filter is highly selective. The workaround is per-namespace HNSW indexes for large namespaces (discussed below).

Hot namespace problem

Some users have vastly more memories than others: power users with long conversation histories, or organizations sharing a single namespace. A namespace with 500K memories uses more resources than 5,000 namespaces with 100 memories each — not just in storage, but in HNSW recall degradation (post-filter selectivity), BM25 query time (more posting list entries), and graph traversal fan-out (more edges per hop).

Identify hot namespaces by memory count with a simple query run weekly:

SELECT namespace_id, COUNT(*) AS memory_count
FROM memories
GROUP BY namespace_id
ORDER BY memory_count DESC
LIMIT 20;

Namespaces above a threshold (100K memories is a reasonable trigger) are candidates for dedicated Postgres schemas with per-namespace HNSW indexes. A per-namespace HNSW at 100K vectors is tiny — under 1 GB RAM — and gives near-perfect recall without post-filter degradation. Route hot namespaces to their dedicated schema; cold namespaces remain in the shared table.

Write rate spikes

A user who pastes 1,000 conversation turns at once (backfilling history, importing a chat log) can saturate the write pipeline and affect other users' write latency. Rate limiting per namespace at the API layer prevents this. Default: 100 turns/minute per namespace with a burst allowance of 200 turns. For deliberate backfill operations, use the batch import endpoint, which places work in the background worker queue rather than the synchronous write path. The background worker drains at a rate that does not impact real-time write latency for other namespaces.

The write rate spike also affects GIN index maintenance at high aggregate rates. If 100 users simultaneously backfill at 100 turns/minute each, that is 10,000 writes/minute — high enough to stress GIN pending list flushing. Monitor pg_stat_user_tables.n_live_tup and pgstattuple on the GIN index to detect bloat accumulation during sustained high-write periods.

Shard transition planning

Moving from tier 2 (pgvector, single instance) to tier 3 (sharded) or tier 4 (specialized vector DB) requires careful planning because the transition must be zero-downtime. Users cannot experience degraded retrieval quality during migration.

Tier 2 to Tier 3 (adding shards)

A four-week migration timeline:

Day 1:    Provision shard 2 (new Postgres instance with pgvector)
Day 2:    Add routing layer (consistent hash by namespace_id)
Day 3–7:  Backfill existing namespaces to their target shard
          (dual-write to both shards during backfill window)
Day 7:    Cut over write path — new namespaces route to target shard
          via hash(namespace_id) % num_shards
Week 2-4: Migrate remaining existing namespaces at background priority
Day 30:   All existing namespaces migrated; remove dual-write path;
          decommission shared table

Routing is by hash(namespace_id) % num_shards. For N=4 shards: namespace_id mod 4 determines the shard. This is stable — a given namespace always routes to the same shard without a coordination table, which keeps the routing layer stateless and fast.

Cross-shard queries are rare in memory systems because nearly all queries are scoped to a single namespace. When cross-shard is required (administrative queries, analytics), fan-out to all shards and merge results at the routing layer. Design APIs to avoid cross-shard patterns in the hot path — they are linearly slower with shard count.

Tier 3 to Tier 4 (adding a dedicated vector DB)

Vector-specific traffic (ANN queries) moves to the specialized store (Qdrant, Milvus, Weaviate); all other operations — BM25, graph traversal, temporal range scans, metadata filters — stay in Postgres. The routing layer dispatches based on operation type, not namespace.

This "split-brain" architecture requires strict write synchronization: a memory write that lands in Postgres must also trigger an embedding upsert in the vector DB within the consistency SLA. Implement this as a transactional outbox: the Postgres write appends to an embedding_queue table within the same transaction, and the background worker drains the queue to the vector DB with at-least-once delivery. Idempotency key: the memory's stable UUID.

The failure mode to design for is queue backlog: if the vector DB is slow or offline, the outbox accumulates. Design the backlog tolerance (how many queued embeddings before the write path starts rejecting or slowing) based on your vector DB's observed recovery time.

Cost optimization at scale

Memory infrastructure costs more than expected if you are not paying attention to the right cost drivers. At 100M+ memories, three items dominate the budget.

Embedding API cost

At OpenAI text-embedding-3-small pricing (~0.02per1Mtokens),embedding100Mmemoriesatanaverageof50tokenseachis5Btokens,or0.02 per 1M tokens), embedding 100M memories at an average of 50 tokens each is 5B tokens, or 100. That is a one-time cost for the initial corpus. The ongoing cost for new memories at 1M new memories per day at 50 tokens each is 50M tokens per day, or $1 per day — essentially free.

The expensive event is model migration: re-embedding 100M memories when switching from text-embedding-3-small to text-embedding-3-large (or any model change) costs another $100 in API fees and approximately one week of background worker time at 3,000 RPM. Model selection decisions are consequential. The shadow strategy described above gives you the tooling to validate before committing; the cost model means you do not want to do this frequently. Plan embedding model selection as a 1–2 year commitment, not a quarterly optimization.

LLM extraction cost

At Claude Haiku pricing, 100M extraction calls (including input context and output JSON) costs approximately 1,0002,000.Thisisthelargestongoingoperationalcostinthememorypipelinelargerthanstorage,largerthanembeddingAPI.Theprefiltergate(theclassifierthatrejectsturnsunlikelytocontainmemorablecontent)istheprimarycostlever:rejecting551,000–2,000. This is the largest ongoing operational cost in the memory pipeline — larger than storage, larger than embedding API. The pre-filter gate (the classifier that rejects turns unlikely to contain memorable content) is the primary cost lever: rejecting 55% of turns before the LLM extraction call saves 550–1,100 at 100M memory scale.

HyPE (hypothetical question generation at write time) adds approximately 30% to extraction cost because it generates additional LLM output per memory. It is opt-in and improves retrieval quality on direct question queries. At 100M scale, opting into HyPE adds $300–600 to extraction costs — justified if your workload includes frequent direct question retrieval patterns (e.g., "what is Alice's phone number?"), unjustified for workloads that are primarily conversational context retrieval.

Database storage cost

At AWS RDS pricing, 1TB gp3 SSD storage is approximately 100115/month.At5.5KBpermemory(textcontent,metadata,auditfields,andoneembeddingvectorat1536dims×4bytes=6,144bytes),100Mmemoriesconsumesapproximately550GBforembeddingsalone.Addingtextcontent(average200bytes),metadata,indexes,andTOASToverhead:approximately1TBtotalat100Mmemories.Budget100–115/month. At 5.5 KB per memory (text content, metadata, audit fields, and one embedding vector at 1536 dims × 4 bytes = 6,144 bytes), 100M memories consumes approximately 550 GB for embeddings alone. Adding text content (average 200 bytes), metadata, indexes, and TOAST overhead: approximately 1TB total at 100M memories. Budget 100–200/month for storage at this scale.

Retrieval infrastructure (Redis for L2/L3 cache, dedicated vector DB if at tier 4) adds $200–500/month at 100M scale depending on Redis instance size and vector DB pricing model.

The consolidation dividend

Aggressive deduplication reduces the effective memory count and therefore every cost that scales with memory count. A system with 10M raw stored memories and a 40% dedup rate (aggressive supersession + entity attribute consolidation) has 6M effective memories in active indexes. Each 10% reduction in effective memory count reduces:

  • HNSW RAM footprint by 10%
  • Embedding API re-embedding cost by 10%
  • BM25 GIN index size by 10%
  • Storage cost by 10%

The consolidation dividend compounds: a system that aggressively consolidates from the start avoids the infrastructure scaling work that a system with poor dedup requires. Measure dedup rate monthly. If it is below 20% at 1M memories, the extraction and supersession pipeline has a problem — it is storing too many near-duplicate memories.

Related reading

Updates from the lab.

Engineering notes, research drops, occasional product updates. Roughly monthly.