HNSW Tuning: m, ef, and Memory Cost
Hierarchical Navigable Small Worlds is the algorithm behind almost every production vector index — pgvector, Qdrant, Milvus, Weaviate. The reason is straightforward: it gets very good recall (95%+) at very low query latency, with predictable resource costs you can budget for.
The three parameters
- m (default 16): the maximum number of bidirectional links per node in the upper layers. Higher m → better recall, more memory, slower build.
- ef_construction (default 200): the size of the dynamic list during insertion. Higher → better quality index, much slower build, no query-time impact.
- ef_search (default 64): the size of the dynamic list during query. Higher → better recall, slower query, no memory impact.
Memory cost
bytes_per_vec ≈ dims × 4 + m × 8 × log₂(N)
Approximate; assumes float32 vectors and 8-byte link IDs.
- A 768-dim vector at m=16, N=1M: 3072 + 16 × 8 × 20 = 5632 bytes ≈ 5.5 GB for 1M vectors.
- Same parameters at N=100M: 3072 + 16 × 8 × 27 = 6528 bytes ≈ 0.65 TB.
- The dims term dominates for high-dim embeddings; the link term dominates for very large N at low dims.
Build time
build_time ≈ N × log(N) × m × ef_construction × cost_per_distance
Dominated by ef_construction; rebuilds are slow.
For N=10M, m=16, ef_construction=200 on a single thread, expect roughly 3–6 hours depending on dims and CPU. Most production systems batch-insert in parallel and accept slightly worse index quality in exchange for time.
Query time
query_time ≈ log(N) × ef_search × cost_per_distance
Logarithmic in N; linear in ef_search.
Empirically, p99 query latency on commodity CPU at N=10M, ef_search=64, dims=768: roughly 4–8ms per query. ef_search=256: 16–30ms. The trade-off is recall.
When to rebuild vs. update incrementally
HNSW supports incremental insert — but quality degrades over time as distributions drift. Rules of thumb:
- Below 10M memories: rebuild quarterly. The cost is bounded; the quality win is real.
- 10M–100M: rebuild yearly; partial rebuild on shards in between.
- 100M+: never full-rebuild. Use sharded indexes; rebuild one shard at a time, behind a router that handles drainage.
When HNSW is not the right choice
- Below 100K memories. Brute force or IVF is simpler and faster to maintain. HNSW pays off at scale, not at small N.
- Above 1B memories. Memory cost becomes prohibitive. Consider IVF-PQ (product quantization) or DiskANN — sacrifice some recall for radically lower RAM.
- Frequent deletes. HNSW handles deletion via tombstoning; the index accumulates dead nodes. If your churn rate is high, the index degrades faster than you can rebuild.
Why approximate nearest neighbors exist
Exact nearest neighbor search in d-dimensional space requires computing the distance between the query vector and every vector in the store: O(N × d) distance operations with no way to skip candidates without violating the exactness guarantee. At N=10M and d=768 (a standard embedding dimension for models like text-embedding-3-small), this is 7.68 billion floating-point multiplications per query. On a modern server CPU capable of roughly 1 billion float ops per second in practice (accounting for memory bandwidth and cache misses), this is approximately 7–8 seconds per query — entirely unusable for interactive agent memory retrieval where the budget is 50ms end-to-end.
HNSW trades exactness for speed by building a graph structure that allows the search to skip most of the vector store. At N=10M and ef_search=64, HNSW evaluates on the order of log(N) × ef_search ≈ 23 × 64 ≈ 1,472 distance computations per query instead of 10,000,000. That is roughly a 6,800× reduction in work, which translates directly to the observed 4–8ms query latency. The cost is that the result set has bounded error: at ef_search=64, HNSW finds a set of 10 candidates that overlaps approximately 95% with the true 10 nearest neighbors. The 5% gap almost always falls in positions 8–10 of the exact ranking, not position 1. For agent memory retrieval, missing the 9th most relevant memory in favor of the 11th has no practical consequence.
The exactness gap is the reason HNSW has a separate ef_search control at query time, distinct from the build-time ef_construction. You cannot retroactively improve the graph topology after building it (short of rebuilding), but you can always spend more query-time effort searching the existing graph more thoroughly. Setting ef_search=256 at query time on an index built with ef_construction=200 will find approximately 99% of exact neighbors, at the cost of 3–4× higher query latency. The separation of build-quality and search-effort into two independent controls is what makes HNSW operationally practical: you build once, then tune ef_search against your SLA without touching the index.
The error bound on approximate search is not uniform across all queries. Queries in regions of the embedding space with high vector density (many similar memories clustered together) tend to have higher recall because the greedy graph search naturally stays in the right neighborhood. Queries in sparse regions (unusual topics with few relevant memories) tend to have lower recall because the graph navigation may converge to a local minimum that is not the global nearest neighbor. For agent memory, this means the retriever is most reliable for common topics and least reliable for unusual ones — the opposite of what BM25 experiences, which makes them natural complements in a fusion architecture.
How HNSW is structured
HNSW builds a multi-layer graph over the vector store. Layer 0 is the base layer and contains all N nodes. Layer 1 contains approximately N/m nodes selected probabilistically during construction. Layer 2 contains N/m² nodes, and so on, until the top layer contains a single entry point node. The exact layer assignment for each node is determined at insertion time by drawing from a geometric distribution with parameter 1/m_L (where m_L is a configurable factor, typically set to 1/ln(m)). This probabilistic assignment is the source of the "hierarchical" property: nodes at higher layers form a coarse connectivity skeleton over the full dataset.
Search begins at the top layer with the single entry point. The algorithm greedily moves to whichever neighbor of the current node is closest to the query vector, continuing until it reaches a local minimum at that layer. It then descends to the next layer and repeats the greedy traversal, starting from where it stopped. This layer-by-layer descent means that the high-layer traversal rapidly narrows the search from the full vector space to a small neighborhood, and the low-layer traversal refines within that neighborhood. The structure explains the O(log N) query complexity: each layer halves the effective search space, and there are approximately log(N)/log(m) layers.
Insertion into a built HNSW index follows the same layer assignment and greedy search logic. When a new node is inserted, the algorithm first determines its maximum layer (random draw), then for each layer from the top down to 0, it finds the ef_construction nearest neighbors via greedy search and creates bidirectional links. Bidirectionality is the "navigable small world" property: not only does the new node link to its neighbors, but each neighbor links back to the new node. This ensures that the graph has no unreachable isolated clusters — from any node, you can reach any other node by following links, which is necessary for correct search behavior.
Link pruning is applied when a node would exceed m connections at a given layer. HNSW uses a heuristic called "select neighbors by heuristic" (the original paper's Algorithm 4) rather than simply keeping the m closest neighbors. The heuristic favors neighbors that provide diverse graph connectivity over neighbors that are all clustered in the same direction. The practical effect is that HNSW avoids the degenerate case where a hub node connects only to its closest cluster, leaving distant regions of the graph unreachable without an expensive traversal through the hub. This is why HNSW empirically outperforms simpler graph-based indexes (like KGraph) on real-world data distributions with clusters.
The entry point for search is the node at the topmost layer, maintained as global state. When the graph is built from scratch, the first inserted node becomes the entry point. As the index grows and new top-layer nodes are created, the entry point may be updated to a more central node. In pgvector's implementation, the entry point is stored as a special metadata page in the index file. After a rebuild, the new entry point is computed as the node with the highest average connectivity across upper layers — this minimizes expected path length from the entry point to any query destination.
Parameter sensitivity analysis
The three HNSW parameters interact with the three output metrics — recall@10, query latency, and memory per vector — in ways that are not always intuitive. Understanding the sensitivity of each parameter independently is the prerequisite for tuning to a specific SLA.
Effect of m on recall, latency, and memory at N=1M, ef_search=64, dims=768.
| m | recall@10 | query latency (p99) | memory per vector |
|---|---|---|---|
| 8 | ~90% | 4ms | 4.2 KB |
| 16 | ~95% | 8ms | 5.6 KB |
| 32 | ~97% | 14ms | 8.4 KB |
| 64 | ~98% | 26ms | 14 KB |
m=16 is the standard default and represents the best tradeoff for most production workloads. Doubling m from 16 to 32 adds 2% recall at the cost of 75% more memory and 75% more query latency — diminishing returns in both directions. The latency increase from higher m is not from the memory cost alone; it is because the graph traversal evaluates more candidates per node (m neighbors instead of m/2), multiplying the distance computations per hop.
Effect of ef_search on recall and latency at m=16, N=1M, dims=768.
| ef_search | recall@10 | query latency (p99) |
|---|---|---|
| 32 | ~92% | 4ms |
| 64 | ~95% | 8ms |
| 128 | ~98% | 14ms |
| 256 | ~99% | 26ms |
ef_search is the only parameter you can tune without rebuilding the index, making it the primary operational lever. The recall gain from doubling ef_search is concave: going from 32→64 adds 3%, 64→128 adds 3%, 128→256 adds 1%. The latency scales nearly linearly with ef_search because each additional candidate in the dynamic list costs roughly one distance evaluation. For agent memory retrieval with a 50ms end-to-end budget and a 20ms budget for the semantic retriever specifically, ef_search=64 is the correct default. ef_search=128 is appropriate for high-value queries where missing the 9th most relevant memory is costly.
Effect of ef_construction on build quality and build time at m=16, N=10M.
| ef_construction | build time | recall@10 (at ef_search=64) |
|---|---|---|
| 100 | ~4h | ~92% |
| 200 | ~8h | ~95% |
| 400 | ~16h | ~96% |
ef_construction is a pure build-time parameter with zero runtime effect. Once the index is built, ef_construction is irrelevant — it does not appear in any query code path. This is a common source of confusion: operators sometimes increase ef_construction trying to improve query latency, which does nothing. The correct mental model is that ef_construction controls how much effort went into placing each link correctly during construction. Higher ef_construction means each inserted node had more candidates evaluated when selecting its m neighbors, producing better long-range connectivity. The quality gain from 200→400 is only 1%, making 400 rarely worth the 2× build time.
Memory sizing for real deployments
The formula bytes_per_vec ≈ dims × 4 + m × 8 × log₂(N) gives per-vector memory cost, but translating this to server sizing requires accounting for PostgreSQL overhead (shared_buffers, WAL, index metadata) and operating system page cache. The practical rule is: provision RAM at 1.5× the computed index size to leave room for query working memory, connection pools, and OS page cache. Running with less than 1.2× forces the OS to swap, and HNSW on swapped memory degrades to effectively random query latency.
Scenario A — Small namespace, 100K memories, dims=768, m=16.
bytes_per_vec = 768 × 4 + 16 × 8 × log₂(100,000) = 3,072 + 16 × 8 × 17 = 3,072 + 2,176 = 5,248 bytes
Total index size: 100,000 × 5,248 = 524 MB. At 1.5× overhead: ~786 MB. This fits comfortably within PostgreSQL's default shared_buffers = 1GB on any modern server. A namespace at this scale does not require dedicated infrastructure — a shared database server with 4–8 GB RAM handles dozens of such namespaces simultaneously.
Scenario B — Medium namespace, 1M memories, dims=1536 (OpenAI text-embedding-3-large), m=16.
bytes_per_vec = 1,536 × 4 + 16 × 8 × log₂(1,000,000) = 6,144 + 16 × 8 × 20 = 6,144 + 2,560 = 8,704 bytes
Total index size: 1,000,000 × 8,704 = 8.7 GB. At 1.5× overhead: ~13 GB. This requires a dedicated database server with at least 16 GB RAM to avoid swap. The dims term (6,144 bytes) now exceeds the link term (2,560 bytes), confirming the earlier observation: for high-dimensional embeddings, the raw vector storage dominates. Switching from dims=1536 to dims=768 would cut the index size roughly in half with a moderate recall penalty depending on the embedding model.
Scenario C — Large namespace, 10M memories, dims=768, m=16.
bytes_per_vec = 768 × 4 + 16 × 8 × log₂(10,000,000) = 3,072 + 16 × 8 × 23 = 3,072 + 2,944 = 6,016 bytes
Total index size: 10,000,000 × 6,016 = 60 GB. At 1.5× overhead: ~90 GB. No commodity server can hold a 90 GB index entirely in RAM economically — a 128 GB RAM server costs roughly $800/month in cloud pricing. At this scale, three options exist: vertical scaling (expensive), horizontal sharding (the recommended approach, splitting the namespace into 4–8 shards each below 10M vectors), or switching to a dedicated vector database (Qdrant, Weaviate) that offers disk-aware segment management. Disk-aware HNSW keeps frequently accessed graph layers in memory and colder layers on NVMe, trading some query latency for radically lower RAM requirements.
Index quality degradation from incremental inserts
HNSW is built with a global view of the data distribution at construction time. The greedy neighbor selection during ef_construction phases chooses links that are optimal for the distribution of vectors present when each node was inserted. When new vectors are inserted after the initial build, the algorithm uses the existing graph structure to find their neighbors — but the existing structure was not designed to accommodate these new nodes. The new nodes are correctly inserted with m links each, but they do not update existing nodes' link sets. Over time, as the distribution shifts or fills in, some existing nodes that were well-connected to their original neighborhood become less well-connected to the new arrivals in their region.
The degradation is quantifiable and roughly predictable. After inserting 10% new nodes post-build (i.e., the index has grown from N to 1.1N since the last rebuild), recall@10 drops approximately 1–2% at equivalent ef_search. After 30% new nodes (N grows to 1.3N), degradation reaches 4–6%. The exact magnitude depends on how different the new vectors' distribution is from the original: if new memories cover similar topics as existing ones, degradation is minimal; if new memories cover entirely new semantic territory (a new project, a new domain, a new language), degradation is sharper because the new vectors have no nearby neighbors in the existing graph and end up poorly connected.
The spec's rebuild schedule — quarterly for sub-10M, yearly for 10M–100M with partial shard rebuilds — is calibrated against this degradation curve. At quarterly cadence, a namespace growing at 1,000 memories per day accumulates roughly 90,000 new memories between rebuilds. If the namespace was at 2M memories at the start of the quarter, it grows to 2.09M — a 4.5% growth rate, well within the 10% threshold where degradation becomes noticeable. For faster-growing namespaces, the rebuild cadence should be adjusted. The growth rate at which quality drops below target is: growth_pct_per_quarter = 10% × (target_recall_drop / 2%). At a target of 1% recall drop, rebuild when growth exceeds 5%.
One effective mitigation for high-ingestion-rate namespaces is the write-buffer pattern. New memories are inserted into a small secondary index — either a flat brute-force scan (viable up to ~50K vectors) or a small HNSW sub-index — rather than directly into the main HNSW index. Queries fan out across both the main index and the write buffer, and results are merged. The write buffer is periodically merged into the main index via a full rebuild. This preserves the graph quality of the main index (it never receives incremental inserts) while handling new arrivals without quality degradation. The cost is additional query fan-out complexity and the overhead of maintaining two indexes, but both are manageable below 1M memories in the write buffer.
pgvector vs dedicated vector databases
pgvector extends PostgreSQL with HNSW and IVF index types, enabling vector similarity search within the same transaction and query planner that handles your memory table, entity graph, BM25 index, and metadata. The architectural consequence is significant: a write that stores a new memory, updates the entity graph, and inserts the memory embedding into the HNSW index can all happen in a single BEGIN / COMMIT block. If the write fails partway through, the entire transaction rolls back. There is no possibility of a state where the memory row exists but its vector is missing from the index, or where the entity graph reflects a relationship that was later rolled back. This transactional correctness is not a luxury — without it, retrieval systems accumulate subtle inconsistencies that are difficult to debug and painful to remediate.
The tradeoff is performance. pgvector's HNSW implementation shares the PostgreSQL process model, buffer pool, and locking primitives with every other operation on the database. A heavy HNSW build operation competes with ongoing queries for CPU and shared_buffers. A vector search during a large batch write may experience elevated latency because the buffer pool is churned by the write path. Dedicated vector databases (Qdrant, Milvus, Weaviate) run HNSW in a purpose-built runtime: their memory management is tuned exclusively for vector operations, they use SIMD-optimized distance kernels that bypass PostgreSQL's generic expression evaluator, and their concurrency model is designed for high-throughput parallel vector searches. Benchmarks consistently show dedicated systems at 1.5–2× lower query latency than pgvector at equivalent N and recall.
The practical decision point is 10M memories per namespace. Below 10M, pgvector is almost always the right choice. The operational advantage of a single database is substantial: one backup strategy, one monitoring stack, one connection pool, one set of credentials, one failover procedure. The performance gap versus dedicated systems is real but falls within acceptable SLA budgets when the end-to-end retrieval pipeline is already spending 15–30ms on reranking. Above 10M memories per active namespace, the pgvector performance gap compounds: query latency climbs, the HNSW index may not fit in shared_buffers, and rebuilds begin to impact production query performance. At this scale, moving the vector tier to a dedicated system (while keeping PostgreSQL for everything else) is worth evaluating. The integration cost is one additional service with its own operational burden, but the performance gain is predictable and the architectural boundary is clean: the vector store handles only HNSW queries; PostgreSQL handles relational data, BM25, entity graph, and metadata.
One pgvector-specific limitation bears noting: pgvector does not support filtered HNSW search efficiently. A query like "find the 10 nearest neighbors of this vector among memories where type = 'Preference'" requires pgvector to either post-filter (run the full HNSW search and discard non-Preference results) or pre-filter with a bitmap scan. Post-filtering degrades recall when the filter is selective; pre-filtering requires maintaining separate indexes per type. Dedicated vector databases implement filtered HNSW natively using per-segment metadata filters that prune graph traversal before distance computations. For the type-filtered retrieval pattern described in the five-retrievers architecture, this pgvector limitation is mitigated by maintaining separate HNSW indexes per memory type (one for Event, one for Fact, one for Preference) — a straightforward schema decision that avoids the filtering problem entirely.
IVF and IVF-PQ for the very-large scale
Inverted File Index (IVF) is the alternative vector indexing strategy for scales where HNSW's memory cost becomes prohibitive. IVF begins with a k-means clustering pass over the full vector dataset, producing N_lists cluster centroids. At query time, the query vector is compared against all N_lists centroids (a small O(N_lists × dims) computation), and the top N_probes closest clusters are selected. The exact nearest neighbor search is then performed only within those N_probes clusters, examining a fraction N_probes / N_lists of the total dataset. Memory cost is O(N × dims × 4) for the raw vectors plus O(N_lists × dims × 4) for the centroids — no link structure, so no m-dependent overhead. For N=1B at dims=768, this is 3 TB for vectors plus ~3 MB for 1,000 centroids — dramatically cheaper than HNSW's link structure, which would add ~6 TB at m=16.
The recall curve for IVF is controlled by N_probes. At N_probes=1, only the closest cluster is searched — extremely fast, but recall@10 drops to roughly 60–70% for clustered data. At N_probes=8, recall@10 reaches ~80%. At N_probes=64, recall@10 reaches ~93%. Unlike HNSW, IVF's recall does not asymptote smoothly — there are discontinuous jumps when N_probes crosses a cluster boundary that happens to contain several relevant vectors. For agent memory retrieval, IVF is practical above 500M memories where HNSW RAM cost exceeds $2,000/month in cloud infrastructure.
IVF-PQ (Product Quantization) adds a compression step that reduces each vector from dims × 4 bytes to a fixed-size compact code, typically 8–64 bytes regardless of dims. The compression works by dividing the vector into M sub-vectors of dims/M dimensions each and encoding each sub-vector as an index into a codebook of 256 centroids (8 bits). At M=8 and dims=768, each sub-vector has 96 dimensions and is encoded as 1 byte — 8 bytes total per vector instead of 3,072. This 384× compression allows 10B+ memories to fit in the RAM of a modest server. The cost is recall: the quantized distance approximation introduces error, and recall@10 at N_probes=8 drops from ~80% (exact IVF) to ~65–70% (IVF-PQ). At N_probes=64, IVF-PQ achieves ~85–88% recall@10 — acceptable for some applications, too low for others.
IVF-PQ recall@10 vs N_probes at N=1B, dims=768, M=8 sub-vectors, N_lists=10,000.
| N_probes | recall@10 | approximate latency |
|---|---|---|
| 8 | ~65% | 2ms |
| 32 | ~78% | 8ms |
| 64 | ~85% | 15ms |
| 256 | ~92% | 55ms |
For agent memory at the 1B+ scale, the production-viable architecture uses IVF-PQ with N_probes dynamically scaled by query specificity. High-specificity queries (rare terms, exact names, narrow temporal windows) benefit from thorough search because there are few relevant memories and missing them is costly — N_probes=64 to 128. Low-specificity queries (broad concepts, common words) have many relevant memories distributed across many clusters — N_probes=32 is sufficient to find representative examples. The specificity signal comes from the same query optimizer that drives retriever selection: a high-IDF query gets high N_probes; a low-IDF query gets low N_probes. This adaptive approach recovers much of the latency cost of high-recall search on the queries that need it, while preserving fast retrieval for the majority of queries that do not.