Vector Stores: Selection, Hybrid Retrieval, and Reranking

FlowRidge

Vector Store Selection — Dedicated vs Hybrid Search

Dedicated vector DB

ANN at scale

HNSW, IVF optimised

Low-latency retrieval

Millisecond p50

Simpler operational model

Single engine

Weak lexical fallback

Struggles with IDs/exact

vs

Hybrid search (BM25 + vector)

Lexical + semantic

Reciprocal rank fusion

Exact match preserved

IDs, names, codes

Reranker layer

Cross-encoder second pass

More moving parts

Two indexes to operate

Figure 342. Dedicated vector stores optimise for ANN performance; hybrid search engines combine lexical and semantic matching. The right choice is workload-dependent.

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 6 of 35

The vector store is the piece of the RAG architecture that the organization will live with longest. Models are swapped every twelve months as new frontier releases land. Prompts are rewritten every sprint. Retrieval corpora grow month by month, but the index that holds their embeddings is rebuilt only under duress because reindexing billions of chunks across an embedding-model migration is a week-long incident even with parallel compute. An architect who chooses a vector store well inherits a component that quietly compounds in value. An architect who chooses poorly inherits a system that gets harder to change every quarter. Article 5 settled chunking and embedding; this article settles the component that catches those chunks and returns them under query load.

What a vector store actually does

A vector store is a database specialized for similarity search on high-dimensional vectors. Each record contains an embedding (typically 768 to 3,072 dimensions), a primary identifier, an opaque payload (the original chunk text), and a set of structured metadata fields. At query time, the store accepts a query vector plus optional metadata filters and returns the top-k records whose embeddings are most similar under a configured distance metric, usually cosine similarity, sometimes dot product, rarely Euclidean. Similarity is computed approximately using an index structure — HNSW, IVF, ScaNN, DiskANN — because exact search across hundreds of millions of vectors is computationally infeasible under interactive latency budgets.

The architect who treats the vector store as a commodity misses the point. Three vectors of identical content, embedded by identical models, indexed in three different stores, return meaningfully different results at the top of the ranked list because each store’s index parameters, distance computation, and filter implementation differ. The store is a component with personality. The architect evaluates that personality against the workload rather than assuming fungibility.

The four architectural dimensions

Every vector store presents four architectural dimensions on which the decision turns.

Managed versus self-hosted. Managed stores (Pinecone, Weaviate Cloud, Qdrant Cloud, Zilliz Cloud for Milvus, MongoDB Atlas Vector Search, Elastic Cloud, Azure AI Search, Amazon OpenSearch Serverless, Google Vertex AI Vector Search) trade control for operational simplicity. The architect pays per-hour or per-vector pricing and gets autoscaling, backups, and SLAs. Self-hosted stores (pgvector on PostgreSQL, self-hosted Milvus, self-hosted Qdrant, self-hosted Weaviate, FAISS, Chroma) trade operational simplicity for control and cost predictability. The architect runs the cluster, handles failover, and pays for the underlying compute and storage without per-vector markups.

Index algorithm. HNSW is the dominant graph-based index and is the default in Qdrant, Weaviate, Milvus, pgvector (via pgvectorscale or the HNSW index type), and most others. IVF (inverted file) suits high-recall workloads when memory is a constraint and is available in Milvus and Elasticsearch. DiskANN lets the index spill to SSD for extreme scale and is the basis of Microsoft’s Azure offering and pgvectorscale. The architect who does not understand which algorithm their store uses cannot reason about why latency spikes when the corpus doubles or why recall degrades when the filter cardinality is high.

Filter capability. Pre-filter versus post-filter is the most commonly overlooked architectural question. In post-filter designs, the store retrieves the top-k by similarity then filters by metadata; if the filter is selective, the result set shrinks below k and quality degrades. In pre-filter designs, the filter narrows the candidate set before similarity search; this preserves k but is slower at high cardinality. Qdrant, Weaviate, and Milvus support sophisticated pre-filtering. Pinecone supports metadata filtering natively with tenant-namespace isolation. Some stores force the architect into one mode.

Scale. Different stores dominate at different orders of magnitude. pgvector is production-grade through the low hundreds of millions of vectors; beyond that, dedicated stores pull ahead. Pinecone and Vertex AI Vector Search are engineered for tens of billions. Milvus and Qdrant span the range with self-hosted operations. The architect sizes the future corpus three years out, not today’s pilot, and picks a store whose scale envelope includes that figure.

[DIAGRAM: MatrixDiagram — aite-sat-article-6-store-selection-matrix — Two-axis comparison matrix with vector stores on the horizontal axis (Pinecone, Weaviate, Qdrant, pgvector, Milvus, OpenSearch) and evaluation criteria on the vertical axis (operational model, index algorithm, filter capability, scale envelope, metadata model, hybrid-search support, multi-tenancy, ecosystem). Cells contain short labels (for example, “HNSW, pre-filter, 100B+ scale, managed only”). The grid is colour-coded by strength category (native / capable / partial / absent).]

The six representative stores

Pinecone pioneered managed vector search and remains the reference for teams who want zero operations. It supports metadata filtering, namespaces for multi-tenancy, serverless and pod-based pricing tiers, and hybrid search via sparse-dense vectors. Its public cases include Notion AI’s retrieval layer over the workspace corpus.¹ Pinecone’s weakness is cost at very large scale and the lack of a self-hosted option, which concerns architects with residency requirements not met by Pinecone’s available regions.

Weaviate is available managed and self-hosted, is open-source, supports HNSW with filter-friendly pre-filtering, and has a first-class hybrid-search API combining BM25 and vector similarity. Weaviate’s module system covers embedding providers, rerankers, and generative endpoints directly, which shortens the RAG pipeline. The operational footprint is modest for mid-scale corpora; at very large scale, architects report memory pressure that demands careful sharding.

Qdrant is open-source Rust, available managed and self-hosted, with one of the strongest filter implementations — pre-filter by default with cost-based query planning. Qdrant’s strengths are latency under heavy filtering, straightforward self-hosting, and a clean API. It is increasingly the reference self-hosted choice in Phase 2 European enterprise deployments that cannot use Pinecone for residency reasons.

pgvector is the PostgreSQL extension that turns any existing Postgres instance into a vector store. Supabase’s public documentation and reference architectures show pgvector powering production RAG at scale on their managed Postgres service.² pgvector’s strengths are operational familiarity (Postgres is already in the stack), transactional consistency (the vector is written in the same ACID transaction as the source record), row-level security (tenant isolation is enforced by the same mechanism that protects the rest of the data), and cost efficiency at small-to-medium scale. Its weakness is that the scale ceiling is lower than purpose-built stores without pgvectorscale or Timescale Vector extensions.

Milvus is the reference open-source store for very large scale with a Kubernetes-native operator and dedicated storage and index nodes. Milvus is behind Zilliz Cloud, its managed offering, and is deployed at billion-vector scale by several public customers. Its complexity suits architects with platform teams ready to operate it.

OpenSearch and Elasticsearch brought their mature inverted-index engines together with HNSW vector support, which makes them natural choices for teams that already run search clusters. Hybrid search is first-class because BM25 is native. The latency and memory profile is different from Qdrant or Pinecone because the engine was designed for lexical search first.

Hybrid retrieval

Dense retrieval (vector similarity) captures semantic similarity. Sparse retrieval (BM25 or SPLADE) captures keyword specificity. They fail on complementary queries. Dense alone misses exact keyword matches on rare entities — a product code, a drug name, a regulation number. Sparse alone misses paraphrase and concept-level similarity. Hybrid retrieval runs both in parallel and fuses the results before passing them to the next stage. The fusion function is usually Reciprocal Rank Fusion (RRF) or a learned linear combination; RRF is the default because it has no parameters and works robustly across corpora.³

The architecture runs the query through two paths: dense retrieval against the vector index and sparse retrieval against a BM25 index. Both return ranked lists of candidate chunks. The fusion step combines them into a single ranked list. The top candidates pass to the next stage — either directly to the generator or to a reranker.

Hybrid is not a free win. The architect pays two retrieval costs rather than one. On corpora where queries are conceptual and the vocabulary is small, pure dense retrieval is adequate. On corpora with entities, codes, or numbers, hybrid is the default. The decision is made by evaluating both modes on the golden-query set from Article 5’s evaluation protocol.

[DIAGRAM: StageGateFlow — aite-sat-article-6-hybrid-retrieval-flow — Left-to-right flow: “User query” → split into two parallel branches labelled “Dense retriever (HNSW, top-50)” and “Sparse retriever (BM25, top-50)” → “Reciprocal Rank Fusion” → “Metadata filter (tenant, date, classification)” → “Cross-encoder reranker (top-10 → top-5)” → “Context assembly” → “Generator”. Gates annotated with measured-latency budgets (10ms, 8ms, 2ms, 3ms, 60ms, 5ms).]

When to rerank

Reranking is the third stage of a strong retrieval pipeline. The first-stage retrievers (dense, sparse, or hybrid) are optimized for recall at large k — return 50 to 100 plausibly relevant candidates quickly. The reranker is optimized for precision at small k — score each candidate against the query using a cross-encoder that concatenates the query and passage and produces a single relevance score. The cross-encoder sees both sides simultaneously and produces more accurate ordering than the bi-encoder used for first-stage retrieval, at the cost of latency proportional to the number of candidates being reranked.

Common rerankers include Cohere Rerank (managed API), BAAI bge-reranker (open-weight, self-hostable), Voyage rerank, and open-weight cross-encoders from sentence-transformers. Cohere Rerank is documented at roughly 100ms for 50 candidates against a typical query; self-hosted rerankers depend on hardware.⁴

The architect adds reranking when the first-stage ranking is not precise enough at small k — when the generator is confused by near-miss passages that appear high in the ranking or when the corpus contains many similar passages that must be disambiguated by the query’s specific wording. The architect avoids reranking when latency budgets are tight and first-stage quality is already adequate. The decision is not ideological; it is measured against the golden set.

Two real-world examples

Notion AI uses Pinecone as the vector store for its workspace-search retrieval, as described in joint Pinecone and Notion public case materials.¹ The workload involves tenant-isolated indexes per workspace, metadata filters for document type and permission, and a combination of dense retrieval and structured filter. The architectural point is that Notion chose a managed store to avoid running a vector database as part of their product operations, not because Pinecone was the only viable option. The decision was about operations allocation, not about ranking algorithms.

Supabase publishes reference architectures for pgvector-backed RAG at production scale, including OpenAI-embedding-plus-pgvector templates and row-level-security patterns for multi-tenancy.² The architectural point here is the opposite of Notion’s — teams already running Supabase for their application database get vector search without adding an operational component, and row-level security gives tenant isolation without a second access-control layer. The decision was about reducing moving parts, not about reaching the highest possible throughput.

Both decisions are correct given their constraints. The error an architect makes is copying one decision into a context that matches the other. Pinecone for a twelve-user pilot on an existing Supabase stack adds an unnecessary vendor; pgvector for a billion-chunk index serving interactive queries to a global audience will eventually hit a scale ceiling the team cannot move past without reindexing.

Multi-tenancy and the index

Multi-tenancy (developed at length in Article 16) is relevant here because the vector store is the most common place tenant isolation leaks. A shared index with tenant-id metadata is safe only if every query enforces the filter and every filter pushes into the index’s pre-filter plan rather than a post-filter. An index-per-tenant (Pinecone namespace, Qdrant collection, pgvector schema) eliminates the filter-enforcement problem at the cost of per-tenant index overhead. The architect chooses based on the number of tenants and the isolation posture demanded by the data.

Capacity planning and reindexing

Two operational realities deserve explicit planning. First, vector-store capacity is sized on vectors, on metadata, and on expected query volume in parallel. A store sized for 100 million vectors at 768 dimensions with modest metadata is a different physical cluster from one sized for the same vector count with high-cardinality metadata and concurrent heavy query load. The architect builds a capacity spreadsheet that tracks vector count, dimensionality, metadata bytes per record, query concurrency, and ingestion throughput, and validates it against the chosen store’s documented scale envelope before committing.

Second, reindexing is a scheduled operation, not an improvisation. Changing the embedding model requires reindexing every chunk; changing the chunking strategy requires reindexing; changing the metadata schema in ways that affect filters often requires reindexing. A production corpus of billions of chunks takes days of parallel compute to reindex even on well-provisioned hardware. The architect plans reindexing capacity ahead of time — extra ingestion throughput, a shadow index that can be swapped in after verification, a traffic-cutover procedure with rollback — so that reindexing is a routine operation rather than a crisis.

Regulatory alignment

The vector index is in scope for EU AI Act Article 10 (data governance) and Article 12 (record-keeping) when the system it serves is high-risk. Article 10 requires that the data used by the system be relevant and representative, which extends to retrieval corpora because retrieval supplies content to the model at inference. Article 12 requires that logs be kept and auditable; the architect designs retrieval-event logging so that every query-to-chunk relationship is reconstructable for a defined retention period. ISO/IEC 42001 Clause 8.3 requires life-cycle management, which includes index versioning and reindexing procedures.⁵

Summary

The vector store is a long-lived architectural commitment whose personality affects retrieval quality in ways the benchmarks do not capture. The four selection dimensions — managed versus self-hosted, index algorithm, filter capability, and scale — narrow the choice. The six representative stores cover the field from managed simplicity (Pinecone) through operational-familiarity self-hosting (pgvector) to very-large-scale open-source (Milvus). Hybrid retrieval combines dense and sparse paths and is the default when the corpus contains entities or codes. Reranking with a cross-encoder adds precision at small k when first-stage ranking is not good enough and latency allows. The Notion-AI-on-Pinecone and Supabase-on-pgvector cases show the same discipline applied to different operational postures. Tenant isolation lives or dies at the store’s filter implementation. Regulatory alignment is satisfied by treating the index as a governed data artifact.

Further reading in the Core Stream: Data Architecture for Enterprise AI and Grounding, Retrieval, and Factual Integrity for AI Agents.

Pinecone customer story and Notion AI technical discussion. Pinecone, “Notion AI” customer case. https://www.pinecone.io/customers/ — accessed 2026-04-20. ↩ ↩²
Supabase Vector / pgvector reference architecture documentation. https://supabase.com/docs/guides/ai — accessed 2026-04-20. ↩ ↩²
Gordon V. Cormack, Charles L. A. Clarke, Stefan Büttcher, “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,” SIGIR 2009. https://dl.acm.org/doi/10.1145/1571941.1572114 — accessed 2026-04-20. ↩
Cohere Rerank documentation. https://docs.cohere.com/docs/rerank — accessed 2026-04-20. BAAI bge-reranker model card. https://huggingface.co/BAAI/bge-reranker-large — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 10 and 12. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clause 8.3. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩