← All reports
Architecture decision report

Multi-Tenant RAG Vector Index Partitioning

Choose the best vector index partitioning strategy for a multi-tenant RAG platform: per-tenant ANN indexes versus a shared metadata-filtered index.

multitenant-rag-vector-index-partitioning.md
UC-013 · 2026-07-03Recommended

How should a multi-tenant RAG platform partition vector indexes for scalability and retrieval performance?

In a multi-tenant RAG platform where each tenant owns isolated vector indexes, what are the trade-offs between per-tenant ANN indexes versus a shared index with metadata filtering when considering retrieval latency, embedding drift, and operational costs?

Original question on Stack Overflow ↗

Recommendation

Use a hybrid layout: keep separate ANN indexes for high-volume or sensitive tenants, and use a shared metadata-filtered index for the long tail of small tenants.

That gives you strong tenant isolation where it matters most while avoiding the cost and operational sprawl of one full ANN index per tenant.

Options compared
Per-tenant ANN indexesViable
FitFair fit

Best when tenant isolation is strict or when tenant-specific query latency must stay consistently low, but it raises build, monitoring, and reindexing overhead as tenant count grows.

Shared index with metadata filteringViable
FitFair fit

Best for many small tenants because it reduces duplicate storage and index management, but it depends on reliable tenant filters and can be harder to keep fast under skew or noisy neighbors.

Hybrid partitioning by tenant or tenant groupPick
FitStrong fit

It combines the strongest parts of both approaches: isolate hot or sensitive tenants, share the rest, and keep drift/reindexing work bounded.

Architecture shape
Authenticated tenant context retrieval gateway tenant-aware routing per-tenant ANN indexes for hot/sensitive tenants OR shared filtered index for smaller tenants chunk-level metadata/permissions answer assembly
Assumptions
None recorded.
Tradeoffs
Separate indexes improve isolation and simplify tenant-specific reindexing
Multiply operational cost.
Shared filtering lowers storage and maintenance cost
Every retrieval must enforce tenant metadata correctly.
Hybrid partitioning reduces noisy-neighbor risk without forcing every tenant to pay for its own full index.
Embedding or chunking changes are easier to absorb in small isolated slices than in one giant shared index.
A shared path is cheaper to run
A hot tenant can dominate resources unless you add explicit backpressure and quotas.
This is still provisional because
You have not said whether hard physical isolation is mandatory.
You have not given tenant count, corpus skew, or p95/p99 latency targets, so the exact split point is still unknown.
Risks and mitigations
Cross-tenant leakage if tenant metadata is missed; mitigate with authenticated tenant context, retrieval-time checks, and chunk-level permissions.
Noisy-neighbor latency spikes in the shared path; mitigate with tenant-group partitioning, quotas, and overload backpressure.
Reindexing churn from embedding drift can become expensive; mitigate by isolating tenants that change embeddings or chunking frequently.
Operational complexity from too many partitions; mitigate by only splitting tenants that are hot, large, or high-risk.
Answer these next
01Do tenants require hard physical isolation of their vectors and indexes, or is logical isolation with shared infrastructure acceptable?
02What retrieval latency target do you need at p95/p99?
03Will embeddings, chunking, or retrieval schemas change independently over time or per tenant?
Revisit this if
If you require hard physical isolation for all tenants, move to per-tenant indexes.
If tenant count is high and most tenants are small, lean more heavily toward the shared filtered path.
If p95/p99 latency is very strict, increase isolation for hot tenants.
If embeddings or chunking change often per tenant, isolate those tenants earlier.
Curated references
[1]Multi-Tenant Security IsolationMulti-Tenant Application Security Cheat Sheet
The index partitioning choice has to preserve tenant boundaries without accidentally leaking or cross-mixing retrieval results, so the isolation guidance directly informs how much sharing is acceptable.
[2]RAG Access Control And Chunk IsolationRAG Security Cheat Sheet
A shared index only works if chunk-level retrieval can reliably carry tenant permissions and source lineage, which is central to deciding between per-tenant indexes and metadata filtering.
[3]Optimize Data PerformanceArchitecture strategies for optimizing data performance
This decision is fundamentally about retrieval latency, index maintenance, and storage/query efficiency, so the performance guidance helps weigh the real cost of each partitioning model.
[4]Handle Overload with Backpressure and Graceful DegradationGoogle SRE Book - Handling Overload
If some tenants are much hotter than others, the platform needs a design that avoids noisy-neighbor effects, which is a direct input to shared-versus-isolated index layout.
[5]Central Repository With Local CachesRFC 977 - Network News Transfer Protocol
The consultation is essentially choosing between one shared retrieval store with selective filtering and duplicated per-tenant copies, which maps well to the central-repository tradeoff.
[6]pgvector Vector Similarity Search For Postgrespgvector
Original question by Mouhamed Aymen Harroum on Stack Overflow (2026-06-28), licensed CC BY-SA 4.0. Report generated by the Architect Buddy pipeline on 2026-07-03, reviewed and edited by a human. This is architectural guidance, not a substitute for your own judgment.