Multi-Tenancy in AI Systems

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 16 of 35

A SaaS product rarely has one tenant. A large enterprise rarely has one business unit. The moment an AI feature serves more than one audience, it becomes multi-tenant, and multi-tenancy for AI is not solved by the same patterns that solved multi-tenancy for traditional SaaS. The vector index leaks if one tenant’s documents are retrieved to answer another tenant’s query. The prompt leaks if one tenant’s system prompt is accessible to another tenant’s users. The cost leaks if one tenant’s heavy usage is invisibly subsidized by another tenant’s spend. The safety policy leaks if one tenant’s tolerance for content is applied to another tenant’s regulated domain. Each leak is an incident. The architect designs multi-tenancy from the start because retrofitting it after the first tenant is live is an order of magnitude harder than building it in. This article gives the AITE-SAT learner the five architectural patterns, the layers that must be tenant-scoped regardless of pattern, and the decision framework that picks the right pattern for the workload.

Why AI multi-tenancy is different

Classical multi-tenant SaaS isolates tenants at the data-store level: each row carries a tenant identifier, every query filters by that identifier, and application logic enforces the filter. AI adds four complications.

The retrieval corpus is a tenant-specific data asset. Two tenants have different document sets. Mixing them in a shared index without filter discipline produces cross-tenant retrieval. Ensuring the filter is always applied, always correctly, and always before similarity computation rather than after is an architecture decision, not a developer-discipline decision.

The model is a shared computational resource. The same model instance serves many tenants. Prompt-injection attempts from one tenant’s user cannot be allowed to persist in ways that affect another tenant’s session. State held in the model is ephemeral per-call, which helps; state held in caches (semantic caches, prompt caches, session stores) must be tenant-scoped, which the architect must enforce.

Per-tenant policy differs. A healthcare tenant has different regulated-content requirements than a retail tenant. A European tenant has different residency requirements than a US tenant. A free-tier tenant has different usage caps than an enterprise tenant. The policy layer must be per-tenant, not per-application.

Cost attribution is a business requirement. Multi-tenant AI is usually billed, and the bills must be accurate. Per-tenant cost attribution is a runtime concern that has to be designed into the observability stack from day one.

The five architectural patterns

Pattern 1 — Per-tenant index, per-tenant everything

Each tenant gets its own vector index, its own prompt templates, its own fine-tuned adapters if any, its own safety policy, its own rate limits, and often its own deployment. The pattern is the strongest isolation posture and the highest operational cost.

Fit: small number of large tenants, strict regulatory isolation, tenants with genuinely different data schemas or use cases. Pinecone’s dedicated-pod pricing tier, per-project Azure AI Foundry deployments, and one-account-per-tenant Databricks workspaces all fit this pattern. The per-tenant overhead (index, deployment, observability) sets a floor on per-tenant cost; the pattern is uneconomic below a certain tenant size.

Pattern 2 — Shared service, per-tenant index (namespace)

One deployment of the AI service serves all tenants. The vector index is logically per-tenant using the store’s native namespace or collection construct — Pinecone namespaces, Qdrant collections, pgvector schemas, Weaviate tenancy. The prompt assembly, model call, and orchestration are shared.

Fit: SaaS with medium-to-large tenant counts where each tenant has a distinct content corpus but shared application logic. The pattern is the most common AI multi-tenant architecture in 2024–2026 because it scales to thousands of tenants without per-tenant deployment cost and it eliminates cross-tenant index leakage as a class because tenant separation is enforced by the store itself.

Pattern 3 — Shared index with metadata filter

One index holds all tenants’ chunks, each chunk tagged with a tenant identifier. Every query applies a mandatory tenant filter. This is the cheapest pattern per tenant because there is no per-tenant overhead, but it is the most susceptible to filter-enforcement errors.

Fit: very large tenant counts (tens of thousands or more), light content volume per tenant, where per-tenant overhead would dominate cost. The pattern works when the filter is pushed into the store’s pre-filter plan (Qdrant, Weaviate, or pgvector with the right index parameters) and when a missing filter at query time fails closed rather than returning all tenants’ chunks. Row-level security patterns in relational databases — pgvector with RLS on Supabase is the canonical reference — enforce the filter at the storage layer so that application bugs cannot bypass it.¹

Pattern 4 — Sharded index by tenant cohort

Tenants are grouped into cohorts (by region, by tier, by vertical), and each cohort gets a shared index using one of the above patterns within the cohort. The pattern combines the operational simplicity of a shared index with the blast-radius containment of segregation.

Fit: global SaaS with residency requirements (EU tenants in an EU index, US tenants in a US index), or enterprise platforms with tier-based SLA differentiation (gold tenants in a gold cohort with premium latency targets). The architect tunes cohort size to match the risk-versus-overhead trade-off.

Pattern 5 — Per-tenant orchestration, shared model

The orchestration layer (prompt assembly, retrieval, tool selection, guardrails) is per-tenant. The model call is shared — the same provider endpoint serves all tenants. This pattern isolates the logic and data per tenant while pooling the expensive model-call resource.

Fit: tenants with genuinely different AI applications (different features, different prompt templates, different tool sets) built on the same shared model infrastructure. A product-line with multiple AI features per tenant can adopt this pattern so that each feature has its own orchestration lane while the underlying model billing consolidates.

[DIAGRAM: MatrixDiagram — aite-sat-article-16-tenant-isolation-cost-quadrant — A 2×2 matrix with “Tenant isolation” (weak → strong) on one axis and “Cost of isolation” (low → high) on the other. The five patterns placed in the matrix: Pattern 1 (per-tenant everything) — upper-right (strongest isolation, highest cost); Pattern 2 (shared service, per-tenant index) — upper-middle (strong isolation, medium cost); Pattern 3 (shared index with filter) — lower-left (weakest isolation, lowest cost); Pattern 4 (cohort sharded) — middle (medium isolation, medium cost); Pattern 5 (per-tenant orchestration, shared model) — middle-right (strong orchestration isolation, medium cost). Annotation arrows indicate migration paths as tenant counts or regulatory requirements change.]

The five layers that are always tenant-scoped

Regardless of pattern, five layers are tenant-scoped in any multi-tenant AI system.

Retrieval results. Every retrieval call returns only chunks belonging to (or authorized for) the current tenant. In Patterns 1, 2, and 4 this is native; in Pattern 3 this is a filter that must be applied at every call. The architect treats a retrieval call that returned cross-tenant chunks as a P0 incident, not an anomaly.

Prompts and system prompts. Tenant A’s system prompt may contain sensitive information — custom instructions, tool descriptions that reveal business logic, regulatory disclaimers specific to their jurisdiction. It must not surface to Tenant B. Prompt-template storage is tenant-scoped; prompt selection at runtime verifies the tenant identity before assembly.

Caches. Prompt caches, semantic caches, and response caches are tenant-scoped so that a cached response generated for one tenant cannot be returned to another. This is non-obvious because caches are a performance concern conceptually divorced from tenancy; it becomes obvious after the first incident.

Rate limits and usage caps. Per-tenant rate limits (queries per minute, tokens per day, concurrent requests) prevent one tenant from starving others. The rate-limit enforcement happens at the gateway before any AI-specific processing so the expensive path is never entered for over-limit requests.

Cost attribution. Every inference call, retrieval call, tool call, and storage operation is tagged with the tenant identifier. The billing pipeline aggregates by tag and produces per-tenant cost reports. Without attribution, the business cannot charge accurately, and cannot detect the runaway tenant whose usage exceeds their tier before the month-end invoice.

[DIAGRAM: HubSpokeDiagram — aite-sat-article-16-per-tenant-policy-hub-spoke — Hub labelled “Shared AI service” with spokes radiating out to per-tenant policy objects. Spoke 1: “Tenant A: EU residency, healthcare content class, 10 QPS, bge-large embeddings”. Spoke 2: “Tenant B: US residency, retail content class, 50 QPS, OpenAI embeddings”. Spoke 3: “Tenant C: APAC residency, financial services class, 25 QPS, Voyage embeddings, human-review sample 5%”. Spoke 4: “Tenant D: Internal tenant, no residency constraint, unlimited QPS, experimental models allowed”. Each spoke annotated with the per-tenant artifacts: system prompt, tool schema subset, safety policy, rate limits, cost attribution tag, retention policy.]

Per-tenant safety policy

Safety is not uniform across tenants. A medical-advice tenant requires stricter disclaimers and lower tolerance for speculation than a general-purpose assistant. A tenant in a regulated jurisdiction requires bias-auditing frequencies higher than a tenant in a less regulated one. A tenant serving minors has content restrictions a tenant serving adults does not. The architect expresses these as per-tenant policy objects the shared service loads at request time and applies to the prompt assembly, tool availability, and output-layer filters.

Policy objects are versioned. A tenant’s policy change is a deliberate event recorded in the change log and propagated to the evaluation harness so that the post-change evaluation runs against the new policy. A silent policy change — one tenant’s tolerance updated by an admin without the team noticing — is a governance failure waiting to be found by an external auditor.

Tenant onboarding and offboarding

Onboarding provisions the tenant’s isolated resources (index or namespace, policy object, rate-limit bucket, cost-attribution tag) and loads any tenant-specific content into the retrieval corpus through the data pipeline from Article 15. Offboarding reverses the process: content is deleted from the index per the tenancy pattern (delete namespace, delete filtered chunks, retire deployment), caches are purged, the policy object is archived, and the cost-attribution tag is frozen so historical billing remains auditable.

GDPR Article 17 right to erasure applies at the individual user level within a tenant; the offboarding process is the tenant-level analogue. Both processes exercise the same deletion machinery.

Two real-world examples

OpenAI Assistants threading. OpenAI’s Assistants API documents per-thread conversation state, per-thread file uploads with scoped access, and per-assistant configuration — a structural multi-tenancy where each thread and each assistant is effectively a sub-tenant within the customer account.² The architectural point for the AITE-SAT learner is that even within a single customer account, the managed provider has built multi-tenancy primitives that the architect can use instead of rebuilding. When composing on top of a managed platform, the architect uses the platform’s native isolation units rather than overlaying a home-grown isolation layer.

Pinecone namespaces for per-tenant isolation. Pinecone’s namespace feature is the documented, recommended pattern for multi-tenant applications: a single index holds all tenants’ data, each tenant’s vectors live in a distinct namespace, and every query specifies the namespace to constrain results.³ The architectural point is that Pattern 2 (shared service, per-tenant index) is natively supported by the vector store — the architect does not have to design custom isolation, they consume the store’s built-in construct. The architect is responsible for ensuring every query includes the namespace; the store is responsible for enforcing isolation given that input.

Azure AI Foundry per-project isolation. Azure AI Foundry exposes projects as isolation units for AI applications — each project has its own deployments, its own datasets, its own access control, its own logs.⁴ A multi-tenant architecture on Azure AI Foundry can map one project to one tenant for Pattern 1 isolation without the architect building deployment machinery. The architectural point is that hyperscalers have matured their AI-specific multi-tenancy constructs significantly since 2023, and the architect choosing a cloud-platform approach gets per-tenant isolation in the platform’s administrative UX rather than in custom code.

A multi-tenant AI architecture faces noisy-neighbor problems the classical SaaS world has seen but in AI-specific shapes. A tenant’s long prompt or large batch job occupies model-server capacity during its processing, delaying other tenants’ short queries behind it. A tenant’s embedding-refresh job hammers the vector store, slowing retrieval for other tenants. A tenant’s tool-heavy agent loop consumes orchestration-service connections at a rate that starves other tenants.

The architect applies three mitigations. First, fair-share scheduling at the model-call layer — queueing theory applied to batch boundaries, with per-tenant concurrent-request caps so no single tenant monopolizes the serving engine. Second, priority tiers — premium tenants get guaranteed capacity via provisioned-throughput reservations, free tenants share best-effort capacity, and burst capacity floats between tiers. Third, isolation of long-running workloads — batch embedding refreshes, evaluation runs, and other background tasks run on a separate serving path or at off-peak hours so they do not compete with interactive traffic.

The architecture is harder to get right than classical web fair-sharing because LLM request duration varies by more than three orders of magnitude (a completion request can take 50ms or 50 seconds), and the serving engine’s batching dynamics mean that a few long-running requests can dominate throughput. The architect measures per-tenant interactive-path latency under load and validates that fair-share enforcement is actually working rather than assuming it is.

Regulatory alignment

EU AI Act Article 10 on data governance, Article 13 on transparency, and Article 14 on human oversight all have per-tenant implications when the deployer provides the AI system on behalf of multiple end-users.⁵ GDPR’s data-controller and data-processor distinctions become operationally real in multi-tenant architectures: the SaaS vendor is typically the processor and each tenant may be a controller, or both may be joint controllers depending on the configuration. The architect documents the roles, the tenant-isolation controls, and the per-tenant policy mechanisms as part of the conformity assessment and the data-processing agreement. ISO/IEC 42001 Clause 7.5 on documented information requires per-tenant configurations be tracked; the per-tenant policy object satisfies this.

Summary

Multi-tenancy in AI systems has architectural implications classical SaaS does not prepare the architect for. The five patterns — per-tenant everything, shared service with per-tenant index, shared index with filter, cohort-sharded, per-tenant orchestration with shared model — span the isolation-versus-cost trade-off. Five layers are always tenant-scoped: retrieval, prompts, caches, rate limits, cost attribution. Per-tenant safety policies are versioned and evaluated on every change. OpenAI Assistants threading, Pinecone namespaces, and Azure AI Foundry projects are public references for patterns 1 and 2. Regulatory alignment with EU AI Act, GDPR, and ISO 42001 depends on the architect documenting the tenancy model, the isolation controls, and the per-tenant policy machinery as first-class artifacts.

Further reading in the Core Stream: Enterprise AI Deployment Patterns and Data Governance for AI.

Supabase Row-Level Security (RLS) documentation. https://supabase.com/docs/guides/database/postgres/row-level-security — accessed 2026-04-20. ↩
OpenAI Assistants API and threads documentation. https://platform.openai.com/docs/assistants/overview — accessed 2026-04-20. ↩
Pinecone namespaces documentation. https://docs.pinecone.io/guides/indexes/use-namespaces — accessed 2026-04-20. ↩
Azure AI Foundry projects documentation. https://learn.microsoft.com/en-us/azure/ai-studio/how-to/create-projects — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 10, 13, and 14. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. Regulation (EU) 2016/679 (GDPR), Articles 4, 17, and 28. https://eur-lex.europa.eu/eli/reg/2016/679/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clause 7.5. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩