Model Serving Patterns and Inference Paths

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 8 of 35

A model does not exist in isolation. It is hosted somewhere, reached through some protocol, sized for some throughput. The decision about where and how to host it — the serving pattern — determines latency, cost, residency, capability ceiling, and operational dependency. A managed API returns tokens from a service the architect neither operates nor controls. A self-hosted deployment returns tokens from hardware the architect provisioned, optimized, and monitored. The intermediate option — cloud AI platforms — splits the difference. Each serving pattern is an architectural choice with downstream consequences, and the architect is responsible for making that choice deliberately rather than inheriting it from whatever path is most convenient on the day.

The five serving patterns

Managed API. The provider runs the model; the application calls a documented endpoint and receives tokens. OpenAI, Anthropic, Cohere, Mistral API, and Google AI Studio are canonical examples. The architect surrenders infrastructure control in exchange for zero operations, provider-maintained evaluation benchmarks, automatic capability upgrades, and predictable per-call billing. Capability ceiling is the highest because the provider ships their best model as the default. Residency is limited to where the provider has endpoints; some providers now offer regional isolation (OpenAI European endpoints, Anthropic on AWS Bedrock Europe), others do not.

Cloud AI platform. A hyperscaler (AWS, Azure, GCP) operates a model catalog accessible through their control plane. AWS Bedrock offers Anthropic Claude, Meta Llama, Mistral, Cohere, Amazon Nova, and others through a unified API; Azure AI Foundry offers OpenAI models plus a catalog including Llama, Mistral, Phi, and DeepSeek; Google Vertex AI Model Garden offers Gemini plus third-party models. The architect inherits IAM, VPC, regional endpoints, existing billing, and compliance postures from the cloud provider, and gains residency control through regional selection. Capability ceiling is high but sometimes lags the direct managed API by weeks as new model versions propagate through the platform.

Self-hosted batch. The organization runs its own inference server on its own hardware for offline or asynchronous workloads. Common stacks are vLLM, Text Generation Inference (TGI), SGLang, Triton Inference Server, or TensorRT-LLM running open-weight models (Llama 3, Mistral, Qwen, DeepSeek, Phi). Batch inference suits document processing, bulk classification, embeddings at scale, and any workload where seconds-to-minutes latency is acceptable. Cost economics are predictable (GPU-hours rather than per-token), residency is complete (data never leaves the organization’s boundary), and the architect is responsible for everything — model weights, tokenizer, server tuning, quantization, fleet management.

Self-hosted online. The same open-weight stacks run under interactive latency. vLLM’s continuous batching, FlashAttention kernels, speculative decoding, and paged KV cache turn a single A100 or H100 into an online serving engine for a 7B–70B-parameter model with respectable time-to-first-token. The architect pays for staff who can tune the server, a fleet that can autoscale, and the expertise to diagnose tail-latency regressions. The reward is an inference path with no external billing surface and full control over the model.

Edge. Inference runs on or near the device: mobile (CoreML, TensorFlow Lite), laptop (Ollama, LM Studio), factory-floor compute node, or embedded device (Jetson, NPU-accelerated PCs). Small quantized models (Phi-3 Mini, Llama 3 8B 4-bit, Mistral 7B 4-bit, Gemma 2B) make this viable for constrained tasks. Edge is the topology of last resort for residency but the topology of first resort for latency-critical UX. Article 18 develops deployment topology including edge in depth; this article treats edge as one of the five serving patterns the architect should be able to reach for.

[DIAGRAM: MatrixDiagram — aite-sat-article-8-control-scale-matrix — A 2×2 quadrant diagram with “Control” (low → high) on the vertical axis and “Scale” (small → very large) on the horizontal axis. Quadrant labels: top-left (high control, small scale) — “Self-hosted online, dedicated cluster”; top-right (high control, very large scale) — “Self-hosted batch, GPU fleet”; bottom-left (low control, small scale) — “Managed API, direct”; bottom-right (low control, very large scale) — “Cloud AI platform, regional endpoints”. Edge is marked as a fifth pattern sitting outside the matrix at extreme-low-control/extreme-small-scale.]

The inference path

An inference call is not a single operation. It is a sequence of stages, each with its own latency contribution. The architect reasons about the path so that the measured latency numbers from production tie back to a named stage rather than a mystery. The path is: client request → gateway authentication and routing → orchestration layer pre-processing (prompt assembly, tool schema injection, retrieval) → model serving endpoint → token generation (prefill + decode) → post-processing (parsing, validation) → streaming or buffered response → client rendering. Each stage has a latency budget the architect set during design; each stage has measured actuals in production; the comparison surfaces problems before they become incidents.

Time-to-first-token (TTFT) is the latency from request submission to the first token arriving at the client. It is dominated by prefill — the model processing the input prompt — plus network and gateway overhead. Inter-token latency (ITL) is the time between consecutive tokens during streaming. It is dominated by decode throughput, which in turn depends on GPU utilization, batch size, and KV cache efficiency. Total response latency is TTFT plus (ITL × token count). The architect budgets all three separately because optimization techniques target them differently: prompt caching improves TTFT, continuous batching improves ITL, speculative decoding improves both.

The build-your-own reference

The curriculum’s technology-neutrality requirement asks for a build-your-own raw-model reference example that mirrors the patterns taught on managed APIs and cloud platforms. The reference stack is FastAPI (for the outer application), vLLM (for the inference engine), and Llama 3 (for the model), running on a single H100 or an 8xA100 node.

vLLM is the reference engine because it is the most widely documented open-source serving framework, it implements paged attention and continuous batching in production-quality form, and it supports the OpenAI-compatible API surface that applications written against OpenAI can target without code changes.¹ A production vLLM deployment exposes an HTTP server with a chat completion endpoint; FastAPI in front of it adds the application-specific authentication, request shaping, and integration with the tool registry, retrieval layer, and evaluation pipeline that the managed API providers would otherwise host.

Running Llama 3 70B quantized to AWQ or GPTQ on 4×A100-40GB is a production-grade configuration documented in vLLM’s deployment guides. Running Llama 3 8B un-quantized on a single A100 delivers sub-second TTFT and hundreds of tokens per second at modest batch sizes. The numbers are public and reproducible; the curriculum teaches the architect to measure them on their own hardware before relying on them.

[DIAGRAM: TimelineDiagram — aite-sat-article-8-vllm-continuous-batching — Horizontal timeline showing four concurrent requests arriving at a vLLM server across a 2-second window. Request arrivals annotated at t=0ms, t=120ms, t=310ms, and t=480ms. Each request shown as a horizontal bar broken into three segments: “Queued” (amber), “In-flight prefill” (green), “In-flight decode” (blue with tokens being emitted). Continuous batching is visible because request 2 enters in-flight decode while request 1 is still decoding and request 3’s prefill overlaps request 2’s decode. Total time-to-first-token and total response time are labelled for each request.]

Capability-versus-control trade-off

The five patterns form a curve, not a ranking. Managed APIs sit at one end: highest capability ceiling, lowest operational burden, lowest control. Edge sits at the other: lowest capability ceiling, highest operational burden per device, highest control. The architect’s task is not to pick the “best” pattern but to pick the pattern that matches the use case’s constraints.

A consumer-facing chatbot with non-sensitive queries that needs frontier quality gets a managed API. A regulated HR assistant that must keep data in a specific jurisdiction gets a cloud AI platform with regional endpoints or a sovereign-cloud deployment (Article 18). A bulk document-processing job on a ten-million-document corpus gets self-hosted batch because the per-call economics of a managed API would be prohibitive. A latency-critical mobile feature that must work offline gets edge.

The architect expects to use more than one pattern in the same architecture. A common enterprise pattern is managed API for the interactive path, self-hosted batch for the offline ingestion and evaluation path, and cloud AI platform for the regulated-workload path. Article 9’s cost architecture explores routing across patterns; Article 26’s build-versus-buy develops the strategic framing.

Two real-world examples

Databricks DBRX. When Databricks released DBRX in 2024, the Mosaic AI team published a detailed serving blog documenting the self-hosted-open-weight reference architecture: Mosaic Inference with continuous batching, per-model autoscaling, tensor parallelism for larger variants, and benchmarks for latency and throughput at production batch sizes.² The serving blog is an exemplar of what a self-hosted online deployment looks like for a team that has the engineering depth to operate one. The architectural point for the AITE-SAT learner is that self-hosted serving at enterprise scale is a real option, documented by a vendor with reproducible numbers.

OpenAI API performance page. OpenAI publishes aggregate performance metrics for its managed API — TTFT percentiles by model, throughput under load, availability.³ The performance page is the reference an architect uses to understand the SLA surface of the managed option. It is also the baseline against which other options are compared. If a cloud AI platform or a self-hosted stack cannot meet the managed-API baseline, the architect has to argue why — latency-insensitive workload, residency requirement, cost — not merely appeal to preference.

Azure AI Foundry deployment options. Azure AI Foundry’s documentation enumerates the deployment options for a model: serverless (per-token, shared endpoint), provisioned throughput (reserved capacity, predictable latency), and managed compute (dedicated GPU cluster running the chosen model).⁴ The documentation is the cloud-AI-platform reference for how the pattern subdivides inside a hyperscaler’s catalog. The architect learning from Azure’s options recognizes that the three sub-patterns map to three different cost-and-latency profiles and that picking between them is itself an architectural choice.

Serving-path orchestration frameworks

The architect sits above the serving engine and below the application. Orchestration frameworks — LangChain, LlamaIndex, Haystack, DSPy, Semantic Kernel, Vercel AI SDK — provide the glue that connects the serving endpoint to retrieval, tool use, evaluation, and observability. The choice of orchestration is independent of the choice of serving pattern; the same orchestration framework can talk to a managed API, a cloud AI platform endpoint, and a self-hosted vLLM server. The architect picks orchestration based on language fit (Python for most data-science teams, TypeScript for most product teams), opinionation fit (LangChain’s breadth vs. LlamaIndex’s RAG depth vs. DSPy’s declarative compilation), and ecosystem fit (Semantic Kernel for Microsoft-aligned stacks, Bedrock Agents for AWS-aligned stacks).⁵

The orchestration layer is where provider abstraction happens in production. An architect who writes directly against the OpenAI API surface has tied the application to one provider; an architect who writes against an orchestration framework’s abstraction can swap providers without rewriting application code. This is Article 26’s build-versus-buy decision applied to the serving path: the abstraction cost is worth paying when the organization anticipates provider mix changes, and it is not worth paying for a one-off prototype.

Failover and graceful degradation across serving patterns

Production systems rarely rely on a single serving path. The architect designs failover across patterns: primary on one managed API, fallback to a different provider or to a cloud platform on provider outage, final fallback to a smaller model or a cached response when all paths fail. The failover path is tested during the build, not during the first outage. Tools like Portkey, Not Diamond, and OpenRouter implement provider-level routing and failover out of the box; in-house routing is equivalent work and equivalent discipline.⁶

A graceful-degradation plan acknowledges that AI endpoints have worse availability than most application components. A managed API’s 99.9% availability is an order of magnitude lower than the 99.99% or better that stateless microservices routinely achieve; the architecture accommodates the gap by expecting and handling failures rather than assuming they will not occur.

Regulatory alignment

Serving-pattern selection is the first place residency and sovereignty requirements bite (EU AI Act Article 10 on data governance, Schrems II precedent on transfers, sector rules on healthcare and financial data). A managed API with no EU endpoint is incompatible with a GDPR data-minimization posture for sensitive categories. A cloud AI platform with regional selection plus a sovereign-cloud option covers most EU use cases; a self-hosted stack inside the organization’s own boundary covers all of them at the cost of operational burden. The architect documents the serving-pattern decision as part of the conformity-assessment evidence pack for high-risk systems, per EU AI Act Article 11.

Summary

Serving patterns are the architecture’s relationship with where and how the model runs. The five patterns — managed API, cloud AI platform, self-hosted batch, self-hosted online, edge — form a capability-versus-control curve. The inference path is a stack of stages, each with its own latency contribution; TTFT, ITL, and total response latency are budgeted separately. The build-your-own reference (FastAPI + vLLM + Llama 3) anchors the self-host option as a first-class pattern, not a curiosity. Databricks DBRX’s serving blog documents what enterprise self-hosting looks like in practice; OpenAI and Azure document what managed and cloud-platform options offer. The architect picks the pattern — often a mix — that matches the use case’s constraints, not the vendor’s preference. Regulatory alignment is driven by where the inference runs and under whose operational control.

Further reading in the Core Stream: Enterprise AI Deployment Patterns and The Technology Architecture Roadmap.

vLLM project documentation and reference deployment guide. https://docs.vllm.ai/ — accessed 2026-04-20. Woosuk Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention,” SOSP 2023 (arXiv 2309.06180). https://arxiv.org/abs/2309.06180 — accessed 2026-04-20. ↩
Databricks Mosaic AI, “Introducing DBRX” and subsequent serving performance blog. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm — accessed 2026-04-20. ↩
OpenAI API performance and status pages. https://status.openai.com/ and https://platform.openai.com/docs/guides/latency-optimization — accessed 2026-04-20. ↩
Microsoft Azure AI Foundry deployment options documentation. https://learn.microsoft.com/en-us/azure/ai-studio/concepts/deployments-overview — accessed 2026-04-20. ↩
LangChain. https://python.langchain.com/ — accessed 2026-04-20. LlamaIndex. https://docs.llamaindex.ai/ — accessed 2026-04-20. Haystack. https://haystack.deepset.ai/ — accessed 2026-04-20. DSPy. https://dspy.ai/ — accessed 2026-04-20. Microsoft Semantic Kernel. https://learn.microsoft.com/en-us/semantic-kernel/ — accessed 2026-04-20. Vercel AI SDK. https://sdk.vercel.ai/ — accessed 2026-04-20. ↩
Portkey. https://portkey.ai/ — accessed 2026-04-20. Not Diamond. https://notdiamond.ai/ — accessed 2026-04-20. OpenRouter. https://openrouter.ai/ — accessed 2026-04-20. ↩