Latency, Cost, and Scalability Architecture

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 17 of 35

Every AI architecture eventually arrives at the same trilemma. Users want answers in two hundred milliseconds. Finance wants per-query cost at one cent. Product wants the system to handle ten times the traffic at the next launch without breaking the first two promises. Two of the three are usually achievable with straightforward engineering. The third requires architectural work. This article teaches the AITE-SAT learner how to reason about latency, cost, and scalability as interacting variables, what metrics they reduce to in practice, what techniques move each one, and what the trade-offs are. The article ends with a reference pattern the architect can adapt to a named SLA target; the reference is not a silver bullet, but it shows what a balanced design looks like so the architect recognizes when their own design is unbalanced.

The NFR trilemma

Three non-functional requirements dominate AI system design: latency, cost, and scalability. They are not independent. A technique that reduces latency usually increases cost (smaller batches, premium hardware, redundant paths). A technique that reduces cost usually hurts latency (larger batches, cheaper models, aggressive caching with fallback). A technique that improves scalability usually complicates both (horizontal partitioning, async processing, queueing). The architect’s task is not to optimize for any one of them — that problem is easy and almost always produces a bad architecture — but to find the Pareto frontier where none of the three can be improved without worsening another.

The trilemma becomes concrete when the business names the target. A consumer-facing chatbot with free-tier users has a different trilemma frontier than a high-value legal-research tool with paid enterprise users. The free-tier chatbot favors cost and scalability at the expense of latency; the legal tool favors latency and quality at the expense of cost. The architect’s first job is to force the business to name the target so that the trilemma has a point to aim at, not just three axes.

The latency decomposition

LLM latency is not a single number. The architect budgets and measures three separate components.

Time to first token (TTFT). The latency from request submission to the first token arriving at the client. Users perceive TTFT as “how long before I see the response starting.” Streamed responses feel responsive if TTFT is under a second even when total response takes five. TTFT is dominated by prefill — the model’s computation on the input prompt — plus network and gateway overhead. It scales roughly linearly with input length, often non-linearly with batch size on the serving engine.

Inter-token latency (ITL). The time between consecutive tokens during streaming. Users perceive ITL as the “speed of reading” of the response. ITL is dominated by decode throughput. On managed APIs it is typically 20-50 ms per token depending on model and load; on self-hosted deployments it depends on hardware and batch configuration.

Total response latency. TTFT plus (ITL × number of output tokens) plus any post-processing. For a 500-token response at 30 ms per token with 400 ms TTFT, total response is 15.4 seconds. Users perceive total latency as “how long before I have the full answer.” Non-streaming responses expose total latency as the only latency the user sees, which is why most chat UIs stream.

The three components respond to different optimization techniques. Prompt caching improves TTFT by skipping prefill on cached prefixes. Speculative decoding improves both TTFT and ITL by predicting multiple tokens per model step. Output-length control (via max_tokens, structured output, or prompt design) improves total latency by reducing the number of tokens that ITL multiplies against. Streaming does not improve any of the three but improves the user’s perception by converting total latency into a start-to-finish experience.

Five optimization techniques and what they buy

Prompt caching. Caches the model’s internal representation of a prompt prefix so subsequent requests with the same prefix skip prefill. Buys: TTFT reduction of 50-90% on cached-prefix requests, input-token cost reduction of 90% on cache hits for providers that offer discounted cached reads. Costs: requires prompt templates shaped for cacheable prefixes; cache hit rate depends on workload structure. Applicable to managed APIs (Anthropic, OpenAI, Google) and self-hosted vLLM with prefix caching enabled.¹

Speculative decoding. Uses a small draft model to predict multiple tokens ahead and a larger target model to verify them in batch, effectively producing several tokens per target-model step. Buys: TTFT reduction and ITL reduction of 2-3x in typical configurations. Costs: requires both draft and target models co-located; complexity in ensuring draft and target outputs align; limited benefit on already-small models. vLLM and TGI both support speculative decoding for self-hosted deployments.²

Continuous batching. Packs in-flight requests into the same GPU batch dynamically, letting new requests join the batch without waiting for the current batch to complete. Buys: throughput increase of 2-10x over static batching, improved GPU utilization, lower cost per request at scale. Costs: requires a serving engine that supports it (vLLM, TGI, TensorRT-LLM); benefits compound with higher concurrency. Native to most modern inference engines; the architect verifies it is enabled in production configurations.

Quantization. Runs the model in reduced precision (int8, int4, AWQ, GPTQ, FP8) at modest quality loss. Buys: 2-4x memory reduction, 2-3x throughput increase, ability to fit a larger model on the same hardware or a mid-size model on much cheaper hardware. Costs: quality drop that must be validated on the golden set; fewer precision formats on every accelerator; some providers do not disclose whether their deployed model is quantized. Applicable to self-hosted open-weight deployments primarily.

Output shaping. Reduces the number of output tokens through structured output, length instructions, and prompt design. Buys: proportional reduction in per-call latency and cost. Costs: may require re-training of downstream consumers to handle shorter outputs; some use cases resist shortening (creative content, long-form summaries). The cheapest optimization because it requires no infrastructure change.

[DIAGRAM: ScoreboardDiagram — aite-sat-article-17-nfr-target-actual — Scoreboard-style panel showing three rows of metrics. Row 1 (Latency): “TTFT p50 target 400ms — actual 520ms (red)”; “TTFT p95 target 1200ms — actual 980ms (green)”; “Total response p95 target 5000ms — actual 4800ms (green)”. Row 2 (Cost): “Per-query cost target $0.010 — actual $0.014 (amber)”; “Per-tenant monthly run-rate target $12K — actual $15K (red)”; “Cache hit rate target 40% — actual 25% (amber)”. Row 3 (Scalability): “Peak QPS handled target 500 — actual 620 (green)”; “Autoscale lag target 30s — actual 75s (red)”; “Regional failover RPO target 0 — actual 0 (green)”. Annotations below indicate the next optimization actions suggested by the red and amber cells.]

The autoscaling challenge

LLM workloads are spiky in ways classical web workloads are not. A viral prompt template or a newly published blog post can multiply traffic 10x within minutes. A large batch job initiated by a user in another timezone can saturate the serving path during a period the architect expected to be quiet. Autoscaling must respond fast enough that the burst does not exhaust the connection pool and slow enough that the scale-down does not strand capacity paid for but unused.

Three autoscaling patterns are common.

Managed-API auto-scaling. The provider scales transparently within the customer’s quota. The architect’s concerns are quota negotiation (can the managed API accommodate the expected burst), fallback (if the provider rate-limits, does the application gracefully degrade or fail), and cost management (higher throughput bills at the same per-token rate, so the cost scales linearly with traffic).

Cloud-platform serverless. Services like Bedrock on-demand, Vertex AI Model Garden serverless, and Azure AI Foundry serverless deployments autoscale implicitly with pay-per-token pricing. They suit variable workloads where peak is many times average.

Self-hosted autoscaling. Kubernetes-based autoscaling with GPU-aware metrics (KServe, vLLM on EKS/AKS/GKE with HPA, Ray Serve, SkyPilot). Buys: full control over scale-out behavior, potentially lower cost at high steady-state load. Costs: GPU startup time is measured in minutes (image pull, CUDA initialization, model load, warmup) so reactive autoscaling lags; predictive autoscaling with overhead capacity smooths the spike but adds idle cost. The architect plans for the startup cost explicitly, often with a warm pool of pre-initialized but idle pods.³

Autoscaling at the AI layer composes with autoscaling at lower layers (orchestration service, vector store, object storage). The architect ensures the slowest-to-scale component does not become the bottleneck; a vector store that cannot match the model’s scale capacity starves the system even when the model path is warm.

Queueing and graceful degradation

When demand exceeds capacity, queueing is the mechanism by which the system preserves a predictable response for requests it can still handle. A synchronous API that accepts every request and tries to serve it all becomes uniformly slow for everyone as load exceeds capacity; a queued API returns fast-path responses for the priority traffic and controlled-delay responses for the rest.

Graceful degradation is the architect’s choice of what to give up first. A chatbot under load may: route simple queries to a smaller model (model routing from Article 9); return streaming responses without reranking (dropping a retrieval-quality optimization to save latency); shorten system prompts; disable tool calls; or return cached responses for query patterns that match. Each degradation is an explicit fallback path the architect designs and tests, not an accidental behavior under pressure.

The reference pattern

A balanced reference pattern for a moderate-scale enterprise AI feature has the following composition.

Gateway (API gateway or ingress controller) applies authentication, rate limiting, and tenant-scoping at layer 7. Prompt assembly runs in the orchestration service with prompt caching for stable prefixes. Retrieval runs against a hybrid index (dense + sparse) with pre-filter and a cross-encoder reranker as a second stage. Model calls route through a routing layer that sends easy queries to a small model and escalates the rest to a larger model; sensitive queries skip routing and go directly to the enterprise-grade model. The model layer runs on cloud-AI-platform endpoints with provisioned capacity for baseline load and serverless overflow for burst. The response is streamed to the client with per-token logging to observability. Evaluation runs asynchronously on a sample of responses and feeds quality signals back to the dashboard.

The pattern targets p50 TTFT under 500 ms, p95 TTFT under 1.5 seconds, per-query cost under $0.02 at the current traffic mix, and 10x burst handling with under 30 seconds of degradation. A production system in 2026 that hits this envelope on any of the three cloud AI platforms with any of the major managed-API or self-hosted model families is not a laboratory exercise; it is a benchmark multiple publicly discussed enterprise deployments demonstrate.

[DIAGRAM: TimelineDiagram — aite-sat-article-17-streaming-response-timeline — Horizontal timeline of a single streaming response. t=0: “Request submitted at client”. t=20ms: “Gateway auth and route”. t=60ms: “Orchestration begins prompt assembly (with cache lookup)”. t=120ms: “Retrieval call (dense + sparse, parallel)”. t=200ms: “Retrieved chunks available”. t=240ms: “Reranker top-5 produced”. t=260ms: “Prompt sent to model (cache hit on system prefix)”. t=420ms: “First token arrives” (TTFT marked). t=420ms-4200ms: “Stream continues at ~25ms/token ITL” (ITL marked). t=4200ms: “Final token, end-of-stream”. t=4220ms: “Response rendering complete at client”. Below the timeline, optimization contributions are annotated (cache saved 80ms on TTFT, reranker added 40ms but improved precision, routing saved 30% cost by using medium model).]

Two real-world examples

Character.AI serving. Character.AI has published engineering blog posts describing their serving architecture for extreme-scale low-latency chat, including custom-kernel work, aggressive quantization, and architectural simplifications that trade a small amount of quality for dramatic latency and cost wins at their scale of billions of messages.⁴ The architectural point for the AITE-SAT learner is that at sufficient scale, previously-marginal optimizations become strategic, and the architect plans for scale-dependent architecture changes from the start rather than assuming the pilot-phase design extends indefinitely.

Groq LPU benchmarks. Groq’s Language Processing Unit hardware publishes public benchmarks showing sub-50 ms TTFT on small and medium models at speeds that exceed typical GPU-based serving by an order of magnitude.⁵ The architectural point is that hardware choice is an option the architect should evaluate when latency is the dominant trilemma axis; specialized accelerators (Groq, AWS Trainium/Inferentia, Intel Gaudi) are available for workloads where the latency budget is tight enough to justify the hardware-selection decision.

Databricks DBRX Mosaic Inference benchmarks. Databricks publishes throughput and latency benchmarks for DBRX and other open-weight models running on Mosaic AI Inference with continuous batching and optimized kernels.⁶ The architectural point is that self-hosted open-weight serving can match managed-API performance in the ranges most enterprises operate in, and the published benchmarks let the architect verify the claim rather than relying on vendor assertion.

Regulatory alignment

Latency, cost, and scalability are not regulatory metrics per se, but their architecture intersects with regulation at two points. EU AI Act Article 15 on accuracy, robustness, and cybersecurity requires that high-risk systems behave within specification under expected load; an architecture that cannot sustain peak load fails the robustness requirement regardless of its quality on the golden set.⁷ ISO/IEC 42001 Clause 9.1 on monitoring and measurement expects continuous monitoring of operational metrics, which the observability stack from Article 13 provides for latency and scalability; cost monitoring is often separate but equally expected under financial governance frameworks. The architect documents the SLA target, the measured performance, and the degradation plan as part of the evidence pack for high-risk deployments.

Summary

The latency-cost-scalability trilemma is the non-functional frame every enterprise AI architecture must navigate. Latency decomposes into TTFT, ITL, and total response; each responds to different techniques. Five techniques — prompt caching, speculative decoding, continuous batching, quantization, output shaping — cover the optimization surface. Autoscaling patterns differ between managed APIs, cloud-platform serverless, and self-hosted deployments; GPU startup time is the dominant constraint on self-hosted autoscaling. Graceful degradation is the architect’s choice of what to give up under load, not an accidental behavior. A reference pattern targeting p50 TTFT under 500 ms, per-query cost under $0.02, and 10x burst handling is achievable in 2026 with careful composition. Character.AI’s serving, Groq’s LPU benchmarks, and Databricks Mosaic Inference are three public references spanning the spectrum from extreme-scale optimization to hardware specialization to open-weight serving parity. Regulatory alignment with EU AI Act Article 15 and ISO 42001 Clause 9.1 depends on the performance architecture being documented, measured, and resilient under expected load.

Further reading in the Core Stream: AI Cost Management and FinOps and Enterprise AI Deployment Patterns.

Anthropic prompt caching documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching — accessed 2026-04-20. OpenAI prompt caching documentation. https://platform.openai.com/docs/guides/prompt-caching — accessed 2026-04-20. vLLM automatic prefix caching. https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html — accessed 2026-04-20. ↩
Yaniv Leviathan et al., “Fast Inference from Transformers via Speculative Decoding,” ICML 2023 (arXiv 2211.17192). https://arxiv.org/abs/2211.17192 — accessed 2026-04-20. vLLM speculative decoding documentation. https://docs.vllm.ai/en/latest/usage/spec_decode.html — accessed 2026-04-20. ↩
KServe documentation. https://kserve.github.io/ — accessed 2026-04-20. Ray Serve documentation. https://docs.ray.io/en/latest/serve/index.html — accessed 2026-04-20. ↩
Character.AI engineering blog posts on serving optimization. https://research.character.ai/ — accessed 2026-04-20. ↩
Groq Language Processing Unit benchmarks. https://groq.com/lpu-inference-engine/ — accessed 2026-04-20. ↩
Databricks Mosaic AI Inference benchmarks. https://www.databricks.com/blog/databricks-mosaic-ai-agent-framework-and-vector-search-bring-enterprise-ready-rag — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Article 15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clause 9.1. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩