Inference Cost Architecture: Caching, Routing, and Distillation

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 9 of 35

Early enterprise AI budgets allocated most of the spend to training. Later enterprise AI budgets allocated most of the spend to inference, and the ratio has kept shifting. A frontier-class chatbot running tens of millions of queries a month accumulates inference cost at a pace that dwarfs the one-time or periodic fine-tune spend the team planned for. A single careless prompt template that adds three hundred unnecessary tokens to every call multiplies by monthly query volume into a six-figure annual overage. Inference cost is not an ops concern the team discovers after launch; it is an architecture concern the team designs for from the first sprint. This article gives the AITE-SAT learner the three levers — caching, routing, and distillation — and the FinOps governance patterns that decide when each lever is pulled.

Where the cost actually lives

A managed-API inference cost decomposes into four components: input tokens, output tokens, optional cached-prompt read, and any tool or retrieval calls that precede the model call. A self-hosted inference cost decomposes into GPU-hours amortized across throughput, engineering time to operate the fleet, and underlying data-centre or cloud-provider costs for the hardware. A cloud-AI-platform cost sits between the two, with per-token pricing that tracks managed APIs plus the hyperscaler’s platform fees.

Output tokens are typically the most expensive component on managed APIs — often 3 to 5 times the per-token cost of input. A response that is 200 tokens shorter saves more than a prompt that is 200 tokens shorter. This asymmetry is invisible in early prototypes because the token counts are small; it becomes dominant in production because the volume is large. The architect reasons about both sides of the token economy.

Retrieval calls, tool calls, and embedding refreshes add subsidiary costs. An agent loop with five tool calls pays five orchestration-layer overheads plus five model calls; a RAG query pays one embedding call plus one index query plus one model call with retrieved context. Each sub-component has its own unit economics, and the sum is the per-query cost. The architect who tracks only the headline model cost misses half the spend.

The three levers

Lever 1 — Caching

Caching avoids the work rather than making the work cheaper. Three caching patterns each capture different redundancy in the workload.

Prompt caching (or “prefix caching”) stores the model’s internal representation of a prompt prefix so that subsequent requests with the same prefix skip the prefill stage and jump straight to decode. Anthropic’s prompt caching, OpenAI’s prompt caching for the Chat Completions API, Google’s context caching for Gemini, and vLLM’s prefix caching for self-hosted deployments implement the same pattern with different commercial wrappers.¹ Prompt caching is extremely effective when the prompt has a long stable prefix (system prompt, retrieval context, tool schemas) and a short variable suffix (user turn). Cost savings of 50–90% on input tokens are common when the pattern fits. The architect designs prompt templates with the cacheable prefix separated from the variable user content so that caching can be enabled in production without reshaping the template.

Exact-match response caching stores completed responses keyed by prompt hash. When the same prompt arrives again, the cached response is returned without calling the model. Exact-match caching suits deterministic-prompt use cases — structured transformations, scheduled batch jobs, common FAQs — where the prompt distribution has a heavy head. It does not suit conversational use cases where every prompt is unique after the first exchange.

Semantic caching stores completed responses keyed by the query’s embedding and returns a cached response when a new query’s embedding is sufficiently similar to a stored one. Semantic caching trades strictness for coverage — a new query never seen before can be answered from the cache if it is semantically close to one that was. The architect accepts the correctness risk that the cached response may not precisely match the paraphrased query and designs the similarity threshold to balance cache hit rate against semantic drift. Redis’s vector extensions, GPTCache, and LangChain’s semantic cache implementations make the pattern accessible.²

Lever 2 — Model routing

Model routing sends each query to the cheapest model capable of answering it. The architecture has a router (rule-based, classifier-based, or LLM-based) and a pool of models (often one small, one medium, one frontier). Easy queries go to the small model; hard queries go to the frontier model; medium queries go to the medium model. A well-tuned router can shift 40–70% of traffic to cheaper tiers without measurable quality regression on the golden set.

Rule-based routing is the simplest: a deterministic rule decides the route based on query metadata (document length, tool-use requirement, user tier). It is transparent and cheap but inflexible. Classifier-based routing trains a small classifier to predict which model’s output is adequate. LLM-based routing uses a small model as the first pass and escalates to a larger model when the small model flags uncertainty (sometimes called “cascading” or “waterfall” routing). Mixture-of-experts routing, implemented at training time in models like Mixtral and DBRX, is routing internalized into the model architecture rather than externalized across models.

Routing requires evaluation discipline. The architect evaluates the routed-traffic output against the frontier-only output on the golden set and tracks the rate of escalations, the rate of regressions, and the rate of quality drops below threshold. If the router degrades quality by more than a defined tolerance, the router is retuned or retired. Unbounded cost reduction at unbounded quality risk is not an architecture — it is a quarterly bonus waiting to be clawed back after an incident.

Lever 3 — Distillation

Distillation is the training-time lever that produces a smaller student model matching the quality of a larger teacher model on a specific task distribution. The student is cheaper to run per call, which shifts the cost curve permanently for that workload. Distillation suits tasks with narrow distributions — customer-intent classification, structured extraction, summarization with a fixed rubric — where the teacher’s general capability exceeds the task’s specific requirements.

OpenAI’s distillation-friendly workflow (reinforcement fine-tuning and supervised fine-tuning via their API) and Anthropic’s reference patterns for smaller Claude models document the process from the managed-API side. Meta’s Llama distillation papers (including the Llama 3 technical report) and Hugging Face’s DistilGPT, DistilBERT history document the open-weight side.³ The architect considers distillation when the workload is narrow, the volume justifies the training cost, and the evaluation discipline is strong enough to ensure the student holds quality.

Distillation is not a first move. An architect who cannot demonstrate the workload has saturated caching and routing has no basis for choosing distillation. Distillation is expensive in engineering time, in teacher inference cost during training, and in ongoing evaluation burden to catch regressions as the distribution drifts.

[DIAGRAM: StageGateFlow — aite-sat-article-9-cost-pipeline — Left-to-right flow of a single query: “User query arrives” → “Exact-match cache lookup” (hit → return cached, miss → continue) → “Semantic cache lookup” (hit → return cached, miss → continue) → “Router evaluates query” → split into three lanes: “Small model lane”, “Medium model lane”, “Frontier model lane” → “Model response” → “Response cache write” → “Return to user”. Annotations show per-stage expected hit rates and per-stage latency/cost contributions.]

The FinOps overlay

Levers without governance produce a system that is cheap at launch and expensive within six months as caches churn, routers decay, and distilled students drift. FinOps for AI is the discipline of measuring, attributing, and governing inference cost the same way cloud costs are managed. Three patterns matter most.

Per-tenant and per-user attribution. The architecture tags every inference call with the tenant and user on whose behalf the call was made. The billing-system or FinOps dashboard aggregates by tag and exposes the per-tenant cost curve. Without attribution, a single heavy tenant’s runaway spend is invisible until the month-end invoice; with attribution, the alert fires when the tenant’s weekly cost exceeds its expected band.

Per-user cost caps with graceful degradation. The architecture caps per-user spend at a configurable threshold. When a user hits the cap, the system degrades gracefully — smaller model, shorter responses, rate-limited interactions, or a queue for a human review — rather than failing outright. The cap protects the platform from abuse and protects the unit economics when viral growth outpaces pricing. Lab 4 in the memo walks the learner through building this.

Budget gates in the deployment pipeline. A new model version or prompt template is evaluated for cost impact before promotion to production. A prompt refactor that increases per-call cost by 30% does not ship until the architect approves the cost delta against the quality delta. This is the pattern that prevents routine deployments from accumulating cost debt.

[DIAGRAM: ScoreboardDiagram — aite-sat-article-9-cost-per-query-breakdown — Stacked horizontal bar chart showing the components of a representative per-query cost (total shown as 100%). Segments: “System prompt (cached)” (small), “User turn input tokens” (medium), “Retrieval context input tokens” (medium), “Tool schema input tokens” (small), “Tool call overhead” (small), “Output tokens” (large), “Response cache write” (tiny). A second bar below shows the same query after optimization (prompt caching + routing to small model + tightened output): dramatic reduction in the system prompt and output segments, modest reduction in input tokens. Annotations on each segment cite the optimization that reduced it.]

Two real-world examples

GitHub Copilot’s cost architecture. GitHub has spoken publicly about the cost levers in Copilot’s architecture, including caching of common completion patterns, fallback routing to smaller models for cheaper completions, and continual evaluation of quality-versus-cost trade-offs at scale.⁴ The architectural point is that Copilot operates at a scale where single-percentage-point improvements in per-query cost translate into material margin. The architect reading Copilot’s public material sees that at high volume, every lever is worth pulling; at low volume, only the levers with favorable engineering-to-savings ratio justify the investment.

Anthropic prompt caching. Anthropic’s public documentation for prompt caching describes the mechanism, the pricing (cached reads at roughly 10% of regular input token price for most models), the TTL, and the API changes required to use it.¹ The architectural point is that prompt caching is not a silent optimization the provider applies — it is a feature the architect must design prompts for. A prompt template that interleaves variable and stable content cannot be cached effectively; a template that places the stable prefix first and the variable user turn at the end can capture the full caching benefit. The architect reading Anthropic’s documentation learns to shape templates around caching boundaries from day one.

The hidden costs that catch teams out

Three cost categories routinely surprise teams who budget only for headline model-call costs.

Agent-loop amplification. An agentic workflow that calls the model five or six times per user request multiplies the per-call cost by the loop depth. A query that looks like a single-call interaction to the user can be a five-call interaction to the billing system. The architect measures agent-loop depth on the golden set and adds it to the per-query cost calculation rather than assuming one user query equals one model call.

Retrieval-side token inflation. A RAG pipeline that retrieves eight chunks of 400 tokens each and stuffs them into the prompt contributes 3,200 input tokens per query on top of the user’s turn. When the corpus grows and the top-k is raised to maintain recall, the per-query input cost grows with it. The architect measures retrieval-context size over time and either tightens chunking (Article 5) or tightens reranking (Article 6) when the inflation exceeds the budget envelope.

Evaluation and experimentation overhead. A production-quality evaluation harness runs the system against a golden set on every commit plus a sampling of live traffic continuously. The evaluation’s own model calls are a real line item — often 10–30% of total inference spend for teams that evaluate rigorously. The architect budgets evaluation spend as a separate category, because cutting it to save cost compromises the harness that protects quality.

Experiment cohort cost. A/B experiments run two versions in parallel, often against a larger-than-usual traffic slice to hit statistical power within a reasonable window. The parallel-version cost and the longer-than-normal run duration can add a significant surcharge above steady-state spend during experimentation periods. The architect plans experimentation capacity into the budget rather than apologizing for it after the fact.

Regulatory alignment

Cost optimization is not a compliance topic in itself, but it interacts with compliance obligations. Routing traffic to different models means the traffic crosses different model-risk profiles; the architect documents which data classes can flow to which models and excludes sensitive data from cheap third-party models that may not meet the residency or processing constraints. Semantic caching means that a query’s response may have been generated for a different user; privacy-sensitive use cases cannot tolerate cross-user cache pollution, and the architect scopes the cache per user or per tenant. EU AI Act Article 12 on record-keeping expects that the chain from user query to model response to user-facing output is reconstructable, which means cache hits are logged distinctly from fresh inferences so auditors can trace both.

Summary

Inference cost is where enterprise AI lives once it ships. The three levers — caching (prompt, exact-match, semantic), routing (rule, classifier, LLM-based cascade), and distillation (training-time student production for narrow tasks) — cover the architectural moves. The FinOps overlay — per-tenant attribution, per-user caps with graceful degradation, budget gates in the deployment pipeline — prevents the levers from drifting out of governance. GitHub Copilot’s public cost architecture and Anthropic’s prompt caching documentation show the pattern applied in production at scale. The architect evaluates every lever against the golden set, every deployment against the cost baseline, and every quarter against the budget.

Further reading in the Core Stream: AI Cost Management and FinOps and The Technology Architecture Roadmap.

Anthropic prompt caching documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching — accessed 2026-04-20. OpenAI prompt caching documentation. https://platform.openai.com/docs/guides/prompt-caching — accessed 2026-04-20. Google Gemini context caching documentation. https://ai.google.dev/gemini-api/docs/caching — accessed 2026-04-20. vLLM automatic prefix caching. https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html — accessed 2026-04-20. ↩ ↩²
GPTCache project. https://github.com/zilliztech/GPTCache — accessed 2026-04-20. LangChain semantic cache documentation. https://python.langchain.com/docs/integrations/llm_caching/ — accessed 2026-04-20. ↩
Meta Llama 3 Technical Report (2024). https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ — accessed 2026-04-20. Hugging Face distillation references (DistilBERT, DistilGPT). https://huggingface.co/docs/transformers/model_doc/distilbert — accessed 2026-04-20. ↩
GitHub Engineering blog and public architecture discussions for Copilot. https://github.blog/category/ai-and-ml/ — accessed 2026-04-20. ↩