FinOps for AI

FlowRidge

FinOps for AI matters because AI run-cost now exceeds build-cost in most GenAI programs — a reversal of the historic pattern where training dominated. Token-based inference pricing scales non-linearly with usage; context windows grow with conversation length; retrieval hops and tool calls compound the per-request cost. The cloud FinOps playbook assumes consumption-proportional cost; the AI FinOps playbook must handle cost curves that accelerate with scale.

This article teaches the three FinOps phases applied to AI, the specific practices (prompt caching, model-tier routing, context trimming) that produce the largest cost reductions, and the organizational structure — cross-functional team, budget authority, escalation pathway — that sustains FinOps over time.

The three phases

Phase 1 — Inform

Inform is cost visibility: who is spending, on what, at what rate. For cloud FinOps, the answer comes from billing exports tagged to resources. For AI FinOps, the answer requires three additional dimensions.

Token-level accounting. Every inference request consumes input and output tokens at model-specific prices. Cost must be attributable to the feature, the user cohort, and the prompt pattern that generated it. OpenCost (open-source), AWS CUR with tag-based allocation, and commercial AI observability platforms (Langfuse, Arize) all support token-level accounting.

Retrieval cost. RAG systems make multiple retrieval calls per request. Each retrieval has an embedding cost and a vector-store query cost. Without instrumentation, retrieval cost is hidden inside the platform bill.

Tool-call cost. Agentic systems make tool calls that each have compute, API, and potentially external-service costs. These accumulate quickly in production.

Inform-phase output is the AI cost allocation report: cost per feature, cost per model tier, cost per user cohort, cost per 1K successful outcomes, trend lines over time. A feature team that cannot produce this report has not completed Phase 1.

Phase 2 — Optimize

Optimize is the engineering phase. Four practice categories produce outsized cost reductions.

Prompt caching. When many requests share a common prefix — system prompt, persona instructions, few-shot examples — cached prefixes reduce the input token charge by 40–90%. Provider-side prompt caching (Anthropic, OpenAI, Google), client-side caching in orchestration frameworks (LangChain, LlamaIndex), and custom caching on self-hosted models all apply.

Model-tier routing. Not every request needs the most capable model. A classification request needs a small model; a complex reasoning request needs a capable one. Tier-routing uses a cheap classifier or heuristic to route each request to the minimum sufficient model. Published case studies report 3–10x cost reductions from aggressive tier-routing.

Context trimming. Long context windows cost more. Conversations accumulate history that often has low marginal relevance to the current turn. Context trimming — summarization, selective inclusion, retrieval-based windowing — keeps context compact without degrading outcomes.

Model-class substitution. For narrow tasks where fine-tuned small models perform comparably to large models, substitution reduces cost by 10-50x. Substitution requires capability evaluation (Article 24) to confirm non-degradation.

Phase 3 — Operate

Operate is the governance phase. Optimizations that are not operationalized decay; cost discipline requires sustained organizational commitment.

Budget ownership. Each feature has a named budget owner — typically the feature lead — who is accountable for the cost trajectory. Budget ownership is documented in the feature charter and reviewed at each stage-gate (Article 31).

Alert and throttle policies. Compute budgets (Article 29) translate Operate-phase commitments into enforced ceilings. An alert fires at 80% of budget; a throttle engages at 100%; an escalation path handles overruns.

Cost-KPI integration. Cost per successful outcome is a first-class KPI in the feature’s KPI tree (Article 12). Feature teams that treat cost as a separate concern routinely under-invest in Phase 2; teams that put cost KPIs on the main scorecard keep optimizing.

Cross-functional review cadence. A weekly or biweekly cost review with engineering, finance, and the feature team sustains discipline. The review examines trend lines, flags budget-at-risk features, and authorizes optimizations or tier-routing changes.

Practice × cost-saving range

Actual savings depend on workload, scale, and starting baseline, but the ranges observed in published case studies and the FinOps Foundation working-group reports cluster as follows.

Prompt caching: 40–90% reduction on input tokens for repeated prefix workloads. For a typical enterprise copilot with 80% of traffic hitting the same system prompt, total cost reduction of 30–50%.

Model-tier routing: 50–85% reduction for mixed-complexity workloads where cheap models suffice for the majority of requests. Realization requires a well-tuned classifier; poor classifiers route too many requests to expensive tiers and erode the savings.

Context trimming: 10–35% reduction in long-conversation settings. Savings grow as conversations lengthen.

Model-class substitution: 80–95% reduction when fine-tuned small models replace general-purpose large models for narrow tasks. The capability-evaluation burden is high but pays for itself at scale.

Compound effect: programs that execute all four practices typically achieve 60–85% total cost reduction within six months. Programs that execute only one or two practices see single-practice gains without the compound.

Two clouds, one decomposition

FinOps for AI is portable across cloud providers. The same cost decomposition works on AWS, Azure, and GCP, though the accounting mechanics differ.

AWS. Cost and Usage Reports (CUR) with tag-based allocation across Bedrock, SageMaker endpoints, and custom inference services. OpenCost for Kubernetes-based inference workloads. AWS Budgets for alerting.

Azure. Cost Management with resource-group tagging across Azure OpenAI, Azure ML, and custom inference. Azure Budgets. The Azure Monitor stack for inference-level instrumentation.

GCP. Billing exports with label-based allocation across Vertex AI, BigQuery ML, and custom inference. Billing budgets and alerts. Cloud Logging for request-level attribution.

The decomposition — input tokens + output tokens + context overhead + retrieval hops + tool calls — is identical across clouds. The mechanics of pulling the numbers out of each billing system differ; the analyst who has implemented FinOps for AI on two clouds can implement it on the third in weeks.

The self-hosted case

FinOps for self-hosted models differs. There is no token-based billing; the cost is GPU-hours (or specialized accelerators) plus storage plus network. The decomposition is:

Cost per request = (GPU-hours consumed / request) × GPU-hourly rate + storage-access cost + egress cost + operational overhead.

For self-hosted stacks, Prometheus + Grafana + custom exporters produce the cost allocation report. OpenCost or Kubecost tracks Kubernetes-level cost attribution. The accounting is more work up front but produces more granular visibility once set up.

The self-hosted case matters especially for regulated industries (finance, healthcare, government) where managed-provider inference is restricted. The Stanford HAI AI Index Report documents the ongoing shift toward self-hosted fine-tuned models for cost and sovereignty reasons.¹

Organization structures that sustain FinOps

Programs that achieve the 60–85% compound savings share organizational patterns.

A dedicated FinOps lead — typically a role, not a full-time position, owned by a senior engineer with finance partnership. The lead coordinates across feature teams and chairs the weekly cost review.

A central platform team that provides the Inform-phase instrumentation and the Optimize-phase tooling. Feature teams consume the platform rather than reinventing it.

Budget discipline at feature-charter time. A feature that launches without a cost budget will acquire cost debt.

Regular FinOps reviews at executive level. Monthly or quarterly reports to CFO/CIO that show cost trajectory, savings achieved, and at-risk features.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art13-Agentic-AI-Cost-Modeling-Token-Economics-Compute-Budgets-and-ROI.md — practitioner cost-modeling companion.
EATE-Level-3/M3.3-Art02-Enterprise-AI-Platform-Strategy.md — platform strategy where FinOps tooling lives.

Self-check

A feature team cannot answer “what is our cost per successful outcome?” Which FinOps phase are they in, and what is the next step?
Prompt caching savings are reported at 15% when published ranges suggest 40–90%. Name three plausible causes.
Model-tier routing is implemented with a classifier that routes 60% of requests to the expensive tier and 40% to the cheap tier. Is this optimal? What is the next optimization?
A regulated-industry organization self-hosts all inference. How does the FinOps decomposition change?