Cost Model and FinOps for AI

FlowRidge

This article gives the three-layer cost model, walks the five major cost drivers, and covers the FinOps governance and dashboard design an architect ships to make AI spend predictable at scale.

The three-layer cost model

Classical FinOps thinks in workloads and budgets. AI FinOps benefits from a third layer for per-query unit economics because AI costs are more sensitive to per-query shape than classical workloads are.

Layer 1 — per-query unit economics

The cost of a single user interaction, summed across all components. A worked example for a mid-complexity RAG query on a closed-weight managed API:

Component	Cost per query
Retrieval (vector store query, embedding input)	$0.0002
Reranker (third-party managed reranker)	$0.0003
Model call (prompt + retrieved context + response generation)	$0.025
Observability and tracing	$0.0001
Network egress and platform overhead	$0.0001
Unit cost per query	$0.025–0.026

At Layer 1 the architect can see where the cost concentrates (the model call, typically) and which levers reduce it (model routing, caching, prompt compression, response length caps). Klarna’s published per-interaction economics on its 2024 OpenAI-powered assistant gave rough ratios that suggested managed-API customer-service bots reached unit costs in the $0.01–0.10 range depending on the depth of the interaction.¹ The specific numbers shift with provider pricing and workload shape; the model structure is stable.

Layer 2 — workload-level monthly run-rate

The monthly total for a given workload. For a workload running 10K daily active users with an average of 3 interactions each, the math is 10K × 3 × 30 = 900K queries/month × $0.025 = $22,500/month. Add amortised fixed costs (self-hosted GPU reservations if any, platform contracts) and the workload’s total monthly run-rate emerges.

The architect uses Layer 2 to answer questions like: does the workload’s value justify this run-rate. If the workload produces $0.10 of value per query, $0.025 of cost per query is a healthy 4x margin; if it produces $0.02 of value per query, the workload is losing money and the architecture needs to change.

Layer 3 — portfolio-level budget

The aggregated monthly spend across all AI workloads in the organisation. This is the finance-team view. Layer 3 is where showback, chargeback, and allocation conversations happen. It is where anomaly detection catches a single workload spiking by a factor of three and burning through everybody else’s headroom.

The five cost drivers

AI cost at scale concentrates in five drivers. The architect monitors each.

1. Model-call cost

Typically the dominant cost for inference-heavy workloads. Function of input tokens, output tokens, and model tier. Closed-weight managed APIs have per-token pricing; open-weight self-hosted has per-GPU-hour pricing that is converted to per-token at the workload’s utilisation rate.

Levers: model-tier routing (cheaper model for easier queries), prompt compression, response-length caps, caching of common queries, batch inference where latency tolerates it (Article 9 covers these in depth).

2. Retrieval cost

For RAG systems: embedding generation, vector store queries, reranker calls. Often modest per query but can grow rapidly at scale, especially for managed vector-store subscriptions where per-query pricing applies.

Levers: vector-store choice (managed vs self-hosted; Article 6), embedding-model choice, query-time reranker selection, caching of embeddings for repeat queries.

3. Tool-call cost

For agentic and tool-using systems, the cost of the tools the agent calls. Database queries, API calls, external-service pricing per action. Easy to undercount because classical tool cost is low-per-call but high-per-query-on-the-AI-side (dozens of calls per agent task).

Levers: tool-call budgets per task, tool-result caching, tool-consolidation prompts to reduce chained calls.

4. Observability and storage cost

Trace storage, log retention, eval-set storage, evidence-pack retention. Scales with request volume and retention window. Often overlooked until retention budgets start biting at month six.

Levers: retention policies (shorter for non-high-risk workloads), sampling of trace captures, log aggregation.

5. Self-hosting fixed cost

For open-weight self-hosted deployments: GPU reservations, Kubernetes infrastructure, the operations team. Fixed monthly costs that spread across whatever workload uses them. Under-utilised GPUs are the most common hidden loss.

Levers: right-sizing GPU capacity to peak plus buffer; routing overflow to managed APIs; mixed-serving patterns (Article 8).

FinOps governance

The architect’s FinOps responsibility is not finance policing; it is designing the dashboards, alerts, and controls that let the engineering team manage its own spend responsibly. The FinOps Foundation’s maturity model — Crawl, Walk, Run — applies.²

Crawl. Visibility. The platform emits per-query, per-workload, and per-owner cost telemetry. Dashboards are accessible to engineers. The team can answer “what did this cost last month” accurately. Most teams enter this stage in the first 6 to 12 months of AI work.

Walk. Optimisation. Cost anomalies are detected automatically. Routing and caching decisions are informed by cost data. Per-workload budgets are tracked and alerts fire on approach. Teams at this stage see their first significant cost reductions from architectural changes justified by the numbers.

Run. Operation. Showback or chargeback is live. Budget caps enforce at the gateway. Continuous cost optimisation is part of the engineering review cadence. Most organisations reach this stage two to three years into serious AI spend.

Azure Cost Management + AI, AWS Bedrock cost tracking, Google Cloud AI cost insights, and build-your-own dashboards on OpenTelemetry data all support FinOps dashboards; the specific tool matters less than the data model.³

Showback, chargeback, and allocation

Three patterns for how AI spend lands on the budget.

Showback. Every engineering team sees what they spent but costs are centrally funded. Useful early; creates visibility without political friction.

Chargeback. Costs land on the consuming team’s budget directly. Creates strong incentives but can discourage platform adoption if the chargeback is introduced too early.

Allocation. Platform costs are allocated by usage metric (queries, users, model tier). Sits between showback and chargeback; common for mature platforms.

The architect collaborates with finance on the model; the architect’s specific contribution is that the usage metric is meaningful. Charging by raw request count when workloads have wildly different per-request costs misallocates cruelly. Charging by a cost-weighted metric is fair and informative.

Anomaly detection

Cost anomalies are signal. The architect specifies detection:

Per-query cost spike. A prompt change that suddenly emits 5x more output tokens. Detected hourly; alerts the prompt owner.
Per-workload cost spike. A workload’s daily spend exceeds a threshold. Alerts the workload owner; triggers a cost review.
Portfolio ceiling approach. The organisation’s monthly spend is trending to exceed budget. Finance and architect review.
Idle self-hosted capacity. GPU utilisation below a threshold for an extended period. Right-sizing trigger.

The detection rules are lightweight. Statistical-process-control style bands around rolling baselines catch most anomalies. The cost observability tool emits these signals alongside the quality signals in the SLO dashboard (Article 20).

Budget caps and enforcement

A budget cap that is not enforced is a hope, not a cap. The architect specifies enforcement.

Per-user cost cap. If an individual user’s interactions cost more than a threshold in a window, the platform degrades service (cheaper model, smaller context, refusal). A lab exercise in many enterprise architect training programmes because it exposes the design choices clearly.⁴

Per-workload cost cap. Monthly budget; enforcement at 90% warns, at 100% can degrade or rate-limit depending on policy.

Per-agent task cap. For agentic workloads (Article 32), per-task cost limits prevent runaway costs.

Portfolio cap with human override. The organisation-level cap is the ultimate backstop. Exceeding it requires finance and architect sign-off, not a silent accept.

Graceful degradation under cost pressure is a design problem. A cost-capped system tells the user it is operating in reduced mode rather than failing silently; the responsible-AI fallback pattern (Article 31 Pattern 5) applies.

The cost-model artefact

The cost model is a document. A complete cost model document for an AI workload:

Workload description and scope.
Layer 1 unit economics with component breakdown.
Layer 2 monthly run-rate with traffic assumptions and sensitivity analysis (what if DAU doubles, what if average tokens rise 50%).
Layer 3 contribution to portfolio.
Cost drivers and architectural levers identified.
Budget request and approval.
Monitoring and alert configuration.
Revisit cadence (typically quarterly).

The document is refreshed quarterly and whenever a material architectural change is made. It is cited in the ADR corpus and referenced in stage-gate readouts.

Worked example — model routing saves 60%

A customer-support assistant running at 1M queries/day has an average cost of $0.025/query on a top-tier model for a monthly spend of $750K. The architect observes that 80% of queries are simple FAQ-style lookups that a mid-tier model handles at equivalent quality. Routing these to the mid-tier model at $0.005/query and keeping the remaining 20% on the top-tier model:

Simple queries: 800K/day × $0.005 = $4K/day → $120K/month
Complex queries: 200K/day × $0.025 = $5K/day → $150K/month
Total: $270K/month, a 64% reduction

The savings are real but require the routing infrastructure (a classifier or rules engine), the quality monitoring to catch mis-routed queries, and the ongoing discipline to maintain the routing policy. All are architectural decisions the AITE-SAT architect specifies.

Worked example — self-hosting crossover

A team serves 200M tokens/day through an open-weight model. At typical managed API pricing of roughly $1.50 per million tokens for a mid-tier open-weight model hosted by a provider, the monthly cost is roughly $9K. Self-hosting on three H100 GPUs at roughly $3/hour reserved ($6,500/month per GPU all-in) costs roughly $20K/month. The managed option is cheaper at this volume.

At 1B tokens/day the managed cost is $45K/month. Self-hosting with 10 GPUs at this scale (leaving headroom and supporting peak load) costs roughly $65K/month. Managed is still cheaper but the curves are converging.

At 10B tokens/day managed costs $450K/month. Self-hosting scaled to 60 GPUs costs roughly $390K/month and delivers capacity the organisation controls. Self-hosting becomes cheaper and also delivers a data-residency and latency posture managed providers may not match.

The numbers shift constantly. The architect runs the model with current pricing rather than trusting conventional wisdom.

Governance integration

Cost is a first-class governance concern. EU AI Act Article 11 technical documentation and Article 12 record-keeping are not explicit about cost but the evidence pack benefits from cost transparency: a notified body looking at a system’s monitoring disposition sees more reliable signal when the cost telemetry is present.⁵ ISO/IEC 42001’s requirements on monitoring and measurement (clause 9.1) map cleanly to the cost-observability plane. For regulated industries (financial services, healthcare), the cost model is often audited alongside the model-risk documentation.

Anti-patterns

No cost model. Invisible cost is the FinOps equivalent of invisible risk. First thing to fix.
Cost model in a spreadsheet nobody opens. Living outside the platform, the cost model drifts out of date within weeks. Put it where the team sees it.
One dashboard per department. A proliferation of inconsistent cost dashboards prevents portfolio-level reasoning. One platform dashboard plus department slices.
Premature chargeback. Charging back before visibility and optimisation are in place pushes teams away from the platform. Crawl before walk.
Budget cap without degradation spec. A cap that simply fails queries when exceeded creates a support nightmare. Specify the degradation.
Ignoring observability cost. Observability budget grows with retention windows and request volume. Unbudgeted, it surprises at scale.

Summary

The three-layer cost model (per-query, per-workload, portfolio) and the five cost drivers (model, retrieval, tool, observability, self-hosting) give the architect the mental model to diagnose and control AI spend. FinOps governance (showback, chargeback, allocation) is implemented through the platform dashboards, anomaly detection, and budget caps the architect specifies. Worked examples from model routing and self-hosting crossover show how architectural choices are the primary lever on AI cost.

Key terms

FinOps for AI
Three-layer cost model
Unit economics (per-query)
Showback versus chargeback
Cost anomaly detection

Learning outcomes

After this article the learner can: explain the three-layer cost model; classify five cost drivers; evaluate a cost model for hidden drivers; design a FinOps dashboard specification.