Experiment Cost and Compute Budget

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 12 of 14

A decade ago, experiment cost in ML was rounding error compared with engineering salary cost. That is no longer true. Training a modern large language model runs into the tens or hundreds of millions of dollars. Fine-tuning a mid-sized open-weight model is a five- or six-figure bill per run. Even hyperparameter sweeps on classical ML can tip into four or five figures if unbounded. And LLM-based evaluation, done well, has a non-trivial per-query cost that accumulates across regression suites, A/B traffic, and human-review harnesses. A practitioner who does not treat compute budget as a design constraint will lose the ability to run experiments before the organization loses the willingness to fund them.

The cost categories

Experiment cost breaks into four categories, each with different elasticity and different levers.

Training compute. The compute consumed by training a model or fine-tuning one. Driven by model size, data size, epochs, and hardware efficiency. Spot and preemptible instances, gradient checkpointing, mixed-precision training, and parameter-efficient fine-tuning (LoRA, QLoRA) are the major cost levers. Across providers (AWS, Azure, Google Cloud, Oracle Cloud Infrastructure, and neocloud providers including CoreWeave and Lambda), training is often the largest single line item.

Evaluation compute. The compute consumed by running evaluation harnesses. For classical ML this is usually modest. For LLM evaluation, especially when LLM-as-judge is used, costs accumulate fast: a 10,000-item regression suite evaluated by an LLM judge across multiple metrics can reach five figures on a single run, multiplied by frequency.

Online serving compute. The compute consumed by the feature itself in production. A/B tests and canary deployments that double the serving footprint during the experiment window are online-serving cost. For LLM features the token-usage cost can dwarf the infrastructure cost.

Human review cost. The time cost of human reviewers scoring outputs. Often expressed as a count of reviewer-hours per week; converted to dollars through the review team’s loaded rate.

The four categories respond to different levers. Mixing them into a single budget obscures which lever to pull when the budget is tight.

Sample size versus effect size

Classical experimentation economics trades sample size against detectable effect size. Larger samples detect smaller effects. Smaller samples can detect only larger effects. The trade is exact and computable.

For a two-sample A/B test on a binary outcome, the sample size per variant scales roughly with the inverse square of the minimum detectable effect (MDE): halving the MDE quadruples the sample size. For continuous outcomes the scaling is similar, with baseline variance as an additional factor. The practical implication is that cutting sample size to save cost costs disproportionately in what the experiment can detect.

Three tactics improve the trade.

Variance reduction. Techniques including CUPED (Controlled-experiment Using Pre-Experiment Data), regression adjustment, and stratified sampling reduce the effective variance of the primary metric. Microsoft ExP has published extensively on CUPED, and comparable techniques are documented across Netflix, LinkedIn, and other large-scale programs¹². A 20% variance reduction allows roughly a 20% sample-size reduction at the same MDE.

Guardrail prioritization. Not every guardrail needs full-sample power. A primary metric might need a 1% MDE, but a guardrail metric that is binary and safety-critical might be usable at 5% MDE. Sizing the experiment for the tightest requirement wastes capacity on guardrails that did not need it.

Sequential testing. As covered in Article 4, sequential-testing frameworks allow continuous monitoring and early stopping when the result is clear, cutting expected sample size for confident wins or losses.

Compute cost for LLM features

LLM features have cost patterns classical ML does not.

Per-token pricing. Managed LLM APIs (OpenAI, Anthropic, Google, Cohere, Mistral API, multiple aggregators) price per token. Input tokens and output tokens often price differently. A feature whose prompt contains 2,000 tokens of retrieved context, costs on every request for those tokens. Prompt caching — available from several providers — can reduce costs dramatically for prompts with stable prefixes. Anthropic published pricing documentation for its Claude models and documented prompt-caching mechanics explicitly³.

Model-version drift in cost. A newer, “better” model from the same provider is often priced higher per token than its predecessor. A team shipping the new model silently absorbs the cost increase. A cost regression test (Article 8) catches this.

Self-hosted cost. Running an open-weight model (Llama, Mistral, Qwen, DeepSeek, others) on self-hosted infrastructure costs differently — primarily compute rental rather than per-token. The break-even against a managed API depends on utilization; low-utilization workloads are cheaper on managed APIs, high-utilization workloads on self-hosted. The decision is not static; practitioners who run the numbers periodically will find the right answer changes as model and hardware prices change.

Agentic multiplier. An agentic workload with tool use can make many LLM calls per user interaction. A single user query that invokes an agent with retrieval, reasoning, tool use, and a final synthesis step might make 10–50 LLM calls under the hood. Cost budgeting that assumes one call per query will be off by an order of magnitude.

Spot, preemptible, and reserved pricing

Cloud providers offer several pricing tiers that can cut experiment cost substantially.

Spot instances (AWS), Spot VMs (Azure, Google Cloud). Interruptible compute at 60–90% discount versus on-demand. Suitable for experiment workloads that can tolerate preemption: hyperparameter sweeps, batch evaluation, offline training with checkpointing.
Reserved instances and savings plans. Committed capacity at discount in exchange for multi-month or multi-year commitment. Suitable for baseline training and serving capacity that a team knows it will use continuously.
Neocloud pricing. Providers including CoreWeave, Lambda, Vast.ai, and RunPod offer GPU compute at per-hour rates that often undercut the major hyperscalers for specific workloads. Suitable for experiment workloads that do not need the hyperscaler’s integration with the organization’s broader infrastructure.

A practitioner’s job is not to make the commercial decision, but to know what the options are and to design experiments that can exploit them. A hyperparameter sweep that cannot tolerate preemption pays three to five times what one that can does.

The experiment-portfolio view

Experiment cost adds up across the organization. A program that runs 50 experiments a month, each costing $500 on average, is a $300,000/year program. The individual experiments might be modest; the portfolio is not.

Portfolio discipline has three elements.

A shared budget ceiling. The organization allocates a compute budget to the experimentation program. Individual teams bid against it. Budget overruns trigger portfolio review, not individual-experiment emergency calls.

A deduplication check. Before approving an experiment, the practitioner checks that a similar experiment has not already run. A team running a hyperparameter sweep that a peer team ran last quarter is wasting the portfolio. Tracking systems (MLflow, W&B, and cloud-provider native) support this if they are used consistently.

A cost-of-value metric. The portfolio view prioritizes experiments by expected information gain per unit cost. A small sample size on a high-signal question dominates a large sample size on a low-signal question. The metric is imprecise but forces the conversation.

[DIAGRAM: ScoreboardDiagram — aitm-eci-article-12-portfolio-cost-dashboard — A dashboard-style table with rows per experiment, columns for compute cost, expected information gain, priority score, and status (approved, running, complete). A summary row shows total portfolio spend against ceiling.]

Budget as a first-class element of the experiment brief

Article 14 develops the experiment brief in full. The brief’s budget section includes.

Estimated compute cost (training + evaluation + online-serving-if-applicable), in dollars and in GPU-hours.
Estimated human-review cost, in reviewer-hours.
Cost trigger. The spend level at which the experiment escalates for additional approval.
Stopping cost. The spend level at which the experiment halts regardless of results.

A brief without a budget section is not ready to start. A brief with a budget section that matches the approved ceiling is funded. The practitioner’s job is to make the brief’s budget honest and to stop the experiment when the ceiling is reached.

[DIAGRAM: MatrixDiagram — aitm-eci-article-12-cost-vs-value — 2x2 of “Compute cost (low/high)” by “Decision value (low/high)” with priority recommendations: high-value + low-cost (do first), high-value + high-cost (approve with care), low-value + low-cost (schedule opportunistically), low-value + high-cost (reject).]

Two real references in the cost vocabulary

Anthropic Claude 3 training-compute disclosure. Anthropic’s Claude 3 model card discloses aspects of the training compute profile and is a public reference for how the largest labs think about training budget⁴. The disclosure is a teaching anchor for the scale of frontier training, and a reminder that per-query costs at inference are a tiny fraction of the training investment.

OpenAI GPT-4 technical report. The GPT-4 technical report (March 2023) discussed compute allocation during development, including how evaluation and safety work consumed a non-trivial fraction of the project’s compute⁵. The report is a reference for the claim that experimentation cost is not small relative to training cost in frontier-lab budgets.

Summary

Experiment cost has four categories: training, evaluation, online serving, human review. Sample size trades against detectable effect size; variance reduction, guardrail prioritization, and sequential testing improve the trade. LLM features have per-token pricing, model-version-drift cost risk, and an agentic multiplier. Spot and preemptible compute can cut costs substantially for experiment workloads. The portfolio view — shared ceiling, deduplication check, cost-of-value metric — prevents aggregate waste. The experiment brief’s budget section is first-class; a brief without one is not ready to run. Anthropic Claude 3 and OpenAI GPT-4 are public references for the scale of the economics.

Further reading in the Core Stream: From Measurement to Decision.

Alex Deng, Ya Xu, Ron Kohavi, Toby Walker. Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data (CUPED). WSDM 2013. https://www.microsoft.com/en-us/research/publication/improving-the-sensitivity-of-online-controlled-experiments-by-utilizing-pre-experiment-data/ — accessed 2026-04-19. ↩
Netflix Technology Blog and LinkedIn Engineering Blog experimentation series. https://netflixtechblog.com/ ; https://engineering.linkedin.com/blog/topic/machine-learning — accessed 2026-04-19. ↩
Anthropic API documentation — prompt caching. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching — accessed 2026-04-19. ↩
Anthropic. Claude 3 Model Card (2024). https://www-cdn.anthropic.com/files/4zrzovbb/website/5c49cc247484cecf107c699baf29250302e5da70.pdf — accessed 2026-04-19. ↩
OpenAI. GPT-4 Technical Report. March 2023. https://arxiv.org/abs/2303.08774 — accessed 2026-04-19. ↩