What an AI Experiment Is

FlowRidge

Figure 313. An AI experiment is the intersection of falsifiable hypothesis, controlled variation, and pre-registered success criteria. Missing any one turns the experiment into a demo.

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 1 of 14

Most enterprise AI work is experimental work. A model selection is an experiment. A prompt change is an experiment. A retrieval reweight is an experiment. Whether the organization treats these as experiments, documents them as experiments, and governs them as experiments is what separates a team that learns from one that merely changes things. This credential’s first job is to establish a vocabulary for the four modes of AI experimentation so that every subsequent article can be precise about which mode it is talking about.

The four modes

An AI experiment is a structured comparison that produces evidence for a decision. The decision might be “ship this model”, “keep the current prompt”, “grant a deployment gate approval”, or “retire this feature”. The structure is what makes the comparison produce evidence rather than opinion. The four modes differ along two dimensions: whether the experiment uses production traffic, and whether its outputs reach real users or downstream systems.

Offline experiments use static data, no production traffic, and no user impact. A model trains on a historical dataset and is evaluated on a held-out slice of the same dataset. Offline experiments are cheap, fast, and fully reproducible. They answer questions that the data can answer, and nothing else. Hyperparameter sweeps, benchmark runs, and regression suites against curated evaluation sets all sit here.

Online experiments use production traffic, and their outputs reach real users. A/B tests, multi-armed bandits, and canary deployments are all online experiments. Online experiments are expensive, slow, and irreproducible in the strict sense (the exact traffic mix cannot be replayed), but they are the only way to answer questions that depend on how real users actually behave. Whether a recommendation feels useful, whether a generated summary is trusted, whether a fraud model triggers in the patterns that adversaries actually produce — these are online questions.

Shadow experiments use production traffic but do not expose outputs to users. A new model receives the same inputs as the current production model, runs in parallel, and its outputs are logged but not served. Shadow is the middle ground: it gets realistic inputs without user exposure, which makes it the right mode for high-risk changes that need realism but cannot yet absorb user-facing risk. The Google Site Reliability Engineering workbook documents shadow traffic as a standard pattern for risky model launches¹.

Adversarial experiments are deliberately designed to make the system fail rather than to validate that it works. Red-team exercises, jailbreak batteries, and safety sweeps are adversarial experiments. They share one property with online experiments (they probe realistic behavior) and one with offline experiments (they use curated inputs), but the hypothesis they test is inverted: the question is “what input class breaks this model?” rather than “does this model meet its metric?”. The MITRE ATLAS catalog is effectively a library of adversarial experiment classes for AI systems².

[DIAGRAM: MatrixDiagram — aitm-eci-article-1-experiment-modes-matrix — 2x2 of “Production traffic required (no/yes)” by “User-facing impact (no/yes)”, placing offline in top-left, shadow in top-right, online in bottom-right, and adversarial in bottom-left with a note that adversarial can use production traffic for red-team replay.]

Why the mode choice is a governance choice, not a technical choice

A team that defaults to offline for everything will ship features that work on paper and fail in production. A team that defaults to online for everything will expose users to untested behavior. Neither default is safe. The correct behavior is deliberate selection per change, per risk tier, and per evidentiary need.

Under the NIST AI Risk Management Framework 1.0, the MEASURE function catalogs the evidence an organization must produce to demonstrate that an AI system performs within specified tolerances. Subcategory MEASURE 2.5 explicitly requires that the AI system be evaluated under both prospective and operational conditions³. A risk register that records offline results as sufficient evidence for an operational claim is failing to meet the subcategory by construction. Offline evidence answers the prospective question; shadow or online evidence answers the operational question. The mode of experiment is the evidence the regulator expects.

Under the EU AI Act (Regulation (EU) 2024/1689), Article 15 requires that high-risk AI systems be designed, developed, and tested to achieve an appropriate level of accuracy, robustness, and cybersecurity throughout their lifecycle⁴. The word “tested” is doing work: it spans pre-deployment testing (offline), pre-exposure testing (shadow), post-deployment monitoring (online), and adversarial testing (red-team). A high-risk system whose Annex IV technical documentation records only offline numbers is non-conformant. The mode of experiment determines whether the documentation is complete.

Under ISO/IEC 42001:2023, Clause 9.1 requires the organization to monitor, measure, analyze, and evaluate the performance of its AI management system⁵. Operational monitoring is online evaluation by another name, and the audit trail it requires includes the experiment records that produced the claims. The mode of experiment produces the evidence the auditor will read.

Two real cases in the mode vocabulary

Zillow Offers — a missing shadow. In November 2021, Zillow announced the shutdown of its iBuying arm and the winding-down of Zillow Offers. The SEC 10-Q filing for the quarter recorded approximately $881 million of inventory writedowns and related costs⁶. Public reporting and the company’s own statements made clear that the model which priced homes had been trained and evaluated offline on historical housing data and deployed straight to production pricing decisions, where the 2021 housing inflection exposed transfer failure the offline splits had no way to detect. The experiment mode error is diagnosable in the vocabulary of this article: the feature never ran in a shadow mode that could have compared model pricing against subsequent sale prices on real homes without committing capital, and the online evaluation that did happen was bundled with real capital at risk. A shadow mode between offline and online would have produced the evidence that offline was not producing, at a cost far below $881 million.

Google Search quality — a disciplined four-mode program. Google’s Search Quality documentation describes a multi-decade program that pairs offline rater evaluations with online experiments on a small percentage of queries, sometimes preceded by shadow scoring runs⁷. Kohavi, Tang, and Xu describe comparable programs at Microsoft, Bing, and several other platforms whose practitioners published reference material: tens of thousands of online experiments per year, preceded by offline regression suites, preceded in some cases by shadow runs, preceded in risky cases by red-team sweeps⁸. The four-mode vocabulary maps directly onto what these programs do; the discipline is in doing all four, not in picking one.

The vocabulary is deliberately agnostic to the platform. The same four modes describe a feature built on an open-source stack (a self-hosted fine-tuned Llama served on Kubernetes with Airflow training pipelines and Prometheus observability), on a managed-API stack (an Anthropic or OpenAI API with a managed feature store and a SaaS experimentation platform), or on a cloud-platform stack (SageMaker, Vertex AI, Azure ML, or Databricks). What changes is the tooling that records each mode, not the modes themselves.

Classifying a proposed change

The practitioner’s first artifact for any proposed AI change is a mode-classification note. Given a description of the change, the practitioner writes, in one paragraph: which of the four modes the proposed evaluation plan already covers, which modes it omits, which modes are required by the change’s risk tier, and which omissions must be closed before approval. The note is short, deliberately, so that the product owner, engineer, and reviewer can hold a shared view in a single meeting.

The classification note asks five questions.

Question	Offline	Shadow	Online	Adversarial
Can the change be evaluated without production traffic?	Yes	No	No	Partly
Can the change be evaluated without user exposure?	Yes	Yes	No	Yes
Does the change interact with user behavior?	No	Weakly	Yes	No
Is the change safety-relevant?	Sometimes	Sometimes	Sometimes	Always
Does the risk tier require evidence in this mode?	Almost always	For high-risk	For high-risk	For high-risk and safety-relevant

A change that a product team calls “just a prompt tweak” might need all four modes if the system it sits in is high-risk. A change that a team calls “a major model upgrade” might need only offline plus shadow if the downstream system is non-user-facing. Risk tier and interaction pattern drive the required mode set; the change description alone does not.

[DIAGRAM: TimelineDiagram — aitm-eci-article-1-four-mode-progression — A horizontal timeline showing progression from offline regression to shadow comparison to canary (1%) to ramp (10% then 50%) to full rollout, with an adversarial sweep attached at each gate.]

What the rest of the credential builds on

Every subsequent article refers back to this vocabulary. Article 2 teaches hypothesis and metric design across the four modes. Article 3 develops offline evaluation. Article 4 develops online evaluation in depth (A/B, bandits, canary, shadow). Article 5 teaches hyperparameter search, which is almost always offline. Articles 6 through 9 teach the tracking, pipeline, continuous-integration, and continuous-delivery infrastructure that spans all four modes. Articles 10 and 11 apply the vocabulary to LLM evaluation and to red-team experimentation. Articles 12 through 14 cover budget, regulatory documentation, and the experiment brief and report, which are the governance artifacts that turn experiments into evidence.

The four-mode vocabulary is the contract the credential holds with the learner: any proposed AI change can be classified, any classification can be defended, and any defense can be recorded in a one-page artifact that survives an audit.

Summary

AI experiments come in four modes. Offline runs on static data without user exposure. Shadow runs on production traffic without user exposure. Online runs on production traffic with user exposure. Adversarial runs with inverted hypotheses, probing for failure. The mode is not a technical detail; it is the governance claim the experiment can support. Picking the right mode set for a given change is the practitioner’s first job. The Zillow case shows what happens when the shadow mode is skipped; the Google and Microsoft programs show what disciplined four-mode practice looks like. The rest of this credential builds methods, tools, and documentation for each of the four.

Further reading in the Core Stream: The COMPEL Cycle: Iteration and Continuous Improvement, Produce: Executing the Transformation, and Machine Learning Fundamentals for Decision Makers.

The Site Reliability Workbook, Chapter “Canarying Releases” and Chapter “Managing Load”. Google. https://sre.google/workbook/canarying-releases/ — accessed 2026-04-19. ↩
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems). MITRE Corporation. https://atlas.mitre.org/ — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023, MEASURE function, Subcategory 2.5. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (EU AI Act), Article 15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 9.1. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩
Zillow Group, Inc., Form 10-Q for the quarterly period ended September 30, 2021. U.S. Securities and Exchange Commission. https://www.sec.gov/Archives/edgar/data/1617640/000161764021000112/z-2021x09x30x10q.htm — accessed 2026-04-19. ↩
How Search algorithms work — Ranking systems guide. Google Search Central. https://developers.google.com/search/docs/appearance/ranking-systems-guide — accessed 2026-04-19. ↩
Ron Kohavi, Diane Tang, Ya Xu. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020. https://experimentguide.com/ — accessed 2026-04-19. ↩