Experiment Brief and Experiment Report

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 14 of 14

Two artifacts bookend every serious experiment. The brief is written before the experiment starts; it specifies what the experiment will do and what decision its result will support. The report is written after the experiment ends; it records what actually happened, what the result means, and what the organization should do next. A team that produces both brief and report runs experiments that contribute to institutional learning. A team that produces neither runs experiments that generate numbers and arguments. This final article of the credential pulls together the artifacts the earlier articles produced and teaches the two documents that make the whole thing legible.

The experiment brief

An experiment brief is a short document — typically one to two pages — authored before the experiment starts. Its role is to force the team to close open design questions before execution rather than during it. The Kohavi, Tang, and Xu reference text on online controlled experiments describes pre-registration of the hypothesis, design, and decision rule as the single largest driver of experiment-conclusion quality¹. The brief is the pre-registration artifact.

A competent brief has eight sections, each derived from the earlier articles of this credential.

1. Title, owner, date. The experiment has a name, a named accountable owner, and a creation date. The owner is a specific individual, not a team.

2. Hypothesis. Using the structure from Article 2: subject, predicted effect, measurement, threshold. One paragraph, typically.

3. Experiment mode. Using the vocabulary from Article 1: offline, shadow, online, adversarial, or a combination. If a combination, the sequence is specified.

4. Metrics. One primary metric, a short list of secondaries, a short list of guardrails. Each has a definition, an owner, and a pointer to the metric catalog (if the organization has one).

5. Evaluation protocol. The evaluation design, anchored to Articles 3, 4, or 10 depending on the mode. Sample size, power, traffic allocation, duration, stopping rules.

6. Budget. From Article 12: compute cost estimate, human-review cost estimate, cost-trigger threshold, stopping-cost threshold.

7. Decision rule. What triggers ship, what triggers rollback, what triggers a continuation. Specific thresholds and specific actions, not aspirations.

8. Reproducibility and governance. Where the experiment will be tracked, which regulatory anchors (Article 13) the experiment produces evidence for, what the retention policy is.

The brief is reviewed by a peer and, for regulated or high-risk experiments, by a governance role (model-risk manager, compliance officer, platform owner). Review sign-off is the gate to execution.

Teams that have not written a brief before will find the first several painful. By the tenth, most teams produce briefs in under an hour, because the structure is predictable and most of the fields draw on artifacts that already exist (metric catalog, tracking system configuration, budget ceiling). The brief is not heavyweight; it is structured.

[DIAGRAM: BridgeDiagram — aitm-eci-article-14-brief-to-report — Left-side experiment brief sections, center experiment execution with artifact accumulation (tracking records, metrics, registry entries), right-side experiment report sections. Bridge beams show which brief sections each report section draws on.]

The experiment report

An experiment report is a longer document — typically two to five pages — authored after the experiment ends. Its role is to produce a defensible conclusion and a decision. A report that does not produce a decision has failed at its primary job.

A competent report has nine sections.

1. Title, owner, date, brief reference. The report links to the brief it closes. A report without a brief reference is a standalone document, not an experiment result.

2. Summary. Two to three sentences stating the hypothesis outcome (supported, rejected, inconclusive), the primary-metric movement, and the decision.

3. Method as executed. What the brief specified versus what actually happened. Deviations (sample size reached, duration extended, traffic re-allocated, rollback triggered) are recorded. An experiment executed exactly as briefed is rare; an experiment whose deviations are undocumented is suspect.

4. Results. The primary metric with confidence interval, each secondary metric, each guardrail metric. Slice-level results where applicable. Figures referenced in the tracking system for reproducibility.

5. Interpretation. The result in context: what it means for the hypothesis, what it does not mean, which of the results were anticipated and which surprised the team.

6. Limitations. What the experiment could not answer. Which populations or conditions the experiment did not cover. Which proxies were used and what the known decoupling risks are.

7. Recommendation. The decision the report recommends: ship, do not ship, run a follow-up experiment, or escalate. The recommendation is specific and refers to the brief’s decision rule.

8. Next experiment. If a follow-up is recommended, the next hypothesis and experiment mode. One paragraph; the formal next brief follows separately.

9. Reproducibility pointer. The experiment ID in the tracking system, the registry entries, the CI runs, the regulatory artifacts produced. A future audit reader starts from this section.

The report is reviewed by the same roles that reviewed the brief, plus any stakeholder the decision affects directly (product owner, service owner, compliance, security). The review is not a rubber stamp; a well-written report invites scrutiny.

The brief and report in the Booking.com and Spotify vocabulary

Booking.com. Booking.com’s public writeups on its experimentation platform describe a program with ten thousand plus experiments per year, and a culture where the experiment brief (“experiment design document” in their vocabulary) is a prerequisite for platform time². The brief template is shared across the organization and is reviewed by a central experimentation team before execution. The practice scales because the template is predictable and the review is fast.

Spotify. Spotify’s engineering blog series on experimentation describes the end-to-end cycle including brief, execution, and report³. The reports are reviewed by peer experimenters, and findings that surprise the team are escalated to a broader review before they inform product decisions. The surprise-triggered escalation is a specific learning: the report’s “what surprised us” becomes a signal for institutional attention.

Both programs share a property: briefs and reports are short, structured, and reviewed. Neither is ceremonial. Both are operational.

Claim-evidence fit

The single most important discipline in an experiment report is claim-evidence fit. The claim (ship, do not ship, revisit) is supported by the evidence (primary metric movement, guardrail stability, slice-level findings). A report that makes a claim not supported by its evidence is worse than a report that makes no claim; it misleads the next decision-maker.

Three checks maintain claim-evidence fit.

Is the claim stronger than the evidence? A 1.5-percentage-point improvement with a 1.2-percentage-point confidence interval is evidence of a direction but not of a magnitude. A claim that asserts the magnitude is overreach.

Does the claim address the hypothesis the brief made? A report that pivots to a different hypothesis mid-experiment (“the primary metric did not move, but look at this secondary”) has either discovered something real and must be validated, or has performed post-hoc fishing.

Does the claim account for limitations? A claim of generalization across user segments is defensible only if segments were evaluated. A claim of robustness is defensible only if the adversarial tests were run.

A report reviewer’s first question, regardless of organization, is usually some version of “does the evidence actually support this claim?”. A practitioner who has asked the question before the reviewer does is the one whose reports ship.

[DIAGRAM: ScoreboardDiagram — aitm-eci-article-14-exec-summary — A one-page executive summary card with slots for hypothesis, primary metric result, guardrail status, decision, next experiment, and the single-sentence “what surprised us” note.]

Integration with the regulatory pipeline

Article 13 developed the documentation pipeline. The brief and report are the primary practitioner inputs to that pipeline. A brief feeds the pre-registration record. A report feeds the evaluation and monitoring records. The regulatory extractor (from Article 13) pulls both into the Annex IV, Clause 9.1, and MEASURE evidence sets.

A practitioner who writes the brief and report with the regulatory extractor in mind — using consistent section names, using structured metadata where possible, using standardized metric definitions — makes the regulatory pipeline nearly-automatic. A practitioner who writes them as narrative prose makes the extractor’s job harder, which usually means the extractor does not keep up.

The capstone of this credential

This credential has taught the practitioner a vocabulary (four experiment modes, three metric roles, five partition slices, five leakage classes, four search strategies, eight tracking artifacts, six pipeline stages, five CI layers, five lifecycle states, five LLM evaluation modes, eight red-team technique classes, four cost categories, and three regulatory anchors). It has taught the discipline that turns the vocabulary into practice. And it has taught the two artifacts — brief and report — that carry the vocabulary and discipline out of the practitioner’s head and into the organization.

The exam that closes this credential tests application, not memorization. A passing practitioner can take a described AI change, classify its experiment mode, propose a hypothesis and metric, design an evaluation protocol, budget it, execute it through CI and CD, integrate LLM and red-team layers where appropriate, and produce the brief and the report that make the experiment auditable. That is the job.

Summary

Two artifacts bookend every serious experiment. The brief, written before execution, has eight sections: title, hypothesis, mode, metrics, evaluation protocol, budget, decision rule, reproducibility and governance. The report, written after, has nine: summary, method as executed, results, interpretation, limitations, recommendation, next experiment, reproducibility pointer. Booking.com and Spotify are reference programs. Claim-evidence fit is the single most important discipline in the report. Brief and report integrate with the regulatory documentation pipeline so that evidence is a by-product of doing the work. The capstone of this credential is the practitioner who can produce both for any AI change at any tier. The exam tests that capstone.

Further reading in the Core Stream: From Measurement to Decision.

Ron Kohavi, Diane Tang, Ya Xu. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020. https://experimentguide.com/ — accessed 2026-04-19. ↩
Booking.com engineering blog — experimentation platform series. https://booking.ai/ — accessed 2026-04-19. ↩
Spotify Engineering Blog — Data category. https://engineering.atspotify.com/category/data/ — accessed 2026-04-19. ↩