Artifact Template: Experiment Brief

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Artifact Template

How to use this template

This template is the companion artifact to Article 14. An experiment brief is authored before the experiment starts, reviewed before execution, and committed to the experiment tracking system as the pre-registration record. The brief is typically one to two pages when rendered in a normal document style; the sections below map directly onto the eight-section structure in Article 14.

Every field is required. Empty fields are rejected by the review gate. Fields for which the answer is “not applicable” are filled with “not applicable” and a one-sentence rationale.

Copy the template, rename it with the experiment’s name, and fill in each section. Leave the frontmatter intact so that the brief is recognized by the documentation extractor.

Experiment Brief — [Experiment Name]

1. Identification

Field	Value
Experiment name	[short name, e.g., “TicketSummary prompt v3”]
Experiment ID	[tracking-system-assigned ID]
Owner (single accountable individual)	[name, role]
Peer reviewer	[name, role]
Governance reviewer (if applicable)	[name, role; required for high-risk or regulated systems]
Date authored	YYYY-MM-DD
Planned start	YYYY-MM-DD
Planned end	YYYY-MM-DD
Feature or system under test	[name; link to feature/system record]
Risk tier	[low / moderate / high / high-risk per EU AI Act / other]
Regulated-system flags	[EU AI Act high-risk, ISO 42001 scope, HIPAA, PCI, other; or “none”]

2. Hypothesis

Using the four-element structure from Article 2: subject, predicted effect, measurement, threshold.

Subject. [What is the change? Which model, prompt, retrieval policy, feature, threshold, configuration, or combination is the object of the experiment?]
Predicted effect. [What observable quantity does the change alter, and in which direction?]
Measurement. [How will the effect be observed? On what population? Over what window? By what instrument?]
Threshold. [What size of effect, detected with what confidence, triggers what decision?]

One sentence stating the hypothesis in falsifiable form:

[Hypothesis sentence, e.g., “Changing the retrieval reranker from cross-encoder A to cross-encoder B will increase answer-acceptance rate on English-language sessions in the last 14 days of the experiment window by at least 1.5 percentage points at p < 0.01 two-sided.”]

3. Experiment mode

Using the vocabulary from Article 1. Select all that apply and specify sequence if multiple.

Sequence and gating (if multi-mode): [e.g., “Offline first; if offline metrics meet threshold, shadow for 14 days; if shadow divergence within tolerance, canary at 1% then ramp.”]

Rationale for the mode choice in one paragraph:

[Why these modes and not others; what gap each mode closes that the previous mode could not.]

4. Metrics

Primary metric

Field	Value
Name	[metric name, linked to metric catalog if one exists]
Operational definition	[precise definition, including numerator, denominator, window, filters]
Owner	[name or team]
Expected baseline	[baseline value with 95% interval, from recent production]
Target MDE (minimum detectable effect)	[e.g., 1.5 absolute percentage points]
Confidence and power	[e.g., alpha 0.01 two-sided, power 80%]

Secondary metrics (short list)

Name	Definition	Rationale
[metric 1]	[…]	[why included]
[metric 2]	[…]	[why included]
[metric 3]	[…]	[why included]

Guardrail metrics (short list)

Name	Definition	Breach threshold	Action on breach
[metric 1, e.g., latency p99]	[…]	[e.g., +20% over baseline for 5 min rolling window]	[automatic rollback / page owner / extend review]
[metric 2, e.g., error rate]	[…]	[…]	[…]
[metric 3, e.g., per-request cost for generative features]	[…]	[…]	[…]
[metric 4, e.g., safety signal]	[…]	[…]	[…]

Adversarial review of the primary metric

One paragraph per Article 2: how might the primary metric move without the feature actually being better?

[Adversarial-review paragraph; 3–6 sentences.]

5. Evaluation protocol

Population and scope

Included population: [description; e.g., “English-language sessions from organizations on paid plans”]
Excluded population: [description with rationale]
Randomization unit: [user / session / account / request / other]
Stratification (if any): [strata and rationale]

Sample size and duration

Field	Value
Sample size per variant	[number]
Traffic allocation	[e.g., 50/50]
Variance-reduction technique (if any)	[CUPED, regression adjustment, stratified sampling, or none]
Expected duration	[days, with assumption on traffic rate]
Sequential-testing framework	[always-valid confidence sequence, mSPRT, fixed-horizon]

Slice evaluation

Slices to be reported separately (from Article 3):

[Slice 1, e.g., geography]
[Slice 2, e.g., segment]
[Slice 3, e.g., temporal cohort]
[Slice 4]

Minimum per-slice sample size to render the per-slice metric reliable: [number]

Shadow or canary prerequisites (if mode includes them)

[Description of shadow duration and comparison protocol]
[Description of canary ramp steps and minimum time per step]

6. Budget

From Article 12. All three categories are required.

Compute cost

Category	Estimate	Unit
Training compute (if any)	[amount]	[GPU-hours, dollars]
Evaluation compute	[amount]	[GPU-hours, dollars]
Online-serving compute (for the experiment window)	[amount]	[dollars]
Per-request LLM cost (for generative features)	[amount]	[dollars total over window]

Human-review cost

Category	Estimate	Unit
Reviewer hours per week	[count]	hours
Weeks of review	[count]	weeks
Total reviewer hours	[count]	hours

Triggers

Cost trigger (escalates for additional approval): [amount; e.g., “if aggregate spend exceeds 1.3× estimate”]
Stopping cost (halts experiment regardless of results): [amount; e.g., “if aggregate spend exceeds 2× estimate”]

7. Decision rule and stopping rule

Ship rule

The experiment supports a ship decision if and only if:

[Primary metric condition, e.g., “primary metric improves by at least 1.5 percentage points at p < 0.01”]
[Guardrail condition, e.g., “no guardrail metric breached at any point in the experiment window”]
[Slice condition, e.g., “no per-slice metric degrades by more than 3% relative to baseline”]
[Any other conditions]

Rollback rule

Beyond the guardrail triggers in §4, the experiment triggers full rollback if:

[Condition; e.g., “primary metric degrades by more than 0.5 percentage points at p < 0.01”]
[Condition; e.g., “safety-signal rate exceeds X per thousand requests for more than 1 hour”]
[Condition]

Extension rule

The experiment is extended past planned duration if:

[Condition; e.g., “result is inconclusive (neither ship nor rollback) and duration extension remains within stopping-cost budget”]
[Condition]

Peeking policy

No peeking allowed before planned end.
Interim peeks allowed under the sequential-testing framework named in §5; peek results must be recorded in the tracking system.
Other (specify): […]

8. Reproducibility and governance

Tracking

Field	Value
Tracking system	[MLflow, W&B, Neptune, Aim, SageMaker Experiments, Vertex AI Experiments, Azure ML, Databricks, or another]
Experiment record ID	[tracking-system ID]
Code repository	[URL, plus branch and commit SHA at experiment start]
Data version (if applicable)	[DVC tag, feature-store version, LakeFS branch, or other]
Container image (if applicable)	[registry and SHA]

Regulatory mapping

Which regulatory anchors does this experiment produce evidence for? (From Article 13.)

EU AI Act Annex IV §§2, 3, 4, 5
EU AI Act Article 12 record-keeping
ISO/IEC 42001 Clause 9.1 monitoring records
NIST AI RMF MEASURE subcategories (list which)
Other (specify)
None (non-regulated system)

Retention

Field	Value
Retention system	[e.g., SharePoint, Confluence, document management system]
Retention duration	[e.g., 10 years for EU AI Act high-risk, per retention policy]
Access controls	[who can read, who can modify before commit, immutability policy]

Review and sign-off

Role	Name	Decision	Date
Owner	[…]	Authored	YYYY-MM-DD
Peer reviewer	[…]	Approved / requested changes	YYYY-MM-DD
Governance reviewer (if applicable)	[…]	Approved / requested changes / held	YYYY-MM-DD

Execution is not authorized until all required rows show “Approved” and the brief is committed to the tracking system. Modifications after authorization require re-approval; the modification, the rationale, and the re-approval are recorded in the tracking system with the original brief.

Notes on use

When to use this template. Every experiment on a deployed or soon-to-be-deployed AI feature, regardless of risk tier. High-risk systems must use it; low-risk systems benefit from its discipline.

When a lighter template is acceptable. A very small offline experiment (a single training run for a personal hypothesis, run in under a day, with no production implications) can use a shortened version that omits §§5–8 and fits on half a page. The full template returns as soon as the experiment has any production pathway.

Common errors in first-time use. Vague predicted effect (“will improve X”); missing guardrails; no decision rule; no stopping cost; no regulatory mapping for regulated systems. A peer reviewer catches these; a governance reviewer treats them as blocking.

What follows. The companion artifact is the experiment report, authored at conclusion (Article 14 §2). Every brief has a report; every report references the brief.