AITM-ECI: AI Experimentation Associate — Body of Knowledge Artifact Template
How to use this template
This template is the companion artifact to Article 14. An experiment brief is authored before the experiment starts, reviewed before execution, and committed to the experiment tracking system as the pre-registration record. The brief is typically one to two pages when rendered in a normal document style; the sections below map directly onto the eight-section structure in Article 14.
Every field is required. Empty fields are rejected by the review gate. Fields for which the answer is “not applicable” are filled with “not applicable” and a one-sentence rationale.
Copy the template, rename it with the experiment’s name, and fill in each section. Leave the frontmatter intact so that the brief is recognized by the documentation extractor.
Experiment Brief — [Experiment Name]
1. Identification
| Field | Value |
|---|---|
| Experiment name | [short name, e.g., “TicketSummary prompt v3”] |
| Experiment ID | [tracking-system-assigned ID] |
| Owner (single accountable individual) | [name, role] |
| Peer reviewer | [name, role] |
| Governance reviewer (if applicable) | [name, role; required for high-risk or regulated systems] |
| Date authored | YYYY-MM-DD |
| Planned start | YYYY-MM-DD |
| Planned end | YYYY-MM-DD |
| Feature or system under test | [name; link to feature/system record] |
| Risk tier | [low / moderate / high / high-risk per EU AI Act / other] |
| Regulated-system flags | [EU AI Act high-risk, ISO 42001 scope, HIPAA, PCI, other; or “none”] |
2. Hypothesis
Using the four-element structure from Article 2: subject, predicted effect, measurement, threshold.
- Subject. [What is the change? Which model, prompt, retrieval policy, feature, threshold, configuration, or combination is the object of the experiment?]
- Predicted effect. [What observable quantity does the change alter, and in which direction?]
- Measurement. [How will the effect be observed? On what population? Over what window? By what instrument?]
- Threshold. [What size of effect, detected with what confidence, triggers what decision?]
One sentence stating the hypothesis in falsifiable form:
[Hypothesis sentence, e.g., “Changing the retrieval reranker from cross-encoder A to cross-encoder B will increase answer-acceptance rate on English-language sessions in the last 14 days of the experiment window by at least 1.5 percentage points at p < 0.01 two-sided.”]
3. Experiment mode
Using the vocabulary from Article 1. Select all that apply and specify sequence if multiple.
- Offline evaluation
- Shadow deployment
- Online A/B test
- Online multi-armed bandit
- Canary deployment with guardrail-only evaluation
- Adversarial / red-team experiment
- Other (specify)
Sequence and gating (if multi-mode): [e.g., “Offline first; if offline metrics meet threshold, shadow for 14 days; if shadow divergence within tolerance, canary at 1% then ramp.”]
Rationale for the mode choice in one paragraph:
[Why these modes and not others; what gap each mode closes that the previous mode could not.]
4. Metrics
Primary metric
| Field | Value |
|---|---|
| Name | [metric name, linked to metric catalog if one exists] |
| Operational definition | [precise definition, including numerator, denominator, window, filters] |
| Owner | [name or team] |
| Expected baseline | [baseline value with 95% interval, from recent production] |
| Target MDE (minimum detectable effect) | [e.g., 1.5 absolute percentage points] |
| Confidence and power | [e.g., alpha 0.01 two-sided, power 80%] |
Secondary metrics (short list)
| Name | Definition | Rationale |
|---|---|---|
| [metric 1] | […] | [why included] |
| [metric 2] | […] | [why included] |
| [metric 3] | […] | [why included] |
Guardrail metrics (short list)
| Name | Definition | Breach threshold | Action on breach |
|---|---|---|---|
| [metric 1, e.g., latency p99] | […] | [e.g., +20% over baseline for 5 min rolling window] | [automatic rollback / page owner / extend review] |
| [metric 2, e.g., error rate] | […] | […] | […] |
| [metric 3, e.g., per-request cost for generative features] | […] | […] | […] |
| [metric 4, e.g., safety signal] | […] | […] | […] |
Adversarial review of the primary metric
One paragraph per Article 2: how might the primary metric move without the feature actually being better?
[Adversarial-review paragraph; 3–6 sentences.]
5. Evaluation protocol
Population and scope
- Included population: [description; e.g., “English-language sessions from organizations on paid plans”]
- Excluded population: [description with rationale]
- Randomization unit: [user / session / account / request / other]
- Stratification (if any): [strata and rationale]
Sample size and duration
| Field | Value |
|---|---|
| Sample size per variant | [number] |
| Traffic allocation | [e.g., 50/50] |
| Variance-reduction technique (if any) | [CUPED, regression adjustment, stratified sampling, or none] |
| Expected duration | [days, with assumption on traffic rate] |
| Sequential-testing framework | [always-valid confidence sequence, mSPRT, fixed-horizon] |
Slice evaluation
Slices to be reported separately (from Article 3):
- [Slice 1, e.g., geography]
- [Slice 2, e.g., segment]
- [Slice 3, e.g., temporal cohort]
- [Slice 4]
Minimum per-slice sample size to render the per-slice metric reliable: [number]
Shadow or canary prerequisites (if mode includes them)
- [Description of shadow duration and comparison protocol]
- [Description of canary ramp steps and minimum time per step]
6. Budget
From Article 12. All three categories are required.
Compute cost
| Category | Estimate | Unit |
|---|---|---|
| Training compute (if any) | [amount] | [GPU-hours, dollars] |
| Evaluation compute | [amount] | [GPU-hours, dollars] |
| Online-serving compute (for the experiment window) | [amount] | [dollars] |
| Per-request LLM cost (for generative features) | [amount] | [dollars total over window] |
Human-review cost
| Category | Estimate | Unit |
|---|---|---|
| Reviewer hours per week | [count] | hours |
| Weeks of review | [count] | weeks |
| Total reviewer hours | [count] | hours |
Triggers
- Cost trigger (escalates for additional approval): [amount; e.g., “if aggregate spend exceeds 1.3× estimate”]
- Stopping cost (halts experiment regardless of results): [amount; e.g., “if aggregate spend exceeds 2× estimate”]
7. Decision rule and stopping rule
Ship rule
The experiment supports a ship decision if and only if:
- [Primary metric condition, e.g., “primary metric improves by at least 1.5 percentage points at p < 0.01”]
- [Guardrail condition, e.g., “no guardrail metric breached at any point in the experiment window”]
- [Slice condition, e.g., “no per-slice metric degrades by more than 3% relative to baseline”]
- [Any other conditions]
Rollback rule
Beyond the guardrail triggers in §4, the experiment triggers full rollback if:
- [Condition; e.g., “primary metric degrades by more than 0.5 percentage points at p < 0.01”]
- [Condition; e.g., “safety-signal rate exceeds X per thousand requests for more than 1 hour”]
- [Condition]
Extension rule
The experiment is extended past planned duration if:
- [Condition; e.g., “result is inconclusive (neither ship nor rollback) and duration extension remains within stopping-cost budget”]
- [Condition]
Peeking policy
- No peeking allowed before planned end.
- Interim peeks allowed under the sequential-testing framework named in §5; peek results must be recorded in the tracking system.
- Other (specify): […]
8. Reproducibility and governance
Tracking
| Field | Value |
|---|---|
| Tracking system | [MLflow, W&B, Neptune, Aim, SageMaker Experiments, Vertex AI Experiments, Azure ML, Databricks, or another] |
| Experiment record ID | [tracking-system ID] |
| Code repository | [URL, plus branch and commit SHA at experiment start] |
| Data version (if applicable) | [DVC tag, feature-store version, LakeFS branch, or other] |
| Container image (if applicable) | [registry and SHA] |
Regulatory mapping
Which regulatory anchors does this experiment produce evidence for? (From Article 13.)
- EU AI Act Annex IV §§2, 3, 4, 5
- EU AI Act Article 12 record-keeping
- ISO/IEC 42001 Clause 9.1 monitoring records
- NIST AI RMF MEASURE subcategories (list which)
- Other (specify)
- None (non-regulated system)
Retention
| Field | Value |
|---|---|
| Retention system | [e.g., SharePoint, Confluence, document management system] |
| Retention duration | [e.g., 10 years for EU AI Act high-risk, per retention policy] |
| Access controls | [who can read, who can modify before commit, immutability policy] |
Review and sign-off
| Role | Name | Decision | Date | Notes |
|---|---|---|---|---|
| Owner | […] | Authored | YYYY-MM-DD | |
| Peer reviewer | […] | Approved / requested changes | YYYY-MM-DD | |
| Governance reviewer (if applicable) | […] | Approved / requested changes / held | YYYY-MM-DD |
Execution is not authorized until all required rows show “Approved” and the brief is committed to the tracking system. Modifications after authorization require re-approval; the modification, the rationale, and the re-approval are recorded in the tracking system with the original brief.
Notes on use
When to use this template. Every experiment on a deployed or soon-to-be-deployed AI feature, regardless of risk tier. High-risk systems must use it; low-risk systems benefit from its discipline.
When a lighter template is acceptable. A very small offline experiment (a single training run for a personal hypothesis, run in under a day, with no production implications) can use a shortened version that omits §§5–8 and fits on half a page. The full template returns as soon as the experiment has any production pathway.
Common errors in first-time use. Vague predicted effect (“will improve X”); missing guardrails; no decision rule; no stopping cost; no regulatory mapping for regulated systems. A peer reviewer catches these; a governance reviewer treats them as blocking.
What follows. The companion artifact is the experiment report, authored at conclusion (Article 14 §2). Every brief has a report; every report references the brief.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.