AITM-ECI: AI Experimentation Associate — Body of Knowledge Lab Notebook 2 of 2
Scenario
Your organization is a B2B SaaS company that ships a customer-service ticket management product used by support teams at roughly 8,000 customer organizations worldwide. Your product team has built a generative AI feature called TicketSummary that produces a three-sentence summary at the top of every open ticket, designed to give the assigned support agent faster context. The feature is powered by a managed LLM API behind a feature-flag layer; the platform team can control which customer organizations and which ticket categories receive the feature. You have been asked to design the online A/B test that decides whether to ship TicketSummary to the full fleet.
The product owner’s stated hypothesis is “summaries will reduce average time-to-first-response by support agents by 10% or more”. Baseline time-to-first-response is 6.2 minutes across the fleet, with high variance across customer organizations and ticket categories. Agents currently handle approximately 1.4M tickets per week across the fleet.
Part 1: Hypothesis refinement and metric design (25 minutes)
The product owner’s statement is a starting point, not a hypothesis. Refine it into the four-element hypothesis structure from Article 2 (subject, predicted effect, measurement, threshold). Then design the metric set.
Produce:
- A refined hypothesis in the four-element structure. The predicted-effect and threshold must be concrete and falsifiable.
- A primary metric with an operational definition: how it is measured, over what population, over what window, with what inclusion/exclusion criteria.
- Two to four secondary metrics that help interpret primary-metric movement.
- Three to five guardrail metrics that must not degrade. Consider: error rate, latency p99, per-ticket cost (this is a generative feature with per-token cost), agent satisfaction (survey-based), and at least one metric that represents downstream customer outcome (ticket resolution quality or customer CSAT).
- A one-paragraph adversarial review of the primary metric (from Article 2): how might the metric move without the feature actually being better?
Expected artifact: Hypothesis-and-Metrics.md with the refined hypothesis, the metric table, and the adversarial-review paragraph.
Part 2: Sample-size calculation and duration planning (30 minutes)
Design the sample-size and duration plan.
Required elements:
- The minimum detectable effect (MDE) you will target. Justify the choice given the product owner’s stated 10% threshold and the typical variance in the baseline.
- The sample size per variant required to detect the MDE at 80% power and 1% significance, given the baseline time-to-first-response of 6.2 minutes and a reasonable variance assumption (state your assumption). Show the calculation.
- The traffic allocation (50/50, 90/10, or other). Justify.
- The variance-reduction strategy (Article 12). Will you use CUPED? Stratified assignment by customer organization? Regression adjustment? Document the choice.
- The expected duration in days given the current ticket volume and the allocation.
- The sequential-testing policy: will you use an always-valid confidence sequence, mSPRT, or fixed-horizon? Name your framework or platform and state the rationale in one sentence.
Include a contingency: if the expected duration exceeds 4 weeks, propose either a broader allocation or a more concentrated deployment plan.
Expected artifact: Sample-Size-and-Duration.md with the calculation and the contingency.
Part 3: Canary ramp and rollback criteria (30 minutes)
Design the canary ramp that will precede the A/B test and the rollback criteria that govern both.
The ramp design must specify:
- Traffic-share steps (1%, 10%, 50%, 100% is typical; propose your version).
- Minimum time per step before automatic ramp to the next.
- Automated rollback triggers. For each of your Part 1 guardrails, state the threshold breach that triggers rollback. Be specific (not “error rate goes up”, but “error rate p95 exceeds 1.5% over a 5-minute rolling window”).
- Human-escalation triggers. For signals that are not automatic rollbacks but need a human decision, name the signal and the named role that is paged.
- Shadow prerequisite. Will you run shadow before canary? Given this is a user-facing generative feature with moderate risk (it aids but does not replace the human agent), justify the decision.
A concrete consideration for TicketSummary specifically: the feature has per-ticket cost driven by prompt and completion tokens. A cost regression that drives per-ticket cost above budget is a business-critical failure even if quality holds. Include an explicit per-ticket-cost rollback trigger.
Expected artifact: Canary-and-Rollback.md with the ramp table, the rollback trigger table, and the shadow-prerequisite rationale.
Part 4: Pre-registration and decision-rule artifact (20 minutes)
Assemble the pre-registration. This is the document that will sit in the experiment tracking system before the experiment starts; it is what the tracking system and the governance reviewer will compare against at experiment conclusion.
Required sections:
- The hypothesis (from Part 1).
- The primary, secondary, guardrail metric definitions (from Part 1).
- The sample-size, duration, and allocation plan (from Part 2).
- The canary ramp and rollback criteria (from Part 3).
- The decision rule: what primary metric movement, at what confidence, with what guardrail condition, triggers ship; what triggers rollback (beyond the canary triggers); what triggers extension.
- The stopping rule: maximum duration, early-stop condition for a confident win or loss, peeking policy (are interim peeks allowed; by whom; with what recorded justification).
- The owner, reviewer, and approver. Specific role names, not “the team”.
- The tracking system identifier where the experiment will be recorded. Name the tool you would actually use.
The pre-registration is approximately one page. It is the source of truth; everything else in the package defers to it.
Expected artifact: Pre-Registration.md, approximately one page.
Reflection questions (10 minutes)
Write one paragraph on each.
- The 10% threshold. The product owner’s stated 10% time-to-first-response improvement is aggressive. If your sample-size calculation suggests the experiment would need to run for 8 weeks to detect 10% reliably, what do you do? Which of the tools in Article 12 (variance reduction, guardrail prioritization, sequential testing) would you reach for first, and why?
- The cost guardrail. Generative AI features have a specific cost hazard. Describe, using Article 12’s vocabulary, the three ways TicketSummary’s per-ticket cost could drift during the A/B window, and what monitoring you would put in place to detect each.
- The regulatory dimension. This is a customer-service product; the EU AI Act does not classify it as high-risk by default, but several customer organizations will be EU-based, and the feature summarizes content that may include personal data. What elements of your Part 3 and Part 4 artifacts will produce the Article 12 Annex IV–adjacent evidence if a customer organization later asks for it?
Final deliverable
A single online evaluation package named TicketSummary-Online-Experiment-Package.md combining the four artifact files and the reflection paragraphs, with a one-page executive summary at the top stating: the feature, the hypothesis in one sentence, the primary metric and threshold, the total compute-and-duration budget, and the go/no-go recommendation for starting the experiment (approve, approve with conditions, redesign). The package runs to approximately eight to twelve pages.
What good looks like
An experiment-review gate will look for:
- A quantitative sample-size argument. Not “we will run until the result is clear”, but “we need 340,000 tickets per variant to detect a 10% relative change with 80% power at alpha 0.01 given variance X”.
- Concrete rollback triggers. Thresholds, windows, and actions, not verbal guardrails.
- Per-ticket-cost monitoring. The generative-AI-specific cost guardrail visibly in place.
- Variance reduction named. CUPED or stratification or another technique, with the expected sensitivity improvement stated.
- A clear decision and stopping rule. No room for post-hoc reinterpretation.
- Tool neutrality. Feature-flag and experimentation tool named as a choice, not a branded endorsement.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.