Lab 02: Design an Online A/B Test with Sample-Size Calculation and Rollback Criteria

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Lab Notebook 2 of 2

Scenario

Your organization is a B2B SaaS company that ships a customer-service ticket management product used by support teams at roughly 8,000 customer organizations worldwide. Your product team has built a generative AI feature called TicketSummary that produces a three-sentence summary at the top of every open ticket, designed to give the assigned support agent faster context. The feature is powered by a managed LLM API behind a feature-flag layer; the platform team can control which customer organizations and which ticket categories receive the feature. You have been asked to design the online A/B test that decides whether to ship TicketSummary to the full fleet.

The product owner’s stated hypothesis is “summaries will reduce average time-to-first-response by support agents by 10% or more”. Baseline time-to-first-response is 6.2 minutes across the fleet, with high variance across customer organizations and ticket categories. Agents currently handle approximately 1.4M tickets per week across the fleet.

The product owner’s statement is a starting point, not a hypothesis. Refine it into the four-element hypothesis structure from Article 2 (subject, predicted effect, measurement, threshold). Then design the metric set.

Produce:

A refined hypothesis in the four-element structure. The predicted-effect and threshold must be concrete and falsifiable.
A primary metric with an operational definition: how it is measured, over what population, over what window, with what inclusion/exclusion criteria.
Two to four secondary metrics that help interpret primary-metric movement.
Three to five guardrail metrics that must not degrade. Consider: error rate, latency p99, per-ticket cost (this is a generative feature with per-token cost), agent satisfaction (survey-based), and at least one metric that represents downstream customer outcome (ticket resolution quality or customer CSAT).
A one-paragraph adversarial review of the primary metric (from Article 2): how might the metric move without the feature actually being better?

Expected artifact: Hypothesis-and-Metrics.md with the refined hypothesis, the metric table, and the adversarial-review paragraph.

Part 2: Sample-size calculation and duration planning (30 minutes)

Design the sample-size and duration plan.

Required elements:

The minimum detectable effect (MDE) you will target. Justify the choice given the product owner’s stated 10% threshold and the typical variance in the baseline.
The sample size per variant required to detect the MDE at 80% power and 1% significance, given the baseline time-to-first-response of 6.2 minutes and a reasonable variance assumption (state your assumption). Show the calculation.
The traffic allocation (50/50, 90/10, or other). Justify.
The variance-reduction strategy (Article 12). Will you use CUPED? Stratified assignment by customer organization? Regression adjustment? Document the choice.
The expected duration in days given the current ticket volume and the allocation.
The sequential-testing policy: will you use an always-valid confidence sequence, mSPRT, or fixed-horizon? Name your framework or platform and state the rationale in one sentence.

Include a contingency: if the expected duration exceeds 4 weeks, propose either a broader allocation or a more concentrated deployment plan.

Expected artifact: Sample-Size-and-Duration.md with the calculation and the contingency.

Part 3: Canary ramp and rollback criteria (30 minutes)

Design the canary ramp that will precede the A/B test and the rollback criteria that govern both.

The ramp design must specify:

Traffic-share steps (1%, 10%, 50%, 100% is typical; propose your version).
Minimum time per step before automatic ramp to the next.
Automated rollback triggers. For each of your Part 1 guardrails, state the threshold breach that triggers rollback. Be specific (not “error rate goes up”, but “error rate p95 exceeds 1.5% over a 5-minute rolling window”).
Human-escalation triggers. For signals that are not automatic rollbacks but need a human decision, name the signal and the named role that is paged.
Shadow prerequisite. Will you run shadow before canary? Given this is a user-facing generative feature with moderate risk (it aids but does not replace the human agent), justify the decision.

A concrete consideration for TicketSummary specifically: the feature has per-ticket cost driven by prompt and completion tokens. A cost regression that drives per-ticket cost above budget is a business-critical failure even if quality holds. Include an explicit per-ticket-cost rollback trigger.

Expected artifact: Canary-and-Rollback.md with the ramp table, the rollback trigger table, and the shadow-prerequisite rationale.

Part 4: Pre-registration and decision-rule artifact (20 minutes)

Assemble the pre-registration. This is the document that will sit in the experiment tracking system before the experiment starts; it is what the tracking system and the governance reviewer will compare against at experiment conclusion.

Required sections:

The hypothesis (from Part 1).
The primary, secondary, guardrail metric definitions (from Part 1).
The sample-size, duration, and allocation plan (from Part 2).
The canary ramp and rollback criteria (from Part 3).
The decision rule: what primary metric movement, at what confidence, with what guardrail condition, triggers ship; what triggers rollback (beyond the canary triggers); what triggers extension.
The stopping rule: maximum duration, early-stop condition for a confident win or loss, peeking policy (are interim peeks allowed; by whom; with what recorded justification).
The owner, reviewer, and approver. Specific role names, not “the team”.
The tracking system identifier where the experiment will be recorded. Name the tool you would actually use.

The pre-registration is approximately one page. It is the source of truth; everything else in the package defers to it.

Expected artifact: Pre-Registration.md, approximately one page.

Reflection questions (10 minutes)

Write one paragraph on each.

The 10% threshold. The product owner’s stated 10% time-to-first-response improvement is aggressive. If your sample-size calculation suggests the experiment would need to run for 8 weeks to detect 10% reliably, what do you do? Which of the tools in Article 12 (variance reduction, guardrail prioritization, sequential testing) would you reach for first, and why?
The cost guardrail. Generative AI features have a specific cost hazard. Describe, using Article 12’s vocabulary, the three ways TicketSummary’s per-ticket cost could drift during the A/B window, and what monitoring you would put in place to detect each.
The regulatory dimension. This is a customer-service product; the EU AI Act does not classify it as high-risk by default, but several customer organizations will be EU-based, and the feature summarizes content that may include personal data. What elements of your Part 3 and Part 4 artifacts will produce the Article 12 Annex IV–adjacent evidence if a customer organization later asks for it?

Final deliverable

A single online evaluation package named TicketSummary-Online-Experiment-Package.md combining the four artifact files and the reflection paragraphs, with a one-page executive summary at the top stating: the feature, the hypothesis in one sentence, the primary metric and threshold, the total compute-and-duration budget, and the go/no-go recommendation for starting the experiment (approve, approve with conditions, redesign). The package runs to approximately eight to twelve pages.

What good looks like

An experiment-review gate will look for:

A quantitative sample-size argument. Not “we will run until the result is clear”, but “we need 340,000 tickets per variant to detect a 10% relative change with 80% power at alpha 0.01 given variance X”.
Concrete rollback triggers. Thresholds, windows, and actions, not verbal guardrails.
Per-ticket-cost monitoring. The generative-AI-specific cost guardrail visibly in place.
Variance reduction named. CUPED or stratification or another technique, with the expected sensitivity improvement stated.
A clear decision and stopping rule. No room for post-hoc reinterpretation.
Tool neutrality. Feature-flag and experimentation tool named as a choice, not a branded endorsement.