A/B Testing for AI Features

FlowRidge

A/B testing is well-practised outside of AI, in product development, marketing, and clinical trials. What makes A/B testing of AI features distinct is the three categories of pitfall that undermine the classical design: contamination across arms, network effects between treated and control users, and learning systems that adapt during the test itself. This article teaches the reader to design the RCT, compute the power, anticipate the pitfalls, and produce an evaluation that a CFO and an audit committee will both accept.

Structuring the RCT

A usable AI-feature A/B test has nine specifications, settled before the test starts.

Unit of randomization. User, session, account, tenant, or device. The unit must be the smallest level at which contamination is prevented. For a personalization-model test, the user is usually correct. For a contact-center copilot test, the agent is usually correct. For a recommender test on a social platform, user-level randomization contaminates because treated users’ actions affect controls’ feeds; cluster randomization by social sub-graph is often required.

Eligibility. The criteria that define who enters the test. Eligibility too narrow produces a sample that does not generalize; too broad dilutes power. Eligibility decisions are often the first place the CFO will push back. The GitHub Copilot productivity RCT from Peng et al. (2023) limited eligibility to developers completing a scoped coding task, which was appropriate for the research question but explicit about the narrow scope; subsequent claims that the reported effect generalizes to all developer work were not supported by the study design.¹

Treatment definition. What exactly constitutes the AI-enabled arm. The treatment must be concretely specified — “copilot v2.4 with enterprise-tier context and tool access” — not gestured at — “the AI copilot.” When the treatment definition is loose, mid-test drift in the AI feature can invalidate the test.

Control arm. What the control gets. Three standard options: no feature, the existing non-AI workflow, or a prior-generation AI feature. The choice matters because the counterfactual is the control. A test against “no feature” overstates the incremental value relative to a realistic rollout alternative; a test against “prior feature” understates it.

Primary metric. The single metric the test is powered to detect. Secondary metrics are observed but not powered. Picking multiple primary metrics inflates the false-positive rate; CFOs will reject results that rely on multiple-comparison-unadjusted secondary findings.

Minimum detectable effect (MDE). The smallest effect size the test can reliably detect at the chosen power level. MDE is the honest answer to “what effect must we see for the test to matter?” If the business case requires a 10% productivity lift but the test MDE is 15%, the test cannot validate the business case.

Sample size. Computed from MDE, baseline variance, and power. The standard computation for a two-sample t-test is n per arm = 2 × (z_{1-α/2} + z_{1-β})² × σ² / δ², where δ is the MDE and σ² the outcome variance. For binary outcomes, use the proportions formulation. Power 0.80 and α 0.05 are standard starting points; high-stakes decisions warrant α 0.01 or power 0.90.

Duration. Computed from sample size and expected unit arrival rate. A test that completes in a day is under-specified (seasonal variation, novelty effects); a test that takes six months may be overtaken by model or product changes. Two to eight weeks is a common sweet spot.

Pre-registration. The hypothesis, primary metric, MDE, stopping rule, and analysis method documented before the test starts. Pre-registration is the defence against post-hoc metric-picking. Regulated industries are moving toward mandatory pre-registration; enterprise best practice follows.

The three pitfalls

Classical A/B testing textbooks underweight three pitfalls that dominate AI-feature evaluation.

Pitfall 1 — Contamination

Contamination occurs when treated-unit activity influences control-unit outcomes. A GenAI copilot used by treated agents improves their call-resolution times; controls, seeing their treated colleagues handle tougher calls, may experience easier queue and record better metrics for unrelated reasons. A recommender-system A/B test on a social feed contaminates because treated users’ shares appear in controls’ feeds.

The mitigation is cluster randomization — assign whole clusters (teams, offices, geographies, network segments) to one arm. Cluster randomization reduces effective sample size; the analyst must re-compute power using the intra-cluster correlation. Textbook references in social-network-experiment literature give the formula; Walker and Muchnik’s foundational work documented the effect sizes that contamination introduces in unaddressed social-network experiments.²

Pitfall 2 — Network effects

Network effects are contamination’s stronger cousin: the AI feature’s value depends on how many other users have it. A collaboration-copilot A/B test at 10% treatment assignment understates the feature’s value at 100% deployment because many of the collaborative interactions are between treated and control users. The classic fix is staged ramping with multiple treatment densities — 10%, 50%, 100% — and analyzing the dose-response curve.

Eng-blog case studies from Meta, LinkedIn, and Uber document this pattern in detail. Uber’s pricing DiD studies, cited earlier, arose partly because A/B tests on pricing have persistent network-effect contamination that made DiD at geography level the better identification strategy.³

Pitfall 3 — Learning systems that co-adapt

Many AI features learn online from user feedback. A recommender-system A/B test where the treated model retrains on treated-user clicks will, over test duration, diverge from an identical untreated model. The “treatment” changes during the test. Two fixes: freeze the treated model for the test duration (which surrenders some of the feature’s design intent), or build the learning behaviour into the experimental design (cluster-time randomization, cross-over designs).

The Netflix recommender-system tech blog has published honestly on this challenge over the years, and the academic literature on bandit-algorithm evaluation has specific designs for it.⁴ The AI value practitioner does not need to build the methods, only to recognize when a learning system makes the classical RCT invalid and to request the appropriate design from the experimentation platform team.

Sample-size math, walked through

Consider a contact-center GenAI copilot. The proposed treatment arm receives copilot suggestions during calls; the control arm uses the existing workflow. The baseline average handle time (AHT) is 8 minutes with standard deviation 3 minutes. The business case requires a 5% reduction (0.4-minute improvement) to hit rNPV hurdle.

Two-sample t-test, 0.05 significance, 0.80 power, equal arms. δ = 0.4, σ = 3. n per arm = 2 × (1.96 + 0.84)² × 9 / 0.16 ≈ 887 calls per arm. At ~400 calls per day across the pilot center, the test completes in roughly a week. Doubling power requirement to 0.90 raises the sample to ~1,178 per arm (~3 days longer). Asking for a 2% rather than 5% effect raises it to ~5,544 per arm (~two weeks). The sample-size sensitivity reveals why MDE choice is a business-case conversation, not an analyst conversation.

Reporting the RCT

A reported A/B test lands in the VRR (Article 16) at Section 3 — the counterfactual narrative. The reported elements are: the ATE point estimate with 95% confidence interval, the pre-registered primary metric result (separated from any post-hoc findings), the randomization check (balance on covariates between arms), the test duration, sample sizes per arm, any protocol deviations and their effect on analysis, and the CFO-ready narrative of what the result means.

The narrative is the hardest section. “Copilot increases productivity” is marketing; “Treated agents completed calls 6% faster than controls (95% CI: 3%–9%), translating to an estimated $2.1M annualized saving at current call volume, subject to the parallel-metrics caveats below” is a counterfactual. Which narrative style the board gets depends on the analyst’s discipline, not on the underlying math.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md#experimentation — practitioner framing of A/B tests.
EATF-Level-1/M1.2-Art05-Evaluate-Measuring-Transformation-Progress.md — Evaluate stage methodology anchoring RCTs.

Self-check

A GenAI copilot is A/B tested at 10% treatment density. Treated users report a 7% productivity lift. Why might this understate the effect at 100% rollout, and what design change addresses it?
A recommender A/B test runs for three weeks; the treated model retrains on treated-user feedback. What pitfall has the analyst stepped into, and what are the two possible fixes?
Sample-size math suggests an 8-week test to detect a 2% effect; the business wants a decision in 4 weeks. What is the honest analyst response to leadership?
An analyst reports a primary metric (p = 0.03) and four secondary metrics (one with p = 0.02). How should the VRR present this?