Online Evaluation

FlowRidge

Online Evaluation — Canary to Full Rollout

Shadow

No user impact

Logged predictions

Offline comparison

Ready gate

Canary

1% traffic

Guardrail monitor

Auto-rollback

Scale gate

A/B

50/50 split

Statistical test

Counter-metric

Decision gate

Rollout

Gradual ramp

Drift monitor

Incident queue

Stable state

Figure 316. Online evaluation tests in production under real traffic. Each phase carries its own kill-switch and its own success criteria.

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 4 of 14

Online evaluation is where the hypothesis formulated in Article 2 meets the users it was meant to serve, and where the offline evidence collected in Article 3 is confirmed or refuted against reality. The article teaches four canonical designs — A/B tests, multi-armed bandits, canary deployments, and shadow traffic — and the discipline each one requires. The designs are not substitutes for one another. Each answers a different question, and a mature experimentation program uses all four.

A/B tests

An A/B test (also called a randomized controlled experiment) assigns users randomly to a control variant (A) or a treatment variant (B), collects a primary metric over a fixed window, and compares the two. If the metric difference exceeds a pre-specified threshold at pre-specified confidence, the decision rule fires.

The discipline around A/B tests is statistical, not implementational. The implementation is straightforward: a random assignment function, a metric pipeline, and a reporting view. The statistics are where mistakes are made.

Sample size must be computed in advance. The minimum detectable effect (MDE) the experiment can detect at a given power and significance level depends on sample size, baseline variance, and traffic allocation. Computing this in advance prevents the two common failure modes: experiments run too short (underpowered, inconclusive) and experiments run too long (wasted traffic on a known winner or known loser). Sample-size calculation is widely documented, and most mature experimentation platforms, including the ones documented by Netflix, Microsoft, Booking.com, and others, compute MDE automatically from historical variance¹²³.

Peeking inflates false positives. Looking at the results before the planned end of the experiment and deciding to stop early based on what is seen is called peeking. Classical fixed-horizon statistics become invalid under peeking; the effective significance level inflates. The correction is either to not peek, or to use a sequential testing framework (mSPRT, Bonferroni-adjusted alpha-spending, or always-valid p-values) that accounts for continuous evaluation. The Kohavi, Tang, Xu reference text treats peeking as the single most common error in online experimentation practice¹.

Sample ratio mismatch is a tripwire. If the traffic was meant to be 50/50 but arrives as 52/48, the randomization is broken. Sample ratio mismatch (SRM) indicates a bug somewhere — in assignment, in logging, in filtering — and results under SRM are invalid regardless of the primary metric movement. Every A/B pipeline must check SRM as the first thing it reports.

Carry-over and contamination must be prevented. Users who interact with both variants, features that share infrastructure across variants, and networks where users influence each other (social networks, marketplaces) all create paths for the treatment to affect the control. Prevention is design-time: user-level (not session-level) assignment, feature-level isolation, and cluster-randomized designs on networks.

Multi-armed bandits

A multi-armed bandit is an online experimentation strategy that adaptively shifts traffic toward better-performing variants during the experiment. Two common algorithms are epsilon-greedy (explore with probability epsilon, exploit otherwise) and Thompson sampling (Bayesian posterior sampling). Contextual bandits add per-user context to the decision.

Bandits are appropriate when the goal is to minimize opportunity cost across many variants (dozens or hundreds of headlines, creative assets, or ranking policies) and when the hypothesis does not require a clean A/B comparison. Bandits are inappropriate when the goal is a clean “does this variant improve the metric?” decision, when the stakes are high enough that a potential loser should not see meaningful traffic at all, or when statistical rigor in the final report is required — bandits produce biased post-hoc estimates because traffic allocation is correlated with outcomes.

The choice between A/B and bandit is a choice between two questions. A/B answers “which variant should I ship?”. Bandits answer “across many variants, how do I minimize my regret?”. Most AI experimentation programs use A/B for decision-quality experiments and bandits for content-optimization experiments where the number of variants is large and the per-variant signal is modest.

Canary deployments

A canary deployment is a phased rollout in which a new model or feature is exposed to a small slice of traffic (commonly 1%), evaluated against pre-specified guardrails, and ramped to larger slices (10%, 50%, 100%) only if the guardrails hold. The Google SRE Workbook documents the pattern across non-ML deployments; the adaptation to ML is a mechanical extension⁴.

Canary differs from A/B in intent. An A/B test is designed to measure a primary metric. A canary is designed to detect regressions fast — latency spikes, error-rate increases, guardrail-metric breaches, dimension drift — and to roll back automatically when they appear. The canary is a safety rail. It can coexist with an A/B test running underneath it: the ramp through 1%, 10%, and 50% can itself be randomized so that the A/B statistics accrue, while the canary gate controls whether the ramp proceeds.

Canary design has four elements.

Ramp schedule. Explicit traffic-share steps (1%, 10%, 50%, 100%) and the time the system must spend at each step before ramping.
Guardrail set. A list of metrics (latency p99, error rate, primary-metric degradation beyond a threshold, secondary-metric breaches) that trigger rollback.
Rollback mechanism. An automated, tested path from the ramp back to full control traffic, with no human-in-the-loop required for guardrail breach.
Human-in-the-loop escalation. For breaches that are not clear regressions, a named person is paged and given a decision window.

Tooling for canary is available across the managed and self-hosted ecosystems. SageMaker, Vertex AI, Azure ML, Databricks, and MLflow all document canary patterns; Kubernetes-native tools including Argo Rollouts and Flagger implement the ramp mechanics on self-hosted stacks; feature-flag systems including LaunchDarkly, GrowthBook, and Unleash support canary as a flag-driven rollout⁵⁶. The pattern is vendor-neutral.

[DIAGRAM: StageGateFlow — aitm-eci-article-4-canary-ramp — A left-to-right ramp from 1% to 10% to 50% to 100%, with guardrails shown above each step and a rollback arrow returning to the previous step on breach. A secondary arrow labels the A/B test accruing across the ramp.]

Shadow traffic

A shadow deployment routes production inputs to both the current production model and a new candidate model in parallel. The candidate’s outputs are logged but not served to users. The new model runs on realistic inputs without being exposed to real users.

Shadow is the right mode for four situations.

A model class change where offline evaluation cannot confirm input-distribution coverage.
A risk-sensitive feature (credit decisioning, medical triage, legal summarization) where user exposure must be preceded by an operational realism check.
A high-cost experiment where the loss from a bad A/B would outweigh the learning value.
A regulated system where pre-exposure evaluation on realistic inputs is a documentation obligation, for example for high-risk systems under EU AI Act Annex IV⁷.

Shadow comes with its own hazards. The candidate consumes compute for every production request without serving any user, which doubles the serving cost for the shadow period. Logging candidate outputs at scale is itself a data-governance task. When the candidate’s outputs must be compared to the production model’s outputs to assess agreement, a comparison function must be defined, and the definition is often nontrivial for generative outputs. And shadow alone does not assess user-facing behavior; a canary ramp is the next step after a successful shadow.

Sequential testing and always-valid inference

Classical A/B statistics assume a fixed horizon. The data is collected, and the test is performed once at the end. In practice, most practitioners want to monitor the experiment continuously and stop early when the result is clear. Sequential testing frameworks formalize this.

mSPRT (mixture sequential probability ratio test) and always-valid confidence sequences are two families of sequential tests widely adopted at large experimentation platforms. They offer guaranteed Type I error rates under continuous monitoring, at the cost of slightly wider intervals than fixed-horizon tests would produce for the same sample size. The engineering trade is almost always favorable, because the operational cost of running an experiment longer than needed exceeds the statistical cost of slightly wider intervals.

The practitioner need not derive the math. A competent A/B platform ships a sequential-testing mode; a competent practitioner enables it when continuous monitoring is required. The Kohavi, Tang, Xu reference text and the Booking.com platform writeups both treat sequential testing as a standard feature of modern experimentation infrastructure¹³.

Decision rules and rollback criteria

Every online experiment needs both a ship rule and a rollback rule. The ship rule is “if the primary metric improves by at least X at confidence Y, with no guardrail breach, ship”. The rollback rule is “if any of the following happens, revert”. Typical rollback triggers for an AI feature include:

Primary metric degraded beyond tolerance.
Any guardrail metric breached.
Latency p99 increased beyond threshold.
Error rate increased beyond threshold.
Safety monitoring (content-safety breaches, jailbreak detections) increased.
Cost per request increased beyond budget threshold (a specific AI concern, covered in Article 12).

The rollback rules are written before the experiment. They are tested before the experiment. They are automated where possible. And they are invoked without a meeting when they fire; the meeting happens after the rollback, to decide what to do next.

[DIAGRAM: TimelineDiagram — aitm-eci-article-4-ab-timeline — A horizontal timeline from pre-registration -> traffic ramp -> data collection -> sequential boundary check -> decision with labeled peeking boundary and guardrail monitors.]

Two real programs in the online-experimentation vocabulary

Microsoft and Bing — ExP. Published work by Ron Kohavi and colleagues at Microsoft’s experimentation platform documented thousands of experiments per year across Bing, Office, and Azure properties, with a platform-team-owned infrastructure providing sample-size calculation, SRM checks, sequential testing, and a metric catalog². Microsoft’s published practice is the operational reference for A/B testing at scale, and its lessons (including the 100-to-1 ratio of proposed experiments that fail to improve the primary metric) are transferable to AI experimentation directly.

Vendor landscape — Optimizely, LaunchDarkly, GrowthBook. Commercial experimentation and feature-flag platforms including Optimizely, LaunchDarkly, and GrowthBook implement the canary, A/B, and sequential-testing patterns as vendor services; open-source alternatives including Unleash, Flagsmith, and Eppo’s open-source components provide the same patterns on self-hosted stacks⁵. The technical pattern is the same across all implementations; the practitioner’s responsibility is the design, not the plumbing.

Summary

Online evaluation has four canonical designs. A/B tests answer ship-or-not questions with pre-specified sample size, pre-specified confidence, and sequential-testing safeguards. Bandits minimize regret across many variants when decision-quality statistics are not the primary need. Canary deployments detect regressions fast through guardrail-gated ramps. Shadow traffic produces realistic inputs without user exposure, and is the right mode for high-risk, high-regulation features. Sequential testing, SRM checks, and pre-specified rollback rules are non-negotiable infrastructure. The tooling is broad and vendor-neutral; the discipline is not.

Further reading in the Core Stream: Evaluate: Measuring Transformation Progress.

Ron Kohavi, Diane Tang, Ya Xu. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020. https://experimentguide.com/ — accessed 2026-04-19. ↩ ↩² ↩³
Microsoft Research Experimentation Platform (ExP). https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/ — accessed 2026-04-19. ↩ ↩²
Booking.com engineering blog. Experimentation platform series. https://booking.ai/ — accessed 2026-04-19. ↩ ↩²
The Site Reliability Workbook, “Canarying Releases”. Google. https://sre.google/workbook/canarying-releases/ — accessed 2026-04-19. ↩
Optimizely, LaunchDarkly, and GrowthBook vendor documentation. https://www.optimizely.com/ ; https://launchdarkly.com/ ; https://www.growthbook.io/ — accessed 2026-04-19. ↩ ↩²
Argo Rollouts and Flagger progressive-delivery documentation (Kubernetes-native). https://argoproj.github.io/rollouts/ ; https://flagger.app/ — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 (EU AI Act), Article 15 and Annex IV. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩