AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 2 of 14
A hypothesis is a falsifiable statement that an experiment is designed to support or reject. A metric is the measurement used to decide. The order matters. Teams that pick a metric first and ask the question later end up with dashboards that trend in the wrong direction for reasons nobody can name. Teams that author the hypothesis first can design a metric that answers it and a decision rule that closes the experiment. This article teaches the structure of a defensible hypothesis and the taxonomy of metrics that serves it.
Elements of a testable hypothesis
A hypothesis that can produce a decision has four parts, not one.
- Subject. What is the change? Which model, prompt, retrieval policy, feature, or threshold is the object of the experiment?
- Predicted effect. What observable quantity does the change alter, and in which direction?
- Measurement. How will the effect be observed? On what population? Over what window? By what instrument?
- Threshold and decision rule. What size of effect, detected with what confidence, triggers what decision?
A hypothesis that names only the first two parts is a guess. A hypothesis that names all four is an experiment design. Ron Kohavi, Diane Tang, and Ya Xu document this four-part structure in their reference text on online controlled experiments, and note that the most common cause of inconclusive A/B tests is not weak data but weak hypothesis specification up front1.
A worked example, built in the four parts:
Subject. Retrieval reranker for the customer-service chatbot (swap cross-encoder A for cross-encoder B). Predicted effect. Answer-acceptance rate rises (fewer “this did not help” clicks). Measurement. Acceptance rate computed over English-language sessions of chat users in the last 14 days of the experiment window, with a minimum of three turns per session. Threshold. A 1.5 percentage-point absolute increase at p < 0.01 two-sided triggers the decision to ship. A greater-than-0.5-point decrease at the same confidence triggers rollback. Anything in between triggers a further 14-day window.
The decision rule is what closes the experiment. Without it, the team debates what to do after the numbers arrive, and the debate usually resolves in favor of whoever argues last.
[DIAGRAM: HubSpokeDiagram — aitm-eci-article-2-hypothesis-elements — Central hub labeled “Hypothesis” with six spokes: Subject, Predicted effect, Measurement, Threshold, Decision rule, Stopping rule. Each spoke lists two example questions the author answers.]
Primary, secondary, and guardrail metrics
Most experiments need more than one metric. Three roles separate cleanly.
- Primary metric. The one metric whose movement decides the experiment. Only one. When a team claims to have two primaries they are usually ducking a prioritization call.
- Secondary metrics. Metrics that help interpret the primary’s movement but do not drive the decision. A drop in a secondary with no primary signal is information, not a conclusion.
- Guardrail metrics. Metrics that must not degrade even if the primary improves. A ship decision requires both primary improvement and guardrail stability.
Netflix’s engineering team has documented the guardrail pattern extensively in its tech-blog series, describing how a single primary metric (e.g., retention-relevant streaming engagement) is paired with a set of guardrails (latency, error rates, cancellation, customer-support contact rate)2. Microsoft’s experimentation platform follows a similar structure3. The shared lesson across both programs is that the guardrail list is not a nice-to-have; it is the mechanism that prevents a primary-metric win from being a Pyrrhic one.
The distinction between leading and lagging indicators cuts across all three roles. A leading indicator moves early in the causal chain (click-through on a generated answer) and gives faster signal but weaker evidence of the downstream outcome. A lagging indicator moves later (paid renewal, repeat-session probability) and gives stronger evidence but slower signal. Most experiments need at least one leading metric to detect early failure and one lagging metric to confirm success.
[DIAGRAM: MatrixDiagram — aitm-eci-article-2-metric-taxonomy-2x2 — 2x2 of “Primary vs. guardrail” on one axis and “Leading vs. lagging” on the other, with examples for each cell: acceptance rate in top-left, latency in top-right, renewal in bottom-left, error rate in bottom-right.]
Proxy metrics and why they drift
A proxy metric is a stand-in for the outcome the team actually cares about. Proxies are often necessary: retention is the outcome, but retention takes weeks to measure, so “sessions in the last week” becomes the proxy. The hazard is that proxies decouple from the underlying outcome in at least three ways.
Causal decoupling. The proxy correlates with the outcome on historical data but not on the new regime. A retrieval model tuned to maximize click-through generated more clicks but fewer satisfied sessions in the long run, because clickable answers were not useful answers.
Measurement decoupling. The proxy is instrumented differently from the outcome. A satisfaction proxy based on thumbs-up feedback collects from 3% of users, and those users are unrepresentative.
Adversarial decoupling. Someone, inside or outside the organization, learns to move the proxy without moving the outcome. A generated-answer system rewarded for long answers produces longer answers, not better ones. This is Goodhart’s law applied to AI experimentation: when a measure becomes a target, it ceases to be a good measure4.
Defense against proxy drift is structural. Every proxy must be paired with a quarterly “ground-truth reconciliation” — a slower, more expensive, more direct measurement of the underlying outcome on a sample — and the correlation between proxy and ground truth must be recomputed. If it slips, the proxy is retired or revalidated. The practice is documented in Booking.com’s public writeups on their experimentation platform and in the Kohavi reference15.
Metric gaming and how to anticipate it
Gaming is not always malicious. Engineers optimize for what they are measured on. Product managers chase the chart that produces the promotion. Models trained to maximize a metric find edge-case behaviors that maximize it without moving the outcome the metric was chosen to represent. The discipline the practitioner brings is to anticipate gaming before the experiment runs, not after.
Three anticipation tactics:
- Adversarial review of the metric. Before the experiment runs, someone on the team (or a peer from another team) writes down, “If I wanted the metric to go up without the feature being better, what would I do?” The output is a short list of gaming vectors. For each vector, the team decides whether to add a guardrail, to change the measurement, or to accept the risk.
- Pre-registration. The hypothesis, metrics, threshold, and decision rule are recorded in a permanent artifact before the experiment runs. Pre-registration is standard practice in clinical trials and in the ICLR reproducibility challenge; the adaptation to AI experimentation is a simple markdown file in the experiment’s directory6.
- Post-hoc guardrail audit. After the experiment, the team reviews the winning variant for evidence of gaming. Did the metric move disproportionately in a narrow user segment? Did a downstream behavior that should have correlated fail to correlate? If so, the conclusion is “metric moved; outcome unclear”, not “ship”.
Two real programs, contrasted
Netflix — long-running experimentation with multi-role metrics. Netflix’s experimentation platform serves thousands of experiments per year. The Netflix Tech Blog has published extensively on multi-role metric design, including a multi-part series on how guardrails are authored, how long-term-holdout designs validate proxies, and how overall-evaluation criteria (OEC) are composed from primary and secondary signals2. What distinguishes the Netflix practice is that metric design is treated as a first-class artifact owned by a platform team, not as an ad-hoc decision per experiment.
Microsoft ExP — metric quality as a governance function. Kohavi’s work at Microsoft and Bing documented thousands of experiments and built an internal “metric catalog” that any team could pull from, rather than inventing metrics per experiment3. Metric definitions in the catalog carry ownership, a validation history, and a guardrail-pairing schema. When a team proposes a new metric, it enters the catalog only after it has been peer-reviewed. The governance function is that the organization cannot run an experiment against an unreviewed metric.
The common lesson from both programs, and from dozens of others documented at companies across consumer, enterprise, and platform segments, is that metric design is a program, not an artifact. An organization that treats metric design as a program owns a metric catalog, assigns metric owners, and gates experiments on the catalog’s approval. An organization that treats it as an artifact invents a metric per experiment and learns Goodhart’s lesson at scale.
The hypothesis and metric section of the experiment brief
Articles 14 of this credential covers the full experiment brief. The hypothesis-and-metric section is authored first and is typically half a page long. Its structure:
- One sentence naming the subject.
- One sentence naming the predicted effect.
- One paragraph describing the measurement, the population, and the instrument.
- One line naming the primary metric.
- A bullet list (kept short) of secondary metrics.
- A bullet list (kept short) of guardrail metrics.
- One line stating the threshold and decision rule.
- One line stating the stopping rule (when the experiment ends early or extends).
If that half-page cannot be written before the experiment starts, the experiment is not ready to start.
Summary
A hypothesis is a falsifiable statement with four parts: subject, predicted effect, measurement, and threshold. Metrics come in three roles: primary drives the decision, secondaries interpret, guardrails must not degrade. Proxies decouple from outcomes through three routes (causal, measurement, adversarial), and the defense is ground-truth reconciliation on a cadence. Gaming is anticipated through adversarial review, pre-registration, and post-hoc audit. Netflix and Microsoft ExP exemplify disciplined metric programs. The hypothesis-and-metric section of the experiment brief is the gate: if it cannot be written, the experiment cannot begin.
Further reading in the Core Stream: Evaluate: Measuring Transformation Progress and From Measurement to Decision.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.
Footnotes
-
Ron Kohavi, Diane Tang, Ya Xu. Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press, 2020. https://experimentguide.com/ — accessed 2026-04-19. ↩ ↩2
-
Netflix Technology Blog. Experimentation platform series. https://netflixtechblog.com/tagged/ab-testing — accessed 2026-04-19. ↩ ↩2
-
Microsoft Research Experimentation Platform (ExP). https://www.microsoft.com/en-us/research/group/experimentation-platform-exp/ — accessed 2026-04-19. ↩ ↩2
-
Charles Goodhart. Problems of Monetary Management: The UK Experience. 1975; popularized in “Goodhart’s Law” and widely adapted to machine learning metric selection. ↩
-
Booking.com engineering blog. Experimentation and metric design series. https://booking.ai/ — accessed 2026-04-19. ↩
-
ICLR Reproducibility Challenge. Papers With Code. https://paperswithcode.com/ — accessed 2026-04-19. ↩