Skip to main content
AITE M1.3-Art18 v1.0 Reviewed 2026-04-06 Open Access
M1.3 The 20-Domain Maturity Model
AITF · Foundations

Choosing Between Experimental and Observational Designs

Choosing Between Experimental and Observational Designs — Maturity Assessment & Diagnostics — Advanced depth — COMPEL Body of Knowledge.

10 min read Article 18 of 48 Calibrate

This article teaches a six-question decision tree for design selection and walks through the standard CFO objections each design faces. Articles 19 through 23 then cover each design in depth. Learners who exit this article can look at any proposed AI value feature and choose — in twenty minutes — the right design, or correctly state that the design is infeasible and the alternative.

Why design choice is the first decision

The common analyst instinct is to reach for the method the analyst knows best. If the analyst is an economist, they reach for DiD. If they are a product analyst, they reach for A/B tests. If they are a data scientist, they reach for propensity matching. This is the wrong order of operations. The design choice should be made first, against the problem shape, and the analyst on the project should be chosen second.

Angrist and Pischke make this the opening argument of Mostly Harmless Econometrics: the research question dictates the research design, and the most common failure of applied economics work is to let the method drive the question.1 The same logic applies to AI value. A counterfactual chosen to fit the analyst’s comfort zone rather than the feature’s shape is a counterfactual that will not hold up in a CFO review.

The six questions

Six questions, asked in order, select the right design for any AI feature.

Question 1 — Can we randomize?

Randomization is the gold standard. If two otherwise-identical users, customers, or work units can be randomly assigned to the AI-enabled or AI-disabled condition, an A/B test (RCT) will produce the strongest causal inference any design can produce. The DoorDash published pricing experimentation work is a high-quality example of randomization executed at marketplace scale.2

Randomization fails when the AI feature cannot be selectively withheld. An enterprise-wide copilot rollout does not permit randomization — every knowledge worker gets access. A fraud-detection model that scores every transaction does not permit randomization — the threshold is the same for everyone. If the answer to Question 1 is “no,” move to Question 2.

Question 2 — Is rollout staged, with natural variation in timing?

If the AI feature will be rolled out in waves — geography-by-geography, team-by-team, tier-by-tier — and the timing of each wave is plausibly independent of the outcome, the DiD design from Article 20 is the right choice. The copilot that launches in Germany this quarter, France next quarter, and Italy the quarter after produces exactly the timing variation DiD exploits.

DiD requires parallel trends — treated and control units must have been moving in parallel before treatment. If the trends were diverging pre-treatment, the DiD estimate will be biased. If the answer to Question 2 is “no” — the rollout is simultaneous, or timing was picked precisely to target the highest-value segments first — move to Question 3.

Question 3 — Is there a threshold or score cutoff driving assignment?

Many AI features use a threshold. A credit-scoring model accepts applicants above a cutoff, rejects below. A fraud model routes transactions above a threshold to review, passes transactions below. A customer-success model escalates accounts above a churn-risk score. For these threshold-assignment cases, the regression discontinuity design from Article 21 is purpose-built.

RDD assumes the threshold itself is exogenous — that applicants just above and just below the threshold are otherwise similar. Where threshold manipulation exists (applicants gaming their credit-score input, say), the design is biased. If the answer to Question 3 is “no” — there is no threshold, or the threshold is endogenously manipulable — move to Question 4.

Question 4 — Is the treated unit unique?

Some AI deployments affect a single unique unit — an enterprise-wide copilot in a single company, a national fraud system in a single government, a single-market service. For these cases, synthetic control from Article 22 constructs a counterfactual by weighting a donor pool of untreated units (peer companies, peer governments, peer markets) to match the treated unit’s pre-treatment trajectory.

Synthetic control requires a plausible donor pool. If the treated unit is truly one of a kind — no peer company, no peer market — synthetic control is infeasible. If the answer to Question 4 is “no,” move to Question 5.

Question 5 — Do we have rich observational data on treated and untreated units, and plausible observability of all confounders?

When randomization, staged rollout, threshold assignment, and uniqueness all fail, propensity-score matching from Article 23 may be usable. PSM estimates each unit’s propensity to receive treatment based on observed covariates, then matches treated and control units on similar propensities. PSM’s Achilles heel is the matching-on-observables limitation — if unobserved confounders drive both treatment and outcome, PSM will overclaim.

PSM is the right answer when a feature’s adoption is voluntary and the analyst has rich data on who adopted, who did not, and why. It is the wrong answer when the adoption decision is driven by unobservable factors (individual motivation, team culture, leadership preference) that also drive the outcome. If the answer to Question 5 is “no — we do not trust that our observables cover the confounders,” move to Question 6.

Question 6 — Is a pre/post comparison the best we can do?

The last resort. A simple pre/post comparison — outcomes in the three months before the feature shipped, versus the three months after — is a weak counterfactual that assumes nothing else changed during the transition. It is often the only option for small programs without rollout variation, unique-unit status, or rich observational data. The best pre/post analyses acknowledge the limitation explicitly in the counterfactual narrative (Article 16, Section 3).

A pre/post design that is presented as equivalent to an RCT or DiD will fail a CFO review. A pre/post design that is presented as pre/post, with its assumptions disclosed, is a defensible minimum.

The stakes, reversibility, network-effects, sample-size, ethics, feasibility filter

The six questions above select on problem shape. Six secondary considerations then filter which design is actually deployable in the context.

Stakes. High-stakes decisions (medical triage, credit, hiring) cannot ethically use A/B randomization where the control arm is materially disadvantaged. These cases push toward observational designs with IRB-equivalent review.

Reversibility. If the AI feature is hard to reverse once adopted (an enterprise-wide deployment that trains users on new workflows), the experimental design must front-load the evaluation or commit to running the experiment alongside rollout.

Network effects. A/B tests break when treatment spills between arms. Social-feature AI (content recommendation, team-collaboration copilots) routinely suffers contamination. DiD with geography splits is often better.

Sample size. Small programs with under a few hundred units lose statistical power. Synthetic control and PSM need rich pre-treatment history; DiD needs parallel trends; RDD needs dense observations near the threshold. Underpowered designs produce wide uncertainty bands that CFOs read as “we don’t know.”

Ethics and regulatory constraint. Some designs are banned by sector rule. Credit-scoring A/B tests face ECOA constraints in the US and similar rules in most jurisdictions. Medical AI evaluations require IRB review. The design choice must clear the regulatory bar before it clears the statistical bar.

Feasibility. The most elegant design is worthless if the data does not exist. A DiD design requires treated-unit data and untreated-unit data with comparable time series. A synthetic-control design requires a donor pool with observable covariates. The feasibility filter kills many designs in practice.

The standard CFO objections

Every design faces predictable CFO objections. Learners who anticipate the objection before the CFO raises it deliver counterfactuals that survive first review.

Against A/B tests, CFOs object that the sample does not represent the broader population — the test ran in one market, one segment, one season. The rebuttal is external-validity analysis: show the sample’s covariate distribution matches the broader population, or state where it does not.

Against DiD, CFOs object that the trend break could be coincident with something else — a product launch, a seasonal swing, a macroeconomic shift. The rebuttal is pre-treatment parallel-trend testing and robustness to placebo interventions.

Against RDD, CFOs object that the threshold was chosen to maximize political approval rather than to create a natural experiment. The rebuttal is bandwidth-robustness testing and discussion of threshold manipulability.

Against synthetic control, CFOs object that the donor pool is not comparable. The rebuttal is pre-treatment fit diagnostics and sensitivity to donor-pool composition.

Against PSM, CFOs object that the unobservable confounders drive the result. The rebuttal is Rosenbaum sensitivity analysis and transparency about the matching-on-observables limit.

The best practitioners put the CFO objections in section 3 of the VRR (Article 16) themselves, alongside the counterfactual method and the robustness tests. Transparency about known limits is what turns a counterfactual from a marketing claim into an audit-grade finding.

Cross-reference to Core Stream

  • EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md#causal-design — practitioner framing of causal-design choice.
  • EATP-Level-2/M2.5-Art10-From-Measurement-to-Decision.md — decision coupling from causal evidence.

Self-check

  1. A contact-center deploys a GenAI agent copilot enterprise-wide on a single day. Which design is feasible and why?
  2. A credit-scoring model uses a 680-score threshold for auto-approve. Which design is purpose-built for this case, and what ethical constraint must be cleared first?
  3. A recommender-system team wants to A/B test but contamination between arms is suspected. What is the alternative, and what rollout shape is required?
  4. An analyst prefers PSM because they are expert in it; the feature has staged geography rollout. What is the correct design and why?

Further reading

  • Angrist & Pischke, Mostly Harmless Econometrics (Princeton, 2009), Chs. 1–3.
  • Pearl, Causal Inference in Statistics: An Overview, Statistics Surveys 3 (2009).
  • Abadie, Using Synthetic Controls: Feasibility, Data Requirements, and Methodological Aspects, Journal of Economic Literature (2021).

Footnotes

  1. J. D. Angrist and J.-S. Pischke, Mostly Harmless Econometrics: An Empiricist’s Companion (Princeton University Press, 2009). https://press.princeton.edu/books/paperback/9780691120355/mostly-harmless-econometrics

  2. DoorDash Engineering, Published pricing experimentation case studies (2022). https://doordash.engineering/