Synthetic Control for Unique Deployments

FlowRidge

Synthetic control fills a gap that DiD and RDD cannot: what happens when the AI feature is deployed enterprise-wide at one company, nationwide in one country, or across all customers of one product line? There are no untreated units within the organization; there is exactly one treated unit. DiD is infeasible; A/B tests are infeasible; RDD is infeasible. But donor units — peer companies, peer countries, peer product lines — may exist in published data, and their weighted combination can construct a counterfactual for the unique treated case.

This article teaches the construction, the placebo testing, the robustness analysis, and the reporting discipline that turn synthetic control from a niche academic method into a repeatable enterprise AI-evaluation tool.

Constructing the synthetic control

Five steps construct a synthetic control.

Step 1 — Define the treated unit and pre/post periods

The treated unit is the organization, market, or segment that received the AI feature. The pre-treatment period is a sufficient history (typically ≥12 quarters or months) for trajectory matching. The post-treatment period runs from the feature launch forward.

Step 2 — Assemble the donor pool

The donor pool is a set of untreated units — peer organizations, peer markets, peer products — whose outcomes are observable and whose structure is comparable to the treated unit. For an enterprise-wide copilot evaluation, donors might be peer companies of similar size, industry, and customer base that have not deployed a comparable copilot. For a national AI service evaluation, donors are peer nations with similar economic and demographic profiles.

Donor pool quality is the primary constraint on synthetic control validity. A donor pool of ten peer companies of varied sizes and industries will often produce better counterfactuals than a pool of three perfectly-matched peers, because the weighting algorithm can combine partial matches.

Step 3 — Run the weighting algorithm

The synthetic-control weighting algorithm finds weights $w_1, w_2, …, w_J$ on the $J$ donor units that minimize the pre-treatment trajectory distance between the weighted donor combination and the treated unit. Constraints: weights sum to 1 and all weights are non-negative. Abadie’s implementation is available in R (Synth package), Python (SyntheticControlMethods), and Stata (synth command).¹

Pre-treatment fit is diagnosed by the root mean squared prediction error (RMSPE) — the average deviation between treated-unit outcome and synthetic-control outcome across pre-treatment periods. A small RMSPE (≤5% of the outcome scale) indicates a tight pre-treatment fit. Large RMSPE means the donor pool cannot construct a good counterfactual; the analysis is unreliable.

Step 4 — Compute the treatment effect

The treatment effect in each post-treatment period is the actual treated-unit outcome minus the synthetic-control outcome. Averaged across post-treatment periods, this gives the average post-treatment effect.

Step 5 — Placebo and robustness tests

Placebo tests are indispensable. For each donor unit, pretend it was the treated unit, construct a synthetic control from the remaining donors, and compute the “placebo effect.” If the actual treated unit’s post-treatment gap exceeds the 95th percentile of placebo gaps, the effect is statistically distinguishable from noise.

Robustness tests include leave-one-out (re-run the analysis excluding each donor in turn) and pre-treatment-period sensitivity (re-run with shorter or different pre-treatment windows). Effects robust to leave-one-out and pre-treatment variation are reportable; effects that change materially under these tests are fragile.

When synthetic control applies to AI

Three common AI-evaluation scenarios fit synthetic control.

Scenario 1 — Enterprise-wide copilot rollout

A company deploys a GenAI copilot to all 20,000 knowledge workers simultaneously. No internal controls exist. Donor pool: publicly-reporting peer companies whose productivity metrics (disclosed in earnings or in labor-productivity data) are observable, and who have not deployed comparable copilots. The synthetic control weighs peers to match the pre-launch productivity trajectory; the post-launch gap is the estimated copilot effect.

This analysis has limits. It relies on public productivity data whose definitions vary across companies. Donor pools for knowledge-work copilots are small because the feature has spread quickly. Results are typically reported with wide uncertainty bands and careful disclosure of donor-pool composition.

Scenario 2 — Single-market AI service

A national retailer deploys AI-driven personalization on its home market only. Peer retailers in other markets, or the same retailer’s pre-launch trajectory versus peers, supply the donor pool. Rosetta-stone comparisons are easier here because retail sector data is comparable across markets.

Scenario 3 — Public-sector single-agency deployment

A national government deploys an AI triage system in one agency. Donor pool: analogous agencies in peer countries that have not deployed comparable systems. Government reporting data (OECD, Eurostat, NAO) often provides the donor pool. The US Navy “Task Force Hopper” program, referenced earlier, could in principle be evaluated with synthetic control against peer services in other nations’ defense programs.²

Limitations and honest reporting

Synthetic control is a powerful but limited tool. Four limitations should appear in any reporting of a synthetic-control result.

Single treated unit. Statistical inference is based on permutation over donor placebos, not on classical sampling theory. Uncertainty is typically wider than analysts accustomed to DiD or A/B tests will expect.

Donor-pool dependency. Results depend on which units are in the donor pool. Leave-one-out sensitivity often reveals meaningful shifts in the estimate; these must be disclosed.

Interpolation bias. The synthetic control is a weighted average of donors; if the treated unit lies outside the convex hull of donors on any important covariate, the counterfactual involves extrapolation and is unreliable. Pre-treatment RMSPE catches the worst cases but not all.

Confounding shocks. Any event that affects the treated unit but not donors — or vice versa — will be absorbed into the estimate. A post-launch macroeconomic shock that hits the treated country harder than the donor countries will inflate the estimated AI effect.

Reporting a synthetic-control result

The VRR section 3 presentation of a synthetic-control result includes: the treated unit, the donor pool and its composition rationale, the weighting algorithm, the pre-treatment RMSPE, the point estimate with placebo-based inference, leave-one-out sensitivity, and the explicit disclosure of the four limitations above. The narrative names the method as “synthetic control” — analysts who label it “difference-in-differences” because that term is more familiar are mis-stating the method and will lose credibility when a reviewer spots the mislabel.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md#causal-design — practitioner framing of causal methods.
EATE-Level-3/M3.5-Art15-Strategic-Value-Realization-Risk-Adjusted-Value-Frameworks.md — expert-level risk-adjusted value framing where synthetic control often applies.

Self-check

A company’s enterprise-wide copilot rollout has no internal controls. Pre-treatment RMSPE on the synthetic control is 14% of the outcome scale. What does this mean, and what is the reporting implication?
A placebo test shows the treated unit’s post-treatment gap is smaller than the median placebo gap. What conclusion is appropriate?
Leave-one-out sensitivity shows the estimate ranging from +2% to +18% across donor-exclusion runs. How do you report the result?
An analyst presents a synthetic-control analysis but labels the method “difference-in-differences.” What is the corrective conversation?