Propensity-Score Matching for Observational AI Studies

FlowRidge

PSM fills a specific niche in AI evaluation: the feature adoption is voluntary, the adopter and non-adopter populations are systematically different, and the analyst has rich observational data on the characteristics that drive both adoption and outcome. Enterprise copilot rollouts where employees opt in, consumer AI services with voluntary sign-up, and AI-augmented clinical workflows with physician-level choice all fit the pattern.

PSM is also the causal design most often misapplied. Its Achilles heel is the matching-on-observables limitation: PSM can only adjust for covariates observed in the data. Unobserved confounders — individual motivation, team culture, leadership preference — routinely drive both treatment uptake and outcome, and PSM cannot remove their effect. This article teaches the method, the balance diagnostics, and the honest disclosure discipline that separate defensible PSM work from overclaiming.

The PSM workflow

Five steps produce a PSM analysis.

Step 1 — Define treatment and control

Treated units adopted the AI feature; control units did not. Both groups must come from a well-defined population and a specified time window. Ambiguous treatment definition (e.g., “used Copilot”) typically collapses under scrutiny — was once-a-month use “treatment”? The standard practice is to define treatment by an activity threshold (e.g., “used Copilot on at least 20 work days in Q1”) and test sensitivity to the threshold.

Step 2 — Estimate propensity scores

Fit a model predicting treatment status from baseline covariates. Logistic regression is the classical choice; machine-learning methods (random forests, gradient boosting) are increasingly used for flexibility. The model includes every covariate that could plausibly drive both treatment uptake and outcome: tenure, role, prior productivity, team, manager, prior technology-adoption history, location.

A practical check: the propensity model should achieve decent but not excellent discrimination. If the ROC AUC is 0.95, treated and control are so separable on observables that matching is barely feasible — any match across the propensity range will involve substantial extrapolation. If the AUC is 0.55, treatment is nearly random on observables and matching may be overkill.

Step 3 — Match

Several matching algorithms are standard. Nearest-neighbour matching pairs each treated unit with its closest control on propensity score. Caliper matching adds a maximum-distance constraint (typically 0.2 standard deviations of the propensity score). Kernel matching uses a weighted average of controls near the treated unit’s propensity. Optimal matching minimizes total within-pair distance across all pairs simultaneously.

The choice matters but rarely by a lot when balance is good. The analyst should report the matching algorithm and verify that the chosen algorithm produced balance.

Step 4 — Test balance

Balance testing verifies that matched treated and control groups are comparable on the observed covariates. Standard tests: standardized mean difference (target <0.1 on every covariate), variance ratio (target 0.5–2.0), and Q-Q plots for continuous variables.

Balance failure indicates the matching did not achieve its purpose. Remedies: tighten caliper, add covariates to the propensity model, exclude propensity ranges where matching is infeasible (common-support trimming), or switch methods.

Step 5 — Estimate the treatment effect and test sensitivity

The treatment effect is the mean outcome difference between matched treated and control units. For the average treatment effect on the treated (ATT), one averages pair-level differences. For the average treatment effect (ATE), one re-weights to represent the full population.

Sensitivity analysis — Rosenbaum bounds — asks: how large would an unobserved confounder need to be to overturn the estimated effect? Reporting Rosenbaum bounds is what distinguishes audit-grade PSM from marketing PSM. An effect robust to a Γ = 1.5 confounder (50% greater treatment-odds ratio from unobserved bias) is reasonably defensible; an effect that loses significance at Γ = 1.1 is fragile.

The matching-on-observables limit

PSM’s cardinal limitation is that it matches only on observables. The assumption that all confounders are observed — the “conditional independence assumption” or “selection on observables” — cannot be empirically tested. If unobserved factors drive both treatment uptake and outcome, PSM will estimate the biased treatment-plus-selection effect, not the treatment effect.

For AI-feature adoption, the most common unobserved confounders are:

Motivation. Highly motivated employees adopt new tools faster and also perform better. PSM matches on tenure and role; it cannot match on motivation.
Team culture. Teams with collaborative culture both adopt new tools and perform better. PSM matches on team ID; it cannot match on culture within team.
Leadership. Teams under ambitious managers both adopt new tools and perform better. PSM can sometimes match on manager ID, but the matched sample becomes thin.

Analysts who estimate PSM without addressing these confounders will overstate the AI feature’s effect. The overstatement magnitude is typically 20–50% in enterprise settings, based on comparisons of PSM estimates with subsequent RCT evidence in the same organizations.

The mitigation is not to abandon PSM. It is to report PSM estimates with Rosenbaum bounds and to triangulate across methods when possible. A PSM estimate that matches a DiD estimate on a subset of the population where DiD is feasible carries more weight than a PSM estimate alone.

The overclaim pattern

The PSM overclaim pattern is common enough to be worth naming. A company rolls out a voluntary AI copilot; 40% of employees adopt; PSM compares adopters to non-adopters on observed covariates; the estimated effect is +12% productivity. Press release: “AI copilot boosts productivity 12%.” CFO accepts; board applauds.

Six months later, a partial RCT in a subset of the organization estimates the effect at +5% productivity. The gap is the matching-on-observables bias. The 7-percentage-point overstatement came from selection — motivated, high-performing employees adopt AI faster. PSM attributed the motivation-and-culture effect to the AI.

The disciplined reporting path would have been: “PSM estimate of +12%, with Rosenbaum sensitivity showing the effect disappears at Γ = 1.3, and a parallel RCT planned for Q3 to verify.” That honest framing, delivered at release time, would have survived the subsequent RCT result. The overstatement framing did not.

The published enterprise adoption-study literature contains examples of both patterns. Academic re-analyses of proprietary enterprise AI-adoption datasets have repeatedly found that PSM-reported effects attenuate under more rigorous designs.¹

When to use PSM

PSM is the right tool when all four conditions hold: treatment uptake is voluntary; adopter and non-adopter populations differ systematically; rich observational data is available on the plausible confounders; and the analyst can commit to Rosenbaum-bounds reporting.

PSM is the wrong tool when: randomization is feasible (use A/B from Article 19); staged rollout creates timing variation (use DiD from Article 20); a threshold governs treatment (use RDD from Article 21); the unit is unique (use synthetic control from Article 22); or unobserved confounders are clearly important and no sensitivity bound exists that provides defensible conclusions.

The six-question decision tree in Article 18 places PSM correctly as Question 5 — a design to reach for after experimental and clean-quasi-experimental options have been exhausted.

Reporting PSM in the VRR

The VRR section 3 presentation of a PSM result includes: the treatment definition, the propensity model specification and AUC, the matching algorithm and caliper, pre-match and post-match balance diagnostics on every covariate, the ATT (or ATE) point estimate with bootstrap confidence interval, Rosenbaum-bounds sensitivity, and the explicit disclosure of the matching-on-observables limit along with the candidate confounders that cannot be ruled out.

Cross-reference to Core Stream

EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md#causal-design — practitioner causal-design framework.
EATP-Level-2/M2.5-Art10-From-Measurement-to-Decision.md — decision architecture for PSM-based findings.

Self-check

A PSM propensity model has an ROC AUC of 0.93. What does this tell you about the feasibility of matching, and what is the remedy?
Post-match standardized mean differences on two covariates exceed 0.15. Is the analysis usable? What are the options?
A PSM estimate of +12% productivity loses significance at a Rosenbaum Γ of 1.15. What should appear in the VRR narrative?
An analyst wants to use PSM because the team is experienced with it; the rollout has clear staged timing. What is the correct advice?