Skip to main content
AITE M1.3-Art54 v1.0 Reviewed 2026-04-06 Open Access
M1.3 The 20-Domain Maturity Model
AITF · Foundations

Lab 4: Design a Difference-in-Differences Rollout for an Enterprise Copilot

Lab 4: Design a Difference-in-Differences Rollout for an Enterprise Copilot — Maturity Assessment & Diagnostics — Advanced depth — COMPEL Body of Knowledge.

7 min read Article 54 of 48 Calibrate

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Lab 4 of 5


Lab objective

Design a staged rollout for the enterprise copilot described below that preserves DiD identification and produces a defensible counterfactual. Specify the parallel-trends test plan, the sample-size math, and the pre-registered analysis.

Duration: Two hours. Deliverable: A rollout-design document (three to five pages) and a parallel-trends test plan. Linked articles: 18 (design selection), 20 (DiD in AI rollouts).

Scenario

You are the AI value lead for a global professional-services firm with 28,000 knowledge workers across 12 major offices in 8 countries. The firm has piloted “SparkCopilot,” a GenAI assistant that supports document drafting, research, meeting summarization, and task planning for its consultants. The pilot ran in one office (Toronto, ~1,800 consultants) for two quarters and showed suggestive positive effects on hours billed per consultant and on junior-consultant retention.

Leadership has approved a global rollout. The product team proposes rolling out to all 12 offices simultaneously over a six-week implementation window. The AI value team — you — has been asked to review the rollout plan and recommend changes that preserve the ability to measure the copilot’s impact.

Context

  • Firm structure: 12 offices, each with 1,500–3,500 consultants, organized into practice areas (Strategy, Technology, Operations, Tax, Audit).
  • Primary outcome metric (candidate): Billable hours per consultant per quarter.
  • Secondary metrics: Junior-consultant retention (12-month rolling); client-satisfaction score; proposal win rate.
  • Historical data: Four years of office-level billable-hours data; three years of retention data; 18 months of client-satisfaction data.
  • External context: Industry experienced a billable-hours decline in the last two quarters due to macroeconomic softness; the softness varies by practice area.

What to produce

Step 1 — Design the staged rollout

Redesign the rollout to preserve DiD identification. Consider:

  • Timing variation. Stage the rollout in three waves over six months: Wave 1 (weeks 1–4) — 4 offices; Wave 2 (weeks 9–12) — 4 offices; Wave 3 (weeks 17–20) — remaining 4 offices.
  • Office selection per wave. How are the 4 offices per wave chosen? If leadership wants the “highest-potential” offices first, the design is compromised. Propose a selection rule that is plausibly independent of expected outcome — e.g., stratify by office size and practice-area mix, then randomize within strata.
  • Practice-area treatment. Should all practice areas go live simultaneously within an office? Or staggered within? Justify.

Step 2 — Specify the DiD model

Write the estimating equation using two-way fixed effects. Use the specification from Article 20:

$Y_{it} = \alpha_i + \gamma_t + \delta \cdot D_{it} + \epsilon_{it}$

Specify:

  • Unit of analysis: office × practice-area × quarter.
  • Treatment indicator: 1 if office has received copilot by quarter start.
  • Fixed effects: office fixed effects, quarter fixed effects.
  • Clustering for standard errors: at office × practice-area level.
  • Small-panel correction: 12 offices × 5 practice areas = 60 clusters; typically adequate for standard cluster-robust inference, but consider wild-cluster bootstrap as sensitivity.

Recent econometric literature (Callaway–Sant’Anna) should inform the choice of estimator. Name it explicitly.

Plot pre-rollout trajectories for the 12 offices across at least 8 quarters. Check:

  • Visual inspection. Do treated-wave and control-wave (future-treated) offices move in parallel before treatment?
  • Formal test. Run a regression of outcome on quarter × wave-assignment interaction across the pre-treatment window. A significant interaction indicates pre-trend divergence.
  • Placebo tests. Pretend Wave 1 treatment date six months earlier; re-estimate DiD; confirm placebo effect is near zero.

Document what you will do if pre-trends fail. The candidate designs include: event study with linear-trend controls, matched DiD, or (if trends are severely non-parallel) switching to synthetic control for the unique-rollout case.

Step 4 — Sample-size and MDE analysis

Given the historical variance of billable hours per consultant per quarter (use your own plausible estimate), compute the minimum detectable effect the DiD can support at 80% power.

Is the MDE smaller than the business-case-required effect? If yes, proceed; if no, either extend the pre-treatment window, increase the post-treatment window, or accept that the design may fail to detect a real effect of a smaller size.

Step 5 — Anticipate CFO objections

Article 20 lists the standard CFO objections to DiD. For each, draft your rebuttal plan.

  • “Could a macroeconomic shift explain this?”
  • “Could a different firm-wide initiative explain this?”
  • “Are the treated and control offices really comparable?”
  • “How confident are you about the parallel-trends check?”

For each objection, name the specific evidence that addresses it.

Step 6 — Write the rollout-design document

Three to five pages. Sections:

  1. Proposed rollout schedule. Wave-by-wave with office assignments.
  2. Identification strategy. DiD specification, estimator choice, clustering.
  3. Parallel-trends test plan. Pre-treatment diagnostic and placebo plan.
  4. Sample-size analysis. MDE, power, duration required.
  5. Anticipated objections and rebuttals. Four standard CFO objections, documented rebuttal evidence.
  6. Risk and fallback. What happens if parallel-trends fails; what happens if the macro shock persists.

Guidance

  • Wave selection is the crux. The product team will want to launch to offices with the most “pent-up demand”; that is outcome-driven sequencing, which kills DiD. Your design must balance product-team preferences with identification discipline.
  • Practice-area heterogeneity. Five practice areas in one office can look very different from each other. Treating practice area as a separate unit in the analysis gives you 60 units instead of 12, improving power and supporting sub-group analyses.
  • The macro shock matters. Industry billable-hours decline in recent quarters may violate parallel trends in ways that are hard to distinguish from treatment effects. Plan for this explicitly.
  • Honesty about uncertainty. The final report to the CFO should disclose that DiD identification rests on assumptions; the parallel-trends check supports those assumptions but does not prove them; placebo tests are suggestive but not definitive.

Evaluation rubric

DimensionWhat to demonstrateWeight
Rollout designStagecraft, wave selection, identification-aware20%
DiD specificationCorrect econometric form; appropriate estimator15%
Parallel-trends planVisual and formal checks; placebo plan15%
Sample-size mathMDE computed; compared to business-case effect15%
CFO objection handlingEach standard objection addressed with specific evidence15%
Macro-shock handlingExplicit acknowledgment and mitigation10%
Document qualityReads in 15 minutes; supports CFO decision10%

Reflection questions

  1. The product team pushes to launch in Toronto first because Toronto was the successful pilot. What is your recommendation, and how do you negotiate?
  2. Your sample-size math shows the MDE is 2.8% but the business case requires detecting a 4% effect. What are your design options?
  3. Six months into the rollout, a placebo test flags a pre-treatment divergence you had not caught. What is the corrective process?

Linked articles and further reading

  • Article 18 — Choosing between experimental and observational designs.
  • Article 20 — Difference-in-differences in AI rollouts.
  • Article 11 — Sensitivity analysis for the MDE computation.
  • Callaway and Sant’Anna, Difference-in-Differences with Multiple Time Periods, Journal of Econometrics (2021).

Submission

Submit the rollout-design document and the parallel-trends test plan. Reviewer will validate the identification strategy, the stat choices, and the rebuttal preparedness.