Case Study 2: Enterprise Copilot Rollout — A DiD Design in Practice

FlowRidge

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Case Study 2 of 3

Case overview

This case is a composite. It integrates patterns from three publicly-documented sources: the Microsoft Work Trend Index reports on Microsoft 365 Copilot deployments (2023–2024); the Peng et al. 2023 arXiv paper on GitHub Copilot developer productivity; and Brynjolfsson, Li, and Raymond’s 2023 NBER working paper on generative AI in customer-service work.¹ No individual client organization is portrayed; the specific numbers and timelines are illustrative at the composite level while reflecting the directional findings of the underlying sources.

The case illustrates what it looks like to apply the credential’s measurement discipline across an enterprise copilot rollout from Calibrate through Evaluate. It covers rollout-design choices that preserved DiD identification, attribution-model selection across multiple copilot features, counterfactual-estimate communication to a CFO, and the portfolio-level learning that accumulated across the twelve-month evaluation window.

The program

“Lighthouse,” a composite enterprise copilot program at a hypothetical global professional-services firm with approximately 28,000 knowledge workers across 12 offices in 8 countries. The firm deployed two copilot features: a document-drafting copilot (Lighthouse Draft) and a meeting-summarization copilot (Lighthouse Summary). A third feature, a research-assistant copilot, was in pilot but is not discussed here.

The firm’s AI value lead — call her the practitioner — had been asked to produce realized-value estimates for both features across the rollout. The business case projected an 8–12% increase in billable hours per consultant per quarter once the copilots reached full adoption, offset by approximately $6M in annualized run cost for the two features combined.

Calibrate — the measurement plan

The practitioner opened Calibrate by writing a measurement plan that covered both features. Article 4’s eleven-section structure was applied. The primary metric for both features was billable hours per consultant per quarter, measured from the firm’s time-tracking system. Secondary metrics included junior-consultant retention, client-satisfaction score, and suggestion-acceptance rate (for Lighthouse Draft) and summary-quality score (for Lighthouse Summary).

The counterfactual-method question was the consequential one. Simultaneous global rollout — which the product team had proposed — would have made DiD infeasible. The practitioner proposed a three-wave staged rollout over six months: Wave 1 in four offices, Wave 2 in four offices three months later, Wave 3 in the remaining four offices six months later. Wave assignment was stratified by office size and practice-area mix, then randomized within stratum, removing the outcome-driven sequencing that would have compromised identification.

Leadership pushed back. The firm’s Toronto office had led the pilot and wanted to be in Wave 1. The practitioner proposed that Toronto be in Wave 1 with its wave-peers, which produced an office selection that was not fully randomized but was defensible: Toronto in Wave 1, three comparable-size offices from other stratum cells also in Wave 1, with the remaining nine offices randomized into Waves 2 and 3.

The practitioner disclosed the non-randomized Toronto placement in the measurement plan, along with a robustness analysis that would re-estimate the DiD excluding Toronto and verify that the estimates held.

Organize — the data infrastructure

The firm’s time-tracking system was the primary data source. Retention came from HR; client satisfaction from the client-feedback platform. Suggestion-acceptance rate came from the Copilot telemetry; summary-quality score came from a rubric-based evaluation performed on a sampled subset of generated summaries.

Three data-quality issues required resolution before rollout. First, time-tracking categories had changed in 2024, making pre-2024 trajectory data not directly comparable to post-2024 data. The practitioner worked with the time-tracking team to produce a reconciled series with a documented crosswalk. Second, client-satisfaction scores were collected only on sampled engagements; the sample was not large enough to support office-by-practice-area × quarter analysis. The practitioner accepted office × quarter granularity for this secondary metric. Third, retention was a low-frequency metric (turnover events per year); the practitioner accepted that retention effects would have lower statistical power and would be reported with wider uncertainty bands.

The compute budget (Article 29) for the two features combined was set at $6M annualized run cost with alert threshold at 80% and review threshold at 95%. FinOps instrumentation captured token-level cost attribution per user and per feature; prompt-caching was planned from Week 1.

Produce — the rollout and pilot

Wave 1 launched in Q2. Adoption metrics were tracked weekly. Suggestion-acceptance rates for Lighthouse Draft stabilized around 55% by Week 6; summary-quality scores for Lighthouse Summary averaged 0.78 on the 0–1 rubric.

Three months later, Wave 2 launched. By that point, pre-treatment trajectories across the offices had been plotted (Article 20’s parallel-trends check). Visual inspection showed treated-Wave-1 and future-treated-Wave-2 offices moving in parallel over the previous eight quarters. A formal test — regressing outcome on quarter × wave interaction over the pre-treatment window — showed no significant pre-trend. Placebo tests pretending Wave 1 treatment six months earlier produced placebo effects statistically indistinguishable from zero.

The parallel-trends result was included in the Produce-stage-exit evidence pack.

Wave 3 launched at the six-month mark. By that time, twelve weeks of Wave-1 post-treatment data and three weeks of Wave-2 post-treatment data had accumulated. DiD estimation began producing interim estimates with wide uncertainty bands — expected given the thin Wave-2 and Wave-3 data.

Evaluate — the counterfactual

Twelve months after the first wave, full-panel DiD estimation was run. The specification followed Article 20: office × practice-area × quarter as unit of analysis, two-way fixed effects, cluster-robust standard errors at office × practice-area level, with wild-cluster bootstrap as sensitivity because the cluster count (60 units) was in the range where small-panel correction is prudent.

The estimated treatment effect was an average increase of roughly 6% in billable hours per consultant per quarter, 95% CI 3.5%–8.5%. The point estimate came in below the 8–12% business-case projection; the confidence interval overlapped the lower bound of the business case.

Three follow-up analyses supported the estimate’s interpretation. Robustness to excluding the non-randomized Toronto placement: the estimate moved to 5.8% with broadly similar confidence interval. Event-study analysis across quarters-since-treatment: the effect rose over the first six months then stabilized, consistent with an adoption-and-learning pattern. Subgroup analysis by practice area: effects were stronger in Strategy and Technology practices (roughly 8–9%) and weaker in Audit and Tax practices (roughly 3–4%).

Attribution across two features

Both Lighthouse Draft and Lighthouse Summary contributed to the same billable-hours outcome for the same consultants. Attribution-model choice (Article 26) was consequential. The practitioner considered three approaches.

Approach A — linear split across features. Allocate each consultant’s productivity gain equally between the two features. Simple; may misrepresent if one feature contributed substantially more than the other.

Approach B — usage-weighted. Allocate based on proportional usage of each feature (hours-of-use or interaction-count).

Approach C — Shapley-value decomposition. Use cooperative-game-theory attribution to estimate each feature’s marginal contribution.

The firm’s program office had set Shapley as the portfolio-primary attribution model. The practitioner produced Approach C as the primary estimate and reported Approach B as a sensitivity. Approach A was documented as a simpler sensitivity but was not used for headline reporting.

Under Shapley attribution, Lighthouse Draft contributed approximately 60% of the combined feature effect; Lighthouse Summary approximately 40%. The split reflected usage patterns and likely marginal contribution rather than any presumption about relative feature importance.

The VRR

The quarterly VRR after twelve months presented the findings in Article 16’s six-section structure. Section 1 stated a combined realized-value claim of approximately $14M annualized (based on the 6% effect applied to the firm’s billable-hours baseline), well-short of the business-case $19–$28M range but within the range of positive realized value. Section 3 presented the DiD analysis with its robustness and subgroup extensions; uncertainty was preserved in the narrative. Section 4 decomposed the $14M by feature under Shapley attribution ($8.4M to Draft, $5.6M to Summary) with parallel reporting under usage-weighted attribution ($9.2M to Draft, $4.8M to Summary).

Section 5 flagged three risks: the effect’s lower-than-business-case magnitude (yellow, not red), evidence that the effect may be weaker in certain practice areas (yellow, requires targeted investigation), and an open question about whether the effect would sustain beyond twelve months (watch, no action required yet).

Section 6 recommended continuing both features with scope adjustment: focus additional adoption investment on the weaker-effect practice areas; begin planning for a third feature in the Lighthouse family; schedule an 18-month refresh of the DiD analysis.

CFO dialogue

The CFO conversation centered on three questions. First, “why is the effect smaller than the business case projected?” The practitioner’s response: the business case’s 8–12% range was drawn from published pilot findings that applied to narrower task scopes; the firm’s broader adoption pattern includes many consultants whose core work has less room for copilot-driven productivity gain. The 6% effect is still well above the no-effect null and supports continued investment.

Second, “how confident are we that the 6% isn’t measurement error?” The practitioner’s response: the confidence interval is 3.5%–8.5%, so it is statistically distinguishable from zero at the 95% level; the pre-trend and placebo tests support the identification assumption; the Toronto-excluded robustness analysis produces similar estimates. The effect is as solid as DiD identification can produce.

Third, “does this justify continuing?” The practitioner’s response: current run cost is approximately $6M annualized; realized value is approximately $14M; net is approximately $8M annualized, above the CFO’s hurdle rate. Continuation is supported. Retirement would be economically irrational at the current trajectory.

The dialogue illustrates the discipline this credential teaches. Numbers honestly reported, methods disclosed, uncertainty preserved, recommendation anchored in go-forward economics rather than in achieved-claim defense. The CFO accepted the recommendation and the portfolio scorecard moved forward.

What practitioners should take from this case

Four takeaways.

Takeaway 1 — Rollout design is a measurement decision. Product-team preferences (simultaneous global rollout) and measurement-team requirements (staged rollout for DiD identification) conflict often. The practitioner’s job is to negotiate a rollout that serves both, not to accept one at the expense of the other.

Takeaway 2 — Smaller-than-hoped effects are not failures. The business case projected 8–12%; the realized effect was 6%. The 6% effect is still meaningful; the larger claim was aspirational; the discipline is to report honestly and to let the math support continuation when it does.

Takeaway 3 — Multi-feature attribution is a choice that must be governed. Different attribution models produce different feature-level numbers. The governance rule — portfolio primary attribution with sensitivity reporting — is what prevents the aggregation pathology.

Takeaway 4 — Uncertainty preservation wins CFO trust. The practitioner who shows 95% CIs, robustness analyses, and subgroup heterogeneity produces a more credible presentation than the practitioner who rounds everything to a single number. CFOs know numbers are uncertain; what they reward is acknowledgement of the uncertainty.

Discussion questions

The non-randomized Toronto placement was a concession to leadership. Was it the right call? What design alternative could have preserved identification while respecting the leadership preference?
The realized effect (6%) came in below the business-case projection (8–12%). What does this say about business-case discipline at Calibrate, and how would you adjust for the next feature in the Lighthouse family?
Three attribution models produced different feature-level numbers. The practitioner chose Shapley as primary. What arguments could be made for linear or usage-weighted, and how do you govern the choice?
The CFO question “how confident are we?” is the single most common CFO question in AI value reviews. What communication discipline produces the most trust in response?

Brynjolfsson, Li, and Raymond, Generative AI at Work, NBER Working Paper 31161 (2023). https://www.nber.org/papers/w31161. Peng et al., The Impact of AI on Developer Productivity: Evidence from GitHub Copilot, arXiv 2302.06590 (2023). Microsoft Work Trend Index reports (2023–2024). ↩