Lab 1: Write a Measurement Plan for an AI Feature

FlowRidge

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Lab 1 of 5

Lab objective

Produce a complete, defensible measurement plan for the scenario described below. The plan must include all eleven sections specified in Article 4, align to ISO 42001 Clause 9.1 and NIST AI RMF MEASURE 1.1, and be ready for sponsor sign-off.

Duration: 90 minutes. Deliverable: A completed measurement plan document (Word, Google Docs, or Markdown) of roughly three to five pages. Linked articles: 4 (measurement plan), 5 (leading/lagging indicators).

Scenario

You are the AI value lead for a mid-sized business-services company. The product team is about to ship “BillExplain,” a generative-AI feature that reads a customer’s invoice and produces a plain-language explanation in response to customer chat requests about their charges. Customer-service representatives use BillExplain’s output as a draft, edit as needed, and send to the customer.

The proposed business case claims: (1) reduced handle time per invoice inquiry from an average of 6 minutes to 3.5 minutes; (2) improved first-contact resolution rate from 72% to 85%; (3) a consequent reduction of $3.2M in annualized operations cost.

The feature launches in three weeks to 120 representatives in the North American contact center, with a gradual rollout to the European centers over the following two quarters. No formal measurement plan exists; your CFO has requested one before the rollout proceeds.

What to produce

Draft all eleven sections of the measurement plan. Target approximately one page in total for concise sections and up to a half page each for the two or three sections that require detail.

Hypothesis. State the primary business hypothesis in falsifiable terms. Include the direction, magnitude, and population to which the claim applies.
Primary metric. Name one primary metric (not multiple). Define its computation, its source system, its time window, and its aggregation rule.
Secondary metrics. Name three to five secondary metrics that provide supporting evidence but are not the basis for go/no-go decisions. Describe the purpose of each.
Data sources. List every source system contributing to the primary and secondary metrics. For each source, state the owner, the refresh cadence, and any known data-quality limitations.
Collection cadence. Specify how frequently each metric is collected and aggregated. Address both the measurement cadence and the reporting cadence.
Analysis method. Describe how the causal effect of BillExplain will be estimated. Because all 120 reps will have access day one in the North American center, consider which of the six designs from Article 18 apply. Justify your choice and disclose its limitations.
Decision rule. Specify the thresholds on the primary metric that trigger continue, modify, or retire decisions. Make thresholds numeric and defendable.
Pre-registration. Record the hypothesis, primary metric, decision rule, and analysis method in a pre-registration record. Specify where this record is stored and how changes are authorized.
Review owners. Name (by role) the owners of the measurement plan, the weekly operational review, the monthly value review, and the quarterly stage-gate review.
Risk flags. List the three most significant measurement risks you foresee (e.g., contamination, adoption shortfall, drift in upstream invoicing data) and describe mitigation for each.
Escalation path. Describe what triggers escalation to the CFO or steering committee, the timeline for escalation, and the supporting evidence required.

Guidance

Eligibility and population. The 120 representatives are a cluster; consider randomization at the team level rather than individual. Geography rollout provides the staged-timing variation DiD exploits.
Counterfactual thought. Pre/post comparisons alone are weak because the training period for representatives, the holiday-season volume pattern, and the Europe launch all interact. A DiD across North America (treated earlier) and Europe (treated later) is likely the strongest feasible design.
Indicator discipline. Handle time is a lagging indicator of rep behaviour; suggestion-acceptance rate and edit-distance on the AI output are leading indicators of realized value.
Honesty about limits. Adoption is voluntary above minimum use thresholds; this is a potential confounder that will surface in the counterfactual. Disclose it.

Evaluation rubric

Your draft will be scored on the following dimensions.

Dimension	What to demonstrate	Weight
Completeness	All eleven sections present and non-trivial	20%
Hypothesis precision	Falsifiable, directional, magnitude-specified	10%
Primary-metric discipline	One metric, well-defined	10%
Counterfactual choice	Design selected and defended against Article 18’s six-question tree	15%
Decision-rule specificity	Numeric thresholds, defensible	10%
Risk-flag substance	Specific risks, not boilerplate	10%
Alignment to ISO 42001 Clause 9.1	Explicit mention of clauses addressed	10%
Readability for CFO audience	Can be read in ten minutes, supports decision	15%

A passing draft scores 70% or above. Drafts scoring below 70% are returned with feedback and re-submitted.

Reflection questions

After completing the draft, answer the following in writing (approximately 150 words per question).

Which of the eleven sections did you find hardest to complete, and why?
Your counterfactual design has at least one known limitation. State it honestly and describe how the VRR would disclose it.
The business case claims a $3.2M annualized saving. Assuming your measurement plan reveals a 30% shortfall, how would you communicate the shortfall to the CFO?

Linked artifacts and further reading

Article 4 — The measurement plan artifact.
Article 5 — Leading and lagging indicators.
Article 18 — Choosing between experimental and observational designs.
ISO/IEC 42001:2023 Clause 9.1.
NIST AI RMF 1.0, MEASURE 1.1 subcategory.

Submission

Submit as Word, Google Doc, or Markdown. Reviewer will provide written feedback within one week; drafts may iterate until the rubric passes.