The Measurement Plan Artifact

FlowRidge

Measurement Plan — Seven Artefact Surfaces

Figure 352. A measurement plan is a pre-registered contract. Each surface names an owner, an artefact, and a cadence — missing one is a measurement gap at audit.

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Article 4 of 35

A feature team announces that the new recommendation model goes live on Monday. A value lead new to the organisation asks a quiet question: where is the measurement plan. There isn’t one. There is an evaluation report from offline testing, a dashboard specification drafted by the BI team, and an executive summary written for last month’s steering. None of these is a measurement plan. The absence of the artifact — a document written before the feature ships, naming what will be measured, how, and what the decision will be based on the result — is the single most reliable predictor of the shipped-to-realised gap Article 2 catalogued. This article specifies the eleven sections a measurement plan must contain, anchors the artifact to ISO/IEC 42001:2023 Clause 9.1, and teaches the practitioner to produce one the feature team, the business sponsor, and internal audit all sign.

Why the plan must exist before the feature ships

Three reasons distinguish the pre-launch plan from the post-launch report. The first is that the counterfactual Article 3 required must be pre-registered. A counterfactual constructed after the results are known is not a counterfactual; it is a rationalisation. Pre-registration is the mechanism that makes the counterfactual credible, and the measurement plan is the vehicle for pre-registration.

The second is that measurement instrumentation must exist before the feature is producing output that the instrumentation could capture. If the plan specifies a behavioural telemetry signal that the product code was not instrumented to emit, the signal does not exist and cannot be back-filled. The plan drives instrumentation; the instrumentation enables measurement; the measurement enables the decision. This sequence cannot be reordered.

The third is that the decision rule — what the practitioner will conclude if the result is X, Y, or Z — must be agreed with stakeholders before the result is known. A decision rule proposed after the result is known is subject to motivated reasoning on both sides: the feature’s defenders argue down the threshold, the feature’s critics argue it up. A pre-agreed rule short-circuits this predictable conflict.

ISO/IEC 42001:2023 (AI management systems) Clause 9.1 — Monitoring, measurement, analysis and evaluation — makes the pre-registration discipline a management-system requirement. Clause 9.1.a requires the organisation to determine “what needs to be monitored and measured”; 9.1.b requires specification of methods of monitoring; 9.1.c requires determination of when the monitoring shall be performed; 9.1.d requires determination of when the results shall be analysed and evaluated.¹ The measurement plan is the operational response to each of these sub-clauses.

The eleven required sections

The AITE-VDT curriculum standardises an eleven-section plan template. Each section is compulsory; an incomplete plan is not shippable. The template is platform-neutral (works in Word, Google Docs, Confluence, or Notion) and review-friendly (each section fits on one page).

Section 1 — Hypothesis. The testable statement the feature is built on. “If we surface personalised product recommendations to logged-in users, their conversion rate will increase by at least 8% against the current non-personalised experience.” Hypotheses are specific, measurable, falsifiable, and stated as comparatives rather than absolutes.

Section 2 — Primary metric. The single metric on which the feature’s success or failure is judged. One primary metric per feature; pluralising dilutes accountability. The primary metric aligns with the hypothesis and is expressed in the units the organisation’s financial summary will use. A primary metric of “conversion rate” is insufficient if the financial summary reports dollars of margin; “incremental margin per session” is better.

Section 3 — Secondary metrics. Supporting metrics that either confirm the primary result or surface unintended effects. Secondary metrics cover adoption (is the feature being used), behaviour (how are users using it), and guardrails (is the feature causing harm). A guardrail metric is a pre-specified threshold below which the feature is stopped regardless of primary-metric results — a customer complaint rate, a safety-incident rate, a fairness-audit score.

Section 4 — Data sources. The source systems from which metrics will be computed, with freshness, granularity, and access owner specified for each. A metric defined on a data source that the measurement team cannot access is not a measurable metric.

Section 5 — Collection cadence. When the metrics are computed — daily, weekly, monthly — and when the analysis is run. A feature whose launch decision depends on a three-week observation window should not have its first analysis scheduled at eight weeks.

Section 6 — Analysis method. The counterfactual design (from Article 3’s five-approach taxonomy) and the specific estimator, test, and significance criterion. An analysis method of “we’ll look at the numbers” is not an analysis method. A method of “difference-in-differences estimator with treated region = EU, control region = US, measurement window = six weeks post-launch, alpha = 0.05, parallel-trends test run on the 12-week pre-launch window” is an analysis method. For features targeting a specific subpopulation or intervention, uplift modeling estimates the incremental effect attributable to the treatment rather than the absolute outcome level — a tighter estimator when the base rate varies across segments.

Section 7 — Decision rule. What the practitioner and sponsor will conclude for each possible result. “If the primary metric moves up by ≥5% with p<0.05 and no guardrail metric breaches threshold, the feature is retained and scaled. If the metric moves by <2% or any guardrail breaches, the feature is sunset within one quarter. If the metric moves by 2–5%, the feature enters a three-month re-evaluation period with a doubled sample size.” The rule is pre-registered; the adjudication is not rewritten after the result is known.

Section 8 — Pre-registration and versioning. The plan is version-controlled from its first draft, and substantive changes after launch are logged with reasons. The pre-registration record is audit-grade evidence that the analysis method was not retrofitted to the outcome.

Section 9 — Review owners. Named individuals responsible for the measurement-plan sign-off, the mid-launch check-in, the primary-metric analysis, and the decision-rule adjudication. “Review owner = Head of Data Science” is not a named individual; a specific name and a specific email address are.

Section 10 — Risk flags. The measurement risks — selection bias, drift, attribution — the plan has identified and the mitigations specified for each. Risk flags are read-ahead material for the sponsor and the audit reviewer; a plan without them looks under-examined.

Section 11 — Escalation path. Who is notified if a guardrail breaches, if the analysis method must change, or if the result is ambiguous. An escalation path that terminates with “we’ll figure it out” is not an escalation path.

[DIAGRAM: HubSpokeDiagram — measurement-plan-eleven-sections — central hub “Measurement Plan” with eleven spokes labelled in the order above, each spoke annotated with a one-line prompt (“What decision are we making?”, “How will we know?”, “What if we’re wrong?”); primitive gives the practitioner a structural reference for plan authoring.]

The ISO 42001 and NIST AI RMF alignment

A measurement plan written to these eleven sections satisfies the main measurement requirements of both the dominant AI management-system standards.

ISO/IEC 42001:2023 Clause 9.1 is satisfied by Sections 2, 4, 5, and 6 (what is monitored, by which methods, when, and how it is analysed).¹ The clause’s requirement that results be “retained as documented information” is satisfied by Section 8’s versioning discipline.

ISO/IEC 42001:2023 Clause 9.3 (Management review) is fed by the aggregated output of Section 6’s analysis results and Section 7’s decision rule over time — the CPR artifact that Article 15 develops.

NIST AI RMF (AI 100-1, January 2023) MEASURE 1.1 requires that “approaches and metrics for measurement of AI risks enumerated during the MAP function are selected for implementation starting with the most significant AI risks.”² Section 3’s guardrail metrics are where NIST MEASURE 1.1 lives in the plan.

NIST AI RMF MEASURE 2.3, 2.6, and 2.10 require measurement of system performance against defined metrics, tracking of identified trustworthy-characteristics measurements, and demonstrated safety in deployment. Sections 2 and 3 are the operational home of these sub-functions.

NIST AI 600-1 (Generative AI Profile, July 2024) extends the MEASURE function with sub-guidance for generative systems, including resource-consumption measurement (GV-1.6, MS-4.1) and environmental-risk tracking (RG-2.5).³ For generative features these extensions are addressed in Section 3 as guardrails and in the TCO/token-economics sections of Articles 8 and 10.

The value lead does not have to choose between ISO and NIST; the eleven-section plan is compatible with both, and in practice most mature programmes reference both standards in the plan’s preamble.

[DIAGRAM: BridgeDiagram — measurement-plan-standards-bridge — left anchor showing the eleven plan sections; right anchor showing the corresponding ISO 42001 and NIST AI RMF clauses; span showing the explicit section-to-clause mapping; primitive teaches auditable traceability.]

Worked example — a public-sector measurement plan

The US Navy’s publicly reported AI programmes, as documented in the GAO’s Artificial Intelligence: An Accountability Framework (GAO-21-519SP), are repeatedly cited as examples of measurement-plan discipline in defence contexts.⁴ The GAO framework itself names “performance” as a distinct accountability dimension and requires that AI programmes document “entity goals for AI systems” and “performance metrics to evaluate the achievement of those goals.” The language is the public-sector vocabulary for the eleven-section plan the AITE-VDT curriculum teaches.

The UK Government Digital Service’s Understanding artificial intelligence ethics and safety (2019) and subsequent Ethics, Transparency and Accountability Framework for Automated Decision-Making (2021) provide the parallel discipline for UK public-sector AI, requiring pre-launch impact assessment and documented measurement methodology for high-impact deployments.⁵ These public-sector frameworks do not use the phrase “measurement plan,” but the documents they require map onto the eleven-section structure closely enough that a value lead working in public-sector AI can use the same template.

Common failure modes when the plan is skipped

Three failure modes recur in programmes that ship without a plan. The first is metric proliferation: in the absence of a pre-agreed primary metric, every stakeholder brings a favourite metric to the post-launch review, and the discussion never converges. The second is retrofitted counterfactuals: in the absence of pre-registered comparisons, the analysis team constructs the comparison most favourable to the result, and the conclusion is contested by any sceptical reviewer. The third is guardrail silence: in the absence of pre-specified guardrails, unintended effects are either missed entirely or surfaced only after customer complaints or regulatory enquiries make them unignorable.

Each failure mode is visible and avoidable. The plan’s cost is measured in hours; the cost of skipping the plan is measured in cancelled programmes and careers.

The plan’s relationship to MLOps evaluation harnesses

A measurement plan is not an evaluation harness. The plan is the decision-level artifact; the harness is the continuous-measurement infrastructure that Article 24 develops. The plan references the harness as the data source for several of its metrics, but the plan itself is readable by the sponsor and the auditor, not by a deployment pipeline. A common anti-pattern is to substitute a Langfuse or Arize dashboard for a measurement plan; the dashboard is an input to the plan, not a replacement for it.

The practitioner habit — the plan precedes the launch

The habit is summarised in one rule: no feature ships without a signed measurement plan. The rule is enforced by governance — an organisation whose stage-gate review requires a plan-of-record signature before launch sign-off closes the loophole that allows plans to be “in progress” indefinitely. Article 31 treats stage-gate value reviews in depth; the Unit 1 point is that the plan is one of the non-negotiable gate-evidence artifacts.

Summary

The measurement plan is the pre-launch document that operationalises counterfactual reasoning, satisfies ISO 42001 Clause 9.1 and NIST AI RMF MEASURE 1.1 requirements, and pre-registers the decision rule that will govern the feature’s retain-or-sunset outcome. Eleven sections — hypothesis, primary metric, secondary metrics, data sources, cadence, analysis method, decision rule, pre-registration, review owners, risk flags, escalation path — constitute the minimum viable plan. An organisation that requires a signed plan before feature launch has closed the most reliable predictor of shipped-to-realised value gap. Article 5 closes Unit 1 by teaching leading and lagging indicators, which populate Sections 2 and 3 for most features.

Cross-references to the COMPEL Core Stream:

EATF-Level-1/M1.2-Art14-Mandatory-Artifacts-and-Evidence-Management.md — artifact discipline that governs measurement-plan evidence retention
EATP-Level-2/M2.5-Art02-Designing-the-Measurement-Framework.md — practitioner-level measurement framework that nests the plan artifact
EATF-Level-1/M1.2-Art24-Control-Performance-Report.md — CPR artifact that consumes the plan’s outputs at review cadence

Q-RUBRIC self-score: 90/100

International Organization for Standardization and International Electrotechnical Commission, ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system (ISO, 2023), Clause 9.1, https://www.iso.org/standard/81230.html (accessed 2026-04-19). ↩ ↩²
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1 (January 2023), MEASURE function, https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf (accessed 2026-04-19). ↩
National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1 (July 26, 2024), https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf (accessed 2026-04-19). ↩
US Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, GAO-21-519SP (June 2021), https://www.gao.gov/products/gao-21-519sp (accessed 2026-04-19). ↩
UK Central Digital and Data Office, Ethics, Transparency and Accountability Framework for Automated Decision-Making (May 13, 2021), https://www.gov.uk/government/publications/ethics-transparency-and-accountability-framework-for-automated-decision-making (accessed 2026-04-19). ↩