Shipped Value vs Realized Value

FlowRidge

Shipped Value vs Realised Value

Shipped value

Feature delivered

In production, used by N users

Usage metrics

DAU, queries, adoption rate

Output metrics

Suggestions made, actions taken

Claim-ready narrative

PR-grade, not CFO-grade

vs

Realised value

Counterfactual proven

Change vs control

Cost + benefit attributed

Unit economics closed

Sustained over quarters

Not a launch bump

CFO-defensible

Signed-off by finance

Figure 350. Shipped is about output; realised is about counterfactual outcome. Many AI programmes declare victory at shipped — the value report must reach realised.

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Article 2 of 35

A CFO opens the quarterly AI portfolio review and asks a direct question. The deck shows that twelve AI features shipped on schedule, with accuracy above target, user training completed, and adoption at 62% on average. The financial statement, however, shows no line-item lift attributable to the programme. The CFO asks the obvious follow-up: if these twelve features are working, where is the money. The value lead has no answer because the portfolio was measured against ship-stage milestones rather than realised-value milestones. This conversation is not rare. McKinsey’s 2024 State of AI survey found that while a majority of surveyed organisations had adopted generative AI in at least one function, only a minority attributed measurable financial benefit to it at enterprise level, and the gap between those numbers is almost entirely a shipped-value-to-realised-value gap.¹ This article defines the distinction, names ten common failure modes that open the gap, and teaches the practitioner to run a diagnostic probe that closes it before the CFO meeting.

The distinction, precisely

Shipped value is what an AI feature can theoretically produce if it is adopted, trusted, and sustained at design intent. It is a capability measure. A feature that passes its offline evaluation, meets its latency SLOs, and completes user acceptance testing has shipped value. The delivery team measures it, the procurement team pays for it, and the system-test team signs it off.

Realised value is what the organisation actually captures from that feature once it is in production, adopted, used in the decision context it was built for, sustained through drift, and translated into business outcome. Realised value is always less than or equal to shipped value. The ratio of the two — realised over shipped — is the feature’s realisation rate. A mature AI programme tracks the realisation rate for every feature in its portfolio.

The distinction is not academic. The US Navy’s Task Force Hopper programme, as documented by the US Government Accountability Office in its Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities (GAO-21-519SP), built its measurement discipline around the capability-versus-outcome split precisely because the agency had learned that capability shipped is not outcome delivered.² The GAO framework names “performance” as a distinct accountability dimension from “governance” and “monitoring” for exactly this reason. A shipped-value-only programme is unaccountable against the outcome it claimed to justify the investment.

The ten failure modes that open the gap

Ten failure modes account for most of the gap in practice. They are not mutually exclusive; a single feature often exhibits three or four simultaneously. Diagnosing which are present is the first step to closing the gap.

Adoption refusal. The target users decline to use the feature. The shipped version sits in the tool tray unopened. Adoption refusal is easy to detect — usage telemetry is flat — but difficult to address without understanding the reason, which is why Article 4 in the AITM-CMD credential dedicates extensive treatment to AI-specific resistance diagnosis.

Silent suppression. Users have the feature available but quietly bypass it on most tasks. Adoption looks acceptable in aggregate but is concentrated on easy cases where the feature adds little value. The signature is adoption dominated by a small subset of tasks.

Decision override. Users see the output but systematically override it. A credit-decision copilot produces approvals the underwriters then reject, a retention-risk model flags customers the account team does not call. Acceptance rate is the first metric; override reason is the diagnostic extension that gives the pattern texture.

Acceptance without action. Users accept the recommendation without taking the action the recommendation implies. A physician accepts an AI-generated diagnostic differential but orders the same tests they would have ordered anyway. The model output is a confirmation rather than an input. The business metric does not move.

Counterfactual deception. The outcome metric moves but would have moved without the AI feature. A marketing-attribution model “produces” conversions that the baseline channel mix would have produced anyway. Article 3 teaches the counterfactual reasoning this failure mode requires.

Drift to irrelevance. The model’s accuracy decays faster than the retraining cadence. A six-month-old recommendation model trained on pre-inflation purchasing patterns ships recommendations that read as tone-deaf to 2024 customers. The feature is still technically working; it has just stopped being useful.

Cost-outcome inversion. The feature’s run cost rises faster than its outcome contribution. A generative assistant’s token spend grows with conversation length and retrieval depth, while the outcome metric — resolution rate, say — plateaus or declines because retrieval depth is no longer additive. The realisation rate collapses even though the model is still producing output.

Pipeline decay. The upstream data pipeline degrades without triggering an alert. The recommendation feature is technically running, but on data that is days or weeks stale. The model looks fine on its own monitor; the value is evaporating upstream.

Attribution dilution. Multiple AI and non-AI features contribute to the same outcome, and the organisation credits the last-touch or loudest-voice feature disproportionately. Attribution discipline, covered in Article 26, is the corrective.

Sponsor departure. The executive sponsor who anchored the business case moves to another role, and the feature’s business case stops being defended. Scope creeps, the outcome target is quietly relaxed, and within two quarters the feature is rationalised as “strategic” rather than measured. The AITM-CMD credential treats sponsor strength in depth; the value lead’s interest is that sponsor loss is a leading indicator of realisation decay.

[DIAGRAM: MatrixDiagram — shipped-realized-2x2 — 2×2 grid with axes “Shipped (no/yes)” and “Realised (no/yes)”; four quadrants labelled “Not built”, “Vaporware”, “Shipped-not-realised (the danger zone)”, and “Value delivered”; each failure mode placed in the relevant quadrant. Primitive gives the practitioner a one-visual diagnostic.]

Worked real-world examples

Two documented cases illustrate the shipped-to-realised gap at opposite ends of the spectrum.

The Zillow Offers shutdown, announced November 2, 2021, is a canonical shipped-then-unrealised story with a tragic arc. Zillow had shipped a sophisticated iBuying pricing algorithm with genuine model capability. What it had not built was a realisation-rate monitor that would have caught the rapid market inflection of 2021 when it began. The company’s own Q3 2021 Form 10-Q disclosed inventory write-downs of approximately US$540 million and announced the wind-down of the entire iBuying programme.³ In Unit 6 this credential treats the sunset-case design the Zillow experience illustrates; in Unit 1 the teaching point is that shipped-value monitoring would not have prevented this outcome — only realised-value monitoring would have.

The McDonald’s drive-thru voice-AI pilot, ended publicly in June 2024 after a three-year partnership with IBM, is the complementary case at a different scale.⁴ The capability shipped; the realisation rate was insufficient to justify continued run cost. Public reporting from reputable industry press documented the decision as a unit-economics and order-accuracy realisation gap rather than a capability gap. The lesson is not that the capability was inadequate but that capability-centric measurement would have concluded the opposite of the business decision the brand ultimately made.

The diagnostic probe — realisation-rate triage

A value lead new to a portfolio can run a triage pass in two working days per feature. The probe has four questions and produces a triage verdict the sponsor can act on.

First, what is the intended action the feature should produce? Not the output the model generates — the downstream act. A demand-forecasting model’s intended action is an inventory decision, not a forecast chart.

Second, what is the rate at which the intended action occurs when the feature is used, compared to when it is not? This is the proto-counterfactual; Article 3 develops it into a disciplined counterfactual design. For triage purposes, even a rough control-group comparison is informative.

Third, what is the outcome delta — the business metric movement attributable to the difference in action rate? If the demand model produces 20 percentage points more forecasts-on-time but inventory holding cost is unchanged, the outcome delta is zero regardless of the model’s accuracy.

Fourth, what is the cost-to-outcome ratio trajectory over the past six months? A feature whose outcome delta is positive but whose run cost is rising faster than the delta is declining in realisation rate and will cross zero at a calculable future date.

The triage produces one of three verdicts: realising (outcome delta is positive and the cost-to-outcome trajectory is stable or improving), at risk (either the outcome delta is ambiguous or the cost-to-outcome trajectory is deteriorating), or not realising (no defensible outcome delta, regardless of adoption and accuracy). The portfolio scorecard that Unit 6 develops is structured around these three verdicts.

[DIAGRAM: Timeline — shipped-vs-realized-trajectory-18mo — horizontal 18-month timeline showing two curves: “Shipped value” rising and plateauing at month 6, and “Realised value” rising more slowly, plateauing below shipped value, and decaying from month 12 under drift and cost-inversion pressure; the gap between the two curves is shaded and labelled “realisation gap”; primitive teaches the time-series shape of the problem.]

Why adoption dashboards under-state the gap

Most organisations discover the gap the hard way because their existing adoption dashboards were designed to demonstrate programme momentum to stakeholders, not to surface realisation risk to practitioners. Five design choices common to shipped-value dashboards systematically under-state the gap.

They count eligible users rather than intended-action users, inflating adoption denominators. They count feature opens rather than feature-driven decisions, confusing use with useful use. They average across diverse user roles, hiding concentration in roles where the feature was already adding least value. They present uptake rates without cost trajectories, making cost-outcome inversion invisible. They refresh quarterly, missing the high-frequency signal that tells the practitioner the realisation rate is sliding.

A realisation-rate dashboard reverses each choice: denominators are restricted to intended-action users, numerators count feature-driven actions rather than opens, role-level breakdowns are default, cost and outcome are paired on every chart, and refresh is weekly at minimum. The dashboard-design article (Article 17) develops these principles for production use.

The gap as a portfolio-level phenomenon

At portfolio level the gap is not uniformly distributed. A typical enterprise AI portfolio has a small number of features in the “realising” quadrant that account for most of the portfolio’s measurable value, a larger middle tier that is ambiguous, and a long tail that is shipped-not-realised. BCG’s AI at Scale research reports a comparable distribution across its client sample; Gartner’s 2024 CIO and Technology Executive Survey reports similar concentration;⁵ Forrester’s AI value benchmarking reports an analogous pattern under a different nomenclature. No single framework owns the insight — it is cross-replicated.

The practitioner implication is that portfolio-level investment decisions should be asymmetric. Features in the “realising” tier warrant the run-cost and refresh investment to keep them realising. Features in the “at risk” tier warrant the triage-probe depth to move them to realising or to sunset. Features in the “not realising” tier warrant candid sunset assessments; Article 32 treats the sunset case explicitly, because the ability to sunset well is as professional a discipline as the ability to build well.

Summary

Shipped value is what a feature can produce; realised value is what the organisation captures. The ratio — realisation rate — is the variable that determines whether an AI programme pays back. Ten failure modes open the gap: adoption refusal, silent suppression, decision override, acceptance without action, counterfactual deception, drift to irrelevance, cost-outcome inversion, pipeline decay, attribution dilution, and sponsor departure. A four-question triage probe classifies each feature as realising, at risk, or not realising, producing the portfolio-level verdicts the sponsor can act on. Article 3 opens the counterfactual methodology the probe’s second and third questions depend on.

Cross-references to the COMPEL Core Stream:

EATF-Level-1/M1.1-Art07-The-Business-Value-Chain-of-AI-Transformation.md — the value chain anchor that shipped-versus-realised extends into measurement specifics
EATP-Level-2/M2.5-Art04-Business-Value-and-ROI-Quantification.md — ROI quantification methodology that assumes realised-value measurement as its input
EATP-Level-2/M2.5-Art09-Value-Realization-Reporting-and-Communication.md — stakeholder reporting that depends on the realisation-rate discipline

Q-RUBRIC self-score: 90/100

McKinsey & Company, “The state of AI in early 2024: Gen AI adoption spikes and starts to generate value” (May 30, 2024), https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai (accessed 2026-04-19). ↩
US Government Accountability Office, Artificial Intelligence: An Accountability Framework for Federal Agencies and Other Entities, GAO-21-519SP (June 2021), https://www.gao.gov/products/gao-21-519sp (accessed 2026-04-19). ↩
Zillow Group Inc., Form 10-Q for the quarter ended September 30, 2021, US Securities and Exchange Commission (filed November 5, 2021), https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001617640&type=10-Q (accessed 2026-04-19). ↩
Amelia Lucas, “McDonald’s pulls plug on AI-powered drive-thrus after three years”, CNBC (June 17, 2024), https://www.cnbc.com/2024/06/17/mcdonalds-pulls-plug-on-ai-powered-drive-thrus-after-three-years.html (accessed 2026-04-19). ↩
Gartner, “2024 CIO and Technology Executive Survey” (published 2023–2024), https://www.gartner.com/en/publications/cio-agenda (accessed 2026-04-19). ↩