The Readiness Scorecard

FlowRidge

This article teaches the practitioner to author a scorecard that succeeds across the three audiences. It walks the structure, the scoring convention, the multi-audience construction principle, the assembly workflow, and the ongoing refresh discipline. It closes with the two municipal-algorithm cases — Amsterdam SyRI and Rotterdam welfare — that illustrate what happens when a scorecard is absent or inadequate.

What the scorecard has to do

The scorecard has four jobs, each with a different reader in mind.

Approve or defer the use case. The executive sponsor reads the scorecard to decide whether the use case proceeds, proceeds conditionally, or is deferred pending remediation.
Defend the decision to a reviewer. An external or internal auditor reads the scorecard to test whether the decision has adequate evidence.
Guide the team to the next action. The product owner and the data and ML engineering teams read the scorecard to know what they must do next.
Carry the trajectory over time. The scorecard is refreshed periodically; successive versions show whether the readiness posture is improving, holding, or eroding.

A scorecard written for only the first job is too thin for the auditor. A scorecard written for the auditor is usually too dense for the sponsor. A scorecard with no remediation mapping is useless to the team. The practitioner’s discipline is to produce a document that does all four jobs without inflating into a wall of text.

Structure

The scorecard has seven standard sections. Each section has a required content and a length norm. The norms are targets, not limits; they keep the document readable.

Section 1 — Executive summary (one page)

Three paragraphs. The first paragraph states the use case, its risk tier, and the overall readiness determination (fit, conditionally fit, not fit). The second paragraph names the three most material findings that drove the determination. The third paragraph names the top three remediation priorities, with owners and target dates.

Executive readers should not need to read beyond this section to decide. Auditors will read beyond; sponsors often do not.

Section 2 — Use-case scope (one to two pages)

The full use-case description, the risk-tier assignment with rationale, the intended model type and training approach, the deployment context, the population served, the consent and lawful-basis summary, and the applicable regulatory perimeter. This section is the scoping memo from Article 1 of this credential, refined into final form.

Section 3 — Dimension scores (two to four pages)

The ten-dimension scorecard from Article 2, extended with the additional sub-dimensions introduced in subsequent articles:

Accuracy
Completeness
Consistency
Timeliness
Validity
Uniqueness
Representativeness
Freshness versus training cutoff
Labeling agreement
Distributional stability
Lineage coverage
Documentation currency (datasheets, decision logs)
Access policy enforcement
Subgroup coverage (absolute, relative, intersectional)
Proxy audit
Privacy / minimization compliance
Third-party source posture
Drift monitoring coverage
Incident-response wiring
Sustainment metric set

Each dimension has: a score (1–5 or green/amber/red), a threshold the score was measured against, the evidence consulted, the gap (if any), and the owner.

[DIAGRAM: ScoreboardDiagram — full-scorecard-render — the complete scorecard rendered as a two-column grid with dimension names on the left and a horizontal bar on the right showing score against threshold, color-coded by status, each row annotated with evidence-reference number and owner badge, summary footer showing overall determination and top-three remediation priorities]

Section 4 — Evidence register (two to five pages)

A numbered list of evidence items. Each item has: an evidence number referenced from the dimension scores, a description, the source (document path, report reference, query result, or named person), the date of collection, and a retention location. The evidence register is the backbone of auditability. A claim in the scorecard without a corresponding evidence entry is a claim without defense.

Section 5 — Gaps and risks (one to three pages)

Findings that will block or complicate the use case if unremediated. Each gap has: a description, the dimension it applies to, the risk if unremediated (with consequence framing), the remediation needed, the remediation cost and timeline, and the owner. Gaps are ranked by severity and feasibility.

Section 6 — Remediation plan (one to two pages)

The sequenced work the organization will undertake to close the gaps. The remediation plan is not a wish list; it is a committed plan with owners, dates, and dependencies. Gaps that cannot be remediated within the use-case timeline are explicitly surfaced as permanent constraints.

Section 7 — Signoff (one page)

Named approvers and their roles. At minimum: the readiness practitioner (author), the data steward (for the data-governance view), the ML engineer or equivalent (for the technical view), the product owner (for the use-case view), and the governance lead (for the risk-review view). Each signoff carries a date and a basis (what they are signing off on).

The multi-audience construction principle

The document cannot be re-written for each audience; that would produce three scorecards that eventually disagree. Instead, the scorecard is written once with a layered structure: each audience reads what it needs without being slowed down by what it does not.

Layer 1 — the first page serves the executive sponsor.
Layer 2 — sections 2 through 5 serve the auditor and the product owner.
Layer 3 — the evidence register and remediation plan serve the engineering team and the auditor.

Concretely, the practitioner writes the executive summary last, after the rest of the scorecard is done, so it honestly reflects what the document actually says. The practitioner does not inflate section 1 to hide a weakness in section 5.

The assembly workflow

A scorecard is assembled in five passes.

Scope freeze. Before any scoring begins, the scope is frozen (section 2). A moving scope produces a scorecard that never finalizes.
Evidence collection. The practitioner collects evidence dimension by dimension, using the earlier articles’ methods, and files each item in the evidence register.
Dimension scoring. Each dimension is scored with reference to its evidence. The practitioner should be able to justify each score to a skeptical reviewer without looking outside the scorecard.
Gap and remediation drafting. Gaps are written before remediation; writing remediation first tempts the practitioner to define gaps that remediation can conveniently solve.
Summary and review. Executive summary drafted; cross-functional review by the signoff parties; revisions; final signoff.

The workflow is iterative. A gap discovered in the remediation drafting may require a return to evidence collection; a review comment from a signoff party may require a score revision. The practitioner should build in time for at least one full iteration between first complete draft and final.

The refresh discipline

A scorecard is a snapshot in time. It should be refreshed on a documented cadence and on trigger events.

Cadence refresh — quarterly for low-risk use cases, monthly for high-risk use cases, additional triggers for use cases in active remediation.
Trigger refresh — after a severity-1 or severity-2 incident, after a regulatory change that affects the perimeter, after a material change to the data (new source, major supplier change), and after a model retraining that changes the deployment context.

Successive scorecard versions are preserved. The trajectory across versions is itself evidence — an improving trajectory supports a renewed approval; an eroding trajectory is a reason to tighten the use case scope or suspend deployment.

[DIAGRAM: BridgeDiagram — as-is-to-to-be — horizontal bridge from “As-is readiness posture” on the left to “Target readiness posture” on the right; the bridge sections show the sequenced remediation packages, each labeled with the dimensions it addresses, the owner, the timeline, and the score improvement expected; the right pillar shows the target score per dimension and the overall target determination]

The Amsterdam and Rotterdam case pair

Two Dutch cases anchor what the scorecard discipline is meant to prevent.

Amsterdam SyRI ruling (Dutch District Court, 2020). The System Risk Indication (SyRI) was a Netherlands-wide risk-scoring program for welfare fraud detection. On 5 February 2020, the District Court of The Hague ruled that the SyRI legislation violated Article 8 of the European Convention on Human Rights, finding that the program lacked the transparency and safeguards required for a system with such deep implications for individuals.¹ The court’s reasoning included specific criticism of the data and analytical pipeline: the categories of data processed, the logic applied, and the individuals subjected to scoring were not adequately disclosed.

The readiness translation is direct. A scorecard written for SyRI before launch should have surfaced: a gap in the transparency section (data-subject information was inadequate); a gap in the subgroup-coverage section (disproportionate effect on low-income neighborhoods was foreseeable); a gap in the governance section (the lawful-basis test for the data processing was contested); and a gap in the documentation section (the scoring logic was not auditable). Any one of these should have blocked proceeding in its current form.

Rotterdam welfare-fraud algorithm (Lighthouse Reports and Wired investigation, 2023). Lighthouse Reports, in collaboration with Wired and Dutch outlet Follow the Money, investigated Rotterdam’s use of a machine-learning system to score welfare applicants for fraud risk. The March 2023 coverage documented that the system disadvantaged applicants with specific demographic patterns and that the model’s data and logic could not be reconstructed by the city’s own staff.² Rotterdam discontinued the program.

The readiness translation is, again, concrete. A scorecard refreshed during operation would have surfaced: an eroding trajectory on subgroup coverage; documentation that could not be reconstructed post-hoc; an access-scope gap between the modeling team and the operational decision-makers; and a sustainment-metric gap. The program’s discontinuation was correct; a timely scorecard could have reached the same decision earlier, with less public harm.

Lessons from the two cases

The cases teach three lessons the practitioner should carry forward:

A scorecard written before launch is more valuable than any number of forensic analyses after.
Transparency and subgroup coverage are recurring failure modes in consequential-decision AI; the scorecard must examine both explicitly.
The ability to reconstruct the data and logic after deployment is itself a readiness dimension; systems that cannot be reconstructed are not ready for high-risk use.

A readiness scorecard that applies the ten preceding articles’ disciplines and surfaces weak evidence across them is the mechanism the profession uses to stop these programs before they harm individuals and before they force their sponsors into public retreat.

Common failure modes in scorecard production

First-time practitioners make a small set of recurring mistakes that the discipline’s working norms are built to prevent. Naming them helps the practitioner avoid them.

Score inflation. Scoring dimensions higher than evidence supports, because the team wants the use case to proceed. An honest score of 2 reveals a gap that the organization can address; an inflated score of 4 hides the gap until it produces an incident.
Threshold borrowing. Adopting another organization’s thresholds without examining whether they fit the current risk tier. Thresholds are use-case specific.
Evidence asymmetry. Strong evidence for the dimensions the practitioner finds interesting, thin evidence for the dimensions that require cross-functional cooperation. The thin-evidence dimensions are usually where the real risk lives.
Single-iteration delivery. Writing the scorecard in one pass and delivering. The discipline’s quality emerges in the second and third iterations, where review comments surface the gaps the first pass missed.
Remediation reverse-engineering. Drafting the remediation first and then defining the gaps the remediation addresses. The result is a scorecard that ignores gaps the organization is not willing to pay to remediate.
Length inflation. Producing a 60-page scorecard because the practitioner is anxious about leaving something out. The sponsor will not read 60 pages, and the practitioner’s signal gets lost in the noise.
Silent disagreement. Signoff parties who have concerns but sign anyway because the process requires it. The practitioner should actively surface concerns during review rather than let them hide behind a formal signoff.

Each of these failure modes has a counter-practice. Pair the scoring with evidence-first discipline. Justify every threshold in writing. Escalate evidence gaps as findings, not as omissions. Schedule review cycles, not delivery deadlines. Draft gaps before remediation. Set a length target upfront. Include disagreements as explicit notes on the signoff page where they exist.

Scorecard as a governance artifact

The scorecard is not only a decision document. It is a governance artifact that survives beyond the initial approval. It lives in the governance catalog; it is retrievable on demand; it is a primary reference for audits, for incident investigations, and for successor readiness engagements.

The practitioner should author the scorecard with that afterlife in mind. A scorecard that is decipherable only by its author is not a governance artifact. The practitioner writes so that a successor, coming in two years later, can open the scorecard and understand what was assessed, how, and why the determination was made.

Concrete practices that support afterlife:

Glossary at the front: a half-page defining the organization-specific terms used in the document. A new reader can get oriented in five minutes.
Plain-language explanation of the risk-tier assignment. “Medium risk because [concrete reasons]” is worth more than “Medium (per internal policy).”
Named external reference standards cited in full. “EU AI Act Article 10” is clearer than an internal policy reference whose meaning may drift.
A “when to re-read this scorecard” note. The practitioner predicts the situations that would prompt a future reader to open the document and writes signposts for each.

Cross-references

COMPEL Practitioner — Organizational readiness pre-assessment (EATP-Level-2/M2.1-Art03-Organizational-Readiness-Pre-Assessment.md) — the broader readiness discipline in which the data scorecard sits.
COMPEL Core — Calibrate: establishing the baseline (EATF-Level-1/M1.2-Art01-Calibrate-Establishing-the-Baseline.md) — the Calibrate-stage methodology the scorecard operationalizes at the data level.
AITM-DR Articles 1 through 10 — the dimension-by-dimension disciplines the scorecard aggregates.

Summary

The readiness scorecard is the primary deliverable of the AI data readiness practitioner. It has seven sections, serves four audiences in layered form, is assembled in five passes, and is refreshed on a documented cadence with named triggers. It carries the trajectory of the readiness posture over time. A scorecard absent, or inadequate, or never refreshed is the failure mode that produced the Amsterdam SyRI and Rotterdam welfare cases. A scorecard well-built, and well-sustained, is how the profession prevents the next generation of those cases from reaching deployment.

Rechtbank Den Haag, NJCM c.s. / Staat der Nederlanden (SyRI), ECLI:NL:RBDHA:2020:865, judgment of 5 February 2020. https://uitspraken.rechtspraak.nl/details?id=ECLI:NL:RBDHA:2020:865 ↩
G. Geiger, E. Braun, J. Labanca, J. Lamdahl, M. Turner, D. Howden, H. Howard, S. Dreyer, F. Simon, Inside the Suspicion Machine, Wired (Lighthouse Reports investigation), March 6, 2023. https://www.wired.com/story/welfare-algorithms-discrimination/ ↩