Case Study — Amsterdam SyRI and Rotterdam Welfare-Fraud Algorithm

FlowRidge

COMPEL Specialization — AITM-DR: AI Data Readiness Associate Case Study 1 of 1

Why these cases

Most AI failure cases referenced in the data readiness literature are drawn from private-sector decisions — a hiring tool withdrawn, a pricing model shut down, a content-moderation pipeline put under review. The two Dutch cases covered here are different in one consequential respect: they were public-sector programs run by governments against their own residents, to allocate or deny social benefits. The stakes were individual welfare. The failure surfaces that the readiness discipline exists to detect were present in both programs and went uncorrected in both.

The practitioner’s learning is not that public-sector AI is uniquely bad. It is that the readiness discipline is the same across sectors, that the failure modes are the same across sectors, and that the harm when readiness is inadequate scales with the consequence tier of the use case. Welfare eligibility carries a consequence tier high enough to warrant the deepest readiness discipline the profession can offer.

Part 1 — Amsterdam SyRI

The program

The System Risk Indication (SyRI) was a Netherlands-wide welfare-fraud detection program operated under Dutch national law. Authorized by the SUWI Act and implemented under a 2014 government decree, SyRI matched data from multiple government databases — tax records, property records, benefits payments, education records, and more — against risk-scoring criteria, flagging individuals for additional welfare-fraud investigation.¹ The program was concentrated in four Dutch cities, applied predominantly in neighborhoods classified as low-income, and was operated with limited public disclosure of the categories of data processed, the risk logic applied, or the individuals affected.

Civil-society organizations, led by the Netherlands Committee of Jurists for Human Rights (NJCM), brought a challenge in the District Court of The Hague. The case was heard in October 2019 and decided on 5 February 2020.

The ruling

The District Court ruled that the SyRI legislation violated Article 8 of the European Convention on Human Rights — the right to respect for private and family life. The court’s reasoning, published in full in ECLI:NL:RBDHA:2020:865, identified several structural defects:¹

Inadequate transparency. Data subjects were not informed that they had been submitted to risk scoring; the categories of data processed were not sufficiently specified; the logic of the risk model was opaque.
Inadequate safeguards. The balance between the public interest in fraud detection and the private interests of individuals subjected to scoring was not adequately protected by the statutory and operational framework.
Disproportionate effect. The deployment pattern — concentrated in specific neighborhoods — raised concerns about discrimination that the government had not adequately addressed.

The court struck down the SyRI legislation. The decision is one of the earliest major judicial invalidations of an automated decision-support system in Europe, and it is foundational reading for the readiness practitioner.

Reading SyRI through the readiness lens

An AI data readiness scorecard authored before SyRI’s deployment — or even during its operation — would have surfaced findings across multiple dimensions of this credential’s framework.

Data governance (Article 3). What data contracts governed the inputs to SyRI? The public record does not describe standing contracts in the sense Article 3 uses. Matching across tax, property, and benefits records under the program was authorized by statute, but the operational governance that a contract would provide — named owners, change policies, access-scope enforcement, consumer lists — does not appear in the available record. A readiness scorecard would have flagged this as a material gap before the program proceeded.

Lineage and documentation (Article 4). The court’s criticism of opaque logic is, from a readiness perspective, a documentation failure. A datasheet for the risk-scoring inputs would have required disclosure of data sources, preprocessing, and known limitations. A lineage graph would have shown, in reproducible form, how individual scores were constructed. The program could not be defended because it could not be explained.

Bias-relevant variables and subgroup coverage (Article 7). The deployment concentration in specific neighborhoods raises a subgroup-coverage question that the readiness scorecard would have examined before deployment. Were low-income neighborhoods over-sampled in the evaluation set? Did the risk model’s features include proxies for socioeconomic status? What was the expected disparate impact, and what remediation was built into the deployment? These are questions the program would have had to answer, on paper, with evidence, for the scorecard to approve proceeding.

Privacy and minimization (Article 8). The EU AI Act did not exist when SyRI was designed, but the GDPR (in force since May 2018) did. The lawful-basis test for processing of personal data across multiple government databases, the minimization duty, and the data-subject transparency obligations were all operative. A readiness scorecard would have applied the three-test minimization memo from Article 8 and surfaced gaps on all three tests. The court’s Article 8 ECHR ruling was in substance a finding that those tests had not been passed.

Scorecard (Article 11). A scorecard written for SyRI would not have produced a “fit” determination. At best, a “conditionally fit — pending remediation of transparency, subgroup coverage, minimization, and governance gaps” finding would have been issued. At worst, a “not fit” finding would have been issued. The program’s retreat in 2020 was, seen this way, the consequence-tier version of the remediation that an earlier scorecard might have produced without the public harm.

Part 2 — Rotterdam welfare-fraud algorithm

The program

Rotterdam operated a machine-learning system to score welfare applicants for fraud risk. The system had been in use since around 2017 and was part of a suite of tools the city used to allocate investigative resources. In 2021 and 2022, journalists from Lighthouse Reports, Wired, and Follow the Money obtained access — through freedom-of-information requests and a narrow agreement with the city — to documentation and analytical artifacts sufficient to reconstruct the system’s behavior. Their investigation was published in March 2023.²

The investigation’s findings

The published investigation documented several patterns that are readiness findings by any standard:

Model could not be reconstructed by city staff. The data and logic had been developed with external vendors and were not fully documented inside the city. Staff could not answer questions about specific model behaviors because the information required to answer them did not exist in retrievable form.
Demographic disparities in scoring. The investigation’s reconstruction showed that the model scored applicants with specific demographic patterns (family structure, neighborhood, age) as higher risk in ways that were not readily defensible from the model’s stated purpose.
Feature choices embedded proxies. Non-obvious features carried predictive weight that was interpreted, post-hoc, as proxies for national origin and socioeconomic status.
Decision context. Applicants flagged as higher risk were subject to additional investigation; the investigative process itself had consequences (delayed benefits, required documentation) that affected material welfare.

Rotterdam discontinued the program.

Reading Rotterdam through the readiness lens

Every finding in the published investigation maps to a dimension this credential covers.

Documentation (Article 4). The inability of city staff to reconstruct the model is a documentation failure of the most literal kind. A datasheet would have covered composition, preprocessing, uses, and maintenance; its absence meant that questions about model behavior had no documented answer. A readiness scorecard refresh during operation would have flagged documentation currency as a red finding and escalated it.

Feature-level bias and proxy leakage (Article 7). The specific finding that non-obvious features carried proxy weight is the failure mode Article 7 is written to prevent. A proxy audit with mutual-information statistics against protected attributes would have identified the features. A curation decision would then have been required — exclude, transform, or accept with explicit rationale — and the decision would have lived in the decision log.

Subgroup coverage and net-fairness rule (Article 7). The demographic disparities in scoring would have been examined during pre-deployment subgroup-coverage analysis and sustained monitoring. The net-fairness rule — that each curation decision must be examined for its effect across all groups — is the discipline by which the program would have been required to defend its feature choices.

Sustainment (Article 10). A readiness program running scorecard refreshes during operation would have caught the erosion of documentation currency, the drift in decision outcomes across demographic groups, and the ongoing gap between model behavior and staff understanding. The published investigation did this work as journalism; the readiness practitioner’s job is to do it inside the organization before the journalism becomes necessary.

Scorecard (Article 11). As with SyRI, a scorecard written for Rotterdam would not have produced a durable “fit” determination. The most honest early-cycle scorecard would have been “conditionally fit — pending documentation rebuild, proxy audit, subgroup coverage evidence, and sustainment instrumentation.” An honest sustainment-refresh scorecard would have been “not fit — documentation has eroded to the point that the system cannot be defended.”

Part 3 — Patterns across the two cases

The two cases are separated by years, municipalities, legal regimes, and specific technical details. The patterns they share are the ones the readiness practitioner should carry forward.

Pattern 1 — Transparency as a first-class dimension

Both programs failed on transparency. SyRI could not tell affected individuals that they had been scored; Rotterdam could not tell its own staff how the model worked. Transparency is not a communications function. It is a data-readiness function. Without it, the governance structures that would otherwise catch problems cannot operate.

Pattern 2 — Proxy discrimination is the default, not the exception

Both programs had deployment patterns that disadvantaged specific populations in ways their proponents did not initially recognize. The pattern is common across consequential-decision AI and is not a failure of intent; it is a failure of the data-curation discipline that Article 7 makes explicit.

Pattern 3 — Sustainment, not launch, is where the failure becomes visible

Both programs operated for years before the failure became public. The readiness discipline’s sustainment layer — scorecard refresh, drift monitoring, incident response — is the mechanism by which such erosions are caught internally. Without sustainment, the organization discovers the failure when it is forced into public retreat.

Pattern 4 — Documentation is the last line of defense

Both programs could be critiqued, in hindsight, because enough information survived to reconstruct what they had done. In Rotterdam’s case, the reconstruction required outside-journalist effort because internal documentation was absent. A stronger documentation regime would have produced the same reconstruction internally and earlier.

Part 4 — Lessons for the practitioner

The practitioner takes four concrete habits from this case study:

For any consequential-decision AI use case, the readiness scorecard carries more weight than it does for advisory use cases. The threshold for proceeding should be tighter.
Transparency, documentation, and proxy audit are not three separate dimensions; they are three expressions of the same discipline — the discipline of making the system explainable to a reviewer who was not part of its construction.
The sustainment cadence for consequential-decision AI should be monthly at minimum. A quarterly refresh is too slow to catch the erosion patterns these cases exhibit.
When the scorecard’s answer is “not fit,” the practitioner’s job is to make the answer visible to decision-makers with enough authority to act. Silence is complicity.

Discussion questions

Imagine you were the readiness practitioner assigned to SyRI before deployment. Draft the three findings from your scorecard that would have been most significant for the Article 8 ECHR concerns the court later identified.
Rotterdam’s sustainment gap meant that documentation eroded over years without anyone raising the issue. What sustainment metric (from Article 10) would have caught the erosion earliest? Why?
The readiness practitioner for either program might have faced institutional pressure to produce a “fit” finding. What is your response to such pressure? What professional boundary do you hold?
Both cases are public-sector. Do the readiness disciplines apply differently in private-sector analogues (credit scoring, tenant screening, employment vetting)? If so, where; if not, why not?
Both cases involved external vendors or contractors in the modeling pipeline. What third-party readiness controls (from Article 9) would have mitigated the documentation gap?

Why these cases

Part 1 — Amsterdam SyRI

The program

The ruling

Reading SyRI through the readiness lens

Part 2 — Rotterdam welfare-fraud algorithm

The program

The investigation’s findings

Reading Rotterdam through the readiness lens

Part 3 — Patterns across the two cases

Pattern 1 — Transparency as a first-class dimension

Pattern 2 — Proxy discrimination is the default, not the exception

Pattern 3 — Sustainment, not launch, is where the failure becomes visible

Pattern 4 — Documentation is the last line of defense

Part 4 — Lessons for the practitioner

Discussion questions

Further reading

Footnotes