Data Quality Dimensions Extended for AI

FlowRidge

AI-Extended Data Quality Dimensions

Figure 302. Classic data quality has six dimensions. AI adds four more — each with its own failure mode and its own evidence requirement.

This article walks each dimension in turn, describes the check that produces evidence for it, sets an orientation on thresholds, and names the downstream model failure mode the dimension protects against. The reference standard throughout is ISO/IEC 5259, the first ISO multi-part standard dedicated to ML data quality, with Part 2 (2024) providing the measure catalog and Part 3 (2024) the requirements.¹²

Why AI quality is not BI quality

The seven classical dimensions were built for operational and analytical systems — data warehouses, operational data stores, reporting layers. Those systems are used by humans making decisions, and humans are generally robust to small quality defects: a rounding inconsistency in a dashboard is annoying but not catastrophic. AI workloads break that assumption. Models are statistical learners that can encode a dataset’s defects as rules. A small training-set bias can produce a model that reproduces the bias at scale, long after the training run is over. A small labeling inconsistency can shift decision boundaries enough to change outcomes for named individuals.

The STAT News investigation into IBM Watson for Oncology documented one manifestation of this pattern. Reporting in 2018 described recommendations traced to training on hypothetical cases authored by a small number of clinicians rather than on real patient data, producing treatment suggestions that MD Anderson’s internal review found unsafe or incorrect in multiple documented instances.³ The defects were present in the training data from the start; the scale of the product turned a data-quality problem into a patient-safety problem.

A data readiness practitioner’s quality work is therefore more rigorous than a BI practitioner’s. The thresholds are tighter, the checks are more frequent, and the sampling strategy must account for the risk tier of the downstream model, not only the statistical properties of the dataset.

The seven classical dimensions

Accuracy

Accuracy measures how close a data value is to the true value it is intended to represent. For AI, accuracy failures in the training set propagate directly to prediction error in the model. A customer-churn model trained on a dataset where 3% of the “churned” labels are wrong will learn a decision boundary that reflects that noise. An object-detection model trained on a dataset where 2% of bounding boxes are drawn around the wrong object will learn to misplace boxes on unseen images.

Checks: sample-and-compare against source-of-truth; use independent verification where a second system has the same value (customer address in CRM and in billing); compute reconciliation metrics against a gold set.

Threshold orientation: for high-risk use cases, ≥99.0% accuracy on critical fields; for medium-risk, ≥97.0%; for low-risk, ≥95.0%. The threshold must be justified against the use case, not asserted as a default.

Completeness

Completeness measures the proportion of required values that are present. For AI, completeness has two sub-dimensions: record-level completeness (does each row have the expected columns filled?) and dataset-level completeness (does the dataset cover the expected population?). Both matter. A training set that covers the expected population but has 30% missing age values will force the preprocessing pipeline to impute, and the imputation choice is a modeling decision that must be governed.

Checks: null counts, sparsity analysis, missing-at-random tests (Little’s MCAR test), coverage analysis against a reference population.

Consistency

Consistency measures whether the same fact is represented the same way across records and across systems. Inconsistency is especially corrosive in AI because models learn patterns. If “US,” “USA,” and “United States” appear for the same country field across the training set, the model either treats them as three distinct tokens (learning nothing useful from the country feature) or the preprocessing pipeline must normalize them (and the normalization becomes a governed transformation).

Checks: distinct-value enumeration with reconciliation against a controlled vocabulary; cross-system reconciliation; dual-coding detection.

Timeliness

Timeliness measures whether a data value reflects the state of the world within the latency window the use case requires. For an AI use case, timeliness has two layers: is the training-set cutoff recent enough for the model to learn a current pattern, and is the serving-path latency short enough for the feature values to be informative at prediction time?

Checks: freshness dashboards at the source system; event-time versus ingestion-time monitoring; feature-staleness metrics in the feature store (covered in Article 6).

Validity

Validity measures whether data values conform to the syntactic and semantic rules defined for them. A date field must be a date. A categorical field must be from the allowed set. A numeric field must fall within a defined range. Validity failures are the easiest class to catch — schema-level checks find them automatically — and the easiest to fix before they reach the training set.

Checks: schema validation at ingest; type checks; range checks; regex checks for formatted strings; enum-conformance checks for categoricals.

Uniqueness

Uniqueness measures whether records that should be distinct are not duplicated, and records that represent the same real-world entity are not fragmented. For AI, duplicates in the training set cause class-imbalance distortions and leakage between training and test splits. Entity fragmentation (the same person appearing as three customer records) causes the model to learn from less data than the dataset volume suggests.

Checks: deterministic deduplication on key columns; probabilistic entity resolution on weaker keys; near-duplicate detection on free-text fields using hashing or embedding similarity.

Representativeness

Representativeness measures whether the dataset’s distribution across meaningful segments approximates the distribution of the target population the model will serve. A recruiting model trained on ten years of résumés from a predominantly male applicant pool will learn patterns that reflect the pool, not the broader labor market the organization might want to reach. Amazon’s retirement of an experimental recruiting tool in 2018 has been reported as an instance of this failure, traced to gender imbalance in training data.⁴

Checks: distribution comparison against a reference population; subgroup-size audits (covered in depth in Article 7); missing-segment analysis.

The three AI-specific dimensions

Freshness versus training cutoff

A training dataset has a cutoff date — the latest point from which records are drawn. The cutoff interacts with the use case in two ways. If the world has shifted since the cutoff (the 2021 housing-market inflection that Zillow’s model was trained before is a canonical example), the model’s predictions will be systematically off on production data. If the cutoff is too recent, the dataset may not include enough of the rare events the model needs to learn.

The readiness practitioner checks the cutoff against the use case’s deployment timeline and against any known regime changes in the domain. The evidence is a documented cutoff date, a rationale for why it is appropriate for the use case, and a monitoring plan that will catch the distributional shift when it eventually happens (covered in Article 10).

Labeling agreement

Where labels are produced by human annotators, the degree of agreement across annotators is itself a quality dimension. A label set where three annotators independently agree 95% of the time on a binary task is very different, as training data, from one where they agree 65% of the time. Inter-annotator agreement is measured using Cohen’s kappa for two annotators, Fleiss’s kappa for three or more on categorical tasks, and Krippendorff’s alpha for more general cases.⁵⁶

For high-risk use cases, the practitioner should require a target agreement threshold (often κ ≥ 0.8 for binary classification, α ≥ 0.8 for ordinal tasks) and a documented adjudication process for the disagreements. Article 5 of this credential covers labeling operations in depth.

Distributional stability

Distributional stability measures how much the dataset’s joint distribution changes across time windows or segments. Two datasets with identical univariate distributions can differ significantly in their joint distributions — the interaction between features — and the difference is what the model ends up learning. Stability is assessed using population stability index (PSI) across time windows, Kolmogorov-Smirnov tests on continuous features, and chi-squared tests on categoricals.

Distributional stability is a snapshot version of drift monitoring, discussed in depth in Article 10. In the readiness assessment, the practitioner checks stability across the train / validation / test split and across the historical windows used to build the training set. Unstable distributions signal that the model will need aggressive drift monitoring in production.

[DIAGRAM: ScoreboardDiagram — ten-dimension-scorecard — ten-row scorecard with columns for dimension, check method, threshold, current score, owner badge, evidence reference, and status (green/amber/red) — shows the seven classical dimensions followed by the three AI-specific dimensions, each annotated with the primary standard reference (ISO 5259-2, ISO 5259-3, or EU AI Act Article 10)]

Setting thresholds — the use-case-first rule

A common mistake is to import vendor-default quality thresholds and assert them across every use case. That approach fails because the same 95% accuracy threshold can be too loose for a medical-imaging model and too tight for a sports-ranking novelty application. The readiness practitioner sets thresholds use-case-first, using a four-step method:

Identify the downstream model’s failure cost. What is the cost of a wrong prediction, expressed in dollars, in human welfare, or in regulatory exposure?
Identify the model’s robustness to the dimension under consideration. Some models tolerate label noise well; others do not.
Identify the risk tier the use case sits in (under EU AI Act Article 6, under internal risk policy, or under sector rules).
Set the threshold at the tightest of the three anchors, then justify it in the scorecard.

The justification is itself evidence. An auditor reading a scorecard with a 97% completeness threshold should be able to follow the justification back to the cost-of-wrong-prediction calculation.

Automated checks and sampling review

No single check captures quality. A mature readiness evidence program uses three layers:

Automated schema-level checks — run on every batch, catch validity and many uniqueness failures, integrate into the ingest pipeline with fail-closed semantics.
Automated statistical checks — run daily on rolling windows, catch distribution drift, freshness failures, and completeness regressions; produce alerts rather than fail-closed.
Sampling review — periodic human review of a stratified sample, catches the semantic failures no statistical check can (the bounding box around the wrong object, the sentiment label inverted, the address that looks valid but is not).

Open-source tools (Great Expectations, Soda Core, custom SQL) and commercial tools (Monte Carlo, Databand, Anomalo, and many others) all implement variants of the first two layers. Technology neutrality requires the practitioner to score the check coverage, not the tool. A homegrown SQL-based framework that covers all three layers passes readiness; a best-in-class commercial tool that only runs on a subset of datasets does not.

[DIAGRAM: BridgeDiagram — classical-to-ai-quality — bridge from “Classical seven-dimension quality” on the left to “AI-extended ten-dimension quality” on the right; the bridge spans show the additions (freshness-vs-cutoff, labeling agreement, distributional stability) and the changes in threshold regime for the seven classical dimensions when applied to AI]

Pedagogical anchor — the Post Office Horizon case

The UK Post Office Horizon scandal is not an AI case. Its current public inquiry, which continued through 2024, documents a two-decade pattern in which software-generated accounting evidence was treated as authoritative and used to prosecute subpostmasters for shortfalls that were, in many cases, software defects.⁷ The case is a pedagogical anchor for the readiness practitioner because it illustrates what happens when no independent data-quality check sits between a system’s output and a downstream consequential decision. The machines said the numbers were right; no program of sampling review existed to catch the places where they were not.

An AI readiness scorecard with a thin quality layer produces the same class of failure — a model’s output is treated as authoritative, the data foundations that produced it are not examined, and the downstream decisions compound the error. The practitioner’s quality discipline is the mechanism by which that class of failure is prevented.

Tying quality to risk tier

EU AI Act Article 10 requires, for high-risk systems, that training, validation, and testing datasets be relevant, sufficiently representative, and to the best extent possible free of errors and complete in view of the intended purpose.⁸ The text does not prescribe specific thresholds. The readiness practitioner translates the Article 10 language into measurable requirements for the specific use case, scored across the ten dimensions, with evidence supporting each score and a defensible justification for each threshold.

The same translation applies to ISO/IEC 42001 Clause 7.5 (documented information) and Annex B controls on data management.⁹ An AI management system audited under 42001 will expect the quality evidence to be organized, retrievable, and signed by named owners. A readiness program that produces quality evidence only once, at model launch, will fail the audit even if the scores themselves are strong.

Cross-references

COMPEL Core — Data as the foundation of AI (EATF-Level-1/M1.4-Art05-Data-as-the-Foundation-of-AI.md) — the macro-level argument that data quality bounds AI outcomes.
COMPEL Practitioner — Data quality and technology assessment deep dive (EATP-Level-2/M2.2-Art06-Data-Quality-and-Technology-Assessment-Deep-Dive.md) — the practitioner-level treatment of quality diagnostics, which this credential applies to AI-specific contexts.
AITM-DR Article 7 (./Article-07-Bias-Relevant-Variables-and-Subgroup-Coverage.md) — representativeness extends into fairness-aware curation.
AITM-DR Article 10 (./Article-10-Drift-Monitoring-Incident-Classification-and-Sustainment.md) — distributional stability as a point-in-time assessment connects to drift as a sustained operational discipline.

Summary

AI data quality is the seven classical dimensions plus three extensions — freshness versus training cutoff, labeling agreement, and distributional stability. Thresholds are set use-case-first, justified against downstream model impact and risk tier. Evidence is produced through three check layers: schema-level automation, statistical automation, and stratified sampling review. The readiness scorecard records the scores, the thresholds, and the justifications, and becomes the primary artifact the governance program retains for audit.

ISO/IEC 5259-1:2024, Artificial intelligence — Data quality for analytics and machine learning (ML) — Part 1: Overview, terminology, and examples. https://www.iso.org/standard/81088.html ↩
ISO/IEC 5259-2:2024, Part 2: Data quality measures. https://www.iso.org/standard/81091.html ↩
C. Ross and I. Swetlitz, IBM’s Watson supercomputer recommended ‘unsafe and incorrect’ cancer treatments, STAT News, July 25, 2018. https://www.statnews.com/2018/07/25/ibm-watson-recommended-unsafe-incorrect-treatments/ ↩
J. Dastin, Amazon scraps secret AI recruiting tool that showed bias against women, Reuters, October 10, 2018. https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G ↩
K. Krippendorff, Content Analysis: An Introduction to Its Methodology, 2nd ed., Sage (2004). ↩
J. L. Fleiss, Measuring nominal scale agreement among many raters, Psychological Bulletin 76, no. 5 (1971): 378–382. ↩
Post Office Horizon IT Inquiry, public hearings and evidence bundles, 2022–2024. https://www.postofficehorizoninquiry.org.uk/ ↩
Regulation (EU) 2024/1689, Article 10 (Data and data governance). https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩
ISO/IEC 42001:2023, Information technology — Artificial intelligence — Management system. https://www.iso.org/standard/81230.html ↩