Data Lineage, Provenance, and Documentation

FlowRidge

Lineage & Provenance — End-to-End Traceability

Source

System of record

Consent capture

Access log

Extract manifest

Transform

Pipeline code

Version tag

Test suite

Output hash

Feature

Feature definition

Training snapshot

Drift baseline

Catalog entry

Model

Training log

Datasheet

Eval card

Deployment record

Figure 304. Provenance artefacts must exist at every transformation. A broken lineage link is a reproducibility finding.

Definition

Data lineage is the structural record of where a data value came from and every transformation applied to it on its path to use. Provenance is the wider governance record — origin, custody, permissions, transformations, and accountability — for a data asset across its lifecycle. Documentation is the set of human-readable artifacts (datasheets, dataset cards, decision logs) that explain the data asset to a reviewer who was not part of its creation. For an AI readiness assessment, all three are required, and all three must be producible at a cost the organization is willing to sustain. Documentation that is heroically produced once and never updated is worse than no documentation, because it invites false confidence.

This article defines the three artifacts, translates regulatory expectations into engineering practice, and describes the minimal viable documentation regime that will pass a supervisory review under the EU AI Act Article 10 and ISO/IEC 8183.¹²

Why documentation is the readiness deliverable

The data readiness scorecard covered in Article 11 of this credential is itself a document, built on top of the documents described in this article. Without the underlying documentation, the scorecard is a set of assertions. With it, the scorecard becomes a review that reconciles claims against evidence. The readiness practitioner’s most common diagnostic finding is not that the data is wrong but that the data cannot be explained — no one can answer where did this come from, what was done to it, who decided, and who is accountable now. That pattern is the one Article 10 was written to prevent.

Gebru et al. introduced the Datasheets for Datasets proposal in 2018 and refined it through publication in Communications of the ACM in 2021.³ The proposal was simple: every dataset should have a structured questionnaire document that answers a defined list of questions about motivation, composition, collection, preprocessing, intended uses, distribution, and maintenance. Adoption has been uneven. Large organizations have often adopted the form without the substance, producing datasheets as launch-gate paperwork that nobody reads. Small organizations have sometimes skipped them entirely. The readiness practitioner should aim for the middle path: datasheets produced at the cadence the data product actually changes, maintained by the data owner, and kept short enough to stay current.

Three mandatory artifacts

The readiness practitioner requires three artifacts per data product used in an AI workload.

1. Lineage graph

The lineage graph traces every field in the target dataset back to its source systems, through every transformation, with enough resolution that an auditor can ask “where did this column come from?” and be answered with a complete chain. The graph has two layers:

Structural lineage — which upstream column or table produces which downstream column or table. Structural lineage is captured automatically by most modern warehouse platforms, transformation tools, and lineage-specific products. Tools that cover this layer include OpenLineage, Marquez, Atlan, Collibra, DataHub, Apache Atlas, and others; the readiness practitioner scores coverage, not the tool.
Semantic lineage — what the transformation actually means. A column named customer_ltv that is the output of a SQL query joining three tables has structural lineage captured by the query; the semantic statement “this is lifetime value over the past 24 months, including refunds but excluding returns” is semantic lineage that the query does not capture and must be written down.

Semantic lineage is where most lineage programs quietly fail. The structural graph shows which tables feed which tables; the semantic layer — what the transformations mean — lives in the code, in Slack threads, and in the heads of the engineers who wrote the pipeline. Getting semantic lineage into a durable form is the harder half of the practitioner’s work, and is usually where the readiness engagement discovers the governance gaps that are worth fixing.

2. Datasheet

The datasheet is the human-readable companion to the lineage graph. It answers the Gebru et al. questions, adapted to the enterprise context:

Motivation. For what specific AI use case was this dataset created or adapted? Who funded it? What specific task does it support?
Composition. What does each instance represent (a customer, a transaction, a document chunk)? How many instances? What labels or target values? What features? What is held out?
Collection process. How was the data collected? Over what time period? Through what mechanisms (query, API, scrape, human annotation, synthetic generation)?
Preprocessing. What cleaning, filtering, imputation, transformation, enrichment, and splitting were applied? Was raw data retained?
Uses. What AI tasks has this dataset been used for? What tasks is it unsuitable for? Are there known distributions where it underperforms?
Distribution. Is the dataset distributed externally? Under what license? With what subject-consent or regulatory constraint?
Maintenance. Who owns it? How often is it refreshed? When will it be retired? Where are errata and change logs kept?

A datasheet should be short. A three-page datasheet that is current and honest is better than a thirty-page datasheet that was accurate in 2023. The readiness practitioner should push back hard on datasheets that attempt comprehensive coverage at the cost of staying current.

3. Dataset decision log

The decision log is the chronological record of governance decisions made about the dataset. Its entries are dated, signed, and reference the evidence consulted. Decisions recorded include:

Use-case scope expansions or restrictions
Quality threshold changes with rationale
Access-scope changes with approver
Inclusions or exclusions of records or columns with rationale (especially where fairness or privacy concerns were adjudicated)
Remediation decisions after quality incidents
Retirement and decommissioning

The decision log is the artifact that most organizations skip and most auditors ask for first. Without it, the readiness scorecard cannot explain why the dataset is shaped the way it is. With it, every score in the scorecard can be traced to a dated decision.

[DIAGRAM: HubSpokeDiagram — dataset-hub-artifacts — central hub labeled “Dataset” with six spokes to: datasheet, lineage graph, decision log, change history, access log, risk register; each spoke is annotated with the named owner and the refresh cadence (per change / daily / per decision)]

Structural and semantic lineage in practice

The practitioner should expect three recurring integration patterns for structural lineage:

Transformation-tool-native — dbt, Dataform, Coalesce, and similar tools emit lineage manifests that downstream catalogs can consume. Coverage is strong inside the tool and typically fails at the edges (ingestion, external sinks).
Warehouse-native — Snowflake, BigQuery, Databricks, and Redshift variants each publish lineage APIs. Coverage depends on whether queries are run through the native query engine or through external engines (Spark, Trino) whose lineage the warehouse does not see.
Catalog-native — OpenLineage, DataHub, Amundsen, and commercial catalogs (Collibra, Alation, Atlan) collect lineage from multiple sources and normalize it. Coverage depends on the collectors configured.

No single pattern is complete. The readiness practitioner scores coverage by measuring lineage completeness across a stratified sample of target-dataset columns: for each sampled column, can the full chain back to a source system be reconstructed from the available tools? A score under 80% on a high-risk use case is a readiness finding that must be remediated.

Regulatory translation — what Article 10 actually requires

EU AI Act Article 10 applies to high-risk AI systems. Subsections 10(2) through 10(6) specify, in sequence: the data governance and management practices that must be applied, the appropriate statistical properties of training data, the handling of special categories of data, the representativeness checks, the bias-detection duty, and the retention duty.¹ The article does not prescribe specific technical methods. It does require that the provider document the methods applied and justify the choices made.

Translated to readiness evidence, Article 10 asks for:

A data governance statement (the contract from Article 3 of this credential substantially satisfies this)
A representativeness assessment (the quality scorecard from Article 2, combined with Article 7’s subgroup analysis)
A known-bias documentation (Article 7 again)
A retention justification (the datasheet’s Maintenance section)
Traceability from source through every transformation (the lineage graph)

A readiness scorecard that includes these artifacts, with each section evidenced and signed, is the minimum viable Article 10 record. The provider still needs a Conformity Assessment under the Act’s Chapter III rules; the readiness scorecard is an input to that assessment, not a substitute for it.

Provenance failures and enforcement

The enforcement record on provenance has grown since 2021. The Clearview AI case is the clearest example. Clearview built a facial-recognition database by scraping public-web images. Three EU data protection authorities issued enforcement decisions against the company in 2022 — the Italian Garante imposed a €20M fine in February 2022, the UK ICO issued an enforcement notice in May 2022, and the French CNIL fined Clearview €20M in October 2022.⁴⁵⁶ The common thread in the decisions is provenance: the company could not demonstrate the lawful basis on which images were collected, could not demonstrate data-subject awareness, and could not produce a complete record of where its training data came from.

A readiness practitioner assessing a third-party dataset or a web-scraped corpus must therefore treat provenance as a first-tier concern. Article 9 of this credential develops the third-party data readiness rubric; this article establishes the documentation discipline that makes the Article 9 rubric assessable.

[DIAGRAM: ConcentricRingsDiagram — provenance-rings — five concentric rings labeled outward from center: raw source → cleaned → enriched → feature → training snapshot; each ring annotated with the governance artifact produced at that level (source-system metadata, cleaning log, enrichment rules, feature definitions, versioned training snapshot) and the reference standard (ISO 8183, ISO 5259-2, EU AI Act Article 10)]

Sustaining documentation — the operating cadence

Documentation that is produced once and abandoned is a liability. The readiness practitioner should require an explicit maintenance cadence per artifact:

Lineage graph — regenerated on every pipeline change; verified monthly against a stratified sample.
Datasheet — updated on every material change to the dataset (schema, scope, cleaning rules); reviewed quarterly even if no change occurred.
Decision log — appended on every governance decision; no scheduled review (the log is append-only).
Change history — emitted automatically by the data platform where possible.
Access log — collected automatically; reviewed quarterly for anomalies.
Risk register — updated on material risk events; reviewed at each scorecard refresh.

The cadence is not heroic. It is the minimum that keeps the artifacts current. The readiness engagement should not propose a cadence tighter than the organization can sustain, because a slipped cadence is itself a readiness finding.

Pedagogical anchor — Gebru et al.’s continuing relevance

The Datasheets for Datasets proposal has been cited thousands of times since 2018 and has influenced several regulatory drafts, including EU AI Act Article 13 (instructions for use) and Annex IV (technical documentation).³ The proposal’s durability is not because the specific questionnaire was perfect; it is because the underlying discipline — of answering a defined set of questions about every dataset — is correct and adaptable. The readiness practitioner should treat the datasheet format as a starting point to be adapted to the organization’s context, not as a fixed deliverable to be filed.

The complementary discipline for models is the model card (Mitchell et al. 2019), which the COMPEL Core stream covers in its own documentation module. Datasheets and model cards together form the paired artifacts that make AI systems explainable to reviewers who were not part of their construction.

Cross-references

COMPEL Core — Data governance for AI (EATF-Level-1/M1.5-Art07-Data-Governance-for-AI.md) — the governance umbrella in which documentation sits.
COMPEL Core — Mandatory artifacts and evidence management (EATF-Level-1/M1.2-Art14-Mandatory-Artifacts-and-Evidence-Management.md) — the framework-level rules on evidence production and retention.
AITM-DR Article 3 (./Article-03-Data-Governance-and-Data-Contracts.md) — the data contract is the producing contract for the datasheet and the lineage commitment.
AITM-DR Article 11 (./Article-11-The-Readiness-Scorecard.md) — the scorecard consumes these artifacts as evidence.

Summary

Data readiness documentation is three artifacts: the lineage graph (structural and semantic), the datasheet (human-readable dataset description), and the decision log (chronological record of governance decisions). Regulatory expectations under EU AI Act Article 10 and ISO/IEC 8183 are met by these three artifacts when they are sustained at the cadence the data actually changes. Heroic one-time documentation is a liability. The practitioner’s job is to establish a cadence the organization will keep, to verify coverage against stratified samples, and to flag the provenance gaps that would cause an enforcement exposure similar to the Clearview pattern.

Regulation (EU) 2024/1689, Article 10 (Data and data governance). https://eur-lex.europa.eu/eli/reg/2024/1689/oj ↩ ↩²
ISO/IEC 8183:2023, Information technology — Artificial intelligence — Data life cycle framework. https://www.iso.org/standard/83002.html ↩
T. Gebru et al., Datasheets for Datasets, Communications of the ACM 64, no. 12 (2021): 86–92. arXiv preprint, revised 2021: https://arxiv.org/abs/1803.09010 ↩ ↩²
Garante per la protezione dei dati personali, Provvedimento del 9 febbraio 2022 — Clearview AI, doc. web n. 9751362. https://www.garanteprivacy.it/home/docweb/-/docweb-display/docweb/9751362 ↩
Information Commissioner’s Office (UK), enforcement notice against Clearview AI Inc., May 2022. https://ico.org.uk/action-weve-taken/enforcement/clearview-ai-inc-mpn/ ↩
Commission Nationale de l’Informatique et des Libertés (CNIL), Délibération SAN-2022-019 (20 October 2022), fining Clearview AI €20,000,000. https://www.cnil.fr/en/facial-recognition-cnil-imposes-penalty-eur-20-million-clearview-ai ↩