Data Lineage Documentation Practices

FlowRidge

This article describes the practical documentation practices that turn lineage from a slogan into an operational asset, with emphasis on the metadata model, the capture mechanisms, and the everyday workflows that lineage enables.

Why Lineage Matters More for AI

Conventional data warehousing has long valued lineage. AI raises the stakes for three reasons.

First, decision accountability. When a credit decision, a medical triage recommendation, or a hiring score is challenged, the question “what data informed this output?” must have a credible answer. The Federal Reserve Supervisory Letter SR 11-7 on Model Risk Management at https://www.federalreserve.gov/supervisionreg/srletters/sr1107.htm requires “comprehensive documentation” sufficient to allow independent review.

Second, bias and fairness reasoning. Bias in a model often originates in upstream data composition decisions made by people who never imagined the data would be used for this purpose. Without lineage, the diagnosis of bias hits a wall. The Algorithmic Accountability Act discussion drafts in the United States Congress at https://www.congress.gov/bill/118th-congress/house-bill/5628 contemplate explicit lineage disclosure requirements.

Third, right-to-explanation and erasure. The General Data Protection Regulation Article 22 right to information about automated decisions, and Article 17 right to erasure, both presuppose that the organisation can identify which datasets contain a given subject’s data. The European Data Protection Board guidance at https://edpb.europa.eu/our-work-tools/our-documents/guidelines/guidelines-052020-consent-under-regulation-2016679_en discusses the documentation expectations.

The Lineage Metadata Model

A documented lineage uses a standard metadata model. The OpenLineage specification at https://openlineage.io/ has emerged as the de facto open standard, defining datasets, jobs, runs, and the relationships between them. Organisations building from scratch should adopt OpenLineage rather than inventing a parallel schema.

At minimum, every documented dataset should carry: identity, owner, source classification (first-party operational, third-party purchased, scraped public, synthetic, or derived), sensitivity classification (PII, PHI, commercially sensitive, public), refresh cadence, quality SLAs, upstream dependencies, and downstream consumers.

Every documented transformation should carry identity, owner, logic (the SQL, Python, or pipeline definition), inputs and outputs with typed schemas, triggering (scheduled, event-driven, manual), and quality gates.

Capture Mechanisms

Manual lineage documentation is unsustainable beyond toy programs. Three capture patterns dominate.

The first is runtime capture from data orchestrators. Apache Airflow, Dagster, Prefect, and dbt all expose execution metadata that can be emitted as OpenLineage events.

The second is query capture from data warehouses. Snowflake, BigQuery, Databricks Unity Catalog, and Amazon Athena emit query history that can be parsed for source-to-target dataset relationships. The Databricks Unity Catalog documentation at https://docs.databricks.com/aws/en/data-governance/unity-catalog/data-lineage.html describes the pattern.

The third is integration capture from data integration tools. Modern ETL/ELT platforms (Fivetran, Stitch, Talend, Informatica) emit lineage as part of their normal operation.

Mature programs combine all three plus a metadata aggregation layer — typically a data catalogue such as DataHub, Atlan, OpenMetadata, or Collibra.

Field-Level Lineage

Dataset-level lineage answers “where did this table come from?” Field-level lineage answers “where did this column come from?” The latter is what an investigator needs when a single sensitive attribute drives a model decision.

Field-level lineage requires the capture mechanism to parse transformation logic. dbt provides field-level lineage through the manifest. Where transformations occur in code (PySpark, pandas), parsers are less reliable; explicit column-level annotations are the workaround.

The expense of full field-level lineage is justified only for data that drives consequential decisions.

The catalogue should expose upstream views (from a dataset, walk back to every original source), downstream views (from a dataset, see every consuming model), impact analysis (simulate the effect of removing or changing a dataset), and diff over time (see how lineage has changed across versions).

The Linux Foundation’s DataHub project at https://datahubproject.io/ provides a reference implementation.

Lineage and Privacy

Lineage that includes PII or sensitive data must itself be protected. Common controls include role-based access to the catalogue, field-level masking, and audit logging of lineage queries. The European Union Agency for Cybersecurity (ENISA) Data Protection Engineering report at https://www.enisa.europa.eu/publications/data-protection-engineering describes complementary patterns.

Lineage in the AI Lifecycle

Training data preparation. Each training dataset version should be lineage-linked to source datasets, including any sampling, deduplication, balancing, or synthetic augmentation steps.

Feature engineering. Each feature should be lineage-linked from raw inputs through the feature store. Tools like Feast, Tecton, and Databricks Feature Store integrate with lineage capture natively.

Model training. Each model version should be lineage-linked to the training dataset version, the evaluation dataset version, the code commit, and the configuration. Model registries such as MLflow at https://mlflow.org/ encode this lineage as a first-class concept.

Inference. Each inference call should reference (in the audit trail) the model version, which carries the upstream lineage.

Common Failure Modes

The first is snapshot lineage — the catalogue contains lineage from a single point in time but does not track how lineage has evolved. Counter by treating lineage events as time-series data with version markers.

The second is partial coverage — the catalogue covers the data warehouse but not the data lake, or covers SQL transformations but not Python notebooks. Counter by mandating that any system producing data consumed by AI must emit lineage.

The third is abandoned ownership — datasets in the catalogue list owners who left the organisation years ago. Counter with quarterly ownership re-attestation.

The fourth is over-classification — every dataset marked sensitive, which collapses the meaning of the classification.

What Comes Next

The next article in Module 1.22 turns to synthetic data — generation, validation, and governance for the increasingly important class of data that has been algorithmically generated rather than collected.