Data Governance and Data Contracts

FlowRidge

Data Contract Stack — From Source to Consumer

Consumer contract

Downstream AI system

SLASchemaQuality SLO

Semantic layer

Business meaning

DefinitionsLineageVersion

Integrity layer

Machine-checkable

SchemaConstraintsTests

Producer contract

Source system

OwnerFreshnessIncident queue

Figure 303. Data contracts formalise producer–consumer agreements. Each layer names an accountable owner and a breaking-change policy.

This article teaches the readiness practitioner to evaluate governance as a readiness input, to author a data contract for an AI use case, to design the lifecycle in which the contract lives, and to map accountabilities using a RACI that survives handoffs between the data steward, the data engineer, the ML engineer, the product owner, and the governance lead.

Why AI governance is harder than BI governance

BI governance matured in an environment where data flowed from source systems into a warehouse, served dashboards that humans read, and changed slowly. Governance could operate on a monthly cadence because the consequences of a governance failure were typically contained — a wrong dashboard reading was caught by the analyst, corrected, and that was that.

AI changes three things. Data flows are many-to-many instead of many-to-one: the same source dataset can feed feature stores, vector indexes, fine-tuning corpora, and inference pipelines simultaneously. Consequences are amplified: an AI system acts at a scale humans cannot audit in real time, and a governance failure can propagate into many decisions before it is caught. Cadence is compressed: models retrain, retrieval indexes refresh, and prompts evolve on weekly or daily cadences that BI processes cannot match.

The US Government Accountability Office’s 2023 review of federal AI programs documented this gap concretely. Reviewing 20 agencies, GAO found that most had not fully implemented the governance requirements set out in prior executive orders, with data governance among the most consistently underdeveloped areas — catalogs were incomplete, lineage was inconsistent, and access policies had not been adapted from BI-era defaults.¹ The readiness practitioner encountering a similar pattern in an enterprise engagement is looking at a governance maturity problem, not a contracts problem, and must scope the engagement accordingly.

Policy-level governance versus contract-level governance

The readiness practitioner should separate two levels of governance:

Policy-level governance sets the rules. Access policy. Retention policy. Classification policy. Residency policy. Acceptable use policy. These are typically owned by a central governance function (CDO, DPO, or equivalent) and change infrequently. Policy governance is the input the readiness practitioner takes as given. A weak policy regime raises the bar on everything downstream, but policy reform is outside the readiness engagement’s scope.

Contract-level governance operationalizes the rules at the dataset level. A data contract specifies, for a specific data product, the schema, the semantics, the freshness SLA, the quality expectations, the access scopes, the change policy, and the owner. Contracts are authored by data producers, consumed by data consumers, and updated on the cadence the use case demands. Contract governance is where most of the readiness practitioner’s work happens.

The distinction matters because conversations get confused when the two levels are conflated. “We don’t have data governance” might mean the organization has no DPO and no retention policy (a policy-level gap the readiness engagement cannot fix), or it might mean that specific datasets have no contracts and no named owner (a contract-level gap the readiness engagement can and should fix).

The anatomy of a data contract

A data contract for an AI use case has nine required sections. Each section is testable — a data consumer can write a check that verifies the contract is being honored.

Identity. A stable identifier for the data product, versioned, with the name of the owning team, the consumer list, and the current revision date.
Schema. The complete column-level schema, including types, nullability, allowed values, and the versioning scheme. A schema change is a contract event and triggers the change policy.
Semantics. Plain-language definitions for each column, with examples and counter-examples. Semantics are where most BI-to-AI failures originate — a column called status means something specific, and that meaning must be written down.
Quality expectations. Per-column and dataset-level quality thresholds, referencing the ten dimensions from Article 2 of this credential. Thresholds are tied to the use case and justified.
Freshness SLA. The acceptable latency from source event to availability in the data product, with a monitoring commitment and an incident-response commitment when the SLA is breached.
Access scope. Who can read. What derived uses are permitted. Whether training, validation, testing, retrieval, and inference are each separately scoped. Whether PII fields are in scope, and under what minimization controls (Article 8 covers this).
Change policy. How schema changes, semantic changes, and deprecations are announced, reviewed, and executed. Breaking-change windows and consumer signoff requirements.
Lineage commitment. The upstream sources, the transformations, and the downstream consumers that will be kept in a lineage graph (Article 4 covers this).
Owner and escalation. Named primary owner, named secondary, on-call contact, and escalation path for quality incidents.

A data contract that has all nine sections and is honored in production is worth more than any number of generic governance policy documents. A contract that has the sections but is not honored is a piece of shelf-ware.

[DIAGRAM: StageGateFlow — data-contract-lifecycle — horizontal flow of data-contract states: draft → peer review → consumer review → approval → publication → change request → amendment / deprecation — each stage annotated with the owner, the review artifact, and the gate criterion]

The contract lifecycle

A contract is a living document. The lifecycle has six states.

Draft. The owning team authors the contract. The readiness practitioner often helps draft the first contract in an engagement; subsequent contracts are drafted by the data team independently.

Peer review. An adjacent data producer reviews the draft for schema consistency, semantic clarity, and compatibility with their own contracts. The review catches the upstream-alignment gaps that would otherwise show up as production data incidents.

Consumer review. The AI use-case team reviews the contract against its requirements. The review is where fitness-for-purpose disagreements surface — the dataset the producer has is not quite what the consumer needs — and is where the readiness practitioner often mediates.

Approval. A named approver (typically the data steward for the domain) signs the contract. Approval is gated on the three reviews and on a policy-compliance check (access scope consistent with classification policy, retention consistent with policy, etc.).

Publication. The contract is registered in the data catalog, linked to the dataset, and made discoverable. Consumers can now depend on it.

Change request. When a producer needs to change the schema, semantics, SLA, or scope, they open a change request. The change follows a mini-lifecycle of its own: peer review, consumer impact assessment, approval, publication, consumer migration window, deprecation of the old version.

The lifecycle becomes the governance cadence. A team running twenty contracts at once will have changes moving through the lifecycle continuously. The readiness practitioner’s role is to verify that the cadence exists and that no use case is depending on an un-contracted dataset.

A RACI that survives AI handoffs

Accountability for AI data is frequently ambiguous. The data steward says “I own governance, not model training.” The ML engineer says “I consume the data, I do not own it.” The product owner says “I own the business outcome, not the technical plumbing.” In the gap between these positions, readiness fails silently.

The readiness practitioner should establish an explicit RACI covering the five recurring roles:

Data steward — the policy-level owner of a data domain. Accountable for classification, retention, and catalog correctness.
Data engineer — the producer of the data product. Accountable for schema, SLA, and pipeline reliability.
ML engineer — the consumer of the data product for an AI use case. Accountable for fitness-for-purpose assessment, feature engineering, and model-training use.
Product owner — the business sponsor of the AI use case. Accountable for the use-case scope, the risk-tier classification, and the value case.
Governance lead — the AI governance function. Accountable for risk review, compliance sign-off, and policy escalation.

Across the nine lifecycle stages the RACI assigns R, A, C, and I per role. For example, at the Label stage, the ML engineer is responsible, the product owner is accountable, the data steward is consulted for access scope, and the governance lead is informed. The specific assignments vary by organization, but the exhaustiveness requirement does not — every stage must have a named accountable role and no stage can have zero responsible roles.

[DIAGRAM: OrganizationalMappingBridge — raci-across-lifecycle — matrix with the nine lifecycle stages across the top and the five roles down the side; cells contain R / A / C / I markers per role-stage combination, with a right-side column showing the governance artifact each stage must produce]

Worked example — Uber’s data contract evolution

Uber’s engineering organization has published multi-year accounts of its data governance evolution, documenting the transition from a data-lake-free-for-all to a contract-governed mesh.² The published timeline spans multiple years: initial ad-hoc governance, a first-generation data catalog, contract emergence in the data-platform team, organization-wide contract adoption, and the transition into governance as a product. Three lessons from the published material are useful to a readiness practitioner:

Contracts became durable only when they were produced by the data-producing teams, not by a central governance team. Central-team contracts were ignored; producer-team contracts were honored because the producers had skin in the game.
The contract-change cadence became the most telling governance metric. Teams shipping twenty changes a month through the change-request workflow were in healthy governance; teams shipping one change a quarter through emergency patches were not.
The downstream consumer list embedded in each contract became the most-used operational feature, because it let producers know who to call when they needed to break a schema.

The practitioner transferring these lessons should not assume the Uber pattern fits every organization. What transfers is the principle: contracts must live with producers, must have a change cadence that reflects the real rate of change, and must know their consumers.

Governance for AI-specific artifacts

Three artifact classes are new to AI and need explicit governance:

Feature stores — governed as data products with contracts covering online and offline surfaces, freshness, and access scope. See Article 6.
Vector stores / embedding indexes — governed as versioned data products with contracts covering chunking strategy, embedding model version, refresh cadence, and source-document lineage. See Article 6.
Fine-tuning corpora — governed as training-set snapshots with contracts covering provenance, license, subject-rights exposure, and retention (the fine-tuning corpus may need to be kept for the life of every model trained on it).

Each of these extends the contract template. The practitioner should not try to force them into a BI-era contract and should expect to negotiate with the data team on what the extensions look like.

Samsung and the access-scope case

In April 2023, Samsung issued a ban on employees using external AI chatbots after reported incidents in which employees pasted source code and internal meeting notes into ChatGPT, routing proprietary material into an external training pipeline.³ The relevant readiness question is: what access scopes existed, and were they enforceable? A data contract for proprietary source code would name external AI services as out-of-scope; an access policy built on top of the contract would need to block the route by technical or procedural means. The Samsung incident was, among other things, a contract-scope enforcement failure — the policy existed, the technical enforcement did not.

An AI readiness engagement that evaluates governance must confirm not only that contracts exist but that their access scopes are enforceable. A scope that is written down but not enforced is not a control.

Cross-references

COMPEL Core — Data governance for AI (EATF-Level-1/M1.5-Art07-Data-Governance-for-AI.md) — the framework-level treatment of AI data governance.
COMPEL Core — Process pillar domains: use cases and data (EATF-Level-1/M1.3-Art04-Process-Pillar-Domains-Use-Cases-and-Data.md) — the 20-domain maturity model’s treatment of the data governance domain.
AITM-DR Article 4 (./Article-04-Data-Lineage-Provenance-and-Documentation.md) — the lineage commitment in the contract is the spine of Article 4’s provenance discipline.
AITM-DR Article 6 (./Article-06-Feature-Stores-and-Vector-Stores-as-Governance-Artifacts.md) — the contract extensions for feature stores and vector stores.

Summary

Data governance for AI operates at two levels: policy governance that sets the rules and contract governance that operationalizes them at the dataset level. The contract is the unit of work. A good contract has nine sections, lives in a six-stage lifecycle, and is accountable to a named five-role RACI across the data lifecycle. The readiness practitioner audits contracts against their lifecycle, flags datasets without contracts, and checks that contract scopes are enforceable. Governance that exists on paper but not in enforcement is not governance.

US Government Accountability Office, Artificial Intelligence: Agencies Have Begun Implementation but Need to Complete Key Requirements, GAO-24-105980, December 2023. https://www.gao.gov/products/gao-24-105980 ↩
Uber Engineering, Uber’s Journey Toward Better Data Culture From First Principles, 2021 (and related posts on Uber data contracts). https://www.uber.com/blog/ubers-journey-toward-better-data-culture-from-first-principles/ ↩
S. Ray, Samsung bans ChatGPT and other chatbots for employees after sensitive code leak, Forbes, May 2, 2023. https://www.forbes.com/sites/siladityaray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-employees-after-sensitive-code-leak/ ↩