Lab 2 — Data Contract and Datasheet for a RAG Source

FlowRidge

COMPEL Specialization — AITM-DR: AI Data Readiness Associate Lab 2 of 2

Lab overview

This lab places the learner in the role of a readiness practitioner at a large professional-services firm. The firm is deploying a retrieval-augmented generation system that answers consultants’ questions about internal methodologies, client engagement templates, and regulatory guidance. The retrieval corpus combines three source classes — published internal methodologies (the well-governed case), regulatory-guidance summaries written by the firm’s regulatory team (the medium case), and client engagement templates contributed ad hoc by practice groups (the messy case). The practitioner’s task is to author a data contract for the retrieval corpus and a datasheet for one representative content type within it.

The lab is a writing exercise supported by provided skeletal materials (a corpus inventory, three sample documents per source class, the RAG system’s architecture note). No tooling is strictly required; the deliverables are written artifacts.

Time: 90 minutes. Prerequisites: Articles 3, 4, and 6 completed. Output: A data contract draft and a datasheet draft, each critiqued in peer review.

Scenario

The firm. A 12,000-person professional-services firm with a consulting practice and a regulatory-advisory practice. AI governance is maturing but not mature; a governance committee exists, some use cases have been approved, several are in development. RAG deployments have proliferated faster than central governance can keep up with, and the firm’s CIO has asked the readiness function to bring structure to the data side.

The use case. An LLM-powered consultant assistant. Consultants ask questions in natural language; the system retrieves relevant chunks from the corpus and generates answers with citations. Risk tier is classified as medium — answers are advisory to named-consultant review before going to client.

The corpus. Three content classes:

Class A — Published methodologies. 1,200 documents. Governed under the firm’s knowledge-management program, versioned, reviewed, with named owners per document. Content is fully internal.
Class B — Regulatory-guidance summaries. 430 documents. Written by the regulatory-advisory practice, reviewed by a regulatory committee, with named owners per topic. Content is partially derived from external regulatory text and partially from the firm’s analysis.
Class C — Client engagement templates. 2,700 documents. Contributed by practice groups, no central review, unclear ownership, sometimes contains client-specific names and details. Provenance unclear.

Retrieval architecture note. The corpus is chunked into 600-token segments with 100-token overlap, embedded with a commercial embedding model (version v2.1, March 2024), indexed into a managed vector store, refreshed nightly from the source document system.

Task 1 — Scope the contract (10 minutes)

Before writing the contract, the practitioner must decide: is there one contract for the whole corpus, or one contract per class? Write a one-paragraph scoping decision that identifies:

Which approach you are choosing (one contract or three)
Why
What implications the choice has for schema, access policy, and change cadence

Defensible answers exist for both approaches. A common answer: three contracts because the three classes have materially different governance regimes, and forcing them into a single contract papers over real differences.

Task 2 — Author the data contract (40 minutes)

Using the nine-section data contract template (from Article 3), draft the contract for Class B (regulatory-guidance summaries). Include:

Identity. Name, version, owner (name the role), consumer list (RAG system consumers), revision date.
Schema. Each document chunk’s metadata fields: chunk_id, source_document_id, source_document_version, source_document_publication_date, regulatory_topic, regulatory_jurisdiction, regulatory_instrument_cited (can be multi-valued), owner_role, effective_from, expires_on, confidentiality_class. For each field: type, nullability, allowed values (where applicable).
Semantics. Definitions for regulatory_topic, regulatory_jurisdiction, regulatory_instrument_cited, and effective_from / expires_on — what each means, with examples. Semantics matter for retrieval quality.
Quality expectations. Per-document: all metadata fields populated; title and body non-empty; body length between 500 and 10,000 tokens. Per-corpus: distribution of topics and jurisdictions against a target profile; freshness SLA (documents updated within 30 days of underlying regulatory change).
Freshness SLA. Per-document updates within 14 business days of the triggering regulatory-change notification; corpus refresh index within 24 hours of per-document update.
Access scope. Read access to the RAG system’s retrieval service for all authenticated firm employees. No training-use of the corpus without separate approval. External distribution prohibited.
Change policy. Schema changes — 2-week consumer notice, consumer impact review, migration path. Semantic changes — 1-week notice, consumer ack. Document-level changes — continuous, logged.
Lineage commitment. Each document traces to: the source regulatory instrument (citation), the analyst who authored the summary (person-level), the committee approval record.
Owner and escalation. Named primary owner role (e.g., Regulatory Advisory Knowledge Lead), named secondary, on-call contact during business hours, escalation path for quality incidents.

Task 3 — Author the datasheet (30 minutes)

Using the datasheet template from Article 4, draft a datasheet for Class B. The datasheet covers:

Motivation. Why does this dataset exist? Who uses it and for what? What business outcome does it support?
Composition. How many documents? Over what time span? What regulatory topics and jurisdictions?
Collection process. Who writes the summaries? What sources are consulted? What review happens before publication?
Preprocessing. What happens to a document between authoring and availability to the RAG system? (Publication, chunking, embedding, indexing.) Is the raw document retained?
Uses. What AI tasks has this corpus been used for (the RAG consultant assistant is one; are there others)? What tasks is it unsuitable for?
Distribution. Is the corpus distributed externally? No — internal only. Are excerpts distributed to clients in any downstream deliverable?
Maintenance. Who owns it? How often is it refreshed? When would it be retired?

The datasheet should be three pages at most. Concise beats comprehensive. Every sentence should earn its place.

Task 4 — Peer review (10 minutes)

Pair with another learner. Review your partner’s contract and datasheet against these questions:

Is the schema testable? Could a consumer write a check that verifies a document complies with the contract?
Is the semantics section unambiguous? Could two reviewers produce the same interpretation?
Is the access scope enforceable, or only aspirational?
Does the datasheet’s Uses section name the ways this corpus is unsuitable, not only how it is used?
If a regulatory-change notification arrives tomorrow, does the contract tell you what to do and by when?

Exchange two pieces of constructive feedback. Revise your drafts.

Expected artifacts

A completed data contract draft for Class B (approximately 1,000 words / 2-3 pages)
A completed datasheet draft for Class B (approximately 700 words / 2-3 pages)
A scoping-decision paragraph
Peer review notes

Reflection questions

Answer in writing. Responses feed into the Article 11 scorecard discussion.

Class C (client engagement templates) has the weakest governance of the three. Sketch the controls you would require before Class C could enter the RAG corpus. What are the minimum controls you would accept?
The embedding model version (v2.1, March 2024) is pinned in the retrieval architecture note. What is the data contract’s obligation when the vendor releases v2.2? What change-policy clause covers it?
The corpus is refreshed nightly. If a source document is withdrawn (the regulatory committee decides a summary is wrong), what is the propagation time until the RAG system no longer retrieves it? Is that acceptable for the risk tier? What would you tighten?
The access scope prohibits external distribution and training-use. What technical and procedural controls would make each of those prohibitions enforceable rather than aspirational?
The datasheet’s Uses section says the corpus is unsuitable for some purposes. Give three concrete unsuitability statements and the reasoning for each.

Grading rubric

Graded pass / needs revision, against:

Contract completeness — all nine sections populated with testable content. (weight: 30%)
Datasheet clarity — each of the seven Gebru-style sections populated with informative, concise content. (weight: 20%)
Scope-decision defensibility — the one-contract versus three-contract decision is justified. (weight: 10%)
Semantic precision — the semantics section distinguishes meanings a naive reader would conflate. (weight: 20%)
Access-scope enforceability — the access scope references technical and procedural enforcement, not policy language alone. (weight: 20%)

A pass requires 80% of the rubric. Below 80%, the lab returns with revision notes.

Tooling neutrality

The deliverables are written artifacts. The learner may use any writing tool (Word, Google Docs, Markdown, Confluence templates). The firm’s existing contract template or catalog schema can be substituted for the template used here, provided the nine sections and seven datasheet sections are present.

Connection to subsequent material

The contract and datasheet produced here are two of the inputs that the Article 11 readiness scorecard would consume for this use case. The case study that follows this lab (on the Amsterdam SyRI and Rotterdam welfare cases) will invite the learner to imagine what the contract and datasheet would have looked like for those programs, and why the programs should not have proceeded had the documents been honest.