Regulatory Documentation for Experiments

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 13 of 14

Every experiment run against a regulated AI system produces evidence. The evidence lives in the experiment tracking system, in the model registry, in the CI pipeline logs, and in the promotion history. Whether it is findable, legible, and complete when the regulator or auditor asks depends on how the practitioner designs the documentation pipeline. A well-designed pipeline produces regulatory evidence as a by-product of doing the work. A poorly-designed pipeline produces panic and archaeology. This article teaches the mapping between the experiment artifacts that have already appeared across this credential and the three regulatory anchors most practitioners will encounter: EU AI Act Annex IV, ISO/IEC 42001 Clause 9.1, and NIST AI RMF MEASURE.

EU AI Act Annex IV

Regulation (EU) 2024/1689 (the EU AI Act) obligates providers of high-risk AI systems to compile technical documentation demonstrating compliance with the regulation’s requirements. Article 11 makes the obligation explicit; Annex IV specifies the minimum contents¹. Annex IV enumerates 9 headings of information the documentation must cover.

The most relevant headings for experiment evidence are.

General description of the AI system (§1) — intended purpose, provider information, the hardware on which the system is intended to run, interactions with other systems. Part of the product-definition artifact; not primarily an experiment output.
Detailed description of the elements of the AI system and of the process for its development (§2) — methods and steps performed for the development, design specifications, classification choices, data requirements, pre-processing, the main classification choices, the main elements of the system architecture, and validation. This is where the bulk of the experiment evidence lives: the offline and online evaluation protocols, the hyperparameter-search design, the training-data lineage, the performance metrics.
Detailed information about the monitoring, functioning, and control of the AI system (§3) — metrics used, reference metrics for measuring performance, measures for ensuring non-discriminatory outcomes, and the validation procedures. Maps to the evaluation harness (Articles 3, 4, 10), the CI pipeline (Article 8), and the fairness-slice evaluation (Article 3).
Detailed description of the risk management system in accordance with Article 9 (§4) — the risk assessment process applied to the system. Maps to red-team findings (Article 11) and to the mitigation record.
Description of any change made to the system through its life cycle (§5) — every promotion, every retraining, every prompt change. Maps to the promotion history from the model registry (Article 9) and to the experiment tracking log (Article 6).
A list of the harmonised standards applied (§6) — references to ISO/IEC 42001 and related standards where applied.
Copy of the EU declaration of conformity (§7) — the formal declaration.
Detailed description of the system in place to evaluate AI system performance in the post-market phase (§8) — post-market monitoring. Maps to online monitoring and ongoing evaluation (Article 10’s cadence).

Not every experiment contributes to every heading, but the headings determine what records must exist and how long they must be retained. Article 18 of the Regulation sets retention at 10 years for the technical documentation of a high-risk system after the system is placed on the market or put into service.

ISO/IEC 42001:2023 Clause 9.1

ISO/IEC 42001:2023, “Information technology — Artificial intelligence — Management system”, is the first international management-system standard for AI. Clause 9.1 — Monitoring, measurement, analysis, and evaluation — requires the organization to determine what needs to be monitored, the methods used, when monitoring occurs, and when the results are analyzed and evaluated². The clause also requires documented information to be retained as evidence.

The mapping from experiment artifacts to Clause 9.1 is direct.

What needs to be monitored. The primary and guardrail metrics of each deployed feature (Article 2), updated periodically, with the rationale for their selection.
Methods used. The online evaluation pattern (Article 4) and the LLM evaluation harness (Article 10), documented with their cadence.
When monitoring occurs. The cadence schedule attached to each harness mode.
When results are analyzed. The decision-rule artifacts attached to each experiment, plus the incident-review cadence (not covered in this credential but present in related programs).
Documented information retained as evidence. The experiment tracking records (Article 6) and the promotion-history records (Article 9).

A mature practitioner can produce the Clause 9.1 evidence for a given system in minutes by querying the tracking and registry systems. A less-mature practitioner spends weeks assembling it. The difference is not in the evidence; it is in how the evidence was organized at the time it was produced.

NIST AI RMF MEASURE

The NIST AI Risk Management Framework 1.0 MEASURE function enumerates subcategories that describe the evidence an organization must produce³. The most experiment-relevant subcategories are.

MEASURE 1.1. Appropriate methods and metrics are identified and applied. Maps to hypothesis and metric design (Article 2) and to evaluation protocols (Articles 3 and 4).
MEASURE 2.5. The AI system to be deployed is evaluated under both prospective and operational conditions. Maps to the four-mode experiment vocabulary (Article 1) and to the shadow, canary, and online evaluations (Article 4).
MEASURE 2.6. The AI system is evaluated for the validity, reliability, and safety of its operation. Maps to reproducibility and replicability (Article 6) and to the CI pipeline (Article 8).
MEASURE 2.7. AI system security and resilience are evaluated and documented. Maps to red-team experimentation (Article 11) and to the safety sweep in LLM evaluation (Article 10).
MEASURE 2.8. Risks associated with transparency and accountability are examined and documented. Maps to the experiment brief and report (Article 14).
MEASURE 2.11. Fairness and bias are examined and documented. Maps to slice-based evaluation (Article 3) and to the fairness portion of the LLM evaluation harness (Article 10).

The MEASURE function’s framing is that evidence is generated by activities, not by checklists. The practitioner’s work in Articles 1 through 12 of this credential is the activity set. The documentation pipeline in this article is what converts activity into evidence.

The documentation pipeline

A documentation pipeline is a set of automated and semi-automated jobs that transform experiment artifacts into regulator-ready evidence. The pipeline has three elements.

Extraction. Query the experiment tracking system, model registry, and CI logs to pull the records for a given system over a given window. Most tracking systems expose APIs; a documentation extractor is a few hundred lines of code.

Transformation. Format the records into the structure the regulatory destination requires. Annex IV has a specific structure; Clause 9.1 has a different structure; internal model-risk committees may have a third. A transformation layer maps tracking records to output structure.

Retention. Write the output to a records system with appropriate access controls and retention policy. The EU AI Act’s 10-year retention is a non-trivial commitment; the records system must survive tooling changes.

Open-source and vendor tooling for each stage exists. OpenLineage and Marquez for lineage extraction⁴; custom transformation layers (typically Python or similar) for formatting; document management systems (SharePoint, Confluence, custom systems) for retention. The commercial AI-governance tool category — vendors including Credo AI, Holistic AI, Fairly AI, and several others — sells the pipeline as a service; the in-house alternative builds it on open-source substrate.

[DIAGRAM: BridgeDiagram — aitm-eci-article-13-evidence-bridge — Left-side experiment artifacts (tracking records, registry entries, CI logs, promotion history, red-team reports, human-review rubrics), right-side regulatory outputs (Annex IV sections, ISO 42001 Clause 9.1 records, NIST AI RMF MEASURE evidence), and bridge beams mapping each artifact type to each output.]

Retention and auditability

Two discipline elements distinguish evidence from paperwork.

Immutability of evidence. Experiment tracking records must not be retroactively editable. Most tracking systems support this via write-once metadata and, where regulated industries require it, via WORM (write-once read-many) storage. Evidence a regulator can audit is evidence a regulator can trust.

Chain of custody. The evidence record includes who produced it, when, and against what version of the system. The experiment tracking log, the registry’s version metadata, and the promotion history together form the chain. Without the chain, a claim like “the model scored 0.87 on the regression set” is unverifiable.

Organizations subject to FDA, MDR (medical devices), or MiCA (crypto) regulations have parallel chain-of-custody requirements that AI-specific governance can inherit patterns from. The practitioner does not need to invent the pattern; the pattern is well-established across regulated industries and can be adapted.

Two real references in the regulatory vocabulary

EU AI Act text. Regulation (EU) 2024/1689 is the primary source. Practitioners do not need to read the full 180 pages but should read Article 9 (risk management), Article 11 (technical documentation), Article 12 (record-keeping), Article 15 (accuracy, robustness, cybersecurity), and Annex IV (technical documentation structure)¹. The regulation is publicly available on EUR-Lex and on the EU AI Office’s site.

AESIA — Spanish AI Supervisory Agency pilots. The Spanish AI Supervisory Agency (AESIA), established as a national competent authority under the EU AI Act, has run conformity-assessment pilots with private-sector partners⁵. The pilots are a reference for how regulatory assessment works in practice and are the first move toward operational implementation of the Act by a national authority. A practitioner building documentation against Annex IV benefits from tracking the AESIA pilots for concrete feedback on what regulators actually look at.

Summary

Every experiment on a regulated AI system produces regulatory evidence. The mapping: experiment tracking records contribute to Annex IV §§2–3, ISO 42001 Clause 9.1 monitoring records, and NIST AI RMF MEASURE subcategories. A documentation pipeline — extraction, transformation, retention — turns experiment artifacts into regulator-ready evidence by construction. Immutability and chain of custody are the discipline elements that distinguish evidence from paperwork. EU AI Act retention is 10 years for high-risk systems. AESIA pilots are a reference for operational regulatory assessment. The next article develops the experiment brief and experiment report, which are the two practitioner artifacts that flow the evidence into the documentation pipeline.

Further reading in the Core Stream: Deployment Readiness Checklist.

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (EU AI Act), Articles 9, 11, 12, 15, 18, and Annex IV. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩ ↩²
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 9.1. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩
OpenLineage and Marquez documentation. https://openlineage.io/ ; https://marquezproject.github.io/marquez/ — accessed 2026-04-19. ↩
Agencia Española de Supervisión de la Inteligencia Artificial (AESIA). https://www.aesia.gob.es/ — accessed 2026-04-19. ↩