Continuous Delivery and Governed Promotion

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 9 of 14

Continuous delivery for ML takes the output of the CI pipeline (a newly trained model that passed its tests) and carries it through a promotion lifecycle to production, with named gates at every boundary. The point is not speed. The point is governance: every promotion is recorded, every approver is named, every evidentiary artifact is attached, and every rollback path is pre-built. A model reaches production through the lifecycle or it does not reach production at all. Unpromoted code that happens to run in production is, by definition, out of governance.

Lifecycle states

Every mature model-governance program uses a lifecycle with at least five states. The state machine is stable across organizations; what varies is who approves each transition.

Development (dev). The model has been trained, passes CI, is registered in the model registry. No production traffic.
Staging. The model has passed extended evaluation (larger test sets, fairness analysis, cost analysis, safety sweep). Available for shadow deployment in production.
Canary. The model is serving a small slice of production traffic (typically 1%). Canary gate monitors guardrails and auto-rolls-back on breach.
Production. The model is serving full production traffic. Subject to ongoing monitoring and to scheduled re-evaluation.
Archived. The model has been retired. Its artifacts and metadata are retained for audit and for potential re-promotion.

Additional states appear in specific programs: retired (archived by a predecessor model, not yet deleted), deprecated (marked for replacement but still serving), restricted (serving only a named subset of users, e.g., for a phased rollout). The five-state minimum is the floor, not the ceiling.

[DIAGRAM: StageGateFlow — aitm-eci-article-9-promotion-lifecycle — Left-to-right flow: dev -> staging -> canary -> production -> archived, with approver roles above each transition (automated tests, ML engineer, platform team, product owner, audit) and rollback arrows for each step.]

Gates and approvers

Each state transition has a gate. A gate is a boolean condition — a combination of automated tests and human approvals — that must evaluate true for the transition to proceed.

dev -> staging. Typically automated: CI tests pass, ML Test Score above threshold, evaluation on the regression set meets the baseline, reproducibility check passes. Human involvement is usually a code-review-style approval.

staging -> canary. Typically automated plus one human. The automated portion: safety-sweep results meet criteria, fairness analysis meets criteria, cost projection meets budget, operational readiness checks pass. The human portion: an ML engineer or tech lead reviews the promotion proposal, confirms the evaluation artifacts, and approves.

canary -> production. Typically automated under guardrail stability plus one or more humans. The automated portion: all canary-phase guardrails stable across the ramp; no rollback triggered. The human portion: a product owner or business owner confirms the business outcome is as predicted; for regulated systems, a model-risk manager or compliance officer confirms the documentation is complete.

production -> archived. Typically human, initiated by the product team with platform-team concurrence.

The approver list is configurable. Organizations with model risk committees include them for high-risk models. Organizations with mandatory security reviews include security. The design principle is that the approver list is encoded in the promotion workflow itself, not in a separate policy document; the workflow cannot be bypassed, and the approvers cannot be silently removed.

Automated approvals where safe, human approvals where required

The temptation in CD is to automate everything. The discipline is to automate where the criteria are mechanical and reserve human judgment for where criteria are not.

Mechanical criteria, suitable for automation, include:

Numeric threshold checks against pre-specified baselines.
Regression-test-pass booleans.
Data-contract validations.
Cost-projection checks against budget.
Container-image signing and provenance checks.

Non-mechanical criteria, suitable for human approval, include:

Business-outcome confirmation (did the thing we expected to happen actually happen).
Fairness and bias review for high-risk systems.
Cross-domain-impact assessment (this change in product A affects service B; does that team concur).
Regulatory-sufficiency review (is the Annex IV documentation complete; is the ISO 42001 audit trail adequate).

The separation is not a technical sophistication argument. It is a governance argument: the organization is saying that certain decisions require named human judgment, and the workflow records who exercised that judgment.

The model registry and version governance

The model registry is the system of record for models in promotion, applying version control discipline to the full model lifecycle. Every model has a versioned entry with metadata including:

Lifecycle state and history of state transitions.
Training-data version and training-pipeline version.
Evaluation artifacts for each promotion stage.
Approver list and approval timestamps.
Governance tags (risk tier, regulated-system flags, business domain).
Serving metadata (endpoint, region, capacity).

MLflow, SageMaker Model Registry, Azure ML Model Registry, Vertex AI Model Registry, Databricks Model Registry, W&B Model Registry, and open-source tools including Kubeflow Metadata each implement the pattern with variations¹. Cross-registry portability is imperfect; a team that invests in one registry is making a platform decision. The common schema — version, state, metadata, approvals — is portable.

A practitioner’s first responsibility against the registry is to use it as the only path to production. A model that serves production traffic without a registry entry is out of governance and, depending on the regulatory context, potentially out of compliance.

Regulatory alignment

Three regulatory anchors shape the CD design.

ISO/IEC 42001:2023, Clause 8.1 — Operational planning and control. The clause requires that the organization plan, implement, and control the processes needed to meet its AI management system requirements, and retain documented information to demonstrate that processes have been carried out as planned². The promotion lifecycle is exactly this planning-and-control. The state transitions, the gate criteria, the approver assignments, and the records kept at each stage are the documented information the auditor expects.

EU AI Act Articles 11 and 12 — Technical documentation and record-keeping. Article 11 requires high-risk AI systems to have technical documentation demonstrating compliance throughout the lifecycle. Article 12 requires that high-risk systems automatically record events relevant to the identification of risks and substantial modifications³. A CD pipeline that records every promotion event, every approver, every evaluation artifact, and every rollback event produces the Article 12 log by construction.

NIST AI RMF 1.0, MEASURE and MANAGE functions. MEASURE subcategories 2.5 and 2.7 require evaluation under prospective and operational conditions, and evaluation of mechanisms for continuous improvement⁴. The CD pipeline’s evaluation-at-each-gate pattern is the operationalization.

A practitioner designing a CD pipeline for a regulated system begins with the regulation as input, not as afterthought. The gate criteria map directly to the regulatory requirements; the approver list maps directly to the accountable roles; the records kept map directly to the documentation obligations.

Rollback discipline

Every promotion has a rollback path. The rollback is tested before the promotion, not after. A CD pipeline that ships model v2 without being able to demonstrate that v1 can be restored in under 15 minutes is not production-ready.

Rollback patterns differ by promotion stage.

Staging rollback. Revert the staging pointer to the previous version. Usually seconds.
Canary rollback. Return canary traffic to the production model. Triggered automatically on guardrail breach; fires in under 30 seconds in mature systems.
Production rollback. Revert production traffic to the previous version. Must be tested periodically; the platform team owns this test.

Article 14 of this credential treats the experiment report in depth, including the rollback record: what triggered the rollback, when, by whom or by what automated trigger, what the guardrail values were at the time, what the remediation plan is. The record is the audit trail.

Two real programs in the CD vocabulary

Uber Michelangelo. Uber’s engineering blog series on Michelangelo describes a large enterprise ML platform that handles thousands of production models across Uber’s product surface. The blog series documents the promotion lifecycle, the gate criteria, and the governance records, and has been a reference for many similar platforms built since⁵. The lesson is that at Uber scale, promotion discipline is what prevents thousands of production models from becoming an operational hazard.

Airbnb Bighead and Zipline. Airbnb’s engineering blog series on Bighead (ML platform) and Zipline (feature store) describes the integrated story: feature consistency, training pipelines, registration, promotion, and monitoring, across a product surface with both classical and generative AI features⁶. The lesson is that CD is not a bolt-on; it is an integrated design with the feature store, the tracking tool, the registry, and the serving infrastructure all cooperating.

[DIAGRAM: OrganizationalMappingBridge — aitm-eci-article-9-promotion-raci — A RACI matrix across roles (ML engineer, tech lead, product owner, security, compliance, model-risk manager) and promotion stages (dev-to-staging, staging-to-canary, canary-to-production, production-to-archived). Each cell marks R, A, C, or I.]

Summary

Continuous delivery for ML is a state machine with at least five states — dev, staging, canary, production, archived — and gates between them. Each gate combines automated tests with human approvals. The model registry is the system of record. The regulatory anchors (ISO 42001 Clause 8.1, EU AI Act Articles 11 and 12, NIST AI RMF MEASURE and MANAGE) shape the gate criteria and the records kept. Rollback is tested before promotion, not after. Uber Michelangelo and Airbnb Bighead are two reference programs. The next article applies this infrastructure to LLM evaluation, where the tests at the gates have a different shape.

Further reading in the Core Stream: Stage-Gate Decision Framework and Deployment Readiness Checklist.

MLflow Model Registry; SageMaker, Azure ML, Vertex AI, Databricks, and Weights & Biases model registry documentation. https://mlflow.org/docs/latest/model-registry.html — accessed 2026-04-19. ↩
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 8.1. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 (EU AI Act), Articles 11 and 12. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023, MEASURE and MANAGE functions. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩
Uber Engineering Blog — Michelangelo series. https://www.uber.com/blog/michelangelo-machine-learning-platform/ — accessed 2026-04-19. ↩
Airbnb Engineering Blog — Bighead and Zipline series. https://medium.com/airbnb-engineering/ — accessed 2026-04-19. ↩