Environment Promotion and Change Management

FlowRidge

This article teaches the architect to design a promotion pipeline for AI change types, specify the gate at each boundary, and write a promotion spec that a platform team can implement on any CI system. It is paired with Article 11 (evaluation architecture) because eval is the primary gate technology, and Article 21 (registries) because the promotion pipeline writes to the prompt, model, and index registries as state transitions.

The six AI change types

A software-only system has one dominant change type: the code deploy. An AI system has six, each with different blast radius and different rollback cost.

Prompt change. System prompt, template text, few-shot examples, tool descriptions. Lowest cost to change; highest risk of silent regression because prompts rarely get the same review scrutiny as code.
Model version change. Upgrading from GPT-4 Turbo to GPT-4o, from Claude 3 Opus to Claude 3.5 Sonnet, from Llama 3 70B to Llama 3.1 70B. Upgrade changes capability in ways that can regress specific patterns even as average quality improves.
Retrieval corpus change. Re-indexing the knowledge base, adding new sources, deprecating obsolete sources, updating chunking or embedding strategy. Can silently shift answer content.
Tool schema change. Adding a tool, removing a tool, changing a tool’s parameter names or return format. Can cascade through agent loops (Article 7).
Guardrail policy change. Updating the input/output filters, PII policies, allowed-topics list, refusal templates. Usually tightens; tightening can over-refuse and break workflows.
Infrastructure change. Promoting a new serving config, a new autoscaling policy, a new observability plugin. Can affect latency and cost without affecting quality on average, but p99 shifts may still matter.

The architect’s promotion policy must address all six. A pipeline that only promotes code and calls the model from a configuration file misses five out of six AI-specific change types.

The four environments

Dev

Dev is the individual engineer’s sandbox. Changes here are cheap, ephemeral, and visible only to the author. Dev exists to let an engineer iterate quickly; it has no production traffic and no eval gates. The architect’s constraint on dev is that it must use the same model families and the same retrieval abstractions as production, so that a change that works in dev is not carrying hidden compatibility assumptions.

Staging

Staging is the first shared environment where the change meets the evaluation harness from Article 11. Staging has three responsibilities: run the full regression eval, run the safety eval, and run representative load tests. A change that fails any of the three cannot be promoted from staging to canary. Staging should use production-like data (or a high-fidelity synthetic) so that eval results transfer; many teams under-invest in staging data quality and are surprised when canary metrics differ from staging metrics.

Staging is where the architect’s ADR (Article 23) is pre-filed and the change artefacts are registered: the prompt registry gets a draft version, the model registry gets a pending approval, the index registry gets a candidate index.

Canary

Canary serves production traffic at a controlled sample rate — typically 1%, 5%, 10% — while the full production still runs the prior version. Canary is the first environment where real users see the change. Canary evaluations include: online eval against a sample stream, shadow comparison to the production version, operational metric monitoring (latency p95/p99, cost per query, error rate), and safety-signal monitoring. The canary window is time-bounded (often 24–72 hours) and volume-bounded (often minimum 10K requests).

The Google SRE concept of progressive rollout, adapted to AI, is canary’s architectural ancestor.¹ The canary gate is the hardest one to design; canaries that are too short miss slow-onset regressions, and canaries that are too conservative delay shipping indefinitely. The architect’s gate policy should be explicit: required duration, required request volume, max tolerable regression on each KPI, max tolerable cost delta.

Production

Production serves 100% of traffic. The promotion to production is not the end of change management; it is the start of monitoring the change for delayed effects, and the start of planning the next change.

Gates by change type

Different change types need different gate stacks. The table below shows a reference mapping; each team adapts it to their eval harness and their risk tolerance.

Change type	Staging gate	Canary gate	Production gate
Prompt change	Regression eval, safety eval	24h online sample, shadow delta	Continuous regression
Model version	Full regression + capability deltas	72h online, shadow delta, cost delta	Continuous regression + cost guard
Retrieval corpus	Retrieval quality eval (recall@k, MRR), citation eval	48h online, citation-rate monitoring	Corpus freshness monitor
Tool schema	Tool-contract test, agent-behavior eval	48h online, tool-call error rate	Tool-call SLO monitor
Guardrail policy	Over-refusal eval, safety eval	48h online, refusal-rate delta	Refusal-rate continuous monitor
Infrastructure	Load test, chaos test	24h online, latency/cost delta	Classical SLO monitor

The promotion spec artefact (see templates below and Template 04) is a short document — two to five pages — that lists, per change type, the required gates, the gate thresholds, the owner for the decision, and the rollback procedure. The spec is written once and evolves. Writing it forces the architect to be explicit about trade-offs that are otherwise implicit.

Rollback, not just forward-fix

Every gate must have a rollback path because forward-fix fails. A corpus change that corrupts a retrieval index cannot be patched by adding another index; it must be reversed to the pre-change index. A model upgrade that regresses quality on a critical pattern cannot be patched by prompting harder; it must be reversed to the prior model version.

Rollback for AI systems is harder than for code because the state of the system at time t is distributed across the prompt registry, model registry, index registry, and tool registry. A rollback changes all of them atomically (or requires explicit sequencing). The architect’s promotion pipeline should support:

Atomic rollback by rolling back the composite release (a manifest that pins versions of all artefacts).
Selective rollback of a single artefact while holding others at their current version.
Hot rollback that executes in under a minute, distinguishing it from cold rollback that tolerates a larger window.

Replit’s public incident blog on their AI Agent launch shows an example of the rollback discipline in practice.² GitHub’s published model upgrade cadence for Copilot shows the same pattern at hyperscaler scale.³

Shadow, A/B, and interleaving

Canary is not the only production-side evaluation technique. Shadow traffic duplicates real requests to the new version without returning its responses to users; shadow surfaces regressions that production KPIs don’t because the response is not seen. A/B randomly assigns users to the new or old version and compares aggregate metrics. Interleaving — presenting results from both versions to the same user and measuring which the user prefers — is used in search and is applicable to AI answer quality.

The architect’s choice: shadow for backend correctness and cost comparison; A/B for user-facing quality metrics; interleaving for preference judgments. Each has statistical and UX implications. See Article 11 for the evaluation depth behind the choice.

OpenAI deprecation schedules as a forcing function

Managed-model providers set promotion calendars the architect must respect. OpenAI, Anthropic, and Google publish model deprecation notices; the architect who does not track these will be forced into an un-gated promotion when their current model is retired. The architect builds the deprecation calendar into the platform’s monthly architecture meeting and ensures every deployed model version has a replacement candidate with completed staging eval.

Governance integration: ModelOps, ChangeOps, and the COMPEL stage gates

The promotion pipeline sits between Produce and Evaluate in the COMPEL methodology (Articles 29, 30). Each promotion event produces evidence that feeds the Evaluate gate: the staging eval report, the canary report, the post-promotion monitoring summary. For EU AI Act Article 12 record-keeping on high-risk systems, the promotion pipeline is the primary evidence generator.⁴

ISO/IEC 42001 Clause 8.1 on operational controls and Clause 9.1 on monitoring align with promotion-pipeline artefacts; an architect asked for ISO 42001 evidence points to the promotion spec and the promotion event log as the primary compliance deliverables.⁵

Anti-patterns

Promoting prompts as configuration without eval gates. Prompts are code; they regress like code. Teams that skip eval on prompt changes discover quality drops days after the change, and struggle to correlate because the prompt change was not recorded as a deployment event.
Canary without rollback rehearsal. A canary whose rollback has never been tested is a canary that will fail to roll back when needed.
Long canaries that never finish. A canary that has run for three weeks without a promotion decision is a signal the gate is ill-defined; the architect’s job is to force the decision.
Conflating model upgrade with feature change. A model upgrade plus a new feature is two changes; they should be canaried separately so attribution of any regression is clean.

Summary

An AI system changes more often, in more ways, than a classical software system. The architect’s promotion pipeline makes those changes deliberate and reversible: six change types, four environments, gate-specific evaluation, atomic rollback, and an explicit promotion spec. The payoff is smaller and more frequent releases — the industry consensus on software release cadence applies to AI with adjustments, not with exceptions.

Key terms

Canary
Shadow traffic
Prompt registry
Promotion spec
Rollback

Learning outcomes

After this article the learner can: explain the four-environment promotion pattern and the six AI change types; classify AI changes by promotion risk; evaluate a deployment pipeline for eval coverage; design a promotion spec and rollback plan.