Architecture Review Gate: Evaluate and Learn Stages

FlowRidge

This article walks what the architect contributes to Evaluate and Learn reviews, the artefacts that depend on architectural input, and the retirement decision that closes the loop.

Evaluate — the architect’s inputs

Evaluate is the ongoing operational review. It runs on a cadence — weekly operational review, monthly business review, quarterly architecture review — and it produces signals that feed the Learn stage. The architect’s Evaluate work is measurement and pattern recognition.

SLO review

The SLO target sheet from Article 20 is the primary Evaluate artefact. Each SLI is checked against its SLO; error-budget burn is reviewed; the action-on-burn rules are enforced. The architect reads the SLO dashboard not as a compliance exercise but as a diagnostic: which SLOs are burning, why, and what architectural change would reduce the burn.

A specific pattern to watch for: SLOs that never burn are probably loose and should be tightened; SLOs that always burn are probably wrong (either the target is too strict or the system cannot meet it) and need a design decision. Google’s SRE book calls this the error budget philosophy and it generalises cleanly to AI SLOs.¹

Eval-score trend review

Production eval scores (Articles 11, 12) are reviewed at the cadence appropriate to the workload. For a chat product with steady traffic, weekly is typical; for a batch enrichment job, monthly. The architect looks at trend, variance, and slice-based breakdowns. A stable average eval score hiding a slice regression (Article 20’s fourth incident class) is the common early-warning signal.

Cost review

The cost model v2 from Produce is now running against real traffic. The architect reviews per-query cost, monthly run-rate, and anomaly signals. Unexpected cost drivers — a prompt that suddenly emits 5x more output tokens, a retrieval path that now scans the whole index, a tool call that loops — surface first in the cost dashboard.

The FinOps discipline in Article 33 covers this ongoing review. At Evaluate-stage reviews the architect verifies that cost trends are acceptable and that the budget ceiling is not being burned through.

Incident review

All incidents since the previous Evaluate review are covered. For AI systems the classification from Article 20 — confabulation outbreak, safety bypass, prompt injection, model regression, retrieval corruption — applies alongside the classical incidents. Each incident yields either a remediation item, a backlog item, or a documented decision not to remediate. The architect owns the pattern recognition: are recurring incidents pointing at a deeper architectural issue.

Drift and regression review

Over time the eval harness itself may drift relative to production traffic. New topics emerge, new user segments adopt the tool, new question shapes appear. The architect schedules periodic eval-set refresh and verifies the refreshed eval set is representative of current production. A harness that was representative at launch is not necessarily representative six months later.

Learn — the architect’s inputs

Learn is where operational signal becomes organisational knowledge. It is where the architect closes the loop between what was planned and what actually happened, and where the organisation’s ability to do the next use case well is built up.

Post-incident reviews

The Amazon Correction of Errors (COE) and Google SRE post-mortem cultures are the industry references.² Every significant incident gets a written post-mortem that is blameless, specific, and actionable. The architect either writes or contributes to post-mortems for architectural incidents. Good post-mortems have the following characteristics:

Timeline with named decision points.
Named root cause (and contributing causes) in the system, not in individuals.
Action items with accountable owners and target dates.
Links to artefacts: the SLO dashboard at the time of incident, the registry state, the model version, the prompt version, the retrieved passages.
Patterns noted — whether this incident resembles previous incidents, what architectural change would prevent the class.

Post-mortems are published (inside the organisation at a minimum; sometimes externally). The library of post-mortems is itself an architectural asset — new engineers read it to build intuition and existing architects revisit it to avoid repeat patterns.

ADR supersession

Decisions recorded at Organise and Model are revisited against production reality. Some hold; some need superseding. An ADR chose managed Claude for cost reasons at launch, but the provider’s pricing has changed and self-hosted Llama 3 is now cheaper for the workload; the ADR is superseded, the link-forward is written, and the platform implements the change. Without the ADR discipline this change becomes an invisible drift; with it, the history is legible.

Platform backlog feedback

Operational lessons feed the platform backlog (Article 24). A repeated class of product-team workaround typically indicates a missing platform capability; repeated cost anomalies indicate a missing observability signal; repeated incidents in a specific incident class indicate a missing tool. The architect is the translator from operational pain to platform roadmap.

Capability extensions

The system’s scope often grows after launch. New user segments, new modalities, new languages, new use-case extensions. The architect runs a light-weight Calibrate for each extension rather than quietly bolting it on. The point is not ceremony but continuity: the architect knows what is in the system and why, even three years after launch.

Documentation refresh

The system card (Articles 11, 21) is updated. The evidence pack (Article 22) is refreshed. The reference architecture diagram is kept current. These are chores that only happen if an accountable owner owns them; the architect is usually that owner.

The retirement or replacement decision

Every AI system eventually faces a retirement or replacement decision. Not every system is retired after a year; some run for many years. But the decision is not a passive one and the architect is the one who surfaces it.

Triggers for re-evaluation:

Cost trajectory. The cost curve has shifted and a different architecture is cheaper. A system on a now-retired model version, or a workload that has grown past its serving pattern’s sweet spot, often triggers a replacement.

Capability gap. The underlying model generation has moved on and the system’s outputs look dated or inferior next to contemporaries. A 2023 RAG implementation using an older embedding model and an early-generation reasoning model will usually be outperformed by a 2025 equivalent.

Regulatory change. A new regulation or a tightening enforcement environment makes the current architecture untenable. The 2026 EU AI Act high-risk compliance deadline triggered architectural reassessments across the industry.³

Incident accumulation. Recurring incidents that the current architecture cannot cleanly prevent signal that the architecture needs to change. At some point patching becomes more expensive than rebuilding.

Business shift. The use case’s importance has declined; the system is no longer justifying its ongoing cost and governance overhead. Retirement is the honest answer.

The retirement decision is itself worth an ADR. Who used the system, what replaces it, how long the sunset window is, what happens to the data and the evidence pack. A retirement ADR with a link-forward to the successor system (if any) preserves the lineage.

Worked example — a customer-service assistant at 18-month review

A customer-service assistant launched eighteen months prior comes up for annual review plus scope-extension proposal. The architect’s Evaluate-stage read:

SLOs: availability stable at 99.7%; TTFT p95 holding at 420ms; eval score has drifted down 3 points over the last quarter, concentrated in a specific slice (refund disputes, Spanish-language).
Cost: per-query cost has risen 22% due to a prompt change that added more retrieval context; monthly cost is within budget but trending.
Incidents: two minor confabulation incidents both resolved by prompt tuning; one model-regression incident after a provider-pushed model upgrade, resolved by pinning to a prior version.
Drift: eval set refresh overdue by two months; when refreshed, overall score drops 5 points reflecting genuine drift.
ADR supersessions: the single-provider ADR from launch is now outdated; the architect writes a superseding ADR introducing a fallback to a second provider for refund-dispute and Spanish-language slices.
Platform feedback: a prompt-ABtesting capability would accelerate the fixes the team keeps shipping manually; the architect adds it to the platform backlog.
Retirement decision: not yet. The system continues with the planned changes.

The Evaluate readout is a document of seven pages. The Learn activities (ADR supersessions, backlog items, eval-set refresh, post-mortem for the regression incident) are specific and scheduled.

Worked example — Microsoft Tay retrospective

Microsoft Tay, the 2016 chatbot that was taken offline within 24 hours after coordinated adversarial interactions, is not strictly an AITE-SAT case study but its retrospective is an instructive Learn-stage exemplar for the earlier AI era.⁴ The public lessons — adversarial-traffic testing, content filtering at multiple layers, the kill-switch discipline — became permanent parts of the industry playbook. Contemporary AI deployments inherit these lessons as baseline expectations; the architect’s Learn discipline is to keep absorbing new incidents with the same rigour.

The more contemporary NYC MyCity chatbot issues (2024) and the DPD UK chatbot swearing incident (January 2024) show that the Learn loop is never complete.⁵ Each generation of deployments produces its own set of incidents from which the next generation learns.

Cross-system learning

An individual architect working on one system learns from their own incidents. An organisation with many AI systems learns across them if the architecture function is set up to share. Patterns that help:

A shared incident log. Not just a JIRA backlog but a searchable, pattern-classified log that all AI architects in the organisation consult.

Architecture guild cadence. Monthly or fortnightly architecture guild meetings where architects share recent decisions and incidents. The McKinsey State of AI findings repeatedly point to this kind of cross-team learning as a differentiator.⁶

Platform team as learning amplifier. The platform team (Article 24) sits at the intersection of many product teams and sees patterns earlier than any single product architect. The architect guild and the platform team should be tightly coupled.

External reading discipline. Public post-mortems from OpenAI, Anthropic, Microsoft, Google, Meta; academic papers; incident database entries. The architect’s reading list is part of the job.

Governance integration

Articles 72 and 73 of the EU AI Act cover post-market monitoring and serious incident reporting.³ Evaluate and Learn are where those obligations live. For high-risk systems the architect confirms that the post-market monitoring plan is running, that serious incidents are being reported within the required window, and that the evidence pack stays current. ISO/IEC 42001 clause 10 (improvement) directly maps to the Learn stage’s continuous improvement discipline.

Anti-patterns

Launch-and-forget. A system that has no scheduled Evaluate cadence drifts invisibly. The architect institutes the cadence before handing off, even if the architect does not personally attend every review afterwards.
Blameful post-mortems. Post-mortems that identify individuals as root causes teach nothing and discourage future honesty. The blameless discipline is a learned organisational skill.
Unlinked ADR supersessions. A new ADR replaces an old one but without the link-forward. The history becomes unreadable.
Ignoring the retirement option. Systems that have outlived their usefulness but are kept running to avoid the political cost of retirement accumulate risk. The architect names the retirement option explicitly even when the answer is “not yet.”
Post-mortems with no action items. A post-mortem that describes what happened without committing to what changes is a written incident, not a learning event.

Summary

Evaluate and Learn are the stages where architecture earns its long-run value. The architect’s Evaluate work is measurement and pattern recognition across SLOs, eval scores, costs, incidents, and drift. The Learn work is post-mortems, ADR supersession, platform feedback, capability extension, and documentation refresh, culminating in the retirement decision when its time comes. Architecture is the through-line across all six COMPEL stages and the architect’s continued presence is what keeps the through-line intact.

Key terms

Evaluate stage (COMPEL)
Learn stage (COMPEL)
Post-incident review (blameless)
Eval-set drift
Retirement decision

Learning outcomes

After this article the learner can: explain Evaluate and Learn gate artefacts; classify four post-launch architecture activities; evaluate an Evaluate review for architectural signal quality; design the architect’s section of a Learn readout.