Many architecture organisations drop the ball here. Calibrate and Organize attract architect attention because they are upstream and consequential; Evaluate and Learn feel downstream and routine. That is backwards. Evaluate and Learn is where architectural learning compounds — where the platform gets better because every agent’s post-mortem updates the reference architecture.
Evaluate — what the architect contributes
Evaluate is the post-launch review stage, typically running on a monthly or quarterly cadence per agent. The architect has four contributions.
Evaluate contribution 1 — Production metrics review
The architect reviews the agent’s production metrics against the commitments made at Organize:
- Goal-achievement rate — is it within target? Trending?
- HITL intervention rate — in line with expectation? Spiking (indicating a design problem) or falling (possibly indicating complacency)?
- Safety classifier pass/fail rates — are classifiers detecting what they should? Are false-positive rates acceptable?
- Latency SLOs — p50 / p95 / p99 within bounds? Trending?
- Cost per interaction — within target? Outliers investigated?
- Error rates by tool — any tool disproportionately failing?
- Memory-write volume and provenance — healthy? Unprovenanced writes trending to zero?
- Policy-engine denials — are deny rates reflecting appropriate tightness or excess friction?
- User feedback / satisfaction — tracking? Comments patterns?
- Article 50 disclosure coverage — is disclosure happening on every applicable output?
Findings turn into action items the architect tracks across quarters.
Evaluate contribution 2 — Incident post-mortems review
Every incident produces a post-mortem (Article 25). The architect reads each post-mortem through an architectural lens:
- Was this a design gap or a situational failure?
- Is the mitigation scoped only to this agent, or should it be a platform-level change?
- Does the incident reveal a pattern we should catalog (so the next team avoids it)?
- Are the action items appropriately owned?
The architect aggregates incidents quarterly. Patterns (e.g., three incidents in the past quarter relating to retrieved-content injection) trigger platform-level work.
Evaluate contribution 3 — Evaluation-harness refresh
The evaluation harness (Article 17) ages. New attack patterns emerge; new regressions appear; user behaviour shifts. The architect schedules quarterly evaluation-harness refreshes:
- New golden tasks added (for use cases now visible in production traffic).
- Obsolete golden tasks retired.
- Adversarial battery updated with newly-public attack patterns.
- Calibration scored on fresh traffic.
- Fairness metrics re-computed on fresh data.
Evaluate contribution 4 — Autonomy-expansion review
The most consequential Evaluate conversation is “should this agent do more?” Common expansion proposals:
- Raise the HITL threshold (e.g., from $500 to $2000).
- Add new tools (e.g., email-send tool where currently draft-only).
- Expand user base (e.g., from internal to external customers).
- Expand language coverage.
- Expand to new geographies with different regulatory envelopes.
The architect brings the decision back to the autonomy framework (Article 2) and to evidence:
- What does current performance tell us? Goal-achievement and HITL-override rates at the current scope.
- What is the incremental blast radius? At the new scope, what is worst-case harm?
- What controls would need to change? New HITL thresholds, new runbooks, new evaluation coverage.
- What is the rollback plan if expansion shows harm? Specific metrics, specific thresholds.
Evaluate gate-review conversation
Evaluate gate reviews are typically quarterly per agent. Agenda:
- Production metrics readout.
- Incident review and post-mortem themes.
- Evaluation-harness refresh findings.
- Autonomy-expansion or contraction proposals (if any).
- Platform-level findings (what should propagate to other agents).
- Action items and owners.
Learn — what the architect contributes
Learn is about compounding knowledge across the portfolio. The architect has three contributions.
Learn contribution 1 — Platform evolution
The architect translates Evaluate findings into platform roadmap items:
- A new safety classifier that would have caught three incidents.
- A new observability signal multiple teams requested.
- A new policy-engine feature that would eliminate a class of authorization gaps.
- A new registry field that captures lineage information currently missing.
Platform evolution proposals are sized, prioritised, and funded against the COE’s platform budget.
Learn contribution 2 — Pattern library updates
The architect curates the pattern library with new patterns and anti-patterns learned:
- New anti-patterns from incidents (“goal-hijack via JSON-formatted user input — here is the mitigation”).
- New patterns for autonomy expansion (“three-step expansion with gradual HITL-threshold change — here is the template”).
- New sector-specific patterns as new domains join the portfolio.
Pattern-library updates are broadcast across product teams in a regular rhythm (monthly architect-guild meeting, published bulletin).
Learn contribution 3 — Retirement planning
Agents eventually retire. Causes:
- Use case no longer valuable.
- Better alternative available.
- Unsustainable operating cost.
- Persistent quality problems.
- Regulatory changes that make the use case impractical.
The architect produces retirement plans:
- Sunset schedule with notice to users.
- Migration path where applicable (hand to alternative agent, hand to human workflow).
- Archive plan for audit trails (Article 28 retention applies).
- Lessons-learned post-mortem for the retirement decision (what led here, what to watch in the next agent).
Timeline of a single agent’s lifecycle
Feedback loops that keep Evaluate-Learn credible
Three feedback loops turn Evaluate + Learn from paperwork into compound value.
Loop 1 — Incident to evaluation. Every incident produces a new adversarial test in the evaluation battery. This pattern (inspired by Google SRE “learn from incidents”) ensures the same root cause cannot produce an incident twice.
Loop 2 — Post-mortem to platform. Every post-mortem with architectural implication produces a platform roadmap item. The architect tracks the percentage of post-mortem action items that reached the platform — healthy is 30–50%.
Loop 3 — Product to pattern library. Every new pattern or anti-pattern discovered in a product team gets curated into the shared library. The architect’s role is to recognise generalizable patterns and to prevent the library from becoming a dumping ground.
Regulatory-driven refresh cadences
Beyond quarterly evaluation-harness refreshes, specific regulatory regimes impose their own refresh obligations that the architect tracks:
- EU AI Act Article 17 (quality management system). Requires ongoing processes for data-set management, training data documentation, risk management — the architect ensures the Evaluate cycle touches each.
- EU AI Act Article 61 (post-market monitoring). Requires a post-market monitoring plan producing periodic reports to the provider and, on demand, to authorities. The Evaluate-stage output feeds these reports directly.
- SR 11-7 / PRA SS1/23. Require annual (or more frequent depending on materiality) model reviews; validator partnership during Evaluate ensures this is not a last-minute scramble.
- MDR post-market surveillance. For clinical-decision-support agents under MDR scope, post-market surveillance plans demand trend analysis on adverse events. The Evaluate cycle feeds the surveillance workflow.
- NIST AI RMF MANAGE function. Recommends governance of identified risks across the lifecycle. Evaluate is where MANAGE gets operational teeth.
The architect keeps a simple rolling calendar — agent × refresh obligation × next-due date — and escalates to legal / compliance when an obligation is at risk of slipping.
Architect readouts in Learn
Annual architect readouts give senior stakeholders a view of the portfolio. Typical annual readout includes:
- Portfolio overview (agents by maturity, by autonomy, by sector).
- Incident summary (total; by class; severity distribution).
- Platform investments and their impact.
- Pattern-library updates.
- Risks and watchouts for next year (upcoming regulatory changes, emerging attack patterns, talent pipeline).
Real-world references
Google SRE Book postmortem culture chapter. The canonical reference for blameless post-mortems. Agentic systems inherit the SRE tradition; architect-led post-mortem reviews use SRE vocabulary.
Amazon Correction of Errors (COE) template. Amazon’s post-mortem format is public in various forms; many agentic teams adopt it or a variant. The architect customises a template for agentic specifics.
Public OpenAI, Anthropic, and Microsoft post-incident materials. Model-provider post-incidents are informative for platform-level learnings; the architect reads them even when the incident is not in the organisation’s own agent.
MIT AIID (AI Incident Database). Public database of AI-related incidents including agentic cases. A useful external check for the architect’s own pattern library.
Anti-patterns to reject
- “Evaluate is an SRE review.” SRE leads operational review; the architect leads architectural review.
- “Post-mortems sit in a folder.” Unread post-mortems are unlearned lessons.
- “Autonomy can only expand.” Contraction is a valid outcome of Evaluate; the architect preserves the option.
- “Retirement means we failed.” Retirement is a disciplined end-of-life; failure is what unretired agents do while no one is looking.
- “Learn is an annual PowerPoint.” Learn is a continuous loop; the annual readout is a marker, not the work.
Learning outcomes
- Explain Evaluate and Learn gate artefacts for agentic systems and the architect’s four Evaluate contributions and three Learn contributions.
- Classify four post-launch architect activities (production metrics review, post-mortem review, evaluation-harness refresh, autonomy-expansion review) by purpose and cadence.
- Evaluate an Evaluate review for architectural signal adequacy — is the review surfacing platform-level findings or only product-level ones?
- Design the architect’s Learn readout for a portfolio of agents, including retirement decisions, platform-evolution proposals, and pattern-library updates.
Further reading
- Core Stream anchors:
EATF-Level-1/M1.2-Art05-Evaluate-Measuring-Transformation-Progress.md;EATF-Level-1/M1.2-Art06-Learn-Sustaining-Transformation-Momentum.md. - AITE-ATS siblings: Article 17 (evaluation), Article 18 (SLOs), Article 24 (lifecycle), Article 25 (incident response), Article 36 (Calibrate + Organize), Article 37 (Model + Produce).
- Primary sources: Google SRE Book postmortem culture; Amazon Correction of Errors template (public references); MIT AIID.