Architect in Evaluate and Learn Stages for Agentic Systems

FlowRidge

Many architecture organisations drop the ball here. Calibrate and Organize attract architect attention because they are upstream and consequential; Evaluate and Learn feel downstream and routine. That is backwards. Evaluate and Learn is where architectural learning compounds — where the platform gets better because every agent’s post-mortem updates the reference architecture.

Evaluate — what the architect contributes

Evaluate is the post-launch review stage, typically running on a monthly or quarterly cadence per agent. The architect has four contributions.

Evaluate contribution 1 — Production metrics review

The architect reviews the agent’s production metrics against the commitments made at Organize:

Goal-achievement rate — is it within target? Trending?
HITL intervention rate — in line with expectation? Spiking (indicating a design problem) or falling (possibly indicating complacency)?
Safety classifier pass/fail rates — are classifiers detecting what they should? Are false-positive rates acceptable?
Latency SLOs — p50 / p95 / p99 within bounds? Trending?
Cost per interaction — within target? Outliers investigated?
Error rates by tool — any tool disproportionately failing?
Memory-write volume and provenance — healthy? Unprovenanced writes trending to zero?
Policy-engine denials — are deny rates reflecting appropriate tightness or excess friction?
User feedback / satisfaction — tracking? Comments patterns?
Article 50 disclosure coverage — is disclosure happening on every applicable output?

Findings turn into action items the architect tracks across quarters.

Evaluate contribution 2 — Incident post-mortems review

Every incident produces a post-mortem (Article 25). The architect reads each post-mortem through an architectural lens:

Was this a design gap or a situational failure?
Is the mitigation scoped only to this agent, or should it be a platform-level change?
Does the incident reveal a pattern we should catalog (so the next team avoids it)?
Are the action items appropriately owned?

The architect aggregates incidents quarterly. Patterns (e.g., three incidents in the past quarter relating to retrieved-content injection) trigger platform-level work.

Evaluate contribution 3 — Evaluation-harness refresh

The evaluation harness (Article 17) ages. New attack patterns emerge; new regressions appear; user behaviour shifts. The architect schedules quarterly evaluation-harness refreshes:

New golden tasks added (for use cases now visible in production traffic).
Obsolete golden tasks retired.
Adversarial battery updated with newly-public attack patterns.
Calibration scored on fresh traffic.
Fairness metrics re-computed on fresh data.

Evaluate contribution 4 — Autonomy-expansion review

The most consequential Evaluate conversation is “should this agent do more?” Common expansion proposals:

Raise the HITL threshold (e.g., from $500 to $2000).
Add new tools (e.g., email-send tool where currently draft-only).
Expand user base (e.g., from internal to external customers).
Expand language coverage.
Expand to new geographies with different regulatory envelopes.

The architect brings the decision back to the autonomy framework (Article 2) and to evidence:

What does current performance tell us? Goal-achievement and HITL-override rates at the current scope.
What is the incremental blast radius? At the new scope, what is worst-case harm?
What controls would need to change? New HITL thresholds, new runbooks, new evaluation coverage.
What is the rollback plan if expansion shows harm? Specific metrics, specific thresholds.

Evaluate gate-review conversation

Evaluate gate reviews are typically quarterly per agent. Agenda:

Production metrics readout.
Incident review and post-mortem themes.
Evaluation-harness refresh findings.
Autonomy-expansion or contraction proposals (if any).
Platform-level findings (what should propagate to other agents).
Action items and owners.

Learn — what the architect contributes

Learn is about compounding knowledge across the portfolio. The architect has three contributions.

Learn contribution 1 — Platform evolution

The architect translates Evaluate findings into platform roadmap items:

A new safety classifier that would have caught three incidents.
A new observability signal multiple teams requested.
A new policy-engine feature that would eliminate a class of authorization gaps.
A new registry field that captures lineage information currently missing.

Platform evolution proposals are sized, prioritised, and funded against the COE’s platform budget.

Learn contribution 2 — Pattern library updates

The architect curates the pattern library with new patterns and anti-patterns learned:

New anti-patterns from incidents (“goal-hijack via JSON-formatted user input — here is the mitigation”).
New patterns for autonomy expansion (“three-step expansion with gradual HITL-threshold change — here is the template”).
New sector-specific patterns as new domains join the portfolio.

Pattern-library updates are broadcast across product teams in a regular rhythm (monthly architect-guild meeting, published bulletin).

Learn contribution 3 — Retirement planning

Agents eventually retire. Causes:

Use case no longer valuable.
Better alternative available.
Unsustainable operating cost.
Persistent quality problems.
Regulatory changes that make the use case impractical.

The architect produces retirement plans:

Sunset schedule with notice to users.
Migration path where applicable (hand to alternative agent, hand to human workflow).
Archive plan for audit trails (Article 28 retention applies).
Lessons-learned post-mortem for the retirement decision (what led here, what to watch in the next agent).

Timeline of a single agent’s lifecycle

Feedback loops that keep Evaluate-Learn credible

Three feedback loops turn Evaluate + Learn from paperwork into compound value.

Loop 1 — Incident to evaluation. Every incident produces a new adversarial test in the evaluation battery. This pattern (inspired by Google SRE “learn from incidents”) ensures the same root cause cannot produce an incident twice.

Loop 2 — Post-mortem to platform. Every post-mortem with architectural implication produces a platform roadmap item. The architect tracks the percentage of post-mortem action items that reached the platform — healthy is 30–50%.

Loop 3 — Product to pattern library. Every new pattern or anti-pattern discovered in a product team gets curated into the shared library. The architect’s role is to recognise generalizable patterns and to prevent the library from becoming a dumping ground.

Regulatory-driven refresh cadences

Beyond quarterly evaluation-harness refreshes, specific regulatory regimes impose their own refresh obligations that the architect tracks:

EU AI Act Article 17 (quality management system). Requires ongoing processes for data-set management, training data documentation, risk management — the architect ensures the Evaluate cycle touches each.
EU AI Act Article 61 (post-market monitoring). Requires a post-market monitoring plan producing periodic reports to the provider and, on demand, to authorities. The Evaluate-stage output feeds these reports directly.
SR 11-7 / PRA SS1/23. Require annual (or more frequent depending on materiality) model reviews; validator partnership during Evaluate ensures this is not a last-minute scramble.
MDR post-market surveillance. For clinical-decision-support agents under MDR scope, post-market surveillance plans demand trend analysis on adverse events. The Evaluate cycle feeds the surveillance workflow.
NIST AI RMF MANAGE function. Recommends governance of identified risks across the lifecycle. Evaluate is where MANAGE gets operational teeth.

The architect keeps a simple rolling calendar — agent × refresh obligation × next-due date — and escalates to legal / compliance when an obligation is at risk of slipping.

Architect readouts in Learn

Annual architect readouts give senior stakeholders a view of the portfolio. Typical annual readout includes:

Portfolio overview (agents by maturity, by autonomy, by sector).
Incident summary (total; by class; severity distribution).
Platform investments and their impact.
Pattern-library updates.
Risks and watchouts for next year (upcoming regulatory changes, emerging attack patterns, talent pipeline).

Real-world references

Google SRE Book postmortem culture chapter. The canonical reference for blameless post-mortems. Agentic systems inherit the SRE tradition; architect-led post-mortem reviews use SRE vocabulary.

Amazon Correction of Errors (COE) template. Amazon’s post-mortem format is public in various forms; many agentic teams adopt it or a variant. The architect customises a template for agentic specifics.

Public OpenAI, Anthropic, and Microsoft post-incident materials. Model-provider post-incidents are informative for platform-level learnings; the architect reads them even when the incident is not in the organisation’s own agent.

MIT AIID (AI Incident Database). Public database of AI-related incidents including agentic cases. A useful external check for the architect’s own pattern library.

Anti-patterns to reject

“Evaluate is an SRE review.” SRE leads operational review; the architect leads architectural review.
“Post-mortems sit in a folder.” Unread post-mortems are unlearned lessons.
“Autonomy can only expand.” Contraction is a valid outcome of Evaluate; the architect preserves the option.
“Retirement means we failed.” Retirement is a disciplined end-of-life; failure is what unretired agents do while no one is looking.
“Learn is an annual PowerPoint.” Learn is a continuous loop; the annual readout is a marker, not the work.

Learning outcomes

Explain Evaluate and Learn gate artefacts for agentic systems and the architect’s four Evaluate contributions and three Learn contributions.
Classify four post-launch architect activities (production metrics review, post-mortem review, evaluation-harness refresh, autonomy-expansion review) by purpose and cadence.
Evaluate an Evaluate review for architectural signal adequacy — is the review surfacing platform-level findings or only product-level ones?
Design the architect’s Learn readout for a portfolio of agents, including retirement decisions, platform-evolution proposals, and pattern-library updates.