Skip to main content
AITE M1.2-Art38 v1.0 Reviewed 2026-04-06 Open Access
M1.2 The COMPEL Six-Stage Lifecycle
AITF · Foundations

Architect in Evaluate and Learn Stages for Agentic Systems

Architect in Evaluate and Learn Stages for Agentic Systems — Transformation Design & Program Architecture — Advanced depth — COMPEL Body of Knowledge.

9 min read Article 38 of 53

Many architecture organisations drop the ball here. Calibrate and Organize attract architect attention because they are upstream and consequential; Evaluate and Learn feel downstream and routine. That is backwards. Evaluate and Learn is where architectural learning compounds — where the platform gets better because every agent’s post-mortem updates the reference architecture.

Evaluate — what the architect contributes

Evaluate is the post-launch review stage, typically running on a monthly or quarterly cadence per agent. The architect has four contributions.

Evaluate contribution 1 — Production metrics review

The architect reviews the agent’s production metrics against the commitments made at Organize:

  • Goal-achievement rate — is it within target? Trending?
  • HITL intervention rate — in line with expectation? Spiking (indicating a design problem) or falling (possibly indicating complacency)?
  • Safety classifier pass/fail rates — are classifiers detecting what they should? Are false-positive rates acceptable?
  • Latency SLOs — p50 / p95 / p99 within bounds? Trending?
  • Cost per interaction — within target? Outliers investigated?
  • Error rates by tool — any tool disproportionately failing?
  • Memory-write volume and provenance — healthy? Unprovenanced writes trending to zero?
  • Policy-engine denials — are deny rates reflecting appropriate tightness or excess friction?
  • User feedback / satisfaction — tracking? Comments patterns?
  • Article 50 disclosure coverage — is disclosure happening on every applicable output?

Findings turn into action items the architect tracks across quarters.

Evaluate contribution 2 — Incident post-mortems review

Every incident produces a post-mortem (Article 25). The architect reads each post-mortem through an architectural lens:

  • Was this a design gap or a situational failure?
  • Is the mitigation scoped only to this agent, or should it be a platform-level change?
  • Does the incident reveal a pattern we should catalog (so the next team avoids it)?
  • Are the action items appropriately owned?

The architect aggregates incidents quarterly. Patterns (e.g., three incidents in the past quarter relating to retrieved-content injection) trigger platform-level work.

Evaluate contribution 3 — Evaluation-harness refresh

The evaluation harness (Article 17) ages. New attack patterns emerge; new regressions appear; user behaviour shifts. The architect schedules quarterly evaluation-harness refreshes:

  • New golden tasks added (for use cases now visible in production traffic).
  • Obsolete golden tasks retired.
  • Adversarial battery updated with newly-public attack patterns.
  • Calibration scored on fresh traffic.
  • Fairness metrics re-computed on fresh data.

Evaluate contribution 4 — Autonomy-expansion review

The most consequential Evaluate conversation is “should this agent do more?” Common expansion proposals:

  • Raise the HITL threshold (e.g., from $500 to $2000).
  • Add new tools (e.g., email-send tool where currently draft-only).
  • Expand user base (e.g., from internal to external customers).
  • Expand language coverage.
  • Expand to new geographies with different regulatory envelopes.

The architect brings the decision back to the autonomy framework (Article 2) and to evidence:

  • What does current performance tell us? Goal-achievement and HITL-override rates at the current scope.
  • What is the incremental blast radius? At the new scope, what is worst-case harm?
  • What controls would need to change? New HITL thresholds, new runbooks, new evaluation coverage.
  • What is the rollback plan if expansion shows harm? Specific metrics, specific thresholds.

Evaluate gate-review conversation

Evaluate gate reviews are typically quarterly per agent. Agenda:

  1. Production metrics readout.
  2. Incident review and post-mortem themes.
  3. Evaluation-harness refresh findings.
  4. Autonomy-expansion or contraction proposals (if any).
  5. Platform-level findings (what should propagate to other agents).
  6. Action items and owners.

Learn — what the architect contributes

Learn is about compounding knowledge across the portfolio. The architect has three contributions.

Learn contribution 1 — Platform evolution

The architect translates Evaluate findings into platform roadmap items:

  • A new safety classifier that would have caught three incidents.
  • A new observability signal multiple teams requested.
  • A new policy-engine feature that would eliminate a class of authorization gaps.
  • A new registry field that captures lineage information currently missing.

Platform evolution proposals are sized, prioritised, and funded against the COE’s platform budget.

Learn contribution 2 — Pattern library updates

The architect curates the pattern library with new patterns and anti-patterns learned:

  • New anti-patterns from incidents (“goal-hijack via JSON-formatted user input — here is the mitigation”).
  • New patterns for autonomy expansion (“three-step expansion with gradual HITL-threshold change — here is the template”).
  • New sector-specific patterns as new domains join the portfolio.

Pattern-library updates are broadcast across product teams in a regular rhythm (monthly architect-guild meeting, published bulletin).

Learn contribution 3 — Retirement planning

Agents eventually retire. Causes:

  • Use case no longer valuable.
  • Better alternative available.
  • Unsustainable operating cost.
  • Persistent quality problems.
  • Regulatory changes that make the use case impractical.

The architect produces retirement plans:

  • Sunset schedule with notice to users.
  • Migration path where applicable (hand to alternative agent, hand to human workflow).
  • Archive plan for audit trails (Article 28 retention applies).
  • Lessons-learned post-mortem for the retirement decision (what led here, what to watch in the next agent).

Timeline of a single agent’s lifecycle

Feedback loops that keep Evaluate-Learn credible

Three feedback loops turn Evaluate + Learn from paperwork into compound value.

Loop 1 — Incident to evaluation. Every incident produces a new adversarial test in the evaluation battery. This pattern (inspired by Google SRE “learn from incidents”) ensures the same root cause cannot produce an incident twice.

Loop 2 — Post-mortem to platform. Every post-mortem with architectural implication produces a platform roadmap item. The architect tracks the percentage of post-mortem action items that reached the platform — healthy is 30–50%.

Loop 3 — Product to pattern library. Every new pattern or anti-pattern discovered in a product team gets curated into the shared library. The architect’s role is to recognise generalizable patterns and to prevent the library from becoming a dumping ground.

Regulatory-driven refresh cadences

Beyond quarterly evaluation-harness refreshes, specific regulatory regimes impose their own refresh obligations that the architect tracks:

  • EU AI Act Article 17 (quality management system). Requires ongoing processes for data-set management, training data documentation, risk management — the architect ensures the Evaluate cycle touches each.
  • EU AI Act Article 61 (post-market monitoring). Requires a post-market monitoring plan producing periodic reports to the provider and, on demand, to authorities. The Evaluate-stage output feeds these reports directly.
  • SR 11-7 / PRA SS1/23. Require annual (or more frequent depending on materiality) model reviews; validator partnership during Evaluate ensures this is not a last-minute scramble.
  • MDR post-market surveillance. For clinical-decision-support agents under MDR scope, post-market surveillance plans demand trend analysis on adverse events. The Evaluate cycle feeds the surveillance workflow.
  • NIST AI RMF MANAGE function. Recommends governance of identified risks across the lifecycle. Evaluate is where MANAGE gets operational teeth.

The architect keeps a simple rolling calendar — agent × refresh obligation × next-due date — and escalates to legal / compliance when an obligation is at risk of slipping.

Architect readouts in Learn

Annual architect readouts give senior stakeholders a view of the portfolio. Typical annual readout includes:

  • Portfolio overview (agents by maturity, by autonomy, by sector).
  • Incident summary (total; by class; severity distribution).
  • Platform investments and their impact.
  • Pattern-library updates.
  • Risks and watchouts for next year (upcoming regulatory changes, emerging attack patterns, talent pipeline).

Real-world references

Google SRE Book postmortem culture chapter. The canonical reference for blameless post-mortems. Agentic systems inherit the SRE tradition; architect-led post-mortem reviews use SRE vocabulary.

Amazon Correction of Errors (COE) template. Amazon’s post-mortem format is public in various forms; many agentic teams adopt it or a variant. The architect customises a template for agentic specifics.

Public OpenAI, Anthropic, and Microsoft post-incident materials. Model-provider post-incidents are informative for platform-level learnings; the architect reads them even when the incident is not in the organisation’s own agent.

MIT AIID (AI Incident Database). Public database of AI-related incidents including agentic cases. A useful external check for the architect’s own pattern library.

Anti-patterns to reject

  • “Evaluate is an SRE review.” SRE leads operational review; the architect leads architectural review.
  • “Post-mortems sit in a folder.” Unread post-mortems are unlearned lessons.
  • “Autonomy can only expand.” Contraction is a valid outcome of Evaluate; the architect preserves the option.
  • “Retirement means we failed.” Retirement is a disciplined end-of-life; failure is what unretired agents do while no one is looking.
  • “Learn is an annual PowerPoint.” Learn is a continuous loop; the annual readout is a marker, not the work.

Learning outcomes

  • Explain Evaluate and Learn gate artefacts for agentic systems and the architect’s four Evaluate contributions and three Learn contributions.
  • Classify four post-launch architect activities (production metrics review, post-mortem review, evaluation-harness refresh, autonomy-expansion review) by purpose and cadence.
  • Evaluate an Evaluate review for architectural signal adequacy — is the review surfacing platform-level findings or only product-level ones?
  • Design the architect’s Learn readout for a portfolio of agents, including retirement decisions, platform-evolution proposals, and pattern-library updates.

Further reading

  • Core Stream anchors: EATF-Level-1/M1.2-Art05-Evaluate-Measuring-Transformation-Progress.md; EATF-Level-1/M1.2-Art06-Learn-Sustaining-Transformation-Momentum.md.
  • AITE-ATS siblings: Article 17 (evaluation), Article 18 (SLOs), Article 24 (lifecycle), Article 25 (incident response), Article 36 (Calibrate + Organize), Article 37 (Model + Produce).
  • Primary sources: Google SRE Book postmortem culture; Amazon Correction of Errors template (public references); MIT AIID.