Architect in Model and Produce Stages for Agentic Systems

FlowRidge

Model and Produce are where the agent becomes real. They are also where shortcuts happen and where agentic discipline is most likely to degrade under schedule pressure. The architect’s deliverables are checklists and documents the team can follow without constant architect presence.

Model — what the architect contributes

Model is the build stage. The architect’s contribution is less about writing new architecture and more about ensuring the build faithfully implements the architecture.

Model contribution 1 — Reference-architecture reality check

At the midpoint of Model the architect reviews the in-progress implementation against the Organize-stage reference architecture:

Are the planned tools in the tool registry with approved schemas?
Is the policy engine integrated and evaluating tool calls?
Is observability live (trace IDs, prompt logs, tool-call logs)?
Is the kill-switch wired and testable?
Are memory namespaces isolated as designed?

Findings become explicit action items. The review is lightweight but substantive — typically a two-hour architect + tech lead walkthrough.

Model contribution 2 — Evaluation plan finalization

The preliminary evaluation plan from Organize gets finalised in Model:

Golden tasks — final list, with pass criteria and scoring rubrics.
Adversarial battery — OWASP Agentic Top 10 coverage (Article 8, 14), MITRE ATLAS techniques, jailbreak prompts tuned to the use case.
Simulation environment — the test-fixture infrastructure is stood up.
Calibration evaluation — confidence vs accuracy plots with target calibration scores.
Fairness evaluation — for regulated use cases (Annex III), fairness metrics across protected attributes.
Cost and latency benchmarks — per-interaction cost, latency p50/p95/p99 on representative traffic.

The evaluation plan is itself a registry-tracked artefact (Article 26); versioned and reusable for subsequent agent versions.

Model contribution 3 — Kill-switch design and drill

Article 9’s kill-switch design becomes a drill-tested implementation in Model:

Synchronous kill-switch (ops-triggered, takes effect in ≤10s).
Asynchronous kill-switch (user-triggered per session).
Deadman switch (heartbeat loss triggers halt).
Circuit-breakers on tools (error rate, cost rate, latency).
Containment verification — a kill-switch drill runs through each control and confirms the agent halts.

Model contribution 4 — Runbooks

The architect finalises runbooks for each of the six incident classes (Article 25): memory poisoning, goal hijacking, runaway loop, tool misuse, coordination failure, behavioral regression. Each runbook includes detect, contain, remediate, post-mortem steps and named owners.

Model contribution 5 — ADR updates

New decisions surface during build. The architect captures them as ADR updates: “we chose MCP servers from vendor X rather than custom wrappers because Y.”

Model gate — what “done” means

Model-stage exit is not “code works”; it is:

The reference architecture is implemented faithfully (with delta documented).
Evaluation plan is finalised and initial evaluation results are available.
Kill-switch is drilled.
Runbooks are drafted and reviewed.
ADRs are up-to-date.
The independent model-risk validator (where applicable — Article 30) has begun their review.

Produce — what the architect contributes

Produce is the deployment stage. The architect’s contribution focuses on the production-readiness gate.

Produce contribution 1 — Production-readiness checklist

The architect owns the production-readiness checklist. The list is discipline-specific; below is a representative example for an agentic system:

Evaluation. Golden-task pass rate ≥ threshold; adversarial battery acceptable pass rate; calibration score within target; fairness metrics within bounds.
Security. OWASP Agentic Top 10 threat-model review complete (Article 27); penetration test with agentic-specific probes; secrets management reviewed.
Observability. Traces propagating end-to-end; prompt/tool-call/memory logs retained with correct classification; alert rules defined and tested.
Operational. On-call rotation staffed and trained; runbooks drilled; kill-switch drilled; fallback mode tested; SLO baselines captured.
Compliance. Article 14 evidence pack complete (Article 23); Article 50 disclosure surfaces live; DPIA complete; ATRS record drafted if public-sector (Article 32); MDR conformity submission filed if healthcare (Article 31); SR 11-7 validation complete if financial services (Article 30).
Registry. Agent record promoted to active; prompt/tool/memory references versioned; lineage complete (Article 26).
Cost. Cost per interaction within target; spend alerts configured; rate limits active (Article 19).
Training. Operators trained on runbooks; HITL reviewers trained on workflow.

Produce contribution 2 — Canary plan

Article 24’s promotion discipline applies: the agent promotes through dev → staging → canary → gradual rollout → full production. The architect defines:

Canary traffic percentage. Typically 1–5%.
Canary duration. Minimum hours/days before expansion.
Canary metrics. Goal-achievement rate, tool-call anomaly rate, cost per interaction, HITL-trigger rate, user-feedback sentiment.
Rollback trigger. Specific thresholds that automatically halt expansion.
Expansion schedule. 1% → 5% → 25% → 100% with gates at each.

Produce contribution 3 — Article 14 evidence-pack finalisation

For EU AI Act high-risk systems (Article 23), the evidence pack must be finalised before Produce-gate sign-off:

Design evidence — specification, threat model, HITL design.
Documentation evidence — operator manual, user instructions, overseer training materials.
Monitoring evidence — evaluation results, drift monitoring plan, incident-response plan.
Training evidence — records of overseer training completion.
Change-management evidence — version histories, promotion decisions, validation sign-offs.

Produce contribution 4 — Post-launch observability readout

Before Produce gate sign-off the architect confirms the observability dashboards show exactly what Learn and Evaluate stages will need. If a dashboard will be useful in month 3 but is missing at launch, add it at launch — not later.

Stage × artefact × owner matrix

Common failure patterns in Model and Produce

Pattern 1 — “It passes internal tests; ship it.” Internal tests measure what you thought to test. The adversarial battery and the red-team (Article 17) measure what an attacker thought of. Both are required.

Pattern 2 — Runbook-by-wiki. Runbooks written on a wiki and not drilled are runbooks that will fail in an incident. Drill.

Pattern 3 — “The model is the agent.” Teams sometimes treat model upgrade as the agent’s release; the prompt + tool + memory + policy + runtime context is the agent. Version all of it.

Pattern 4 — Production-readiness as signoff. Ceremonial checklists erode fast. The architect keeps the checklist material, removing items that lose meaning and adding items informed by real incidents.

Pattern 5 — Canary over-short. Many incidents surface at p95/p99 latencies the canary never samples because the canary ran for an hour. Canary long enough to see the traffic’s tails.

The architect’s partnership with independent validators

For regulated use cases (Article 30 SR 11-7; Article 31 MDR-governed; Article 32 public-sector high-risk), independent validation is not a rubber stamp. The architect’s partnership with validators during Model and Produce shapes whether the agent ships on time.

Ground rules the architect establishes upfront:

Validators see the reference architecture during Organize, not at Produce gate.
Validators specify the evidence they need; the architect produces it proactively during Model.
Disagreements go to named escalation (typically CRO or ethics committee), not to a shouting match at gate time.
The validator’s independent evaluation battery is run before Produce gate, not during.

What validators usually ask for:

The threat model walk-through (Article 27).
Evidence the evaluation battery covers known failure modes.
Access to production-comparable traces from canary.
Fairness analysis where protected attributes matter.
Change-control plan with the rationale for predetermined vs submission-triggering changes.

Where disputes often surface:

Adequacy of the evaluation battery — validators may want more coverage than the team planned.
Calibration — validators often find confidence reports over-stated in agentic systems.
HITL threshold placement — validators may recommend lower thresholds than the product team proposes.

The architect’s posture is “engage early, document rigorously.” Late-stage validator objections on material concerns should be rare because the validator has been visible throughout.

Real-world references

EBA guidelines on ICT and security risk management (2019; updates and DORA alignment 2024). Financial-services operational-resilience guidance with production-readiness concepts aligned with DORA. Agentic systems inherit these.

Microsoft AI production readiness materials (public). Microsoft publishes AI-specific production readiness guidance drawing on Azure AI Foundry experience; useful template material.

Cognition AI Devin public production-rollout discussion (2024). Cognition wrote publicly about the gap between demo and reliable production use of their task-level SWE agent; the lessons are broadly applicable to Produce-stage expectations.

Anti-patterns to reject

“We’ll update the evaluation plan after launch.” An evaluation plan that can be skipped will be skipped. Finalise in Model.
“Kill-switch code is written but never tested.” Untested is broken.
“Runbook is the SRE team’s responsibility.” It is SRE’s to execute; it is the architect’s to design.
“Production-readiness is a checkbox.” The architect defends the substance; the organisation defends the ceremony.
“Canary for 30 minutes.” Traffic tails live in the night and the weekend; canary through at least one full week cycle where possible.

Learning outcomes

Explain Model and Produce gate artefacts for agentic systems and the architect’s role in each.
Classify six architecture deliverables by the stage where they are drafted, committed, exercised, reviewed, or updated.
Evaluate a Produce-exit package for production-readiness-checklist completeness, Article 14 evidence-pack adequacy, and canary-plan soundness.
Design a production-readiness checklist and canary plan for a given agentic system, including Article 14 evidence-pack references and rollback triggers.