COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 2 of 40
Thesis. “Is it agentic?” is the wrong question once a system has crossed the threshold Article 1 drew. The operational questions are how agentic and what kind of agentic. Autonomy is a spectrum; agent type is a taxonomy. Together they give the architect a classification that maps directly to governance depth, oversight pattern, evaluation cadence, and EU AI Act Article 14 obligations. An agent at L2 in the task category needs a different pack than an agent at L4 in the code category, even if both run on the same LangGraph runtime.
The six-level autonomy spectrum
The spectrum is a working model the credential uses throughout. It draws on the autonomy-classification lineage from US Department of Defense Directive 3000.09 (autonomy in weapon systems — a public policy reference for definitional rigour, not for mapping weapons to commerce), the Anthropic Responsible Scaling Policy capability levels, and the OpenAI Preparedness Framework preparedness levels. It is technology-neutral — the level is a property of the feature, not of the orchestration framework.
- L0 — manual-with-AI. Human drives; AI advises. A knowledge worker drafting an email with an LLM suggestion panel is at L0. There is no loop, no tool call, no memory. Governance burden: classical model-card and bias testing only.
- L1 — assisted. Human drives; AI proposes actions the human approves one at a time. A code-completion suggestion in an IDE — accept or reject — is L1. Still not agentic by Article 1’s screen; no loop.
- L2 — bounded executor. AI executes within a pre-approved tool sandbox under synchronous human approval for every consequential action. GitHub Copilot Chat writing a patch the developer must accept is L2. Customer-service agent that drafts refunds up to $X with human approval is L2 for the refund decision. Light agentic pack required — tool schema, authorization, audit.
- L3 — supervised executor. AI plans and executes multi-step sequences. A human reviews outcomes, not every step. Replit AI Agent completing a feature is L3. Most enterprise customer-service agents with bounded refund authority are L3 for the refund-under-threshold path. Full pack required — kill-switch, HITL gate on high-risk actions, observability, behavioural regression.
- L4 — autonomous executor. AI executes for extended periods without per-action supervision. Humans define guardrails; the system operates within them. Salesforce Agentforce SDR pursuing a lead list over days, Klarna customer-service bot handling 700-FTE-equivalent volume (public 2024 report), and most production multi-agent back-office systems sit at L4. Article 14 oversight design is non-trivial; incident response becomes first-class.
- L5 — self-directing. AI sets its own sub-goals, acquires new tools, and operates across long horizons. No enterprise production system is genuinely at L5 in 2025; any vendor marketing claiming otherwise is either using “L5” loosely or describing L4 with a longer horizon. The credential covers L5 for pre-production research agents and for future-state planning, not for present deployment.
The critical jumps: L1 → L2 (first tool execution, first agentic governance needed); L2 → L3 (first loop, first kill-switch requirement, first observability spec); L3 → L4 (first asynchronous operation, first Article 14 design work, first coordination-failure surface if multi-agent). The jump from L4 to L5 is not a near-term design concern for most enterprise architects.
Diagram 1 — Autonomy × agent taxonomy matrix
AUTONOMY LEVEL
L0 L1 L2 L3 L4 L5
┌────┬────┬────┬────┬────┬────┐
Conversational│ CX │ CX │ CSV│ CSV│ KLA│ -- │ CX = LLM-chat draft; KLA = Klarna 2024
├────┼────┼────┼────┼────┼────┤
Task │ -- │ -- │ TSK│ TSK│ TSK│ -- │ task = per-ticket agent
├────┼────┼────┼────┼────┼────┤
Workflow │ -- │ -- │ WF │ WF │ WF │ -- │ multi-step business process
├────┼────┼────┼────┼────┼────┤
RPA-adjacent │ -- │ -- │ RPA│ RPA│ RPA│ -- │ legacy automation + LLM replan
├────┼────┼────┼────┼────┼────┤
Research │ -- │ -- │ -- │ RES│ RES│ -- │ deep research assistants
├────┼────┼────┼────┼────┼────┤
Code │ GHC│ GHC│ CHT│ REP│ DEV│ -- │ GHC=Copilot, REP=Replit, DEV=Devin
├────┼────┼────┼────┼────┼────┤
Embodied │ -- │ -- │ -- │ -- │ -- │ -- │ robotics / computer-use (Anthropic)
└────┴────┴────┴────┴────┴────┘
Reading the matrix: any single cell gives a starting governance spec. Conversational-at-L3 and Code-at-L3 are the same level but require very different packs — conversational at L3 needs disclosure + cancellation (Article 52), while code at L3 needs sandbox + rollback (Articles 16, 21). The matrix forces the architect to ask both axes, not one.
The seven-category taxonomy
The taxonomy is the second axis. Each category has a distinct failure-mode signature and a characteristic set of tools.
- Conversational agents. Customer-service, internal-help-desk, advisor bots. Primary tools: CRM, knowledge base, ticketing, email. Primary failure modes: goal hijacking, hallucinated policy commitment (Moffatt v. Air Canada, 2024 BCCRT 149; Chevrolet of Watsonville $1 Tahoe, December 2023), reputational goal-hijack (DPD chatbot, January 2024). Oversight pattern: HOTL with HITL gate on consequential actions.
- Task agents. Scoped to one task-type — draft a memo, reconcile an invoice, triage an alert. Primary tools: one or two line-of-business APIs. Primary failure modes: scope creep, memory poisoning if memory persists across tasks. Oversight pattern: HITL at task-exit gate.
- Workflow agents. Execute multi-step business processes — new-hire onboarding, mortgage origination, claims intake. Primary tools: multiple LOB systems coordinated. Primary failure modes: coordination failures, partial-completion recovery. Oversight pattern: HITL at workflow transition gates.
- RPA-adjacent agents. Legacy RPA bots with LLM re-planning when a UI element is missing or a branch fails. Primary tools: screen-scraping, legacy APIs. Primary failure modes: LLM replans a sensitive step the original script never did. Oversight pattern: constrain LLM replan to non-sensitive branches only.
- Research agents. Deep-research assistants — Perplexity-style search + reason + synthesise. Primary tools: web search, document retrieval, vector stores. Primary failure modes: indirect prompt injection via retrieved content (Article 14), source hallucination. Oversight pattern: citation validation + source-authorization policy.
- Code agents. Code-completion (L1), chat (L2), pull-request generation (L3), full-task (L4). Primary tools: filesystem, test runners, version control, CI. Primary failure modes: sandbox escape, destructive operations, supply-chain injection via dependencies. Oversight pattern: sandbox + human review on merge.
- Embodied agents. Robotics, computer-use surfaces (Anthropic Claude Computer Use, October 2024), voice-in-the-loop. Primary tools: device drivers, screen-control APIs. Primary failure modes: physical/ui side effects, consent in shared environments. Oversight pattern: presence-aware kill-switch, bystander-protection design.
An agent can sit in two categories — a code agent with conversational surface (Copilot Chat) — in which case the stricter pack applies to each category’s failure modes.
Diagram 2 — Timeline of autonomy escalation in one product
2021 ──── 2022 ──── 2023 ──── 2024 ──── 2025 ──── 2026
│ │ │ │ │
│ │ │ │ │
Copilot Copilot Copilot Copilot Copilot
(L1 IDE + Chat X + PR Workspace Agent mode
complete) (L2) (L2/L3) (L3) (L3/L4)
GitHub Copilot’s public product history is the single clearest public example of autonomy escalation inside one product. The governance pack for each stage had to be rebuilt — the L1 code-completion pack is not sufficient for L2 Chat, and the L2 Chat pack is not sufficient for L3 Workspace. The architect’s lesson is that autonomy escalation is always a re-architecture, not a patch.
Classification worked example — eight systems
- ChatGPT custom GPT that summarises documents. Conversational, L0/L1. Classical.
- Enterprise copilot drafting emails with one-click send. Conversational, L2. Light agentic.
- Customer-service agent with refund authority up to $500. Conversational, L3 for the refund path, L2 otherwise. HITL gate on amounts above threshold.
- Agentforce SDR (Salesforce, 2024). Task/workflow, L4. Article 52 disclosure + Article 14 oversight design required.
- Replit AI Agent. Code, L3. Sandbox + memory versioning + postmortem culture.
- Devin (Cognition AI, 2024). Code, L4. Long-horizon evaluation + staged rollout.
- Claude Computer Use (Anthropic, October 2024). Embodied, L3. Safety card public; mitigations include click confirmation for high-risk actions.
- Multi-agent back-office system for an insurer (Case Study 3 exemplar). Workflow, L4. Full pack including conformity assessment under EU AI Act Annex III.5.
For each system, the classification yields a three-line governance directive: level, category, primary failure modes to design against. That directive is the input to the autonomy statement template — the first artefact in an agent’s documentation pack.
The stability test — “would the classification change under edge cases?”
An autonomy classification that looks stable at the happy path can drift under edge cases. The architect should stress-test every classification against four edge cases:
- Tool failure at L3. If a tool fails, does the agent escalate (stays at L3) or retry autonomously (drifts toward L4)?
- Prompt injection at L2. If a prompt injection arrives via a retrieved document, does the agent remain in its sandbox (L2) or execute a hijacked goal (effectively L4 without the governance)?
- Memory corruption at L3. If memory corrupts, does the system continue (L3 degraded) or halt (L3 stable)?
- User request outside scope at L3. If the user asks something outside scope, does the agent refuse (stable L3) or improvise (drifts)?
A classification that fails any stability test is not yet a design — it is a wish. The autonomy statement artefact records the stable classification plus the containment design for each drift vector.
Real-world anchors
GitHub Copilot — the public autonomy-escalation trajectory
GitHub’s public product announcements tracked Copilot from code-completion in 2021 (L1) through Copilot Chat in 2023 (L2) through Copilot Workspace in 2024 (L3) through Copilot Agent mode in late 2024/2025 (L3+/L4 depending on task). Each escalation required a new safety design — Chat needed filtering for harmful suggestions, Workspace needed scoped repo access and PR review gates, Agent needed sandbox + rollback. The relevant lesson: the architect who designs an L2 pack and calls it done will be doing it again at L3 and at L4. Public announcements: https://github.blog/ (Copilot evolution archive).
Anthropic Claude Computer Use — embodied L3 with public safety card
Anthropic’s October 2024 announcement of Claude Computer Use — a model that can take screenshots, move a cursor, and click — is the clearest public embodied-agent architecture in 2024–2025. The accompanying safety card discussed mitigations explicitly: click confirmation for high-risk actions, URL allowlists for browser operations, session-scoped memory, and prohibited-domain blocking. Autonomy classification: L3 embodied with HOTL supervision plus synchronous HITL on defined-risk actions. The public safety card is a teaching artefact for this article and for Article 8. Source: https://www.anthropic.com/news/3-5-models-and-computer-use.
Closing
The six-level spectrum plus the seven-category taxonomy give the architect a 42-cell classification grid. Most enterprise agents fall into a dozen cells. The autonomy statement — template provided in the credential’s artefact set — records the classification, the containment design for each drift vector, and the governance depth directive. Article 3 takes the next step: given a classification, what runtime pattern should the agent actually use?
Learning outcomes check
- Explain the six autonomy levels with the architectural obligation each introduces.
- Classify eight example systems against the spectrum and taxonomy.
- Evaluate a classification for stability under four edge cases.
- Design an autonomy statement for a proposed feature.
Cross-reference map
- Core Stream:
EATF-Level-1/M1.2-Art20-Agent-Autonomy-Classification.md;EATF-Level-1/M1.4-Art11-Agentic-AI-Architecture-Patterns-and-the-Autonomy-Spectrum.md. - Sibling credential: AITM-AAG Article 3 (governance-facing autonomy classification).
- Forward reference: Article 10 (oversight pattern selection); Article 23 (EU AI Act Article 14 mapping).