Autonomy Spectrum and Agent Taxonomy

FlowRidge

Autonomy Spectrum — Agent Taxonomy

Figure 344. The autonomy spectrum is the first architectural decision. Every level above 'Recommender' triggers additional governance obligations.

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 2 of 40

Thesis. “Is it agentic?” is the wrong question once a system has crossed the threshold Article 1 drew. The operational questions are how agentic and what kind of agentic. Autonomy is a spectrum; agent type is a taxonomy. Together they give the architect a classification that maps directly to governance depth, oversight pattern, evaluation cadence, and EU AI Act Article 14 obligations. An agent at L2 in the task category needs a different pack than an agent at L4 in the code category, even if both run on the same LangGraph runtime.

The six-level autonomy spectrum

The spectrum is a working model the credential uses throughout. It draws on the autonomy-classification lineage from US Department of Defense Directive 3000.09 (autonomy in weapon systems — a public policy reference for definitional rigour, not for mapping weapons to commerce), the Anthropic Responsible Scaling Policy capability levels, and the OpenAI Preparedness Framework preparedness levels. It is technology-neutral — the level is a property of the feature, not of the orchestration framework.

L0 — manual-with-AI. Human drives; AI advises. A knowledge worker drafting an email with an LLM suggestion panel is at L0. There is no loop, no tool call, no memory. Governance burden: classical model-card and bias testing only.
L1 — assisted. Human drives; AI proposes actions the human approves one at a time. A code-completion suggestion in an IDE — accept or reject — is L1. Still not agentic by Article 1’s screen; no loop.
L2 — bounded executor. AI executes within a pre-approved tool sandbox under synchronous human approval for every consequential action. GitHub Copilot Chat writing a patch the developer must accept is L2. Customer-service agent that drafts refunds up to $X with human approval is L2 for the refund decision. Light agentic pack required — tool schema, authorization, audit.
L3 — supervised executor. AI plans and executes multi-step sequences. A human reviews outcomes, not every step. Replit AI Agent completing a feature is L3. Most enterprise customer-service agents with bounded refund authority are L3 for the refund-under-threshold path. Full pack required — kill-switch, HITL gate on high-risk actions, observability, behavioural regression.
L4 — autonomous executor. AI executes for extended periods without per-action supervision. Humans define guardrails; the system operates within them. Salesforce Agentforce SDR pursuing a lead list over days, Klarna customer-service bot handling 700-FTE-equivalent volume (public 2024 report), and most production multi-agent back-office systems sit at L4. Article 14 oversight design is non-trivial; incident response becomes first-class.
L5 — self-directing. AI sets its own sub-goals, acquires new tools, and operates across long horizons. No enterprise production system is genuinely at L5 in 2025; any vendor marketing claiming otherwise is either using “L5” loosely or describing L4 with a longer horizon. The credential covers L5 for pre-production research agents and for future-state planning, not for present deployment.

The critical jumps: L1 → L2 (first tool execution, first agentic governance needed); L2 → L3 (first loop, first kill-switch requirement, first observability spec); L3 → L4 (first asynchronous operation, first Article 14 design work, first coordination-failure surface if multi-agent). The jump from L4 to L5 is not a near-term design concern for most enterprise architects.

Diagram 1 — Autonomy × agent taxonomy matrix

                    AUTONOMY LEVEL
                L0   L1   L2   L3   L4   L5
              ┌────┬────┬────┬────┬────┬────┐
Conversational│ CX │ CX │ CSV│ CSV│ KLA│ -- │  CX = LLM-chat draft; KLA = Klarna 2024
              ├────┼────┼────┼────┼────┼────┤
Task          │ -- │ -- │ TSK│ TSK│ TSK│ -- │  task = per-ticket agent
              ├────┼────┼────┼────┼────┼────┤
Workflow      │ -- │ -- │ WF │ WF │ WF │ -- │  multi-step business process
              ├────┼────┼────┼────┼────┼────┤
RPA-adjacent  │ -- │ -- │ RPA│ RPA│ RPA│ -- │  legacy automation + LLM replan
              ├────┼────┼────┼────┼────┼────┤
Research      │ -- │ -- │ -- │ RES│ RES│ -- │  deep research assistants
              ├────┼────┼────┼────┼────┼────┤
Code          │ GHC│ GHC│ CHT│ REP│ DEV│ -- │  GHC=Copilot, REP=Replit, DEV=Devin
              ├────┼────┼────┼────┼────┼────┤
Embodied      │ -- │ -- │ -- │ -- │ -- │ -- │  robotics / computer-use (Anthropic)
              └────┴────┴────┴────┴────┴────┘

Reading the matrix: any single cell gives a starting governance spec. Conversational-at-L3 and Code-at-L3 are the same level but require very different packs — conversational at L3 needs disclosure + cancellation (Article 52), while code at L3 needs sandbox + rollback (Articles 16, 21). The matrix forces the architect to ask both axes, not one.

The seven-category taxonomy

The taxonomy is the second axis. Each category has a distinct failure-mode signature and a characteristic set of tools.

Conversational agents. Customer-service, internal-help-desk, advisor bots. Primary tools: CRM, knowledge base, ticketing, email. Primary failure modes: goal hijacking, hallucinated policy commitment (Moffatt v. Air Canada, 2024 BCCRT 149; Chevrolet of Watsonville $1 Tahoe, December 2023), reputational goal-hijack (DPD chatbot, January 2024). Oversight pattern: HOTL with HITL gate on consequential actions.
Task agents. Scoped to one task-type — draft a memo, reconcile an invoice, triage an alert. Primary tools: one or two line-of-business APIs. Primary failure modes: scope creep, memory poisoning if memory persists across tasks. Oversight pattern: HITL at task-exit gate.
Workflow agents. Execute multi-step business processes — new-hire onboarding, mortgage origination, claims intake. Primary tools: multiple LOB systems coordinated. Primary failure modes: coordination failures, partial-completion recovery. Oversight pattern: HITL at workflow transition gates.
RPA-adjacent agents. Legacy RPA bots with LLM re-planning when a UI element is missing or a branch fails. Primary tools: screen-scraping, legacy APIs. Primary failure modes: LLM replans a sensitive step the original script never did. Oversight pattern: constrain LLM replan to non-sensitive branches only.
Research agents. Deep-research assistants — Perplexity-style search + reason + synthesise. Primary tools: web search, document retrieval, vector stores. Primary failure modes: indirect prompt injection via retrieved content (Article 14), source hallucination. Oversight pattern: citation validation + source-authorization policy.
Code agents. Code-completion (L1), chat (L2), pull-request generation (L3), full-task (L4). Primary tools: filesystem, test runners, version control, CI. Primary failure modes: sandbox escape, destructive operations, supply-chain injection via dependencies. Oversight pattern: sandbox + human review on merge.
Embodied agents. Robotics, computer-use surfaces (Anthropic Claude Computer Use, October 2024), voice-in-the-loop. Primary tools: device drivers, screen-control APIs. Primary failure modes: physical/ui side effects, consent in shared environments. Oversight pattern: presence-aware kill-switch, bystander-protection design.

An agent can sit in two categories — a code agent with conversational surface (Copilot Chat) — in which case the stricter pack applies to each category’s failure modes.

Diagram 2 — Timeline of autonomy escalation in one product

2021 ──── 2022 ──── 2023 ──── 2024 ──── 2025 ──── 2026
  │         │         │         │         │
  │         │         │         │         │
Copilot   Copilot   Copilot   Copilot   Copilot
 (L1 IDE  + Chat    X + PR    Workspace Agent mode
 complete) (L2)     (L2/L3)   (L3)      (L3/L4)

GitHub Copilot’s public product history is the single clearest public example of autonomy escalation inside one product. The governance pack for each stage had to be rebuilt — the L1 code-completion pack is not sufficient for L2 Chat, and the L2 Chat pack is not sufficient for L3 Workspace. The architect’s lesson is that autonomy escalation is always a re-architecture, not a patch.

Classification worked example — eight systems

ChatGPT custom GPT that summarises documents. Conversational, L0/L1. Classical.
Enterprise copilot drafting emails with one-click send. Conversational, L2. Light agentic.
Customer-service agent with refund authority up to $500. Conversational, L3 for the refund path, L2 otherwise. HITL gate on amounts above threshold.
Agentforce SDR (Salesforce, 2024). Task/workflow, L4. Article 52 disclosure + Article 14 oversight design required.
Replit AI Agent. Code, L3. Sandbox + memory versioning + postmortem culture.
Devin (Cognition AI, 2024). Code, L4. Long-horizon evaluation + staged rollout.
Claude Computer Use (Anthropic, October 2024). Embodied, L3. Safety card public; mitigations include click confirmation for high-risk actions.
Multi-agent back-office system for an insurer (Case Study 3 exemplar). Workflow, L4. Full pack including conformity assessment under EU AI Act Annex III.5.

For each system, the classification yields a three-line governance directive: level, category, primary failure modes to design against. That directive is the input to the autonomy statement template — the first artefact in an agent’s documentation pack.

The stability test — “would the classification change under edge cases?”

An autonomy classification that looks stable at the happy path can drift under edge cases. The architect should stress-test every classification against four edge cases:

Tool failure at L3. If a tool fails, does the agent escalate (stays at L3) or retry autonomously (drifts toward L4)?
Prompt injection at L2. If a prompt injection arrives via a retrieved document, does the agent remain in its sandbox (L2) or execute a hijacked goal (effectively L4 without the governance)?
Memory corruption at L3. If memory corrupts, does the system continue (L3 degraded) or halt (L3 stable)?
User request outside scope at L3. If the user asks something outside scope, does the agent refuse (stable L3) or improvise (drifts)?

A classification that fails any stability test is not yet a design — it is a wish. The autonomy statement artefact records the stable classification plus the containment design for each drift vector.

Real-world anchors

GitHub Copilot — the public autonomy-escalation trajectory

GitHub’s public product announcements tracked Copilot from code-completion in 2021 (L1) through Copilot Chat in 2023 (L2) through Copilot Workspace in 2024 (L3) through Copilot Agent mode in late 2024/2025 (L3+/L4 depending on task). Each escalation required a new safety design — Chat needed filtering for harmful suggestions, Workspace needed scoped repo access and PR review gates, Agent needed sandbox + rollback. The relevant lesson: the architect who designs an L2 pack and calls it done will be doing it again at L3 and at L4. Public announcements: https://github.blog/ (Copilot evolution archive).

Anthropic Claude Computer Use — embodied L3 with public safety card

Anthropic’s October 2024 announcement of Claude Computer Use — a model that can take screenshots, move a cursor, and click — is the clearest public embodied-agent architecture in 2024–2025. The accompanying safety card discussed mitigations explicitly: click confirmation for high-risk actions, URL allowlists for browser operations, session-scoped memory, and prohibited-domain blocking. Autonomy classification: L3 embodied with HOTL supervision plus synchronous HITL on defined-risk actions. The public safety card is a teaching artefact for this article and for Article 8. Source: https://www.anthropic.com/news/3-5-models-and-computer-use.

Closing

The six-level spectrum plus the seven-category taxonomy give the architect a 42-cell classification grid. Most enterprise agents fall into a dozen cells. The autonomy statement — template provided in the credential’s artefact set — records the classification, the containment design for each drift vector, and the governance depth directive. Article 3 takes the next step: given a classification, what runtime pattern should the agent actually use?

Learning outcomes check

Explain the six autonomy levels with the architectural obligation each introduces.
Classify eight example systems against the spectrum and taxonomy.
Evaluate a classification for stability under four edge cases.
Design an autonomy statement for a proposed feature.

Cross-reference map

Core Stream: EATF-Level-1/M1.2-Art20-Agent-Autonomy-Classification.md; EATF-Level-1/M1.4-Art11-Agentic-AI-Architecture-Patterns-and-the-Autonomy-Spectrum.md.
Sibling credential: AITM-AAG Article 3 (governance-facing autonomy classification).
Forward reference: Article 10 (oversight pattern selection); Article 23 (EU AI Act Article 14 mapping).