Human-in-the-Loop and Human-on-the-Loop Designs

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 10 of 40

Thesis. Oversight is not a slider. It is a set of three architecturally distinct patterns — human-in-the-loop (HITL), human-on-the-loop (HOTL), and human-out-of-the-loop (HOOTL) — and every agent’s oversight design selects among them per action class, per risk class, and per regulatory regime. The confusing part of industry practice is that teams often believe they are doing HITL when they are in fact doing HOOTL (a human periodically reviews transcripts), and they believe they are doing HOTL when they are actually doing nothing (a dashboard nobody watches). This article pins down the definitions, maps them to EU AI Act Article 14 language, and specifies the architectural controls each pattern requires.

Three patterns, three architectures

Pattern 1 — Human-in-the-loop (HITL)

HITL: a human synchronously approves or rejects a specific agent action before it takes effect. The agent pauses; the human is summoned; the human decides; the agent either proceeds with the approved action (possibly with edits) or receives a rejection and re-plans.

Architectural requirements:

A gate in the agent loop at the decision point (Article 4, state-graph).
A reviewer queue with authentication, routing logic, and SLA tracking.
A review UI that gives the human enough context to decide — the action, the reasoning, the alternatives considered, the inputs — without overwhelming with a wall of raw prompt.
A decision capture that records approve/reject/edit, the reviewer identity, the time, and any rationale; this is an audit artifact.
A timeout policy for what happens when no reviewer shows up within the SLA (default to deny; or escalate; or queue with operational alert).
A bypass prohibition — the agent cannot execute the gated action by going around the gate under any prompt condition.

HITL is the right pattern for: high-stakes irreversible actions, regulated decisions (credit, benefits, employment), medical decisions, and first-time exposure to a class of action.

Pattern 2 — Human-on-the-loop (HOTL)

HOTL: a human monitors the agent’s operation asynchronously and intervenes when signals flag concern. The agent operates continuously; the human reviews dashboards, traces, and alerts; if a threshold trips, the human can stop, rollback, or redirect.

Architectural requirements:

A monitoring surface (Article 15) that presents the agent’s operational health, action statistics, and exceptions in near-real-time.
An alert system that routes anomalies to the monitor via paging or ticketing.
An intervention affordance — the human can pause, kill (Article 9), or correct the agent’s behavior without rebuilding the system.
Rollback capability — for side-effecting actions that were recently taken, the human can undo within a defined window.
Retention of enough context for post-hoc review: which actions ran, what they did, what the agent’s reasoning was.

HOTL is the right pattern for: high-volume low-stakes actions where HITL is infeasible, mature workflows where the failure modes are known, deployments with quick rollback and limited per-action damage.

Pattern 3 — Human-out-of-the-loop (HOOTL)

HOOTL: the agent operates autonomously; human oversight occurs through periodic audit of outcomes, metrics, and random samples. No real-time monitor; no per-action gate; accountability is post-hoc.

Architectural requirements:

Complete audit log — every action, input, output, decision, and outcome is recorded for later inspection.
Periodic review cadence — weekly, monthly, quarterly (depending on risk class), with defined scope and sampling.
Outcome metrics — goal-achievement rate, error rate, customer complaint rate, downstream incident rate — tracked over time.
Escalation criteria — conditions that move the system back to HOTL or HITL automatically when signals degrade.

HOOTL is the right pattern for: low-stakes repetitive workloads where per-action monitoring is economically infeasible, and the per-action failure cost is bounded. HOOTL is not the right pattern for: high-risk EU AI Act Annex III systems, anything touching safety of persons, financial-decision systems, or first-deployment agents where failure modes are not yet characterized.

EU AI Act Article 14 mapping

Article 14 of the EU AI Act (Regulation 2024/1689) requires that high-risk AI systems be “effectively overseen by natural persons” during use. Article 14(4) enumerates the capabilities the oversight must enable: (a) properly understand relevant capacities and limitations; (b) remain aware of possible tendencies (automation bias); (c) correctly interpret output; (d) decide not to use output, or to override; (e) intervene or interrupt operation through a stop button or similar.

The architectural mapping:

Article 14(4)(a) — understanding capacities — is addressed by the autonomy statement (Article 2 artifact) and the system card exposed to oversight staff.
Article 14(4)(b) — automation bias awareness — is addressed by oversight training plus UI patterns that surface agent confidence and alternatives.
Article 14(4)(c) — correctly interpret output — is addressed by explanation surfaces in the UI (Article 34).
Article 14(4)(d) — override authority — is addressed by the HITL gate (if present) or HOTL intervention affordance.
Article 14(4)(e) — stop-button — is addressed by the kill-switch (Article 9).

For high-risk agentic systems under Article 6 + Annex III, the architect documents which of the three oversight patterns applies to which action class, maps each to Article 14(4) sub-clauses, and includes the mapping in the conformity-assessment evidence pack (Article 23).

Six example systems classified

Customer-support chatbot answering common questions, no side effects. L2 autonomy; HOTL sufficient; periodic sample review.
Customer-support agent with refund authority up to $100. L3 autonomy; HITL above $100, HOTL below, with sample review and outcome tracking.
Mortgage-underwriting agent producing a decision. L3 autonomy, Annex III high-risk; HITL mandatory at the decision boundary; Article 14(4)(d) override must be meaningful (not rubber-stamped).
Agentic SDR sending outbound emails to prospect lists. L3 autonomy; HITL on first email to a new prospect or on flagged segments; HOTL on routine follow-ups; Article 52 transparency disclosure.
Code-completion agent in IDE. L2 autonomy; HITL intrinsic (developer accepts/rejects each suggestion); HOOTL for metrics review.
Fraud-detection triage agent feeding a human analyst queue. L2 autonomy (agent doesn’t decide, it proposes); HITL intrinsic; periodic calibration review.

Five practical failure modes of oversight designs

Oversight that looks good on paper but fails in practice has recognizable failure modes. The architect reviews against these when accepting an oversight spec.

Approval fatigue. HITL systems that gate too many low-stakes actions produce reviewer fatigue; reviewers rubber-stamp; the gate becomes cosmetic. The architectural fix is classification discipline — only the actions that warrant per-action review are gated; everything else flows through HOTL or HOOTL.

Context impoverishment. Reviewers asked to approve/reject without sufficient context default to approval. Fix: review UI exposes the right level of detail — what the agent wants to do, why, what the alternatives were, what could go wrong.

Automation bias. Reviewers over-trust the agent’s recommendation even when their role is to catch errors (Article 14(4)(b)). Fix: explicit decision-support training; UI patterns that encourage independent judgment (e.g., not showing the agent’s recommendation until the reviewer enters their own).

Invisible drift to HOOTL. A system designed as HOTL but whose monitor-staffing was cut becomes HOOTL in practice. Fix: staffing levels are part of the operational runbook; alerts route to a configured on-call rather than “whoever is watching.”

Recoverability assumed but untested. HOTL designs assume the human can intervene in time, but the team has never tested the time between alert and intervention. Fix: periodic drills (Article 25) measure and verify intervention time.

Framework parity — where HITL lives in frameworks

LangGraph — interrupt nodes are the canonical HITL gate; state is checkpointed; resume is clean. LangGraph’s design is deliberately HITL-centric.
CrewAI — human input via HumanInput task type; custom task can block until response; supports sequential HITL workflows.
AutoGen — UserProxyAgent with human_input_mode="ALWAYS" or "TERMINATE"; group-chat patterns route to human on flagged turns.
OpenAI Agents SDK — input_guardrails with tripwire_triggered to request human input; Runner.run_streamed with tool-approval events the caller surfaces to a reviewer UI.
Semantic Kernel — function filters that await user input via integration with the hosting UI layer; Process Framework support for explicit human steps.
LlamaIndex Agents — ask_human custom tool; reflection patterns that include human verification.

The common pattern: the framework pauses execution; the runtime surfaces the pause event to the platform’s reviewer-queue service; the reviewer’s decision is fed back as a function result that resumes execution.

Real-world anchor — US DoD Directive 3000.09 on autonomy in weapon systems

The US Department of Defense Directive 3000.09 on “Autonomy in Weapon Systems” (updated 2023) is a public policy reference for the definitions of human-in-the-loop, human-on-the-loop, and human-out-of-the-loop in one of the most consequential application domains. The directive’s insistence on meaningful human control — not just a human somewhere in the process — is the reference architects should invoke when specifying oversight for high-risk agentic systems outside defense. Source: dodd.defense.gov.

Real-world anchor — EU AI Act Article 14 and AI Office guidance

The EU AI Act text (Regulation 2024/1689) and the European Commission AI Office’s Article 14 guidance (published iteratively 2024–2025) are the authoritative reference for the architectural obligations in EU deployments. Architects should track the AI Office guidance because it clarifies what counts as “meaningful” oversight — reviewer expertise, training, authority to override, avoidance of automation bias. The guidance is the most likely source of near-term regulatory nuance between now and Phase 2 publication. Source: ec.europa.eu/commission/ai-office.

Real-world anchor — Morgan Stanley wealth-management assistant

Morgan Stanley’s publicly disclosed wealth-management assistant deployment illustrates an HOTL-leaning design: the agent drafts responses, the financial advisor reviews before sending, and the advisor’s edits and rejections feed back into the evaluation pipeline. The architectural shape — agent proposes, human disposes, feedback loop closes — is the pattern most enterprise deployments can reach. The lesson is that HITL is not a scaling ceiling; it is a design pattern that, done well, enables the agent to scale the advisor’s capacity rather than replacing the advisor’s judgment.

Closing

Three patterns, five Article 14(4) sub-clauses, five failure modes. Oversight design is not an afterthought — it is the architectural commitment that a partially autonomous system remains answerable to humans. Article 11 now takes up the framework comparison that the preceding articles have implicitly drawn on.

Learning outcomes check

Explain three oversight patterns (HITL, HOTL, HOOTL) with their architectural requirements.
Classify six example systems by required pattern, using autonomy level and action class.
Evaluate an oversight design against EU AI Act Article 14(4)(a)–(e) and identify any sub-clause not adequately addressed.
Design an oversight spec for a given use case including gate placement, reviewer queue design, and failure-mode mitigations.

Cross-reference map

Core Stream: EATF-Level-1/M1.4-Art13-Human-Oversight-of-Agentic-Systems.md; EATL-Level-4/M4.5-Art13-Article-14-Implementation-Guide.md.
Sibling credential: AITM-AAG Article 9 (governance-facing oversight design).
Forward reference: Articles 17 (evaluation of oversight effectiveness), 23 (EU AI Act), 30–32 (sector patterns).