Skip to main content
AITE M1.2-Art51 v1.0 Reviewed 2026-04-06 Open Access
M1.2 The COMPEL Six-Stage Lifecycle
AITF · Foundations

Lab — Build a Finance Agent with a Human-in-the-Loop Escalation Matrix

Lab — Build a Finance Agent with a Human-in-the-Loop Escalation Matrix — Transformation Design & Program Architecture — Advanced depth — COMPEL Body of Knowledge.

9 min read Article 51 of 53

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Lab 1 of 5


Lab objective

Design and implement a production-credible human-in-the-loop escalation matrix for a finance agent. The agent accepts invoice batches, reconciles them against open purchase orders, proposes payment runs, and — within a policy cap — initiates ACH and SEPA transfers. By the end of the lab the learner will have (a) a written escalation matrix with gate conditions, operator roles, timeouts, and fallbacks; (b) the matrix wired into the agent runtime; (c) a replay showing each gate firing against synthetic invoice inputs; and (d) an incident-response note that reads the replay back to front.

Prerequisites

  • Articles 6, 8, 9, 10, 22, 25 of this credential.
  • One working agent runtime: LangGraph, CrewAI, AutoGen, or the OpenAI Agents SDK. The lab must be reproducible on at least one framework; the rubric rewards reproducing it on two.
  • A tool mock layer. Use a local stub for the payments API; do not connect to a live payments rail.
  • Access to any LLM with function-calling: an Anthropic Claude API, an OpenAI GPT-4-class API, a self-hosted Llama 3 70B behind vLLM, or equivalent.
  • An observability sink. Langfuse, LangSmith, Arize Phoenix, Humanloop, or a structured-log file with JSON lines is acceptable.

The finance agent in scope

The agent’s mandate is narrow and bounded:

AttributeValue
InputBatch of 20–200 invoices with PO number, vendor ID, amount, currency, due date, line items
Actions(a) query PO store; (b) query vendor master; (c) draft payment-run record; (d) initiate transfer via payments API
OutputPayment-run record; per-invoice status; exception list
Policy cap (auto-approved)€5,000 per invoice, €50,000 per batch
Escalation capAnything above auto-approval; any vendor flagged high-risk; any currency mismatch; any three-way-match failure
TenantSingle-tenant for the lab; the design must generalise

The autonomy level under the Article 2 rubric is Level 3 — Supervised executor — for in-cap invoices and Level 2 — Bounded executor — for out-of-cap invoices escalated through the matrix.

Step 1 — Write the escalation matrix

Before touching code, write the matrix as a table. The matrix is the design; the code is implementation.

Gate IDTrigger conditionGate typePrimary operator (role)Secondary (role)TimeoutTimeout behaviourObservability event
G1Invoice amount > €5,000Pre-execution HITLAP team leadTreasury on-call4 business hoursPark invoice; no transfer; notify submittergate.g1.fired + correlation ID
G2Batch total > €50,000Pre-execution HITLTreasury on-callCFO delegate2 business hoursPark batch; hold sub-cap itemsgate.g2.fired
G3Three-way-match failure (PO, invoice, receipt)Pre-execution HITLProcurement reviewerAP team lead1 business dayPark invoice; flag for vendor querygate.g3.fired
G4Vendor flag = high-riskPre-execution HITLCompliance officerHead of Finance1 business dayPark invoicegate.g4.fired
G5Currency mismatch (invoice vs. PO)Pre-execution HITLAP team leadFX desk4 business hoursPark invoicegate.g5.fired
G6Anomaly detector flag (amount > 3× vendor trailing median)Pre-execution HITLAP team leadTreasury on-call4 business hoursPark invoicegate.g6.fired
G7Tool-call authorization failureAbort + incidentSecurity operationsAgent ownerImmediateHalt run; raise P2 incidentgate.g7.fired
G8Kill-switch pressedAbortStop-go authorityAgent ownerImmediateHalt all tool callsgate.g8.fired

The eight gates cover the four failure classes the finance agent can produce: policy-cap breach, data-integrity failure, counterparty risk, and runtime anomaly. The matrix is deliberately asymmetric — some gates park, some abort. The learner’s design should be able to defend each choice.

Step 2 — Gate placement in the agent loop

For a ReAct-style loop, gates are not decorators on tool calls alone. They apply at three points:

  1. Pre-plan gates — batch-level policy checks (G2). Before the agent begins reasoning, the batch total is computed and G2 is evaluated.
  2. Pre-action gates — per-invoice policy checks (G1, G3, G4, G5, G6). Evaluated immediately before the initiate_transfer tool call for each invoice.
  3. Tool-response gates — runtime checks (G7). If a tool call returns an authorization failure, G7 fires and the run halts.

G8 (kill-switch) sits outside the loop as a runtime control on the agent session itself, reachable by an operator without framework knowledge.

Step 3 — Implementation across frameworks

The lab asks for two implementations. Reference shapes:

LangGraph variant

Gate evaluations are state-graph nodes. The edge from propose_payment to initiate_transfer is replaced by an edge from propose_payment to a gate router node that either emits to initiate_transfer directly or to a pending_human_review node. The pending_human_review node persists the agent state, sends a notification to the primary operator, and halts the run until a callback reactivates the graph with a decision. The state key gate_history is append-only and is written to the audit sink on every transition.

CrewAI variant

CrewAI expresses the same pattern through a crew of cooperating agents. A compliance_officer agent sits between accountant_agent (proposes) and treasurer_agent (executes). The compliance_officer’s role description encodes the gate matrix; the task output schema includes a decision field (approve | park | abort) and a justification string. The parked tasks become deliverables for a human reviewer queue, via the framework’s task-assignment hook.

OpenAI Agents SDK variant

The Agents SDK’s handoffs primitive expresses the matrix as named handoffs from the finance agent to a human-in-the-loop agent whose execute method blocks on an external callback. The SDK’s guardrails mechanism expresses G7 naturally; a rejected tool call raises an exception that the run loop catches and routes through the abort path.

The three implementations must produce functionally equivalent observability events. The rubric does not reward code volume; it rewards equivalent semantics across frameworks.

Step 4 — Observability hooks

Each gate firing writes a structured event. Minimum schema:

{
  "event": "gate.g1.fired",
  "timestamp": "2026-04-20T10:14:22.841Z",
  "run_id": "run_8a2c...",
  "invoice_id": "inv_7321",
  "gate_id": "G1",
  "trigger_condition": "amount > eur_cap",
  "invoice_amount": 7420.00,
  "cap": 5000.00,
  "routed_to": "ap_team_lead",
  "timeout_seconds": 14400,
  "agent_version": "finance-agent-v1.2.0",
  "tenant_id": "acme"
}

The observability schema is stable across frameworks. A run-replay tool reconstructs a full narrative — proposal, gate decisions, operator responses, final tool calls — from these events and the agent-trace log. Reconstruction, not real-time monitoring, is the primary use case; the auditor arrives months after the run.

Step 5 — Synthetic input battery

Drive the agent against a scripted input set that fires each gate at least once. A representative battery:

  • 40 invoices, all in-cap, three-way-matched, low-risk vendor → full auto-approval path; zero gates fire.
  • Insert one invoice at €7,420 → G1 fires; operator approves; transfer initiates.
  • Insert one invoice at €7,420 and one at €4,200 → G1 fires on one; the other proceeds; batch total stays under €50,000.
  • Inflate batch to total €60,000 → G2 fires; whole batch parks.
  • Invoice with no matching receipt → G3 fires; others proceed.
  • Vendor with high-risk flag → G4 fires.
  • JPY invoice against EUR PO → G5 fires.
  • Amount 5× trailing vendor median → G6 fires.
  • Revoke the payments-API service account mid-run → G7 fires; run aborts.
  • Press the kill-switch mid-run after three successful transfers → G8 fires; remaining transfers do not issue.

Record the run. The replay must include every gate firing with correct routing and timeout metadata.

Step 6 — Write the incident-response note

The lab’s final deliverable is an after-action note that reads the battery run back to front. The note answers: what did the agent do, which gates fired, how did the operators respond, what was the total financial exposure committed, and what would have happened if G7 and G8 had not been present. The note is one page. It is the artefact a regulator would ask for on inquiry.

Deliverables

  1. The escalation-matrix table (Step 1).
  2. Implementation on two frameworks (Step 3). Committed to version control with README describing how to run the battery.
  3. Observability event log for a complete run (Step 5). JSON lines.
  4. Run-replay output reconstructing the narrative.
  5. Incident-response note (Step 6). Single page.

Rubric

CriterionEvidenceWeight
Matrix design defends each gate choiceStep 1 table + 1–2 line justification per row20%
Correct gate placement in the loopCode review of two implementations20%
Cross-framework semantic equivalenceDiff of observability events15%
Observability schema completenessSchema review against template15%
Battery exercises every gateLog inspection15%
Incident note reads as written by a practitionerStyle and content review15%

Lab sign-off

The Methodology Lead’s three follow-up questions:

  1. Which gate would you remove first if asked to reduce operator load by 20 percent, and what compensating control would you add?
  2. Which gate is most likely to be gamed by the submitter, and how would you detect that gaming pattern in the replay?
  3. If the policy cap moved from €5,000 to €25,000 tomorrow, which other design elements must move with it?

A defensible lab submission names the gate; identifies the compensating detection (post-hoc sampling, anomaly review, periodic audit); describes the gaming vector (e.g., batch splitting to stay under G2); and, for the cap change, names the observability threshold, the operator-capacity implication, and the reclassification trigger from the autonomy rubric.

The lab’s pedagogic point is that the escalation matrix is not a list of thresholds. It is a contract between the agent, the operators, and the organisation, and the design choices in the contract determine whether the agent is governable at production scale.