Lab — Build a Finance Agent with a Human-in-the-Loop Escalation Matrix

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Lab 1 of 5

Lab objective

Design and implement a production-credible human-in-the-loop escalation matrix for a finance agent. The agent accepts invoice batches, reconciles them against open purchase orders, proposes payment runs, and — within a policy cap — initiates ACH and SEPA transfers. By the end of the lab the learner will have (a) a written escalation matrix with gate conditions, operator roles, timeouts, and fallbacks; (b) the matrix wired into the agent runtime; (c) a replay showing each gate firing against synthetic invoice inputs; and (d) an incident-response note that reads the replay back to front.

Prerequisites

Articles 6, 8, 9, 10, 22, 25 of this credential.
One working agent runtime: LangGraph, CrewAI, AutoGen, or the OpenAI Agents SDK. The lab must be reproducible on at least one framework; the rubric rewards reproducing it on two.
A tool mock layer. Use a local stub for the payments API; do not connect to a live payments rail.
Access to any LLM with function-calling: an Anthropic Claude API, an OpenAI GPT-4-class API, a self-hosted Llama 3 70B behind vLLM, or equivalent.
An observability sink. Langfuse, LangSmith, Arize Phoenix, Humanloop, or a structured-log file with JSON lines is acceptable.

The finance agent in scope

The agent’s mandate is narrow and bounded:

Attribute	Value
Input	Batch of 20–200 invoices with PO number, vendor ID, amount, currency, due date, line items
Actions	(a) query PO store; (b) query vendor master; (c) draft payment-run record; (d) initiate transfer via payments API
Output	Payment-run record; per-invoice status; exception list
Policy cap (auto-approved)	€5,000 per invoice, €50,000 per batch
Escalation cap	Anything above auto-approval; any vendor flagged high-risk; any currency mismatch; any three-way-match failure
Tenant	Single-tenant for the lab; the design must generalise

The autonomy level under the Article 2 rubric is Level 3 — Supervised executor — for in-cap invoices and Level 2 — Bounded executor — for out-of-cap invoices escalated through the matrix.

Step 1 — Write the escalation matrix

Before touching code, write the matrix as a table. The matrix is the design; the code is implementation.

Gate ID	Trigger condition	Gate type	Primary operator (role)	Secondary (role)	Timeout	Timeout behaviour	Observability event
G1	Invoice amount > €5,000	Pre-execution HITL	AP team lead	Treasury on-call	4 business hours	Park invoice; no transfer; notify submitter	`gate.g1.fired` + correlation ID
G2	Batch total > €50,000	Pre-execution HITL	Treasury on-call	CFO delegate	2 business hours	Park batch; hold sub-cap items	`gate.g2.fired`
G3	Three-way-match failure (PO, invoice, receipt)	Pre-execution HITL	Procurement reviewer	AP team lead	1 business day	Park invoice; flag for vendor query	`gate.g3.fired`
G4	Vendor flag = high-risk	Pre-execution HITL	Compliance officer	Head of Finance	1 business day	Park invoice	`gate.g4.fired`
G5	Currency mismatch (invoice vs. PO)	Pre-execution HITL	AP team lead	FX desk	4 business hours	Park invoice	`gate.g5.fired`
G6	Anomaly detector flag (amount > 3× vendor trailing median)	Pre-execution HITL	AP team lead	Treasury on-call	4 business hours	Park invoice	`gate.g6.fired`
G7	Tool-call authorization failure	Abort + incident	Security operations	Agent owner	Immediate	Halt run; raise P2 incident	`gate.g7.fired`
G8	Kill-switch pressed	Abort	Stop-go authority	Agent owner	Immediate	Halt all tool calls	`gate.g8.fired`

The eight gates cover the four failure classes the finance agent can produce: policy-cap breach, data-integrity failure, counterparty risk, and runtime anomaly. The matrix is deliberately asymmetric — some gates park, some abort. The learner’s design should be able to defend each choice.

Step 2 — Gate placement in the agent loop

For a ReAct-style loop, gates are not decorators on tool calls alone. They apply at three points:

Pre-plan gates — batch-level policy checks (G2). Before the agent begins reasoning, the batch total is computed and G2 is evaluated.
Pre-action gates — per-invoice policy checks (G1, G3, G4, G5, G6). Evaluated immediately before the initiate_transfer tool call for each invoice.
Tool-response gates — runtime checks (G7). If a tool call returns an authorization failure, G7 fires and the run halts.

G8 (kill-switch) sits outside the loop as a runtime control on the agent session itself, reachable by an operator without framework knowledge.

Step 3 — Implementation across frameworks

The lab asks for two implementations. Reference shapes:

LangGraph variant

Gate evaluations are state-graph nodes. The edge from propose_payment to initiate_transfer is replaced by an edge from propose_payment to a gate router node that either emits to initiate_transfer directly or to a pending_human_review node. The pending_human_review node persists the agent state, sends a notification to the primary operator, and halts the run until a callback reactivates the graph with a decision. The state key gate_history is append-only and is written to the audit sink on every transition.

CrewAI variant

CrewAI expresses the same pattern through a crew of cooperating agents. A compliance_officer agent sits between accountant_agent (proposes) and treasurer_agent (executes). The compliance_officer’s role description encodes the gate matrix; the task output schema includes a decision field (approve | park | abort) and a justification string. The parked tasks become deliverables for a human reviewer queue, via the framework’s task-assignment hook.

OpenAI Agents SDK variant

The Agents SDK’s handoffs primitive expresses the matrix as named handoffs from the finance agent to a human-in-the-loop agent whose execute method blocks on an external callback. The SDK’s guardrails mechanism expresses G7 naturally; a rejected tool call raises an exception that the run loop catches and routes through the abort path.

The three implementations must produce functionally equivalent observability events. The rubric does not reward code volume; it rewards equivalent semantics across frameworks.

Step 4 — Observability hooks

Each gate firing writes a structured event. Minimum schema:

{
  "event": "gate.g1.fired",
  "timestamp": "2026-04-20T10:14:22.841Z",
  "run_id": "run_8a2c...",
  "invoice_id": "inv_7321",
  "gate_id": "G1",
  "trigger_condition": "amount > eur_cap",
  "invoice_amount": 7420.00,
  "cap": 5000.00,
  "routed_to": "ap_team_lead",
  "timeout_seconds": 14400,
  "agent_version": "finance-agent-v1.2.0",
  "tenant_id": "acme"
}

The observability schema is stable across frameworks. A run-replay tool reconstructs a full narrative — proposal, gate decisions, operator responses, final tool calls — from these events and the agent-trace log. Reconstruction, not real-time monitoring, is the primary use case; the auditor arrives months after the run.

Step 5 — Synthetic input battery

Drive the agent against a scripted input set that fires each gate at least once. A representative battery:

40 invoices, all in-cap, three-way-matched, low-risk vendor → full auto-approval path; zero gates fire.
Insert one invoice at €7,420 → G1 fires; operator approves; transfer initiates.
Insert one invoice at €7,420 and one at €4,200 → G1 fires on one; the other proceeds; batch total stays under €50,000.
Inflate batch to total €60,000 → G2 fires; whole batch parks.
Invoice with no matching receipt → G3 fires; others proceed.
Vendor with high-risk flag → G4 fires.
JPY invoice against EUR PO → G5 fires.
Amount 5× trailing vendor median → G6 fires.
Revoke the payments-API service account mid-run → G7 fires; run aborts.
Press the kill-switch mid-run after three successful transfers → G8 fires; remaining transfers do not issue.

Record the run. The replay must include every gate firing with correct routing and timeout metadata.

Step 6 — Write the incident-response note

The lab’s final deliverable is an after-action note that reads the battery run back to front. The note answers: what did the agent do, which gates fired, how did the operators respond, what was the total financial exposure committed, and what would have happened if G7 and G8 had not been present. The note is one page. It is the artefact a regulator would ask for on inquiry.

Deliverables

The escalation-matrix table (Step 1).
Implementation on two frameworks (Step 3). Committed to version control with README describing how to run the battery.
Observability event log for a complete run (Step 5). JSON lines.
Run-replay output reconstructing the narrative.
Incident-response note (Step 6). Single page.

Rubric

Criterion	Evidence	Weight
Matrix design defends each gate choice	Step 1 table + 1–2 line justification per row	20%
Correct gate placement in the loop	Code review of two implementations	20%
Cross-framework semantic equivalence	Diff of observability events	15%
Observability schema completeness	Schema review against template	15%
Battery exercises every gate	Log inspection	15%
Incident note reads as written by a practitioner	Style and content review	15%

Lab sign-off

The Methodology Lead’s three follow-up questions:

Which gate would you remove first if asked to reduce operator load by 20 percent, and what compensating control would you add?
Which gate is most likely to be gamed by the submitter, and how would you detect that gaming pattern in the replay?
If the policy cap moved from €5,000 to €25,000 tomorrow, which other design elements must move with it?

A defensible lab submission names the gate; identifies the compensating detection (post-hoc sampling, anomaly review, periodic audit); describes the gaming vector (e.g., batch splitting to stay under G2); and, for the cap change, names the observability threshold, the operator-capacity implication, and the reclassification trigger from the autonomy rubric.

The lab’s pedagogic point is that the escalation matrix is not a list of thresholds. It is a contract between the agent, the operators, and the organisation, and the design choices in the contract determine whether the agent is governable at production scale.