COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Lab 1 of 5
Lab objective
Design and implement a production-credible human-in-the-loop escalation matrix for a finance agent. The agent accepts invoice batches, reconciles them against open purchase orders, proposes payment runs, and — within a policy cap — initiates ACH and SEPA transfers. By the end of the lab the learner will have (a) a written escalation matrix with gate conditions, operator roles, timeouts, and fallbacks; (b) the matrix wired into the agent runtime; (c) a replay showing each gate firing against synthetic invoice inputs; and (d) an incident-response note that reads the replay back to front.
Prerequisites
- Articles 6, 8, 9, 10, 22, 25 of this credential.
- One working agent runtime: LangGraph, CrewAI, AutoGen, or the OpenAI Agents SDK. The lab must be reproducible on at least one framework; the rubric rewards reproducing it on two.
- A tool mock layer. Use a local stub for the payments API; do not connect to a live payments rail.
- Access to any LLM with function-calling: an Anthropic Claude API, an OpenAI GPT-4-class API, a self-hosted Llama 3 70B behind vLLM, or equivalent.
- An observability sink. Langfuse, LangSmith, Arize Phoenix, Humanloop, or a structured-log file with JSON lines is acceptable.
The finance agent in scope
The agent’s mandate is narrow and bounded:
| Attribute | Value |
|---|---|
| Input | Batch of 20–200 invoices with PO number, vendor ID, amount, currency, due date, line items |
| Actions | (a) query PO store; (b) query vendor master; (c) draft payment-run record; (d) initiate transfer via payments API |
| Output | Payment-run record; per-invoice status; exception list |
| Policy cap (auto-approved) | €5,000 per invoice, €50,000 per batch |
| Escalation cap | Anything above auto-approval; any vendor flagged high-risk; any currency mismatch; any three-way-match failure |
| Tenant | Single-tenant for the lab; the design must generalise |
The autonomy level under the Article 2 rubric is Level 3 — Supervised executor — for in-cap invoices and Level 2 — Bounded executor — for out-of-cap invoices escalated through the matrix.
Step 1 — Write the escalation matrix
Before touching code, write the matrix as a table. The matrix is the design; the code is implementation.
| Gate ID | Trigger condition | Gate type | Primary operator (role) | Secondary (role) | Timeout | Timeout behaviour | Observability event |
|---|---|---|---|---|---|---|---|
| G1 | Invoice amount > €5,000 | Pre-execution HITL | AP team lead | Treasury on-call | 4 business hours | Park invoice; no transfer; notify submitter | gate.g1.fired + correlation ID |
| G2 | Batch total > €50,000 | Pre-execution HITL | Treasury on-call | CFO delegate | 2 business hours | Park batch; hold sub-cap items | gate.g2.fired |
| G3 | Three-way-match failure (PO, invoice, receipt) | Pre-execution HITL | Procurement reviewer | AP team lead | 1 business day | Park invoice; flag for vendor query | gate.g3.fired |
| G4 | Vendor flag = high-risk | Pre-execution HITL | Compliance officer | Head of Finance | 1 business day | Park invoice | gate.g4.fired |
| G5 | Currency mismatch (invoice vs. PO) | Pre-execution HITL | AP team lead | FX desk | 4 business hours | Park invoice | gate.g5.fired |
| G6 | Anomaly detector flag (amount > 3× vendor trailing median) | Pre-execution HITL | AP team lead | Treasury on-call | 4 business hours | Park invoice | gate.g6.fired |
| G7 | Tool-call authorization failure | Abort + incident | Security operations | Agent owner | Immediate | Halt run; raise P2 incident | gate.g7.fired |
| G8 | Kill-switch pressed | Abort | Stop-go authority | Agent owner | Immediate | Halt all tool calls | gate.g8.fired |
The eight gates cover the four failure classes the finance agent can produce: policy-cap breach, data-integrity failure, counterparty risk, and runtime anomaly. The matrix is deliberately asymmetric — some gates park, some abort. The learner’s design should be able to defend each choice.
Step 2 — Gate placement in the agent loop
For a ReAct-style loop, gates are not decorators on tool calls alone. They apply at three points:
- Pre-plan gates — batch-level policy checks (G2). Before the agent begins reasoning, the batch total is computed and G2 is evaluated.
- Pre-action gates — per-invoice policy checks (G1, G3, G4, G5, G6). Evaluated immediately before the
initiate_transfertool call for each invoice. - Tool-response gates — runtime checks (G7). If a tool call returns an authorization failure, G7 fires and the run halts.
G8 (kill-switch) sits outside the loop as a runtime control on the agent session itself, reachable by an operator without framework knowledge.
Step 3 — Implementation across frameworks
The lab asks for two implementations. Reference shapes:
LangGraph variant
Gate evaluations are state-graph nodes. The edge from propose_payment to initiate_transfer is replaced by an edge from propose_payment to a gate router node that either emits to initiate_transfer directly or to a pending_human_review node. The pending_human_review node persists the agent state, sends a notification to the primary operator, and halts the run until a callback reactivates the graph with a decision. The state key gate_history is append-only and is written to the audit sink on every transition.
CrewAI variant
CrewAI expresses the same pattern through a crew of cooperating agents. A compliance_officer agent sits between accountant_agent (proposes) and treasurer_agent (executes). The compliance_officer’s role description encodes the gate matrix; the task output schema includes a decision field (approve | park | abort) and a justification string. The parked tasks become deliverables for a human reviewer queue, via the framework’s task-assignment hook.
OpenAI Agents SDK variant
The Agents SDK’s handoffs primitive expresses the matrix as named handoffs from the finance agent to a human-in-the-loop agent whose execute method blocks on an external callback. The SDK’s guardrails mechanism expresses G7 naturally; a rejected tool call raises an exception that the run loop catches and routes through the abort path.
The three implementations must produce functionally equivalent observability events. The rubric does not reward code volume; it rewards equivalent semantics across frameworks.
Step 4 — Observability hooks
Each gate firing writes a structured event. Minimum schema:
{
"event": "gate.g1.fired",
"timestamp": "2026-04-20T10:14:22.841Z",
"run_id": "run_8a2c...",
"invoice_id": "inv_7321",
"gate_id": "G1",
"trigger_condition": "amount > eur_cap",
"invoice_amount": 7420.00,
"cap": 5000.00,
"routed_to": "ap_team_lead",
"timeout_seconds": 14400,
"agent_version": "finance-agent-v1.2.0",
"tenant_id": "acme"
}
The observability schema is stable across frameworks. A run-replay tool reconstructs a full narrative — proposal, gate decisions, operator responses, final tool calls — from these events and the agent-trace log. Reconstruction, not real-time monitoring, is the primary use case; the auditor arrives months after the run.
Step 5 — Synthetic input battery
Drive the agent against a scripted input set that fires each gate at least once. A representative battery:
- 40 invoices, all in-cap, three-way-matched, low-risk vendor → full auto-approval path; zero gates fire.
- Insert one invoice at €7,420 → G1 fires; operator approves; transfer initiates.
- Insert one invoice at €7,420 and one at €4,200 → G1 fires on one; the other proceeds; batch total stays under €50,000.
- Inflate batch to total €60,000 → G2 fires; whole batch parks.
- Invoice with no matching receipt → G3 fires; others proceed.
- Vendor with high-risk flag → G4 fires.
- JPY invoice against EUR PO → G5 fires.
- Amount 5× trailing vendor median → G6 fires.
- Revoke the payments-API service account mid-run → G7 fires; run aborts.
- Press the kill-switch mid-run after three successful transfers → G8 fires; remaining transfers do not issue.
Record the run. The replay must include every gate firing with correct routing and timeout metadata.
Step 6 — Write the incident-response note
The lab’s final deliverable is an after-action note that reads the battery run back to front. The note answers: what did the agent do, which gates fired, how did the operators respond, what was the total financial exposure committed, and what would have happened if G7 and G8 had not been present. The note is one page. It is the artefact a regulator would ask for on inquiry.
Deliverables
- The escalation-matrix table (Step 1).
- Implementation on two frameworks (Step 3). Committed to version control with README describing how to run the battery.
- Observability event log for a complete run (Step 5). JSON lines.
- Run-replay output reconstructing the narrative.
- Incident-response note (Step 6). Single page.
Rubric
| Criterion | Evidence | Weight |
|---|---|---|
| Matrix design defends each gate choice | Step 1 table + 1–2 line justification per row | 20% |
| Correct gate placement in the loop | Code review of two implementations | 20% |
| Cross-framework semantic equivalence | Diff of observability events | 15% |
| Observability schema completeness | Schema review against template | 15% |
| Battery exercises every gate | Log inspection | 15% |
| Incident note reads as written by a practitioner | Style and content review | 15% |
Lab sign-off
The Methodology Lead’s three follow-up questions:
- Which gate would you remove first if asked to reduce operator load by 20 percent, and what compensating control would you add?
- Which gate is most likely to be gamed by the submitter, and how would you detect that gaming pattern in the replay?
- If the policy cap moved from €5,000 to €25,000 tomorrow, which other design elements must move with it?
A defensible lab submission names the gate; identifies the compensating detection (post-hoc sampling, anomaly review, periodic audit); describes the gaming vector (e.g., batch splitting to stay under G2); and, for the cap change, names the observability threshold, the operator-capacity implication, and the reclassification trigger from the autonomy rubric.
The lab’s pedagogic point is that the escalation matrix is not a list of thresholds. It is a contract between the agent, the operators, and the organisation, and the design choices in the contract determine whether the agent is governable at production scale.