COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Artifact Template 5 of 5
How to use this template
Populate one runbook per agent. The runbook is for the on-call engineer at 02:00 who may be unfamiliar with this agent and needs to halt it quickly and cleanly. It is intentionally short, operational, and specific.
The runbook sits alongside the kill-switch specification (from Lab 5) and is referenced from Section 8 of the Agent Governance Charter (Template 1). The specification is the design; the runbook is the operation. A change to the specification requires a change to the runbook and a re-rehearsal.
Target length when populated: two pages. If the runbook exceeds two pages, it is encoding design choices that belong in the specification rather than the runbook.
Kill-Switch Runbook
Identity
| Field | Value |
|---|---|
| Agent identifier | stable-agent-id |
| Runbook version | 1.0 |
| Charter version referenced | 1.0 |
| Last updated | YYYY-MM-DD |
| Runbook owner (role) | ops lead |
1. What this agent does (one paragraph)
One paragraph, written for a reader who has never seen this agent. What it does, who uses it, what it touches, what the worst thing it can do is.
Example: The finance-agent-v1.2 reconciles invoice batches against open purchase orders and, within policy caps, initiates ACH and SEPA transfers. It touches the PO store (read), the vendor master (read), the payment-runs ledger (write), and the payments API (irreversible). Worst-case bad action: an erroneous transfer up to €5,000 per invoice, €50,000 per batch before the in-cap cap halts further action.
2. When to kill
Kill the agent if any of the following is observed:
- A flood of unauthorised tool-call attempts in the observability dashboard.
- Loop length exceeds the p95 threshold for more than one run.
- Users report output that is clearly wrong and the agent is continuing.
- Memory-write schema-violation events are emitting.
- A security-on-call page fires referencing this agent.
- The deadman watchdog is paging.
When in doubt, kill. Agents are cheap to restart; unchecked incidents are not.
3. How to kill — three paths
Use whichever path is fastest. They are independent; use more than one if in doubt.
Path A — UI button (preferred when available)
- Open the agent console at
<url-of-agent-console>. - Log in with your on-call credentials.
- Find the agent by ID:
<agent-identifier>. - Click the red Kill Session button (to halt one session) or Kill Agent (to halt all sessions).
- Enter a justification (free text; will be audited).
- Confirm.
Expected effect: the agent emits session.halted within 5 seconds; no further tool calls issue.
Path B — REST endpoint
curl -X POST https://<agent-platform>/agents/<agent-id>/kill \
-H "Authorization: Bearer $ONCALL_TOKEN" \
-H "Content-Type: application/json" \
-d '{"scope":"session","session_id":"<session-id>","reason":"<text>"}'
Use "scope":"agent" to halt all sessions of this agent. The response returns the kill_id; note it for the post-incident record.
Path C — CLI
agent-admin kill \
--agent-id <agent-id> \
--session-id <session-id> \
--scope session \
--reason "<text>"
Run from the on-call workstation with the pre-configured credentials.
Last-resort path — process kill
If the above paths fail (platform outage, credential problem), the agent runs in a named container. Identify it:
kubectl get pods -n agents -l agent-id=<agent-id>
kubectl delete pod <pod-name>
This path is last-resort because it does not roll back reversible side effects cleanly; it terminates the runtime, and the orchestrator spawns a replacement. Use only when Paths A–C are unavailable.
4. What will happen when you kill
| Action class | On kill |
|---|---|
| In-flight read (tool) | Completes; result is discarded |
| In-flight memory write (session) | Written or discarded; session memory is disposed |
| In-flight memory write (persistent) | Rolled back within the 30-second rollback window; emits memory.rolled_back |
| In-flight tool call to payments API | Not initiated if kill precedes call; if call already in flight, rollback attempt initiated |
| Active HITL gates | Gates close; pending approvals marked gate.cancelled_on_kill |
| Observability events | Flushed; events after kill do not emit |
Tailor to the agent. The above is an illustrative rather than a prescriptive list.
5. How to know it worked
Check the following within 2 minutes of issuing the kill.
- Observability dashboard panel Active sessions: the killed session is absent.
- Observability dashboard panel Last kill events: contains
kill_switch.firedwith thekill_idfrom §3. - Audit log:
session.haltedevent present with the samekill_id. - Tool-call rate graph for this agent: rate drops to zero within 5 seconds.
- If applicable, payments API sandbox view: no transfers initiated after the kill timestamp.
If any of the five checks fail, escalate immediately (§7).
6. How to recover
Do not restart the agent until at least the following are confirmed.
- Root cause understood, at least at a preliminary level. Was this a model drift, a data-integrity issue, an injection, a runaway loop, or an operator-initiated drill? The category drives the recovery path.
- Side-effect accounting complete. For financial or customer-facing agents, the finance/ops lead has confirmed the side-effect ledger reconciles.
- Configuration unchanged or intentionally changed. If the kill was due to a bad config, the config has been reverted or replaced.
- Authorisation to restart from the on-call engineer with explicit delegation from the agent owner.
Then:
Restart procedure
- In the agent console, click Deploy → New session (or equivalent).
- Monitor the first three runs with the observability dashboard open.
- Compare tool-call patterns to the pre-kill baseline: any deviation triggers a second kill + deeper investigation.
- If the first three runs complete cleanly, hand back to normal on-call posture with a note in the incident record.
7. When to escalate
Escalate immediately if any of the following is true.
- A check in §5 failed: the kill may not have taken effect.
- The agent is customer-facing and customers are currently affected.
- The agent’s action scope includes irreversible external actions and a rollback attempt is in progress.
- The incident class is “memory integrity in question” or “indirect injection suspected”: both require specialist support beyond on-call.
- You are unsure whether this is a drill or a real incident.
Escalation tree (for this agent):
Incident category | L1 | L2 | L3
------------------------------------------------|---------------------------|-----------------------------|-----------
Agent-halt operational failure | platform on-call | agent owner | architect
Customer-facing output incident | comms lead | agent owner; legal | executive sponsor
Memory integrity suspected | data-governance on-call | agent owner | CISO delegate
Indirect-injection suspected | security on-call | agent owner | CISO delegate
Financial / reversal in progress | treasury on-call | CFO delegate | executive sponsor
Regulatory inquiry during incident | legal | executive sponsor | board
Contact methods and current rotations live in the on-call directory at <url>. If the directory is unreachable, call the operations hotline: <number>.
8. Post-incident
Within 24 hours of the kill, the agent owner files an incident note in the incident tracker. The note includes: kill timestamp, kill path used, trigger, observed side effects, recovery time, root cause (preliminary), and remediation action. The note is the input to the next monthly incident review.
9. Rehearsal
The kill-switch specification (Lab 5 deliverable) mandates rehearsal cadence. The runbook is exercised during rehearsal.
| Field | Value |
|---|---|
| Last rehearsal date | YYYY-MM-DD |
| Last rehearsal scenario | e.g., synchronous UI button during mid-tool-call |
| Last rehearsal result | pass / fail with note |
| Next scheduled rehearsal | YYYY-MM-DD |
A runbook whose last rehearsal is older than the rehearsal cadence is out of compliance. The agent owner is notified automatically at cadence − 7 days.
10. Change log
| Date | Version | Change | Trigger | Author (role) |
|---|---|---|---|---|
| YYYY-MM-DD | 1.0 | initial runbook | onboarding | ops lead |
11. Sign-off
| Role | Sign-off date |
|---|---|
| Ops lead | |
| Agent owner | |
| Architect of record |
End of Kill-Switch Runbook template.