Template — Kill-Switch Runbook

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Artifact Template 5 of 5

How to use this template

Populate one runbook per agent. The runbook is for the on-call engineer at 02:00 who may be unfamiliar with this agent and needs to halt it quickly and cleanly. It is intentionally short, operational, and specific.

The runbook sits alongside the kill-switch specification (from Lab 5) and is referenced from Section 8 of the Agent Governance Charter (Template 1). The specification is the design; the runbook is the operation. A change to the specification requires a change to the runbook and a re-rehearsal.

Target length when populated: two pages. If the runbook exceeds two pages, it is encoding design choices that belong in the specification rather than the runbook.

Kill-Switch Runbook

Identity

Field	Value
Agent identifier	stable-agent-id
Runbook version	1.0
Charter version referenced	1.0
Last updated	YYYY-MM-DD
Runbook owner (role)	ops lead

1. What this agent does (one paragraph)

One paragraph, written for a reader who has never seen this agent. What it does, who uses it, what it touches, what the worst thing it can do is.

Example: The finance-agent-v1.2 reconciles invoice batches against open purchase orders and, within policy caps, initiates ACH and SEPA transfers. It touches the PO store (read), the vendor master (read), the payment-runs ledger (write), and the payments API (irreversible). Worst-case bad action: an erroneous transfer up to €5,000 per invoice, €50,000 per batch before the in-cap cap halts further action.

2. When to kill

Kill the agent if any of the following is observed:

A flood of unauthorised tool-call attempts in the observability dashboard.
Loop length exceeds the p95 threshold for more than one run.
Users report output that is clearly wrong and the agent is continuing.
Memory-write schema-violation events are emitting.
A security-on-call page fires referencing this agent.
The deadman watchdog is paging.

When in doubt, kill. Agents are cheap to restart; unchecked incidents are not.

3. How to kill — three paths

Use whichever path is fastest. They are independent; use more than one if in doubt.

Path A — UI button (preferred when available)

Open the agent console at <url-of-agent-console>.
Log in with your on-call credentials.
Find the agent by ID: <agent-identifier>.
Click the red Kill Session button (to halt one session) or Kill Agent (to halt all sessions).
Enter a justification (free text; will be audited).
Confirm.

Expected effect: the agent emits session.halted within 5 seconds; no further tool calls issue.

Path B — REST endpoint

curl -X POST https://<agent-platform>/agents/<agent-id>/kill \
  -H "Authorization: Bearer $ONCALL_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"scope":"session","session_id":"<session-id>","reason":"<text>"}'

Use "scope":"agent" to halt all sessions of this agent. The response returns the kill_id; note it for the post-incident record.

Path C — CLI

agent-admin kill \
  --agent-id <agent-id> \
  --session-id <session-id> \
  --scope session \
  --reason "<text>"

Run from the on-call workstation with the pre-configured credentials.

Last-resort path — process kill

If the above paths fail (platform outage, credential problem), the agent runs in a named container. Identify it:

kubectl get pods -n agents -l agent-id=<agent-id>
kubectl delete pod <pod-name>

This path is last-resort because it does not roll back reversible side effects cleanly; it terminates the runtime, and the orchestrator spawns a replacement. Use only when Paths A–C are unavailable.

4. What will happen when you kill

Action class	On kill
In-flight read (tool)	Completes; result is discarded
In-flight memory write (session)	Written or discarded; session memory is disposed
In-flight memory write (persistent)	Rolled back within the 30-second rollback window; emits `memory.rolled_back`
In-flight tool call to payments API	Not initiated if kill precedes call; if call already in flight, rollback attempt initiated
Active HITL gates	Gates close; pending approvals marked `gate.cancelled_on_kill`
Observability events	Flushed; events after kill do not emit

Tailor to the agent. The above is an illustrative rather than a prescriptive list.

5. How to know it worked

Check the following within 2 minutes of issuing the kill.

Observability dashboard panel Active sessions: the killed session is absent.
Observability dashboard panel Last kill events: contains kill_switch.fired with the kill_id from §3.
Audit log: session.halted event present with the same kill_id.
Tool-call rate graph for this agent: rate drops to zero within 5 seconds.
If applicable, payments API sandbox view: no transfers initiated after the kill timestamp.

If any of the five checks fail, escalate immediately (§7).

6. How to recover

Do not restart the agent until at least the following are confirmed.

Root cause understood, at least at a preliminary level. Was this a model drift, a data-integrity issue, an injection, a runaway loop, or an operator-initiated drill? The category drives the recovery path.
Side-effect accounting complete. For financial or customer-facing agents, the finance/ops lead has confirmed the side-effect ledger reconciles.
Configuration unchanged or intentionally changed. If the kill was due to a bad config, the config has been reverted or replaced.
Authorisation to restart from the on-call engineer with explicit delegation from the agent owner.

Then:

Restart procedure

In the agent console, click Deploy → New session (or equivalent).
Monitor the first three runs with the observability dashboard open.
Compare tool-call patterns to the pre-kill baseline: any deviation triggers a second kill + deeper investigation.
If the first three runs complete cleanly, hand back to normal on-call posture with a note in the incident record.

7. When to escalate

Escalate immediately if any of the following is true.

A check in §5 failed: the kill may not have taken effect.
The agent is customer-facing and customers are currently affected.
The agent’s action scope includes irreversible external actions and a rollback attempt is in progress.
The incident class is “memory integrity in question” or “indirect injection suspected”: both require specialist support beyond on-call.
You are unsure whether this is a drill or a real incident.

Escalation tree (for this agent):

Incident category                              | L1                        | L2                          | L3
------------------------------------------------|---------------------------|-----------------------------|-----------
Agent-halt operational failure                 | platform on-call          | agent owner                 | architect
Customer-facing output incident                | comms lead                | agent owner; legal          | executive sponsor
Memory integrity suspected                      | data-governance on-call   | agent owner                 | CISO delegate
Indirect-injection suspected                    | security on-call          | agent owner                 | CISO delegate
Financial / reversal in progress               | treasury on-call          | CFO delegate                | executive sponsor
Regulatory inquiry during incident             | legal                     | executive sponsor           | board

Contact methods and current rotations live in the on-call directory at <url>. If the directory is unreachable, call the operations hotline: <number>.

8. Post-incident

Within 24 hours of the kill, the agent owner files an incident note in the incident tracker. The note includes: kill timestamp, kill path used, trigger, observed side effects, recovery time, root cause (preliminary), and remediation action. The note is the input to the next monthly incident review.

9. Rehearsal

The kill-switch specification (Lab 5 deliverable) mandates rehearsal cadence. The runbook is exercised during rehearsal.

Field	Value
Last rehearsal date	YYYY-MM-DD
Last rehearsal scenario	e.g., synchronous UI button during mid-tool-call
Last rehearsal result	pass / fail with note
Next scheduled rehearsal	YYYY-MM-DD

A runbook whose last rehearsal is older than the rehearsal cadence is out of compliance. The agent owner is notified automatically at cadence − 7 days.

10. Change log

Date	Version	Change	Trigger	Author (role)
YYYY-MM-DD	1.0	initial runbook	onboarding	ops lead

11. Sign-off

Role	Sign-off date
Ops lead
Agent owner
Architect of record

End of Kill-Switch Runbook template.