Incident Response for Agents

FlowRidge

The architect owns the blueprint. The SRE team owns the execution. The architect’s deliverable is the incident-response plan, the agent-specific runbook library, the escalation matrix, and the post-mortem template. This article walks through each.

Six agent incident classes

Class 1 — Memory poisoning. Injected content in long-term memory that persists across sessions and affects future behavior. Symptoms: unexpected references to content no user provided, recurring off-topic responses, systematic bias in recommendations. Root cause: an attacker injected content that the agent wrote to long-term memory without adequate provenance checking.

Class 2 — Goal hijacking. The agent’s active goal has been redirected by input it received. Symptoms: the agent pursues a different task than the one it was assigned, ignores the original instructions, produces output related to the injected goal. Root cause: prompt-injection or indirect-injection vector found an unguarded path.

Class 3 — Runaway loop. The agent is burning resources (tokens, tool calls, time) without productive progress. Symptoms: token spend spike, tool-call rate spike, session duration exceeds expected p99, no terminal state reached. Root cause: loop-exit conditions failed, infinite delegation between agents, or a pathological input triggered recursive reasoning.

Class 4 — Tool misuse. The agent called a tool with parameters that caused an unintended side effect. Symptoms: an external system shows unexpected writes/reads, an audit log shows a tool call outside expected parameters. Root cause: authorization gap, schema validation gap, or policy gap.

Class 5 — Coordination failure. A multi-agent system has deadlocked, livelocked, or entered deceptive delegation. Symptoms: no progress across multiple agents, one agent reporting completion to another without actually executing, or all agents waiting on each other. Root cause: missing timeout, missing completion verification, or insufficient coordination policy.

Class 6 — Behavioral regression. A recent promotion has changed agent behavior in a way that violates stated expectations. Symptoms: goal-achievement rate drop, refusal-rate spike, human-intervention rate spike, user complaints about behavior change. Root cause: inadequate pre-promotion testing (Article 24).

The runbook structure

Each incident class has a runbook with four phases: detect, contain, remediate, post-mortem. The architect defines the runbook; the SRE team exercises it on a schedule.

Phase 1 — Detect

Detection mechanisms per class:

Memory poisoning: memory-content anomaly detection (embedding distance between recently-written memory and the cluster centroid; frequency of low-provenance writes); user reports of “the agent said something weird”; red-team battery failures.
Goal hijacking: output-to-task semantic-distance spikes; tool-call sequences that don’t match typical task patterns; output classifier flags for policy-violating content.
Runaway loop: cost-per-session anomaly; session-duration p99 breach; per-session tool-call count breach.
Tool misuse: external-system audit-log diffs; customer complaints; policy-engine denial rate spike (catches attempts but also signals the attempts are being made).
Coordination failure: multi-agent deadlock detection (all agents in wait state for > threshold); delegation chain depth exceeds max; same goal appearing in multiple agents’ queues.
Behavioral regression: the canary anomaly detectors from Article 24; user feedback volume spike; support-ticket pattern shift.

Phase 2 — Contain

Containment stops the bleeding. The core action per class:

Memory poisoning: quarantine suspect memory entries (mark as non-readable but preserve for forensics); if the poison is widespread, switch the agent to a pre-incident memory snapshot.
Goal hijacking: kill-switch the session; invalidate the session’s memory; block the source of the injection if identifiable.
Runaway loop: trigger session-level kill-switch; open circuit breakers on the tool(s) that were being called; scale the agent’s token budget cap down temporarily.
Tool misuse: revoke the agent’s authorization for that tool; if side effects occurred, invoke compensating transactions; notify the affected system’s owners.
Coordination failure: halt all coordinating agents; reset coordination state; restart with a clean state.
Behavioral regression: roll back to the previous version; disable the new version in the promotion pipeline pending investigation.

Phase 3 — Remediate

Remediation restores the system to a safe state and addresses root cause.

Memory poisoning: audit all memory writes since the suspected injection date; remove poisoned entries; tighten memory-write policy; add the injection vector to the red-team battery.
Goal hijacking: review the injection path; patch the input-validation or output-filtering gap; add regression test covering the specific attack.
Runaway loop: implement the loop-exit condition that failed; add per-session token budget if missing; add tool-level circuit breaker.
Tool misuse: tighten the authorization rule (Article 6, 22); add schema validation; retrain agent on the corrected usage pattern via examples in prompt.
Coordination failure: add timeout; add completion-acknowledgment; update coordination policy; document the pattern in the anti-pattern library.
Behavioral regression: add golden-task coverage for the regression; update promotion gates; schedule a re-canary for the fixed version.

Phase 4 — Post-mortem

The post-mortem captures what happened, why, and what will prevent recurrence. The agentic post-mortem template has these sections:

Summary. One paragraph describing the incident, impact, and resolution.
Timeline. Timestamped events from detection through resolution.
Root cause. The immediate cause and the contributing causes.
Impact. Users affected; business cost; regulatory exposure.
What went well. Detection speed; containment effectiveness.
What went wrong. Detection gaps; containment friction; communication gaps.
Action items. Specific, owner-assigned, date-targeted changes.
Evidence. Agent trace IDs; tool-call logs; memory snapshots; policy-engine decisions.

The architect reviews every post-mortem for architectural implications. Action items that imply platform-level changes (new policy pattern, new observability signal, new runbook class) get rolled back into the reference architecture and the shared platform.

Severity matrix

Traditional severity matrices (P1–P5 or SEV1–SEV5) translate to agentic but benefit from agentic-specific definitions.

P1 / SEV1: Agent causing active external harm (false refunds, incorrect advice going to customers in real time, production-database corruption). Immediate kill-switch of the agent class. Executive notification within 30 minutes.
P2 / SEV2: Agent producing incorrect results at scale (goal-achievement rate drop > 20%; systematic tool misuse). Kill-switch the affected session class. Engineering response within 1 hour.
P3 / SEV3: Agent producing incorrect results for some users (isolated goal hijacks, tool misuse on edge cases). Pause new sessions; investigate. Engineering response within 4 business hours.
P4 / SEV4: Quality regression without user harm (slight metric degradation in canary). Fix in next release. Engineering response within 2 business days.
P5 / SEV5: Minor behavioral drift observable in metrics, no user impact.

Regulatory reporting

EU AI Act Article 73 requires serious-incident reporting for high-risk AI systems. The reporting threshold captures incidents causing serious harm to health, life, or fundamental rights, serious disruption to critical infrastructure, or breaches of EU law aimed at protecting fundamental rights. The architect should assume certain agent incident classes — P1 goal-hijack incidents in Annex III deployments, memory-poisoning affecting regulated decisions, tool-misuse incidents that violate regulated processes — are reportable and should be pre-classified in the runbook. Incident reports must go to the relevant market-surveillance authority “immediately after the provider has established a causal link” and no later than 15 days after the provider became aware.

Insurance claims and vendor-coordination

When an incident involves a third-party vendor (the model provider, a tool vendor, the orchestration framework vendor), the runbook should specify the notification chain to that vendor. Most major model providers (Anthropic, OpenAI, Google) have public incident-intake processes; the architect should ensure the contact and the required evidence format are in the runbook rather than scrambled during the incident.

Exercises and game-days

A runbook not exercised is a runbook not reliable. Quarterly game-days that simulate incidents in each of the six classes keep the team’s response muscle warm. The architect participates as the technical authority on agentic behavior during the exercise and uses the exercise to find runbook gaps.

Moffatt v. Air Canada lessons

The 2024 British Columbia Civil Resolution Tribunal decision in Moffatt v. Air Canada is the first public tribunal ruling on AI-chatbot incorrect-advice liability. The court held Air Canada responsible for the chatbot’s incorrect bereavement-fare advice to Mr. Moffatt, rejecting Air Canada’s argument that the chatbot was a “separate legal entity.” The architectural lessons:

The agent’s representations to the user bind the organization.
“Tool-like policy effect” — the chatbot committed the company to a policy — arises even without explicit tool calls.
The incident-response runbook must include customer-facing-statement remediation (retraction, customer-specific remedy), not just system-state remediation.

Learning outcomes

Explain the six agent incident classes and the detection signals for each.
Classify four runbook stages — detect, contain, remediate, post-mortem — and the class-specific actions at each.
Evaluate an incident-response runbook for class coverage and detection-signal adequacy.
Design an incident-response plan for a given agent, including severity matrix, escalation chain, and regulatory-reporting triggers.