Incident response for AI systems extends classical incident response with five AI-specific incident classes. This article teaches the architect to design the SLI/SLO sheet, the runbook, and the post-incident review template that makes AI operations tractable.
Classical SLIs and SLOs, applied to AI
The classical SLI triad is availability, latency, and error rate. Each has an AI-specific interpretation.
Availability for an AI service is the fraction of requests that return a parsable response within a hard deadline. “Parsable” matters because an LLM can return a 200 OK with a body that is malformed JSON; classical HTTP availability metrics miss this. A production architect defines a response validator and measures availability against validator-pass, not HTTP-200.
Latency for an AI service has two layers. End-to-end p95 and p99 matter, but so does
Error rate combines HTTP errors, timeouts, validator-fail responses, safety-refusal rates (when those are unexpected), and tool-call failures. The architect defines which errors count against the SLO and publishes the classification.
The three AI-specific SLIs
Evaluation score
The eval score is the ongoing output of the evaluation harness (Article 11) applied to production traffic samples. It is usually a composite: capability score (LLM-as-judge or automated metrics), safety score, and citation-accuracy score for RAG systems. Target: stable or rising across the release window.
An SLO on eval score is the strongest AI-specific SLO the architect can write: “the 7-day rolling eval score must remain within 2 points of the baseline.” Eval score regressions are one of the five incident classes below and are often the only early-warning signal before users complain.
Cost per query
Cost per query is the cloud cost for a single user interaction, summed across model call, retrieval, tool execution, observability, and network egress. Target: within budget ceiling; anomaly detection on per-query cost catches prompt regressions (a prompt that suddenly generates 10x more output tokens is a cost incident before it is a quality incident).
An architect who does not measure cost per query as an SLI cannot run an AI system responsibly at scale — the FinOps architecture in Article 33 depends on this SLI existing as a first-class metric.
Retrieval freshness
For RAG systems, freshness is the age of the corpus at retrieval time. Target: <N hours for real-time workflows, <N days for knowledge workflows. Stale retrieval produces wrong answers that look right, which is a failure mode worse than a visible error. The freshness SLI is calculated by timestamping every indexed document and checking the retrieved passages’ timestamps at response time.
The SLI/SLO target sheet
The architect’s primary deliverable here is a one-page SLI/SLO target sheet. Each row: SLI name, measurement method, SLO target, error-budget window, action on budget burn. A sample for a customer-support assistant:
| SLI | Target | Window | Action on burn |
|---|---|---|---|
| Availability (validator-pass) | 99.5% | 30 days | Freeze non-urgent changes |
| TTFT p95 | <500ms | 7 days | Review serving config |
| End-to-end p95 | <4s | 7 days | Review orchestration + retrieval |
| Eval score | ≥ baseline -2 | 7 days | Halt rollouts, full eval review |
| Cost per query | <$0.04 | 7 days | Cost audit, prompt review |
| Retrieval freshness p95 | <72h | 7 days | Re-index, alert corpus team |
| Refusal rate | 1-5% | 7 days | Tune guardrails |
| Tool-call error rate | <1% | 7 days | Tool contract audit |
The sheet is updated quarterly. The first version of the sheet is often too loose; experience with real traffic tightens it. Architects who lock targets before first canary are invariably wrong.
The five AI incident classes
Classical incidents (outage, performance, security) still apply. Five AI-specific incident classes add to the taxonomy.
1. Confabulation outbreak
A change (prompt, model, corpus) causes the system to produce plausible but wrong content at an elevated rate. The DPD UK chatbot swearing incident and the NYC MyCity wrong-law answers are public examples of related failure modes.2
Detection: eval score drop, human-review sampling, user complaints. Containment: route to safe fallback (canned response, escalate-to-human). Remediation: identify the change, roll back, re-evaluate.
2. Safety bypass
A user finds a prompt pattern that bypasses the safety guardrails. The Chevrolet of Watsonville $1 Tahoe prompt injection is the public case.3
Detection: safety eval failures on adversarial samples, external disclosure. Containment: patch the guardrail immediately; in severe cases, take the feature offline. Remediation: full threat model review (Article 14), red-team exercise, gate-review at Model stage.
3. Prompt injection
A hostile input either in the user message or in retrieved content causes the model to follow attacker-chosen instructions. OWASP LLM01 is the reference taxonomy.4
Detection: anomaly detection on output patterns (tool calls, PII emission, unusual token patterns), user reports. Containment: add input/output filters for the exploit pattern; if indirect injection via retrieval, purge and re-index. Remediation: threat-model update, security architecture review.
4. Model regression
A model version upgrade, prompt change, or tooling change regresses a specific pattern even as the average eval score is stable. This is the canary-miss case — the regression is in a minority pattern the canary sample did not exercise.
Detection: slice-based eval monitoring (per-vertical eval scores), customer complaints. Containment: rollback. Remediation: expand eval coverage to the regressed slice; add that slice to the standing regression suite.
5. Retrieval corruption
The retrieval index serves poisoned, stale, or mis-chunked content, or its hybrid-retrieval balance is off and quality drops. Source poisoning — a document added to the corpus that contains instructions the model then follows — is a specific sub-case from OWASP LLM08.4
Detection: citation-accuracy eval drop, content-safety evals on retrieved passages, cost anomaly (longer contexts). Containment: revert to prior corpus snapshot. Remediation: corpus audit, source-authorization tightening, re-embed and re-index.
The runbook structure
A production runbook for an AI service includes, per incident class: detection signal, first-response action, decision rights (who can take the feature offline), communication template, rollback command, and escalation tree. The architect does not own the runbook day-to-day but owns its shape and participates in periodic reviews.
The OpenAI 20 March 2023 Redis bug post-incident report is the canonical public example of an AI incident post-mortem.5 Microsoft and Google status pages include AI-service incidents; reading several builds intuition for the real cadence and character of AI incidents.6
Kill-switch architecture
Every AI system needs a kill-switch — a single control that immediately takes the feature offline or downgrades it to a safe fallback. For agentic systems the kill-switch is mandatory and its design is treated in depth in the AITE-ATS credential and the Core Stream article EATL-Level-4/M9.3-Art02.
The architect specifies: what the switch does (refuse requests, fall back to non-AI response, mask part of the feature), who can trigger it (SRE on-call, security team, product leader), how long it can be held (typically a bounded window requiring re-authorisation), and how its state is surfaced to users (explicit error message, graceful degradation).
Error-budget policy
Classical SRE error-budget policy: when the budget is exhausted, halt non-urgent releases until it recovers. For AI, the policy extends across more SLIs — burning the eval-score budget freezes prompt and model changes; burning the cost budget triggers a cost audit and possibly a routing change; burning the availability budget triggers the classical response.
The architect writes the error-budget policy once and publishes it; the platform team enforces it. A well-written policy dissolves most of the usual friction between product and reliability because it replaces case-by-case arguments with rules.
Governance integration
For EU AI Act Article 12 (record-keeping) and Article 15 (accuracy, robustness, cybersecurity), the SLO sheet and the incident log are primary evidence.7 ISO/IEC 42001 Clause 9.1 (monitoring) maps directly to SLI measurement. NIST AI RMF MANAGE 1.3 (resource management) and MANAGE 2.3 (risk response) map to the error-budget policy and the kill-switch.8
Anti-patterns
- SLOs that copy classical services without AI-specific SLIs. An availability SLO alone misses the incidents that hurt AI products most: silent quality regression.
- SLOs written by reliability team alone, without eval-harness owner. The eval SLO should be co-owned by reliability and the ML/AI team.
- Runbooks that assume the response is “restart the service.” AI incidents rarely resolve via restart; they resolve via rollback or policy change. Runbooks reflecting classical patterns will waste the first hour of every incident.
- Kill-switch that has never been tested. Untested kill-switches fail at the worst moment. The architect mandates quarterly kill-switch drills.
Summary
Running AI in production is the same SRE discipline as running any other service, plus three AI-specific SLIs and five AI-specific incident classes. The architect’s job is to design the SLO target sheet, define the kill-switch, write the error-budget policy, and make sure the incident runbook covers the AI-specific failure modes. The outputs feed EU AI Act Article 12/15 evidence and ISO/IEC 42001 Clause 9.1 compliance.
Key terms
SLI/SLO for AI Error budget Kill-switch Confabulation outbreak Retrieval corruption
Learning outcomes
After this article the learner can: explain AI-specific SLIs and SLOs; classify five AI incident types; evaluate a runbook for AI incident fit; design an SLO target sheet and kill-switch spec.
Further reading
Footnotes
-
Google Research, “Web Vitals” and latency research summaries; OpenAI and Anthropic streaming UX guidance. ↩
-
BBC and The Markup coverage of DPD chatbot (January 2024) and NYC MyCity (March 2024) incidents. ↩
-
Business Insider, “Chevrolet of Watsonville dealer AI chatbot agrees to sell Tahoe for $1” (December 2023). ↩
-
OWASP Top 10 for LLM Applications, v2025, LLM01 (prompt injection) and LLM08 (vector and embedding weaknesses). ↩ ↩2
-
OpenAI, “March 20 ChatGPT outage: Here’s what happened” (post-incident report, 2023). ↩
-
Microsoft Azure status; Google Cloud AI/Vertex status; Anthropic status pages (public). ↩
-
Regulation (EU) 2024/1689 (AI Act), Articles 12 and 15. ↩
-
NIST AI 100-1, MANAGE 1.3 and 2.3. ↩