Skip to main content
AITE M1.1-Art20 v1.0 Reviewed 2026-04-06 Open Access
M1.1 Foundations of AI Transformation
AITF · Foundations

SLO, SLI, and Incident Response for AI

SLO, SLI, and Incident Response for AI — AI Strategy & Vision — Advanced depth — COMPEL Body of Knowledge.

10 min read Article 20 of 48

Incident response for AI systems extends classical incident response with five AI-specific incident classes. This article teaches the architect to design the SLI/SLO sheet, the runbook, and the post-incident review template that makes AI operations tractable.

Classical SLIs and SLOs, applied to AI

The classical SLI triad is availability, latency, and error rate. Each has an AI-specific interpretation.

Availability for an AI service is the fraction of requests that return a parsable response within a hard deadline. “Parsable” matters because an LLM can return a 200 OK with a body that is malformed JSON; classical HTTP availability metrics miss this. A production architect defines a response validator and measures availability against validator-pass, not HTTP-200.

Latency for an AI service has two layers. End-to-end p95 and p99 matter, but so does time-to-first-token (TTFT) because streaming UX makes TTFT the user-perceived latency. Google’s 100ms TTFT rule of thumb for chat UX is standard; most enterprise architects target <500ms TTFT and <4s p95 end-to-end for chat; very different targets apply to batch workloads.1

Error rate combines HTTP errors, timeouts, validator-fail responses, safety-refusal rates (when those are unexpected), and tool-call failures. The architect defines which errors count against the SLO and publishes the classification.

The three AI-specific SLIs

Evaluation score

The eval score is the ongoing output of the evaluation harness (Article 11) applied to production traffic samples. It is usually a composite: capability score (LLM-as-judge or automated metrics), safety score, and citation-accuracy score for RAG systems. Target: stable or rising across the release window.

An SLO on eval score is the strongest AI-specific SLO the architect can write: “the 7-day rolling eval score must remain within 2 points of the baseline.” Eval score regressions are one of the five incident classes below and are often the only early-warning signal before users complain.

Cost per query

Cost per query is the cloud cost for a single user interaction, summed across model call, retrieval, tool execution, observability, and network egress. Target: within budget ceiling; anomaly detection on per-query cost catches prompt regressions (a prompt that suddenly generates 10x more output tokens is a cost incident before it is a quality incident).

An architect who does not measure cost per query as an SLI cannot run an AI system responsibly at scale — the FinOps architecture in Article 33 depends on this SLI existing as a first-class metric.

Retrieval freshness

For RAG systems, freshness is the age of the corpus at retrieval time. Target: <N hours for real-time workflows, <N days for knowledge workflows. Stale retrieval produces wrong answers that look right, which is a failure mode worse than a visible error. The freshness SLI is calculated by timestamping every indexed document and checking the retrieved passages’ timestamps at response time.

The SLI/SLO target sheet

The architect’s primary deliverable here is a one-page SLI/SLO target sheet. Each row: SLI name, measurement method, SLO target, error-budget window, action on budget burn. A sample for a customer-support assistant:

SLITargetWindowAction on burn
Availability (validator-pass)99.5%30 daysFreeze non-urgent changes
TTFT p95<500ms7 daysReview serving config
End-to-end p95<4s7 daysReview orchestration + retrieval
Eval score≥ baseline -27 daysHalt rollouts, full eval review
Cost per query<$0.047 daysCost audit, prompt review
Retrieval freshness p95<72h7 daysRe-index, alert corpus team
Refusal rate1-5%7 daysTune guardrails
Tool-call error rate<1%7 daysTool contract audit

The sheet is updated quarterly. The first version of the sheet is often too loose; experience with real traffic tightens it. Architects who lock targets before first canary are invariably wrong.

The five AI incident classes

Classical incidents (outage, performance, security) still apply. Five AI-specific incident classes add to the taxonomy.

1. Confabulation outbreak

A change (prompt, model, corpus) causes the system to produce plausible but wrong content at an elevated rate. The DPD UK chatbot swearing incident and the NYC MyCity wrong-law answers are public examples of related failure modes.2

Detection: eval score drop, human-review sampling, user complaints. Containment: route to safe fallback (canned response, escalate-to-human). Remediation: identify the change, roll back, re-evaluate.

2. Safety bypass

A user finds a prompt pattern that bypasses the safety guardrails. The Chevrolet of Watsonville $1 Tahoe prompt injection is the public case.3

Detection: safety eval failures on adversarial samples, external disclosure. Containment: patch the guardrail immediately; in severe cases, take the feature offline. Remediation: full threat model review (Article 14), red-team exercise, gate-review at Model stage.

3. Prompt injection

A hostile input either in the user message or in retrieved content causes the model to follow attacker-chosen instructions. OWASP LLM01 is the reference taxonomy.4

Detection: anomaly detection on output patterns (tool calls, PII emission, unusual token patterns), user reports. Containment: add input/output filters for the exploit pattern; if indirect injection via retrieval, purge and re-index. Remediation: threat-model update, security architecture review.

4. Model regression

A model version upgrade, prompt change, or tooling change regresses a specific pattern even as the average eval score is stable. This is the canary-miss case — the regression is in a minority pattern the canary sample did not exercise.

Detection: slice-based eval monitoring (per-vertical eval scores), customer complaints. Containment: rollback. Remediation: expand eval coverage to the regressed slice; add that slice to the standing regression suite.

5. Retrieval corruption

The retrieval index serves poisoned, stale, or mis-chunked content, or its hybrid-retrieval balance is off and quality drops. Source poisoning — a document added to the corpus that contains instructions the model then follows — is a specific sub-case from OWASP LLM08.4

Detection: citation-accuracy eval drop, content-safety evals on retrieved passages, cost anomaly (longer contexts). Containment: revert to prior corpus snapshot. Remediation: corpus audit, source-authorization tightening, re-embed and re-index.

The runbook structure

A production runbook for an AI service includes, per incident class: detection signal, first-response action, decision rights (who can take the feature offline), communication template, rollback command, and escalation tree. The architect does not own the runbook day-to-day but owns its shape and participates in periodic reviews.

The OpenAI 20 March 2023 Redis bug post-incident report is the canonical public example of an AI incident post-mortem.5 Microsoft and Google status pages include AI-service incidents; reading several builds intuition for the real cadence and character of AI incidents.6

Kill-switch architecture

Every AI system needs a kill-switch — a single control that immediately takes the feature offline or downgrades it to a safe fallback. For agentic systems the kill-switch is mandatory and its design is treated in depth in the AITE-ATS credential and the Core Stream article EATL-Level-4/M9.3-Art02.

The architect specifies: what the switch does (refuse requests, fall back to non-AI response, mask part of the feature), who can trigger it (SRE on-call, security team, product leader), how long it can be held (typically a bounded window requiring re-authorisation), and how its state is surfaced to users (explicit error message, graceful degradation).

Error-budget policy

Classical SRE error-budget policy: when the budget is exhausted, halt non-urgent releases until it recovers. For AI, the policy extends across more SLIs — burning the eval-score budget freezes prompt and model changes; burning the cost budget triggers a cost audit and possibly a routing change; burning the availability budget triggers the classical response.

The architect writes the error-budget policy once and publishes it; the platform team enforces it. A well-written policy dissolves most of the usual friction between product and reliability because it replaces case-by-case arguments with rules.

Governance integration

For EU AI Act Article 12 (record-keeping) and Article 15 (accuracy, robustness, cybersecurity), the SLO sheet and the incident log are primary evidence.7 ISO/IEC 42001 Clause 9.1 (monitoring) maps directly to SLI measurement. NIST AI RMF MANAGE 1.3 (resource management) and MANAGE 2.3 (risk response) map to the error-budget policy and the kill-switch.8

Anti-patterns

  • SLOs that copy classical services without AI-specific SLIs. An availability SLO alone misses the incidents that hurt AI products most: silent quality regression.
  • SLOs written by reliability team alone, without eval-harness owner. The eval SLO should be co-owned by reliability and the ML/AI team.
  • Runbooks that assume the response is “restart the service.” AI incidents rarely resolve via restart; they resolve via rollback or policy change. Runbooks reflecting classical patterns will waste the first hour of every incident.
  • Kill-switch that has never been tested. Untested kill-switches fail at the worst moment. The architect mandates quarterly kill-switch drills.

Summary

Running AI in production is the same SRE discipline as running any other service, plus three AI-specific SLIs and five AI-specific incident classes. The architect’s job is to design the SLO target sheet, define the kill-switch, write the error-budget policy, and make sure the incident runbook covers the AI-specific failure modes. The outputs feed EU AI Act Article 12/15 evidence and ISO/IEC 42001 Clause 9.1 compliance.

Key terms

  • SLI/SLO for AI
  • Error budget
  • Kill-switch
  • Confabulation outbreak
  • Retrieval corruption

Learning outcomes

After this article the learner can: explain AI-specific SLIs and SLOs; classify five AI incident types; evaluate a runbook for AI incident fit; design an SLO target sheet and kill-switch spec.

Further reading

Footnotes

  1. Google Research, “Web Vitals” and latency research summaries; OpenAI and Anthropic streaming UX guidance.

  2. BBC and The Markup coverage of DPD chatbot (January 2024) and NYC MyCity (March 2024) incidents.

  3. Business Insider, “Chevrolet of Watsonville dealer AI chatbot agrees to sell Tahoe for $1” (December 2023).

  4. OWASP Top 10 for LLM Applications, v2025, LLM01 (prompt injection) and LLM08 (vector and embedding weaknesses). 2

  5. OpenAI, “March 20 ChatGPT outage: Here’s what happened” (post-incident report, 2023).

  6. Microsoft Azure status; Google Cloud AI/Vertex status; Anthropic status pages (public).

  7. Regulation (EU) 2024/1689 (AI Act), Articles 12 and 15.

  8. NIST AI 100-1, MANAGE 1.3 and 2.3.