SLO, SLI, and Incident Response for AI

FlowRidge

Incident response for AI systems extends classical incident response with five AI-specific incident classes. This article teaches the architect to design the SLI/SLO sheet, the runbook, and the post-incident review template that makes AI operations tractable.

Classical SLIs and SLOs, applied to AI

The classical SLI triad is availability, latency, and error rate. Each has an AI-specific interpretation.

Availability for an AI service is the fraction of requests that return a parsable response within a hard deadline. “Parsable” matters because an LLM can return a 200 OK with a body that is malformed JSON; classical HTTP availability metrics miss this. A production architect defines a response validator and measures availability against validator-pass, not HTTP-200.

Latency for an AI service has two layers. End-to-end p95 and p99 matter, but so does time-to-first-token (TTFT) because streaming UX makes TTFT the user-perceived latency. Google’s 100ms TTFT rule of thumb for chat UX is standard; most enterprise architects target <500ms TTFT and <4s p95 end-to-end for chat; very different targets apply to batch workloads.¹

Error rate combines HTTP errors, timeouts, validator-fail responses, safety-refusal rates (when those are unexpected), and tool-call failures. The architect defines which errors count against the SLO and publishes the classification.

The three AI-specific SLIs

Evaluation score

The eval score is the ongoing output of the evaluation harness (Article 11) applied to production traffic samples. It is usually a composite: capability score (LLM-as-judge or automated metrics), safety score, and citation-accuracy score for RAG systems. Target: stable or rising across the release window.

An SLO on eval score is the strongest AI-specific SLO the architect can write: “the 7-day rolling eval score must remain within 2 points of the baseline.” Eval score regressions are one of the five incident classes below and are often the only early-warning signal before users complain.

Cost per query

Cost per query is the cloud cost for a single user interaction, summed across model call, retrieval, tool execution, observability, and network egress. Target: within budget ceiling; anomaly detection on per-query cost catches prompt regressions (a prompt that suddenly generates 10x more output tokens is a cost incident before it is a quality incident).

An architect who does not measure cost per query as an SLI cannot run an AI system responsibly at scale — the FinOps architecture in Article 33 depends on this SLI existing as a first-class metric.

Retrieval freshness

For RAG systems, freshness is the age of the corpus at retrieval time. Target: <N hours for real-time workflows, <N days for knowledge workflows. Stale retrieval produces wrong answers that look right, which is a failure mode worse than a visible error. The freshness SLI is calculated by timestamping every indexed document and checking the retrieved passages’ timestamps at response time.

The SLI/SLO target sheet

The architect’s primary deliverable here is a one-page SLI/SLO target sheet. Each row: SLI name, measurement method, SLO target, error-budget window, action on budget burn. A sample for a customer-support assistant:

SLI	Target	Window	Action on burn
Availability (validator-pass)	99.5%	30 days	Freeze non-urgent changes
TTFT p95	<500ms	7 days	Review serving config
End-to-end p95	<4s	7 days	Review orchestration + retrieval
Eval score	≥ baseline -2	7 days	Halt rollouts, full eval review
Cost per query	<$0.04	7 days	Cost audit, prompt review
Retrieval freshness p95	<72h	7 days	Re-index, alert corpus team
Refusal rate	1-5%	7 days	Tune guardrails
Tool-call error rate	<1%	7 days	Tool contract audit

The sheet is updated quarterly. The first version of the sheet is often too loose; experience with real traffic tightens it. Architects who lock targets before first canary are invariably wrong.

The five AI incident classes

Classical incidents (outage, performance, security) still apply. Five AI-specific incident classes add to the taxonomy.

1. Confabulation outbreak

A change (prompt, model, corpus) causes the system to produce plausible but wrong content at an elevated rate. The DPD UK chatbot swearing incident and the NYC MyCity wrong-law answers are public examples of related failure modes.²

Detection: eval score drop, human-review sampling, user complaints. Containment: route to safe fallback (canned response, escalate-to-human). Remediation: identify the change, roll back, re-evaluate.

2. Safety bypass

A user finds a prompt pattern that bypasses the safety guardrails. The Chevrolet of Watsonville $1 Tahoe prompt injection is the public case.³

Detection: safety eval failures on adversarial samples, external disclosure. Containment: patch the guardrail immediately; in severe cases, take the feature offline. Remediation: full threat model review (Article 14), red-team exercise, gate-review at Model stage.

3. Prompt injection

A hostile input either in the user message or in retrieved content causes the model to follow attacker-chosen instructions. OWASP LLM01 is the reference taxonomy.⁴

Detection: anomaly detection on output patterns (tool calls, PII emission, unusual token patterns), user reports. Containment: add input/output filters for the exploit pattern; if indirect injection via retrieval, purge and re-index. Remediation: threat-model update, security architecture review.

4. Model regression

A model version upgrade, prompt change, or tooling change regresses a specific pattern even as the average eval score is stable. This is the canary-miss case — the regression is in a minority pattern the canary sample did not exercise.

Detection: slice-based eval monitoring (per-vertical eval scores), customer complaints. Containment: rollback. Remediation: expand eval coverage to the regressed slice; add that slice to the standing regression suite.

5. Retrieval corruption

The retrieval index serves poisoned, stale, or mis-chunked content, or its hybrid-retrieval balance is off and quality drops. Source poisoning — a document added to the corpus that contains instructions the model then follows — is a specific sub-case from OWASP LLM08.⁴

Detection: citation-accuracy eval drop, content-safety evals on retrieved passages, cost anomaly (longer contexts). Containment: revert to prior corpus snapshot. Remediation: corpus audit, source-authorization tightening, re-embed and re-index.

The runbook structure

A production runbook for an AI service includes, per incident class: detection signal, first-response action, decision rights (who can take the feature offline), communication template, rollback command, and escalation tree. The architect does not own the runbook day-to-day but owns its shape and participates in periodic reviews.

The OpenAI 20 March 2023 Redis bug post-incident report is the canonical public example of an AI incident post-mortem.⁵ Microsoft and Google status pages include AI-service incidents; reading several builds intuition for the real cadence and character of AI incidents.⁶

Kill-switch architecture

Every AI system needs a kill-switch — a single control that immediately takes the feature offline or downgrades it to a safe fallback. For agentic systems the kill-switch is mandatory and its design is treated in depth in the AITE-ATS credential and the Core Stream article EATL-Level-4/M9.3-Art02.

The architect specifies: what the switch does (refuse requests, fall back to non-AI response, mask part of the feature), who can trigger it (SRE on-call, security team, product leader), how long it can be held (typically a bounded window requiring re-authorisation), and how its state is surfaced to users (explicit error message, graceful degradation).

Error-budget policy

Classical SRE error-budget policy: when the budget is exhausted, halt non-urgent releases until it recovers. For AI, the policy extends across more SLIs — burning the eval-score budget freezes prompt and model changes; burning the cost budget triggers a cost audit and possibly a routing change; burning the availability budget triggers the classical response.

The architect writes the error-budget policy once and publishes it; the platform team enforces it. A well-written policy dissolves most of the usual friction between product and reliability because it replaces case-by-case arguments with rules.

Governance integration

For EU AI Act Article 12 (record-keeping) and Article 15 (accuracy, robustness, cybersecurity), the SLO sheet and the incident log are primary evidence.⁷ ISO/IEC 42001 Clause 9.1 (monitoring) maps directly to SLI measurement. NIST AI RMF MANAGE 1.3 (resource management) and MANAGE 2.3 (risk response) map to the error-budget policy and the kill-switch.⁸

Anti-patterns

SLOs that copy classical services without AI-specific SLIs. An availability SLO alone misses the incidents that hurt AI products most: silent quality regression.
SLOs written by reliability team alone, without eval-harness owner. The eval SLO should be co-owned by reliability and the ML/AI team.
Runbooks that assume the response is “restart the service.” AI incidents rarely resolve via restart; they resolve via rollback or policy change. Runbooks reflecting classical patterns will waste the first hour of every incident.
Kill-switch that has never been tested. Untested kill-switches fail at the worst moment. The architect mandates quarterly kill-switch drills.

Summary

Running AI in production is the same SRE discipline as running any other service, plus three AI-specific SLIs and five AI-specific incident classes. The architect’s job is to design the SLO target sheet, define the kill-switch, write the error-budget policy, and make sure the incident runbook covers the AI-specific failure modes. The outputs feed EU AI Act Article 12/15 evidence and ISO/IEC 42001 Clause 9.1 compliance.

Key terms

SLI/SLO for AI
Error budget
Kill-switch
Confabulation outbreak
Retrieval corruption

Learning outcomes

After this article the learner can: explain AI-specific SLIs and SLOs; classify five AI incident types; evaluate a runbook for AI incident fit; design an SLO target sheet and kill-switch spec.