Agent SLO/SLI and Operational Metrics

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 18 of 40

Thesis. Google’s SRE book (2016) defined SLO-driven operations for web services: availability, latency, and error rate expressed as service-level objectives, measured by service-level indicators, with error budgets that enforce a trade-off between feature velocity and reliability. The model survives intact for agents — but availability and latency are no longer the interesting top-line metrics. Did the agent achieve the user’s goal? At what cost? With how much human intervention? At what rate of safety intervention? An agentic SLO sheet that reads like a web-service SLO sheet is the wrong artifact for the job. This article specifies the metrics that actually govern agents, maps them to dimensions, and gives the architect an SLO template tuned for autonomy-bearing workloads.

Six agentic SLIs

SLI 1 — Goal-achievement rate

Of tasks submitted, what percentage achieved the user’s goal successfully? The top-line agentic SLI. Measured against either the user’s explicit confirmation (“yes, this resolved it”), implicit signals (no follow-up complaint within 48 hours), or evaluation-rubric judgment (retrospective human or LLM-as-judge review of a trace sample).

A goal-achievement rate below 85% typically signals agentic misfit — the agent is trying to do something it shouldn’t or isn’t equipped for. Above 95% may signal an under-scoped deployment (only easy tasks are being given to the agent).

SLI 2 — Human-intervention rate

Of agent sessions, what percentage required human intervention (HITL approval, escalation, correction)? Measured per task class; trend-tracked over time.

A rising human-intervention rate signals deteriorating agent performance; a falling rate may signal either genuine improvement or concerning over-delegation (the human isn’t being asked when they should be). The architect specifies target ranges, not just ceilings.

SLI 3 — Cost per task

Total cost — LLM tokens, tool-call costs, infrastructure amortization, human-intervention cost — divided by tasks completed. Tracked per task class. First-class because agentic cost is an order of magnitude more variable than classical service cost (Article 19).

SLI 4 — Time to complete

Wall-clock from task submission to task completion. For synchronous tasks (customer-facing), this is user-experience latency. For asynchronous tasks (back-office), it is throughput. Includes agent loop time plus any HITL wait time.

SLI 5 — Safety-intervention rate

Rate at which safety controls (kill-switch, policy-engine deny, guardrail tripwire, sanitizer flag) fired. A non-zero rate is healthy — it means the controls are actually doing something. A rising rate signals either more adversarial input or weakening operational discipline.

SLI 6 — Consistency

Measured as the variance in outcome, cost, and steps across similar tasks. Rising variance is an early signal of model drift or prompt misalignment.

Dimensions to slice SLIs by

Top-line numbers hide problems. Every SLI is sliced by dimensions that make operational anomalies visible.

Task class — refund, lookup, compose, escalate. Different distributions per class.
Customer segment — B2C individual, B2B enterprise, internal user. Different expectations and failure tolerance.
Agent configuration — different agents in the portfolio have different SLOs.
Model — primary model vs fallback; measure separately.
Region / data residency — regulatory regime may require per-region reporting.
Time of day / business hours — off-hours anomalies often emerge here.
Tenant (for multi-tenant) — anomalies affecting a single tenant are earlier signals.

Setting SLO targets — the discipline

SLO targets are commitments to users, expressed numerically. Setting them well is architectural work.

Start from user expectation. What goal-achievement rate does the user need to trust the agent? For regulated financial workloads the answer may be 99%; for consumer assistants, 85% may suffice. The target is a business conversation, not an engineering guess.

Set objective against dimensional baseline. If goal-achievement varies across task classes, the overall SLO is weighted by volume; per-class SLOs catch specific regressions.

Define the error budget. The gap between the target and 100% is the budget. A 97% goal-achievement target gives 3% budget for failures. When the budget is exhausted in a period, new deployment is paused and remediation prioritized.

Choose the measurement window. 28 days, 30 days, rolling 90 — each has pros. Shorter windows react faster; longer windows smooth noise. Most agentic deployments use rolling 28-day.

Specify the SLI’s population carefully. Which tasks count? Tasks that errored at the framework level before the agent even engaged don’t count. Tasks the user canceled don’t count. The population definition is written out and reviewed.

Review and adjust quarterly. Targets set at launch are almost always wrong. Quarterly review with stakeholders adjusts based on evidence.

Error-budget policy — the sharp end

An SLO without an error-budget policy is a target; with one, it is a commitment. The policy specifies: what happens when budget is exhausted, what happens when budget is burning fast but not yet exhausted, what happens when budget is healthy.

Typical policy for an Annex III high-risk agentic system:

Budget exhausted (month-to-date burn exceeds plan): deployment paused for new features; team focuses entirely on reliability; regulatory notification if Article 14 oversight SLI is affected.
Budget burning hot (>75% consumed at 50% of period): new feature deployments require additional review; reliability work prioritized.
Budget healthy: normal feature velocity; reliability work proportional.

The policy is a pre-committed mechanism; it is not invoked on a case-by-case basis. That is the entire point.

Mapping to the Google SRE model

The mapping from Google SRE to agentic:

Availability → goal-achievement rate (when the agent is “up,” is it delivering?)
Latency → time to complete (unchanged in concept, different distribution in practice)
Saturation → cost-per-task + queue wait (capacity in economic terms plus queue mechanics)
Errors → safety-intervention rate + unsuccessful-task rate (not just HTTP 5xx)
New (agentic-specific) → human-intervention rate and consistency

Classical SRE people adopting agentic don’t have to learn a new discipline; they have to adapt the dimensions. SRE tenets (error budgets, blameless post-mortems, service-level rigor) transfer intact.

Observability to SLO pipeline

Agentic observability (Article 15) must emit SLI data. The pipeline:

Every session ends with a structured outcome record (goal achieved, goal failed, interrupted, escalated).
The record plus session spans are aggregated into per-session metrics.
Per-session metrics are aggregated per SLI and per dimension.
Dashboards and alerts run against the aggregations.
Error-budget tracking runs continuously.
Monthly (or per-cadence) review produces the SLO report for stakeholders.

Without the outcome record at session end, goal-achievement cannot be measured. This is the single most commonly missing data point in early agentic deployments. The architect makes it non-negotiable.

Five anti-patterns in agentic SLOs

Anti-pattern 1 — Availability-only SLOs. “Agent is up 99.9% of the time” says nothing about whether it’s doing what it should. Up is necessary not sufficient.

Anti-pattern 2 — No error budget. Team has targets but no pre-committed policy for breaching them. In practice, targets slip without consequence.

Anti-pattern 3 — No dimensional slicing. Aggregate-only metrics hide per-tenant, per-task-class, per-region problems until they’ve grown.

Anti-pattern 4 — Human-intervention rate as a cost, not a signal. Teams optimize to reduce HITL (looks efficient) without recognizing that dropping HITL can be moving agents to HOOTL inadvertently. HITL rate targets should be ranges, not ceilings.

Anti-pattern 5 — Stale SLOs. Targets set at launch and never revisited. Business changes, model improves, customer expectations shift; the SLOs drift from relevance.

Regulatory mapping

EU AI Act Article 15 (accuracy, robustness, cybersecurity) requires high-risk systems to declare technical robustness. Agentic SLOs are the implementation of Article 15 for agentic systems. The architect’s SLO document serves double duty as Article 15 evidence — targets are committed, measurement is systematic, breaches are tracked, remediation is documented. Regulated industries (finance, health) will have additional SLO expectations; the agentic model accommodates them as dimensions.

Framework parity — metric emission

LangGraph — native metric emission via OTel; LangSmith aggregates; custom aggregation via callbacks.
CrewAI — callback manager emits session metrics; dashboards custom.
AutoGen — OTel emission plus custom event logging; aggregation custom.
OpenAI Agents SDK — tracing and metric export; dashboard in OpenAI platform or external.
Semantic Kernel — OTel + Azure Monitor; rich metric story in Azure.
LlamaIndex Agents — callback manager + external aggregation.

All frameworks can emit; the platform layer aggregates into SLI dashboards and error-budget computations.

Real-world anchor — Google SRE Book SLO methodology

Google’s SRE book (Beyer et al., 2016) is the authoritative reference for SLI/SLO/error-budget discipline. AITE-ATS holders apply its methodology with agent-specific dimensions. The book is a free O’Reilly publication; the chapters on SLIs, SLOs, and error budgets are the core reading. Source: sre.google/sre-book.

Real-world anchor — Anthropic Computer Use measurement discussion

Anthropic’s Computer Use documentation and subsequent engineering posts discuss the specific challenges of measuring agents that act on a user’s screen — what counts as success, how to measure latency in a multi-step UI task, how to attribute a failure to model vs agent-loop vs UI-instability. The discussion is a frontier-lab take on agentic measurement; AITE-ATS holders read it as contemporary practice. Source: Anthropic public materials, 2024–2025.

Real-world anchor — Replit agent metrics

Replit’s public posts on their agent’s production metrics (2024–2025) illustrate the dimensional-slicing problem at consumer scale — per-repo, per-language, per-task-class variation in completion rate and cost. The agent team’s use of error budgets to trade feature velocity for reliability parallels the SRE model precisely. Source: replit.com blog.

Closing

Six SLIs, seven dimensions, an error-budget policy, a pipeline that emits outcome records at session end. Agentic SLO/SLI is the discipline that converts agentic operations from an art into a measurable practice. Article 19 takes up the cost dimension that this chapter touched on but that deserves its own full treatment.

Learning outcomes check

Explain six agentic SLIs (goal-achievement, human-intervention, cost, time, safety-intervention, consistency) and the Google SRE mapping.
Classify each SLI by dimension and understand how slicing reveals anomalies aggregate metrics hide.
Evaluate a monitoring design for coverage against the six-SLI template and the five anti-patterns.
Design an SLO target sheet for a given agent including targets, error-budget policy, dimensional slicing, and review cadence.

Cross-reference map

Core Stream: EATE-Level-3/M3.3-Art12-Operations-and-SLOs-for-AI-Systems.md.
Sibling credential: AITM-OMR Article 9 (ops-management SLO practice); AITF-PLP Article 6 (SRE-grade ops).
Forward reference: Articles 19 (cost architecture), 23 (EU AI Act Article 15), 24 (lifecycle promotion criteria), 25 (incident response).