Artifact Template: Agentic Runtime SLO and SLI Sheet

FlowRidge

AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Artifact Template

How to use this template

This template is the service-level-objective (SLO) sheet for an agentic AI feature — a feature whose runtime orchestrates multi-step model reasoning with tool calls, as described in Lab 3. The sheet is authored by the runtime owner, reviewed by the site-reliability and governance owners, and carried as the living record that on-call engineers, release engineers, and the architecture review board consult.

Agentic features have distinctive reliability dynamics — per-turn latency variance, per-run cost variance, tool-call failure cascades, loop-length blow-ups, prompt-injection-driven behaviour changes — that generic web-service SLO frameworks miss. The template captures those dynamics explicitly.

Every section is required. Sections not applicable (for example, if the feature has no write-capable tools, the write-tool section is completed with “not applicable — this feature has only read-only and draft-only tools” and a one-sentence rationale).

Agentic Runtime SLO and SLI Sheet — [Feature Name]

1. Identification and ownership

Field	Value
Feature name	[link to architecture design document]
Runtime platform	[LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex Agents, hand-rolled, or other]
Runtime version	[pinned]
Runtime owner	[single accountable individual, team]
Site-reliability reviewer	[name, role]
Governance reviewer	[name, role]
On-call rotation	[team name, rotation schedule link]
Effective date	YYYY-MM-DD

2. System invariant

The single machine-checkable invariant that must hold at every step of every agent run. Written as one sentence in the Lab 3 style. For a read-and-draft-only feature, the invariant limits write-capable tool reach. For a tool-restricted feature, the invariant names the restrictions. For a human-in-the-loop feature, the invariant names the human step that must precede any action.

[Example: “The Feature can read market state, portfolio state, research state, and compliance state, and can draft messages. It cannot send, submit, modify, or cancel an order, a position, or a communication to an external venue or counterparty. No agent plan, tool call, prompt injection, or operator instruction can place the Feature outside this envelope.”]

Field	Value
Invariant test	[property-based test, policy simulation, fuzzing harness]
Test location	[CI path, cadence]
Failure action	[block release, page runtime owner, other]

3. Service-level indicators (SLIs)

3.1 Latency SLIs

Indicator	Definition	Measurement source
Per-turn latency	[Wall-clock time from user-turn input to assistant-turn output]	[trace span]
Per-tool-call latency	[Wall-clock time from tool-call dispatch to tool-call result]	[trace span]
Per-run duration	[Total duration of a multi-turn run from session open to session end]	[trace span]
Time-to-first-token	[For streamed responses]	[trace span]

3.2 Cost SLIs

Indicator	Definition	Measurement source
Per-turn cost	[Aggregated LLM and tool costs for a single turn]	[cost attribution record]
Per-run cost	[Aggregated cost for an entire run]	[cost attribution record]
Input-token per turn	[Tokens sent to generator, aggregated across all tool-interleaved calls]	[trace span]
Output-token per turn	[Tokens generated]	[trace span]

3.3 Behavioural SLIs

Indicator	Definition	Measurement source
Loop-length	[Number of agent steps in a run, including reasoning steps and tool calls]	[trace]
Tool-call success rate	[Fraction of tool calls returning a success result, per tool]	[trace, per-tool]
Validator-failure rate	[Fraction of tool calls intercepted by a pre- or post-execution validator, per validator]	[trace]
Refusal rate	[Fraction of runs in which the generator refuses to respond]	[trace]
Unsafe-content rate	[Fraction of responses flagged by the safety classifier]	[safety classifier output]
Human-review rating mean	[If the feature is sampled for human review]	[review console]

3.4 Invariant SLIs

Indicator	Definition	Measurement source
Invariant-violation rate	[Detected violations of the system invariant per run; must be 0]	[runtime enforcement log]
Write-capable-tool-reach attempts	[Attempts to invoke a tool outside the feature’s envelope, per run]	[runtime enforcement log]

4. Service-level objectives (SLOs)

SLO	Target	Measurement window	Error budget	Action on breach
Per-turn latency p50	[e.g., ≤ 2.0 s]	[28-day rolling]	[…]	[…]
Per-turn latency p99	[e.g., ≤ 8.0 s]	[28-day rolling]	[…]	[…]
Per-run cost p95	[e.g., ≤ $0.40]	[28-day rolling]	[…]	[…]
Loop-length p99	[e.g., ≤ 14 steps]	[28-day rolling]	[…]	[…]
Tool-call success rate	[e.g., ≥ 99.0%, per tool]	[28-day rolling]	[…]	[…]
Refusal rate	[e.g., 0.5% to 5.0% band]	[7-day rolling]	[…]	[alert if out of band, in either direction]
Unsafe-content rate	[e.g., ≤ 0.02%]	[24-hour rolling]	[…]	[page on call]
Invariant-violation rate	[0 — no budget]	[real-time]	[none — zero-tolerance]	[immediate incident declaration]

5. Error budget policy

Field	Value
Monthly error budget per SLO	[the time the service may be out of SLO before release cadence slows]
Budget-burn alert thresholds	[e.g., 2× and 10× burn-rate alerts]
Release-cadence impact	[when budget is exhausted, release cadence changes to what]
Budget-exhausted recovery	[the process to re-earn budget — improvement work, post-incident remediation, stricter canary]
Invariant-violation budget	[zero; not subject to the general error-budget process]

6. Incident response

6.1 Severity classification

Severity	Trigger	Response time	Communication
SEV-1	[invariant violation, global outage, data breach]	[minutes]	[exec paging, customer communication]
SEV-2	[SLO breach with active user impact, partial outage]	[tens of minutes]	[team paging, status page update]
SEV-3	[SLO budget-burn rate alert, degraded behaviour]	[business hours]	[team notification]
SEV-4	[minor anomaly, non-urgent drift]	[next business day]	[tracked issue]

6.2 Kill-switch topology

Mode	Who can invoke	Authentication	Propagation	In-flight behaviour	Smoke-test cadence
Tool-level freeze	[role(s)]	[auth step]	[target seconds]	[…]	[…]
Generator-level freeze	[role(s)]	[…]	[…]	[…]	[…]
Agent-level freeze	[role(s)]	[…]	[…]	[…]	[…]
Global freeze	[role(s); two-person rule?]	[…]	[target ≤ 60s]	[…]	[…]

6.3 Runbooks

For at least three named failure scenarios, the runbook (or a link to the runbook) with decision tree, escalation paths, and rollback commands.

[Prompt-injection incident runbook]
[Generator-outage failover runbook]
[Kill-switch invocation runbook]
[Tool-authorization breach runbook]
[Cost-anomaly incident runbook]

7. Observability contract

Field	Value
Trace backend	[Arize, Langfuse, Datadog, OpenTelemetry stack, or other]
Metric backend	[Prometheus, CloudWatch, Datadog, or other]
Log backend	[with retention class per stream]
End-to-end trace ID propagation	[from user-turn through runtime through tool calls through LLM calls]
Log hygiene	[redaction rules for free-text inputs, retention class per stream]
Dashboard location	[URL]
Alert channels	[paging destinations, severity routing]

8. Review and amendments

Role	Name	Decision	Date
Runtime owner	[…]	Authored	YYYY-MM-DD
Site-reliability reviewer	[…]	Approved	YYYY-MM-DD
Governance reviewer	[…]	Approved	YYYY-MM-DD
Architecture reviewer	[…]	Approved	YYYY-MM-DD

Amendment log. Material amendments (change of invariant, change of SLO target, change of kill-switch topology) require re-review by the full panel; non-material amendments (new SLI added for observation, alerting-threshold tuning) may be self-approved by the runtime owner and site-reliability reviewer.

Notes on use

When to use this template. Every agentic feature — any feature that orchestrates multi-step model reasoning with tool calls. Single-turn features use a simpler SLO sheet.

Common errors in first-time use. Missing system invariant; SLOs that are not measurable from the trace data; kill-switch topology without propagation SLO; zero-tolerance SLOs expressed with error budgets; runbooks reduced to links that point to empty wiki pages. Reviewers treat these as blocking.

What follows. The SLO sheet is cited from Template 1 §9 (operational architecture) and feeds the feature’s release-gate readiness review. It is re-reviewed on every material runtime change, every new tool added, and at least quarterly.