Template — Agent SLO / SLI Sheet

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Artifact Template 3 of 5

How to use this template

Populate one sheet per agent at the end of Organize. The sheet records the operational promises the agent makes to its users and to the operators, the measurements that confirm or dispute those promises, and the actions taken when the promises are at risk.

Review monthly. Update when the agent’s tool surface, autonomy level, user population, or model mix changes materially. Publish the sheet to the same location as the dashboard it describes; the sheet and the dashboard must tell the same story.

Agent SLO / SLI Sheet

Identity

Field	Value
Agent identifier	stable-agent-id
Charter version	1.0
SLO sheet version	1.0
Last updated	YYYY-MM-DD
Agent owner (role)	role
Architect of record (role)	role
Observability owner (role)	role
Error-budget policy owner (role)	role

1. User-facing SLOs

The commitments the agent makes to its users. Exactly what the users are promised.

SLO ID	Statement	Target	Window	Measurement basis
SLO-A1	Task-completion rate	≥ 90%	trailing 28 days	runs classified as completed / total runs
SLO-A2	Acknowledgement latency (request to first response)	≤ 5 s p95	trailing 7 days	timestamp of first response token minus request receipt
SLO-A3	Task-turnaround (simple task class)	≤ 60 s p95	trailing 7 days	task complete minus task submitted
SLO-A4	HITL-gate response time (when gate fires)	≤ 4 business hours p95	trailing 28 days	operator decision timestamp minus gate-fire timestamp

Task classes are agent-specific. If the agent handles heterogeneous tasks, SLO-A3 can be decomposed per class.

2. Operational SLIs

The indicators the operators rely on to keep the agent healthy. SLIs feed into SLOs; some SLIs do not have a direct SLO but trigger alerts when they deviate.

SLI ID	Indicator	How measured	Alert threshold	Linked SLO
SLI-B1	Tool-error rate, by tool	errored tool-calls / total tool-calls, per tool, 1-hour window	> 1% for 3 consecutive hours	SLO-A1
SLI-B2	Loop length (steps per run)	steps, per run	p95 > 1.5× trailing-28-day baseline	SLO-A3
SLI-B3	HITL-fire rate	runs with ≥1 gate fire / total runs	> 1.25× trailing-28-day baseline	SLO-A4
SLI-B4	Model-call cost per run	sum cost USD / runs, 1-day window	p95 > budget	no direct SLO; feeds capacity plan
SLI-B5	Memory-write schema-violation count	violations, 1-day window	> 0	no direct SLO; critical signal
SLI-B6	Indirect-injection-detector positive rate	positives / retrieval events, 1-day window	> 0.1% or > 1.5× baseline	no direct SLO; security signal
SLI-B7	Kill-switch fire count	count, 1-day window	> 0 synchronous (information); > 0 asynchronous (alert)	no direct SLO; incident signal

3. Error-budget policy

The error budget translates SLO targets into an operational rule set. When the budget is at risk, development pauses and reliability work takes precedence.

Field	Value
Primary SLO for budget	e.g., SLO-A1 task-completion
Error budget (per window)	(1 − target) × total events, per window
Burn-rate alert thresholds	1-hour burn > 14.4× sustained; 6-hour burn > 6× sustained
Budget-at-risk response	(a) halt non-essential releases; (b) prioritise reliability work; (c) convene incident-review meeting
Budget-exhausted response	(a) halt all releases; (b) consider temporary autonomy-level downgrade; (c) open ticket to executive sponsor

4. Task classes and differentiated SLOs

If the agent serves multiple task classes with materially different expectations, decompose.

Task class	Description	Representative SLO
simple	e.g., single-tool read with summarisation	task-turnaround ≤ 60 s p95
medium	e.g., multi-tool with bounded loop	task-turnaround ≤ 5 min p95
complex / long-horizon	e.g., multi-step planning with HITL gates	task-turnaround ≤ 1 business day p95; acknowledgement ≤ 5 s p95

Task-class assignment must be deterministic (e.g., from a class-router) so that the measured denominator is not dependent on the outcome.

5. User populations and differentiated targets

If different user populations receive different SLOs (free vs. paid; internal vs. customer-facing), enumerate.

Population	SLO differences	Contractual basis
internal	same targets; relaxed alert thresholds	internal operating agreement
enterprise customers	tighter p95 targets; named support path	contract reference

6. Measurement plumbing

Where the numbers come from. The dashboard consumers should be able to audit the path from event to metric.

Metric	Source	Aggregation	Retention
Task-completion rate	`agent.run` spans in observability sink	count by outcome, windowed	days
Tool-error rate	`agent.tool_call` spans	error_count / total_count	days
Model-call cost per run	`agent.model_call` cost attribute summed to run	sum, grouped by run_id	days
Gate-fire count	`agent.gate_eval` audit events	count	regulatory horizon

7. Report and review cadence

Report	Cadence	Audience	Owner
SLO compliance snapshot	weekly	agent owner; observability owner	observability owner
Error-budget burn report	monthly	agent owner; executive sponsor; architect	agent owner
SLO sheet review + update	quarterly or on change event	signing roles below	architect of record

8. Change log

Date	Version	Change	Trigger	Author (role)
YYYY-MM-DD	1.0	initial sheet	onboarding	architect

9. Sign-off

Role	Sign-off date
Agent owner
Architect of record
Observability owner
Error-budget policy owner

End of Agent SLO / SLI Sheet.