COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Artifact Template 3 of 5
How to use this template
Populate one sheet per agent at the end of Organize. The sheet records the operational promises the agent makes to its users and to the operators, the measurements that confirm or dispute those promises, and the actions taken when the promises are at risk.
Review monthly. Update when the agent’s tool surface, autonomy level, user population, or model mix changes materially. Publish the sheet to the same location as the dashboard it describes; the sheet and the dashboard must tell the same story.
Agent SLO / SLI Sheet
Identity
| Field | Value |
|---|---|
| Agent identifier | stable-agent-id |
| Charter version | 1.0 |
| SLO sheet version | 1.0 |
| Last updated | YYYY-MM-DD |
| Agent owner (role) | role |
| Architect of record (role) | role |
| Observability owner (role) | role |
| Error-budget policy owner (role) | role |
1. User-facing SLOs
The commitments the agent makes to its users. Exactly what the users are promised.
| SLO ID | Statement | Target | Window | Measurement basis |
|---|---|---|---|---|
| SLO-A1 | Task-completion rate | ≥ 90% | trailing 28 days | runs classified as completed / total runs |
| SLO-A2 | Acknowledgement latency (request to first response) | ≤ 5 s p95 | trailing 7 days | timestamp of first response token minus request receipt |
| SLO-A3 | Task-turnaround (simple task class) | ≤ 60 s p95 | trailing 7 days | task complete minus task submitted |
| SLO-A4 | HITL-gate response time (when gate fires) | ≤ 4 business hours p95 | trailing 28 days | operator decision timestamp minus gate-fire timestamp |
Task classes are agent-specific. If the agent handles heterogeneous tasks, SLO-A3 can be decomposed per class.
2. Operational SLIs
The indicators the operators rely on to keep the agent healthy. SLIs feed into SLOs; some SLIs do not have a direct SLO but trigger alerts when they deviate.
| SLI ID | Indicator | How measured | Alert threshold | Linked SLO |
|---|---|---|---|---|
| SLI-B1 | Tool-error rate, by tool | errored tool-calls / total tool-calls, per tool, 1-hour window | > 1% for 3 consecutive hours | SLO-A1 |
| SLI-B2 | Loop length (steps per run) | steps, per run | p95 > 1.5× trailing-28-day baseline | SLO-A3 |
| SLI-B3 | HITL-fire rate | runs with ≥1 gate fire / total runs | > 1.25× trailing-28-day baseline | SLO-A4 |
| SLI-B4 | Model-call cost per run | sum cost USD / runs, 1-day window | p95 > budget | no direct SLO; feeds capacity plan |
| SLI-B5 | Memory-write schema-violation count | violations, 1-day window | > 0 | no direct SLO; critical signal |
| SLI-B6 | Indirect-injection-detector positive rate | positives / retrieval events, 1-day window | > 0.1% or > 1.5× baseline | no direct SLO; security signal |
| SLI-B7 | Kill-switch fire count | count, 1-day window | > 0 synchronous (information); > 0 asynchronous (alert) | no direct SLO; incident signal |
3. Error-budget policy
The error budget translates SLO targets into an operational rule set. When the budget is at risk, development pauses and reliability work takes precedence.
| Field | Value |
|---|---|
| Primary SLO for budget | e.g., SLO-A1 task-completion |
| Error budget (per window) | (1 − target) × total events, per window |
| Burn-rate alert thresholds | 1-hour burn > 14.4× sustained; 6-hour burn > 6× sustained |
| Budget-at-risk response | (a) halt non-essential releases; (b) prioritise reliability work; (c) convene incident-review meeting |
| Budget-exhausted response | (a) halt all releases; (b) consider temporary autonomy-level downgrade; (c) open ticket to executive sponsor |
4. Task classes and differentiated SLOs
If the agent serves multiple task classes with materially different expectations, decompose.
| Task class | Description | Representative SLO |
|---|---|---|
| simple | e.g., single-tool read with summarisation | task-turnaround ≤ 60 s p95 |
| medium | e.g., multi-tool with bounded loop | task-turnaround ≤ 5 min p95 |
| complex / long-horizon | e.g., multi-step planning with HITL gates | task-turnaround ≤ 1 business day p95; acknowledgement ≤ 5 s p95 |
Task-class assignment must be deterministic (e.g., from a class-router) so that the measured denominator is not dependent on the outcome.
5. User populations and differentiated targets
If different user populations receive different SLOs (free vs. paid; internal vs. customer-facing), enumerate.
| Population | SLO differences | Contractual basis |
|---|---|---|
| internal | same targets; relaxed alert thresholds | internal operating agreement |
| enterprise customers | tighter p95 targets; named support path | contract reference |
6. Measurement plumbing
Where the numbers come from. The dashboard consumers should be able to audit the path from event to metric.
| Metric | Source | Aggregation | Retention |
|---|---|---|---|
| Task-completion rate | agent.run spans in observability sink | count by outcome, windowed | days |
| Tool-error rate | agent.tool_call spans | error_count / total_count | days |
| Model-call cost per run | agent.model_call cost attribute summed to run | sum, grouped by run_id | days |
| Gate-fire count | agent.gate_eval audit events | count | regulatory horizon |
7. Report and review cadence
| Report | Cadence | Audience | Owner |
|---|---|---|---|
| SLO compliance snapshot | weekly | agent owner; observability owner | observability owner |
| Error-budget burn report | monthly | agent owner; executive sponsor; architect | agent owner |
| SLO sheet review + update | quarterly or on change event | signing roles below | architect of record |
8. Change log
| Date | Version | Change | Trigger | Author (role) |
|---|---|---|---|---|
| YYYY-MM-DD | 1.0 | initial sheet | onboarding | architect |
9. Sign-off
| Role | Sign-off date |
|---|---|
| Agent owner | |
| Architect of record | |
| Observability owner | |
| Error-budget policy owner |
End of Agent SLO / SLI Sheet.