AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Artifact Template
How to use this template
This template is the service-level-objective (SLO) sheet for an agentic AI feature — a feature whose runtime orchestrates multi-step model reasoning with tool calls, as described in Lab 3. The sheet is authored by the runtime owner, reviewed by the site-reliability and governance owners, and carried as the living record that on-call engineers, release engineers, and the architecture review board consult.
Agentic features have distinctive reliability dynamics — per-turn latency variance, per-run cost variance, tool-call failure cascades, loop-length blow-ups, prompt-injection-driven behaviour changes — that generic web-service SLO frameworks miss. The template captures those dynamics explicitly.
Every section is required. Sections not applicable (for example, if the feature has no write-capable tools, the write-tool section is completed with “not applicable — this feature has only read-only and draft-only tools” and a one-sentence rationale).
Agentic Runtime SLO and SLI Sheet — [Feature Name]
1. Identification and ownership
| Field | Value |
|---|---|
| Feature name | [link to architecture design document] |
| Runtime platform | [LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex Agents, hand-rolled, or other] |
| Runtime version | [pinned] |
| Runtime owner | [single accountable individual, team] |
| Site-reliability reviewer | [name, role] |
| Governance reviewer | [name, role] |
| On-call rotation | [team name, rotation schedule link] |
| Effective date | YYYY-MM-DD |
2. System invariant
The single machine-checkable invariant that must hold at every step of every agent run. Written as one sentence in the Lab 3 style. For a read-and-draft-only feature, the invariant limits write-capable tool reach. For a tool-restricted feature, the invariant names the restrictions. For a human-in-the-loop feature, the invariant names the human step that must precede any action.
[Example: “The Feature can read market state, portfolio state, research state, and compliance state, and can draft messages. It cannot send, submit, modify, or cancel an order, a position, or a communication to an external venue or counterparty. No agent plan, tool call, prompt injection, or operator instruction can place the Feature outside this envelope.”]
| Field | Value |
|---|---|
| Invariant test | [property-based test, policy simulation, fuzzing harness] |
| Test location | [CI path, cadence] |
| Failure action | [block release, page runtime owner, other] |
3. Service-level indicators (SLIs)
3.1 Latency SLIs
| Indicator | Definition | Measurement source |
|---|---|---|
| Per-turn latency | [Wall-clock time from user-turn input to assistant-turn output] | [trace span] |
| Per-tool-call latency | [Wall-clock time from tool-call dispatch to tool-call result] | [trace span] |
| Per-run duration | [Total duration of a multi-turn run from session open to session end] | [trace span] |
| Time-to-first-token | [For streamed responses] | [trace span] |
3.2 Cost SLIs
| Indicator | Definition | Measurement source |
|---|---|---|
| Per-turn cost | [Aggregated LLM and tool costs for a single turn] | [cost attribution record] |
| Per-run cost | [Aggregated cost for an entire run] | [cost attribution record] |
| Input-token per turn | [Tokens sent to generator, aggregated across all tool-interleaved calls] | [trace span] |
| Output-token per turn | [Tokens generated] | [trace span] |
3.3 Behavioural SLIs
| Indicator | Definition | Measurement source |
|---|---|---|
| Loop-length | [Number of agent steps in a run, including reasoning steps and tool calls] | [trace] |
| Tool-call success rate | [Fraction of tool calls returning a success result, per tool] | [trace, per-tool] |
| Validator-failure rate | [Fraction of tool calls intercepted by a pre- or post-execution validator, per validator] | [trace] |
| Refusal rate | [Fraction of runs in which the generator refuses to respond] | [trace] |
| Unsafe-content rate | [Fraction of responses flagged by the safety classifier] | [safety classifier output] |
| Human-review rating mean | [If the feature is sampled for human review] | [review console] |
3.4 Invariant SLIs
| Indicator | Definition | Measurement source |
|---|---|---|
| Invariant-violation rate | [Detected violations of the system invariant per run; must be 0] | [runtime enforcement log] |
| Write-capable-tool-reach attempts | [Attempts to invoke a tool outside the feature’s envelope, per run] | [runtime enforcement log] |
4. Service-level objectives (SLOs)
| SLO | Target | Measurement window | Error budget | Action on breach |
|---|---|---|---|---|
| Per-turn latency p50 | [e.g., ≤ 2.0 s] | [28-day rolling] | […] | […] |
| Per-turn latency p99 | [e.g., ≤ 8.0 s] | [28-day rolling] | […] | […] |
| Per-run cost p95 | [e.g., ≤ $0.40] | [28-day rolling] | […] | […] |
| Loop-length p99 | [e.g., ≤ 14 steps] | [28-day rolling] | […] | […] |
| Tool-call success rate | [e.g., ≥ 99.0%, per tool] | [28-day rolling] | […] | […] |
| Refusal rate | [e.g., 0.5% to 5.0% band] | [7-day rolling] | […] | [alert if out of band, in either direction] |
| Unsafe-content rate | [e.g., ≤ 0.02%] | [24-hour rolling] | […] | [page on call] |
| Invariant-violation rate | [0 — no budget] | [real-time] | [none — zero-tolerance] | [immediate incident declaration] |
5. Error budget policy
| Field | Value |
|---|---|
| Monthly error budget per SLO | [the time the service may be out of SLO before release cadence slows] |
| Budget-burn alert thresholds | [e.g., 2× and 10× burn-rate alerts] |
| Release-cadence impact | [when budget is exhausted, release cadence changes to what] |
| Budget-exhausted recovery | [the process to re-earn budget — improvement work, post-incident remediation, stricter canary] |
| Invariant-violation budget | [zero; not subject to the general error-budget process] |
6. Incident response
6.1 Severity classification
| Severity | Trigger | Response time | Communication |
|---|---|---|---|
| SEV-1 | [invariant violation, global outage, data breach] | [minutes] | [exec paging, customer communication] |
| SEV-2 | [SLO breach with active user impact, partial outage] | [tens of minutes] | [team paging, status page update] |
| SEV-3 | [SLO budget-burn rate alert, degraded behaviour] | [business hours] | [team notification] |
| SEV-4 | [minor anomaly, non-urgent drift] | [next business day] | [tracked issue] |
6.2 Kill-switch topology
| Mode | Who can invoke | Authentication | Propagation | In-flight behaviour | Smoke-test cadence |
|---|---|---|---|---|---|
| Tool-level freeze | [role(s)] | [auth step] | [target seconds] | […] | […] |
| Generator-level freeze | [role(s)] | […] | […] | […] | […] |
| Agent-level freeze | [role(s)] | […] | […] | […] | […] |
| Global freeze | [role(s); two-person rule?] | […] | [target ≤ 60s] | […] | […] |
6.3 Runbooks
For at least three named failure scenarios, the runbook (or a link to the runbook) with decision tree, escalation paths, and rollback commands.
- [Prompt-injection incident runbook]
- [Generator-outage failover runbook]
- [Kill-switch invocation runbook]
- [Tool-authorization breach runbook]
- [Cost-anomaly incident runbook]
7. Observability contract
| Field | Value |
|---|---|
| Trace backend | [Arize, Langfuse, Datadog, OpenTelemetry stack, or other] |
| Metric backend | [Prometheus, CloudWatch, Datadog, or other] |
| Log backend | [with retention class per stream] |
| End-to-end trace ID propagation | [from user-turn through runtime through tool calls through LLM calls] |
| Log hygiene | [redaction rules for free-text inputs, retention class per stream] |
| Dashboard location | [URL] |
| Alert channels | [paging destinations, severity routing] |
8. Review and amendments
| Role | Name | Decision | Date |
|---|---|---|---|
| Runtime owner | […] | Authored | YYYY-MM-DD |
| Site-reliability reviewer | […] | Approved | YYYY-MM-DD |
| Governance reviewer | […] | Approved | YYYY-MM-DD |
| Architecture reviewer | […] | Approved | YYYY-MM-DD |
Amendment log. Material amendments (change of invariant, change of SLO target, change of kill-switch topology) require re-review by the full panel; non-material amendments (new SLI added for observation, alerting-threshold tuning) may be self-approved by the runtime owner and site-reliability reviewer.
Notes on use
When to use this template. Every agentic feature — any feature that orchestrates multi-step model reasoning with tool calls. Single-turn features use a simpler SLO sheet.
Common errors in first-time use. Missing system invariant; SLOs that are not measurable from the trace data; kill-switch topology without propagation SLO; zero-tolerance SLOs expressed with error budgets; runbooks reduced to links that point to empty wiki pages. Reviewers treat these as blocking.
What follows. The SLO sheet is cited from Template 1 §9 (operational architecture) and feeds the feature’s release-gate readiness review. It is re-reviewed on every material runtime change, every new tool added, and at least quarterly.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.