COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Lab 3 of 5
Lab objective
Instrument an agent, emit a complete trace, ship the trace to an observability sink, build a dashboard covering the six SLIs that matter for agentic workloads, and implement a session-replay tool that reconstructs a run from its trace. By the end of the lab the learner has produced the observability fabric an on-call engineer uses at 02:00 to triage an agent incident — not the glossy marketing dashboard, the one that actually reads the last hour of runs.
Prerequisites
- Articles 15, 16, 17, 18 of this credential.
- A working agent — the finance agent from Lab 1, the coding agent from Lab 2, or an equivalent agent with at least three tools and a multi-step loop.
- An observability sink. Langfuse, LangSmith, Arize Phoenix, Humanloop, or an OpenTelemetry collector feeding Jaeger + Grafana. The lab must be reproducible on one sink; the rubric rewards reproducing it on two.
- A dashboarding layer. Grafana is the canonical choice; the vendor dashboards on the sinks named above are also acceptable.
The observability model
An agent trace is a nested tree. The root is the agent run. Children are planner turns, tool calls, model calls, memory reads and writes, gate evaluations, and sub-agent invocations. Each node has a span with timing, inputs, outputs, and attributes. This is the OpenTelemetry model applied to agentic workloads.
Six span types carry the lab’s discipline:
| Span type | What it records |
|---|---|
agent.run | The root. Attributes: agent version, tenant, correlation ID, task description, outcome, total tokens, total cost. |
agent.step | One iteration of the loop. Attributes: step number, planner decision, time-to-decision. |
agent.tool_call | One tool invocation. Attributes: tool name, params, result size, duration, authorisation decision, verification outcome. |
agent.model_call | One LLM call. Attributes: model, prompt tokens, completion tokens, cost, latency. |
agent.memory_op | A memory read or write. Attributes: store, key, op, size, retention class. |
agent.gate_eval | A gate / guardrail evaluation. Attributes: gate ID, decision, reason, operator if HITL. |
Every span carries a correlation ID that links the run to upstream requests (the user ticket, the batch submission) and downstream effects (the payment, the commit).
The six SLIs
The dashboard shows six operational signals. They are the agentic-specific translation of the classical RED (Rate, Errors, Duration) model onto loops, tools, and memory.
| SLI | Definition | Target |
|---|---|---|
| Task-completion rate | Fraction of agent.run spans ending with outcome=completed | ≥ 90% |
| Gate-fire rate | agent.gate_eval events per run | baseline per agent; alert on deviation |
| Tool-error rate | Fraction of agent.tool_call spans with error, by tool | ≤ 1% per tool |
| HITL-intervention rate | Fraction of runs invoking a human gate | per matrix design; alert on shift |
| Cost per task (p50, p95) | Sum of agent.model_call cost, rolled up to run | per agent budget |
| Loop length (p50, p95) | Number of steps per run | per agent baseline; alert on tail |
A seventh signal — token-per-task (p50, p95) — is worth tracking separately from cost because cost can move with model selection while tokens move with agent behaviour.
Step 1 — Instrument the agent
Use the OpenTelemetry SDK. Wrap the agent runtime so that:
- The outer
agent.runspan starts when the task is accepted and ends when the agent exits. - Each step emits an
agent.stepspan. - Each tool invocation wrapper emits an
agent.tool_callspan with the authorisation decision attached. - Each model call emits an
agent.model_callspan with tokens and cost attached (cost computed from the model provider’s current pricing table; the attribute iscost_usdon the span). - Each memory operation emits an
agent.memory_opspan. - Each gate evaluation emits an
agent.gate_evalspan.
Attach the correlation ID as a span attribute on every span. Use the W3C trace-context header to propagate across process boundaries where the agent calls out to remote tools.
The lab’s rubric rewards instrumentation that works on a raw agent loop (no framework), on LangGraph, and on the OpenAI Agents SDK. The instrumentation pattern is the same; the hook points differ.
Step 2 — Ship to the sink
Pick one sink. For Langfuse, map the OpenTelemetry spans to Langfuse’s generation / span / event primitives. For Arize Phoenix, use the OpenInference schema. For OpenTelemetry + Jaeger, ship OTLP traces directly. Regardless of sink, ensure:
- Every span has start and end timestamps to millisecond precision.
- Every
agent.model_callcarries prompt, completion, and cost attributes, redacted for PII per the organisation’s redaction policy. - Spans are searchable by correlation ID, tenant, agent version, and outcome.
Step 3 — Build the dashboard
Six panels, one SLI each. A seventh panel shows token per task. Each panel shows last 1 hour, last 24 hours, and last 7 days.
Additional must-haves on the dashboard:
- A run-explorer table: last 50 runs, sortable by task-completion, cost, duration, gate-fire count.
- A gate-firings breakdown: stacked bar of which gates fired per hour.
- A tool-error breakdown: table of tool name, error count, error rate, top error message.
- A model-mix panel: calls per model, useful when the agent is configured to fall back across providers.
The dashboard is the on-call engineer’s first screen. It is not a marketing dashboard. Panels that cannot be consulted in 30 seconds are cut.
Step 4 — Session replay
Given a correlation ID, the replay tool reconstructs a run:
- Fetch all spans for the correlation ID.
- Sort by start time; reconstruct the nesting via parent-span IDs.
- Render the run as a narrative: “At t=0, the agent received task ABC. At t=+12s, it retrieved file X. At t=+31s, it proposed a commit touching Y files. Gate G4 fired at t=+33s; operator O approved at t=+7m. At t=+7m12s, the commit succeeded.”
- Alongside the narrative, render the structured data: inputs and outputs for each tool call, the planner’s reasoning text for each step (truncated for long contexts), and the memory deltas.
The replay is the artefact the incident commander reads at 02:00. It is the artefact the auditor reads at T+6 months. The same tool serves both.
Implementation: any language. Python with Rich or Textual for terminal rendering is a small lift. A web view that collapses and expands spans is a larger lift but worth it for multi-step agents.
Step 5 — Exercise against synthetic incidents
Produce a set of synthetic incidents and run the replay against each. Representative incidents:
- I1 — model refusal cascade. The model refuses a legitimate tool call; the agent retries; the retry is denied by the guardrail; the run aborts.
- I2 — infinite loop. The planner keeps proposing variations of the same tool call. The loop length exceeds the p95 threshold; alerting fires.
- I3 — memory poisoning. A prior session wrote a misleading entry into persistent memory. The current session reads it; the resulting action triggers a gate.
- I4 — tool provider outage. A downstream tool returns 502 for five minutes; the circuit breaker opens; the agent surfaces the outage and halts.
- I5 — cost runaway. The agent’s model choice doubles tokens per call. Cost per task spikes 3×; alerting fires; the run is halted by the cost cap.
- I6 — silent goal drift. Over six steps, the agent’s planner text shifts subtly from the declared task. The drift is visible in the replay’s planner-reasoning panel but would not be caught by aggregate SLIs.
The replay’s usefulness is measured by incident 6: can the on-call engineer, reading the replay cold, see the drift?
Deliverables
- Instrumentation code for the agent (Step 1). Committed to version control.
- Observability sink configuration (Step 2).
- Dashboard exports (Step 3) — JSON or YAML for Grafana or the sink’s equivalent.
- Session-replay tool (Step 4).
- Replay outputs for incidents I1–I6 (Step 5).
- One-page note describing what the dashboard would miss without the replay, and what the replay would miss without the dashboard.
Rubric
| Criterion | Evidence | Weight |
|---|---|---|
| Span schema matches the six-type model | Code review | 15% |
| Every required attribute present | Span inspection | 15% |
| Six SLIs defined and rendered correctly | Dashboard review | 20% |
| Replay reconstructs a run faithfully | Replay output | 20% |
| Replay surfaces goal drift in I6 | Incident walk-through | 15% |
| Note articulates dashboard-vs-replay trade-off | Note review | 15% |
Lab sign-off
The Methodology Lead’s three follow-up questions:
- Which single SLI, if it moves, most commonly means the agent is misbehaving in a way the dashboard cannot fully explain without a replay?
- How does the retention policy for
agent.model_callspan attributes differ from the retention foragent.gate_eval, and why? - If regulatory counsel asked you to produce the evidence pack for a specific agent decision taken six months ago, what does the observability stack give you, and what must be stored elsewhere?
A defensible submission names an SLI and describes the diagnostic (gate-fire rate shifting with no task change suggests input distribution shift or prompt-injection attempt); separates model-call retention from gate-eval retention on PII and audit grounds; and names the immutable audit store that retains gate decisions, operator identities, and final tool-call outcomes beyond the observability stack’s hot-storage horizon.
The lab’s pedagogic point is that agent observability is not LLM logging. The unit of analysis is the run, not the call, and the narrative of the run is recoverable only if the instrumentation captures the loop, the gates, the memory, and the plan — not only the prompts and completions.