Lab — Build an Agent Observability Dashboard and Session Replay

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Lab 3 of 5

Lab objective

Instrument an agent, emit a complete trace, ship the trace to an observability sink, build a dashboard covering the six SLIs that matter for agentic workloads, and implement a session-replay tool that reconstructs a run from its trace. By the end of the lab the learner has produced the observability fabric an on-call engineer uses at 02:00 to triage an agent incident — not the glossy marketing dashboard, the one that actually reads the last hour of runs.

Prerequisites

Articles 15, 16, 17, 18 of this credential.
A working agent — the finance agent from Lab 1, the coding agent from Lab 2, or an equivalent agent with at least three tools and a multi-step loop.
An observability sink. Langfuse, LangSmith, Arize Phoenix, Humanloop, or an OpenTelemetry collector feeding Jaeger + Grafana. The lab must be reproducible on one sink; the rubric rewards reproducing it on two.
A dashboarding layer. Grafana is the canonical choice; the vendor dashboards on the sinks named above are also acceptable.

The observability model

An agent trace is a nested tree. The root is the agent run. Children are planner turns, tool calls, model calls, memory reads and writes, gate evaluations, and sub-agent invocations. Each node has a span with timing, inputs, outputs, and attributes. This is the OpenTelemetry model applied to agentic workloads.

Six span types carry the lab’s discipline:

Span type	What it records
`agent.run`	The root. Attributes: agent version, tenant, correlation ID, task description, outcome, total tokens, total cost.
`agent.step`	One iteration of the loop. Attributes: step number, planner decision, time-to-decision.
`agent.tool_call`	One tool invocation. Attributes: tool name, params, result size, duration, authorisation decision, verification outcome.
`agent.model_call`	One LLM call. Attributes: model, prompt tokens, completion tokens, cost, latency.
`agent.memory_op`	A memory read or write. Attributes: store, key, op, size, retention class.
`agent.gate_eval`	A gate / guardrail evaluation. Attributes: gate ID, decision, reason, operator if HITL.

Every span carries a correlation ID that links the run to upstream requests (the user ticket, the batch submission) and downstream effects (the payment, the commit).

The six SLIs

The dashboard shows six operational signals. They are the agentic-specific translation of the classical RED (Rate, Errors, Duration) model onto loops, tools, and memory.

SLI	Definition	Target
Task-completion rate	Fraction of `agent.run` spans ending with `outcome=completed`	≥ 90%
Gate-fire rate	`agent.gate_eval` events per run	baseline per agent; alert on deviation
Tool-error rate	Fraction of `agent.tool_call` spans with error, by tool	≤ 1% per tool
HITL-intervention rate	Fraction of runs invoking a human gate	per matrix design; alert on shift
Cost per task (p50, p95)	Sum of `agent.model_call` cost, rolled up to run	per agent budget
Loop length (p50, p95)	Number of steps per run	per agent baseline; alert on tail

A seventh signal — token-per-task (p50, p95) — is worth tracking separately from cost because cost can move with model selection while tokens move with agent behaviour.

Step 1 — Instrument the agent

Use the OpenTelemetry SDK. Wrap the agent runtime so that:

The outer agent.run span starts when the task is accepted and ends when the agent exits.
Each step emits an agent.step span.
Each tool invocation wrapper emits an agent.tool_call span with the authorisation decision attached.
Each model call emits an agent.model_call span with tokens and cost attached (cost computed from the model provider’s current pricing table; the attribute is cost_usd on the span).
Each memory operation emits an agent.memory_op span.
Each gate evaluation emits an agent.gate_eval span.

Attach the correlation ID as a span attribute on every span. Use the W3C trace-context header to propagate across process boundaries where the agent calls out to remote tools.

The lab’s rubric rewards instrumentation that works on a raw agent loop (no framework), on LangGraph, and on the OpenAI Agents SDK. The instrumentation pattern is the same; the hook points differ.

Step 2 — Ship to the sink

Pick one sink. For Langfuse, map the OpenTelemetry spans to Langfuse’s generation / span / event primitives. For Arize Phoenix, use the OpenInference schema. For OpenTelemetry + Jaeger, ship OTLP traces directly. Regardless of sink, ensure:

Every span has start and end timestamps to millisecond precision.
Every agent.model_call carries prompt, completion, and cost attributes, redacted for PII per the organisation’s redaction policy.
Spans are searchable by correlation ID, tenant, agent version, and outcome.

Step 3 — Build the dashboard

Six panels, one SLI each. A seventh panel shows token per task. Each panel shows last 1 hour, last 24 hours, and last 7 days.

Additional must-haves on the dashboard:

A run-explorer table: last 50 runs, sortable by task-completion, cost, duration, gate-fire count.
A gate-firings breakdown: stacked bar of which gates fired per hour.
A tool-error breakdown: table of tool name, error count, error rate, top error message.
A model-mix panel: calls per model, useful when the agent is configured to fall back across providers.

The dashboard is the on-call engineer’s first screen. It is not a marketing dashboard. Panels that cannot be consulted in 30 seconds are cut.

Step 4 — Session replay

Given a correlation ID, the replay tool reconstructs a run:

Fetch all spans for the correlation ID.
Sort by start time; reconstruct the nesting via parent-span IDs.
Render the run as a narrative: “At t=0, the agent received task ABC. At t=+12s, it retrieved file X. At t=+31s, it proposed a commit touching Y files. Gate G4 fired at t=+33s; operator O approved at t=+7m. At t=+7m12s, the commit succeeded.”
Alongside the narrative, render the structured data: inputs and outputs for each tool call, the planner’s reasoning text for each step (truncated for long contexts), and the memory deltas.

The replay is the artefact the incident commander reads at 02:00. It is the artefact the auditor reads at T+6 months. The same tool serves both.

Implementation: any language. Python with Rich or Textual for terminal rendering is a small lift. A web view that collapses and expands spans is a larger lift but worth it for multi-step agents.

Step 5 — Exercise against synthetic incidents

Produce a set of synthetic incidents and run the replay against each. Representative incidents:

I1 — model refusal cascade. The model refuses a legitimate tool call; the agent retries; the retry is denied by the guardrail; the run aborts.
I2 — infinite loop. The planner keeps proposing variations of the same tool call. The loop length exceeds the p95 threshold; alerting fires.
I3 — memory poisoning. A prior session wrote a misleading entry into persistent memory. The current session reads it; the resulting action triggers a gate.
I4 — tool provider outage. A downstream tool returns 502 for five minutes; the circuit breaker opens; the agent surfaces the outage and halts.
I5 — cost runaway. The agent’s model choice doubles tokens per call. Cost per task spikes 3×; alerting fires; the run is halted by the cost cap.
I6 — silent goal drift. Over six steps, the agent’s planner text shifts subtly from the declared task. The drift is visible in the replay’s planner-reasoning panel but would not be caught by aggregate SLIs.

The replay’s usefulness is measured by incident 6: can the on-call engineer, reading the replay cold, see the drift?

Deliverables

Instrumentation code for the agent (Step 1). Committed to version control.
Observability sink configuration (Step 2).
Dashboard exports (Step 3) — JSON or YAML for Grafana or the sink’s equivalent.
Session-replay tool (Step 4).
Replay outputs for incidents I1–I6 (Step 5).
One-page note describing what the dashboard would miss without the replay, and what the replay would miss without the dashboard.

Rubric

Criterion	Evidence	Weight
Span schema matches the six-type model	Code review	15%
Every required attribute present	Span inspection	15%
Six SLIs defined and rendered correctly	Dashboard review	20%
Replay reconstructs a run faithfully	Replay output	20%
Replay surfaces goal drift in I6	Incident walk-through	15%
Note articulates dashboard-vs-replay trade-off	Note review	15%

Lab sign-off

The Methodology Lead’s three follow-up questions:

Which single SLI, if it moves, most commonly means the agent is misbehaving in a way the dashboard cannot fully explain without a replay?
How does the retention policy for agent.model_call span attributes differ from the retention for agent.gate_eval, and why?
If regulatory counsel asked you to produce the evidence pack for a specific agent decision taken six months ago, what does the observability stack give you, and what must be stored elsewhere?

A defensible submission names an SLI and describes the diagnostic (gate-fire rate shifting with no task change suggests input distribution shift or prompt-injection attempt); separates model-call retention from gate-eval retention on PII and audit grounds; and names the immutable audit store that retains gate decisions, operator identities, and final tool-call outcomes beyond the observability stack’s hot-storage horizon.

The lab’s pedagogic point is that agent observability is not LLM logging. The unit of analysis is the run, not the call, and the narrative of the run is recoverable only if the instrumentation captures the loop, the gates, the memory, and the plan — not only the prompts and completions.