Observability for Agentic Systems

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 15 of 40

Thesis. Classical service observability answers “why is this request slow” and “why did this request fail.” Agentic observability answers those questions plus “why did the agent make this decision” and “can we replay the trace and get the same answer?” The discipline is not just more spans; it is structurally different data and structurally different questions. Without agentic-grade observability the team cannot debug production incidents (Article 25), cannot evaluate real behavior (Article 17), cannot produce EU AI Act Article 12 logs (Article 23), and cannot run the retrospective that converts incidents into improvements. This article specifies what observability must capture, names the backends that capture it well, and teaches the replay discipline that senior agentic teams treat as non-negotiable.

The six data types agentic observability must capture

Agentic traces have more layers than classical service traces. Each layer needs its own span type.

Type 1 — Session and request metadata

Every agent invocation carries: session_id, tenant_id, acting_user, agent_config_id, agent_version, prompt_template_version, policy_version, model_id, model_version, start_time, end_time, total_cost, total_tokens, outcome. This is the root span of every agent run and the index for everything else.

Type 2 — Reasoning spans

The model’s reasoning — thoughts, plans, reflections — as emitted by the loop (ReAct thought, Plan-and-Execute plan, Reflexion critique, state-graph decisions). Each reasoning span captures the prompt tokens in, the completion out, the model’s reasoning text (if emitted), token counts, and elapsed time. Reasoning spans are the agentic-specific telemetry; classical observability has no analog.

Type 3 — Tool-call spans

Every tool call: tool name, version, input hash, input classification, authorization decision and reasons (Article 6), execution time, output hash, output classification, validation decision and reasons, side-effect summary. Tool spans are nested under reasoning spans so the trace shows “the model thought X, which led to tool call Y.”

Type 4 — Memory operation spans

Every memory read and write: which layer (1–4 from Article 7), which namespace, query or write content, classification, provenance, retrieval top-k or write policy decision. Memory poisoning detection (Article 7) runs on these spans; the data must be there to detect.

Type 5 — Human decision spans

Every HITL or HOTL intervention (Article 10): request posted to reviewer queue, wait time, reviewer identity, decision, rationale, time to decide, resulting agent action. Human decision spans are the audit evidence for EU AI Act Article 14 oversight claims (Article 23).

Type 6 — Policy and safety spans

Every policy-engine evaluation (Article 22), sanitizer invocation (Article 14), guardrail tripwire (Article 6), kill-switch check (Article 9). These spans prove the defense layers actually ran, which matters for both debugging and for regulatory evidence.

OpenTelemetry semantic conventions for GenAI

The CNCF OpenTelemetry project’s GenAI semantic conventions (maturing 2024–2026) standardize attribute names across agentic frameworks. Key conventions:

gen_ai.system — the model system (OpenAI, Anthropic, Vertex, Bedrock).
gen_ai.request.model, gen_ai.response.model — model identifiers.
gen_ai.request.max_tokens, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens.
gen_ai.operation.name — chat, completion, embedding, tool call.
gen_ai.prompt, gen_ai.completion — events with content (subject to sampling and privacy rules).

Agentic-specific additions beyond GenAI conventions (not yet standardized but stabilizing in practice):

agent.id, agent.version, agent.loop_type.
agent.tool.name, agent.tool.version, agent.tool.risk_class.
agent.memory.layer, agent.memory.namespace.
agent.policy.decision, agent.policy.version.
agent.hitl.reviewer, agent.hitl.decision.
agent.kill_switch.triggered, agent.kill_switch.reason.

Architects adopt the GenAI conventions where they apply and add agent-specific attributes for what GenAI doesn’t cover. OTel is the emission format; the backend chooses how to visualize and query.

Backend comparison — five contenders with equal framing

Every backend has strengths; none dominates. Per COMPEL §16.4, at least three are named at equal depth.

Langfuse (OSS + commercial). Agent-first design; tree-shaped trace UI; native session-level aggregation; cost tracking built in; self-hostable. Strong open-source story.
Arize AI / Phoenix (commercial + OSS Phoenix). Agent traces + evaluation-first design; integrates with model-performance monitoring; strong in research and analytics.
Weights & Biases (commercial). Rich experiment tracking plus agent tracing; strong for teams already using W&B for model development.
LangSmith (commercial, by LangChain). Tight LangGraph integration; rich trace UI; production deployments. LangSmith named alongside competitors per §16.4.
Humanloop / Helicone / Datadog LLM Observability / New Relic AI Monitoring — commercial entrants bringing LLM observability into existing APM or product-analytics tooling.
OpenTelemetry + Grafana/Jaeger/Tempo — the build-your-own path; emit OTel; store in any OTel-compatible backend; visualize with off-the-shelf tools plus custom dashboards.

Architects typically pick one commercial product for developer-facing trace exploration and also emit OTel to the org’s standard APM so ops teams have their classical tools plus agent insight.

Replayability — the discipline

Replayability is the property that given the same inputs a trace can be re-executed and produce the same outputs (or — for non-deterministic model calls — the same decisions at the non-model layers). Replayability is the difference between “we saw something weird and can’t explain it” and “we saw something weird, here is the exact reproduction.”

What replayability requires:

Complete input capture. Every input into the agent — user messages, prompts, retrieved passages, memory reads, tool outputs — must be captured byte-exactly (with hashing for large artifacts and full storage for smaller ones).

Version pinning. Model ID + version, prompt template ID + version, tool version, policy version, memory snapshot ID. The replay environment reconstructs the exact versions.

Deterministic seeding where possible. Model calls with temperature=0 and fixed random seeds reduce variability. When non-determinism is inherent, replay still captures what happened even if it cannot re-run.

State-graph or state-machine mode. Loop patterns like state-graph (Article 4) replay cleanly because state transitions are explicit. ReAct and conversation-buffer are harder to replay because context evolves implicitly; capturing the full context at each step addresses this.

The architect’s replay spec: “given a session ID, we can reconstruct the exact input, tool, memory, policy, and model versions; we can re-execute the non-model layers deterministically; we can inspect the model calls with their original tokens and the actual completion received.” Teams that can say this pass regulatory evidence requirements for agentic systems; teams that cannot, do not.

The trace as audit evidence

EU AI Act Article 12 (Article 23) requires automated logging of events relevant to the identification of risks and modifications of AI systems. Agentic traces are the implementation of Article 12 for agentic systems. The architect specifies:

Retention: minimum 6 months for Annex III high-risk; longer per industry regulation (7 years for financial services).
Integrity: append-only, signed where practical, backed by object storage with versioning.
Access controls: who can read traces (ops + security + compliance; limited audit of user-identifiable content).
Sampling policy: full trace capture for high-risk actions; sampled for high-volume low-risk.
Privacy: PII redaction patterns before trace storage; conforming to data-residency requirements (Article 28).

Five observability anti-patterns

Anti-pattern 1 — LLM output logging only. Capturing the model’s output without its input, without the tool interactions, without the reasoning, and calling it observability. This is not debuggable and not auditable.

Anti-pattern 2 — Over-capture that can’t be queried. Storing every token of every call in unstructured text, with no index. Storage is full; nobody can find anything.

Anti-pattern 3 — Proprietary-format capture. All traces in a vendor’s format with no OTel emission. Vendor lock-in plus loss-of-access if the vendor discontinues the product.

Anti-pattern 4 — No sampling strategy. Full traces for every production call regardless of risk; cost explodes; teams reduce coverage by turning observability off at the worst time.

Anti-pattern 5 — Trace and evaluation pipelines disconnected. Evaluation runs synthetic batches; production traces are logged but never fed back into evaluation. Real-world behavior drifts; team doesn’t know.

Framework parity — observability hooks

LangGraph — native OTel emission; LangSmith deep integration; Langfuse/Arize work natively.
CrewAI — callback manager for span emission; Langfuse and OTel integrations documented.
AutoGen — v0.4 OTel support; detailed message logs native.
OpenAI Agents SDK — native tracing exported to OpenAI platform or external OTel.
Semantic Kernel — OTel-first in recent versions; Azure Monitor integration.
LlamaIndex Agents — callback manager with multiple backend integrations.

Platform strategy: every framework emits OTel; traces flow to the platform’s OTel collector; the collector forks to the chosen backend(s) plus the organization’s APM. Framework-native viewers are accepted as developer tools but not the source of truth.

Real-world anchor — Langfuse agent-observability integrations

Langfuse’s documentation (langfuse.com/docs) provides detailed integration examples for LangGraph, LangChain, LlamaIndex, OpenAI Agents SDK, and CrewAI. The agent-specific UI — a tree view of reasoning and tool calls with cost and latency annotations — is a strong example of what agent-observability should look like. Source: langfuse.com.

Real-world anchor — Arize Phoenix for agent observability

Arize’s open-source Phoenix project (phoenix.arize.com) supports agent trace visualization, evaluation integration, and anomaly detection. Phoenix’s pairing of traces with evaluation results directly supports the evaluation-to-observability loop described in Article 17. Source: phoenix.arize.com.

Real-world anchor — LangSmith production discussions

LangSmith’s public case studies and community discussions illustrate LangGraph observability at production scale — particularly for teams running checkpointed state-graph agents with HITL interrupts. LangSmith is named alongside its competitors here, per the credential’s technology-neutrality commitment. Source: smith.langchain.com.

Closing

Six data types, six GenAI attributes, six agent-specific attributes, five backends at equal framing, five anti-patterns. Observability is the surface that makes every other control inspectable. Article 16 now takes up operational resilience — what the observability data drives when things go wrong.

Learning outcomes check

Explain six agentic-observability data types (session, reasoning, tool, memory, human, policy) with their span structures.
Classify five backends by layer coverage, cost, and vendor-lock profile.
Evaluate a deployment for replayability against the four prerequisites (inputs, versions, determinism, state capture).
Design an observability spec for a given agentic platform including OTel emission, backend choice, retention, sampling, and replay procedure.

Cross-reference map

Core Stream: EATE-Level-2/M2.3-Art12-Observability-for-Agentic-Workloads.md.
Sibling credential: AITM-OMR Article 7 (ops-management angle on agent observability).
Forward reference: Articles 17 (evaluation), 18 (SLO/SLI), 23 (EU AI Act Article 12), 25 (incident response).