Skip to main content
AITM M1.6-Art10 v1.0 Reviewed 2026-04-06 Open Access
M1.6 People, Change, and Organizational Readiness
AITF · Foundations

Agent Observability and Audit

Agent Observability and Audit — Organizational Change & Culture — Applied depth — COMPEL Body of Knowledge.

9 min read Article 10 of 18

COMPEL Specialization — AITM-AAG: Agentic AI Governance Associate Article 10 of 14


Definition. Agent observability is the combined infrastructure — instrumentation, collection, storage, search, visualisation, and alerting — that makes an agent’s behaviour legible after the fact. Agent audit is the subset of observability concerned with evidentiary-quality records that survive regulatory, legal, and incident-response scrutiny. The two overlap but are not identical: not every observability signal is audit-quality, and not every audit record is useful for real-time debugging.

The principle is simple and its implementation is not. An agent that ran for six hours overnight, made forty tool calls, wrote twelve entries into a memory store, and produced an output the operator now needs to explain — that agent needs to have emitted, in real time, the records that allow reconstruction of what it did and why. If the records were not emitted at runtime, they do not exist.

The observability layers

Five observability layers apply to agents. Each has a distinct purpose and a distinct consumer.

LayerWhat it capturesPrimary consumer
TracePer-request execution pathEngineering (debugging)
Tool-call logEvery tool invocation and resultGovernance (audit)
Memory-deltaWrites to persistent and shared memoryGovernance + security
Decision-point snapshotInputs and outputs at significant decisionsRegulator / auditor
Audit recordImmutable, tamper-evident subsetLegal / regulator

Trace

A trace captures the execution path of a single request, with spans for each step. Traces are high-volume and short-lived. They are the workhorse of engineering debugging.

Tools in widespread use include LangSmith (from the LangChain project), Langfuse (open source), Humanloop, Arize Phoenix (open source), and general-purpose observability suites from Datadog, New Relic, or Elastic with LLM-specific extensions. Sources: https://smith.langchain.com/ ; https://langfuse.com/ ; https://humanloop.com/ ; https://phoenix.arize.com/.

The governance analyst does not prescribe a vendor. The analyst does require that the trace contain certain fields — agent identity, session identifier, tool calls, model identity and version, token counts, latency — and that the retention duration meets the lower of (a) the organisation’s debugging needs and (b) data-retention policy.

Tool-call log

A tool-call log captures every tool invocation: sender agent identity, tool identifier, parameters, result summary, timestamp, session identifier, outcome (success / failure / rate-limited / denied). The log is a governance asset, not an engineering asset. It lives longer than a trace and feeds audit queries of the form “did agent X ever call tool Y with parameters matching Z?”

Memory-delta log

Every write to persistent memory and every significant read is captured. The log enables poisoning detection (Article 7) and memory-growth auditing. For shared memory that multiple agents use, the delta log is what an incident responder uses to identify which agent wrote the poisoned entry and when.

Decision-point snapshot

At significant decisions — a plan commitment, a tool allow-list check that blocked or permitted an action, an oversight-intervention event — the agent emits a structured snapshot. The snapshot is designed to be readable by a non-engineer reviewing the agent’s reasoning after the fact. Regulators reviewing a high-risk deployment will expect to see decision-point snapshots; an engineering team that emits only traces will be unable to answer a regulator’s “why did the agent do X?” question legibly.

Audit record

A subset of the above, promoted to audit quality. Audit records are append-only, cryptographically signed or chained (so post-hoc modification is detectable), and retained on the regulatory-required horizon. They are the document of record for legal proceedings and supervisory reviews.

What every audit record must contain

Consistency across agents is critical. When a regulator asks for “all agent activity related to incident I,” the answer should not require translation between three different log formats. A minimum audit-record schema:

  • timestamp — wall-clock and monotonic.
  • agent_id — the stable identifier of the agent.
  • agent_version — the deployed version (code + model + prompt + tool config).
  • session_id — the end-to-end session this record belongs to.
  • correlation_id — a cross-cutting identifier that links related records across services.
  • event_type — a controlled vocabulary (tool_call, memory_write, decision_snapshot, oversight_intervention, error, shutdown, etc.).
  • event_payload — structured per event type.
  • operator_id — when a human was involved, who (by role, not by person, in the public artifact).
  • outcome — success / failure / denied / timeout / unknown.
  • integrity — signature or chain element.

The schema is implementation-neutral. Whether emitted from a LangGraph, CrewAI, AutoGen, OpenAI Agents SDK, LlamaIndex Agents, or hand-rolled agent on any model (OpenAI, Anthropic, Google, Llama, Mistral), the schema is the same. The observability vendor is chosen for its capability to ingest and retain to the schema; it is not chosen first and the schema retrofitted around it.

Integration with enterprise SIEM / SOAR

Agent observability is not a standalone discipline. The security operations team runs a SIEM (security information and event management) and, often, a SOAR (security orchestration, automation, and response) platform. Agent signals should flow into these tools.

Three integrations are standard:

  • SIEM ingestion. Audit records are forwarded to the SIEM for correlation with broader security events. An unauthorised tool call may be, in the SIEM’s view, part of a lateral-movement attempt starting elsewhere.
  • Alert pipelines. Thresholds are set (e.g., “more than N tool calls per minute from a single agent,” “write to a shared memory from an unexpected identity”) and alerts fire to the security operations centre.
  • Response automation. SOAR playbooks include agent-specific steps: pause an agent, revoke a token, snapshot an audit window, notify the agent owner.

The integration is bidirectional. Security events from the broader environment may affect agent decisions (a detected compromise on a user account should pause agents acting on that account’s behalf), and agent events feed broader security monitoring.

The audit artifact — the incident-responder perspective

When an incident occurs, the incident responder needs to answer four questions quickly:

  1. What did the agent do?
  2. Why did it do it?
  3. What did it change?
  4. Who else was involved (other agents, users, tools, external systems)?

The audit stack must answer each. Question 1 is served by the tool-call log and decision-point snapshots. Question 2 is served by the decision-point snapshots with enough context to reconstruct the reasoning. Question 3 is served by the memory-delta log and the audit of any downstream systems touched. Question 4 is served by the correlation identifiers that tie records together.

The incident responder is the ultimate consumer. If the responder cannot answer the four questions from the records, the observability design has failed regardless of how pretty the traces are.

Two real-world anchors for observability

The open-source tracing stack — Arize Phoenix as a named example

Arize Phoenix, released as open source in 2023 and extended through 2024, provides agent-specific tracing primitives that many teams use as their first-level observability. Source: https://phoenix.arize.com/. Phoenix is one of several options — LangSmith, Langfuse, Humanloop, and general-purpose APM tools are others — and the governance analyst’s job is to ensure the chosen tool supports the audit-record schema above, not to pick winners.

LangSmith, Langfuse, Humanloop — vendor parity

LangSmith (LangChain’s managed offering), Langfuse (open source, also offered managed), and Humanloop provide production-grade trace and evaluation capabilities. Each has rough feature parity for the governance analyst’s purposes: trace capture, filtering, annotation, and integration hooks. An organisation already using one should not switch for marginal feature differences; an organisation choosing for the first time should weight schema support, retention controls, access controls, and export options. Sources: https://smith.langchain.com/ ; https://langfuse.com/ ; https://humanloop.com/.

Observability across the autonomy levels

The observability burden tracks the autonomy level from Article 3.

LevelObservability requirement
0–1Prompt/response log with retention matching policy
2Add tool-call log
3Add decision-point snapshots and memory-delta log
4Full audit record stream to SIEM; dedicated dashboard
5All of the above plus frontier-capability monitoring per operator-side frameworks

An agent moving up the autonomy spectrum must have its observability upgraded before the autonomy change takes effect, not after. Retroactive observability does not exist.

The audit retention horizon

Retention horizons differ by record type. The table below is indicative; the specialist confirms with data-protection and sector-specific obligations.

Record typeTypical retention
Trace30–90 days
Tool-call log1–7 years (sector-dependent)
Memory-delta logMatch the underlying memory retention
Decision-point snapshot7+ years for high-risk systems
Audit recordRegulatory horizon (often 10 years)

Retention design is a data-protection decision (GDPR storage limitation, sector rules) plus an evidentiary decision (how long might we need this for litigation?). The two horizons sometimes conflict and the organisation’s counsel resolves the conflict deliberately.

Learning outcomes — confirm

A specialist who completes this article should be able to:

  • Name the five observability layers and their consumers.
  • Specify a minimum audit-record schema that survives vendor changes.
  • Design SIEM integration for a described agent.
  • Evaluate an audit trail for completeness against incident-responder needs.

Cross-references

  • EATF-Level-1/M1.5-Art03-Building-an-AI-Governance-Framework.md — Core article on building a governance framework.
  • Article 7 of this credential — memory governance (feeds memory-delta log).
  • Article 11 of this credential — kill-switch and incident response.
  • Article 14 of this credential — Agent Governance Pack (observability is a pack section).

Diagrams

  • HubSpokeDiagram — observability hub with trace, tool-log, memory-delta, decision-snapshot, audit, alert spokes.
  • StageGateFlow — audit-record flow: capture → retain → query → present to audit.

Quality rubric — self-assessment

DimensionSelf-score (of 10)
Technical accuracy (schema and layer descriptions sound)10
Technology neutrality (LangSmith, Langfuse, Humanloop, Arize Phoenix, Datadog, New Relic, Elastic all named)10
Real-world examples ≥2 (multiple vendor tools)10
AI-fingerprint patterns9
Cross-reference fidelity10
Word count (target 2,500 ± 10%)10
Weighted total92 / 100