Observability for AI Applications

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 13 of 35

The first AI incident in a new deployment almost always starts the same way. A user reports an output that was wrong or unsafe. The team tries to reconstruct what the system did. They have server logs that say the API returned 200 OK. They have the user’s screenshot of the output. They do not have the prompt the model received, the retrieval results it was grounded on, the tool calls it emitted, or the latency profile of the call. The reconstruction fails because observability was not designed in. The second time this happens, the team adds prompt logging. The third time, they add retrieval capture. By the fifth incident, they have observability — but they got there reactively, paying for incidents that would not have happened if the instrumentation had been planned from day one. This article gives the architect the minimum observability specification so that the first incident can be explained, not just witnessed.

The three data types

An AI system emits three telemetry data types, and each maps to a different question the architect needs to answer.

Traces answer where time was spent and where failures occurred. A trace follows a single user request across every span it touches: the gateway, the orchestration layer, the retrieval calls, the model call, the tool calls, the post-processing, the response rendering. Each span records its start and end time, its status, its parent span, and annotations the developer attached. Traces are the same data type already collected for microservice architectures; AI systems add AI-specific span kinds (model call, retrieval call, tool call, guardrail check) and AI-specific attributes (model version, prompt hash, token counts).

Prompt and response capture answers what the model saw and what it produced. Every prompt sent to the model and every completion returned is stored with enough identifying metadata to link it to the trace it was part of. Prompt capture is the single highest-value piece of AI observability and the single most likely to be overlooked because it feels like something the provider should already be doing. The provider is not doing it. The architect is.

Evaluation signals answer how good the output was. Evaluation signals come from the harness (Article 11), from online LLM-as-judge runs (Article 12), from user feedback (thumbs up/down, escalations, follow-up queries that suggest dissatisfaction), and from downstream outcome metrics (did the task complete, did the user convert, did the ticket need escalation). The eval signal is joined to the trace and to the prompt capture so that a regression in aggregate quality can be drilled down into specific outputs.

[DIAGRAM: HubSpokeDiagram — aite-sat-article-13-observability-hub-spoke — Hub labelled “LLM service” with spokes radiating out to six observability data types. Spoke 1: “Trace (request path, spans, latency)”. Spoke 2: “Prompt/response capture (inputs and outputs, redacted as needed)”. Spoke 3: “Evaluation signal (judge scores, human review, outcome metrics)”. Spoke 4: “Cost telemetry (per-call tokens, per-tenant aggregate, per-model routing)”. Spoke 5: “Drift signal (distribution of inputs and outputs over time, anomaly flags)”. Spoke 6: “Safety signal (toxicity, injection detections, guardrail trips)”. Each spoke annotated with typical tool coverage and retention expectation.]

What OpenTelemetry for LLMs looks like

OpenTelemetry, the vendor-neutral standard for traces, metrics, and logs, has an active semantic-convention workstream for AI systems.¹ The convention defines span kinds (LLM call, embedding call, tool call, retrieval call) and attributes (model name, model provider, operation type, prompt template identifier, input and output token counts, finish reason, safety-annotation status) that instrumented applications emit. An AI observability tool that claims vendor neutrality supports the OpenTelemetry convention; an application instrumented to the convention can switch observability backends without re-instrumenting.

The architect instruments the application to OpenTelemetry’s AI conventions from the start, even if the first backend is a vendor-specific tool. The one-time cost of standardized instrumentation is trivial compared to the later cost of re-instrumenting when the observability vendor changes.

Prompt and response capture with privacy

Prompt capture is useful only if it includes enough context to be actionable and controlled enough to respect privacy and residency constraints. The architect designs the capture with four controls.

Content redaction. Personally identifiable information, regulated health information, financial account numbers, and similar high-sensitivity content are redacted from captured prompts and responses either at emission time or during ingestion into the observability store. Redaction is configurable per data class and per jurisdiction.

Selective sampling. Not every prompt needs full capture. A sampling policy (typically 1–10% random plus 100% of error cases plus 100% of cases flagged by safety evaluators) keeps the capture volume manageable and the cost under control.

Retention and access control. Captured prompts and responses are governed by a retention policy — often 30–90 days for routine captures, longer for incident-related or audit-flagged captures. Access is role-based; not every developer gets to see raw captures, and access events are themselves logged.

Tenant and residency scoping. A multi-tenant system’s captures are tenant-scoped. A system with residency requirements stores captures within the jurisdiction the data originated in. The capture store is itself a data system, subject to the same governance as the primary data store.

A capture architecture that handles these four controls is not a log; it is a regulated data asset. The architect treats it accordingly.

The five layers of AI observability

Different tools cover different layers. The architect maps the tools onto a layer stack to see what is covered and what is not.

Layer 1 — Infrastructure and runtime. CPU, GPU, memory, network, container orchestration. Covered by the team’s existing observability stack (Prometheus + Grafana, Datadog, New Relic, Dynatrace, Splunk). AI-specific additions are GPU-utilization metrics, VRAM consumption, and serving-engine metrics (vLLM’s /metrics endpoint, TGI’s built-in metrics). These metrics matter when the team self-hosts inference; they are largely invisible when the team uses managed APIs.

Layer 2 — Application traces. Cross-service traces showing the request path. Covered by OpenTelemetry backends (Jaeger, Tempo, Honeycomb, Datadog APM, New Relic APM). AI-specific additions are LLM and retrieval span kinds per the OpenTelemetry semantic conventions.

Layer 3 — AI-specific traces and prompt capture. The LLM call, the prompt, the response, the retrieval, the tool invocation. Covered by AI-native observability tools (Langfuse, Arize Phoenix, LangSmith, Weights & Biases Weave, OpenLLMetry, Helicone, Portkey).² The AI-native tools overlap partly with Layer 2; the distinction is the semantic awareness of LLM-specific concepts (token counts, cost per call, prompt-template identifiers, judge scores).

Layer 4 — Evaluation and quality signals. Judge scores, rubric results, human review outcomes, outcome metrics. Covered by the evaluation harness from Article 11 wired into the observability backend so that quality and system-level data share a dashboard.

Layer 5 — Business and outcome metrics. Task completion, user satisfaction, revenue impact. Covered by the product analytics stack (Amplitude, Mixpanel, GA4, internal dashboards). The architect joins these to the AI signals so that a quality dip is visible as a business outcome dip and vice versa.

A team without Layer 3 has AI observability debt that builds up incident by incident. A team without Layer 4 is flying blind on quality regressions. A team without Layer 5 can win on rubric metrics while losing on business outcomes.

[DIAGRAM: TimelineDiagram — aite-sat-article-13-request-trace-timeline — Horizontal timeline showing a single user request traversing the stack. Left to right: “Client span (browser rendering + API call)” → “Gateway span (auth, routing, rate limit)” → “Orchestration span (prompt assembly, retrieval orchestration)” → “Retrieval span (vector query, reranker)” → “Model span (prompt send, streaming decode, finish)” → “Tool span (if a tool was called, nested under orchestration)” → “Post-processing span (parse, validate, format)” → “Response span (stream to client)”. Each span annotated with start time, duration, and the AI-specific attributes captured (model name, token counts, finish reason).]

Cost observability

Observability costs money. Observability that captures full prompts and responses at 100% of traffic on a high-volume system quickly becomes a material line item. Three cost-control patterns keep observability economics reasonable.

Sampling for high-volume endpoints, full capture for long-tail. High-volume, low-stakes endpoints (search autocomplete, recommendation refreshes) get 1% sampling; low-volume, high-stakes endpoints (regulated advice, financial decisions) get 100% capture. The architect sets sampling per endpoint, not per system.

Compressed retention tiers. Full captures retained for 7 days in hot storage, summarized captures retained for 90 days in warm storage, metadata-only retained for 1 year in cold storage. The drill-down capability decreases with age; the cost decreases with it.

Cost dashboards tied to the observability system itself. The team monitors the cost of observability the way it monitors every other operational cost. A month where observability spend grows faster than traffic is a design problem, not a usage problem.

Alerting and on-call

Observability is the condition of alerting. The architect defines alert thresholds on the metrics that matter — p95 latency, error rate, judge score average, safety-evaluator trip rate, per-tenant cost anomaly — and wires them into the team’s paging system with the same discipline as non-AI alerts. An AI alert should include enough context in the page for the on-call engineer to diagnose without logging into four systems: the prompt, the response, the trace ID, the current quality trend, the recent deployment history.

Article 20 develops SLO/SLI and incident response in depth; observability is the data substrate those processes run on. An SLO that cannot be measured is an aspiration, not a target.

Two real-world examples

Langfuse. Langfuse is an open-source LLM observability platform that implements prompts, traces, evals, datasets, and experiments as first-class entities with an OpenTelemetry-compatible SDK.³ It is the reference open-source stack for teams that want self-hostable AI observability with full capture and evaluation integration. The architectural point for the AITE-SAT learner is that LLM observability is not dependent on commercial vendors; a capable open-source tool exists and is deployable inside the organization’s own boundary with Postgres and ClickHouse behind it.

Arize Phoenix. Phoenix, also open-source (Apache 2.0), is oriented toward tracing and evaluation integrated with the broader Arize platform for model observability.⁴ Phoenix ships with OpenTelemetry instrumentation for the major orchestration frameworks (LangChain, LlamaIndex, Haystack, DSPy) and evaluation-pattern notebooks runnable out of the box. The architectural point is that Phoenix covers the same Layer 3 scope as Langfuse with a different philosophy — closer integration with evaluation workflows, an ML-observability heritage from Arize’s broader product, and a different storage model.

Weights & Biases Weave. Weave is W&B’s commercial AI observability product for LLM tracing and evaluation.⁵ It integrates with the broader W&B experiment-tracking platform, which gives teams already using W&B for classical-ML experiments a continuous path into LLM observability. The architectural point is that the mature ML-ops vendors have extended their products into LLM observability; a team that has already invested in one is likely to prefer its extension over adopting a new AI-native vendor.

Drift detection

An AI system in production is a system whose input distribution is changing. Users ask new questions, the corpus receives new documents, the model provider ships silent capability updates on their side of the managed-API boundary. Drift detection is a specific observability capability that detects these shifts before they manifest as user-visible quality regressions.

Input drift compares the distribution of queries (query length, query category, query language, time-of-day pattern) across time windows and alerts when the distribution shifts materially. Output drift compares the distribution of outputs (response length, finish-reason distribution, tool-call rate, safety-flag rate) similarly. Quality drift compares the evaluation-signal distribution against baseline; a downward shift in the judge-score distribution over two weeks flags a regression even if no single output was clearly wrong. Feature-level drift (a specific tenant’s query distribution, a specific model version’s output distribution) surfaces sub-population problems that aggregate drift misses.

Open-source drift-detection libraries (Evidently, Alibi Detect, Arize Phoenix’s drift module) operate on top of the observability data already collected; the architect wires them to the observability backbone rather than setting up a parallel data pipeline.⁶

Regulatory alignment

EU AI Act Article 12 on record-keeping requires that high-risk systems keep automatically generated logs sufficient to trace the system’s behavior across its lifecycle.⁷ Observability at Layers 2, 3, and 4 is the architecture that satisfies Article 12. Article 15 on accuracy, robustness, and cybersecurity expects measurable performance characteristics; observability is the measurement. ISO/IEC 42001 Clause 9.1 on monitoring and measurement explicitly expects continuous monitoring of AI systems; the observability stack is what the management system monitors. GDPR applies to captured prompts and responses when they contain personal data, which is why the redaction and access-control discipline above is non-optional.

Summary

Observability for an AI system is three data types — traces, prompt and response capture, and evaluation signals — joined to the same backbone that monitors the rest of the stack. OpenTelemetry AI semantic conventions let the architect instrument once and switch backends without re-instrumentation. Prompt capture is a regulated data asset requiring redaction, sampling, retention, and access control. The five-layer observability model — infrastructure, application traces, AI-specific traces, evaluation signals, outcome metrics — is how the architect sees what is covered. Cost control on observability is itself an observability problem. Open-source Langfuse and Phoenix are reference stacks; Weights & Biases Weave is a commercial reference. Alerting and on-call run on the observability backbone. Regulatory alignment with EU AI Act Articles 12 and 15 and ISO 42001 Clause 9.1 depends on observability being designed in rather than added after the first incident.

Further reading in the Core Stream: Observability and Monitoring for AI Systems and Incident Response for AI.

OpenTelemetry Semantic Conventions for Generative AI. https://opentelemetry.io/docs/specs/semconv/gen-ai/ — accessed 2026-04-20. ↩
Langfuse. https://langfuse.com/ — accessed 2026-04-20. OpenLLMetry project (Traceloop). https://www.traceloop.com/openllmetry — accessed 2026-04-20. Helicone. https://www.helicone.ai/ — accessed 2026-04-20. Portkey. https://portkey.ai/ — accessed 2026-04-20. LangSmith. https://docs.smith.langchain.com/ — accessed 2026-04-20. ↩
Langfuse project documentation. https://langfuse.com/docs — accessed 2026-04-20. Langfuse GitHub. https://github.com/langfuse/langfuse — accessed 2026-04-20. ↩
Arize Phoenix documentation and GitHub. https://github.com/Arize-ai/phoenix — accessed 2026-04-20. ↩
Weights & Biases Weave documentation. https://weave-docs.wandb.ai/ — accessed 2026-04-20. ↩
Evidently AI. https://docs.evidentlyai.com/ — accessed 2026-04-20. Alibi Detect. https://docs.seldon.io/projects/alibi-detect/ — accessed 2026-04-20. Arize Phoenix drift. https://docs.arize.com/phoenix/evaluation/concepts-evals/drift — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 12 and 15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clause 9.1. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩