Operational Resilience for Agents — Failure Modes and Recovery

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 16 of 40

Thesis. A web service fails when it goes down or when it returns wrong answers. An agent fails in those ways too, but it also fails in ways web services cannot: it loops without progress, it reasons into circles, it mis-interprets tool output, it carries a corrupted memory into the next session, it cascades refusals through a multi-agent graph until the entire workflow halts. Classical resilience patterns — retries, timeouts, circuit breakers, bulkheads — still apply, but they must be adapted to the agentic context, and new patterns are needed for the agentic-specific failures. This article catalogs eight agent failure modes, specifies three circuit-breaker patterns adapted for agents, and walks the recovery workflow from detection through post-mortem.

Eight agent failure modes

Failure 1 — Infinite loop / no-progress

The agent’s loop iterates without converging. Same thought repeated; same tool called with slight variations; no state change. Budget burns; wall-clock elapses; tool quotas consume. The Article 4 failure mode for ReAct, but all loop patterns can exhibit it.

Detection: step counter, no-progress detector (hash of recent thoughts/actions), wall-clock watchdog.

Mitigation: max_steps cap, budget cap (Article 9), restart policy, kill-switch escalation.

Failure 2 — Tool timeout cascade

One slow tool call delays the loop; the loop retries; retries compound. In multi-agent systems, a slow worker delays all agents in a synchronous chain. Classic distributed-systems failure with agentic amplification.

Detection: per-tool latency alerting, cascade-depth monitoring.

Mitigation: per-tool timeouts set lower than the session’s overall budget, asynchronous tool calls where possible, circuit breakers per tool.

Failure 3 — Model refusal cascade

The model refuses a legitimate request (safety filter too strict, prompt looks adversarial). The agent retries with slight variations; each retry burns tokens. Eventually the agent concludes the task is impossible and halts — but the task was possible, the model was mis-aligned with the operational profile.

Detection: refusal-rate anomaly per agent or per task class, trace analysis of refusal reasons.

Mitigation: refusal-classification diagnostics (is this a safety refusal, an over-cautious refusal, or a legitimate capability limit?), operational-safety prompts that clarify context, fallback to a different model or path, human handoff.

Failure 4 — Context-window exhaustion

Multi-step loops accumulate context. Eventually the context hits the model’s window limit. The model either truncates (losing the beginning, often the system prompt), errors out, or reduces quality. Classical OOM analog, but with silent quality degradation that is harder to detect.

Detection: context-size monitoring per loop iteration, OOM-style alerting at 80% of window.

Mitigation: summarization at configurable thresholds, sliding-window context policies, externalized state (Article 15 replayability stores full history outside the context), model upgrade to larger context if justified.

Failure 5 — Memory corruption

Memory write with bad data (hallucination, poisoning, bug) persists. Future sessions retrieve the bad memory as authoritative. The failure compounds session-over-session until detected.

Detection: memory-integrity checks (provenance present, classification valid), poisoning red-team batteries (Article 17), user-complaint signal correlated to memory contents.

Mitigation: write gating (Article 7), TTL on low-trust memories, rollback to a prior memory snapshot, tombstone corrupted entries with blocked-retrieval flag.

Failure 6 — Partial side effect

An action began but did not complete. Payment initiated but not recorded. Email sent but not logged. Database row inserted but audit trail not written. Agent interprets the state as “failed” and retries, producing double effect.

Detection: post-action reconciliation, idempotency-key collision detection, external-system state polling.

Mitigation: idempotency keys on every side-effecting tool (Article 5), compensation actions (sagas), reconciliation workflows, HITL escalation for ambiguous states.

Failure 7 — Downstream system overload

Agent-driven volume overwhelms downstream — the CRM can’t keep up with the agent’s rate, the payment processor rate-limits, the email server queues. Classical rate-limiting failure, but the agent may not understand backpressure signals.

Detection: downstream latency and error monitoring, agent-caused volume attribution.

Mitigation: rate limits at agent’s tool layer, circuit breaker on downstream errors, backpressure-aware retry policy, queuing with replay.

Failure 8 — Coordination deadlock or livelock (multi-agent)

Agents wait for each other indefinitely, or pass work back and forth without progress. Article 12 failure modes.

Detection: per-message age, cycle detection in delegation graph, task-age alerting.

Mitigation: timeouts on inter-agent waits, cycle-breaking policy (break after N bounces), escalation to supervisor, fallback to single-agent mode.

Three circuit-breaker patterns adapted for agents

Circuit breakers are the classical pattern (Nygard’s “Release It!”, Hystrix lineage) for preventing cascading failures. Agentic systems adapt them at three levels.

Pattern A — Per-tool circuit breaker

Each tool has a breaker tracking error rate and latency. When the breaker opens, the tool returns a cached failure response (“currently unavailable”) without invoking the handler. Agent’s prompt is instructed to interpret circuit-open as “this tool is temporarily unavailable; attempt an alternative or escalate to a human.”

Configuration: failure-rate threshold (e.g., 50% over 60 seconds), recovery attempt window (e.g., try again every 30 seconds), half-open policy (allow one test call before fully closing).

Pattern B — Per-agent-session circuit breaker

The agent session itself has a breaker on anomalous signals: elevated refusal rate, suspicious tool-call patterns, policy-engine denials, cost overrun. When the breaker opens, the session is paused for human review or terminated.

Configuration: composite-score threshold from multiple signals, minimum session length before breaker can engage, operator override with audit.

Pattern C — Fleet-level circuit breaker

Across all sessions running a particular agent configuration or model, a fleet-level breaker fires when fleet-wide anomalies emerge: cost spike, refusal-rate spike, tenant-leakage signal, adversarial-pattern detection. Opening the fleet breaker pauses new sessions and optionally terminates active ones.

Configuration: fleet-percentile thresholds, cooldown policy, re-enable procedure with canary rollout.

The recovery workflow — detection through post-mortem

Every production agentic failure follows a workflow. The architect specifies each stage.

Stage 1 — Detect

Observability (Article 15) captures the anomaly; automated alerts route to on-call. The detection SLO — mean time to detect — should be minutes for high-risk, tens of minutes for medium, hours for low.

Stage 2 — Contain

Circuit breakers trigger automatically; operator confirms; additional containment as needed (kill-switch on specific sessions, fleet-level pause, downstream-dependency isolation). Containment SLO: minutes from detect.

Stage 3 — Checkpoint and rollback

For side-effecting agents, rollback or compensate the partial state. Memory snapshots (Article 7) restore to last-known-good. Tool-side effects rolled back via saga compensations where possible; flagged for manual resolution where not.

Stage 4 — Diagnose

Using the observability trace (Article 15), identify the root cause. Was it a model regression? A tool-handler bug? A prompt-injection attack? A supply-chain failure? Diagnostics fan out into the code, prompt, policy, and external-system layers.

Stage 5 — Remediate

Fix the root cause. Code patch, prompt update, policy update, tool update, model rollback. Remediation is tested in canary before fleet rollout (Article 24 lifecycle).

Stage 6 — Verify

Evaluation harness (Article 17) gains a regression test for the scenario. Canary traffic exercises the fix; fleet rollout proceeds with elevated monitoring.

Stage 7 — Post-mortem

Blameless post-mortem (Article 25) captures timeline, contributing factors, missed detection opportunities, remediation gaps. Output feeds back into the registry (Article 26), the evaluation harness, and the policy set.

Recovery patterns for the hard cases

Two failure classes deserve dedicated recovery patterns because their default handling is often wrong.

Partial side effect with unclear state. The agent attempted to send an email; the email server returned 202 Accepted but no further confirmation. Did the email send? The architect’s pattern: after every Class 5 action, a post-call reconciliation polls the external system for confirmation. On ambiguous states, the tool’s idempotency key is propagated so retries don’t double-send; the agent is instructed to wait-and-verify, not retry-immediately; HITL escalation fires if ambiguity persists past a threshold.

Memory corruption affecting many sessions. A poisoned fact in Layer 2 or Layer 4 memory has been retrieved by dozens of sessions before detection. The architect’s pattern: tombstone the corrupted entry; audit all sessions that retrieved it within the detection window; for each affected session, either notify the user, reprocess the task with corrected memory, or flag for review; update the write-policy to prevent the class of poisoning.

Resilience anti-patterns

Retry without jitter. Every agent retries on tool failure; the downstream system sees retry storms. Fix: exponential backoff with jitter, per-session retry caps.

Timeout stacking that exceeds user patience. The user’s request has a 30s budget; the agent has 20s; the first tool has 15s, the second 15s; timeouts stack to 50s. Fix: budget propagation, each layer respects the enclosing budget.

Silent model degradation. A model update lowers quality; agents accept degraded output without noticing. Fix: behavioral regression tests in CI (Article 17, Article 24), canary rollouts with quality monitoring.

Memory without TTL or tombstoning. Poisoned memories persist forever. Fix: TTL per memory classification; tombstoning as a first-class operation.

Framework parity — resilience hooks

LangGraph — checkpointing at every node provides natural recovery points; state resume is clean.
CrewAI — task-level retry and fallback policies; crew-level recovery requires custom orchestration.
AutoGen — message-retry logic customizable; group-chat manager handles timeouts.
OpenAI Agents SDK — native retry + guardrails; tracing supports post-hoc diagnosis.
Semantic Kernel — Polly-style resilience via Process Framework; integration with Azure reliability primitives.
LlamaIndex Agents — callback-driven error handling; response validators catch malformed outputs.

Real-world anchor — Replit AI Agent postmortems

Replit’s publicly-discussed agent incidents across 2024–2025 illustrate most of the failure modes in this article. Long-horizon coding tasks hit context exhaustion; mis-handled tool timeouts created partial side effects in repositories; memory of user preferences occasionally required rollback after incorrect writes. Replit’s response — tighter budget caps, explicit summarization, improved reconciliation — mirrors the prescriptions here. Source: replit.com blog and community discussions.

Real-world anchor — AWS Bedrock Agents observability and failure-mode docs

AWS Bedrock Agents documentation (public, 2024–2025) names the recurrent failure modes in managed agentic deployments and provides mitigation guidance. The managed-service view complements framework-level discussions; architects deploying on Bedrock should read the docs as the operational companion. Source: docs.aws.amazon.com/bedrock.

Real-world anchor — Anthropic Computer Use safety card

Anthropic’s Computer Use safety card (2024) discusses recovery patterns for a UI-acting agent — specifically how to handle ambiguous UI state, mis-clicked elements, and actions that can’t be easily undone. The card’s candid treatment of recovery limits in UI agents is instructive for any architect designing Class 5/6 agents. Source: Anthropic public materials.

Closing

Eight failure modes, three circuit-breaker levels, a seven-stage recovery workflow, two hard-case patterns. Resilience is not reliability plus retries; it is a discipline that treats every failure mode as a first-class design object. Article 17 now takes up the evaluation harness that tests whether the resilience works before the incident, not only after.

Learning outcomes check

Explain eight agent failure modes with their detection signals and mitigation patterns.
Classify three circuit-breaker patterns (per-tool, per-session, fleet) with configuration parameters.
Evaluate a recovery design for completeness against the seven recovery stages.
Design a failure-handling spec for a given agent including per-failure-mode mitigation, circuit-breaker configuration, and recovery runbook outline.

Cross-reference map

Core Stream: EATE-Level-3/M3.3-Art11-Enterprise-Agentic-AI-Platform-Strategy-and-Multi-Agent-Orchestration.md.
Sibling credential: AITM-OMR Article 8 (ops-management angle); AITF-PLP Article 5 (SRE-level operational angle).
Forward reference: Articles 17 (evaluation), 18 (SLO/SLI), 24 (lifecycle), 25 (incident response).