Agentic Prompt Patterns

FlowRidge

Agentic Prompt Patterns — The Plan-Act-Reflect Loop

Figure 312. Agentic prompts iterate. Each loop produces an observation, updates the plan, and adds a trace entry — termination conditions are a first-class design concern.

AITM-PEW: Prompt Engineering Associate — Body of Knowledge Article 6 of 10

Agentic patterns are the most over-applied class of prompt patterns in current practice. A workflow that could be a chained pipeline is often implemented as a many-turn agent with a tool-selection loop because agents are fashionable and chained pipelines are not. The practitioner’s job is to resist this reflex. Not because agents are bad, but because agentic architectures impose costs (latency, cost per task, surface area for failure, governance overhead) that deterministic pipelines do not, and the costs are justified only when the task’s structure genuinely calls for them. This article develops the four principal agentic patterns, the failure modes unique to each, and the decision procedure a practitioner applies before selecting one.

What makes a pattern agentic

A pattern is agentic when three properties hold: the model’s outputs drive subsequent model inputs in a loop; the loop has an internal stopping condition the model decides; and the loop can take consequential actions between iterations via tools. A chained pipeline has the first property without the second; a search-before-answer feature has the first two properties without the third. When all three hold, the workflow is agentic, and the governance posture changes. Article 5 gave the threshold concretely: three or more chained tool calls without user confirmation, or an open loop over tool calls until a stopping condition, places the feature in agentic territory.

ReAct

The ReAct pattern, from Yao et al. 2023 at ICLR, interleaves reasoning and action¹. The model produces a thought, emits an action (a tool call), receives an observation (the tool’s result), and loops. The published template structures the conversation with explicit thought/action/observation markers, and the loop continues until the model emits a final answer.

ReAct is the default for workflows that need to consult multiple sources before answering. A research assistant that fetches documents, summarises, fetches again, and synthesises fits the pattern naturally. Its strengths are interpretability (the trace shows the reasoning and the actions taken) and composability (new tools slot into the same loop). Its failure modes are three. First, the model can get stuck, repeating a non-productive action; a turn-count bound prevents runaway. Second, the model can confabulate an observation rather than waiting for the real one; a strict observation-insertion discipline in the orchestration layer is the countermeasure. Third, the model can generalise beyond the declared toolset, emitting actions for tools that do not exist; a strict tool-allow-list enforcement rejects these calls without ambiguity.

Every major orchestration framework expresses ReAct natively. LangChain and LangGraph, LlamaIndex, Haystack, DSPy, Semantic Kernel, and AutoGen each provide a ReAct implementation; the bare-API pattern is also trivially expressible on OpenAI, Anthropic, Gemini, and self-hosted Llama, Mistral, or Qwen endpoints. The framework choice is driven by team preference, not by capability difference.

Plan-act-observe

Plan-act-observe separates the planning and the execution into distinct phases. The model first produces a plan (a sequence of steps), then executes the plan step-by-step, observing results and adjusting as it goes. The pattern is attractive for long-horizon tasks where the ReAct loop would be unfocused.

The principal advantage is that the plan is inspectable before any action is taken. A review gate between plan and execution (either a human review or a deterministic plan validator) is where governance attaches. A plan that includes an irreversible action (send an email to all employees, delete a file, post to an external channel) can be flagged before the action occurs.

The principal failure mode is plan rigidity. A plan produced at the start of a long task may be wrong in ways the model would recognise after the first few steps, but a rigid executor does not revise. The countermeasure is a re-plan trigger: after N steps or on encountering an unexpected observation, the model re-plans rather than pushes through.

Reflection

Reflection, in the sense of Shinn et al. 2023’s Reflexion paper², is the pattern in which the model reviews its own output or trajectory and produces a revised version. The review can be a single-pass self-critique (produce an answer, then review and revise) or an episodic pattern in which verbal feedback from prior runs influences future attempts.

Reflection is effective for tasks where errors are plausible on the first pass but detectable on the second. Code generation, complex summarisation, and multi-constraint reasoning are typical fits. The pattern is not a safety layer; a reflective model can double down on a wrong answer, and the review can compound the error. The countermeasure is to pair reflection with an external check (a unit test for code, a factual verification for claims, an adversarial probe for safety) rather than relying on the model to catch itself.

Multi-agent collaboration

Multi-agent patterns compose distinct agents, each with a narrower role, into a team that solves a task by protocol. A typical configuration has a planner agent, one or more specialist agents, and a critic agent. Frameworks like AutoGen and CrewAI expose multi-agent patterns as a first-class primitive; LangGraph supports them through graph construction.

The pattern’s appeal is that narrow roles produce better prompts. A planner prompt can be focused and short; a specialist prompt can carry domain expertise; a critic prompt can be ruthless about quality without compromising helpfulness. The failure modes are characteristic and expensive. Agents can chatter, producing many turns of coordination without forward progress; agents can disagree in ways that deadlock the team; agents can collectively convince themselves of a false premise through mutual reinforcement. A supervisor agent with explicit authority to terminate unproductive loops is the minimum control; in practice, turn-count bounds, timeout bounds, and a deterministic final decision maker (typically the planner, sometimes a human) are all worth building.

[DIAGRAM: HubSpoke — aitm-pew-article-6-multi-agent-topology — Hub: supervisor; spokes: planner, specialist agents, critic; message protocol labelled; shared memory indicated; kill-switch annotation on supervisor.]

The control envelope for agentic features

Every agentic feature needs a control envelope that applies independently of the specific pattern. Five controls constitute the minimum envelope.

A turn-count bound terminates any loop after N iterations; the bound is set during design based on the longest legitimate task. A wall-clock timeout terminates any loop after M seconds; the timeout accommodates natural variance while preventing runaway. A tool-use budget limits the number of tool calls per task; a feature that usually needs three tool calls should not be permitted fifty. A memory envelope limits what the agent retains across turns; without it, context window exhaustion silently truncates early turns and produces incoherent late-turn behaviour. And a kill-switch allows an operator or an automated controller to terminate a running agent; the kill-switch is not optional and must be exercised regularly, because a kill-switch that has never been tested is unlikely to work when needed.

The OWASP Top 10 for LLMs catalogues unbounded consumption as LLM10 and excessive agency as LLM06³, both of which are patterns the control envelope prevents. The MITRE ATLAS matrix catalogues the adversarial analogues, including compound tool-abuse sequences that exploit a missing kill-switch⁴. A practitioner’s agent design includes the envelope from the first prototype, not after the first incident.

[DIAGRAM: Timeline — aitm-pew-article-6-agent-execution-timeline — Timeline of an agent run: initialisation -> step 1 -> step 2 -> checkpoint -> step 3 -> step 4 -> timeout branch -> kill-switch branch -> final answer; gates labelled.]

Memory and state for agents

Agentic features need memory, and memory is a design space of its own. Three memory types are common. Episodic memory records the current task’s history within the active session. Semantic memory records facts or knowledge the agent accumulates over time. Procedural memory records how-to knowledge, often expressed as updated heuristics or prompt snippets.

Each memory type has a failure mode. Episodic memory that grows unboundedly consumes the context window and silently truncates; a memory envelope summarises or drops earlier turns. Semantic memory that stores facts from one user’s session and exposes them to another user’s session is a privacy breach; per-user memory scoping, validated by the orchestration layer, is the control. Procedural memory that updates heuristics based on successful trajectories can learn the wrong lesson; a review gate on procedural memory updates is the minimum control for features that carry any consequence.

The practitioner’s question for each memory type is not whether to have it but how it is scoped, how it is pruned, and how the auditor or the incident responder inspects it. A memory that has no inspection tool is a memory that is effectively unreviewable.

When to be agentic at all

The decision procedure is a few questions, answered honestly.

Can the task be decomposed in advance into a fixed sequence of steps, each with a deterministic input-output contract? If yes, it is a pipeline; build a pipeline. The pipeline is cheaper, faster, more observable, and more governable.

Does the task require interleaved reasoning and action in ways the decomposition cannot anticipate? If yes, ReAct is the fit.

Is the task long-horizon with a natural plan-and-execute structure? If yes, plan-act-observe with a plan-review gate is the fit.

Does the task benefit from critique and revision? Attach reflection to whichever base pattern applies; do not use reflection on its own.

Does the task benefit from specialist collaboration? Multi-agent is the fit, with a supervisor and a deterministic terminator.

A feature that passes any of the last four tests is agentic and requires the control envelope. A feature that passes only the first is a pipeline and does not.

Human-in-the-loop integration

The best-designed agentic features include humans as first-class participants rather than as fallback escape hatches. Three integration points matter.

Plan approval places a human in the loop between plan generation and plan execution. A high-stakes task (a financial action, a customer-facing communication, an irreversible operation) is planned by the model, reviewed by a human, and executed only after approval. The review is typically structured (a list of steps the agent proposes to take, a short rationale, a single approve-or-reject action) so that the human can review quickly.

Checkpoint approval places a human at selected steps of execution. A long-running agent pauses at declared checkpoints and requests confirmation before proceeding. Checkpoints are selected for their consequence, not their position; the practitioner identifies steps at which continuing costs more than pausing.

Review sampling places a human outside the hot loop, sampling completed agent runs for quality review. Review sampling does not gate individual tasks but provides the data the harness needs to track quality over time.

The three integration points compose. A very-high-stakes feature might run plan approval on every task, checkpoint approval on selected steps, and review sampling across the population; a moderate-stakes feature might run only review sampling; a low-stakes feature might run only the automated harness. The choice is calibrated to the feature’s risk tier, not to one-size-fits-all policy.

Cost awareness for agentic features

Agentic features are expensive. Each tool call incurs a model call for the decision, the tool execution itself, and another model call for the processing of the result. A task that resolves in one ReAct turn costs three model calls plus one tool execution; a task that resolves in ten turns costs thirty. Multi-agent features with three agents and five turns each approach a hundred model calls per task.

The cost has two implications. First, the feature’s economics need attention from the first prototype; an agentic feature that costs more per task than its business value produces is not viable, and the economics must be computed before the engineering investment. Second, the evaluation harness tracks cost per task as a first-class metric, not an afterthought. A prompt edit that raises the average number of turns from four to six has raised the feature’s cost by fifty per cent; the harness must surface this, and the team must decide whether the quality gain is worth the cost.

A practical discipline is to set a cost budget per task, expressed in tokens or in dollars, and to treat the budget as a constraint. The budget is enforced through the turn-count and tool-use budgets in the control envelope, so that a runaway task terminates before it becomes a financial incident.

Two real examples

Yao et al. 2023, ReAct. The foundational paper on interleaved reasoning and action demonstrated that combining ReAct with simple retrieval and calculator tools produced improvements on HotpotQA, Fever, and ALFWorld benchmarks¹. The paper is often cited for its improvements; it is equally worth reading for its candour about failure modes, including the model’s tendency to confabulate observations under certain conditions.

Shinn et al. 2023, Reflexion. The paper introduced verbal reinforcement as a way for language agents to learn from their own trajectories without weight updates². The results were substantial on coding and language-navigation tasks. The paper also documented failure modes where the reflection compounded errors, providing the teaching point that reflection is a capability enhancer, not a governance control.

Summary

Agentic patterns are interleaved reasoning with action in loops the model controls. ReAct, plan-act-observe, reflection, and multi-agent collaboration are the four principal patterns. Each has characteristic failure modes. Every agentic feature needs a control envelope of turn-count bound, timeout, tool-use budget, memory envelope, and tested kill-switch. The decision to be agentic at all is answered by whether the task can be decomposed deterministically; when it can, build a pipeline. Article 7 turns to prompt injection, the class of attacks that agentic features amplify.

Further reading in the Core Stream: Tool Use and Function Calling in Autonomous AI Systems and Safety Boundaries and Containment for Autonomous AI.

Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629 — accessed 2026-04-19. ↩ ↩²
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. https://arxiv.org/abs/2303.11366 — accessed 2026-04-19. ↩ ↩²
OWASP Top 10 for Large Language Model Applications, 2025. OWASP Foundation. https://genai.owasp.org/llm-top-10/ — accessed 2026-04-19. ↩
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems). MITRE Corporation. https://atlas.mitre.org/ — accessed 2026-04-19. ↩