Tool Use, Function Calling, and Agent Loops

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 7 of 35

A retrieval-augmented model reads and writes language. A tool-using model does things. The moment an architect gives the model the ability to call functions, the system crosses the boundary from a chatbot into an actor inside the organization’s operational substrate. A function call can read a customer record, create a ticket, send an email, move money, schedule a meeting, or execute a workflow. That capability is the point of most enterprise AI investment, and it is also where the most expensive failures begin. The function schema is no longer a developer convenience; it is now the attack surface. Article 7 teaches the architect how to design that surface so that the model’s new powers are bounded, audited, and reversible.

What function calling is

Function calling is the protocol by which a model emits a structured request to invoke an external function and the runtime executes that function and returns its result to the model for continuation. Three dominant schema conventions exist. OpenAI’s tools API uses JSON schema descriptions in the request and a structured tool_calls array in the response.¹ Anthropic’s Claude tool use uses a similar JSON schema with a slightly different request shape.² Google’s Vertex Gemini function calling and Cohere Command R tool use follow the same general pattern with minor shape differences. Open-weight models reproduce the pattern through fine-tuned tool-calling variants (Llama 3.1 instruct, Mistral Mixtral, Qwen 2.5) that emit tool calls in the prompt-template convention their tokenizer was trained with.

The architect designs the schema as a first-class artifact. Each tool is a named function with typed parameters, a description the model reads at decision time, and declared effects — read-only, internal-write, external-write, network-call, human-approval required. A poorly described tool is a liability because the model’s decision to call it is made against the description, not the implementation. A tool named getUserDetails with the description “returns user information” invites calls the architect did not intend when the actual function returns personally identifiable information from a table the caller is not authorized to read.

The ReAct loop and its descendants

The foundational pattern is ReAct, introduced in Yao et al., 2022: the model alternates Reason (plan next step as text) and Act (emit a tool call), then reads the tool’s observation and iterates until it has enough information to produce a final answer.³ ReAct is simple, observable (each step is text), and forms the backbone of LangChain’s agent APIs, LlamaIndex agents, and most early production agents in 2023–2024.

ReAct has weaknesses. It is greedy — the model plans step by step without lookahead. It is verbose — the reasoning text inflates token cost. It is brittle under complex, multi-step tasks where a failure mid-loop leaves the system in an intermediate state that was not anticipated. Four descendants address these weaknesses.

Plan-and-Execute separates planning from execution: the model first produces a full plan of tool calls, then an executor runs the plan and returns intermediate results, and the model refines the plan as needed.⁴ This pattern trades step-level flexibility for lookahead and reduces the model’s per-step token cost.

Reflexion adds an explicit critique step: after each attempt, the model critiques its own output and tries again with the critique as additional context.⁵ Reflexion suits tasks where evaluation is cheaper than execution and where attempts can be discarded without side effects.

State graphs (LangGraph, Haystack pipelines, Temporal workflows with AI nodes, Pydantic AI graphs) express the agent as a directed graph of states and transitions. The model makes decisions at decision nodes; deterministic code handles the rest. This is the pattern the enterprise typically converges on because it makes the agent’s behavior auditable and testable without executing the model in tests.

Constrained function-calling (one-shot tool use) is the simplest pattern: the model is given tools but is not put in a loop. The application calls the model, the model may emit one tool call, the application executes it and optionally calls the model again with the result — no autonomous iteration. This pattern suits use cases where the model needs information before producing an answer but where agency is deliberately limited.

The architect picks the loop pattern based on the task’s structure and the operational risk of loop divergence. High-autonomy loops suit exploration and research. Low-autonomy state graphs suit revenue-critical paths where reliability matters more than flexibility.

[DIAGRAM: HubSpokeDiagram — aite-sat-article-7-tool-risk-hub-spoke — Hub labelled “Model” with five spokes radiating outward, each labelled by tool category. Spoke 1: “Read-only” (low risk, green) with examples: getCustomer, lookupProduct, queryDatabase. Spoke 2: “Internal-write” (medium risk, amber) with examples: createTicket, updateNote. Spoke 3: “External-write” (high risk, red) with examples: sendEmail, placeOrder, refundCharge. Spoke 4: “Network-call” (medium risk, amber) with examples: fetchWebPage, callExternalAPI. Spoke 5: “Human-approval-required” (governance risk, blue) with examples: transferFunds, deleteRecord, escalateToLegal. Each spoke labelled with its required pre- and post-execution controls.]

The OWASP excessive-agency risk

The OWASP Top 10 for LLM Applications, in its 2025 revision, lists “excessive agency” — the risk that a model takes actions beyond what the use case requires — as one of the ten most material application risks.⁶ The excessive-agency risk is not that the model goes rogue; it is that the schema gives the model more capability than the use case requires and an adversary or a prompt-injection payload exploits that gap. A customer-support agent with a tool that can refund any amount is an excessive-agency design even if 99% of legitimate uses involve small refunds, because the model does not distinguish between a legitimate request and an attempted abuse.

The architect designs for minimum agency. A refund tool is split into two tools: refundSmall with capped amount and no approval and refundLarge with uncapped amount and required human approval. A read tool is restricted to the calling user’s own records, not the entire table. An external-write tool draft-creates rather than sends. The schema enforces the governance posture in its shape.

Pre- and post-execution validation

A tool call is not trusted on either end. The pre-execution validator inspects the model’s tool-call payload before the runtime executes it. It verifies the caller’s authorization to invoke the tool with those arguments, checks argument bounds (refund amount under the cap, user identifier within the caller’s scope), and checks for prompt-injection signatures in string fields. If validation fails, the call is rejected and the failure is returned to the model as an observation so the model can decide what to do next.

The post-execution validator inspects the tool’s output before returning it to the model. It redacts sensitive fields the model does not need, truncates outputs that would balloon the model’s context, and tags outputs with provenance metadata the model is instructed to cite. If the output contains content that might itself be a prompt injection (tool output is often attacker-controlled when the tool reads external content), the validator flags it for downstream safe-output handling.

[DIAGRAM: StageGateFlow — aite-sat-article-7-tool-validation-flow — Left-to-right flow of a single tool call: “Model emits tool_call” → “Pre-execution validator (AuthZ check, arg bounds, injection scan)” → “Runtime executes function” → “Post-execution validator (PII redact, output truncate, provenance tag, injection scan)” → “Observation returned to model”. Two side-branches labelled “Reject → return error observation” and “Human approval required → pause loop, await approval, resume”.]

Two real-world examples

Air Canada chatbot judgment, Moffatt v. Air Canada, 2024 BCCRT 149. The British Columbia Civil Resolution Tribunal held Air Canada liable for the advice its chatbot gave a customer about bereavement-fare policy, rejecting the airline’s argument that the chatbot was a separate legal entity.⁷ The judgment is an architectural warning to anyone deploying tool-enabled agents. The chatbot did not even call a traditional tool — it performed a lookup on policy text and returned an answer — but the tribunal treated that lookup as the airline’s agent speaking on the airline’s behalf. An architect deploying a tool-using agent that writes back to customer systems is not deploying a technical capability; they are deploying an agent with legal consequences. The architectural implication is that every tool must have a defensible, documentable trail of what it was authorized to do, what it actually did, and what it told the user.

Replit’s AI Agent launch, 2024. Replit’s public blog posts describing their AI Agent architecture document the loop pattern, the tool schema, and the safety boundaries the team designed around a code-executing agent.⁸ The agent can read the user’s workspace, propose code changes, and execute them in a sandbox. The sandbox is the key architectural control — the agent has broad tool access inside the sandbox but no path outside it without explicit user action. The architectural pattern is not “restrict tools” but “restrict blast radius.” The architect learning from Replit’s design takes the sandbox principle and applies it inside enterprise systems as well: an agent’s external-write actions draft rather than dispatch, or target a staging environment rather than production, until a human signs off.

Both examples point at the same principle: an agent’s tools are the product, and the governance of those tools is the architecture.

Tool registries

A tool registry is the artifact that makes the tool inventory auditable. The registry records each tool’s name, schema, description, effect class, owner, last review date, allowed callers (which agents, which tenants, which roles), and evaluation record. The architect owns the registry specification — what fields it has, who updates it, how changes are approved — even if the implementation is a database table or a Git repository. Article 21 develops registries at length, including prompt and model registries; the tool registry is the registry that keeps the agent honest.

A tool’s lifecycle mirrors an API’s: design, review, deploy, monitor, version, deprecate. A new tool is proposed with a schema and an expected use case; the architect reviews the schema for minimum-agency and blast-radius discipline; the tool is deployed behind a feature flag; its usage is monitored for anomalous patterns; and when it is superseded, its deprecation is a deliberate event with notice to dependent agents. A tool that was added during a rushed sprint and never reviewed is exactly the governance debt that surfaces during an incident six months later.

Evaluation for tool-using agents

Evaluation for tool-using agents differs from evaluation for generation. Generation evaluation asks whether the output is good; agent evaluation asks whether the agent chose the right tool, passed correct arguments, respected the loop boundary, and produced a defensible audit trail. Three metric families matter: tool-selection accuracy (given a task, did the agent call the correct tool), argument-correctness rate (did the agent fill the tool’s arguments correctly), and outcome-correctness rate (did the end-to-end interaction produce the intended business result). An agent that scores well on tool selection and poorly on argument correctness has a schema-clarity problem; an agent that scores well on both but poorly on outcome has a loop-coordination or planning problem. The architect diagnoses the failure class before attempting to fix it.

Benchmarks such as ToolBench, BFCL (Berkeley Function-Calling Leaderboard), and AgentBench provide public yardsticks for how tool-calling performance compares across models, but the workload-specific evaluation is the binding one.⁹ The architect constructs a tool-calling golden set that exercises each tool in the agent’s inventory, each argument combination that matters, and each failure path (tool unavailable, argument invalid, authorization denied) and scores the agent on these scenarios on every deployment. The golden set is versioned alongside the tool registry so that a change to either is accompanied by the corresponding change to the other.

Regulatory alignment

Tool-using agents touch EU AI Act Articles 13 (transparency — users must know they are interacting with an AI), 14 (human oversight — the deployment must include human oversight measures proportionate to the risk), and 15 (accuracy, robustness, cybersecurity).¹⁰ The architect’s tool-design decisions determine whether human oversight is meaningful. A tool that auto-executes external writes with no human-in-the-loop fails the oversight test for high-risk use cases even if the tool’s technical design is competent. ISO/IEC 42001 Clause 8.3 and Clause A.6.2.6 expect documented decision logs for system behavior, which the tool registry plus the per-call audit log provide together.

Summary

Tool use and agent loops are the architecture moves that turn an AI feature into an operator. The four loop patterns — ReAct, Plan-and-Execute, Reflexion, state graph — differ in their trade-off between flexibility and auditability. The schema is the attack surface; minimum-agency design splits broad tools into narrow ones with explicit approval gates. Pre- and post-execution validators bracket every tool call. OWASP’s excessive-agency risk is an architectural risk, not a prompt-engineering risk. Air Canada’s tribunal judgment is a legal warning about agent liability; Replit’s sandbox pattern is the reference for blast-radius control. Tool registries make the inventory auditable. Regulatory alignment with EU AI Act Articles 13, 14, and 15 depends on the architect’s tool-design discipline, not on the framework choice.

Further reading in the Core Stream: AI Agents: From Automation to Autonomous Action and Agent Orchestration Patterns.

OpenAI function calling / tools API documentation. https://platform.openai.com/docs/guides/function-calling — accessed 2026-04-20. ↩
Anthropic tool use documentation. https://docs.anthropic.com/en/docs/build-with-claude/tool-use — accessed 2026-04-20. ↩
Shunyu Yao et al., “ReAct: Synergizing Reasoning and Acting in Language Models,” ICLR 2023 (arXiv 2210.03629). https://arxiv.org/abs/2210.03629 — accessed 2026-04-20. ↩
Xingyao Wang et al., Plan-and-Execute agent reference via LangChain documentation. https://blog.langchain.dev/planning-agents/ — accessed 2026-04-20. ↩
Noah Shinn et al., “Reflexion: Language Agents with Verbal Reinforcement Learning,” NeurIPS 2023 (arXiv 2303.11366). https://arxiv.org/abs/2303.11366 — accessed 2026-04-20. ↩
OWASP Top 10 for LLM Applications, 2025 edition, LLM08 Excessive Agency. https://owasp.org/www-project-top-10-for-large-language-model-applications/ — accessed 2026-04-20. ↩
Moffatt v. Air Canada, 2024 BCCRT 149. Civil Resolution Tribunal of British Columbia. https://decisions.civilresolutionbc.ca/crt/crtd/en/item/525448/index.do — accessed 2026-04-20. ↩
Replit AI Agent launch blog posts. https://blog.replit.com/ — accessed 2026-04-20. ↩
ToolBench (Qin et al., 2023, arXiv 2307.16789). https://arxiv.org/abs/2307.16789 — accessed 2026-04-20. Berkeley Function-Calling Leaderboard (BFCL). https://gorilla.cs.berkeley.edu/leaderboard.html — accessed 2026-04-20. AgentBench (Liu et al., arXiv 2308.03688). https://arxiv.org/abs/2308.03688 — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 13–15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ↩