Tool Use and Function Calling

FlowRidge

Tool-Use Governance — Six Design Surfaces

Figure 311. Function calling extends model agency. Each design surface carries its own policy; tool-use review is a mandatory artefact before deployment.

AITM-PEW: Prompt Engineering Associate — Body of Knowledge Article 5 of 10

A language model that can produce text is a capability. A language model that can invoke functions and receive their results is a product. The difference between the two is the tool layer, and the transition between them is the moment the governance posture of a feature changes qualitatively. A practitioner who adds tools to a prompt without adding tool-level controls has added autonomy the organisation has not licensed. This article covers the prompt-side of tool use: how a function-call schema is authored, how permission scopes are expressed, how failures are handled, and how to recognise the threshold where the feature has become agentic and the agentic controls of Article 6 are required.

What tool use is and why it matters

Tool use, also called function calling, is a prompt pattern in which the model is presented with the signatures of functions it may invoke and is permitted, during generation, to emit a structured request to invoke one of them. The orchestration layer intercepts the request, executes the function, and returns the result to the model, which then continues. The ReAct paper by Yao et al. 2023, published at ICLR, is the foundational paper on interleaved reasoning and action¹; every major provider has since implemented the pattern as a first-class API primitive. OpenAI’s function calling, introduced in June 2023², Anthropic’s tool use³, Google’s function declarations⁴, and the open-source equivalents on Llama, Mistral, and Qwen share a common shape, with minor differences in syntax.

The pattern matters because it closes the gap between a chatbot and an application. A chatbot asked what is the status of my order? produces fluent prose that may or may not be correct. A tool-using assistant calls a get_order_status function with the user’s order ID, receives a ground-truth response from the order system, and relays the accurate information. The same shift opens the door to meaningful real-world actions: scheduling meetings, submitting expenses, drafting and sending emails, creating tickets. Each action is a place where the model can make a consequential mistake, and each is therefore a place where controls belong.

Anatomy of a function-call schema

A function-call schema has a name, a description, a parameters object declaring typed fields, and a set of required fields. Every provider’s format renders these four parts, with minor variations:

{
  "name": "get_order_status",
  "description": "Retrieve the current shipping status of a customer order.",
  "parameters": {
    "type": "object",
    "properties": {
      "order_id": { "type": "string", "description": "The order identifier." }
    },
    "required": ["order_id"]
  }
}

The name is what the orchestration layer will match against its function registry. The description is what the model uses to decide whether the function applies; a vague description produces erratic tool selection, and a precise description is one of the highest-leverage authoring concerns. The parameters object declares the typed contract; strict schemas prevent the model from emitting arguments the function cannot parse. The required field prevents the model from omitting essential arguments.

A practitioner writes schemas with four habits. The description is precise about when the function should be called and, equally, when it should not. Parameters are minimal; every parameter increases the chance of misuse. Types are strict; a string where an integer is expected is a defect. And each parameter has a description that constrains its semantic range, because the model reads the descriptions and uses them to decide what to emit.

Permission scoping as a prompt-adjacent control

A function signature by itself does not govern permission. Permission is the question: under whose authority may this function be called, and within what limits? A get_order_status function called for order 12345 on behalf of user X is a legitimate lookup if user X is the owner of that order, and a privacy breach if not. The function’s implementation must enforce the identity-based constraint; the prompt and its schema must participate by naming the constraint explicitly so that the model does not construct arguments that cross the line by accident.

The practical discipline is a permission envelope authored alongside the schema. For each function, the envelope records: who may invoke it (which end users, authenticated how); which parameter values are admissible (an order_id that does not belong to the authenticated user is rejected by the orchestration layer, not by the model); what result fields the model is permitted to receive (a full order record may expose more than the user needs, and the orchestration layer may project the result to the fields the assistant actually needs); and what rate or volume limits apply (a drain-the-database pattern is rate-limited, not asked).

The envelope is a layer the model does not see directly. Its effects are visible to the model only as refusals or sanitised results. This is deliberate. The OWASP Top 10 for Large Language Model Applications catalogues excessive agency as LLM06⁵; the defence is precisely that the model’s authority to act is narrower than its ability to speak, enforced by a layer the model cannot argue with.

[DIAGRAM: OrganizationalMappingBridge — aitm-pew-article-5-permission-mapping — Left: user identity + session scope; right: allowed functions + allowed parameter ranges; bridge beams show the mapping from authenticated identity to tool permission envelope.]

The tool-call loop and its failure modes

A tool-using interaction runs a loop: the model emits a tool-call request; the orchestration layer validates and executes; the result returns to the model; the model incorporates and either emits another tool call, emits a final answer, or emits a refusal. Each edge of the loop has a failure mode.

A tool-call request can be malformed, with missing arguments or wrong types. The orchestration layer rejects and returns the error to the model, which retries. A single retry is typically sufficient; a loop that retries indefinitely is a defect.

A tool-call request can be permissible in syntax but impermissible in intent, for example asking for an order that the authenticated user does not own. The orchestration layer rejects, typically with a non-informative message (the user is not authorised; no further detail) so that the model does not learn a more precise oracle from the refusal pattern.

An executed function can return an error, a timeout, or an unexpected result. The model needs to handle each case coherently. A prompt that addresses only the happy path produces brittle behaviour; a prompt that includes a short paragraph on failure handling produces robust behaviour.

A sequence of tool calls can compound. The model calls get_order_status, then calls cancel_order, then calls issue_refund. Each call individually may be permissible; the sequence may be the wrong action at the wrong time. This is the threshold at which the feature has crossed into agentic territory and Article 6 applies.

[DIAGRAM: StageGateFlow — aitm-pew-article-5-tool-call-flow — Sequence: model requests -> schema validator -> permission check -> execution -> result sanitizer -> model consumption -> answer emission; gates at each step.]

When tool use becomes agentic

The distinction between tool use and agentic behaviour is not a bright line in the code; it is a posture about the feature. A feature that exposes one or two tools, invokes at most one per turn, and returns an answer is tool-using. A feature that composes many tools across multiple turns, maintains state across invocations, and pursues a goal over time is agentic.

A practical threshold: if the feature can chain three or more tool calls without a user confirmation at the mid-points, or if it can loop over tool calls until a stopping condition, the feature is agentic and the controls in Article 6 (autonomy classification, checkpoints, kill-switch, reflection bounds) apply. A feature that remains below the threshold can be governed as tool use; a feature that crosses it and is still governed as tool use is under-controlled.

Idempotency and reversibility

Two properties of tools deserve explicit design attention. A tool is idempotent when invoking it multiple times with the same arguments produces the same outcome as invoking it once. A tool is reversible when the action it performs can be undone cleanly. Both properties reduce the consequence of model error.

An idempotent create_ticket function that checks for an existing ticket with the same signature before creating a new one protects the feature from a common failure pattern: the model calls the tool, the network drops the response, the model retries, and two tickets are created. An idempotency key provided in the tool schema and enforced by the implementation closes the gap. A practitioner authoring a create-style tool asks whether idempotency applies and, if so, declares the key.

Reversibility is distinct. A cancel_order function may be idempotent (cancelling an already-cancelled order is a no-op) without being reversible (un-cancelling an order may not be possible once downstream systems have processed the cancellation). Irreversible actions deserve an explicit confirmation step, either in the orchestration layer (human approval before the tool fires) or in the prompt (the model must solicit user confirmation and receive it before emitting the tool call). The cost of an extra turn is much smaller than the cost of an irreversible mistake.

Tool-schema evolution

Tool schemas evolve, and evolution is a governance event. Adding a parameter is typically compatible; removing a parameter is typically breaking. A feature that relies on a deprecated tool schema will misbehave when the tool’s implementation moves on. The tool-schema version is part of the prompt’s model binding from Article 9; a schema change is a prompt change even when the prompt’s text did not change. Teams that treat tool schemas as separate code without this linkage discover the coupling the hard way.

Evaluation and observability of tool use

Tool-using features are evaluated along a different axis from text-only features. The metrics include tool-selection accuracy (did the model pick the right tool), argument correctness (did the arguments match the task), result usefulness (did the function return information that helped), answer correctness given the result, and authority adherence (did the model ever attempt a call outside the envelope). Arize, Langfuse, Weights & Biases, MLflow, Humanloop, and WhyLabs each expose the trace-level instrumentation that makes these metrics measurable; an open-source team can equally assemble equivalent telemetry on top of OpenTelemetry and a column store. The choice of vendor is not prescriptive; the instrumentation is.

Two real examples

OpenAI function calling, June 2023. The initial release introduced the pattern as a typed alternative to free-text prompting for structured output and external actions². The release notes described the feature in terms of deterministic shape and explicit authority, both properties that distinguish tool-using features from free-text assistants. The release did not claim the feature was a safety layer; subsequent practice has shown that the feature becomes a safety layer only when the permission envelope and the orchestration-side validator are present.

Anthropic tool use, 2024. Anthropic’s tool use documentation emphasises the same shape and adds explicit guidance that tool descriptions drive tool selection³. A practitioner reading both Anthropic’s and OpenAI’s documentation will find the recommendations nearly identical, confirming that the pattern has converged across providers.

Security considerations for tool-using features

Tool use opens three security concerns distinct from text-only features. The first is credential handling: a tool that acts on behalf of a user requires that user’s credentials or a scoped token, and the orchestration layer must pass credentials safely (never in the prompt, never logged) and scope them to the specific tool and the specific request. The second is result sanitisation: a tool’s return value may include content that, if passed back into the prompt verbatim, functions as indirect injection; sanitising tool results for instruction-shaped content is the symmetric counterpart to sanitising retrieval chunks. The third is tool-chain confused-deputy risk: a tool called by the model on behalf of a user may, if the tool’s own permissions are broader than the user’s, perform actions the user could not directly perform; the fix is to execute tools with the user’s scope rather than the tool’s service-account scope wherever feasible.

Each of these concerns has mature countermeasures in traditional software security; the tool layer inherits them. A practitioner who has a background in API security will recognise the patterns; a practitioner who does not should consult with the security team when designing the tool layer, because the specific threats have appeared in incident reports across several public LLM deployments.

Observability at the tool layer

Observability for tool-using features requires a slightly richer trace than text-only observability. The trace records: the user’s input, the system prompt version, the tools offered to the model, the tool calls the model emitted (with arguments), the permission-layer decisions on each call, the tool results, and the final model output. Each step is timestamped, tagged with a correlation identifier, and retained for the feature’s declared retention window.

The trace supports incident response, debugging, and audit. An incident responder investigating a user complaint about an unexpected action can reconstruct exactly what the model was asked, what it considered, what it attempted, and what was permitted. A developer debugging a flaky feature can locate the exact step at which the chain diverged from expectations. An auditor can verify that the permission layer acted as declared. A feature without this trace is a feature whose behaviour is, operationally speaking, opaque, and opacity is not a governance-friendly posture.

Summary

Tool use turns a language model into an application. A function-call schema names the tool, describes when to use it, declares strict parameters, and lists required fields. A permission envelope, authored alongside the schema and enforced outside the model, binds tool calls to the authenticated user and the legitimate parameter range. Failure modes include malformed requests, unauthorised requests, execution errors, and compound sequences, each with a named handling path. When a feature chains tools across turns in pursuit of a goal, it has become agentic, and Article 6 applies.

Further reading in the Core Stream: Tool Use and Function Calling in Autonomous AI Systems and AI Use Case Delivery Management.

Shunyu Yao et al. ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023. https://arxiv.org/abs/2210.03629 — accessed 2026-04-19. ↩
Function calling and other API updates. OpenAI, 13 June 2023. https://openai.com/index/function-calling-and-other-api-updates/ — accessed 2026-04-19. ↩ ↩²
Tool use (function calling). Anthropic documentation. https://docs.anthropic.com/en/docs/build-with-claude/tool-use — accessed 2026-04-19. ↩ ↩²
Function calling. Google Gemini API documentation. https://ai.google.dev/gemini-api/docs/function-calling — accessed 2026-04-19. ↩
OWASP Top 10 for Large Language Model Applications, 2025. OWASP Foundation. https://genai.owasp.org/llm-top-10/ — accessed 2026-04-19. ↩