COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 19 of 40
Thesis. A single-turn LLM call costs a predictable amount. An agent call costs anywhere from 2× to 50× more and the distribution has a long right tail. The team that deploys agents without a cost model discovers this the hard way — through a surprise monthly bill, a tenant overrunning budget without a circuit breaker firing, or a runaway loop that consumed $4,000 in tokens over a weekend. Agentic cost architecture is a first-class design concern, not a reporting afterthought. This article teaches the token economics specific to agents, the five cost-control levers, the three-layer budget model that keeps the portfolio within envelope, and the FinOps discipline that turns cost control into an engineering habit.
Agentic token economics
A single-turn query is one input prompt plus one output completion — a predictable number of tokens. An agent loop introduces multipliers at every iteration:
Multiplier 1 — The accumulating context. Each iteration carries forward the conversation so far. Iteration 5 sees the accumulated context of iterations 1–4. Input tokens per iteration grow linearly (at minimum) with iteration count.
Multiplier 2 — The reasoning-text emission. ReAct thoughts, plans, critiques are all tokens the model generates. A complex task easily produces 500–2000 tokens of pure reasoning in addition to the final answer.
Multiplier 3 — The tool-output re-ingestion. Every tool call returns text that enters the next iteration’s context. A tool returning 2K tokens contributes 2K input tokens to every subsequent iteration.
Multiplier 4 — The retrieval round-trips. Agentic RAG (Article 13) with three hops runs the retrieval + generation cycle three times.
Multiplier 5 — The self-critique. Reflexion loops double the work per iteration (act + critique) and triple when revise is included.
Multiplier 6 — The multi-agent fan-out. A three-agent hierarchical workflow roughly triples the work of a single agent on the same task.
A concrete example: a customer-support agent resolving a complex refund case might run 8 iterations × (2K input + 300 output) per iteration = 16K input + 2.4K output, plus tool calls returning 4K and retrieval returning 8K, totalling ~30K tokens — versus a single-turn answer of maybe 1K total. The 20–30× cost multiplier is typical, not exceptional.
The five cost-control levers
Lever 1 — Step and budget caps
Every agent session has hard caps: max_steps, max_tokens, max_tool_calls, max_usd_cost, max_wall_seconds. Caps are enforced in the runtime, not the prompt. Exceeding any cap triggers graceful halt and escalation (Article 9). Caps are set at the task-class level — simple tasks have tight caps, complex tasks have looser caps, but nothing is uncapped.
Lever 2 — Summarization policy
As context grows, summarization compresses older turns into a compact summary. Policies vary: summarize every N turns, summarize when context exceeds X tokens, summarize when topic changes. Summarization is itself a model call, so there is a trade-off — too frequent summarization costs more than it saves; too infrequent causes context explosion.
Summarization loses information; the architect specifies what must be preserved verbatim (user’s original request, critical decisions made, exact parameters to reconstructed actions) and what can be summarized freely (conversational pleasantries, intermediate reasoning no longer relevant).
Lever 3 — Tiered-model routing
Not every step needs the best model. The architect’s cost strategy routes simple steps to cheaper models and complex steps to more capable models. A planner agent might use the premium model; executors use a lower tier; classifiers use an even cheaper tier. Routing rules live in the runtime, not in the prompt.
The frontier labs make this explicit: Anthropic’s Haiku/Sonnet/Opus tiers, OpenAI’s GPT-4o-mini/GPT-4o/o1 tiers, Google’s Flash/Pro/Ultra tiers. Routing is a straightforward engineering decision once the team admits not every step warrants the premium.
Lever 4 — Prompt caching
Anthropic’s prompt caching (2024), OpenAI’s automatic cache (2024), and Google’s equivalents cache the static portion of prompts so repeat calls with the same prefix pay a reduced rate. For agents with long stable system prompts, caching cuts costs substantially. The architect structures prompts to maximize cache hits — stable content (system prompt, tool definitions, few-shot examples) at the start; variable content (conversation, retrievals) at the end.
Lever 5 — Batch offline processing
Asynchronous agent tasks — overnight research reports, bulk document processing — can use batch APIs (OpenAI Batch, Anthropic batch, others) at 50% of interactive pricing. The architect classifies agentic workloads into synchronous (must complete now) and asynchronous (can complete within 24 hours) and routes the asynchronous class to batch.
The three-layer cost model
Layer 1 — Per-task cost
Cost for a single agent session resolving a single task. The atomic unit; all other metrics derive from it. Tracked with every session outcome record (Article 18).
Per-task cost includes: LLM tokens (input + output + cached), tool-call costs (retrieval, external APIs), infrastructure amortization (runtime compute, vector-store amortization), human-intervention cost (reviewer time × hourly cost).
Layer 2 — Workload cost
Cost for an agent configuration over a period. Rolls up per-task costs; reveals distributional characteristics (P50, P95, P99 of per-task cost). The P99 tail is where runaway incidents live.
Layer 3 — Portfolio cost
Cost across all agent configurations for a tenant or organization over a period. The FinOps-level view. Drives budget allocation, re-prioritization, and architectural decisions (which agents are worth the cost, which are not).
Context explosion — the specific failure mode
The architect’s primary cost villain is context explosion. A loop that runs 20 iterations with 3KB of conversation and 5KB of tool output per iteration accumulates 160KB of context by the last turn. At typical token-per-byte ratios, that’s 40K input tokens on the final iteration — and that 40K is paid for every subsequent step.
Detection: context-size monitoring per iteration; alerts when a session’s input-tokens-per-iteration exceeds threshold.
Mitigation: summarization at configurable thresholds; externalized state (store full history outside context, retrieve only relevant portions); sliding-window context policies; hard step caps.
Per-tenant budget policy
In multi-tenant deployments, each tenant has a cost envelope. The policy enforces it.
Soft limits. Tenant exceeds expected monthly cost by N%; alert to tenant-admin and to ops.
Hard limits. Tenant exceeds monthly cost cap; new sessions deny; active sessions complete but no new starts until cap resets or tenant-admin authorizes overrun.
Per-session caps. Each tenant has per-session max-cost; breach triggers session kill.
Attribution. Every cost unit is attributed to a specific tenant, agent config, and session. The attribution is the primary key for chargeback and for anomaly diagnosis.
The policy lives alongside the authorization stack (Article 6) as a parallel control layer. Cost is a resource like any other; the runtime manages it or it doesn’t get managed.
FinOps discipline
FinOps is the operational discipline that keeps cost under control over time. Agentic FinOps has specific rituals.
Weekly review. Cost by agent config + tenant + task class; flagged anomalies; error-budget-style cost budget consumption.
Monthly optimization. Which configurations are above plan? Which have shown cost regressions (cost per task rising over weeks)? Which prompts could be shortened? Which tools have inefficient outputs?
Quarterly portfolio review. Is each agent delivering value commensurate with its cost? Which should be retired? Which should scale up? Which new configs warrant investment?
Incident review. Every cost anomaly (a runaway session, a tenant overrun, a model price change catching the team flat-footed) produces a post-mortem. Lessons feed back into caps, policies, and architecture.
Model cost rotations. As frontier-lab prices drop (and they drop regularly), agent configs are re-tested against cheaper tiers. A task that needed GPT-4 in 2024 may not need it in 2026.
Framework parity — cost observability
- LangGraph — per-node cost tracking via callbacks; LangSmith aggregates; custom attribution via OTel.
- CrewAI — callback emission of cost per task; custom aggregation.
- AutoGen — cost tracking via usage events; aggregation custom.
- OpenAI Agents SDK — native cost in trace exports; OpenAI dashboard aggregation.
- Semantic Kernel — cost via OTel attributes; Azure cost analysis integration.
- LlamaIndex Agents — cost tracking via callback manager; integrations with observability backends.
Platform strategy: every model call and every tool call emits a cost attribute to the OTel trace; aggregation happens at the platform layer; dashboards are consistent across frameworks.
Five cost anti-patterns
Anti-pattern 1 — No step cap. An agent can run indefinitely. One bug or attack consumes unbounded budget.
Anti-pattern 2 — All calls on the premium model. The agent uses the same top-tier model for every thought, every tool selection, every reflection. Costs compound to no benefit.
Anti-pattern 3 — No prompt caching. Large system prompts re-encoded on every call. With caching available, teams pay 2–5× more than they need to.
Anti-pattern 4 — No tenant attribution. A shared cost pool hides tenant-level anomalies until the bill arrives.
Anti-pattern 5 — Cost review as an afterthought. Monthly bill surprise; quarterly budget overrun; no pre-committed optimization policy. Without FinOps rituals, cost drifts up.
Real-world anchor — Anthropic prompt caching
Anthropic’s prompt caching feature (2024) reduced the cost of agent calls with long stable system prompts by up to 90%. For agentic workloads — where system prompts are typically long and frequently repeated — prompt caching is often the single highest-leverage cost control. Architects structure prompts to maximize cache hit rates (stable content at the top, variable content at the end) and monitor cache hit rate as an SLI. Source: docs.anthropic.com prompt caching.
Real-world anchor — OpenAI token usage and batch APIs
OpenAI’s pricing and batch API documentation (public) show the differential pricing between synchronous (premium) and batch (discount) calls, the caching behavior, and the tier distinctions. The architect’s cost model incorporates these primitives — batch for asynchronous workloads, caching for stable prompts, tier routing for mixed-complexity tasks. Source: platform.openai.com/docs/pricing.
Real-world anchor — Devin per-task economics discussions
Cognition AI’s Devin — an autonomous software-engineering agent — has been the subject of public discussions (2024) about per-task economics: an agent that runs for an hour on a coding ticket can consume hundreds of dollars in tokens. The Devin pricing discussion is instructive for architects designing long-horizon agents — at some autonomy level the per-task cost becomes a first-order constraint on which tasks make economic sense for agents versus humans. Source: cognition.ai and industry commentary.
Closing
Six multipliers, five levers, three layers, five rituals, five anti-patterns. Agentic cost is architecturally managed or it is not managed. Article 20 takes up the platform architecture that amortizes cost across the portfolio — where the investments in levers and observability become leverage for the next agent built.
Learning outcomes check
- Explain agentic token economics and the six multipliers that compound classical LLM cost.
- Classify five cost-control levers (caps, summarization, tiered routing, prompt caching, batch processing) and their trade-offs.
- Evaluate a cost model for blind spots against context explosion, missing attribution, absent caps, and poor caching strategy.
- Design a per-tenant budget policy including soft limits, hard limits, per-session caps, attribution, and FinOps rituals.
Cross-reference map
- Core Stream:
EATE-Level-3/M3.3-Art13-Cost-Economics-for-Agentic-Deployments.md. - Sibling credential: AITM-OMR Article 10 (ops-management FinOps); AITF-DDA Article 9 (data-science cost model).
- Forward reference: Articles 20 (platform amortization), 24 (lifecycle cost evaluation), 39 (build vs buy cost factors).