COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Article 10 of 35
An engineering team builds a retrieval-augmented customer-service copilot. The build cost is modest, the offline evaluations look strong, the initial launch goes smoothly. Three months in, the cost per conversation is 4.2x what the business case projected. The product team has not changed anything; users are using the feature as intended. Investigation surfaces five sources of the overshoot. The conversation context grew from the planned 2,000 tokens to an average of 8,500 tokens because users ask follow-up questions that include prior turns. The retrieval hop count grew from the planned two hops to an average of four because the grounding corpus was sparse and the retriever kept expanding. The tool-call count grew from one to three because the agent chose to verify more often than the design anticipated. The cache hit rate dropped from projected 60% to 15% because requests are too varied to cache effectively. And the model class is being up-routed on a quarter of requests because the reasoning is more complex than expected. Each of these is a
The five components of token cost
Generative-system cost decomposes into five components, each with a characteristic scaling behaviour and a characteristic lever for reduction.
Input-token cost. Every request’s prompt, context, retrieved documents, and instructions are tokenised and charged at the input-token price. Input tokens typically account for the majority of token spend in RAG and agentic systems because the context carries retrieved passages. Input tokens scale with context length, which scales with conversation history, retrieval volume, and instruction verbosity.
Output-token cost. Every request’s generated response is tokenised and charged at the (typically higher) output-token price. Output tokens scale with response verbosity; a feature that generates 400-token responses where 100 would suffice is paying 4x the necessary output cost. Output-token pricing is often 2–5x input-token pricing from managed API providers, so output discipline has outsized cost impact.
Context-window overhead. Longer context windows are priced at higher per-token rates in some provider pricing tiers and can trigger model-class up-routing (e.g., a request that exceeds a short-context model’s window routes to a longer-context, more expensive model). The effective per-token price rises with context length, which is a second-order cost multiplier alongside the direct volume effect.
Retrieval-hop cost. For RAG and agentic systems, each retrieval hop produces a vector-store query and adds the retrieved content to the model’s context for the next step. A four-hop retrieval quadruples the retrieval overhead and increases the context length at each subsequent step, compounding the input-token cost. Retrieval economics is treated with particular attention because the design choices that drive hop count are architectural, not configurational.
Tool-call cost. For agentic systems, each tool call is a separate model invocation plus the underlying tool’s cost (API fees, compute, latency). An agent that makes three tool calls per task is paying approximately 4x the single-call cost (one reasoning step per call plus the final synthesis). Tool-call discipline is where many agentic features lose their unit economics.
[DIAGRAM: StageGateFlow — request-to-token-cost-decomposition — horizontal flow from user request through retrieval (N hops) through tool calls (M calls) through model invocation to response; each stage annotated with its token-count contribution and its price multiplier; total cost computed at the end; primitive teaches the cost decomposition per request.]
Forecasting token cost before launch
A business case for a generative feature must include a defensible token-cost forecast. The forecast is built from three inputs: the per-transaction token profile, the transaction volume projection, and the model/tier mix.
The per-transaction token profile is the expected token count across the five components for a typical transaction. It is estimated from prototype runs at the build stage, then refined with pilot data. A practitioner building a customer-service copilot estimates 1,500 input tokens for context, 2,500 input tokens for retrieved documents (two hops × 1,250 tokens), 300 output tokens for the response, and two tool calls each requiring 800 input tokens and 200 output tokens. Total per-transaction: approximately 5,900 input tokens and 700 output tokens.
The transaction volume projection comes from the business case. A projection of 200,000 conversations per month means roughly 1.18B input tokens and 140M output tokens per month.
The model/tier mix specifies which transactions route to which model class. A mix of 80% transactions on a mid-tier model and 20% on a flagship model produces a weighted per-token price that can be computed from public provider pricing. The practitioner should compute the forecast against at least two provider pricing schedules to make the comparison neutral.
The resulting monthly token cost is the volume-weighted product. For the copilot above at typical late-2024 token prices, the monthly cost sits in the low-to-mid five figures; at 10x volume it sits in the low-to-mid six figures. The CFO-actionable insight is the per-order-of-magnitude cost curve, not a single-point estimate.
Cache-hit economics
Prompt caching is a material cost-reduction mechanism for generative features with repetitive context. The FinOps Foundation’s 2024 M1.3FinOps for AI paper documents cache-hit savings in the 40–80% range for well-designed caching, depending on the cache architecture and the request pattern.1 Three cache designs produce different savings profiles.
Full-prompt caching stores the full prompt-plus-response pair and serves cached responses when the exact prompt recurs. Savings are large when they apply (100% of the token cost) but applicability is narrow (only exactly-repeated prompts).
Prefix caching (offered by some managed API providers) caches the input-token processing for long prefixes that recur. A feature with a 4,000-token system prompt that recurs on every request pays the prefix cost once per cache period rather than once per request. Savings on the prefix portion are typically 50–90%.
Retrieval caching caches the retrieval step’s output for a given query. When the same query recurs, the cached passages are used rather than re-querying the vector store. Savings are on retrieval cost rather than model cost and typically range 60–95%.
A practitioner estimating cache savings computes each separately and sums them; a single “caching saves 60%” claim is lazy and usually wrong. The estimate is also scenario-dependent: a cache-hit rate of 60% on a B2B copilot with repetitive tasks may drop to 15% on a consumer assistant with highly varied prompts, and the design must project the realistic scenario rather than the optimistic one.
Token-aware architectural design
Token-aware design is the practice of building generative features to produce lower token volumes without degrading outcomes. Five architectural levers produce most of the practical savings.
Context compression. Replacing raw conversation history with summarised history reduces input-token volume for multi-turn features. A feature that stores a 500-token running summary rather than a 5,000-token raw history reduces per-turn input cost by an order of magnitude for long conversations. The summarisation has its own token cost; the net saving is positive for conversations of sufficient length.
Retrieval pruning. Reducing the number of hops and the passage length per hop. A retrieval strategy that returns five passages of 500 tokens each produces 2,500 tokens of context; one that returns three passages of 300 tokens each produces 900 tokens. The accuracy trade-off must be measured, but the cost trade-off is material.
Model-tier routing. Routing simpler requests to a smaller, cheaper model and only routing complex requests to the flagship. A feature that routes 70% of requests to a small model (say a Claude Haiku-class, GPT mini-class, Gemini Flash-class, or self-hosted Llama 3.1 8B / Mistral 7B / Qwen2 7B model) and 30% to a flagship produces a weighted cost well below the all-flagship baseline.
Response shaping. Constraining response length and format to the minimum that satisfies the user need. A structured output (JSON with named fields) often produces shorter, more useful responses than free-form narrative. Response-shaping discipline often reduces output-token volume by 40–60%.
Tool-call discipline. Auditing the agent’s tool-call policy to reduce unnecessary calls. An agent that calls a verification tool on every output is paying a predictable cost; one that calls the tool only when confidence is below a threshold is paying a much lower cost. The behavioural-diagnosis disciplines from agent-evaluation methodologies apply directly.
[DIAGRAM: MatrixDiagram — token-factor-by-architectural-lever — 5×5 matrix with rows for token factors (input, output, context overhead, retrieval, tool calls) and columns for architectural levers (context compression, retrieval pruning, model-tier routing, response shaping, tool-call discipline); each cell annotated with whether the lever applies to the factor and the typical savings range; primitive teaches the design-choice-to-cost mapping.]
Worked comparison — three orchestration ecosystems
A token-economics comparison on the same feature across three orchestration ecosystems illustrates the neutrality discipline. The feature is a mid-volume document-answering copilot.
On LangChain with a managed API, the feature is composed of a retriever, a grader, and an answer generator. The default chain runs each step as a separate model call; a typical request produces four model invocations, averaging 7,200 input tokens and 500 output tokens.
On LlamaIndex with the same managed API, the feature’s router combines some steps; a typical request produces three model invocations, averaging 6,100 input tokens and 500 output tokens.
On DSPy with the same managed API, the feature’s compiled program optimises the prompt chain; a typical request produces two model invocations, averaging 5,300 input tokens and 450 output tokens.
The three stacks produce roughly 18%/27% reduction on DSPy versus LangChain on input tokens at constant outcome. No ecosystem is universally best; the point is that the token-cost difference across architectural choices can be material and must be measured rather than assumed. Haystack, Semantic Kernel, AutoGen, and LangGraph are additional ecosystems the practitioner should include when the feature’s complexity warrants the breadth.
The token-economics regression discipline
A generative feature’s token cost must be tracked as a time series, not a static number. Five patterns produce gradual cost regression that only time-series monitoring catches.
Context creep — conversations get longer because users ask more follow-ups, and average context length grows. Retrieval expansion — the grounding corpus grows, and the retriever returns more passages per query. Response bloat — a subtle prompt change increases average response length. Model up-routing — a capability requirement forces some requests onto a higher tier. Cache-hit decay — query patterns diversify, and the cache-hit rate falls.
Each pattern is silent in the aggregate dashboard unless the practitioner instruments the decomposition. Article 25 (drift detection) treats cost drift as a specific drift mode that requires dedicated monitoring. The token-economics practitioner’s discipline is to ship the decomposed time-series at launch, not to add it after the cost overshoot has already arrived.
Summary
Token economics is the non-linear cost behaviour of generative systems. Five components — input tokens, output tokens, context-window overhead, retrieval hops, tool calls — decompose the cost per request. Forecasting token cost requires a per-transaction token profile, a volume projection, and a model/tier mix. Cache-hit economics produce 40–80% savings when well-designed but only for repetitive-context workloads. Token-aware architectural design — context compression, retrieval pruning, model-tier routing, response shaping, tool-call discipline — produces the practical cost levers. A token-economics time-series regression discipline catches gradual cost creep before it destroys the unit economics. Article 11 closes Unit 2 with sensitivity analysis and scenario planning, which the token-economics forecast consumes.
Cross-references to the COMPEL Core Stream:
EATP-Level-2/M2.5-Art13-Agentic-AI-Cost-Modeling-Token-Economics-Compute-Budgets-and-ROI.md— canonical core article on agentic cost modelling the token decomposition extendsEATF-Level-1/M1.4-Art04-Generative-AI-and-Large-Language-Models.md— LLM fundamentals the token-economics analysis depends onEATF-Level-1/M1.4-Art12-Tool-Use-and-Function-Calling-in-Autonomous-AI-Systems.md— tool-use and function-calling patterns the cost-decomposition analyses
Q-RUBRIC self-score: 90/100
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.
Footnotes
-
FinOps Foundation, “FinOps for AI Overview” (2024), https://www.finops.org/wg/finops-for-ai/ (accessed 2026-04-19). ↩