Lab 3: Decompose Token Economics and Redesign for 40% Cost Reduction

FlowRidge

COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Lab 3 of 5

Lab objective

Decompose the token-cost of a realistic GenAI workload, identify the four leading optimization levers from Article 27, and redesign the architecture to achieve approximately a 40% total cost reduction. Produce a before/after token-cost worksheet with the specific architectural decisions that achieve the target.

Duration: 90 minutes. Deliverable: A before/after worksheet (spreadsheet or Markdown table) plus a one-page architecture-decisions summary. Linked articles: 10 (token economics), 27 (FinOps for AI), 29 (compute budgets).

Scenario

“HarborHelp” is a GenAI customer-service copilot that helps human agents respond to support tickets. The current architecture has these characteristics.

Current workload

Request volume: ~18,000 tickets per day, averaging ~4 agent-AI interactions per ticket → ~72,000 inference requests per day.
Model class: all requests use a top-tier general-purpose model.
Average input tokens per request: 3,800 (system prompt, agent persona, ticket history, retrieved knowledge base snippets, current turn).
Average output tokens per request: 560.
Retrieval: each request makes three retrieval hops against a vector store.
Tool calls: 22% of requests make a secondary tool call to look up account data.
Context window: average 6,400 tokens per request including accumulated conversation.

Pricing (illustrative; use relative ratios)

Top-tier model: 1.0× input price, 1.0× output price.
Mid-tier model: 0.2× input price, 0.2× output price.
Low-tier model: 0.03× input price, 0.03× output price.
Retrieval hop: 0.15× input-token price per hop.
Tool call: 0.08× input-token price per call.

Current monthly cost (baseline)

Compute the current monthly cost using the volumes and pricing above. Use a notional “unit” where one input token at top-tier is 1 unit. Report the baseline monthly cost in these units.

What to produce

Step 1 — Decompose the baseline

Table the baseline cost across the five token-economics components from Article 10:

Input tokens at model price.
Output tokens at model price.
Context-window overhead (portion of context that is accumulated history beyond minimum).
Retrieval-hop cost.
Tool-call cost.

Show each as a share of total cost. Identify which two components dominate.

Step 2 — Apply the four Phase 2 levers

Apply each of the four levers from Article 27 and estimate the cost reduction.

Lever A — Prompt caching. Estimate the proportion of input tokens that are cacheable (typically the system prompt, agent persona, and stable knowledge-base snippets — perhaps 2,800 of 3,800 input tokens on a cache hit). Estimate cache-hit rate (say, 75% given request similarity). Compute the effective input-token cost with caching.

Lever B — Model-tier routing. Classify requests into tiers. Simple Q&A (perhaps 45% of volume) can use low-tier. Moderate reasoning (35%) can use mid-tier. Complex cases (20%) require top-tier. Compute the blended input and output cost.

Lever C — Context trimming. The 6,400-token context includes accumulated conversation. For most requests, trimming to the most recent three turns plus a summary reduces average context by 1,400 tokens without material quality loss. Estimate the input-token savings.

Lever D — Model-class substitution. For the 45% of simple Q&A volume, a fine-tuned small model can replace the general-purpose model at near-equivalent quality. Compute the substitution savings.

Step 3 — Compute the compound result

Apply the levers in sequence (not in parallel — each lever’s saving is applied to the already-reduced cost from preceding levers). Compute the final monthly cost and the total percentage reduction.

Step 4 — Identify risks and trade-offs

For each lever, name one risk or trade-off:

Prompt caching: cache invalidation cost when prompts change frequently.
Model-tier routing: classifier accuracy; misclassifications route simple requests to expensive tiers or complex requests to cheap tiers.
Context trimming: risk of losing context needed for coherent response.
Model-class substitution: quality risk; requires capability-evaluation harness (Article 24) to confirm non-degradation.

Step 5 — Produce the before/after worksheet

Table format:

Component	Baseline cost	After Lever A	After Lever B	After Lever C	After Lever D
Input tokens	…	…	…	…	…
Output tokens	…	…	…	…	…
Context overhead	…	…	…	…	…
Retrieval hops	…	…	…	…	…
Tool calls	…	…	…	…	…
Monthly total	…	…	…	…	…

Target: ~40% total reduction.

Step 6 — Write the one-page architecture-decisions summary

Brief document covering:

The four architectural decisions (which lever, which specific implementation).
The expected cost reduction by lever and compound.
The capability-evaluation requirements (which levers need which evaluations).
The implementation sequence and estimated effort.
The governance handoffs (compute budget update, stage-gate review).

Guidance

Order matters. Applying the levers in sequence matters because each reduces the base against which subsequent levers compute savings. The textbook order — caching first, then tier-routing, then context trimming, then substitution — typically produces the highest compound.
Realistic, not aspirational. Cache-hit rates of 95% with major prompt variation are aspirational. Model-tier routing at 100% accuracy is aspirational. Use realistic rates (cache hit 70–80%, routing accuracy 85–95%) and document sensitivity to them.
Target range, not target exact. The 40% target is a range; 35–45% is the realistic zone. Programs that claim 70% from paper exercises typically miss their real-world target by a wide margin.

Evaluation rubric

Dimension	What to demonstrate	Weight
Baseline decomposition	Correct component math; dominant components identified	15%
Lever application	All four levers applied; realistic rates used	25%
Compound arithmetic	Sequential application done correctly	15%
Risk identification	Each lever has a named trade-off	10%
Architecture-decisions summary	Clear, implementable, governance-aware	20%
Realism	No aspirational rates; sensitivity disclosed	15%

Reflection questions

Which lever produced the largest share of your total reduction? Is that the lever your organization would implement first?
Model-tier routing requires a classifier. How would you measure classifier accuracy, and how would you handle drift in the classifier over time?
Suppose after implementation, actual cost reduction is 22% rather than 40%. Name three plausible causes and the investigation sequence.

Linked articles and further reading

Article 10 — Token economics of generative systems.
Article 27 — FinOps for AI.
Article 29 — Compute budgets and token-aware governance.
FinOps Foundation, M1.3FinOps for AI technical paper (2024).

Submission

Submit the before/after worksheet and the one-page architecture summary. Reviewer will validate the arithmetic, the realism of the rates used, and the implementation practicality of the architecture decisions.