COMPEL Specialization — AITE-VDT: AI Value & Analytics Expert Lab 3 of 5
Lab objective
Decompose the token-cost of a realistic GenAI workload, identify the four leading optimization levers from Article 27, and redesign the architecture to achieve approximately a 40% total cost reduction. Produce a before/after token-cost worksheet with the specific architectural decisions that achieve the target.
Duration: 90 minutes. Deliverable: A before/after worksheet (spreadsheet or Markdown table) plus a one-page architecture-decisions summary. Linked articles: 10 (token economics), 27 (FinOps for AI), 29 (compute budgets).
Scenario
“HarborHelp” is a GenAI customer-service copilot that helps human agents respond to support tickets. The current architecture has these characteristics.
Current workload
- Request volume: ~18,000 tickets per day, averaging ~4 agent-AI interactions per ticket → ~72,000 inference requests per day.
- Model class: all requests use a top-tier general-purpose model.
- Average input tokens per request: 3,800 (system prompt, agent persona, ticket history, retrieved knowledge base snippets, current turn).
- Average output tokens per request: 560.
- Retrieval: each request makes three retrieval hops against a vector store.
- Tool calls: 22% of requests make a secondary tool call to look up account data.
- Context window: average 6,400 tokens per request including accumulated conversation.
Pricing (illustrative; use relative ratios)
- Top-tier model: 1.0× input price, 1.0× output price.
- Mid-tier model: 0.2× input price, 0.2× output price.
- Low-tier model: 0.03× input price, 0.03× output price.
- Retrieval hop: 0.15× input-token price per hop.
- Tool call: 0.08× input-token price per call.
Current monthly cost (baseline)
Compute the current monthly cost using the volumes and pricing above. Use a notional “unit” where one input token at top-tier is 1 unit. Report the baseline monthly cost in these units.
What to produce
Step 1 — Decompose the baseline
Table the baseline cost across the five token-economics components from Article 10:
- Input tokens at model price.
- Output tokens at model price.
- Context-window overhead (portion of context that is accumulated history beyond minimum).
- Retrieval-hop cost.
- Tool-call cost.
Show each as a share of total cost. Identify which two components dominate.
Step 2 — Apply the four Phase 2 levers
Apply each of the four levers from Article 27 and estimate the cost reduction.
Lever A — Prompt caching. Estimate the proportion of input tokens that are cacheable (typically the system prompt, agent persona, and stable knowledge-base snippets — perhaps 2,800 of 3,800 input tokens on a cache hit). Estimate cache-hit rate (say, 75% given request similarity). Compute the effective input-token cost with caching.
Lever B — Model-tier routing. Classify requests into tiers. Simple Q&A (perhaps 45% of volume) can use low-tier. Moderate reasoning (35%) can use mid-tier. Complex cases (20%) require top-tier. Compute the blended input and output cost.
Lever C — Context trimming. The 6,400-token context includes accumulated conversation. For most requests, trimming to the most recent three turns plus a summary reduces average context by 1,400 tokens without material quality loss. Estimate the input-token savings.
Lever D — Model-class substitution. For the 45% of simple Q&A volume, a fine-tuned small model can replace the general-purpose model at near-equivalent quality. Compute the substitution savings.
Step 3 — Compute the compound result
Apply the levers in sequence (not in parallel — each lever’s saving is applied to the already-reduced cost from preceding levers). Compute the final monthly cost and the total percentage reduction.
Step 4 — Identify risks and trade-offs
For each lever, name one risk or trade-off:
- Prompt caching: cache invalidation cost when prompts change frequently.
- Model-tier routing: classifier accuracy; misclassifications route simple requests to expensive tiers or complex requests to cheap tiers.
- Context trimming: risk of losing context needed for coherent response.
- Model-class substitution: quality risk; requires capability-evaluation harness (Article 24) to confirm non-degradation.
Step 5 — Produce the before/after worksheet
Table format:
| Component | Baseline cost | After Lever A | After Lever B | After Lever C | After Lever D |
|---|---|---|---|---|---|
| Input tokens | … | … | … | … | … |
| Output tokens | … | … | … | … | … |
| Context overhead | … | … | … | … | … |
| Retrieval hops | … | … | … | … | … |
| Tool calls | … | … | … | … | … |
| Monthly total | … | … | … | … | … |
Target: ~40% total reduction.
Step 6 — Write the one-page architecture-decisions summary
Brief document covering:
- The four architectural decisions (which lever, which specific implementation).
- The expected cost reduction by lever and compound.
- The capability-evaluation requirements (which levers need which evaluations).
- The implementation sequence and estimated effort.
- The governance handoffs (compute budget update, stage-gate review).
Guidance
- Order matters. Applying the levers in sequence matters because each reduces the base against which subsequent levers compute savings. The textbook order — caching first, then tier-routing, then context trimming, then substitution — typically produces the highest compound.
- Realistic, not aspirational. Cache-hit rates of 95% with major prompt variation are aspirational. Model-tier routing at 100% accuracy is aspirational. Use realistic rates (cache hit 70–80%, routing accuracy 85–95%) and document sensitivity to them.
- Target range, not target exact. The 40% target is a range; 35–45% is the realistic zone. Programs that claim 70% from paper exercises typically miss their real-world target by a wide margin.
Evaluation rubric
| Dimension | What to demonstrate | Weight |
|---|---|---|
| Baseline decomposition | Correct component math; dominant components identified | 15% |
| Lever application | All four levers applied; realistic rates used | 25% |
| Compound arithmetic | Sequential application done correctly | 15% |
| Risk identification | Each lever has a named trade-off | 10% |
| Architecture-decisions summary | Clear, implementable, governance-aware | 20% |
| Realism | No aspirational rates; sensitivity disclosed | 15% |
Reflection questions
- Which lever produced the largest share of your total reduction? Is that the lever your organization would implement first?
- Model-tier routing requires a classifier. How would you measure classifier accuracy, and how would you handle drift in the classifier over time?
- Suppose after implementation, actual cost reduction is 22% rather than 40%. Name three plausible causes and the investigation sequence.
Linked articles and further reading
- Article 10 — Token economics of generative systems.
- Article 27 — FinOps for AI.
- Article 29 — Compute budgets and token-aware governance.
- FinOps Foundation, M1.3FinOps for AI technical paper (2024).
Submission
Submit the before/after worksheet and the one-page architecture summary. Reviewer will validate the arithmetic, the realism of the rates used, and the implementation practicality of the architecture decisions.