This article examines the demand patterns of training, inference, and retrieval workloads, the planning models that work for each, and the operational practices that allow capacity to flex with reality without surprise budget breaches.
Why AI Capacity Planning Differs
Three properties of AI workloads break conventional capacity planning assumptions.
First, bursty and unpredictable demand. Training jobs run for hours or days at full utilisation, then leave the hardware idle. Inference traffic spikes with product launches or external events that may have no warning. Retrieval-augmented Generative AI applications consume vector store and embedding compute proportional to user input, not to user count.
Second, specialised hardware. AI workloads depend on accelerators (GPUs, TPUs, custom silicon) whose supply is constrained, whose price is volatile, and whose lead times can stretch to months. The U.S. Government Accountability Office report GAO-24-106855 on AI in Federal Agencies at https://www.gao.gov/products/gao-24-106855 documents how accelerator scarcity has become a programmatic risk for major projects.
Third, non-linear cost behaviours. Unlike conventional Compute Unit (CU) or Compute Core scaling, AI workloads can have step-function cost changes — the model fits in the GPU memory or it does not, the latency budget is met or it is not. Linear extrapolation of past cost into future capacity is unreliable.
Training Capacity
Training workloads have predictable resource profiles per job but unpredictable scheduling. A typical mid-sized model training run might require 8 to 64 high-memory GPUs for hours to days, with peak network and storage utilisation at the start and end of the run.
The planning unit for training is the training campaign — the set of training runs the program intends to execute over a planning horizon. Capacity planning for training answers two questions:
- What dedicated capacity should the program own?
- What burst capacity should it reserve from cloud providers?
The economics depend on utilisation. A program executing more than roughly 60-70 percent training utilisation across the year is usually better off owning hardware. Below that threshold, cloud is cheaper. The crossover depends on the program’s procurement leverage and the depreciation schedule it can support.
Reserved capacity (one-year or three-year cloud reservations) can reduce cost materially for the predictable portion of demand, with on-demand or spot capacity covering the variable portion. Spot capacity for training, where supported, can save 50 to 90 percent versus on-demand, at the cost of potential preemption.
Inference Capacity
Inference workloads are typically continuous, with diurnal or event-driven peaks. The planning unit is the queries per second (QPS) profile by use case, with separate consideration of latency requirements.
Three factors complicate inference capacity planning.
Cold start. Loading a model into accelerator memory takes seconds to minutes. A horizontally-scaled inference service must keep enough warm replicas to absorb traffic spikes within the cold-start window.
Batch effects. Most inference servers achieve higher throughput by batching requests. Batching trades off latency for throughput; the appropriate batch size depends on the service-level objective (SLO).
Tail latency. P99 latency is often what matters to users, and tail latency in AI inference is sensitive to garbage collection, page faults, and model size. Capacity must be planned with headroom that keeps the tail acceptable.
The TensorRT-LLM, vLLM, and Triton Inference Server documentation at https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html each describe inference capacity considerations with worked examples that translate to capacity plans.
Retrieval and Vector Store Capacity
Retrieval-augmented Generative AI applications consume two distinct capacities: the vector store query capacity and the embedding generation capacity.
Vector store query capacity scales with the number of vectors, the index type (flat, IVF, HNSW), and the recall target. Pinecone, Weaviate, Milvus, and other vector stores publish capacity guidance specific to their architectures; planning typically requires load testing in the environment because real performance depends on data distribution.
Embedding generation capacity scales with input volume — typically tokens per second for text. Embedding workloads can be batched aggressively because latency requirements are usually relaxed compared to chat inference.
Storage Capacity
AI workloads have multiple distinct storage tiers.
Training data storage. Often petabytes for foundation-model training, terabytes for typical enterprise applications. Optimised for high read throughput; latency is less critical.
Feature store storage. Online (low-latency) and offline (high-throughput) tiers. Capacity planning depends on feature retention policy and join cardinality.
Model artefact storage. Smaller in volume but requires high-availability and versioning.
Vector store storage. Specialised; capacity scales with number of vectors and dimensionality.
Audit log storage. Per Module 1.21, can grow rapidly. Capacity planning here is often the most surprising in cost reviews.
The Linux Foundation’s Storage Performance Development Kit (SPDK) project at https://spdk.io/ and the broader open-source storage performance literature provide engineering reference points.
Network Capacity
Three network capacity concerns are specific to AI.
Cross-region replication bandwidth for the disaster recovery patterns of the previous article. Large model artefacts and vector stores can saturate links if replication is not throttled.
Training cluster interconnect. Distributed training requires high-bandwidth, low-latency interconnect (NVLink, InfiniBand, RoCE) between accelerators. This is hardware-level capacity rarely faced by application teams but important to platform teams.
Egress to AI vendors. Workloads that route requests to external LLM providers consume egress bandwidth that can be material at scale. Cost-aware design (caching, retrieval-first architectures) can reduce egress significantly.
Forecasting and Modelling
Capacity forecasts should be developed with three time horizons.
Short-term (one to four weeks): based on observed traffic and known events. Used for tactical scaling decisions and incident response.
Medium-term (one to two quarters): based on product roadmaps, planned launches, and seasonal effects. Used for reservation and procurement decisions.
Long-term (one to three years): based on strategic forecasts and potential model architecture changes. Used for hardware procurement and major contract negotiations.
The forecasts should be challenged. Bayesian methods and ensemble forecasting reduce reliance on any single source. The Cloud FinOps Foundation has published a body of work at https://www.finops.org/ that covers capacity forecasting in cloud-heavy environments.
Operational Practices
Headroom policies. Each capacity tier should have a documented utilisation ceiling that triggers escalation. For inference, peak utilisation above 70 percent typically warrants action. For training, sustained utilisation above 90 percent indicates either successful cost optimisation or imminent capacity exhaustion, depending on demand trajectory.
Quota management. Cloud accounts and internal capacity should have explicit quotas per workload, with quota requests routed through a defined process. Unmanaged quotas are a recipe for one workload exhausting capacity that another needs.
Right-sizing reviews. Quarterly reviews compare allocated capacity to actual utilisation per workload. Persistent over-allocation triggers downsizing; persistent under-allocation triggers expansion.
Failure-mode-aware planning. Capacity must be sufficient to handle the failure modes the resilience plan addresses — losing a region must not exceed remaining capacity.
Common Failure Modes
The first is peanut-buttered cost — capacity costs are split across multiple budget lines, hiding the real total. Counter with consolidated AI infrastructure cost reporting.
The second is zombie capacity — reserved capacity for workloads that have been retired. Counter with quarterly reservation-to-workload reconciliation.
The third is uncoordinated procurement — multiple teams independently reserving capacity from the same supplier. Counter with central procurement and unified reservation management.
The fourth is demand shock surprise — a product launch consumes inference capacity unexpectedly. Counter with launch-readiness reviews that include explicit capacity sign-off.
Looking Forward
The next article in Module 1.24 examines cost allocation and chargeback — the related but distinct discipline of attributing the consumed capacity back to the workloads and business units that drove it. Capacity planning answers what to buy; chargeback answers who pays.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.