Skip to main content

COMPEL Glossary / continuous-batching

Continuous batching

An inference-server technique — popularised by vLLM and Text Generation Inference — that dynamically groups concurrent requests at the token-generation level to raise GPU utilisation.

What this means in practice

Distinct from static batching because batches are formed and reformed each iteration; central to making self-hosted LLM inference economically viable at scale.

Synonyms

dynamic batching , inference-time batching

See also

  • Serving pattern — The architectural shape of the inference path — managed API, cloud-platform hosted, self-hosted online, self-hosted batch, or edge.
  • Prompt caching — An inference optimisation that caches the attention key-value state for a prompt prefix so that subsequent requests sharing the same prefix skip re-processing.
  • TTFT (time-to-first-token) — The latency from request submission to the first streamed output token.