Skip to main content

COMPEL Glossary / prompt-caching

Prompt caching

An inference optimisation that caches the attention key-value state for a prompt prefix so that subsequent requests sharing the same prefix skip re-processing.

What this means in practice

Reduces both latency and cost on repeated context (system prompts, long documents, few-shot examples); cache hit rate is a first-order cost-architecture metric.

Synonyms

prefix caching , KV-cache reuse

See also

  • Semantic caching — A caching strategy in which cache hits are determined by semantic similarity to prior queries rather than exact-string match — typically implemented by embedding the query and performing nearest-neighbour search over a cache of past query-response pairs.
  • Continuous batching — An inference-server technique — popularised by vLLM and Text Generation Inference — that dynamically groups concurrent requests at the token-generation level to raise GPU utilisation.
  • Model routing — A pattern that routes each request to the cheapest model capable of handling it, escalating to more powerful models only when necessary — typically via a small classifier, confidence-based escalation, or response evaluation.
  • Per-task cost — An agent SLI capturing the full compute and API cost of a single task end-to-end — including all loop iterations, tool calls, memory reads and writes.