Skip to main content

COMPEL Glossary / ttft-time-to-first-token

TTFT (time-to-first-token)

The latency from request submission to the first streamed output token.

What this means in practice

TTFT is the user-perceived responsiveness metric for streaming LLM applications; distinct from total generation latency because downstream UX depends on the time the user waits for any output to appear.

Synonyms

time to first token , first-token latency

See also

  • Serving pattern — The architectural shape of the inference path — managed API, cloud-platform hosted, self-hosted online, self-hosted batch, or edge.
  • Continuous batching — An inference-server technique — popularised by vLLM and Text Generation Inference — that dynamically groups concurrent requests at the token-generation level to raise GPU utilisation.
  • Prompt caching — An inference optimisation that caches the attention key-value state for a prompt prefix so that subsequent requests sharing the same prefix skip re-processing.
  • SLI/SLO for AI — Service-level indicators and objectives for AI systems — including evaluation score, per-task cost, and goal-achievement rate alongside classical availability/latency.