Skip to main content

COMPEL Glossary / quantization-ai-cost

Quantization (AI cost)

Representation of model weights (and sometimes activations) at lower numerical precision — INT8, INT4, or mixed-precision — to reduce memory footprint and accelerate inference.

What this means in practice

Techniques include post-training quantization (GPTQ, AWQ) and quantization-aware training; trade-off is small quality degradation for often 2-4x cost reduction.

Synonyms

model quantization , weight quantization , INT8 / INT4 quantization

See also

  • Distillation — The training of a smaller "student" model to imitate a larger "teacher" model's behaviour — typically on a shared dataset of prompts and teacher outputs.
  • PEFT (parameter-efficient fine-tuning) — A family of fine-tuning techniques — most prominently LoRA, QLoRA, and adapters — that update only a small fraction of model parameters while freezing the rest.
  • Serving pattern — The architectural shape of the inference path — managed API, cloud-platform hosted, self-hosted online, self-hosted batch, or edge.