Skip to main content

COMPEL Glossary / evaluation-harness

Evaluation harness

The infrastructure that runs capability, regression, safety, and human-review evaluations on an LLM feature on a defined cadence.

What this means in practice

Treated as a governance artefact, not just an engineering convenience — its coverage and cadence directly determine whether the organisation can detect and act on drift or misuse.

Synonyms

LLM evaluation suite , eval harness , capability-and-safety evaluation

See also

  • Red-team (for LLMs) — A structured adversarial exercise against an LLM feature using human, automated, or hybrid techniques drawn from MITRE ATLAS or OWASP LLM Top 10 to discover failure modes before attackers do.
  • Confabulation — NIST's preferred term for hallucination: an LLM generating fluent output that is unsupported by ground truth.
  • Content safety classifier — A model or rule system that detects policy-violating output categories — violence, self-harm, CSAM, targeted harassment, dangerous instructions, and similar.
  • Model and prompt registry — A versioned inventory of models, system prompts, retrieval sources, and guardrails deployed in production.

Related articles in the Body of Knowledge