Skip to main content

COMPEL Glossary / golden-dataset

Golden dataset

A versioned, labeled, license-cleared evaluation dataset used as the benchmark reference for an AI feature.

What this means in practice

Distinguished from production or training data by explicit curation, inter-annotator agreement review, and immutability-with-version. The golden dataset is the single artefact against which regression, release, and drift decisions are made.

Synonyms

golden set , reference evaluation dataset , eval gold set

See also

  • Evaluation harness — The infrastructure that runs capability, regression, safety, and human-review evaluations on an LLM feature on a defined cadence.
  • LLM-as-judge — An evaluation technique using a large language model to score outputs from another LLM on quality dimensions — helpfulness, correctness, safety — scaling evaluation beyond human-rater capacity.
  • Benchmark contamination — The presence of benchmark test data in foundation-model training corpora — whether through web crawling or deliberate inclusion — inflating reported benchmark scores and breaking the comparability of benchmark results across models.
  • Position bias (judge) — The systematic tendency of an LLM-as-judge to favour responses in a particular position (first, second, or last) when comparing candidates — independent of content quality.