Skip to main content

COMPEL Glossary / benchmark-contamination

Benchmark contamination

The presence of benchmark test data in foundation-model training corpora — whether through web crawling or deliberate inclusion — inflating reported benchmark scores and breaking the comparability of benchmark results across models.

What this means in practice

Detection methods include canary strings, membership-inference tests, and held-out contamination probes.

Synonyms

test-set contamination , benchmark leakage

See also

  • Data leakage — Information from the test or validation set inadvertently entering training — through preprocessing, feature engineering, target encoding, or time-ordered splits — inflating offline metrics and producing over-optimistic ship decisions.
  • LLM-as-judge — An evaluation technique using a large language model to score outputs from another LLM on quality dimensions — helpfulness, correctness, safety — scaling evaluation beyond human-rater capacity.
  • Evaluation harness — The infrastructure that runs capability, regression, safety, and human-review evaluations on an LLM feature on a defined cadence.
  • Reproducibility — The property that re-running an experiment with the same code, data, and configuration produces the same results within declared tolerance.