Skip to main content

COMPEL Glossary / data-leakage

Data leakage

Information from the test or validation set inadvertently entering training — through preprocessing, feature engineering, target encoding, or time-ordered splits — inflating offline metrics and producing over-optimistic ship decisions.

What this means in practice

A leading cause of offline-to-online performance gaps; defense requires disciplined split protocols and temporal holdouts.

Synonyms

target leakage , feature leakage , evaluation leakage

See also

  • Offline evaluation — Assessment of an AI system against static datasets — training hold-out, validation set, benchmark corpus — without exposure to live user traffic.
  • Benchmark contamination — The presence of benchmark test data in foundation-model training corpora — whether through web crawling or deliberate inclusion — inflating reported benchmark scores and breaking the comparability of benchmark results across models.
  • Reproducibility — The property that re-running an experiment with the same code, data, and configuration produces the same results within declared tolerance.