Offline Evaluation

FlowRidge

Offline Evaluation Stack — Four Layers of Evidence

Unit eval

Single-record correctness

Golden setEdge caseRegression

Aggregate eval

Distribution metrics

AccuracyF1Calibration

Subgroup eval

Fairness slices

AgeRegionProtected class

Robustness eval

Adversarial + drift

PerturbationCovariate shift

Figure 315. Offline evaluation tests the model against fixed data before production. Each layer answers a different question — skipping any layer is a deployment risk.

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 3 of 14

Offline evaluation is where most AI experiments begin, and it is where most of their illusions of progress are generated. A model that looks excellent on a held-out test set is not yet a model that works in production. The gap between the two is the subject of this article. Offline evaluation is a real and necessary filter, but it is an evidentiary claim about what the data captures, not about what production does. Understanding that distinction — and the leakage, slicing, and interpretation hazards that widen the gap — is the discipline this article teaches.

The partition vocabulary

The classical partition of a dataset into training, validation, and test slices is the foundation of offline evaluation. Each slice has a specific role.

Training set. Data the model learns parameters from. It is touched many times across training.
Validation set (also called development set). Data used to tune hyperparameters, to choose between model variants, and to stop training. It is touched many times across model selection but never by gradient descent.
Test set (also called held-out set). Data used to estimate the model’s generalization performance after all tuning decisions are frozen. It is touched once — ideally literally once — at the end of the experiment.
Challenge set. A curated slice designed to probe specific behaviors or edge cases. Not used for training or for hyperparameter selection; used for targeted evaluation.

The discipline that makes the partition produce evidence is isolation. The test set must remain unseen during all model selection. The validation set must remain unseen during all training. Any violation of either isolation inflates the offline metrics without changing the model’s actual generalization. The ML Test Score rubric published by Breck, Cai, Nielsen, Salib, and Sculley (IEEE BigData 2017) treats this isolation discipline as a gating criterion for production readiness¹.

Three-way splits work when data is abundant and stationary. When data is small, cross-validation substitutes. When data is temporal, splits must respect time order. When data is grouped (many rows per user, many documents per organization), splits must respect group boundaries. Each of these adaptations matters, and skipping one of them is the most common source of offline-to-online transfer failure.

[DIAGRAM: StageGateFlow — aitm-eci-article-3-offline-pipeline — A left-to-right flow: raw data -> split (train/val/test) -> train -> validate -> freeze -> test -> held-out-check -> decision gate, with a sidebar showing leakage checks at the split stage.]

Cross-validation

Cross-validation rotates data through training and held-out roles to make more use of small datasets. The canonical form is k-fold: the data is partitioned into k equal folds, and the model is trained k times, each time using k-1 folds for training and one fold for evaluation, with the final metric the average across folds. Typical values of k are 5 or 10.

Three variants are worth the practitioner’s attention.

Stratified k-fold. For classification with imbalanced classes, each fold preserves the class proportions of the full dataset. Stratification is the default when classes are uneven.

Time-series cross-validation. For temporal data, standard k-fold violates temporal order and produces inflated scores. Time-series cross-validation uses expanding or rolling windows where training data always precedes evaluation data in time. The evaluation metric is still averaged, but the folds are ordered.

Group k-fold. When data has group structure (multiple rows per user, patient, organization, document), all rows from a given group must be placed in the same fold. Otherwise the model learns group-level patterns and the evaluation score measures in-group prediction, not out-of-group generalization. Group leakage is the most common leakage class in enterprise datasets.

Cross-validation is appropriate at the model-selection stage (during validation). It is not a substitute for a held-out test set. A practitioner who reports only cross-validated scores and never holds out a final test set has reported a validation score, not a generalization score.

Leakage modes

Leakage is the inadvertent inclusion of information from the held-out data (or the future, or the target) into training. It inflates offline metrics without improving the model. The article’s taxonomy follows the Breck et al. rubric and several published retrospectives from Kaggle competitions¹².

Target leakage. A feature used at training time includes information about the target that will not be available at inference time. Classic example: a “days since payment” feature that is computed differently for paid and unpaid accounts leaks the payment status. Target leakage produces models that look excellent offline and fail catastrophically in production.

Temporal leakage. A feature used at training time reflects future information relative to the target. Example: a feature that aggregates “all purchases in the last 30 days” is computed as of the evaluation date for both training and test, which means training examples have access to post-event purchases. Prevention is point-in-time correctness: every feature is computed as it would have been observable at the moment of prediction.

Identity leakage. A record-level identifier (user ID, device ID, session ID) is present as a feature, and the same identifiers appear in both training and test sets. The model memorizes per-identifier behavior, which will not generalize to unseen identifiers. Prevention is group-aware splitting, plus identifier masking in feature engineering.

Duplicate leakage. The same record appears in both training and test sets, often because of upstream duplication. Prevention is deduplication on a stable primary key before splitting.

Preprocessing leakage. Normalization, scaling, or feature-selection statistics are computed on the full dataset (training plus test) and then applied to the splits. Prevention is “fit on train, transform on test”: every preprocessing transform is fitted only on the training split and applied to the held-out splits at evaluation time.

Published Kaggle retrospectives have documented leakage in competitions including Porto Seguro’s Safe Driver, Santander Value Prediction, and M5 Forecasting; the retrospectives are required reading for any practitioner building offline evaluation pipelines².

[DIAGRAM: MatrixDiagram — aitm-eci-article-3-leakage-taxonomy — 2x2 of “Temporal vs. identity” by “Feature vs. target” leakage, with a named example in each cell and a prevention tactic attached.]

Held-out slices and challenge sets

A single aggregate test metric tells a practitioner one thing about a model. A sliced evaluation tells many things. The discipline is to evaluate not just on the full test set but on explicit slices that represent the populations, inputs, and edge cases the model must handle.

Slice candidates a practitioner should consider:

Demographic slices (by geography, by language, by customer segment), to detect disparate performance. The NIST AI RMF Generative AI Profile and AI RMF 1.0 both flag unequal performance as a MEASURE-function obligation³.
Temporal slices (most recent month, holiday windows, known regime-change windows), to detect concept drift in the test period.
Rare-class slices, because aggregate metrics hide sub-1% class performance.
Known-edge-case slices, where the team has curated examples of inputs known to be difficult.
Regression slice, a fixed curated set that does not change between experiments so that regressions can be tracked.

Stanford HELM (Liang et al., TMLR 2023) formalized this for LLM evaluation, arguing that a single benchmark score is insufficient and that holistic evaluation requires reporting across many slices and many metrics⁴. The same argument applies to classical ML evaluation: report slice by slice, and let the aggregate metric be a single line in a larger table.

Challenge sets deserve a separate mention. A challenge set is a small, curated collection of inputs that test a specific capability or failure mode. Examples include a set of counterfactual inputs probing model robustness, a set of adversarial examples probing safety, and a set of canonical questions probing factual accuracy. Challenge sets are not statistically representative of production traffic; that is the point. They are probes, and they pair with aggregate evaluation, not replace it.

Interpreting offline results

Offline evaluation produces a number. The number has limits. Three interpretive disciplines keep practitioners honest about what the number means.

Report confidence intervals, not point estimates. Bootstrap the test set, or use the standard error of the mean for aggregate metrics. A 1-point difference in accuracy between two models is not meaningful if the confidence interval is 3 points wide.

Quantify the train-test distribution gap. Compute covariate shift statistics (Population Stability Index, Kolmogorov-Smirnov on key features) between the training data and the test data. A large gap warns that offline performance estimates the model’s behavior on the test distribution, not on production.

Quantify the test-production gap. If production logs are available (from a previous version of the feature, for instance), compute the same shift statistics between the test data and production traffic. A large gap is a warning that offline performance is not a direct estimator of production performance, and the mode classification from Article 1 must include a shadow or canary step to close the gap.

These three disciplines turn “the model’s accuracy is 0.87” into “the model’s accuracy is 0.87 ± 0.02 on a test set whose input distribution is moderately shifted from recent production traffic; shadow evaluation should precede ramp”. The second statement is evidence; the first is a dashboard.

Two real cases, in the offline discipline vocabulary

Porto Seguro Safe Driver — identifier leakage. Published retrospectives on the Kaggle Porto Seguro competition documented that early leaderboard positions were dominated by feature engineering that captured per-driver identifiers; the leaders did not generalize to out-of-competition data. The correction, applied by later-stage competitors and in competition retrospectives, was group-aware split adjustment². The case is a reference example of identity leakage at industrial scale.

ML Test Score — a rubric for offline-evaluation readiness. Breck et al. (IEEE BigData 2017) published a 28-item rubric for production ML readiness, approximately a third of which addresses offline-evaluation hygiene (split isolation, leakage checks, slice coverage, challenge sets)¹. The rubric does not propose a new technology; it proposes that the offline pipeline be reviewed against an explicit checklist before production. The rubric is the reference practitioners point peer reviewers to.

Summary

Offline evaluation is a disciplined filter, not a license to deploy. The train-validation-test partition must preserve isolation, the cross-validation variant must match the data structure (temporal, grouped, stratified), leakage must be checked in five named classes, slices must be evaluated explicitly, and results must be interpreted with confidence intervals and distribution-gap statistics. Practitioners who do this work produce evidence. Practitioners who do not produce a number, which is not the same thing. The next article takes the practitioner into online evaluation, where the gaps offline cannot close are addressed.

Further reading in the Core Stream: Machine Learning Fundamentals for Decision Makers.

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data, 2017. https://research.google/pubs/pub46555/ — accessed 2026-04-19. ↩ ↩² ↩³
Kaggle competition leakage discussions and retrospectives (Porto Seguro, Santander, M5). https://www.kaggle.com/discussions/general/4639 — accessed 2026-04-19. ↩ ↩² ↩³
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩
Percy Liang et al. Holistic Evaluation of Language Models. Transactions on Machine Learning Research, 2023. https://crfm.stanford.edu/helm/ — accessed 2026-04-19. ↩