Reproducibility in AI: Container, Code, Data, Environment

FlowRidge

This article examines the four dimensions in turn, the tooling that makes reproducibility achievable in production AI environments, and the operational practices that distinguish programs that talk about reproducibility from those that demonstrate it.

Why AI Programs Have a Reproducibility Problem

The Association for Computing Machinery Journal Reproducibility Initiative at https://reproducibility.acm.org/ has documented for years that even peer-reviewed AI papers frequently cannot be reproduced from their published artifacts. The Joelle Pineau Reproducibility Checklist at https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf has become the standard pre-submission test at major machine learning conferences.

In enterprise AI, the same dynamics apply but with higher consequences. A model that performed well in development but cannot be reproduced in production cannot be debugged, validated by an independent reviewer, defended in a regulatory inquiry, or rolled back. The Office of the Comptroller of the Currency Bulletin 2021-39 on Sound Risk Management of AI at https://www.occ.gov/news-issuances/bulletins/2021/bulletin-2021-39.html explicitly cites reproducibility as a model risk management expectation.

Dimension One: Data

Data reproducibility means that the exact dataset used in a run can be retrieved unchanged at a later time. Three sub-properties are required.

Versioning. Tools include Data Version Control (DVC) at https://dvc.org/, lakeFS, and the dataset versioning features of modern data platforms (Delta Lake, Apache Iceberg).

Hashing. The snapshot should be content-addressable: a cryptographic hash of the data confirms that what is retrieved later is bit-for-bit identical to what was used.

Pre-processing capture. Many “data” operations are actually transformations: deduplication, balancing, resampling, augmentation. The transformation logic and parameters must be captured along with the source dataset.

Dimension Two: Code

Version control everything. Training scripts, preprocessing pipelines, evaluation harnesses, and configuration files all live in version control with a commit hash recorded for the run.

No silent dependency drift. Lock files (poetry.lock, requirements.txt with pinned versions) are non-negotiable.

Deterministic algorithms where possible. Frameworks expose deterministic-mode flags; the PyTorch documentation at https://pytorch.org/docs/stable/notes/randomness.html catalogues the controls.

Seed management. Random seeds for data shuffling, weight initialisation, dropout, and augmentation should be explicit and recorded.

Dimension Three: Environment

Containers. Docker images that capture the OS, the language runtime, and the installed libraries are the standard mechanism. The image tag must be immutable; using latest is the most common reproducibility failure. The Open Containers Initiative specification at https://opencontainers.org/ sets the underlying standards.

Hardware specification. Different GPU models can produce slightly different floating-point results. The specific hardware family used in a run should be recorded.

Driver versions. CUDA, cuDNN, NCCL, and similar driver-level dependencies have produced reproducibility failures for many practitioners.

Distributed training topology. The number of workers, the communication backend, and the gradient aggregation method all influence outcomes.

Dimension Four: Configuration

Configuration as code. Configuration files should be in version control alongside the training code.

Single source of truth. Configurations should be loaded from a single canonical source and any overrides logged.

Capture, don’t infer. The actual configuration used at run time should be written to the run record.

The Run Manifest

Mature programs produce, for every training and evaluation run, a run manifest that captures all four dimensions in a single artefact. The manifest typically contains: run identifier and timestamp, code commit hash, container image identifier (digest, not tag), hardware and driver metadata, dataset versions and content hashes, configuration snapshot, random seeds, resource consumption, output artefact identifiers, validation status.

MLflow at https://mlflow.org/ and Weights & Biases capture much of this automatically; the gap is usually environment and dataset versioning.

Reproducibility Tiers

Not every run needs full reproducibility. A defensible tiering scheme:

Tier 1 (production training and any model that influences regulated decisions): full four-dimensional reproducibility, manifest, content hashes, deterministic mode where feasible.
Tier 2 (development model candidates and major experiments): code, configuration, and environment captured; data versioned; results may vary within published bounds.
Tier 3 (exploratory work): notebooks with documented seeds; reproducibility on best-effort basis.

Reproducibility and Foundation Models

Foundation models complicate reproducibility because the upstream model itself is rarely reproducible from the consumer’s perspective. The consumer can pin the model version and pin the inference parameters, but cannot reconstruct the model from training. The Stanford Foundation Model Transparency Index at https://crfm.stanford.edu/fmti/ tracks the degree to which providers expose enough information to support consumer reproducibility.

Operational Practices

Mandatory pre-flight checks that refuse to start a tier-1 run unless all four dimensions are satisfied.
Periodic re-execution drills that pick random historical runs and verify they can be reproduced.
Onboarding checklists that teach new practitioners the conventions.
Review of the manifest as a standard part of code review for AI changes.

Looking Forward

The fourth article in Module 1.22 turns to AI system decommissioning — what happens at the other end of the lifecycle when reproducibility, lineage, and provenance must be preserved long after the system stops producing decisions.