Experiment Tracking, Reproducibility, and Replicability

FlowRidge

Experiment Tracking — Reproducibility Surfaces

Figure 318. Reproducibility requires six artefacts per experiment. Missing any one converts the experiment into an unverifiable claim.

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 6 of 14

An experiment that cannot be re-run produces a decision that cannot be audited. An experiment that cannot be re-run on a different tech stack produces a finding that cannot be verified. These two failures are distinct, and the distinction has names. Reproducibility is the property that re-running the experiment with the same inputs produces the same results within declared tolerance. Replicability is the property that an independent team, with different data or different tooling or a different implementation, can reproduce the experiment’s conclusions. Both are discipline. Neither is optional for any experiment whose results will inform a governance, compliance, or deployment decision.

What a tracked experiment records

A tracked experiment produces a structured record. The record is small — a few megabytes, usually — but it is the difference between evidence and anecdote. Every experiment tracking tool worth using, and several that are not, records the same core artifact set.

Code hash. The exact git SHA (or equivalent) of the source tree the experiment ran against. Not the branch name, not the tag; the immutable SHA. GitHub, GitLab, Bitbucket, and self-hosted git all produce the SHA equivalently.

Data hash. The content hash of the training, validation, and test data the experiment used. Data hashes are produced by DVC, LakeFS, Git LFS, or simple content-addressed storage. The data hash is what lets a future practitioner verify that the data they are running on is the same data the experiment used.

Environment. The versions of the runtime (Python, Java, Scala, R, or another), the versions of the major dependencies (framework, numerical libraries, CUDA, MKL), and the hardware family (GPU model, instance type). A requirements.txt or environment.yml or pyproject.toml or a container image SHA captures this.

Hyperparameters. Every hyperparameter, including defaults that were not explicitly set, to the exact value used. Most tracking libraries capture these automatically when hyperparameters are passed through a structured configuration.

Random seeds. The seed for each source of randomness: data shuffling, model initialization, dropout, augmentation, batch ordering. Irreproducible experiments most often fail at the seed-capture step; practitioners who set a seed for NumPy but forget about PyTorch’s CUDA RNG produce experiments that re-run with drifting metrics.

Metrics. All primary, secondary, guardrail, and diagnostic metrics, tagged with the slice they were computed on and the time window.

Artifacts. Saved model weights, preprocessor state, feature store exports, evaluation outputs, error-analysis notebooks, and any figure the experiment’s final report depends on.

Lineage. The pointer from this experiment to upstream artifacts: the training data’s pipeline version, the previous model version that was the baseline, the prompt template version, the retrieval index version. Lineage is what makes “what changed?” answerable.

[DIAGRAM: HubSpokeDiagram — aitm-eci-article-6-tracked-experiment-hub — Central hub “Experiment Run” with spokes for code hash, data hash, environment, hyperparameters, seeds, metrics, artifacts, lineage. Each spoke names example content.]

Tooling across the ecosystem

Experiment tracking is a mature space with both commercial and open-source options.

MLflow. Open-source tracking server, model registry, and project abstraction. Self-hostable on any infrastructure, integrates with every major cloud’s managed model registry, and ships embedded in Databricks, Azure ML, and several other platforms¹.

Weights & Biases (W&B). Commercial service with a free tier; self-hosted enterprise option. Popular in deep-learning research and in teams that want strong visualization and collaboration features out of the box².

Neptune. Commercial SaaS with a research focus; lightweight integration, good metadata search³.

Aim. Open-source, self-hostable, with a focus on reproducibility-first design⁴.

Comet, ClearML, Sacred, Guild AI. Additional open-source and commercial options each with specific niches.

Vertex AI Experiments, SageMaker Experiments, Azure ML, Databricks native tracking. Cloud-provider native experiment tracking, tightly integrated with the provider’s model registry and pipeline products⁵.

Kubeflow Pipelines Metadata. Open-source, Kubernetes-native, with an explicit metadata store built for lineage graphs.

The choice of tool is less important than the discipline of using one. A team that records runs in MLflow and a peer team that records in W&B can exchange findings because both produce the same artifact set. A team that records runs in a shared spreadsheet cannot, because the spreadsheet does not capture the data hash, the environment, or the seeds.

Reproducibility

Reproducibility is the property that re-running the experiment with the same inputs produces the same results. In principle this is mechanical: same code, same data, same environment, same seeds. In practice it requires attention to four specific failure modes.

Nondeterministic hardware kernels. GPU kernels for several operations (notably reductions and attention) are nondeterministic by default. Many frameworks ship a deterministic mode that can be enabled (PyTorch’s torch.use_deterministic_algorithms(True), TensorFlow’s tf.config.experimental.enable_op_determinism()). Deterministic mode is slower but produces bit-reproducible results for a given seed.

Package drift. pip install torch installs the latest version that pip can resolve, not the version that was current at the time of the original experiment. Pinning every dependency — either via a lockfile (requirements.txt with version constraints, poetry.lock, uv.lock) or via a container image SHA — is how drift is prevented.

Data drift. Training data that points to “the latest snapshot” of a feature store produces different results over time. Pinning to a dataset version (DVC tag, LakeFS branch, feature-store version) is how data drift is prevented.

Unpinned external services. A training run that calls an external LLM API receives the provider’s current model at the time of the call. Pinning to a specific model version (e.g., gpt-4o-2024-08-06 rather than gpt-4o-latest, or claude-3-5-sonnet-20241022 rather than claude-3-5-sonnet) is how external drift is prevented.

Reproducibility is a tolerance, not a bit-for-bit equivalence. The practitioner declares the tolerance up front: “metric within 0.5% of the original run” is typical for large stochastic systems. A reproducibility test that runs a subset of experiments periodically against the current environment confirms that the tolerance holds. Netflix, Spotify, Booking.com, and several other programs have published on automated reproducibility tests as a standard CI job⁶⁷.

Replicability

Replicability is the harder property. An independent team, with different data or different tooling, reproduces the experiment’s conclusions. Replicability is what distinguishes a finding from an artifact of a specific pipeline.

Replicability is not mechanical. It requires that the experiment’s design be legible enough for a different team to recreate the essential elements: the hypothesis, the measurement protocol, the metric definitions, the exclusion criteria. The published research community has invested heavily in replicability tooling; Papers With Code’s reproducibility reports and the ICLR Reproducibility Challenge are two well-known examples⁸.

The enterprise practitioner’s version of replicability is simpler: an experiment is replicable if a peer team, given the experiment’s brief and report, can run a variant experiment on their own stack and come to the same conclusion. That property requires that the brief and report be written in technology-neutral terms, naming hypotheses and measurements rather than vendor-specific configurations.

The reproducibility-and-replicability artifact

At the end of an experiment, the practitioner produces a “reproducibility-and-replicability” section of the experiment report (covered in detail in Article 14). The section records:

The exact commands, in order, that reproduce the experiment on a clean environment.
The expected runtime and the expected final metric (with tolerance).
Any external services the experiment depends on, with pinned versions.
Any hardware requirements (GPU model, memory, disk).
A “how to vary the experiment” paragraph describing which elements a replicator would change to run a variant and which should stay the same to preserve the comparison.

The section is concrete. It is not “use MLflow to track runs”; it is “run python train.py --config configs/experiment-14.yaml --seed 42 on a single A100 GPU with 80GB memory, expect 4.2 hours, expect a final validation accuracy of 0.847 ± 0.003”.

[DIAGRAM: BridgeDiagram — aitm-eci-article-6-untracked-to-governed — Left-side “untracked notebook” with handwritten notes, right-side “governed experiment record” with code/data/environment/hyperparameters/seeds/metrics/artifacts/lineage, and named bridge beams between them.]

Two real references in the reproducibility vocabulary

Papers With Code and the ICLR Reproducibility Challenge. The research community has run an annual reproducibility challenge in which participants attempt to re-run published experiments. The reports are a structured catalog of what goes wrong in reproducibility (missing seeds, unpinned dependencies, undocumented preprocessing) and what works (clear scripts, provided containers, pinned data)⁸. The lessons are directly transferable to enterprise practice.

MLflow and W&B — the production tooling reference. The dominant open-source and commercial tools both implement the same artifact set, which has become a de facto standard. A practitioner who can operate one can operate the others. Published documentation and user case studies from MLflow, W&B, Neptune, Aim, and cloud-provider native offerings catalog the same core practices¹²³⁴⁵.

Summary

Reproducibility requires recording code, data, environment, hyperparameters, seeds, metrics, artifacts, and lineage for every run, and addressing four failure modes (nondeterministic kernels, package drift, data drift, external service drift). Replicability requires writing the experiment brief and report in technology-neutral terms so that a peer team can run a variant and confirm the conclusion. Tooling is broad and mature (MLflow, W&B, Neptune, Aim, Comet, Kubeflow Pipelines Metadata, cloud-provider native); the discipline is the practitioner’s. The reproducibility-and-replicability section of the experiment report is the artifact that an auditor, a successor team, or a regulator reads to confirm that the experiment is defensible.

Further reading in the Core Stream: MLOps: From Model to Production.

MLflow documentation. https://mlflow.org/ — accessed 2026-04-19. ↩ ↩²
Weights & Biases documentation. https://wandb.ai/ — accessed 2026-04-19. ↩ ↩²
Neptune documentation. https://neptune.ai/ — accessed 2026-04-19. ↩ ↩²
Aim documentation. https://aimstack.io/ — accessed 2026-04-19. ↩ ↩²
Vertex AI Experiments, SageMaker Experiments, Azure ML, and Databricks native tracking documentation. https://cloud.google.com/vertex-ai ; https://aws.amazon.com/sagemaker/ ; https://azure.microsoft.com/en-us/products/machine-learning ; https://www.databricks.com/ — accessed 2026-04-19. ↩ ↩²
Netflix Technology Blog — experimentation and reproducibility series. https://netflixtechblog.com/ — accessed 2026-04-19. ↩
Spotify Engineering blog — data platform series. https://engineering.atspotify.com/category/data/ — accessed 2026-04-19. ↩
Papers With Code and the ICLR Reproducibility Challenge. https://paperswithcode.com/ — accessed 2026-04-19. ↩ ↩²