Continuous Integration for ML

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 8 of 14

Classical continuous integration asks one question on every commit: does the code still work? Continuous integration for machine learning asks three: does the code still work, does the data still look right, and does the model still perform. The three questions are not independent. A code change can break data expectations. A data shift can change a model’s behavior without a single line of code moving. A model regression can be caused by either or by the interaction of both. A CI pipeline that tests only code will miss most of the failures that matter.

The five test layers

ML CI spans five layers, each asking a different question, each with a different cadence and cost profile¹.

Layer 1 — Unit tests. The code-correctness layer. Does the feature-engineering function produce the expected output on a fixed input? Does the loss function compute what its formula says it should? Does the preprocessing pipeline handle nulls, outliers, and type mismatches the way the spec requires. Unit tests run on every commit, in seconds, against isolated functions with fixed inputs.

Layer 2 — Data contract tests. The data-correctness layer. Does the input data conform to the schema the feature pipeline expects — field names, types, ranges, null-rate bounds, categorical-value sets. Tools including Great Expectations, Deequ, Soda, and Pandera (Python) formalize these as versionable contracts². Contract tests run on every data-partition arrival and on every code change that touches the schema.

Layer 3 — Data quality tests. The data-distribution layer. Has the distribution of a feature drifted from the expected baseline? Has the rate of a rare category changed? Has the correlation structure between features changed. Data-quality tests run on each new partition, on a schedule, or on-demand before retraining.

Layer 4 — Model quality tests. The model-performance layer. Does the newly trained model match or exceed the current production baseline on the regression test set? On each per-slice metric. Does the model pass fairness, latency, and resource-usage checks. Model-quality tests run after every training job.

Layer 5 — Integration and deployment tests. The system-correctness layer. Does the model serve correctly at the inference endpoint? Does the feature pipeline produce training-serving-consistent features? Does the canary-rollback automation work. Integration tests run on every model-registration event and before any production promotion.

Breck, Cai, Nielsen, Salib, and Sculley (IEEE BigData 2017) published “The ML Test Score” as a 28-item rubric spanning these five layers¹. The rubric is the single most-cited reference for ML CI design and is the benchmark a practitioner’s CI pipeline should score against.

[DIAGRAM: StageGateFlow — aitm-eci-article-8-ci-flow — A flow from commit -> unit tests -> data contract tests -> data quality tests -> model quality tests -> integration tests -> production gate, with elapsed-time labels on each step.]

Test cost and cadence

Unit tests run in seconds. Model-quality tests run in hours. That gap means ML CI cannot blindly run everything on every commit.

A practical tiering splits tests into three cadences.

Every commit. Unit tests, linters, data-contract tests against a small fixture, smoke-level integration tests. The goal is fast feedback — under 10 minutes — so that developers do not learn to bypass CI.

Every merge to main or every training job. Data-quality tests against the current production partition, model-quality tests against the regression set, reproducibility tests on a subset of experiments. The goal is pre-deployment confidence — under 4 hours — so that the main branch is always deployable.

Nightly or on-demand. Full model retraining on the current data, full-slice evaluation, cost-benefit analysis, safety sweep. The goal is drift detection and longitudinal quality — tolerable in 12+ hours — so that the team sees trends before they become incidents.

The tiering must be explicit. A commit-level test that takes two hours will be silently bypassed. A nightly test that is run weekly will catch drift three times slower than it should. Cost-and-cadence decisions are part of CI design, not an afterthought.

CI for generative AI

Continuous integration for generative AI features adds requirements that classical ML CI does not have.

Prompt regression tests. The prompt template is part of the model. A prompt change can alter behavior as dramatically as a model change can. A regression suite of input-output pairs, where outputs are either exact-match, contain-match, or evaluated by an LLM-as-judge, catches unintended regressions when the prompt changes. Article 10 develops this in depth.

Safety regression tests. A red-team battery (curated from the practitioner’s ongoing adversarial experimentation) runs on every model or prompt change to confirm that known failure modes remain closed. Article 11 develops red-team integration.

Cost regression tests. Generative AI features have per-request costs that can shift with model version, prompt length, and tool-use depth. A cost test computes the mean cost-per-request on a fixed workload and compares against a baseline; breaches trigger the cost-guardrail path in Article 12.

Hallucination and grounding tests. For retrieval-augmented systems, a battery checks whether outputs remain grounded in the retrieved context. Frameworks including Ragas, DeepEval, and promptfoo, and vendor tools including Humanloop, Langfuse, and W&B Weave implement grounding-metric batteries³.

Training-serving consistency tests

The single most pernicious failure mode in production ML is training-serving skew: the features used at training time differ subtly from the features used at inference time, and the model underperforms for reasons that are hard to diagnose. Tests that explicitly compare training-time and inference-time feature computation catch this.

Two patterns help.

Feature-parity tests. Given the same input row, the feature-engineering function used in training produces the same vector as the inference-time feature computation. When the two paths share code (feature-store-based pipelines typically do), this is automatic; when they do not, it is the single most important CI test.

Shadow-serving replay. Production traffic is replayed against the newly trained model in a shadow environment, and its output distribution is compared against production’s. Divergence beyond tolerance is a training-serving skew signal. The Google TFX documentation and the Feast feature-store documentation both develop this pattern⁴.

Gate criteria

A CI pipeline produces pass/fail outputs per test. The practitioner’s job is to compose those into gate criteria — the boolean that decides whether the change proceeds.

Three gate-criterion patterns are common.

Hard gates. Specific tests must pass for the change to merge. Examples: all unit tests pass; no critical data-contract violations; model regression on the full regression set not worse than baseline by more than a threshold; no safety-regression-test failures.

Soft gates. Specific tests must be reviewed by a human if they fail. Examples: data-distribution drift above threshold; model performance on a specific slice degraded by less than a threshold; cost-per-request up by less than a threshold. Soft gates exist because not every signal is a regression.

Informational. Tests that produce metrics the reviewer wants to see but that do not block merge. Examples: individual slice metrics, longitudinal trend charts, cost projections.

The balance between the three types tunes the CI pipeline’s false-positive and false-negative rates. Too many hard gates and developers bypass them; too few and regressions reach production. A mature program adjusts the mix over time based on which kinds of failures actually occurred.

[DIAGRAM: ScoreboardDiagram — aitm-eci-article-8-ml-test-score — A dashboard-style table with rows for each of the 28 ML Test Score items, columns for “implemented”, “automated”, “gate type (hard/soft/info)”, and a running score out of 28.]

CI execution environments

ML CI jobs are heavier than classical CI jobs. Unit tests fit in a standard CI runner (GitHub Actions, GitLab CI, Jenkins, CircleCI, Buildkite), but model-quality tests and full-slice evaluations need GPU runners and non-trivial compute. Two patterns work.

Cloud-native CI with GPU runners. GitHub Actions, GitLab CI, CircleCI, and Buildkite all support GPU runners, either on the provider’s fleet or on self-hosted runners that the platform team manages. The CI pipeline dispatches model-quality jobs to GPU runners and composes the results.

Hybrid CI with orchestrator dispatch. The CI pipeline detects that a model-training test is required and dispatches the work to the experiment orchestrator (Airflow, Kubeflow, Flyte, Dagster, Prefect, Metaflow, or cloud-provider-native), then polls for completion and reports the result. The orchestrator runs on dedicated compute that is sized for training workloads. This pattern lets the CI system stay lightweight while running arbitrarily heavy tests.

Git platform is vendor-neutral: GitHub, GitLab, Bitbucket, and self-hosted Gitea all support the same core CI patterns. The choice of platform is usually made at the organization level; the practitioner’s concern is the pipeline design, which is portable.

Two real references in the CI vocabulary

ML Test Score — Breck et al. The rubric published at IEEE BigData 2017 remains the reference for ML CI design¹. Practitioners score their pipelines against its 28 items and use the score as a maturity indicator. A score below 10 is typical for a team starting out; a score above 20 is typical for a mature program.

Martin Fowler — Continuous Delivery for Machine Learning (CD4ML). Fowler’s 2019 article established the vocabulary and the reference architecture for CI/CD in ML, and remains a foundational industry reference⁵. The article predates most of the LLM-era tooling but its principles (test at every layer, automate the pipeline, keep main deployable) carry over directly.

Summary

ML continuous integration spans five layers: unit, data contract, data quality, model quality, integration. Each layer has its own cadence (every commit, every merge, nightly). Training-serving consistency tests catch the most insidious production failure. Gate criteria come in three types (hard, soft, informational); the mix is tuned over time. Generative AI adds prompt, safety, cost, and grounding tests to the battery. Execution environments combine lightweight CI runners with orchestrator dispatch for heavy workloads. The ML Test Score is the benchmark. The next article develops continuous delivery, where CI outputs feed governed promotion into production.

Further reading in the Core Stream: MLOps: From Model to Production.

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, D. Sculley. The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction. IEEE Big Data, 2017. https://research.google/pubs/pub46555/ — accessed 2026-04-19. ↩ ↩² ↩³
Great Expectations, Deequ, Soda, Pandera documentation. https://greatexpectations.io/ ; https://github.com/awslabs/deequ ; https://www.soda.io/ ; https://pandera.readthedocs.io/ — accessed 2026-04-19. ↩
Ragas, DeepEval, promptfoo, Humanloop, Langfuse, W&B Weave documentation. https://docs.ragas.io/ ; https://docs.confident-ai.com/ ; https://www.promptfoo.dev/ ; https://humanloop.com/ ; https://langfuse.com/ ; https://wandb.ai/site/weave — accessed 2026-04-19. ↩
TensorFlow Extended (TFX) and Feast feature store documentation. https://www.tensorflow.org/tfx ; https://feast.dev/ — accessed 2026-04-19. ↩
Martin Fowler. Continuous Delivery for Machine Learning. 2019. https://martinfowler.com/articles/cd4ml.html — accessed 2026-04-19. ↩