Pipelines and Orchestration

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 7 of 14

A pipeline is an automated execution graph. At its center is the set of steps that take raw data through feature engineering, training, evaluation, registration, and deployment. Around it are the orchestrator that runs the graph, the storage for intermediate artifacts, and the metadata store that records what happened. A team that runs experiments manually from notebooks will reach a ceiling fast: every rerun is a new chance to introduce error, every transfer between team members loses context, every audit request requires archaeology. A team that runs experiments through pipelines has traded some initial setup cost for a durable, auditable substrate that every future experiment builds on.

The canonical ML pipeline

An ML pipeline has a recurring shape, regardless of framework or scale. Six stages cover most patterns.

Ingest. Pull raw data from source systems (databases, data lakes, streaming topics, external APIs). Validate schemas and row counts. Record the data hash.
Feature engineering. Transform raw data into feature vectors. Apply the same transformations that will be applied at inference time. Write to a feature store if one exists.
Train. Train a model on a specified training slice with specified hyperparameters. Record the tracking artifact set (Article 6).
Evaluate. Evaluate the trained model against validation and test slices. Compute primary, secondary, and guardrail metrics across slices. Compute reproducibility checks.
Register. Write the model to the model registry with a version number, lifecycle state (typically staging), lineage pointers, and evaluation metrics attached. Article 9 develops registration and promotion in depth.
Deploy. Serve the model in a production path — shadow, canary, or full rollout — with the observability hooks required to detect drift and guardrail breaches.

The six-stage structure is the skeleton. Real pipelines add steps — data-quality checks before ingest, feature-importance analysis after training, drift-monitoring pre-deployment — but the skeleton holds. The NIST AI RMF Generative AI Profile and the ML Test Score rubric both treat the structured pipeline as a prerequisite for production readiness¹².

[DIAGRAM: StageGateFlow — aitm-eci-article-7-canonical-pipeline — A left-to-right flow: ingest -> feature -> train -> evaluate -> register -> deploy, with gate criteria above each arrow (schema valid, features fresh, metrics above threshold, tests pass, approved) and retry/resume arrows shown below.]

Orchestrator choice

An orchestrator is the execution engine for the pipeline. Orchestrators differ in execution model (graph-native, programmatic, declarative), in scale (single-box to thousands of workers), and in integration surface (managed platform, cloud-native, self-hosted).

Apache Airflow. Mature, widely adopted, scheduler-driven. Good fit for batch-oriented workflows that are time-triggered and have moderate intra-run variability. Python-based DAG definition. Managed offerings exist (Astronomer, MWAA on AWS, Cloud Composer on Google Cloud)³.

Kubeflow Pipelines. Kubernetes-native, container-per-step. Good fit for teams already on Kubernetes and teams that want tight integration with other Kubeflow components (serving, training operators)⁴.

Flyte. Kubernetes-native, strongly typed Python DSL, metadata-first. Good fit for teams wanting robust typing and lineage as first-class features. Managed offering exists (Union.ai)⁵.

Argo Workflows. Kubernetes-native, YAML-driven, lightweight. Good fit for teams who want a thin orchestration layer over container workloads⁶.

Prefect and Dagster. Both Python-native, both with strong developer experience. Prefect emphasizes reliability and lightweight scheduling. Dagster emphasizes assets, typed inputs and outputs, and integrated data quality⁷⁸.

Metaflow. Originally Netflix, open-source since. Designed for data scientist ergonomics and compute-scale-out from the same Python code⁹.

Cloud-provider native. AWS Step Functions, Azure Data Factory, Google Cloud Workflows, Databricks Workflows, Vertex AI Pipelines, and SageMaker Pipelines each provide orchestration tightly coupled to their provider’s compute and storage. Good fit for single-provider organizations that want to minimize orchestration operations¹⁰.

The orchestrator choice is a multi-factor decision. The practitioner’s concern is not which tool to use (that decision is usually made at the platform level) but how to design pipelines that are portable across orchestrators. A pipeline whose steps are self-contained, whose inputs and outputs are typed, and whose dependencies are explicit will run on any of the above with minor adaptation.

Idempotency, retry, and resume

Production pipelines fail. Storage goes down. Kernels crash. Spot instances are preempted. A pipeline that cannot recover from partial failure will have a continuous supply of incidents.

Three design properties matter.

Idempotency. Running a step twice produces the same result as running it once. The canonical violation is appending a row to a table instead of upserting into it; the second run of the appending step doubles the table. Idempotent steps write to content-addressed artifact paths (or use transactional upsert) so that a rerun is safe.

Retry. The orchestrator retries failed steps a specified number of times, with exponential backoff. Transient failures (network timeouts, pre-emption) recover automatically. Permanent failures (schema errors, assertion failures) are propagated to human review after retry exhaustion.

Resume. After a fatal failure, the pipeline resumes from the last successful step rather than from the beginning. Resume requires that intermediate artifacts be durably stored and that the orchestrator track which steps completed. Every mature orchestrator supports this; it is the practitioner’s job to design steps whose inputs and outputs are explicit enough for resume to be unambiguous.

The trio (idempotency, retry, resume) is the difference between a pipeline that runs for months with modest oversight and one that pages an on-call engineer every week. The Metaflow, Flyte, Dagster, and Prefect documentation, and the managed cloud offerings, all document the properties in similar terms; the practitioner’s job is to design for them, not to assume them.

Dependency management

A pipeline has two kinds of dependencies: data dependencies and code dependencies.

Data dependencies. Step B’s input is step A’s output. The orchestrator knows this, schedules B after A, and passes the output path. Explicit data dependencies make the DAG legible. Implicit data dependencies — step B reading a “latest” path that step A wrote to — make reruns unreliable and debugging painful.

Code dependencies. Step A imports a library, whose version must match what the training code expects. Managing these is where most pipelines accumulate technical debt. Two patterns work.

Per-step container images. Each step runs in a container whose image pins every dependency. Rebuilding the image is a deliberate act. Breaking a downstream step by upgrading a shared library is prevented by construction.
Lockfiles and environments. Each step has its own lockfile (requirements.txt, poetry.lock, uv.lock, conda environment.yml). The orchestrator ensures each step runs in the specified environment.

Container-per-step is more robust and more expensive. Lockfiles are lighter weight and more fragile. A program typically uses lockfiles in development and containers in production.

Scheduling patterns

Pipelines run on three schedule types.

Triggered. On a data event (new partition arrived), code event (commit to main), or human event (approval). Most CI-for-ML pipelines are triggered.

Scheduled. On a cron or interval (“daily at 02:00 UTC”, “every 4 hours”). Most production batch-scoring and retraining pipelines are scheduled.

Continuous. As fast as resources allow, processing a streaming input. Pipelines against streaming data infrastructure (Kafka, Pulsar, Kinesis) often use continuous scheduling with micro-batches.

A mature program uses all three. Triggered pipelines for CI, scheduled pipelines for retraining and for batch evaluation, continuous pipelines for streaming features.

[DIAGRAM: HubSpokeDiagram — aitm-eci-article-7-orchestrator-hub — Central hub “Orchestrator” with spokes for triggers (event, schedule, continuous), workers (Kubernetes, VMs, serverless), artifact store (S3, GCS, Azure Blob), metadata store (MLflow, Kubeflow Metadata, provider-native), lineage (OpenLineage, Marquez), alerts (Slack, PagerDuty, Teams).]

Two real programs in the pipeline vocabulary

Netflix Metaflow. Metaflow was built inside Netflix and open-sourced in 2020. Its design prioritizes data scientist ergonomics (annotate a Python function, get a pipeline), compute scale-out (the same code runs on a laptop or on thousands of AWS Batch workers), and versioning (every run’s artifacts are addressed by run ID). Netflix’s engineering blog documents the design decisions and the production scale⁹. The important lesson for practitioners is that orchestration does not have to be painful; a tool designed around data scientist workflow can carry production load.

LinkedIn ProML. LinkedIn’s engineering blog series on ProML describes a large-enterprise feature store and training platform, and the pipelines that run on it¹¹. The articles document how feature consistency between training and serving is maintained by construction (not by convention), and how model training, evaluation, and registration are organized into a standard pipeline template every team inherits. The lesson for a practitioner entering a large organization is that the pipeline template is likely to pre-exist and the practitioner’s job is to extend it, not invent a new one.

Summary

The canonical ML pipeline has six stages: ingest, feature, train, evaluate, register, deploy. The orchestrator choice is a platform decision; portability across orchestrators is achieved through self-contained, typed steps. Idempotency, retry, and resume are the trio that makes pipelines recoverable from failure. Data dependencies should be explicit; code dependencies should be managed by lockfiles in development and by container images in production. Scheduling patterns are triggered, scheduled, and continuous, and a mature program uses all three. Netflix Metaflow and LinkedIn ProML are two reference programs. The next article develops the continuous-integration layer that sits on top of these pipelines.

Further reading in the Core Stream: MLOps: From Model to Production and AI Use Case Delivery Management.

Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 2024. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf — accessed 2026-04-19. ↩
Eric Breck et al. The ML Test Score. IEEE Big Data 2017. https://research.google/pubs/pub46555/ — accessed 2026-04-19. ↩
Apache Airflow documentation. https://airflow.apache.org/ — accessed 2026-04-19. ↩
Kubeflow Pipelines documentation. https://www.kubeflow.org/docs/components/pipelines/ — accessed 2026-04-19. ↩
Flyte documentation. https://flyte.org/ — accessed 2026-04-19. ↩
Argo Workflows documentation. https://argoproj.github.io/workflows/ — accessed 2026-04-19. ↩
Prefect documentation. https://www.prefect.io/ — accessed 2026-04-19. ↩
Dagster documentation. https://dagster.io/ — accessed 2026-04-19. ↩
Metaflow documentation and Netflix Technology Blog. https://metaflow.org/ — accessed 2026-04-19. ↩ ↩²
AWS Step Functions, Azure Data Factory, Google Cloud Workflows, Databricks Workflows, Vertex AI Pipelines, SageMaker Pipelines documentation. — accessed 2026-04-19. ↩
LinkedIn Engineering Blog — Machine Learning category. https://engineering.linkedin.com/blog/topic/machine-learning — accessed 2026-04-19. ↩