Data Pipeline Architecture for AI

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 15 of 35

The model does not own the AI system’s risk; the data does. A perfectly-configured model on an unverified corpus produces answers the organization cannot defend. A well-governed corpus on a mid-tier model produces answers the organization can stand behind. This relationship is counter-intuitive to teams that arrive at AI from a machine-learning training tradition, where the model is the artifact of craft and the data is an input. In the enterprise RAG era, the architectural priorities reverse. The retrieval corpus is the knowledge the model will speak from, the governance over that corpus is the architect’s most consequential work, and every gap in that governance becomes an incident waiting to happen. Article 15 walks the AITE-SAT learner through the data pipeline from raw source to production retrieval index, naming the six stages and the controls that belong at each.

The six pipeline stages

Stage 1 — Source identification and licensing

The pipeline begins before any byte of data is processed. The architect identifies every source that will contribute to the retrieval corpus — internal document stores, ticketing systems, wikis, CRM records, email archives, product catalogs, publicly available content, licensed third-party data — and confirms the licensing posture of each. Content licensed for one use may not be licensed for AI training or retrieval; content scraped from the public web carries its own legal and reputational risks. The output of Stage 1 is a source register that names each source, its licensing basis, its sensitivity classification, and its intended use in the corpus.

An AI corpus built without a source register is a corpus that cannot be defended in a regulatory inquiry or a legal discovery. The register is also the artifact that survives ownership changes on the data team; without it, institutional knowledge of what is in the corpus evaporates with staff turnover.

Stage 2 — Ingestion and format normalization

Source content arrives in heterogeneous formats — PDFs, Word documents, HTML pages, scanned images, email message objects, structured database exports, Confluence pages, SharePoint sites, Slack message archives, audio recordings, spreadsheets. Ingestion normalizes these into a common processable representation. Text-heavy sources produce text and structured metadata (title, author, date, source system, document type). Document-heavy sources may pass through OCR; audio sources pass through transcription; images pass through vision-model captioning if content extraction is required.

Ingestion is where format-specific risks enter the pipeline. PDFs can contain embedded JavaScript that legitimate extractors strip but malicious processors may execute; Office documents can carry macros; HTML can contain attacker-crafted markup; scraped web content can include adversarial prompt-injection payloads planted in benign-looking pages. The ingestion stage applies a format-specific sanitizer before any downstream processing. Unstructured.io, Azure Document Intelligence, AWS Textract, Google Document AI, and LlamaParse are the canonical commercial and open-source ingestion tools.¹

Stage 3 — PII and sensitive-content classification

Once content is normalized, the pipeline classifies it for sensitivity. Personally identifiable information, regulated health information (PHI under HIPAA), regulated financial information (PCI), classified information, privileged legal content, and trade-secret-level confidential content each get tagged at the document or chunk level. Tagging can be done with named-entity recognition models (Microsoft Presidio is the canonical open-source option), cloud-provider DLP services (Google Cloud DLP, AWS Macie, Microsoft Purview), or domain-specific classifiers trained for the organization’s own confidential taxonomy.²

Classification produces two outputs. The metadata is attached to the chunk so downstream retrieval can filter by sensitivity. Where appropriate, the content itself is redacted — PII replaced with tokens, PHI replaced with hashes, names replaced with role labels — before it enters the retrieval index. The architect decides per-corpus whether classification drives filtering (content is retained but only accessible to authorized users) or redaction (content is masked for all users including legitimate ones).

Stage 4 — Chunking and enrichment

Chunking (Article 5) splits normalized content into retrieval-sized fragments. Enrichment adds fields the retriever can use — a summary generated by an LLM, a key-terms extraction, a structured classification into the organization’s taxonomy, a hierarchical section breadcrumb. Enrichment has a cost (one LLM call per document per enrichment type) and a benefit (richer retrievals and better filter cardinality); the architect picks the enrichments that justify their cost.

Enrichment is also where provenance metadata gets attached. Each chunk carries its source identifier, section path, creation date, author, and version hash so that a later retrieval result can be traced back to the exact document and version it came from. Provenance that is not attached at Stage 4 cannot be added later without reprocessing the corpus.

Stage 5 — Embedding and index write

Chunks are embedded (Article 5) and written to the vector index (Article 6) with their metadata. The index write includes tenant identifier, sensitivity class, retention class, and expiration date so that downstream queries can filter on these fields and retention policies can delete stale content automatically. A chunk without tenant identifier is a tenant-isolation failure waiting for an edge case; a chunk without retention class is a GDPR-right-to-erasure request without a machinery to fulfill it.

Embedding-model choice is a cost and a residency decision. Managed-API embeddings (OpenAI, Cohere, Voyage, Google) are cheap per call but route data to the provider; self-hosted open-weight embeddings (BAAI bge, Qwen, Mixedbread, Nomic) are operationally heavier but keep the data inside the organization’s boundary. The architect makes this decision with the same residency logic that applies to inference.

Stage 6 — Lineage and audit trail

The final stage records the full lineage of every production chunk. For chunk X in the index, the team can answer: which source document did it come from, when was it ingested, which ingestion pipeline version processed it, which chunking strategy was used, which embedding model and version produced the vector, what classifications were applied, who authorized its inclusion in the corpus, and when was it last re-verified. The lineage is itself a data store — typically a separate database with the same retention and query capabilities as any audit log — and it is referenced in the evidence pack for regulated deployments.

Lineage is what transforms a retrieval corpus from a black box into a governed asset. Without it, an incident investigation halts at “we cannot tell where this chunk came from,” which is an unacceptable answer in regulated contexts.

[DIAGRAM: TimelineDiagram — aite-sat-article-15-six-stage-pipeline — Horizontal timeline showing six pipeline stages with per-stage owner labels. Stage 1: “Source identification & licensing” (Owner: Data Governance). Stage 2: “Ingestion & normalization” (Owner: Data Engineering). Stage 3: “PII & sensitivity classification” (Owner: Privacy Engineering). Stage 4: “Chunking & enrichment” (Owner: AI Platform). Stage 5: “Embedding & index write” (Owner: AI Platform). Stage 6: “Lineage & audit trail” (Owner: Data Governance). Arrows between stages annotated with outputs produced. Side annotations indicate which EU AI Act Article 10 obligations each stage contributes to and which ISO 42001 clauses apply.]

Governance controls the architect owns

Four governance controls are the architect’s responsibility even when a data-governance team owns the source register.

Licensing chain of custody. For every source, the pipeline records the contractual basis for including the content in the corpus, the scope of use permitted (retrieval only, training allowed, both), and the expiration of that right. A content source whose license expires is removed from the index automatically, not at the end of a manual review cycle.

Tenancy model. The architect defines whether the corpus is per-tenant, shared with tenant-scoped filters, or mixed (some content shared, some tenant-private). The model is enforced at Stage 5 via metadata and at query time via pre-filter. Article 16 develops multi-tenancy at depth; the data-pipeline architect’s job is to ensure the tenancy signal is captured at ingestion rather than reconstructed at query time.

Retention and right-to-erasure. GDPR Article 17 grants data subjects the right to erasure; the corpus must be able to honor that right in bounded time. The architect designs retention classes with explicit time-to-live and a deletion pipeline that propagates from source register through index. A user whose data is erased must be erased from every chunk in every tenant’s index, not just from the source document.

Drift and re-verification. Sources change. A policy document from 2023 may have been superseded in 2024; a product description may have been updated; a regulation may have been amended. The pipeline includes a re-verification cadence where upstream changes trigger chunk updates in the index. Without re-verification, the corpus gradually ossifies into a stale knowledge base the model cites confidently while contradicting current reality.

[DIAGRAM: BridgeDiagram — aite-sat-article-15-raw-to-production-bridge — A left-to-right bridge with “Raw data” on the left pier and “Production retrieval index” on the right pier. Three pillars support the bridge deck: “PII redaction” (leftmost), “License check” (middle), “Embedding & index write” (right). The deck is labelled “Chunking & metadata enrichment” in the centre. Annotations above the bridge show the sensitivity classes being routed per pillar: “Public content → minimal redaction”; “Confidential internal → full redaction + tenant tag”; “Regulated (PII/PHI) → redaction + access control + retention policy”. Below the bridge, audit-trail arrows indicate per-chunk lineage flowing parallel to the data.]

Synthetic data and hybrid corpora

Some pipelines supplement the real corpus with synthetic content — model-generated documents that cover edge cases the real corpus misses or that represent scenarios the team wants the system to handle well. Synthetic data can improve coverage and reduce bias (when generated with that goal), but it introduces its own risks: synthetic errors that look plausible, synthetic content that the model treats as authoritative when it is not, and legal questions about whether synthetic data generated by a commercial model is encumbered by the generator’s terms of use.³

The architect treats synthetic content as a tagged source with its own licensing chain of custody — “generated by Model X on Date Y using Prompt Template Z” — and marks synthetic chunks so retrieval can filter them out of use cases where authoritative-source-only content is required. Synthetic data is useful; synthetic data mixed indistinguishably with authoritative data is a governance failure.

Two real-world examples

OpenAI enterprise data-handling documentation. OpenAI’s enterprise and API documentation describes the data-handling commitments applicable to enterprise use: inputs and outputs are not used for training by default, data retention policies can be set, and enterprise deployments can opt into zero-retention configurations for specific workloads.⁴ The architectural point for the AITE-SAT learner is that managed-API providers have matured their data-governance offerings significantly since 2023, and the architect is expected to read, negotiate, and document these terms as part of the pipeline design. Choosing a managed API without understanding its data-handling terms is a governance gap; the terms are part of the pipeline architecture.

Supabase Row-Level Security with pgvector. Supabase documents reference patterns for pgvector-based retrieval where row-level security in Postgres enforces tenant isolation and access control at the database level.⁵ The chunk row carries a tenant identifier and an access policy; every query is filtered by the policy automatically. The architectural point is that tenant-isolation enforcement can live at the storage layer rather than at the application layer, which eliminates the class of bugs where a developer forgets to add the tenant filter to a query. A pipeline that writes chunks with tenant metadata and a storage layer that enforces tenant filter on read is a safer architecture than a pipeline that relies on application-layer filter consistency.

Snowflake Cortex data-isolation architecture. Snowflake documents how Cortex AI functions operate within the customer’s Snowflake account, with data never leaving the account’s boundary and access control inheriting from the account’s existing role-based security.⁶ The architectural point is that cloud-data-warehouse vendors have extended their native isolation models into the AI pipeline, letting teams keep the AI pipeline within the same data-governance perimeter they already operate. The architect choosing Cortex, BigQuery ML, or Redshift ML gets data isolation as a default rather than an add-on.

Regulatory alignment

EU AI Act Article 10 on data governance is the primary regulatory anchor for this article.⁷ Article 10 requires that high-risk systems’ training, validation, and testing datasets be relevant, representative, free of errors, and have the characteristics required for the intended purpose — and, significantly, Article 10(5) requires special-category personal data to be processed under specific safeguards. The six-stage pipeline with source register, PII classification, licensing chain of custody, and lineage satisfies Article 10’s expectations directly. Article 11 on technical documentation expects the pipeline’s design and operation to be documented; Article 12 on record-keeping expects the pipeline to produce logs that can be audited. GDPR Article 5 on data-protection principles (lawfulness, minimization, accuracy, limitation, integrity, confidentiality, accountability) underpins the same controls from the data-protection side. ISO/IEC 42001 Clause 8.2 on AI system impact and Clause 8.3 on lifecycle management map to the pipeline’s governance surface.

Summary

Data pipeline architecture is where the AI system’s risk is defined and its defensibility is built. Six stages — source identification, ingestion, classification, chunking and enrichment, embedding and index write, lineage — cover the path from raw source to production index. The architect owns four governance controls: licensing chain of custody, tenancy model, retention and right-to-erasure, and drift re-verification. Synthetic data is useful but must be tagged and governed distinctly. OpenAI’s enterprise data-handling terms, Supabase’s row-level security pattern with pgvector, and Snowflake Cortex’s data-isolation architecture are public references for how the pipeline integrates with vendor-native governance. Regulatory alignment with EU AI Act Article 10, GDPR Article 5, and ISO 42001 Clauses 8.2 and 8.3 flows from the architect treating the pipeline as a governed lifecycle, not a one-time ingestion job.

Further reading in the Core Stream: Data Architecture for Enterprise AI and Data Governance for AI.

Unstructured.io documentation. https://docs.unstructured.io/ — accessed 2026-04-20. Microsoft Azure AI Document Intelligence. https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/ — accessed 2026-04-20. Google Document AI. https://cloud.google.com/document-ai — accessed 2026-04-20. AWS Textract. https://aws.amazon.com/textract/ — accessed 2026-04-20. LlamaParse. https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/ — accessed 2026-04-20. ↩
Microsoft Presidio. https://microsoft.github.io/presidio/ — accessed 2026-04-20. Google Cloud DLP. https://cloud.google.com/dlp — accessed 2026-04-20. AWS Macie. https://aws.amazon.com/macie/ — accessed 2026-04-20. Microsoft Purview. https://www.microsoft.com/en-us/security/business/microsoft-purview — accessed 2026-04-20. ↩
Synthetic data discussion in Anthropic’s Responsible Scaling Policy and similar public frameworks. https://www.anthropic.com/news/anthropics-responsible-scaling-policy — accessed 2026-04-20. ↩
OpenAI enterprise privacy and data-handling documentation. https://openai.com/enterprise-privacy/ — accessed 2026-04-20. ↩
Supabase AI and pgvector documentation. https://supabase.com/docs/guides/ai — accessed 2026-04-20. ↩
Snowflake Cortex documentation. https://docs.snowflake.com/en/guides-overview-ai-features — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 10, 11, and 12. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. Regulation (EU) 2016/679 (GDPR), Article 5. https://eur-lex.europa.eu/eli/reg/2016/679/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clauses 8.2 and 8.3. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩