Skip to main content
AITE M1.2-Art28 v1.0 Reviewed 2026-04-06 Open Access
M1.2 The COMPEL Six-Stage Lifecycle
AITF · Foundations

Data Architecture for Agentic Systems

Data Architecture for Agentic Systems — Transformation Design & Program Architecture — Advanced depth — COMPEL Body of Knowledge.

9 min read Article 28 of 53

The architect’s deliverable is a data-flow document that names every data class, every store it passes through, the tenant-isolation mode, the retention policy, the lineage tracking mechanism, and the PII handling. That document is the conformity-assessment evidence (Article 23) that turns “we have data governance” into something a regulator can audit.

Five data classes

An agentic system moves five distinct classes of data:

Class 1 — User context. Whatever the user provides in the session. Includes direct inputs, uploaded files, authenticated identity attributes, and account history surfaced at session start. Classification: whatever the highest-sensitivity element is (PII, PHI, regulated) — the session inherits that class.

Class 2 — Retrieved documents. Content the agent retrieves from knowledge bases, vector stores, the web, third-party APIs. Classification: varies by source; the retrieval layer must tag provenance and license.

Class 3 — Tool outputs. Responses from tool calls — database rows, API responses, computed values. Classification: inherits from the system the tool accesses; a query against a PHI system yields PHI regardless of the agent’s intent.

Class 4 — Memory writes. What the agent writes into short-term, long-term, episodic, or semantic memory (Article 7). Classification: typically inherits from the context that produced the write; a memory entry derived from PHI context is PHI.

Class 5 — Audit logs. The observability and compliance records — traces, tool-call logs, memory-write logs, policy-engine decisions, incident records. Classification: derived sensitivity (contains PII if the trace contains PII); retention often driven by regulatory obligation rather than business need.

Each class has its own lifecycle, its own retention requirement, and its own access-control rules. Collapsing them into one undifferentiated “agent data” stream is the root cause of most audit findings.

Tenant isolation — the non-negotiable

Multi-tenant agentic systems leak data across tenants more often than architects admit, because agents can retrieve, remember, and write — three independent leak surfaces. The architect’s tenant-isolation design must cover all three:

  1. Retrieval isolation. Retrievers filter by tenant ID at query time; embeddings and documents carry tenant tags; index-level partitioning where the vector store supports it.
  2. Memory isolation. Memory stores enforce row-level security (Postgres RLS for pgvector), per-tenant schemas, or per-tenant clusters — documented in the memory registry (Article 26).
  3. Audit isolation. Trace exports, logs, and audit packs are filtered by tenant; the observability stack cannot surface another tenant’s traces.

The test: the architect should be able to answer “if an attacker escalates inside a single tenant, can they access another tenant’s data?” with a confident “no” backed by three enforcement points.

Retention and the right to forget

Agent memory interacts with retention in ways classical systems do not. A vector-store memory entry may be referenced implicitly by the embedding’s neighborhood — deletion of the entry removes the document but may leave residual influence in other documents that were written using it as context. The architect must design retention + forget procedures at the memory-registry level (Article 26) and ensure:

  • Scheduled retention expiry. Every memory namespace has a retention clock; expired entries are deleted and the deletion is logged.
  • Subject-initiated deletion (GDPR Article 17). A tenant or user request to delete triggers a cascade: direct entries deleted; derived entries identified by lineage and deleted or recomputed.
  • Incident-triggered forget. Memory poisoning (Article 25) triggers targeted forgetting of suspect entries plus rollback to pre-incident snapshot if needed.

The “right to forget” is not decorative; it is a design constraint the data architecture must honor.

License compliance for retrieved content

Retrieved documents carry licenses. A third-party API may have terms limiting redistribution; a scraped web page may be copyrighted; a proprietary knowledge base may have contractual restrictions on derived use. The agentic data architecture must:

  • Tag every retrieved document with source, license, and redistribution_rights.
  • Filter retrievals through the license policy before the content enters the context window (e.g., agent asked to generate customer-facing response cannot include non-redistributable retrieved content).
  • Record license metadata in audit logs so that compliance review can reconstruct which licensed content informed which agent output.

PII handling patterns

PII handling in agentic systems needs three specific patterns beyond classical application PII:

Pattern 1 — Minimized context. Before sending to the model, strip PII fields not needed for the task. The model does not need a full customer record to answer a refund question; pre-redact identifiers not used by the task.

Pattern 2 — Redacted audit. Audit logs must redact PII unless retention policy requires it. The trace includes enough to reconstruct the decision but not enough to leak PII if the log is later exported.

Pattern 3 — Forget in memory. PII written to long-term memory must be deletable on request and retrievable for data-subject access requests. The memory registry tracks which namespaces contain PII for this purpose.

Lineage — from input to output

Lineage traces which data contributed to which output. The architect should be able to answer: “for this agent response on this date, which retrieved documents, memory entries, and tool outputs contributed?” Without lineage:

  • Article 14 human-oversight evidence is incomplete.
  • Article 73 incident reporting cannot reconstruct the root cause.
  • GDPR Article 22 explanation obligations cannot be met.

Lineage requirements:

  • Every data read is logged with source IDs.
  • Every memory write logs its provenance (which reads, tool calls, and reasoning steps produced it).
  • Every agent output logs the reads, tool calls, and memory entries that contributed.
  • Trace IDs propagate across all five data classes.

Lineage is not free; it requires design and, at scale, a non-trivial storage budget. The architect sizes the lineage store at design time.

Data-protection impact

For any agent handling PII, the architect collaborates with the DPO on the DPIA. The DPIA references the data-flow document, the retention policy, the forget procedure, and the tenant-isolation design. This is where an architect’s data-flow document earns its keep — the DPIA questions are already answered.

Sector specifics

  • Financial services: data-residency rules (data must stay in a specific jurisdiction), SR 11-7 lineage expectations for data feeding model outputs, market-abuse surveillance data access rules (only specific personnel).
  • Healthcare: HIPAA Minimum Necessary (the agent must not receive more PHI than task requires), 42 CFR Part 2 for substance-use disorder treatment records (consent-based access), EU GDPR special-category data rules.
  • Public sector: transparency obligations (citizens can see which data sources informed decisions about them), UK Algorithmic Transparency Recording Standard data-source fields, EU AI Act Article 10 data-governance requirements for high-risk systems.

Reference implementations

Databricks Unity Catalog for AI workloads. Unity Catalog provides column-level tagging, lineage tracking, and access policies across data and ML assets. For agentic systems, Unity Catalog’s lineage graph can anchor the data-flow document; its tags inform the tenant-isolation design.

Snowflake Cortex with data governance integration. Snowflake Cortex offers row-access policies, dynamic data masking, and column tags. Agents deployed on Snowflake data frequently inherit these controls rather than reimplementing them — a pattern the architect should explicitly approve.

pgvector + PostgreSQL Row-Level Security (open-source reference). For teams running open-source infrastructure, pgvector for vector memory plus PostgreSQL RLS for tenant isolation is a common and defensible pattern; the architect documents the policies and ensures the agent runtime sets the appropriate session variables on every connection.

LlamaIndex Agents data-loader patterns. LlamaIndex ships data-loader patterns with provenance metadata carried through the indexing and retrieval layers; when the platform uses LlamaIndex for retrieval, the architect ensures the metadata is preserved into context-window construction.

Anti-patterns to reject

  • “We log everything.” Without classification and retention, logs become the biggest PII liability in the organization.
  • “Tenant is the tenant-ID column.” Without runtime enforcement (RLS, per-tenant schema), the tenant column is a convention, not a control.
  • “We’ll handle GDPR deletion manually.” Ad-hoc deletion does not scale, does not cascade through memory lineage, and leaves residual data.
  • “Vector store handles licensing for us.” It does not; license metadata lives in source tags and the architect’s policy.
  • “Retention is someone else’s problem.” Retention is the architect’s problem on any store the agent writes to.

Learning outcomes

  • Explain the five agentic data classes, their lifecycles, and the distinct controls each requires.
  • Classify five data flows in a sample agentic system by class and isolation mode.
  • Evaluate a design for tenant-leakage risk, license-compliance risk, and lineage completeness.
  • Design a data-flow document suitable for conformity-assessment evidence, DPIA reference, and incident-response lineage reconstruction.

Further reading

  • Core Stream anchors: EATE-Level-3/M3.3-Art03-Data-Architecture-for-Enterprise-AI.md; EATF-Level-1/M1.5-Art07-Data-Governance-for-AI.md.
  • AITE-ATS siblings: Article 7 (memory), Article 14 (retrieved-content attacks), Article 23 (conformity evidence), Article 26 (registries).
  • Primary sources: EU AI Act Article 10 on data governance; Italian Garante ChatGPT decisions (March 2023; December 2024 €15M fine); Samsung ChatGPT source-code disclosure (April 2023).