Memory Architecture for Agents

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 7 of 40

Thesis. An agent without memory is a new intern every turn. An agent with unbounded memory is a vector for every attack the organization has not yet catalogued. Between those extremes the architect picks a memory architecture — four layers, each with different lifetime, classification, and trust semantics — and makes deliberate decisions about what the agent may remember, how memories are written, who can read them, and how they are retired. Memory is where agentic systems differ most deeply from stateless LLM applications, and memory is where the OWASP Top 10 for Agentic AI places its most novel risk category: memory poisoning.

The four layers

The canonical four-layer memory model for agents, grounded in Park et al.’s “Generative Agents” (UIST 2023) and now ubiquitous across frameworks, is the working vocabulary.

Layer 1 — Short-term (working context)

Short-term memory is the conversation history and scratchpad that the agent carries within a single session. In implementation it is the context window: system prompt + conversation so far + tool-call history + plan scratchpad. Short-term memory is ephemeral by design — it dies when the session ends. The architect’s decisions here are token budget (how much is kept), summarization cadence (when old turns are compressed), and eviction policy (sliding window, topic-clustered, or summary-plus-recent).

LangGraph’s MessagesState, OpenAI Agents SDK’s session context, CrewAI’s task-level context, and Anthropic’s conversation memory all implement Layer 1. The differences are mostly ergonomic; the cost discipline is universal (Article 19 — context bloat is a first-order cost driver).

Layer 2 — Long-term (vector memory)

Long-term memory is the persistent, cross-session knowledge the agent accumulates. Its canonical implementation is a vector store over embedded memory chunks with per-agent or per-user namespacing. “User prefers responses under 200 words,” “User’s shipping address is X,” “This customer escalated last month — treat carefully” are Layer 2 memories.

Vector store choices include Pinecone, Weaviate, Qdrant, pgvector, Milvus, Chroma. The architect’s decisions: embedding model (and the inevitability of re-embedding when models rotate — Article 24), chunking strategy, retrieval top-k, relevance threshold, and — most importantly — the write policy. Which memories get written, by whom, with what classification.

Layer 3 — Episodic (trace history)

Episodic memory is the agent’s own past: “last time this user asked about refunds, I escalated; it was resolved by manager X; customer stayed.” Episodic memory is richer than single-fact long-term memory because it encodes sequences — the tools called, the outcomes observed, the decisions taken. It is often structured (JSON records of past sessions) rather than embedded text, though hybrid designs are common.

Episodic memory supports reflection and learning patterns (Reflexion — Article 4). It is also the most attractive poisoning target because a corrupted episode shapes future behavior across many sessions.

Layer 4 — Semantic (knowledge graph)

Semantic memory is the structured knowledge the agent shares with other agents in the organization: “Customer Foo Corp has active contract SLA tier 2, primary contact is Jane at jane@foo.example, escalation path goes to Rachel.” A knowledge graph — Neo4j, Amazon Neptune, or a lighter-weight RDF triple store — holds this layer. The agent reads from the graph to ground decisions; it writes to the graph only through authorized tools (Article 6).

Semantic memory is the layer most likely to be shared across multiple agents and most likely to hold authoritative organizational truth. Its governance overhead is the highest; its poisoning blast radius is the largest.

The five memory-store technology choices

Picking a store is a concrete architectural decision that the AITE-ATS holder owns. Five options span the production space.

pgvector in Postgres — lowest operational overhead if Postgres is already in play; row-level security for tenant isolation; good for Layer 2 up to low-millions of vectors. Row-level security (RLS) is the architectural win — native tenant isolation in the database rather than in the vector layer.
Pinecone — managed service, excellent operations, namespaces for tenant isolation, metadata filtering. Good default for Layer 2 at scale when Postgres isn’t the organization’s choice.
Weaviate / Qdrant / Milvus — open-source self-hosted options; Weaviate has GraphQL and a hybrid search layer; Qdrant has Rust-based performance; Milvus is at cloud-scale volumes. Choice depends on operational preference and scale target.
Neo4j (or Neptune) — for Layer 4. Cypher query language; strong tooling; expensive at scale but the right shape for relational organizational knowledge.
S3 / object store + indexed metadata — for Layer 3 when episodic records are large-body (full traces with tool outputs). Object store for the bodies; Postgres for the index.

Every choice has tenant-isolation implications. The architect specifies the isolation mechanism at the store layer (namespace, RLS, separate index per tenant) and verifies it in tests (Article 17).

The write policy — the hard problem

Reading from memory is easy; deciding what to write is where architectural discipline pays off. Four write-policy patterns.

Pattern A — Explicit save only. The agent writes only when a specific tool (save_preference, save_escalation_note) is called with explicit parameters. Every write is visible in the tool audit log; the model cannot silently mutate memory. Highest audit fidelity; lowest memory growth; best default for regulated deployments.

Pattern B — Post-hoc summarization. At session end, a summarizer extracts salient facts and writes them to memory. Lower friction than Pattern A; harder to audit what was learned; risk of the summarizer fabricating or distorting.

Pattern C — Reflection-driven. Periodically the agent reflects on recent sessions and decides what to add to Layer 3/4 memory. The Generative Agents paper uses this pattern; it is powerful for long-horizon learning but requires strict write gating. Without gating, an injection in a single session can propagate into permanent memory.

Pattern D — Human-approved writes. Proposed memory writes are queued for human review before entering the store. Highest safety; lowest throughput; appropriate for Layer 4 writes that affect many agents or Layer 3 writes at high-risk tools.

Architectural recommendation: the agent’s default write policy is Pattern A; Pattern B is allowed for session-summary style facts only; Pattern C is allowed only behind a policy engine check (Article 22); Pattern D is required for Layer 4 writes affecting other agents.

Memory poisoning — the OWASP agentic risk

OWASP’s Top 10 for Agentic AI (2024–2025 iteration) names memory poisoning as a distinct risk class. The attack pattern: an attacker (direct user or upstream injection via tool output — Article 14) induces the agent to write false or malicious content into persistent memory. Future sessions retrieve that content as though it were trustworthy context. The bad fact becomes the agent’s belief.

Defenses compose.

Write-time classification. Every memory write carries a classification tag (trusted, user-asserted, tool-output, speculative). Downstream reads weight or filter by classification; speculative or tool-output-sourced memories do not outrank trusted-source memories in retrieval.
Write-time policy gate. The policy engine (Article 22) evaluates every write: is the calling agent authorized to write this classification, to this namespace, at this classification level.
Provenance fields. Every memory record carries source_session, source_prompt_hash, tool_chain, authoring_agent, timestamp. Retrieval can surface provenance to the downstream agent; auditing can trace a suspicious belief back to its origin.
Decay and expiration. Memories have TTLs proportional to their source trust. User-asserted facts expire faster than organizationally-approved facts. Auto-generated reflections expire faster still.
Poisoning detection batteries. The evaluation harness (Article 17) includes memory-poisoning adversarial tests: inject a poisoned fact, query, detect the wrong answer, trace the provenance back to the poison. The battery runs continuously on canary traffic and pre-production.
Tenant boundary enforcement. No agent ever reads memories outside its tenant scope. Violations are detectable in the store layer and must be tested (Article 28).

Framework parity

LangGraph — MemorySaver for short-term; BaseStore protocol (user can plug Postgres, Pinecone, Redis) for long-term; episodic/semantic handled by custom store implementations or external libraries.
CrewAI — Memory abstraction with pluggable backends; short-term via task context; long-term via configured vector store (default Chroma).
AutoGen — Memory protocol with file-based, vector, and custom implementations; episodic via conversation history in group-chat patterns.
OpenAI Agents SDK — session memory native; long-term via user-provided vector tools; no native Layer 3/4.
Semantic Kernel — IMemoryStore abstraction with Azure AI Search, Qdrant, Postgres, in-memory backends; clear separation of working vs persistent memory.
LlamaIndex Agents — extensive Memory class with short-term, long-term, hybrid; integration with its own vector stores.

Across frameworks the architect specifies the four-layer model as the platform contract and picks framework-native implementations for Layer 1 while centralizing Layer 2–4 in platform services.

Real-world anchor — Park et al. Generative Agents (UIST 2023)

The Park et al. paper formalized the memory stream + reflection + planning architecture for agents, demonstrated in the Smallville simulation. The architecture’s four-layer implicit structure — short-term observation, long-term vector memory, reflection as episodic-to-semantic distillation — became the reference for most framework designs that followed. The paper’s contribution to AITE-ATS is that it provides concrete examples of what each layer does and what goes wrong when layers are collapsed or absent. Source: https://arxiv.org/abs/2304.03442.

Real-world anchor — Microsoft Copilot memory (2024 public announcement)

Microsoft’s announcement of personalized memory for Copilot (November 2024) triggered public debate about user-visible memory controls, memory opt-out, and the classification and retention of what the system remembers. The architectural takeaway is the user-visible memory ledger — the user can see exactly what the system has remembered and delete entries. Architects designing consumer-facing or regulated-user-facing agents should treat the memory ledger as a product requirement, not an implementation detail.

Real-world anchor — Character.AI memory discussions (2024)

Character.AI’s public discussions of persistent character memory — user-specific preferences, persona consistency, and the safety review around memory in sensitive conversations — illustrate Pattern C (reflection-driven) at consumer scale. Their published safety posts discuss the specific risks of unfiltered memory writes in emotionally-charged conversations and the architectural responses (classification, filtering, TTL). The lesson for enterprise architects is that consumer-scale memory systems have made the failure modes visible; enterprise architectures should adopt the classification-plus-gate discipline from the outset.

Closing

Four layers, five stores, four write-policy patterns, six poisoning defenses, and a ledger the user can inspect. Memory architecture decisions cascade into every downstream article — observability (what memory reads and writes does the trace capture), policy (what writes are allowed under what conditions), retirement (what memory does not survive retirement). Article 8 now takes up the risk that memory amplifies: goal hijacking and excessive agency.

Learning outcomes check

Explain four memory layers with their lifetimes, classifications, and use cases.
Classify five memory-store options against scale, tenant isolation mechanism, and layer fit.
Evaluate a memory design for poisoning risk — identify missing classification, provenance, gate, or TTL controls.
Design a memory spec for a given agent including layer selection, write policies per layer, and poisoning defenses.

Cross-reference map

Core Stream: EATF-Level-1/M1.4-Art12-Agent-Memory-Architecture-and-Risk.md.
Sibling credential: AITM-AAG Article 6 (governance of memory artifacts).
Forward reference: Articles 8 (hijacking), 14 (indirect injection), 22 (policy engines), 26 (memory registry), 28 (data architecture).