Agent Runtimes and Execution Substrates

FlowRidge

Agent Runtime Substrates — Six Execution Surfaces

Figure 345. Agent runtimes differ by isolation, state management, and tool surface. Runtime choice constrains observability and containment design.

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 3 of 40

Thesis. The runtime is the part of the agentic architecture that every other article depends on. Pick the wrong runtime pattern and every subsequent control — observability, kill-switch, policy gating, memory management — becomes harder than it needs to be. The four dominant patterns each make different trade-offs between inspectability, extensibility, throughput, and failure isolation. This article names the patterns, maps workloads to best-fit patterns, and builds one from scratch on raw Llama 3 to demonstrate that nothing about agentic systems requires proprietary infrastructure.

Four runtime patterns

A runtime is the execution substrate that hosts the agent loop. It owns the control flow — what happens between “model emits a tool call” and “tool result arrives at the next model call” — and the cross-cutting concerns: state, interruption, tracing, recovery. The same agent logic can run on multiple substrates; the substrate choice determines the non-functional properties.

Pattern 1 — State-graph runtime

The state-graph runtime models agent execution as a directed graph of named states with explicit transitions. The graph definition is first-class code; the runtime walks the graph, calling into handlers at each node, and records every transition. LangGraph is the canonical public example (LangChain AI, 2024–2025); Semantic Kernel Process Framework is a closed-ecosystem analogue; AWS Step Functions with Bedrock integration is a cloud-native analogue. The state-graph pattern optimises for inspectability — you can draw the graph, prove properties about it, and route a human reviewer into any node.

Strengths: explicit state, trivial to pause/resume, natural fit for Article 14 oversight gates, deterministic replay (Article 15) becomes near-trivial, graph-level policy gating (Article 22) is clean. Weaknesses: developer friction for exploratory use cases, graph growth over time as edge cases accumulate.

Pattern 2 — Conversation-buffer runtime

The conversation-buffer runtime models agent execution as a growing list of messages — system, user, assistant, tool-result — that the model sees in full on every call. OpenAI Agents SDK (2025), Anthropic Claude with tool use via raw messaging, and Microsoft AutoGen (for single-agent uses) all exemplify the pattern. The conversation-buffer pattern optimises for developer ergonomics and model-driven control flow — the model’s next message determines the next action.

Strengths: simplest mental model, best fit for conversational agents, excellent fit where the model’s reasoning should drive the workflow shape. Weaknesses: context-window bloat (addressed via summarisation — Article 19), implicit state makes Article 15 replay harder, gate-insertion requires middleware discipline.

Pattern 3 — Task-queue runtime

The task-queue runtime models agent execution as a queue of work items processed by workers. Each work item is an independent unit of agent reasoning; state is explicit and persistent between items. BullMQ + Redis, AWS SQS + Lambda, Temporal workflow orchestration, and Celery + Redis all host agents this way. The task-queue pattern optimises for throughput and fault tolerance.

Strengths: horizontal scaling is native, retry semantics are native, backpressure is native, per-tenant isolation is clean. Weaknesses: not suited to interactive conversations (latency of queue round-trip), long-horizon context is harder to carry, observability needs special care to reconstruct the multi-item trace.

Pattern 4 — Actor runtime

The actor runtime models each agent as an independent actor with a mailbox, processing messages one at a time with isolated state. Akka (JVM), Ray actors, Microsoft Orleans, and — at language level — Erlang/Elixir OTP exemplify the pattern. Multi-agent systems built on the actor pattern have a natural fit: inter-agent messages are just actor messages. The actor pattern optimises for concurrency and failure isolation.

Strengths: agent crash does not take down others, message-passing is the natural multi-agent protocol, supervisor trees give restart semantics out of the box. Weaknesses: heavier operational setup, concepts unfamiliar to LLM-app-first developers, debugging requires actor-aware tooling.

Diagram 1 — Runtime pattern × workload characteristics matrix

                       Inspectability  Throughput  Multi-agent  Interactive
                       ──────────────  ──────────  ───────────  ───────────
state-graph (LangGraph)   ★★★★          ★★         ★★          ★★★
conversation-buffer (OAI) ★★            ★★         ★           ★★★★
task-queue (Temporal)     ★★            ★★★★       ★★          ★
actor (Ray/Akka)          ★★★           ★★★★       ★★★★         ★★

★ = weaker fit; ★★★★★ = best fit

Reading the matrix: if inspectability dominates (EU AI Act Article 14 evidence pack, regulated sector), state-graph wins. If interactive conversation dominates (customer service), conversation-buffer wins. If throughput dominates (high-volume back-office), task-queue or actor wins. If multi-agent coordination dominates, actor wins.

Diagram 2 — Hub-and-spoke runtime anatomy

                      ┌──────────────┐
                      │    loop      │
                      │  controller  │
                      └──────┬───────┘
          ┌──────────┬───────┼───────┬──────────┐
          │          │       │       │          │
      ┌───▼──┐  ┌────▼───┐  ┌▼──┐  ┌─▼────┐  ┌──▼────┐
      │ tool │  │ memory │  │ gate│ │safety│  │observ-│
      │ call │  │ read/  │  │ HITL│ │layer │  │ability│
      │      │  │ write  │  │     │ │      │  │       │
      └──────┘  └────────┘  └─────┘ └──────┘  └───────┘

Any production agent runtime has five spokes: tool calls, memory operations, human-gate transitions, safety checks, and observability emission. The runtime’s job is to route requests to the right spoke and to guarantee cross-cutting properties — every tool call passes the authorization spoke; every transition emits a span; every memory write passes the policy spoke. Article 20 shows how the spokes become platform services.

Worked workload selection

Customer-service agent with moderate volume and HITL gate for refund > $500 → state-graph runtime. The gate is a first-class node; the state graph can be inspected by Compliance.
Document-processing agent handling 50k tickets/day → task-queue runtime. Throughput wins.
Multi-agent research team (researcher, critic, editor) → actor runtime. Inter-agent coordination is native.
Interactive executive-assistant chatbot → conversation-buffer runtime. Interactive latency matters.
Regulated back-office agent requiring full audit trail → state-graph with task-queue behind it for throughput. Hybrid runtimes are common and legitimate.

Runtime choice is rarely purely technical. The regulated-sector architect should lean toward state-graph because inspectability will be demanded in audit. The startup architect should lean toward conversation-buffer for speed of iteration and migrate when volume demands it.

Build-your-own — a minimal runtime on raw Llama 3

Vendor-lock-in is a legitimate concern; the credential demonstrates that no part of the agentic stack requires proprietary infrastructure. The reference below is illustrative pseudo-code (production code adds error handling, retries, timeouts, and tracing — see Articles 6, 15, 16):

MODEL = "meta-llama/Llama-3.1-8B-Instruct"   # or Mistral-7B, Qwen2-7B, DeepSeek
tokenizer = load_tokenizer(MODEL)
model     = load_model(MODEL, dtype=bfloat16)

TOOLS = registry_of_approved_tools()          # see Article 5

function agent_loop(goal, max_steps=10):
    history = [ system_prompt_with_tool_schemas(),
                user_message(goal) ]
    for step in 1..max_steps:
        # 1. call the model
        prompt = apply_chat_template(history)
        reply  = model.generate(prompt, max_new_tokens=256)

        # 2. parse for tool call
        tool_call = try_parse_tool_call(reply)
        if tool_call is None:
            return reply                     # final answer

        # 3. authorize (see Article 6)
        if not authorize(tool_call, caller_identity, tenant, policy):
            history.append(tool_result("AUTH_DENIED"))
            continue

        # 4. execute tool in sandbox (see Articles 5, 21)
        result = execute_tool_sandboxed(tool_call.name, tool_call.args)

        # 5. validate side effects (see Article 6)
        if not validate_side_effects(result):
            escalate_to_human(tool_call, result)   # see Articles 9, 10
            return "escalated"

        # 6. append to history and iterate
        history.append(assistant_message(reply))
        history.append(tool_result(result))

    trigger_kill_switch("max_steps_exceeded")      # see Article 9
    return "halted"

This is a conversation-buffer runtime on raw weights. Swap Llama-3.1-8B-Instruct for Mistral-7B-Instruct or Qwen2-7B-Instruct and the pattern holds — all three are supported by Hugging Face Transformers. The authorization hook (step 3), side-effect validation (step 5), and the max_steps guard (step 6) are the three most important controls; they are the seeds of Articles 6, 9, and 16 respectively. The build-your-own demonstrates that agentic architecture is substrate-independent.

Runtime selection and extensibility

An extensible runtime pattern is one where cross-cutting concerns can be added without rewriting the core. The architect should evaluate runtime extensibility against six plugin points:

Tool-call interception. Can authorization and validation run around every tool call without modifying tool code?
State snapshot. Can the runtime serialize full state for pause/resume and for replay?
Event emission. Can observability spans be emitted from every meaningful transition?
Human gate. Can the runtime pause cleanly for a human decision and resume on a signal?
Policy engine. Can a policy engine (OPA, Cedar) gate transitions?
Budget enforcement. Can per-run budgets (tokens, tool calls, wall time) be enforced?

A runtime that fails three or more of these checks should be extended before production use or replaced. LangGraph scores high on checks 1, 2, 3, 4, 5 and moderate on 6. OpenAI Agents SDK scores high on 1, 4, moderate on 2, 3, 5, 6. CrewAI scores high on 4, moderate on 1, 2, 3, 5, 6. Temporal scores high on 2, 4, 6, moderate on 1, 3, 5. The credential covers framework comparison at depth in Article 11.

Real-world anchors

LangGraph in production — the state-graph pattern

LangGraph’s public documentation (LangChain AI, 2024–2025) provides concrete state-graph examples used in enterprise. The canonical pattern — nodes for plan, act, validate, human-gate, with explicit edges and a checkpointer — is readable in the public docs and in public reference implementations on GitHub. Teams adopting LangGraph for regulated workloads cite inspectability (state graph visualisation), native HITL-gate support, and checkpoint/resume as the reasons. The trade-off cited is learning curve and graph maintenance as edge cases accumulate. Source: https://langchain-ai.github.io/langgraph/.

CrewAI and AutoGen — conversation-based substrates

CrewAI (docs public, 2024–2025) builds role-based multi-agent workflows on a conversation-buffer substrate with agent-specific prompt scaffolds. Microsoft AutoGen (docs public, 2024–2025) similarly uses conversation as the substrate and adds multi-agent chat patterns. Neither framework requires a graph; both place the reasoning-driven control flow at the centre. Teams picking these frameworks typically value developer ergonomics over inspectability. Sources: https://docs.crewai.com/ and https://microsoft.github.io/autogen/.

Closing

The runtime is the architectural decision most agentic projects under-invest in. A one-pager that names the runtime pattern, the primary plugin points, and the fallback runtime for overflow scenarios is a non-negotiable artefact for any production agent. Article 4 drops down a level — into the loop shape that runs inside whichever runtime you chose.

Learning outcomes check

Explain four runtime patterns and their trade-offs against four workload dimensions.
Classify a workload by best-fit runtime.
Evaluate a runtime for extensibility against six plugin points.
Design a runtime skeleton for a given feature, including the build-your-own pattern.

Cross-reference map

Core Stream: EATE-Level-3/M3.3-Art11-Enterprise-Agentic-AI-Platform-Strategy-and-Multi-Agent-Orchestration.md; EATE-Level-3/M3.3-Art04-Multi-Model-Orchestration-and-AI-System-Design.md.
Forward reference: Article 11 (framework comparison); Article 20 (platform design).