Software-Engineering Agentic Patterns

FlowRidge

The architect’s interest in SWE agents is twofold. First, SWE agents are frequently the organization’s first production agentic deployment; getting them right matters. Second, SWE-agent architectures are a design-pattern source for other domains — the sandbox, the tool set, the memory discipline, and the evaluation harness generalize.

Four SWE use cases, by autonomy

The SWE category contains four use cases on the autonomy spectrum (Article 2):

L1 — Inline completion (GitHub Copilot-style). The agent suggests next tokens or small snippets; the developer accepts, edits, or rejects; no tool calls beyond the editor. Pattern: stateless completion; prompt conditioned on visible code context; model version tracked; telemetry on accept/reject rate.

L2 — PR-generation from issue (Copilot Workspace-style; Devin’s “task-level”). The agent reads a ticket, produces a plan, edits files, opens a PR; a developer reviews the PR. Pattern: tool set (file read, file write, shell within sandbox), run tests, push branch; developer as reviewer; iteration in PR-comment response pattern.

L3 — Full-task agent (Devin, Replit AI Agent, Factory.ai). The agent executes multi-step tasks autonomously — exploring the repository, running tests, fixing failures, iterating. Developer is the supervisor who assigns and reviews. Pattern: long-running sandbox; persistent task state; rich tool set including browser and terminal; memory of past attempts.

L4 — SRE/ops agent (Datadog Bits AI Dev Agent-style; early deployments). The agent responds to alerts, runs diagnostics, proposes or applies remediations. Pattern: read-only by default with HITL on writes; tight scope; escalation to human on-call if blocked.

The reference architecture

Sandbox (Article 21)

A SWE agent needs an execution sandbox because it will write and run code. Requirements:

Filesystem isolation (per-task scratch; read-only mounts for system bases).
Process isolation (agent cannot escape to host).
Network policy (outbound limited to package registries, declared endpoints).
Resource caps (CPU, memory, disk, wallclock).
Snapshot capability (rollback to pre-task state on kill).
Clean teardown (no residue between tasks).

Production examples: Replit’s sandboxes; Devin’s “DevBox”; Amazon CodeCatalyst / Bedrock agent sandboxes; E2B open-source sandbox service.

Repository access

Agents should operate through the same flow a junior engineer does: checkout a branch, make changes, run tests, push, open PR. Direct-to-main writes should be impossible — branch-protection rules enforce the PR workflow, and agents do not bypass them.

Tool set

Typical SWE agent tool set:

File ops (read, write, list, diff).
Shell (in sandbox only; command allowlist for write commands).
Test runner (language-specific; structured result parsing).
Lint / format (language-specific).
Git (branch, commit, push, diff).
Package manager (install with dependency-pinning discipline).
Browser (documentation look-up; error-message search).
Code search / embedding search (semantic search across the repository).

Memory

SWE agents benefit from two memory surfaces:

Session scratchpad — per-task; ephemeral; holds the agent’s plan, observations, attempted fixes.
Architectural memory — opt-in; persistent; captures “how this repo works” knowledge (key files, conventions, ownership) with explicit curation.

Cross-session memory is powerful but risky: memory poisoning (Article 25) from a single bad session can propagate. Replit’s public 2024 memory-corruption postmortem is a canonical reference for why architectural memory needs provenance and snapshot discipline.

Evaluation harness

SWE agents have the richest evaluation ecosystem of any agentic category:

SWE-bench (Jimenez et al., 2023) — real-world GitHub issues with resolution patches; the public benchmark.
SWE-bench Verified — a curated subset with improved issue selection.
Internal regressions — repository-specific golden tasks (fix this recurring class of bug; implement this class of feature).
Adversarial evaluations — prompt-injection via file contents; malicious-package installation attempts; secret-exfiltration attempts.

The architect requires that every promoted agent version passes the internal regression + the adversarial battery; SWE-bench is a useful comparative benchmark but not a substitute for repository-specific evaluation.

Human-review surface

Inline completion: accept/reject in editor; telemetry only.
PR-level: PR comments; diff review; test-result review; merge decision is the human’s.
Task-level / SRE: agent submits the action; human reviews in a dashboard or PR; for SRE, runbook-approval workflow mirrors pager approval.

Operational patterns from production deployments

GitHub Copilot (public architecture posts, 2024–2025). Token-efficient prompt construction; multi-model routing; per-tenant rate limits; enterprise-configurable policy (allow/deny suggestions from public code). The architect lesson: the most-deployed coding agent is carefully engineered for cost at scale, not just quality on benchmarks.

Devin by Cognition AI (2024). Architectural disclosures include long-horizon task execution, browser + terminal + editor tool set, planning-and-reflection loop (Article 4). Public discussion of limitations — silent failure modes, gap between demo and reliable production use — informed subsequent industry practice.

Replit AI Agent (2024–2025). Publicly described; includes memory-corruption postmortem after an incident where accumulated long-term memory contained confusing content that degraded subsequent task quality. Remediation pattern: provenance-per-memory-entry, targeted forgetting, per-tenant memory-store isolation.

Factory.ai “Bricks” and similar task-level agents. Public product pages and customer case studies describe multi-agent orchestration (planner, implementer, reviewer) over SWE tasks.

Timeline of a task-level SWE agent

Safety patterns

Prompt-injection via source files (Article 14). A SWE agent reading repository files can encounter adversarial content (injected README instructions, poisoned third-party dependency comments). Mitigations: input classifiers on retrieved file content; architectural rule that tool outputs are data, not instructions; structured plan format insensitive to injection.

Secret exfiltration. Repositories may contain secrets; agents should not echo secrets in outputs. Pattern: secret-scanner runs on every message and diff before output; detected secrets are blocked from leaving the sandbox.

Supply-chain risk. Package installation is a supply-chain risk (typosquatting, malicious packages). Mitigations: package allowlist for auto-install; human approval for new dependencies; software bill of materials captured per task.

Write-to-main prohibition. Enforced by branch protection and by the agent’s runtime policy; agents cannot escalate to direct-write even when plausibly correct.

Integration with platform services

SWE agents consume platform services just like other agents (Article 20): shared registries (Article 26), shared policy engine (Article 22), shared observability (Article 15), shared kill-switch controller (Article 9). The SWE-agent product teams do not re-implement any of these.

Multi-framework presence

SWE agents are built on a wider variety of frameworks than other agentic categories, partly because the use case is mature and partly because different teams optimise for different workflows:

LangGraph for teams wanting explicit state-graph modelling of plan/implement/verify steps, especially where long-running tasks need checkpointing and replay.
AutoGen (Microsoft) for research-style SWE exploration with multi-agent conversations among specialised roles (planner, coder, tester, reviewer).
CrewAI for role-based hierarchical patterns where each specialist’s responsibilities are enumerated in a crew definition.
OpenAI Agents SDK for teams already committed to OpenAI’s ecosystem and wanting its tool-calling ergonomics with minimal scaffolding.
Semantic Kernel for .NET-heavy environments where enterprise integration with Microsoft Graph, Azure DevOps, and existing C#/F# tooling dominates.
LlamaIndex Agents for SWE tasks that are heavily retrieval-anchored — codebases treated as a knowledge base with structured retrieval before action.
Custom runtimes as seen at Devin, Replit, and Factory.ai for teams whose differentiation genuinely lives in the runtime itself.

The architect chooses based on the Article 39 build-vs-buy factors rather than framework popularity.

Regulatory considerations for SWE agents

Most SWE agents fall outside EU AI Act high-risk classification because they do not directly affect natural persons’ rights. But edge cases matter:

Employment contexts. SWE agents that effectively decide task assignment, review promotion evidence, or influence performance reviews could fall inside Annex III.4 (employment, workers’ management). Rare but worth screening.
Critical-infrastructure deployments. SWE agents writing code for energy, transport, or water-system infrastructure may trigger sector-specific regulation on software quality and change control (IEC 61508, IEC 62443) rather than AI regulation directly, but with increased audit-trail expectations.
Open-source-license compliance. SWE agents suggest or write code that may incorporate suggestions trained on licensed code. The GitHub Copilot litigation (Doe v. GitHub, 2022) is the marker case; organisations set policies on filtering suggestions matching public code and document them.

The architect coordinates with legal and IP on license-compliance policy at design time.

Anti-patterns to reject

“Agent writes to main; we trust it.” Trust is not an architecture; branch protection is.
“No sandbox — agent runs on the dev’s machine.” Local execution conflates developer-agency with agent-agency; incidents have cross-contamination risk.
“Pass SWE-bench and ship.” SWE-bench is a generalist benchmark; repository-specific regression coverage is what predicts production quality.
“Memory just works.” It does not; memory discipline is the difference between Replit’s pre- and post-incident architectures.
“Reviewers will catch it.” Review fatigue is real; the architect does not rely on review to substitute for pre-review safety.

Learning outcomes

Explain software-engineering agentic patterns across four autonomy levels (inline, PR-generation, full-task, SRE).
Classify four SWE use cases by autonomy level, tool set, and human-review pattern.
Evaluate a SWE agent design for sandbox completeness, tool-set safety, memory discipline, and evaluation-harness adequacy.
Design a SWE agentic plan including sandbox specification, tool set, evaluation battery (SWE-bench + internal regression + adversarial), and branch-protection integration.