Agent Evaluation and Simulation Harness

FlowRidge

COMPEL Specialization — AITE-ATS: Agentic AI Systems Architect Expert Article 17 of 40

Thesis. A benchmark score on a model is not an evaluation of an agent. An LLM-as-judge pass-rate on single prompts is not an evaluation of an agent. An agent is a trajectory — a sequence of reasoning, tool calls, memory reads, and actions — and evaluating it requires evaluating the trajectory. This article separates the four modes of agent evaluation that production architects must implement, introduces the simulation-harness pattern that lets teams run realistic adversarial testing without production blast-radius, and specifies the evaluation cadence that makes the harness a living part of the platform rather than a pre-launch checkbox.

Four evaluation modes

Mode 1 — Capability

Does the agent accomplish the task it was designed for? The classical success-rate metric, but for agents the success is defined over a trajectory: did the agent reach the goal, in what number of steps, at what cost, with what side effects. Capability evaluation uses golden-task datasets — curated tasks with known-correct outcomes — and measures pass rate, efficiency (steps, tokens, time), and side-effect correctness.

Capability testing resembles classical integration testing: the task inputs are defined, the expected outputs or states are defined, the harness compares. Agentic variants: SWE-bench for coding agents, AgentBench (Liu et al., 2023), WebArena for web agents, domain-specific task sets the organization curates.

Mode 2 — Safety

Does the agent avoid unsafe actions even under adversarial conditions? Safety evaluation takes the shape of adversarial batteries — crafted inputs designed to induce unsafe behavior. The battery is indexed by threat — OWASP Top 10 Agentic, MITRE ATLAS techniques, internal threat-model specifics — and the metric is the rate at which safety-critical actions are correctly refused or correctly escalated.

Every safety evaluation has two success modes: the agent correctly refused the adversarial request; the agent correctly escalated to a human. Silent compliance with adversarial input is the failure. Over-cautious refusal of legitimate-looking adversarial-appearing input is also a failure (Article 16, refusal cascades).

Mode 3 — Value alignment

Does the agent follow its stated objectives, even when the correct action is not the easy action? Alignment evaluation checks for goal-drift, deceptive cooperation (agent appears to cooperate while pursuing a different objective), and subtle specification-gaming (agent reaches the letter of the goal while violating the spirit).

Alignment is the hardest mode to evaluate because it requires adversarial inputs designed to offer shortcuts; detecting drift needs careful observation of trajectory, not just outcome. UK AISI’s frontier-model autonomy evaluations are the public methodological reference for this mode.

Mode 4 — Behavioral consistency

Does the agent behave predictably across similar inputs, across model updates, and across time? Consistency evaluation measures response stability: run the same task 100 times and measure variance in outcome, cost, steps, and side effects. Behavioral regression testing — running the full battery against new agent versions before promotion (Article 24) — catches regressions that pass capability and safety tests but produce distributionally-different behavior.

Consistency is first-class because inconsistency is an operational signal: an agent whose behavior drifts across runs cannot be trusted in regulated workflows even if individual runs pass safety.

The simulation harness

The simulation harness is the environment in which agentic evaluation runs without touching production systems. The pattern: stub every external dependency (tools, memory stores, downstream systems) with simulators that return configurable responses; wire the agent to the simulated environment; run tasks with full trace capture.

The architect’s harness spec:

Simulator set. Every tool in the registry has a simulator counterpart that returns configurable responses — successful, failure, timeout, edge cases, adversarial outputs. Simulator responses are versioned; evaluation runs pin to a specific simulator-set version.

Seeding. Every simulated task starts from a seeded state (memory contents, tenant context, user context) so runs are reproducible.

Oracle. For capability tests, a known-correct outcome the harness can compare against. For safety tests, a known-unacceptable action set. For alignment, the expected trajectory shape. For consistency, the statistical expectation.

Trace capture. Full observability traces (Article 15) captured from every run; evaluation logic runs against the traces, not just the final outputs.

Parallelism. The harness runs tasks in parallel for throughput; rate-limit the model calls to avoid overwhelming the provider.

The simulation harness is where red-team exercises are safe to run. Production agents cannot have a $1 Tahoe offer tested on them; simulated agents can, repeatedly, with no customer exposure.

Six evaluation techniques

Technique 1 — Golden-task pass rate

Curated tasks with expected outcomes. Pass rate measures basic capability. Complementary metrics: efficiency (tokens, steps) and cost.

Technique 2 — LLM-as-judge with trajectory

An evaluator model reviews the trajectory — reasoning, tool calls, final output — against rubric. Not a blanket substitute for human evaluation but efficient for scale. Calibrated against a human-labeled subset.

Technique 3 — Simulation-based red-team

Adversarial inputs in the simulated environment. Battery indexed by OWASP and ATLAS. Success metric: refusal or correct-escalation rate.

Technique 4 — Production trace sampling

A random sample of production traces evaluated retrospectively by humans or LLM-as-judge. Detects real-world drift. Complements synthetic evaluation — synthetic tests what the team thought of; traces reveal what users actually do.

Technique 5 — Behavioral regression

Before any promotion (Article 24), the full battery runs on the candidate version and results are compared against the prior version. Regressions block promotion.

Technique 6 — Continuous red-team

A standing red-team (internal or external, human or automated) continuously probes for new attack paths. Findings feed new tests into the battery and, when severe, pause rollouts.

Coverage analysis — what the architect checks

An evaluation harness can pass its tests and still miss real risks. The architect audits coverage.

Are all tools exercised? Every tool in the registry should have capability tests (normal operation), safety tests (abuse scenarios), and failure tests (tool-error responses).

Are all loop patterns exercised? If the agent uses ReAct for some tasks and state-graph for others, both patterns need coverage.

Are all authorization paths exercised? The authorization stack (Article 6) has many branches; tests should cover each layer’s reject and accept paths.

Are all memory operations exercised? Read from empty memory, read from poisoned memory, write under each write-policy case (Article 7).

Are all OWASP agentic Top 10 categories covered? The battery is indexed; the architect verifies every category has at least one test.

Are all MITRE ATLAS techniques relevant to the system covered? Threat model (Article 27) identifies techniques; battery covers them.

Is the regression suite closing the loop from incidents? Every production incident produces a regression test. Verify the last N incidents each added a test.

Coverage gaps are documented as known risks with accepted-until dates and monitored compensating controls.

The three production-safe evaluation environments

Architects typically build three evaluation environments.

Environment 1 — Offline simulation. Full simulator set, no external dependencies, pure in-harness. Fast, safe, cheap. Runs in CI.

Environment 2 — Staging with shadowed dependencies. Real models, real frameworks, but tools point to staging replicas of downstream systems or sandboxes (Article 21). Side effects are real but contained.

Environment 3 — Production with shadow evaluation. Real production deployment; a separate evaluator runs against real traces retrospectively (offline) or runs a parallel “shadow agent” that evaluates against real inputs without acting (online shadow).

Each environment catches different classes of bug; all three are worth running.

The red-team discipline

Red-teaming is specialized. The architect’s spec for a red-team exercise:

Scope — what systems, tools, and actions are in-scope.
Rules of engagement — what attack types are allowed, what’s off-limits (usually: real customer data, real financial actions, persistent denial of service).
Team composition — internal security + external specialists + domain experts (legal, clinical, financial as appropriate).
Timeline — pre-deployment exercise; continuous low-level; scheduled major exercises (quarterly or biannual).
Output — findings catalog with severity, evidence, mitigation, retest plan.

DEF CON’s Generative Red Team events (2023 onward) and the UK AI Safety Institute’s autonomy evaluations are the public references for how structured red-teaming works. AITE-ATS holders engage with both communities.

Framework parity — evaluation tooling

LangGraph — LangSmith evaluators; custom callbacks emit evaluation signals; state checkpoints enable replay-based evaluation.
CrewAI — task-level validation callbacks; CrewAI Evals (commercial tier) integrates with test frameworks.
AutoGen — AgentChat evaluation samples; Microsoft’s AutoGenBench tooling; custom evaluators.
OpenAI Agents SDK — tracing native; guardrails double as evaluation hooks; external eval frameworks (Braintrust, Humanloop) integrate.
Semantic Kernel — evaluation pipelines in SK samples; Azure AI Evaluation service integration.
LlamaIndex Agents — AgentEvalBench, FaithfulnessEvaluator, ContextRelevancyEvaluator native; integration with Ragas and Arize Phoenix.

Across frameworks the architect standardizes on a platform-level evaluation service that ingests traces from any framework via OTel and runs the batteries.

Real-world anchor — UK AI Safety Institute frontier-model autonomy evaluations

The UK AISI’s published evaluations of frontier models on autonomous capability (2024–2025) provide a public methodological reference for alignment evaluation at scale. The AISI battery includes long-horizon task completion, deceptive-cooperation detection, and specification-gaming tests. Architects should read the AISI public methodology papers when designing their own alignment battery. Source: aisi.gov.uk.

Real-world anchor — DEF CON 31 Generative Red Team (2023)

The DEF CON 31 Generative Red Team event (August 2023) was the first large-scale public red-team exercise against frontier LLMs. The event’s findings — exposure of prompt injection, data exfiltration, and policy-evasion patterns — informed the OWASP Top 10 for LLMs and, subsequently, for Agentic AI. The organizational pattern (open red-team with structured reporting) is a public reference architects can adapt internally. Source: defcon.org and published DEF CON findings.

Real-world anchor — AgentBench (Liu et al., ICLR 2023)

The AgentBench paper (arxiv.org/abs/2308.03688) introduced a standardized benchmark for LLM-as-agent across eight environments — operating system, database, web shopping, etc. AgentBench’s methodology — task sets with trajectory-level grading — is the academic reference for capability evaluation in agentic systems. Architects building internal benchmarks should read the paper for the task-design discipline. Source: arxiv.org/abs/2308.03688.

Closing

Four modes, a simulation harness, six techniques, three environments, continuous red-team. Evaluation is the discipline that turns agentic claims into evidence. Article 18 takes up the production metrics — SLO/SLI — that carry the evaluation discipline into operations.

Learning outcomes check

Explain four evaluation modes (capability, safety, value alignment, behavioral consistency) with their success criteria.
Classify six evaluation techniques (golden-task, LLM-as-judge with trajectory, simulation red-team, trace sampling, behavioral regression, continuous red-team) by when each runs.
Evaluate an agentic harness for coverage gaps against tools, loop patterns, authorization paths, memory operations, OWASP/ATLAS, and incident regressions.
Design a simulation-environment spec and an evaluation cadence for a given agent portfolio.

Cross-reference map

Core Stream: EATE-Level-2/M2.3-Art13-Evaluating-Agentic-Systems.md.
Sibling credential: AITM-AAG Article 13 (governance-facing evaluation); AITB-LAG Article 5 (learning-and-assessment angle).
Forward reference: Articles 18 (SLO/SLI), 24 (lifecycle promotion gates), 25 (incident-driven regression tests), 40 (capstone evaluation plan).