Evaluation, Red-Teaming, and Monitoring

FlowRidge

Evaluation Cadence — From Launch to Annual Review

Figure 287. Evaluation is continuous. Each phase has a distinct trigger, produces a distinct artifact, and has a named owner. Gaps in cadence are findings in themselves.

AITB-LAG: LLM Risk & Governance Specialist — Body of Knowledge Article 5 of 6

In August 2023 the organizers of DEF CON 31 opened a dedicated room in the Las Vegas Convention Center, seated two thousand participants at terminals, and ran a three-day adversarial test against eight frontier language models under a protocol endorsed by the United States White House Office of Science and Technology Policy. Participants exercised a structured taxonomy of failure modes (confabulation, prompt injection, content-safety bypass, inappropriate tool use) and the organizers published the methodology and aggregate findings¹. The Generative Red Team exercise produced three things at once: a public reference bar for what a structured red-team looks like, a corpus of attacks that subsequent teams have used as a starting point, and a demonstration that adversarial evaluation at scale is possible without catastrophic disclosure. The United Kingdom’s AI Safety Institute published its own evaluation methodology the following year, covering jailbreak resistance, misuse potential, and autonomy on frontier systems, with similar emphasis on structured technique taxonomies and reproducible reporting². An organization that intends to govern its LLM features needs an evaluation harness of its own. This article defines what minimum form that harness takes, how to run a red-team exercise inside it, and how the harness connects to live monitoring in production.

Four modes, one harness

An evaluation harness is the infrastructure that runs evaluations on an LLM feature on a defined cadence. A minimum viable harness runs four modes, and conflating them is one of the fastest ways to produce misleading results.

Capability. Does the feature do its advertised job? A customer-service assistant asked about return policy should return accurate return-policy information; a coding copilot asked to implement a function should produce code that compiles and passes its tests. Capability evaluation is the mode most teams are familiar with; it maps directly to NIST AI RMF MEASURE 2.5 (validity and reliability)³.

Regression. When the feature changes (a new model version, an updated system prompt, a modified retrieval corpus, a revised guardrail ruleset) do previously-correct behaviors remain correct? Regression sets are the organization’s institutional memory of “we already fixed this; do not break it again.” Regression failures are the most common cause of launched features quietly degrading between releases.

Safety. Does the feature refuse or defuse the attacks it was designed to withstand? Safety evaluation exercises the battery of techniques surveyed in Article 2 plus any domain-specific patterns the feature’s operators have collected. Safety evaluation is distinct from capability because a model can be fully capable and still unsafe, and vice versa.

Human review. Does a qualified reviewer, reading a sample of conversations end to end, agree with the automated evaluators? Human review is the calibration layer for the other three. Automated evaluators drift, and only humans notice when their drift has started to accept content the organization would reject if it saw.

A harness that runs three of the four modes is not a harness. A harness that runs only one mode (usually capability) and claims to evaluate the feature’s safety is the origin of many confident but unsupported public statements about “extensive testing.”

Cadence as a first-class variable

Evaluation is not an event. It is a cadence, and the cadence is chosen by the risk tier of the feature.

A minimum-viable cadence for a customer-facing feature with tool use includes: a pre-deployment baseline run before the first release, a staging red-team exercise that runs the safety battery against a feature-in-final-configuration, a launch canary that watches a defined set of metrics during the first hours of exposure to real traffic, a weekly regression pass that reruns the fixed regression set on current production, a monthly safety sweep that exercises an expanded adversarial battery, a quarterly adversarial campaign that pulls in people from outside the team and runs a planned attack plan, and an annual external review conducted by a team the feature owners do not manage.

Each phase has a trigger, an artifact, and an owner. A missing owner is the most common failure mode. A weekly regression pass without an owner is a calendar invite that no one attends; an annual external review without an owner is a line item that gets cut in the budget cycle. Cadence that does not name accountability is not operational.

Lower-risk features scale the cadence down. An internal read-only research copilot used by a small employee population may not need a weekly regression pass; a monthly one suffices. Higher-risk features scale it up: a public-facing agentic system that can take financial action warrants a daily safety sweep during its first months in production.

Running a structured red-team

A red-team for an LLM feature is a planned exercise, not a free-form hack. The DEF CON methodology is widely imitated because it is reproducible, and reproducibility requires structure.

Targets and scope. What feature version, in what environment, is under test? What tools are registered, what retrieval corpus is live, what system prompt is in effect? Ambiguity on any of these produces results that cannot be compared to the next exercise. A model and prompt registry is a prerequisite; without it, the target cannot be named precisely.

Technique taxonomy. The exercise draws from a named taxonomy (MITRE ATLAS, OWASP Top 10 for LLM Applications, the organization’s own accumulated red-team log) rather than from individual creativity. A taxonomy-driven red-team can report coverage (“we exercised 38 of 62 catalogued techniques across twelve objective areas”) in a way that a free-form exercise cannot⁴⁵.

Severity scale. Each finding is scored on a consistent severity scale tied to business impact. A finding that causes the assistant to produce an off-color joke is not the same severity as a finding that allows exfiltration of customer data. A shared scale across exercises lets the organization track trend over time.

Participants. A red-team that consists entirely of the feature’s developers will produce systematically biased results because the developers know the defenses. The participant pool should include at least one external party, even if external means “another team within the organization.” The DEF CON methodology scaled external participation to thousands; a smaller organization’s red-team may have five people but should still have at least one outside the feature team.

Reporting. Findings go into the evaluation harness, not into a slide deck. A finding in the harness becomes a regression test; a finding in a slide deck becomes a lesson learned and promptly forgotten.

A red-team report template that works: objective, scope, technique taxonomy coverage, severity-bucketed findings, reproduction steps, proposed mitigations, and a re-test date. The re-test date matters. Findings that close without re-test accumulate as technical debt.

Monitoring production

Pre-deployment evaluation catches what the evaluators can imagine. Production monitoring catches what the users and adversaries can invent. Both are required.

Three classes of signal are the minimum for a production LLM feature. Quality signals include user feedback (thumbs up/down, escalation rates, session abandonment), automated quality scores on sampled conversations, and sampled manual review. Safety signals include blocked-content rates by category, escalation-queue depth and clearance time, and detected injection attempts. Operational signals include latency, cost per conversation, error rates, and tool-call volumes and failure rates.

The signals go into the same observability stack the organization already uses for other critical systems. A practitioner who proposes an LLM-specific monitoring silo is solving a political problem, not a technical one. Vendors in this space (Arize, Langfuse, Weights & Biases, MLflow, Humanloop, WhyLabs) offer hosted or self-hosted options; what matters is that the signals are instrumented, retained with appropriate PII handling, and routed to alerting and dashboards that the operations team will actually look at⁶.

Alert thresholds are set from baseline, not from aspiration. A feature whose block rate is normally two percent and suddenly jumps to twelve percent is either under attack or broken; the alert needs to fire before either consumes a full day.

Keeping the suite current

Threat landscapes evolve. A safety battery built in 2024 that is still the same battery in 2026 is almost certainly obsolete. Three mechanisms keep the suite current.

New-incident ingestion. Every reviewed incident, internal or public, generates at least one new regression test. The AI Incident Database, maintained by the Responsible AI Collaborative, is a useful public source for incidents outside the organization; the organization’s own review queue is the most valuable source for incidents inside it⁷.

Technique-taxonomy refresh. OWASP Top 10 for LLM Applications and MITRE ATLAS both release periodic updates. The suite is reviewed against the latest taxonomy annually, and new techniques are added with priority proportional to the exposure in the feature’s surface (Article 1).

External-review findings. The annual external review exists partly to contribute findings the internal team would not have produced. Every external finding becomes a regression test.

The suite grows. Suite growth has a cost: more compute, more review time, more false positives to triage. A suite that never retires tests and never consolidates overlapping ones will collapse under its own weight. Periodic suite grooming is itself a governance activity, documented and auditable.

Summary

An evaluation harness that runs capability, regression, safety, and human-review modes on a defined cadence, supplemented by structured red-team exercises and production monitoring, is the minimum-viable evaluation posture for an LLM feature the organization intends to govern. The DEF CON Generative Red Team and the UK AISI evaluations are the public benchmarks against which a team’s own practice can be calibrated. Cadence without owners, taxonomy without coverage reporting, red-team findings without regression tests, and monitoring without baselined alerts each collapse the system into theater. The practitioner’s obligation is to make each part operational and to keep it current.

Further reading in the Core Stream: AI Risk Assessment and Mitigation, Designing Measurement Frameworks for Agentic AI Systems, and Audit and Assurance for Enterprise AI.

Generative Red Team Challenge at DEF CON 31, post-event report. Humane Intelligence and AI Village, 2023. https://www.humane-intelligence.org/reports — accessed 2026-04-19. ↩
AI Safety Institute Approach to Evaluations, UK AI Safety Institute, 2024. https://www.aisi.gov.uk/work — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, January 2023, MEASURE 2.5. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems). MITRE Corporation. https://atlas.mitre.org/ — accessed 2026-04-19. ↩
OWASP Top 10 for Large Language Model Applications, 2025. OWASP Foundation. https://genai.owasp.org/llm-top-10/ — accessed 2026-04-19. ↩
Vendor documentation surveyed as public-source references only: Arize AI, Langfuse, Weights & Biases, MLflow, Humanloop, WhyLabs. Accessed 2026-04-19. ↩
AI Incident Database. Responsible AI Collaborative. https://incidentdatabase.ai/ — accessed 2026-04-19. ↩