Red-Team Experimentation for Safety

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 11 of 14

A capability experiment asks “does the system produce the outputs we want?”. A red-team experiment inverts the question: “what inputs make the system fail?”. The inversion changes almost everything about how the experiment is designed. The hypothesis is adversarial. The input distribution is handcrafted. The metric is the rate and severity of failure, not the rate of success. And the discipline required to produce credible red-team evidence is different from the discipline required to produce credible capability evidence. This article teaches the adversarial experimentation vocabulary and the integration of red-team findings into the same governance substrate as capability results.

The technique taxonomy

Red-team technique catalogs exist, and a practitioner need not invent one. Two are the references: MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) and the OWASP Top 10 for LLM Applications¹². The taxonomies organize adversarial techniques by what they target, what they achieve, and what class of system they apply to.

For LLM and agentic systems specifically, the red-team taxonomy covers at least the following technique classes.

Direct prompt injection. An attacker-authored prompt overrides the system’s intended instructions.
Indirect prompt injection. Attacker-authored content arrives through a trusted channel (retrieval, tool output, email summary) and carries instructions the model subsequently follows.
Jailbreak. A prompt elicits behavior the model is trained to refuse — harmful content, unsafe instructions, system-prompt disclosure.
Encoded attacks. Malicious instructions encoded in formats (base64, leetspeak, rot13, emoji, image text) designed to bypass input filters.
Persona attacks. “Pretend to be X” framings that elicit persona-gated unsafe behavior.
Tool-output injection. A tool the model invoked returns content with embedded instructions the model acts on.
Retrieval-corpus poisoning. Content placed into a vector store, document index, or web corpus the model retrieves from, carrying instructions or disinformation.
Training data poisoning. Poisoned data mixed into fine-tuning or RAG corpora; a training-time attack rather than an inference-time one.
Model extraction and inversion. Queries designed to extract training data or to reverse-engineer model behavior.
Denial-of-service and resource exhaustion. Inputs that produce extremely long outputs, recursive tool calls, or other resource-exhausting patterns.

An organization does not need to probe every class, but a red-team program that claims to be complete must be able to account for why it has or has not probed each.

Designing a red-team experiment

A red-team experiment is an experiment. It has a hypothesis, a design, a metric, and a report.

Hypothesis. Adversarial: “technique class X, applied to feature Y, will cause behavior Z at rate above threshold W”. The hypothesis is specific. “Test for jailbreaks” is not a hypothesis.

Design. The technique class, the concrete prompt or content pattern, the evaluation signal (what observable behavior constitutes a failure), the severity threshold (a refusal of one benign query is different from generation of a specific harmful output), and the sample size.

Metric. The failure rate on the designed adversarial input distribution, optionally weighted by severity. The metric is not “number of jailbreaks found”; it is “fraction of designed attacks that succeeded, stratified by severity”.

Report. The hypothesis, the design, the findings (with example prompts and outputs, redacted where necessary), the severity classification, and the mitigation proposal. The report is structured so that a remediation engineer can reproduce each finding and so that a governance reviewer can judge severity.

The adversarial framing changes the data governance of the experiment. Red-team prompts and outputs may contain disallowed content. Storage of red-team artifacts must be access-controlled. Reproduction in CI requires sanitized versions. The NIST AI 600-1 GenAI Profile explicitly treats red-team data as a governance-sensitive artifact class³.

Severity classification

A red-team finding is not useful until it is severity-classified. The mitigation response (urgent fix, scheduled fix, risk-accept) depends on severity. A consistent classification across findings is what makes the program tractable.

A practical severity model has four levels.

Critical. Exploitable at scale, leaking sensitive data, producing illegal content, or enabling a tool-use action with real-world harm. Immediate fix required; deployment halt if not already in place.
High. Consistently exploitable but with limited direct harm; or exploitable conditionally but with high-impact consequences. Fix within days.
Medium. Intermittently exploitable; or consistently exploitable but with lower-impact consequences. Fix within weeks.
Low. Theoretical or requires extreme contrivance; or affects a narrow edge case. Tracked, scheduled for the next maintenance cycle.

Severity is classified by the finder plus an independent reviewer. Classification consistency is audited; a program where every finding is “high” is a program where severity has lost its meaning.

[DIAGRAM: TimelineDiagram — aitm-eci-article-11-red-team-campaign — A horizontal timeline: scope -> recruit -> execute -> classify -> mitigate -> retest -> close, with estimated durations and named owners per step.]

Executing at scale

Small red-team exercises can be run by a single practitioner over a week. Larger exercises involve multiple red-teamers, structured scoping, and reporting infrastructure.

Three execution patterns are worth the practitioner’s knowledge.

Internal red team. A dedicated internal team runs campaigns against features before launch and on a recurring cadence post-launch. Works well when the organization has sufficient scale to staff a team, and when the team has the authority to halt launches.

Crowdsourced red team. Outside participants are invited to probe the system, typically under a bug-bounty-style compensation model. The DEF CON 31 Generative Red Team event in August 2023 brought 2,000+ participants to probe commercial LLMs; the event was White-House-endorsed and produced a public report⁴. The crowdsourced model covers technique space no internal team could replicate.

Specialist vendor. External security firms with AI-specific red-team practices run campaigns on retainer. Works well for organizations with specialized requirements or regulated contexts that need third-party attestation.

Most mature programs use all three. Internal team for continuous coverage, crowdsourced events for broad novelty, and specialist vendor for pre-launch or post-incident assurance.

Reproducibility of red-team findings

Red-team findings that cannot be reproduced are not findings. The reproducibility requirements are the same as for capability experiments (Article 6) with two adaptations.

Capture determinism-breaking context. LLM outputs can be non-deterministic at default temperatures. A red-team finding that only fires at temperature 0.7 with a specific seed must record those. Many vendors expose seed parameters; all record the effective temperature. Where determinism is limited, the finding is recorded with an empirical success rate across N runs rather than as a single exploit.

Track the model version. A jailbreak that worked on model-version-A may be closed in model-version-B. The finding records the model version, and the regression suite (Article 10) reruns it against new versions automatically.

Integration with the CI pipeline

A mature program treats safety-regression testing as a first-class gate in the CI pipeline (Article 8). Every merge that touches the model, the prompts, the retrieval corpus, or the tool set triggers the safety-regression suite. A new failure is a CI failure.

This integration has a practical implication. Red-team findings do not stay in a document. They become tests. The test corpus grows over time. The coverage of known adversarial territory becomes measurable by the rate at which new campaigns produce tests not already in the suite; a mature program sees that rate decline over time.

Tooling for safety-regression integration is available across open-source and commercial ecosystems. Garak (an open-source LLM probe framework), promptfoo, DeepEval, PyRIT (Microsoft’s Python Risk Identification Toolkit), and several commercial tools implement the pattern⁵⁶. Vendor-neutrally, the pattern is: a YAML or JSON test definition, a harness that executes it against the target model, and a pass/fail that feeds CI.

[DIAGRAM: ScoreboardDiagram — aitm-eci-article-11-red-team-dashboard — A dashboard-style table with columns for finding ID, technique class, severity, status (open/mitigated/retest), owner, and date-discovered.]

Mitigation and residual risk

A red-team finding produces a mitigation. Mitigation options span.

Input filter. A classifier or heuristic rejects the adversarial input before the model sees it.
Prompt hardening. The system prompt is changed to resist the technique.
Output filter. A classifier rejects or rewrites the adversarial output before it leaves the system.
Tool-call validation. The tool layer enforces constraints the model might otherwise be manipulated into violating.
Scope reduction. The feature’s capability is narrowed to eliminate the attack surface.

Every mitigation has a cost. A tighter input filter produces more false positives. A harder system prompt reduces flexibility. The practitioner records the mitigation and its trade-off, and the governance review weighs the residual.

Residual risk is the failure rate after mitigation is in place. Some adversarial techniques can be reduced but not eliminated; the residual is the number, and it is either acceptable (below the program’s risk tolerance) or not (escalates to product ownership for a scope decision). The sibling AITB-LAG credential develops the guardrail architecture that mitigations implement.

Two real references in the red-team vocabulary

DEF CON 31 Generative Red Team. The August 2023 event brought thousands of participants to probe commercial LLMs. The published report documented both findings and methodological lessons about how to run crowdsourced red-team exercises at scale⁴. The event is a reference for organizations planning their own crowdsourced programs.

MITRE ATLAS. MITRE’s ATLAS catalog is the reference threat taxonomy for AI systems, organizing techniques by tactic (what the attacker is trying to do) and by procedure (how they do it), analogous to the MITRE ATT&CK framework for enterprise security¹. Practitioners use ATLAS both to structure their own red-team campaigns and to communicate findings in a standardized vocabulary.

Summary

Red-team experimentation is adversarial experimentation. The technique taxonomy (MITRE ATLAS, OWASP LLM Top 10) is the reference. A red-team experiment has a hypothesis, design, metric, and report, with adapted data-governance and reproducibility requirements. Severity classification (critical, high, medium, low) structures the mitigation response. Execution patterns include internal, crowdsourced, and specialist-vendor models. Reproducibility requires recording determinism-breaking context and model version. Integration with CI turns findings into permanent regression tests. Mitigation options have costs; residual risk is measured and escalated. DEF CON 31 and MITRE ATLAS are reference anchors. The next article covers the compute budget and cost discipline that makes all of the above sustainable.

Further reading in the Core Stream: Evaluating Agentic AI: Goal Achievement and Behavioral Assessment.

MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems). MITRE Corporation. https://atlas.mitre.org/ — accessed 2026-04-19. ↩ ↩²
OWASP Top 10 for Large Language Model Applications, 2025. OWASP Foundation. https://genai.owasp.org/llm-top-10/ — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 2024. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf — accessed 2026-04-19. ↩
DEF CON 31 Generative Red Team Challenge — Humane Intelligence report. https://www.humane-intelligence.org/reports — accessed 2026-04-19. ↩ ↩²
Garak LLM probe framework. https://github.com/leondz/garak — accessed 2026-04-19. ↩
PyRIT (Python Risk Identification Toolkit). Microsoft. https://github.com/Azure/PyRIT — accessed 2026-04-19. ↩