Skip to main content

COMPEL Glossary / jailbreak

Jailbreak

A user-crafted prompt pattern that bypasses a model's safety training to elicit restricted behavior.

What this means in practice

Treated as a subtype of prompt injection but with a different mitigation lifecycle because defenses lie in safety fine-tuning, policy filters, and refusal patterns rather than input sanitisation alone.

Synonyms

jailbreaking , safety bypass

See also

  • Indirect prompt injection — Prompt injection delivered through content the model retrieves or ingests — emails, documents, web pages, or tool outputs — rather than through a direct user message.
  • Guardrail — A control placed between the user or environment and an LLM that blocks, rewrites, or classifies content at one of four architectural layers: input filter, policy filter, output filter, or tool-call validator.
  • Content safety classifier — A model or rule system that detects policy-violating output categories — violence, self-harm, CSAM, targeted harassment, dangerous instructions, and similar.

Related articles in the Body of Knowledge