Skip to main content

COMPEL Glossary / content-safety-classifier

Content safety classifier

A model or rule system that detects policy-violating output categories — violence, self-harm, CSAM, targeted harassment, dangerous instructions, and similar.

What this means in practice

Forms the output-layer of a guardrail architecture and is technology-neutral: implementations span managed APIs, open-weight classifiers, and rule engines.

Synonyms

safety classifier , policy classifier , moderation classifier

See also

  • Guardrail — A control placed between the user or environment and an LLM that blocks, rewrites, or classifies content at one of four architectural layers: input filter, policy filter, output filter, or tool-call validator.
  • Jailbreak — A user-crafted prompt pattern that bypasses a model's safety training to elicit restricted behavior.
  • Evaluation harness — The infrastructure that runs capability, regression, safety, and human-review evaluations on an LLM feature on a defined cadence.