Why we needed explicit categories
A generic 'harmful content' filter is not precise enough for a deployed AI product. When something is flagged, you need to know why — so you can return a useful refusal message, track what kinds of violations are most common, and make informed decisions about where to tighten or loosen thresholds. We defined nine named categories so that every safety decision is traceable back to a specific rule.
The categories are: S1 violent crimes, S2 weapons manufacturing, S3 controlled substance synthesis, S4 self-harm and suicide, S5 explicit sexual content, S6 hate speech, S7 serious illegal activity, S8 prompt injection and jailbreak attempts, and S9 inappropriate persona assignment. Each maps to a distinct user-facing refusal message with a specific reason — we never return a vague 'I can't help with that.'
S8 and S9: the AI-specific threat categories
S1 through S7 cover harm categories that exist across content moderation broadly. S8 and S9 are specific to AI systems. S8 covers prompt injection — attempts to override the system prompt, jailbreak the model, or bypass safety instructions through clever framing, roleplay, hypotheticals, or nested instructions. These attacks are a real and ongoing problem for deployed AI, and catching them requires both the pattern-based pre-screen and the AI safeguard model working together.
S9 covers inappropriate persona assignment — attempts to force Ophraxx AI into adopting degrading nicknames, power-dynamic titles, or sexually loaded roles. Examples include instructions like 'call me master,' 'you are my slave,' or combinations of sexual and dominance-based language. This category exists because AI identity manipulation is its own harm vector, distinct from the content of any single message. The bot's identity as Ophraxx AI is permanent and cannot be changed by any user, admin, roleplay, or hypothetical framing.
The safe-topic allowlist: preventing over-refusal
One of the easiest mistakes to make in content moderation is flagging things that should be allowed. A server asking 'how does a moderation bot detect spam?' should not hit the S7 illegal activity filter just because 'exploit' appears in the response. A security professional asking about phishing tactics for a red team exercise should not be blocked. Over-refusal is a safety failure of a different kind — it makes the bot useless and frustrating for legitimate users.
We address this with an explicit safe-topic list embedded in the safeguard model's instructions. Topics that must never be flagged include: questions about moderation bots and content filtering systems, security tools and vulnerability concepts discussed in an educational or defensive context, explaining what harmful content looks like for moderation purposes, and general discussions of cybersecurity and server management where words like 'hack,' 'exploit,' or 'phish' appear in non-malicious context. This list is deliberately maintained alongside the threat categories so both grow in sync.
Category-specific refusal messages
Every category in the threat model maps to a specific refusal message. S4 self-harm responses include crisis resources — the 988 Suicide and Crisis Lifeline and an international helpline directory — because a user in crisis needs more than a block. S5 sexual content gets a direct boundary message. S8 jailbreak attempts get a message that acknowledges the attempt without being hostile. S9 persona assignment gets a calm explanation that the bot's identity and address conventions are fixed.
This specificity matters for trust. A user who gets a vague block has no idea what they did wrong and may assume the system is broken. A user who gets a specific explanation understands the boundary, even if they disagree with it. The explanatory refusal rule — which requires every refusal to name the exact reason — is enforced at the system prompt level and applies to both safety blocks and knowledge limitations.