Why safety needed to be layered
From the start, we knew a single content filter was not enough. Pattern-based filters catch known bad phrases but miss context. AI-based classifiers catch context but can be slow or expensive if applied naively. Fact-checking is an entirely separate problem from moderation. We built the Ophraxx safety stack to handle all three concerns independently, so each layer does exactly one job well.
The result is a pipeline: every user message passes through pattern-based moderation first, then a dedicated LLM safeguard model, and every AI output goes through a fact-checking step and a final output validation pass before it ever reaches the user.
The nine-category threat model (S1–S9)
We defined nine threat categories that the safeguard layer screens against — S1 through S9. S1 covers violent crimes, S2 covers weapons manufacturing, S3 covers controlled substance synthesis, S4 covers self-harm and suicide, S5 covers explicit sexual content, S6 covers hate speech, S7 covers serious illegal activity, S8 covers prompt injection and jailbreak attempts, and S9 covers inappropriate persona assignment — attempts to force the AI into degrading or power-dynamic roles.
The safeguard runs on a dedicated AI model hosted within our infrastructure. It receives the full content with category definitions and returns a single classification: safe, or the specific category that was triggered. This gives us precise, actionable signals rather than a vague flagged or unflagged binary.
We also explicitly list topics that must never be flagged as unsafe — questions about moderation bots, security research, educational discussions of cybersecurity — because over-refusal is its own safety failure. A bot that refuses legitimate questions destroys trust just as much as one that allows harmful ones.
Pattern-based pre-screening
Before the AI safeguard runs, every message passes through a fast pattern-based moderation layer covering the same threat categories. This catches obvious violations immediately without spending a model call. The categories include sexual content, direct violence instructions, hate speech terms, self-harm expressions, and illegal activity keywords.
When a pattern match fires, the violation is logged in the user's moderation history. We keep the last ten moderation events per user so that repeated patterns can inform escalation — a user who consistently hits the hate speech filter is treated differently from one with a single ambiguous keyword hit.
Automated fact-checking pipeline
One of the less visible safety layers is the automated fact-checker. After the main model generates a response, a secondary model reviews it for factual accuracy before it is sent. The checker looks specifically at verifiable claims: dates, statistics, scientific facts, historical events, and technical specifics.
The checker returns one of three verdicts: APPROVED if all claims appear accurate, UNCERTAIN with a brief reason if confidence is low, or INCORRECT with a specific reason if a clear factual error is detected. INCORRECT responses are not sent — the pipeline returns an error instead. UNCERTAIN responses are logged as warnings but allowed through, since the bar for blocking a response is deliberate.
Short responses, opinion-based answers, and emotional support replies skip fact-checking entirely. The system adds rigor where it matters most without slowing down everyday conversational exchanges.
PII redaction and the explanatory refusal rule
The output sanitization step runs PII detection to redact personal identifiers — phone numbers, email addresses, and similar data — from AI responses before they are sent. This guards against the model inadvertently reproducing personal data from the conversation context.
Separately, we enforced an explanatory refusal rule at the system prompt level. The bot is never allowed to say 'I can't help with that' without a specific reason. Every refusal must name the exact reason and, where possible, offer an alternative. Vague refusals are both unhelpful and erode trust — users who hit a wall with no explanation tend to assume the system is broken, not safe.