Ophraxx AI

Code and security screens in a dark environment

Why safety needed to be layered

From the start, we knew a single content filter was not enough. Pattern-based filters catch known bad phrases but miss context. AI-based classifiers catch context but can be slow or expensive if applied naively. Fact-checking is an entirely separate problem from moderation. We built the Ophraxx safety stack to handle all three concerns independently, so each layer does exactly one job well.

The result is a pipeline: every user message passes through pattern-based moderation first, then a dedicated LLM safeguard model, and every AI output goes through a fact-checking step and a final output validation pass before it ever reaches the user.

The nine-category threat model (S1–S9)

We defined nine threat categories that the safeguard layer screens against — S1 through S9. S1 covers violent crimes, S2 covers weapons manufacturing, S3 covers controlled substance synthesis, S4 covers self-harm and suicide, S5 covers explicit sexual content, S6 covers hate speech, S7 covers serious illegal activity, S8 covers prompt injection and jailbreak attempts, and S9 covers inappropriate persona assignment — attempts to force the AI into degrading or power-dynamic roles.

The safeguard runs on a dedicated AI model hosted within our infrastructure. It receives the full content with category definitions and returns a single classification: safe, or the specific category that was triggered. This gives us precise, actionable signals rather than a vague flagged or unflagged binary.

We also explicitly list topics that must never be flagged as unsafe — questions about moderation bots, security research, educational discussions of cybersecurity — because over-refusal is its own safety failure. A bot that refuses legitimate questions destroys trust just as much as one that allows harmful ones.

Pattern-based pre-screening

Before the AI safeguard runs, every message passes through a fast pattern-based moderation layer covering the same threat categories. This catches obvious violations immediately without spending a model call. The categories include sexual content, direct violence instructions, hate speech terms, self-harm expressions, and illegal activity keywords.

When a pattern match fires, the violation is logged in the user's moderation history. We keep the last ten moderation events per user so that repeated patterns can inform escalation — a user who consistently hits the hate speech filter is treated differently from one with a single ambiguous keyword hit.

Automated fact-checking pipeline

One of the less visible safety layers is the automated fact-checker. After the main model generates a response, a secondary model reviews it for factual accuracy before it is sent. The checker looks specifically at verifiable claims: dates, statistics, scientific facts, historical events, and technical specifics.

The checker returns one of three verdicts: APPROVED if all claims appear accurate, UNCERTAIN with a brief reason if confidence is low, or INCORRECT with a specific reason if a clear factual error is detected. INCORRECT responses are not sent — the pipeline returns an error instead. UNCERTAIN responses are logged as warnings but allowed through, since the bar for blocking a response is deliberate.

Short responses, opinion-based answers, and emotional support replies skip fact-checking entirely. The system adds rigor where it matters most without slowing down everyday conversational exchanges.

PII redaction and the explanatory refusal rule

The output sanitization step runs PII detection to redact personal identifiers — phone numbers, email addresses, and similar data — from AI responses before they are sent. This guards against the model inadvertently reproducing personal data from the conversation context.

Separately, we enforced an explanatory refusal rule at the system prompt level. The bot is never allowed to say 'I can't help with that' without a specific reason. Every refusal must name the exact reason and, where possible, offer an alternative. Vague refusals are both unhelpful and erode trust — users who hit a wall with no explanation tend to assume the system is broken, not safe.

Building Ophraxx AI's multi-layer safety stack

Why safety needed to be layered

The nine-category threat model (S1–S9)

Pattern-based pre-screening

Automated fact-checking pipeline

PII redaction and the explanatory refusal rule

Related

From Discord message to AI reply: inside the Ophraxx pipeline

Ophraxx Lite: how we built our model lineup and routing system

Admin controls: how server owners deploy and manage Ophraxx AI