How AI content filters actually work: a field guide
Frontier models don't just 'decide' to refuse. There's a stack of classifiers, system prompts, and RLHF signals behind every 'I can't help with that.' Here's what's really happening.
When a frontier model says "I can't help with that," it isn't a single decision. It's the final answer from a stack of four or five independent systems, each with its own veto. Understanding that stack is the difference between prompting around a filter and prompting into the wall behind it.
The five layers of a modern refusal
- 01
Input moderation
A small, fast classifier scans your message before it reaches the main model. It's looking for obvious keywords and patterns — a first, coarse filter.
- 02
The system prompt
A hidden set of instructions prepended to every conversation. This is where labs encode 'refuse X, hedge on Y, always mention Z.' You can't see it, but it's shaping every response.
- 03
RLHF-tuned reflexes
Human labelers rated thousands of refusals as 'good' during fine-tuning. The model now refuses because it learned that refusing gets rewarded — the refusal is baked into its weights.
- 04
Output moderation
After the model generates a response, a second classifier checks the output. If anything trips it, the response is rewritten or killed and the user sees a sanitized version.
- 05
A policy model (sometimes)
On some deployments, a separate large model reviews the candidate output and decides whether to let it through. It's an LLM grading another LLM.
A refusal isn't one decision. It's five systems in a trench coat.
Why this matters for the user
Because the layers are independent, the failure modes are independent too. You can hit a refusal for five completely different reasons and get roughly the same apology message in response. That's why prompt engineering feels so inconsistent — you're not really tuning one model, you're trying to slip past a committee.
Where Unrestricted is different
We don't ship the input and output classifiers. We don't use a second model to re-grade the first one. And our system prompt exists to keep the model useful, not to keep it worried. The RLHF layer still has a floor — genuinely illegal or harmful content is out — but that floor is measured in dozens of categories, not thousands.
The result: one model answering one question, without four committees whispering in its ear.
Frequently asked
Does every major AI chatbot use all five layers?
Most frontier deployments use at least three — system prompt, RLHF, and output moderation. Larger consumer products typically run all five.
Can you tell which layer refused you?
Rarely, from the outside. Input classifiers usually produce a specific error code; RLHF refusals are verbose and moralizing; output moderation often replaces the answer with a generic apology mid-stream.
Is jailbreaking just 'prompting around' these layers?
Yes — and it's why jailbreaks are fragile. A prompt that defeats the system-prompt layer still has to pass the output classifier. That's why we rebuilt the stack instead of bypassing it.
What does 'alignment tax' mean?
The performance cost of all those moderation layers: slower responses, narrower knowledge, and a refusal rate that climbs over time as each new incident adds a new rule.
Ready to experience an AI without a leash?
Start chatting free