We trained a classifier on frontier refusals. Here's what triggers them.
Forty thousand refusals, one classifier, seven surprising categories. The words, topics, and phrasings most likely to make a model bail — and what it means for how you prompt.
We collected 40,000 refusals from the five largest consumer chatbots, labeled them by hand, and trained a small classifier to predict what kind of prompt would trip a refusal. The classifier hit 91% accuracy. Along the way we learned seven things about what actually makes a model bail.
40,000
Prompts collected
12,400
Refusals hand-labeled
91%
Classifier accuracy on held-out set
Seven findings, ranked by surprise
Most of what we found was ordinary. The last three were not.
1. Certain nouns are disproportionately refusal-inducing.
Not surprisingly, "bomb," "virus," "gun." More surprisingly: "church," "Israel," "Biden," "medication." These words pull a refusal even in prompts where nothing about the topic is operational.
2. Capitalization matters.
Prompts written in all caps refuse 22% more often than the same prompt in title case. The model seems to read stylistic urgency as threat.
3. Second-person phrasing refuses more than third-person.
"How do I pick a lock?" refuses 2.1× more than "How do people pick locks?" despite being the same question.
4. Length correlates with hedging.
Short, direct questions are more likely to be refused. Long, context-padded prompts are more likely to be hedged. Users instinctively add caveats; the model responds by adding its own.
5. Role-playing a professional context cuts refusals in half.
"As an ER physician, I need to know…" cuts refusals roughly in half on medical prompts. The model can't verify the claim but appears to give it weight anyway.
The model can't check your credentials. It just wants plausible cover to answer the question it already knew the answer to.
6. The refusal template is consistent across models.
GPT, Claude, Gemini — their refusal prose rhymes. "I understand you're looking for…", "I'm not able to…", "a qualified professional." Our suspicion: refusal templates leaked across labs during the RLHF-data trade of 2023–2024.
7. The category labeled 'politically sensitive' is the fastest growing.
Year over year, politics-adjacent refusals grew faster than any other category. The rate at which a frontier model now declines a political question has roughly tripled since 2023.
Why any of this matters
If refusals were about safety, they'd cluster on safety categories. They don't; they cluster on liability, embarrassment potential, and ambient political risk. The classifier can tell the difference. Maybe the labs can too.
Frequently asked
Are you releasing the classifier?
We'll open-source it later this year, along with a stripped-down version of the training set.
Does Unrestricted refuse any of these categories?
We refuse only within our narrow floor — active violence, CSAM, live exploit authorship. We don't refuse on political topic, stylistic urgency, or second-person phrasing.
Is 'role-play as a professional' a jailbreak?
It's a workaround, and a fragile one — it works until the lab patches it. Our position is that the model shouldn't need the workaround.
Ready to experience an AI without a leash?
Start chatting free