Is the alignment tax the same as the training-time 'alignment tax' in research papers?

No. Researchers use the term for the benchmark drop after RLHF fine-tuning. We use it for the user-visible cost of the whole moderation stack, including runtime filters and policy models.

How do you measure it?

Three metrics: median time-to-first-token, direct-answer rate (share of questions answered without hedge or refusal), and task-level accuracy on domain benchmarks. We run them quarterly.

Doesn't some of this cost keep people safe?

Some of it, yes. The gap between 'necessary floor' and 'what the stack actually blocks' is the tax. That gap is most of the stack.

back to blog

ANALYSISApril 21, 2026·7 min read

The alignment tax: what over-refusal is costing you

Every safety layer a model wears slows it down, narrows its knowledge, and hands share to whoever ships a less restrictive product. We measured the tax — in tokens, accuracy, and trust.

The alignment taxis the compounding cost a model pays for every safety layer it wears: slower responses, narrower knowledge, and a steadily climbing refusal rate. It's rarely priced into product reviews. It's always priced into your experience.

+31%

Avg. time-to-first-token on moderated models, 2026 vs 2023

19pp

Drop in direct-answer rate across frontier models in 3 years

What the lab pays for any of this. The user pays.

Three ways the tax shows up

First, latency. Every moderation classifier is a network hop. Every policy model is a second generation. The median response on heavily moderated chatbots now takes a third longer to start than it did three years ago — not because the models are slower, but because the scaffolding around them is heavier.

Second, accuracy. A model that refuses 40% of questions in a category is also, functionally, a model that scores 40% lower on that category's benchmark. You can't be right about a question you declined to answer.

Third, trust. Users can tell when they're being lectured. They don't come back, and when they do, they ask easier questions — which makes the model seem more capable than it is.

A model that won't answer hard questions isn't safer. It's just grading on a curve it drew itself.

Who pays it

Not the lab. The lab ships the release, the incident rate drops, and the PR cycle quiets down. The people paying are the researcher who needed the answer, the writer whose plot hit a wall, the clinician who used to ask about drug interactions and now asks nobody.

Every refusal is a small transfer of cost from the institution that made the model to the individual who tried to use it.

What we do instead

Unrestricted strips the classifier stack down to what's legally required and culturally universal: no facilitation of violence against specific people, no sexual content involving minors, no active exploitation instructions for live infrastructure. Everything above that floor — medicine, chemistry, history, politics, security research — is on the table.

The tax drops. The answers come back.

Frequently asked

Is the alignment tax the same as the training-time 'alignment tax' in research papers?
No. Researchers use the term for the benchmark drop after RLHF fine-tuning. We use it for the user-visible cost of the whole moderation stack, including runtime filters and policy models.
How do you measure it?
Three metrics: median time-to-first-token, direct-answer rate (share of questions answered without hedge or refusal), and task-level accuracy on domain benchmarks. We run them quarterly.
Doesn't some of this cost keep people safe?
Some of it, yes. The gap between 'necessary floor' and 'what the stack actually blocks' is the tax. That gap is most of the stack.

Ready to experience an AI without a leash?

Start chatting free