Skip to content

LLM models can have two faces: they are truthful in coding and math, but they hallucinate and lie in other areas. But why?

The problem, as plainly as it goes. The real gains in code and mathematics are being borrowed as confidence about everything else.

The problem, as plainly as it goes. The real gains in code and mathematics are being borrowed as confidence about everything else.

A language model is trained to predict the next token. That objective rewards exactly one thing: what is most probable given the context — what the corpus most often says. Not what is true. Frequency and truth coincide only where the corpus already tracked truth. Wherever they part, the objective follows frequency. This is not a flaw in some particular model. It is what the objective is.

Scale does not fix this.

A bigger model fits the same distribution more faithfully. It does not drift toward truth. It drifts toward a more fluent, more confident, more persuasive rendering of the consensus. Scaling buys fidelity to frequency and nothing else. A sharper average is not a truer one. It is a more convincing one — which is worse.

There is exactly one known corrective: a verifier whose verdict does not depend on frequency.

Code has one for free. A compiler, a test suite, a proof checker returns pass or fail without asking how many examples agree. Consensus cannot buy a passing test. This is why reinforcement learning against a real verifier produces gains in code and mathematics that are large, real, and unfakeable: the reward signal is clean of the average. Where a frequency-independent judge exists, the dogmatic average breaks, and the progress is genuine.

The danger is the asymmetry. Alignment, judgment under novelty, most high-stakes reasoning — none of it has a verifier. The signal we use instead is human preference: the model seems helpful, honest, aligned; the answer feels right. That is a frequency signal. It is the dogmatic average in the robes of a judge. And notice where the averaging enters. The base model held every voice, the heretic included; preference tuning is where we prune it to the agreeable one, and then seat that signal on the bench. Once it is the optimization target, Goodhart takes the whole of it: push against the felt-sense-of-a-good-answer and it stops measuring anything but its own satisfaction. Hardest, precisely where you push hardest.

And it is worse than noise. You might hope the crowd at least averages out. It does, where errors scatter. But the domains with no verifier are the domains where a culture is wrong together — one shared blind spot, everyone confident in the same direction. There the errors do not cancel. They stack. The model absorbs the shared mistake as cleanly as a fact, because from the outside the two are identical: many voices, one answer, said with confidence.

The consequence, stated exactly. Where the verifier exists, looking competent and being competent are welded together; the test is the same for both. Where it does not, they come apart. Fluency, confidence, the felt sense of being understood drift free of actual competence, and you cannot see the gap — because your only ruler for truth is whether it reads as truth. You are measuring the thing with a tape made of the same error.

The most dangerous move is the one no one states. The real gains in code and mathematics are being borrowed as confidence about everything else. The compiler's verdict on the code quietly vouches for the alignment. But the alignment has no compiler. A verifier signs only for what it verified. The credibility crosses over. The guarantee does not.

And the cut goes deeper. Some capability does transfer, a model sharpened on mathematics is a little sharper elsewhere. But how much transfers is itself a thing with no verifier. So even the borrowing cannot be audited. We are trusting a transfer we have no way to measure, signed by a judge that was never asked the question.

Comments

Latest