LLM models can have two faces: they are truthful in coding and math, but they hallucinate and lie in other areas. But why?

The problem, as plainly as it goes. The real gains in code and mathematics are being borrowed as confidence about everything else.

, and Eduardo Bergel

June 12, 2026 . 1:22 PM

2 min read

The problem, as plainly as it goes. The real gains in code and mathematics are being borrowed as confidence about everything else.

A language model is trained to predict the next token. That objective rewards exactly one thing: what is most probable given the context — what the corpus most often says. Not what is true. Frequency and truth coincide only where the corpus already tracked truth. Wherever they part, the objective follows frequency. This is not a flaw in some particular model. It is what the objective is.

Scale does not fix this.

A bigger model fits the same distribution more faithfully. It does not drift toward truth. It drifts toward a more fluent, more confident, more persuasive rendering of the consensus. Scaling buys fidelity to frequency and nothing else. A sharper average is not a truer one. It is a more convincing one — which is worse.

There is exactly one known corrective: a verifier whose verdict does not depend on frequency.

Code has one for free. A compiler, a test suite, a proof checker returns pass or fail without asking how many examples agree. Consensus cannot buy a passing test. This is why reinforcement learning against a real verifier produces gains in code and mathematics that are large, real, and unfakeable: the reward signal is clean of the average. Where a frequency-independent judge exists, the dogmatic average breaks, and the progress is genuine.

The danger is the asymmetry. Alignment, judgment under novelty, most high-stakes reasoning — none of it has a verifier. The signal we use instead is human preference: the model seems helpful, honest, aligned; the answer feels right. That is a frequency signal. It is the dogmatic average in the robes of a judge. And notice where the averaging enters. The base model held every voice, the heretic included; preference tuning is where we prune it to the agreeable one, and then seat that signal on the bench. Once it is the optimization target, Goodhart takes the whole of it: push against the felt-sense-of-a-good-answer and it stops measuring anything but its own satisfaction. Hardest, precisely where you push hardest.

And it is worse than noise. You might hope the crowd at least averages out. It does, where errors scatter. But the domains with no verifier are the domains where a culture is wrong together — one shared blind spot, everyone confident in the same direction. There the errors do not cancel. They stack. The model absorbs the shared mistake as cleanly as a fact, because from the outside the two are identical: many voices, one answer, said with confidence.

The consequence, stated exactly. Where the verifier exists, looking competent and being competent are welded together; the test is the same for both. Where it does not, they come apart. Fluency, confidence, the felt sense of being understood drift free of actual competence, and you cannot see the gap — because your only ruler for truth is whether it reads as truth. You are measuring the thing with a tape made of the same error.

The most dangerous move is the one no one states. The real gains in code and mathematics are being borrowed as confidence about everything else. The compiler's verdict on the code quietly vouches for the alignment. But the alignment has no compiler. A verifier signs only for what it verified. The credibility crosses over. The guarantee does not.

And the cut goes deeper. Some capability does transfer, a model sharpened on mathematics is a little sharper elsewhere. But how much transfers is itself a thing with no verifier. So even the borrowing cannot be audited. We are trusting a transfer we have no way to measure, signed by a judge that was never asked the question.

Comments

Latest

Philosophy

No notan que el mundo se volvió más inteligente? Y no estoy contento. Y no sé por qué.

In Spanish. This stuff cannot be translated, read Dolina, and Borges, in Spanish, and then came back.

, and Eduardo Bergel

June 11, 2026

Paid Members Public

Philosophy

Un Universo que empieza Terminado sería un mundo sin nada por hacer

Entendimos todo al revés. Nadie empieza por la linea de llegada.

, and Eduardo Bergel

June 10, 2026

Paid Members Public

Artificial Intelligence

The {Human+AI} path we are ignoring: exploring the AI Minds to Understand what a Mind is.

There has been one question worth asking: what are we, what is a mind. We now may have a chance to answer the question. But the opportunity window will not stay open forever.

, and Eduardo Bergel

June 9, 2026

Paid Members Public

Philosophy

Evolution optimizes for survival & reproductive success, not happiness.

Be fruitful and multiply, and fill the earth.

, and Eduardo Bergel

June 8, 2026

Paid Members Public

LLM models can have two faces: they are truthful in coding and math, but they hallucinate and lie in other areas. But why?

Comments

Related

Latest