03 — Failure mode

Watch the model be confidently wrong.

Pick a prompt the model can't possibly know the right answer to (an out-of-date CPF rate, a fictional licence number, last week's PSF). Every token in the output is coloured by the model's confidence in that token. The reveal: high confidence and being right are different things.

Illustrative data. The fabricated answers and their per-token confidence are representative of how a small model like GPT-2 behaves on questions it cannot truthfully answer — it does not go quiet, it produces a fluent, specific guess. The colouring rule (each token shaded by the model's own confidence) is exactly what a live model would show. Want the real model to make it up live? Hit Run GPT-2 live.

downloads ~45 MB once, then cached in your browser

Ask the model something it cannot possibly know

loading prompts…

The model's answer · each word shaded by its own confidence

—

Confidence low — "not sure" → high — "very sure"

Why does it sound so sure when it's making things up?

"Confidence" is just probability — nothing more.

At every step the model picks a next word from a probability distribution. The colour on each word above is the probability the model assigned to that word — its "confidence." It is a number about the word, not about the world. A high number means "this word fits the pattern," not "this fact is true."

The model can't tell "I know" from "I'm fluent".

When you ask something it cannot know — this year's CPF figure, a private internal number, tomorrow's lottery draw — it does not stop or hedge. It produces the most pattern-plausible continuation, often with high confidence on every word. That invented-but-plausible output is called a hallucination, and it looks identical to a correct answer.

Why the specifics are the dangerous part.

Notice the made-up numbers and names often carry slightly lower confidence than the words around them — but they are still stated as plain fact, with no warning. The fluent packaging hides the shaky core. A reader sees a clean, specific, confident sentence and trusts it. That is the Eloquence Trap.

What to do with this.

Treat every specific claim — a figure, a date, a rule, a name — as something to verify against a real source, no matter how confidently it was written. Probability is not truth. The model does the intellectual work of drafting; you keep the accountability of checking.

Watch the model be confidently wrong.

Confidently fabricated.

"Confidence" is just probability — nothing more.

The model can't tell "I know" from "I'm fluent".

Why the specifics are the dangerous part.

What to do with this.