01 — Mechanism reveal

How AI sees your text.

Type something. Watch it become tokens. Switch encoders to see how different models split the same text. Drop down to bytes to see why a single emoji can cost you four tokens. Then look at what your "system prompt" actually looks like — same stream, no special treatment from the model.

Encoder: switch to compare

Your text

What the model sees

loading encoder…
0 tokens
0 characters
0 utf-8 bytes
density
Pre-loaded examples
What is a token, and why should I care?

The model never saw words.

A token is a small chunk of text — sometimes a whole word, sometimes a few letters, sometimes a single punctuation mark. The model reads your message as a list of these chunks, not as a sentence you wrote.

Why not just one character at a time?

Two reasons. First, characters carry too little meaning — the model would need a much longer attention span. Second, whole words explode the vocabulary: every spelling, every plural, every typo would need its own slot. Tokens are the compromise: common words stay whole, rare words get split into known pieces.

What is BPE — and why do I keep hearing about it?

The standard recipe is called byte-pair encoding (BPE). It started life as a 1994 compression trick by Philip Gage — find the most frequent pair of bytes, replace it with a new symbol, repeat. Sennrich, Haddow & Birch (2016) adapted it for translation; OpenAI picked it up for GPT-2. The result is a vocabulary that compresses common English into one token and stretches uncommon things (long Tamil names, emoji, source code) into many.

What changes when I switch encoder?

cl100k_base is what GPT-3.5 and GPT-4 use (~100k tokens in its vocabulary). o200k_base is the newer one used by GPT-4o (~200k tokens — bigger vocabulary, fewer chunks for the same text). gpt2 is the original 2019 vocabulary (~50k) — you'll usually see more tokens for the same input, especially for code or multilingual text.

And what are "embeddings"?

The token ID is just an integer — it carries no meaning on its own. Inside the model, each ID is used to look up a row in a big table: a vector of around 1,000 numbers (sometimes 4,000, depending on the model). That vector is what the rest of the network actually works with. Two words with similar meanings end up with similar vectors. This is called an embedding — the bridge between human text and machine arithmetic.

What's the punchline?

When someone tells you "the AI said X confidently", remember: the AI didn't read X. It read these chunks, looked up their vectors, did some maths, and produced more numbers. Whatever feeling of confidence it projected has no special connection to your original meaning. Confidence in language is not confidence in truth.

How we got here — a short timeline of tokenisation

Tokenisation looks like an arbitrary modern design choice. It isn't. It is the trailing end of a thirty-year argument about how to make computers handle text without drowning in vocabulary.

  1. 1994
    Philip Gage publishes BPE — as a compression algorithm. The trick is simple: scan a file, find the most common pair of bytes, replace them with one new symbol, and repeat. Born in The C Users Journal, no machine learning involved.
  2. 2015
    Word-level translation hits a wall. Neural translation models can't deal with rare or unseen words: every Tamil suffix, every product code, every typo breaks them.
  3. 2016
    Sennrich, Haddow & Birch repurpose BPE for language. Apply Gage's recipe to characters instead of bytes; rare words split into known sub-word units. Translation quality jumps. Subword tokenisation becomes the default.
  4. 2019
    OpenAI ships GPT-2 with a ~50,000-token BPE vocabulary (r50k_base, what we call gpt2 here). The vocabulary is trained on web text — so English compresses well, code and other languages do not.
  5. 2022
    GPT-3.5 and GPT-4 adopt cl100k_base. Vocabulary doubles to ~100,000 tokens; common words and code patterns now fit in one token, cutting cost and latency on most prompts.
  6. 2024
    GPT-4o ships o200k_base — ~200,000 tokens, deliberately tuned for multilingual and code use. The same paragraph in Chinese or Korean now costs roughly half as many tokens as on cl100k_base.
  7. Now
    You are reading the same algorithm Philip Gage wrote in 1994 — trained on a planet's worth of text, charged by the token, and treated as if it understood you. It does not. It compresses you.