01 — Mechanism reveal

How AI sees your text.

Type something. Watch it become tokens. Switch encoders to see how different models split the same text. Drop down to bytes to see why a single emoji can cost you four tokens. Then look at what your "system prompt" actually looks like — same stream, no special treatment from the model.

Encoder: switch to compare

Your text

What the model sees

loading encoder…

Token IDs — the integers the model actually receives

—

Write a system prompt in the left box and a user message in the right box. Below, the same two texts appear stitched together as one flat token stream — the chat-template markers (<|im_start|>, <|im_end|>) are just more tokens. The model has no separate "authority channel" for the system prompt: it sees the whole stream at once.

Chat template

added by the host

<|im_start|>system\n

System prompt

what the operator wrote

Chat template

switches from system to user

<|im_end|>\n<|im_start|>user\n

User message

what the user wrote

Chat template

model takes over here

<|im_end|>\n<|im_start|>assistant\n

What the model actually sees

one flat stream — colour shows source

type a system prompt and a user message above…

chat-template tokens system-prompt tokens user-message tokens

0total tokens

0template overhead

0system tokens

0user tokens

The punchline. A "system prompt" is not a separate channel with extra authority. It is just text that happens to come earlier in the same token stream. If a user message contains text that looks like a system instruction, the model has no built-in way to know it shouldn't follow it — only patterns it picked up in training. This is the mechanism behind every prompt-injection story you read.

0 tokens

0 characters

0 utf-8 bytes

— density

Pre-loaded examples

What is a token, and why should I care?

The model never saw words.

A token is a small chunk of text — sometimes a whole word, sometimes a few letters, sometimes a single punctuation mark. The model reads your message as a list of these chunks, not as a sentence you wrote.

Why not just one character at a time?

Two reasons. First, characters carry too little meaning — the model would need a much longer attention span. Second, whole words explode the vocabulary: every spelling, every plural, every typo would need its own slot. Tokens are the compromise: common words stay whole, rare words get split into known pieces.

What is BPE — and why do I keep hearing about it?

The standard recipe is called byte-pair encoding (BPE). It started life as a 1994 compression trick by Philip Gage — find the most frequent pair of bytes, replace it with a new symbol, repeat. Sennrich, Haddow & Birch (2016) adapted it for translation; OpenAI picked it up for GPT-2. The result is a vocabulary that compresses common English into one token and stretches uncommon things (long Tamil names, emoji, source code) into many.

What changes when I switch encoder?

cl100k_base is what GPT-3.5 and GPT-4 use (~100k tokens in its vocabulary). o200k_base is the newer one used by GPT-4o (~200k tokens — bigger vocabulary, fewer chunks for the same text). gpt2 is the original 2019 vocabulary (~50k) — you'll usually see more tokens for the same input, especially for code or multilingual text.

And what are "embeddings"?

The token ID is just an integer — it carries no meaning on its own. Inside the model, each ID is used to look up a row in a big table: a vector of around 1,000 numbers (sometimes 4,000, depending on the model). That vector is what the rest of the network actually works with. Two words with similar meanings end up with similar vectors. This is called an embedding — the bridge between human text and machine arithmetic.

What's the punchline?

When someone tells you "the AI said X confidently", remember: the AI didn't read X. It read these chunks, looked up their vectors, did some maths, and produced more numbers. Whatever feeling of confidence it projected has no special connection to your original meaning. Confidence in language is not confidence in truth.

How we got here — a short timeline of tokenisation

Tokenisation looks like an arbitrary modern design choice. It isn't. It is the trailing end of a thirty-year argument about how to make computers handle text without drowning in vocabulary.

1994
Philip Gage publishes BPE — as a compression algorithm. The trick is simple: scan a file, find the most common pair of bytes, replace them with one new symbol, and repeat. Born in The C Users Journal, no machine learning involved.
2015
Word-level translation hits a wall. Neural translation models can't deal with rare or unseen words: every Tamil suffix, every product code, every typo breaks them.
2016
Sennrich, Haddow & Birch repurpose BPE for language. Apply Gage's recipe to characters instead of bytes; rare words split into known sub-word units. Translation quality jumps. Subword tokenisation becomes the default.
2019
OpenAI ships GPT-2 with a ~50,000-token BPE vocabulary (r50k_base, what we call gpt2 here). The vocabulary is trained on web text — so English compresses well, code and other languages do not.
2022
GPT-3.5 and GPT-4 adopt cl100k_base. Vocabulary doubles to ~100,000 tokens; common words and code patterns now fit in one token, cutting cost and latency on most prompts.
2024
GPT-4o ships o200k_base — ~200,000 tokens, deliberately tuned for multilingual and code use. The same paragraph in Chinese or Korean now costs roughly half as many tokens as on cl100k_base.
Now
You are reading the same algorithm Philip Gage wrote in 1994 — trained on a planet's worth of text, charged by the token, and treated as if it understood you. It does not. It compresses you.

How AI sees your text.

Your text

What the model sees

Token IDs — the integers the model actually receives

Same text, five levels down

Chat template

System prompt

Chat template

User message

Chat template

What the model actually sees

The model never saw words.

Why not just one character at a time?

What is BPE — and why do I keep hearing about it?

What changes when I switch encoder?

And what are "embeddings"?

What's the punchline?