How AI sees your text.
Type something. Watch it become tokens. Switch encoders to see how different models split the same text. Drop down to bytes to see why a single emoji can cost you four tokens. Then look at what your "system prompt" actually looks like — same stream, no special treatment from the model.
Your text
What the model sees
Same text, five levels down
Write a system prompt in the left box and a
user message in the right box. Below, the same
two texts appear stitched together as one flat token
stream — the chat-template markers
(<|im_start|>,
<|im_end|>) are just more tokens. The
model has no separate "authority channel" for the system
prompt: it sees the whole stream at once.
Chat template
System prompt
Chat template
User message
Chat template
What the model actually sees
What is a token, and why should I care?
The model never saw words.
A token is a small chunk of text — sometimes a whole word, sometimes a few letters, sometimes a single punctuation mark. The model reads your message as a list of these chunks, not as a sentence you wrote.
Why not just one character at a time?
Two reasons. First, characters carry too little meaning — the model would need a much longer attention span. Second, whole words explode the vocabulary: every spelling, every plural, every typo would need its own slot. Tokens are the compromise: common words stay whole, rare words get split into known pieces.
What is BPE — and why do I keep hearing about it?
The standard recipe is called byte-pair encoding (BPE). It started life as a 1994 compression trick by Philip Gage — find the most frequent pair of bytes, replace it with a new symbol, repeat. Sennrich, Haddow & Birch (2016) adapted it for translation; OpenAI picked it up for GPT-2. The result is a vocabulary that compresses common English into one token and stretches uncommon things (long Tamil names, emoji, source code) into many.
What changes when I switch encoder?
cl100k_base is what GPT-3.5 and GPT-4 use (~100k
tokens in its vocabulary). o200k_base is the
newer one used by GPT-4o (~200k tokens — bigger vocabulary,
fewer chunks for the same text). gpt2 is the
original 2019 vocabulary (~50k) — you'll usually see more
tokens for the same input, especially for code or multilingual
text.
And what are "embeddings"?
The token ID is just an integer — it carries no meaning on its own. Inside the model, each ID is used to look up a row in a big table: a vector of around 1,000 numbers (sometimes 4,000, depending on the model). That vector is what the rest of the network actually works with. Two words with similar meanings end up with similar vectors. This is called an embedding — the bridge between human text and machine arithmetic.
What's the punchline?
When someone tells you "the AI said X confidently", remember: the AI didn't read X. It read these chunks, looked up their vectors, did some maths, and produced more numbers. Whatever feeling of confidence it projected has no special connection to your original meaning. Confidence in language is not confidence in truth.
How we got here — a short timeline of tokenisation
Tokenisation looks like an arbitrary modern design choice. It isn't. It is the trailing end of a thirty-year argument about how to make computers handle text without drowning in vocabulary.
-
1994
Philip Gage publishes BPE — as a compression algorithm. The trick is simple: scan a file, find the most common pair of bytes, replace them with one new symbol, and repeat. Born in The C Users Journal, no machine learning involved.
-
2015
Word-level translation hits a wall. Neural translation models can't deal with rare or unseen words: every Tamil suffix, every product code, every typo breaks them.
-
2016
Sennrich, Haddow & Birch repurpose BPE for language. Apply Gage's recipe to characters instead of bytes; rare words split into known sub-word units. Translation quality jumps. Subword tokenisation becomes the default.
-
2019
OpenAI ships GPT-2 with a ~50,000-token BPE vocabulary (
r50k_base, what we callgpt2here). The vocabulary is trained on web text — so English compresses well, code and other languages do not. -
2022
GPT-3.5 and GPT-4 adopt
cl100k_base. Vocabulary doubles to ~100,000 tokens; common words and code patterns now fit in one token, cutting cost and latency on most prompts. -
2024
GPT-4o ships
o200k_base— ~200,000 tokens, deliberately tuned for multilingual and code use. The same paragraph in Chinese or Korean now costs roughly half as many tokens as oncl100k_base. -
Now
You are reading the same algorithm Philip Gage wrote in 1994 — trained on a planet's worth of text, charged by the token, and treated as if it understood you. It does not. It compresses you.