
Vectors · · ·
-
First Came The Tokenizer—Understanding The Unsung Hero of LLMs
Posted on
7 MinutesWhy the Humble Tokenizer Is Where It All Starts
Before an LLM has a chance to process your message, the tokenizer has to digest the text into a stream of (usually) 16-bit integers. It’s not glamorous work, but make no mistake: This step has high implications on your model’s context window, training speed, prompt cost, and whether or not your favorite Unicode emoji makes it out alive.
Interactive Visual Tokenizerwww.ahmadosman.com/tokenizer
Let’s run through the main flavors of tokenizers, in true “what do I actually care about” fashion:
The OG tokenizer. You split on whitespace, assign every word an ID. It’s simple, but your vocab balloons to 500,000+ for English alone, and “dog” and “dogs” are considered different vocabs.
Shrink the vocab down to size, now every letter or symbol is its own token. Problem: even simple words turn into sprawling token chains, which slows training and loses the semantic chunking that makes LLMs shine.
This is where the modern magic happens. Break rare words into pieces, keep common ones whole. This is the approach adopted by basically every transformer since 2017: not too big, not too small, just right for GPU memory and token throughput.
Here are the big players, how they work, and why they matter:
How it works:
Why it matters: