Blogs

First Came The Tokenizer

Why the Humble Tokenizer Is Where It All Starts

Posted on Jun 26, 2025
7 Minutes

Cover Image — The process of tokenization that turns raw text into a stream of integers for LLMs

Table of Contents

This is blogpost #5 in my 101 Days of Blogging series. If it sparks anything; ideas, questions, or critique, my DMs are open. Hope it gives you something useful to walk away with.

Before Everything

Before an LLM has a chance to process your message, the tokenizer has to digest the text into a stream of (usually) 16-bit integers. It’s not glamorous work, but make no mistake: This step has high implications on your model’s context window, training speed, prompt cost, and whether or not your favorite Unicode emoji makes it out alive.

Interactive Visual Tokenizer
www.ahmadosman.com/tokenizer

Tokenizers: The Hidden Operators Behind LLMs

Models speak numbers, not words. Tokenizers are the translation layer between your text and the neural network continous vector space. No tokenizer, no language model (unless you’ve got half a million neurons dedicated to memorizing the English dictionary).
They determine how much your prompt “really” costs. A string that looks tiny to you can balloon into dozens of tokens (Chinese, Japanese, and Korean-aka CJK-writing systems are the most prone to this).
They define what’s “unknown.” Choosing the wrong tokenizer could potentially lead to half your prompt getting replaced with [UNK] tokens, blowing up meaning and introducing all sorts of headaches downstream. Picking the right tokenizer for the task is crucial, and you minimize that mess by striking a balance between vocabulary size and granularity.

From Whitespace to Subwords: A Lightning Tour

Let’s run through the main flavors of tokenizers, in true “what do I actually care about” fashion:

Word-level

The OG tokenizer. You split on whitespace, assign every word an ID. It’s simple, but your vocab balloons to 500,000+ for English alone, and “dog” and “dogs” are considered different vocabs.

Character-level

Shrink the vocab down to size, now every letter or symbol is its own token. Problem: even simple words turn into sprawling token chains, which slows training and loses the semantic chunking that makes LLMs shine.

Subword-level

This is where the modern magic happens. Break rare words into pieces, keep common ones whole. This is the approach adopted by basically every transformer since 2017: not too big, not too small, just right for GPU memory and token throughput.

Tokenizer Algorithms

Here are the big players, how they work, and why they matter:

Byte-Pair Encoding (BPE) — Used in GPT family, RoBERTa, DeBERTa

How it works:

Starts with all unique characters (or bytes) as the base alphabet
Repeatedly merges the most frequent adjacent pair of symbols (characters or previously merged chunks)
Continues until the target vocabulary size is reached

Why it matters:

Reduces vocab size while capturing frequent subword units
Helps with rare words, typos, and multilingual text
Strikes a balance between simplicity and performance
Widely used due to its effectiveness and generality

How it works:

Similar to BPE, but merges pairs that maximize overall likelihood based on a probabilistic model
Often creates unexpected subword chunks for better statistical fit
Uses special prefixes (like “##”) for tokens inside a word

Why it matters:

More flexible and context-sensitive splits
Slightly improves performance, especially for morphologically rich languages

Unigram LM — Used in T5, mBART, XLNet

How it works:

Starts with a large pool of possible subwords, each with an assigned probability
Iteratively prunes away the least likely subwords
Finds the tokenization that maximizes the product of the subword probabilities for each input

Why it matters:

Probabilistic, pruning-based method gives flexible, efficient vocabularies
Well-suited for diverse or multilingual data
Famously implemented in SentencePiece

SentencePiece — Designed for multilingual/agglutinative languages

How it works:

A toolkit that implements both BPE and Unigram LM
Operates directly on raw UTF-8 bytes, with no need for whitespace-based pre-tokenization
Treats input as a raw stream (great for languages with no spaces)

Why it matters:

Handles languages without spaces (e.g., Japanese, Chinese)
Robust to messy, real-world text
Flexible for diverse token boundaries that break other tokenizers

Byte-level BPE (`tiktoken`) — Used in GPT-2, GPT-4o, OpenAI API

How it works:

Uses raw bytes (0–255) as the base alphabet
Every character, emoji, or symbol is broken into bytes; BPE merges then apply
Totally agnostic to language or script

Why it matters:

Universal: can tokenize any Unicode text (emojis, all scripts, etc.)
No hand-tuned rules needed
Tradeoff: more tokens for common words in most languages (inflates token counts), but ensures universal coverage and robustness

Vocabulary Size: A Tradeoff

Scaling law nerds (hey, that’s us) know: as you scale up your model, you need to scale your tokenizer’s vocabulary too, which in turn helps with optimizing how efficiently your model chews through raw text.

Take Llama 3, for example: its tokenizer jumped to a whopping 128K vocab size, compared to Llama 2’s more modest 32K. Why? With a bigger vocab, each token can represent longer or more meaningful text chunks. That means your input sequences get shorter (the model has fewer tokens to process for the same amount of text), which can speed up both training and inference. You’re basically packing more info into every step.

But there’s no free lunch. A larger vocabulary means your embedding matrix-the lookup table mapping token IDs to vectors-grows right along with it. More tokens = more rows = more parameters = more GPU RAM required. For massive models or memory-constrained deployments, that can be a real pain. You also run into diminishing returns: after a certain point, adding more tokens doesn’t buy you much, but still taxes your hardware.

On the flip side, if your vocabulary is too small, your tokenizer starts breaking up words into lots of tiny pieces (“subwords”). Your model ends up wrestling with long, fragmented sequences, burning compute on basic reconstruction instead of actual language understanding. That’s inefficient, especially for languages with rich morphology or lots of rare words.

So where’s the “just right” zone? It depends. The ideal vocab size is a balancing act:

Compute budget: More vocab = bigger embedding = more FLOPs/memory.
Inference setup: Shorter sequences can mean faster inference, but not if you’re swapping to disk for embeddings.
Languages covered: Multilingual or domain-specific tokenizers might need bigger vocabularies to avoid splintering key terms.

At the end of the day, picking a vocab size isn’t just a technical tweak, it’s a core design choice that shapes everything downstream, from model cost to multilingual robustness. Getting this wrong means you’re either bottlenecked by hardware or wasting compute on needlessly long sequences.

Tokenizer Quirks: Fun Ways To Sabotage Yourself

Whitespace weirdness. Some tokenizers nuke leading spaces (GPT-2), others encode them as tokens (Llama). This results in mystery bugs, mismatched generations, and hours of “why is my model doing this.”
Special tokens. [CLS], <s>, <pad>, <|endoftext|>, etc. They count toward your length limit and love to leak into generations if you forget to mask them.
Multilingual headaches. Using an English-centric vocab for CJK or agglutinative languages? Get ready for triple token counts and a swamp of [UNK]s. Use a language-aware SentencePiece to avoid the suffering.
Version drift. Mismatch your tokenizer and model vocab file-just once-and expect nothing but total garbage output.

Picking a Tokenizer for Your LLM Playground: My Cheat Sheet

Want to skip the theory and just get your hands dirty? Here’s the cheat sheet:

If you care about…

Training speed & GPU RAM:

Grab: Byte-level BPE (tiktoken)
Why: Minimal merges, Rust-fast decoding, OpenAI’s secret sauce.

On-device (mobile) inference:

Grab: WordPiece
Why: Smaller embedding tables, less RAM pain, still competitive on sequence length.

Cross-language coverage:

Grab: SentencePiece Unigram
Why: Train once, cover dozens of scripts and languages.

DIY research tinkering:

Grab: Hugging Face tokenizers
Why: Python bindings, Rust core, built-in introspection.

Final Words: The Humble Tokenizer Is Doing More Than You Think

Tokenizers are never the sexiest part of an LLM stack, but they are very impactful. Understand them, and you control real prompting costs, inference throughput, and the very shape of what your models can express.

Next time your model is spitting out garbage or can’t fit your prompt, don’t blame the GPUs 😝

Give a little respect to the humble tokenizer.

In the meantime, if you want to see tokenizers in action, check out www.ahmadosman.com/tokenizer .

PS: Remember, you can always use my DeepResearch Workflow to learn more, if you’re stuck, need something broken down to first principles, want material tailored to your level, need to identify gaps, or just want to explore deeper.

First Came The Tokenizer

Why the Humble Tokenizer Is Where It All Starts

Before Everything

Tokenizers: The Hidden Operators Behind LLMs

From Whitespace to Subwords: A Lightning Tour

Word-level

Character-level

Subword-level

Tokenizer Algorithms

Byte-Pair Encoding (BPE) — Used in GPT family, RoBERTa, DeBERTa

WordPiece — Used in BERT and related models

Unigram LM — Used in T5, mBART, XLNet

SentencePiece — Designed for multilingual/agglutinative languages

Byte-level BPE (tiktoken) — Used in GPT-2, GPT-4o, OpenAI API

Vocabulary Size: A Tradeoff

Tokenizer Quirks: Fun Ways To Sabotage Yourself

Picking a Tokenizer for Your LLM Playground: My Cheat Sheet

Final Words: The Humble Tokenizer Is Doing More Than You Think

Byte-level BPE (`tiktoken`) — Used in GPT-2, GPT-4o, OpenAI API