
Embeddings · · ·
-
First Came The Tokenizer—Understanding The Unsung Hero of LLMs
Posted on
7 MinutesWhy the Humble Tokenizer Is Where It All Starts
Before an LLM has a chance to process your message, the tokenizer has to digest the text into a stream of (usually) 16-bit integers. It’s not glamorous work, but make no mistake: This step has high implications on your model’s context window, training speed, prompt cost, and whether or not your favorite Unicode emoji makes it out alive.
Interactive Visual Tokenizerwww.ahmadosman.com/tokenizer
Let’s run through the main flavors of tokenizers, in true “what do I actually care about” fashion:
The OG tokenizer. You split on whitespace, assign every word an ID. It’s simple, but your vocab balloons to 500,000+ for English alone, and “dog” and “dogs” are considered different vocabs.
Shrink the vocab down to size, now every letter or symbol is its own token. Problem: even simple words turn into sprawling token chains, which slows training and loses the semantic chunking that makes LLMs shine.
This is where the modern magic happens. Break rare words into pieces, keep common ones whole. This is the approach adopted by basically every transformer since 2017: not too big, not too small, just right for GPU memory and token throughput.
Here are the big players, how they work, and why they matter:
How it works:
Why it matters:
-
No, RAG Is NOT Dead!
Posted on
5 MinutesRAG Is Dead?! Long Live Real AI!
On Friday, April 11, 2025, I hosted a space titled “Is RAG Dead?” with my friend Skyler Payne . Skylar is ex-Google, ex-LinkedIn, currently building cutting-edge systems at Wicked Data and helping enterprises bring real AI solutions to production. This space has been on my mind for quite sometime now. There is this opinion floating around that “LLMs have 1M+ token context length now! We don’t need RAG anymore!” This isn’t just a bad take. It’s a misunderstanding of how actual AI systems are built. So, Skylar and I decided to host this space and talk it out together. What followed was one of the most technical, honest, no-bullshit convos I’ve had in a while about the actual role of RAG in 2025 and beyond.
If you don’t know me, I’m the guy with the 14x RTX 3090s Basement AI Server . I’ve been building and breaking AI systems for quite sometime now, I hold dual degrees in Computer Science and Data Science. I am also running a home lab that looks like it’s trying to outcompute a small startup (because it kind of is). I go deep on LLMs, inference optimization, agentic workflows, and all the weird edge cases that don’t make it into the marketing decks.
Let’s break it down.
Skyler opened by asking the obvious but important question: when people say “RAG,” what do they even mean?
Because half the time, it’s being reduced to just vector search. Chunk some PDFs, some cosine similarity, call it a day. But that’s not RAG.
Real RAG, at least the kind that works at scale, is more than just a search bar duct-taped to an LLM:
Also, definitions matter. If your definition stops at vector search and mine includes multi-hop planning and agents, we’re not debating. Understanding that nuance is key before you even ask whether RAG is “dead.” Because until we agree on the same definition, we’re just yelling at each other for the wrong thing.
The main argument against RAG is that LLMs can now eat entire PDFs. “Why retrieve when you can just feed it all in?” Doesn’t work.
Skyler walked us through what entreprises have on their hands: