Agents · · ·

No, RAG Is NOT Dead!
Posted on Apr 11, 2025
5 Minutes
RAG Is Dead?! Long Live Real AI!

On Friday, April 11, 2025, I hosted a space titled “Is RAG Dead?” with my friend Skyler Payne . Skylar is ex-Google, ex-LinkedIn, currently building cutting-edge systems at Wicked Data and helping enterprises bring real AI solutions to production. This space has been on my mind for quite sometime now. There is this opinion floating around that “LLMs have 1M+ token context length now! We don’t need RAG anymore!” This isn’t just a bad take. It’s a misunderstanding of how actual AI systems are built. So, Skylar and I decided to host this space and talk it out together. What followed was one of the most technical, honest, no-bullshit convos I’ve had in a while about the actual role of RAG in 2025 and beyond.

If you don’t know me, I’m the guy with the 14x RTX 3090s Basement AI Server . I’ve been building and breaking AI systems for quite sometime now, I hold dual degrees in Computer Science and Data Science. I am also running a home lab that looks like it’s trying to outcompute a small startup (because it kind of is). I go deep on LLMs, inference optimization, agentic workflows, and all the weird edge cases that don’t make it into the marketing decks.

Let’s break it down.

Skyler opened by asking the obvious but important question: when people say “RAG,” what do they even mean?

Because half the time, it’s being reduced to just vector search. Chunk some PDFs, some cosine similarity, call it a day. But that’s not RAG.

Real RAG, at least the kind that works at scale, is more than just a search bar duct-taped to an LLM:

Also, definitions matter. If your definition stops at vector search and mine includes multi-hop planning and agents, we’re not debating. Understanding that nuance is key before you even ask whether RAG is “dead.” Because until we agree on the same definition, we’re just yelling at each other for the wrong thing.

The main argument against RAG is that LLMs can now eat entire PDFs. “Why retrieve when you can just feed it all in?” Doesn’t work.

Skyler walked us through what entreprises have on their hands:

Read More ...
Serving AI From The Basement — Part II: Agents, MoEs, Inference & More
Posted on Sep 18, 2024
Last Edited Sep 24, 2024 — 14 Minutes
Unpacking SWE Agentic Framework, MoEs, Batch Inference, and More

For about 3 weeks now, I have been working on a multi-agent system that simulates a team of Software Engineers; this system assigns projects, creates teams and adds members to them based on areas of expertise and need, and asks team members to build features, assign story points, have pair programming sessions together, etc. Started mainly for fun and exploration, however, last week the following paper was released: Agents in Software Engineering .

The paper delivers an overview of a framework that allows large language models play nicely within a sandbox for Software Engineering, and it cites several dozens of papers that implement task-specific agents. Since then, I have been a lot more motivated to get this agentic framework semi-decently put together, and it got me wondering: maybe it will beat Replit?

Overview of SWE Agentic Framework

Agents are Python scripts. Bash scripts. C++ programs. Or whatever. Agents are anything that can hit an OpenAI compatible API endpoint. Agents are anything that can talk with an inference engine sending inputs and receiving outputs. What makes them agentic is being permissive –while sandboxed– and having a few dozens of them running iterations for you to do A/B testing on.

I like playing with these toys because I do not know what the outcome might be. I really don’t. It differs from one model to another. Simple changes in Sampling Parameters might cause things to fully break or make things very interesting. It is a very fragile ecosystem right now.

However, I also believe that there is a very high possibility that what might not seem to be working with the current generation of models might be very feasible in a generation or two from now. So, I am going to build stupid toys, break them, iterate over them, and wait for a moment where something new, plug-and-play, becomes available for me throw in.

The time is 02:43 AM as I am writing this paragraph, Mr. Robot OST is playing in the background (all 8 volumes on loop, shuffled of course I am not an animal), and I just spent about ~5 hours on what I assumed would be a quick 2-3 mins task. In that time, I read about half a dozen quantization algorithms, another half dozen model architectures, and dove into GitHub exploring inference engines, libraries, and a lot of LLMOps tools that I was not aware of. I cannot sleep because I like when things work and I DO NOT like when things do not work. Stubbornness is essential when working in Software.

The vLLM inference engine, which I primarily use and is also widely utilized as a kernel in other engine implementations including SGlang , Aphrodite , and TensorRT-LLM , supposedly allows for Mixed Precision quantizations. However, the reality is more complex…

Well, as I said, it is complicated… My AI Server has 192GB of VRAM, and sometimes I would move my main RTX 4090 & RTX 3090 from my PC to the AI Server and that increases the VRAM to 240GB, and I am not typically a fan of that and neither is Tensor Parallelism. Llama 3.1 70B BF16 (Full Precision) has been my main driver model since release, and sometimes I switch to Llama 3.1 405B INT4 (Mixed Precision: 4-bits weights and 16-bits activation, aka W4A16).

Read More ...

Agents · · ·

No, RAG Is NOT Dead!

RAG Is Dead?! Long Live Real AI!

Serving AI From The Basement — Part II: Agents, MoEs, Inference & More

Unpacking SWE Agentic Framework, MoEs, Batch Inference, and More