Serving AI From The Basement — Part I
A Dedicated AI Server with 8x RTX 3090 GPUs and 192GB of VRAM
AI from The Basement: My latest side project, a dedicated LLM server powered by 8x RTX 3090 Graphic Cards, boasting a total of 192GB of VRAM. I built this with running Meta’s Llamma-3.1 405B in mind.
This blogpost was originally posted on my LinkedIn profile in July 2024.
Backstory: Sometime in March I found myself struggling to keep up with the mere 48GB of VRAM I had been relying on for almost a year in my LLMs experimentations. So, in a geeky-yet-stylish way, I decided to spend my money to build this thing of beauty. Questions swirled: Which CPU/Platform to buy? Does memory speed really matter? And why the more PCIe Lanes we have the better? Why 2^n number of GPUs matter in multi-GPU node setup (Tensor Parallelism, anyone?) How many GPUs, and how can I get all the VRAM in the world? Why are Nvidia cards so expensive and why didn’t I invest in their stock earlier? What inference engine to use (hint: it’s not just llama.cpp and not always the most well-documented option)?
After so many hours of research, I decided on the following platform:
- Asrock Rack ROMED8-2T motherboard with 7x PCIe 4.0x16 slots and 128 lanes of PCIe
- AMD Epyc Milan 7713 CPU (2.00 GHz/3.675GHz Boosted, 64 Cores/128 Threads)
- 512GB DDR4-3200 3DS RDIMM memory
- A mere trio of 1600-watt power supply units to keep everything running smoothly
- 8x RTX 3090 GPUs with 4x NVLinks, enabling a blistering 112GB/s data transfer rate between each pair
Now that I kinda have everything in order, I’m working on a series of blog posts that will cover the entire journey, from building this behemoth to avoiding costly pitfalls. Topics will include:
- The challenges of assembling this system: from drilling holes in metal frames and adding 30amp 240volt breakers, to bending CPU socket pins (don’t try this at home, kids!).
- Why PCIe Risers suck and the importance of using SAS Device Adapters, Redrivers, and Retimers for error-free PCIe connections.
- NVLink speeds, PCIe lanes bandwidth and VRAM transfer speeds, and Nvidia’s decision to block P2P native PCIe bandwidth on the software level.
- Benchmarking inference engines like TensorRT-LLM, vLLM, and Aphrodite Engine, all of which support Tensor Parallelism.
- Training and fine-tuning your own LLM.
Stay tuned.
P.S. I’m sitting here staring at those GPUs, and I just can’t help but think how wild tech progress has been. I remember being so excited to get a 60GB HDD back in 2004. I mean, all the movies and games I could store?! Fast forward 20 years, and now I’ve got more than triple that storage capacity in just one machine’s graphic cards… It makes me think, what will we be doing in another 20 years?!
Anyway, that’s why I’m doing this project. I wanna help create some of the cool stuff that’ll be around in the future. And who knows, maybe someone will look back on my work and be like “haha, remember when we thought 192GB of VRAM was a lot?”
Part II of this Blogpost Series is now available here.
- Update #1 – 09/17/2024: This blogpost was widely discussed on Hacker News , /r/LocalLLama and /r/programming . It was also featured on the Croatian magazine BUG .
- Update #2 – 09/21/2024: Part II of this Blogpost Series is now available here.