Site Logo
Osman's Odyssey: Byte & Build
Chronicles of a Perpetual Learner

Blogs · · ·

  • Stop Wasting Multi-GPUs Setup—Use vLLM or ExLlamaV2 for Tensor Parallelism

    Posted on
    7 Minutes

    Use vLLM or ExLlamaV2 for Tensor Parallelism

    Context: Yesterday, I watched @ThePrimeagen live stream (love his streams by the way) where he was stress testing his new Green Tinybox—a 6x RTX 4090 build. His plan was to get the LLM to send and receive concurrent messages and respond to each others, increasing the number and frequency of those messages with time, as a way to stress test those GPUs; and he was using llama.cpp for inference. The llama.cpp part got my attention, and with such a powerful setup, llama.cpp is pretty much a system crippler. Around the 26-minute mark of his stream, I commented on that, and after some back-and-forth, I figured it was best not to hijack his stream and just write this blogpost instead.

  • Antifragile AI: Harnessing Uncertainty for a Resilient Future

    Posted on
    5 Minutes

    The Evolution from Traditional Software to AI Agentic Systems

  • All In — Stop Caring & Play The Game

    Posted on
    2 Minutes

    Playing The Game Is The Only Way To Win

  • Serving AI From The Basement — Part II: Agents, MoEs, Inference & More

    Posted on
    Last Edited 14 Minutes

    Unpacking SWE Agentic Framework, MoEs, Batch Inference, and More

    For about 3 weeks now, I have been working on a multi-agent system that simulates a team of Software Engineers; this system assigns projects, creates teams and adds members to them based on areas of expertise and need, and asks team members to build features, assign story points, have pair programming sessions together, etc. Started mainly for fun and exploration, however, last week the following paper was released: Agents in Software Engineering .

  • Serving AI From The Basement — Part I: 192GB of VRAM Setup

    Posted on
    Last Edited 3 Minutes

    A Dedicated AI Server with 8x RTX 3090 GPUs and 192GB of VRAM

    This blogpost was originally posted on my LinkedIn profile in July 2024.

    Backstory: Sometime in March I found myself struggling to keep up with the mere 48GB of VRAM I had been relying on for almost a year in my LLMs experimentations. So, in a geeky-yet-stylish way, I decided to spend my money to build this thing of beauty. Questions swirled: Which CPU/Platform to buy? Does memory speed really matter? And why the more PCIe Lanes we have the better? Why 2^n number of GPUs matter in multi-GPU node setup (Tensor Parallelism, anyone?) How many GPUs, and how can I get all the VRAM in the world? Why are Nvidia cards so expensive and why didn’t I invest in their stock earlier? What inference engine to use (hint: it’s not just llama.cpp and not always the most well-documented option)?