Site Logo
Osman's Odyssey: Byte & Build
Chronicles of a Perpetual Learner

AI · · ·

  • Software Engineers Aren't Getting Automated—Local AI Has To Win

    Posted on
    8 Minutes

    Why Full-Stack Ownership is the Only Real Job Security in The Age of AI

    Real technical ability is fading. Worried about AI replacing you? Build real technical depth. LLMs are leverage, a force multiplier, but only if you know what you’re doing. You’re not losing to AI. You’re losing to people who use AI better than you because they actually understand the tech. Get sharper.

    This goes way beyond privacy or ideology. As optimization and model alignment get more personal (and more opaque), your only actual safety net is full local control. If you’re building a business, a workflow, or even a habit that depends on a remote black box, you’re not the customer; you’re the product. Full-stack ownership isn’t just to show off. It’s pure risk management.

    The future belongs to those who can build, debug, and document, not just rent someone else’s toolchain. Bootcamps don’t cut it anymore.

    “Every day these systems run is a miracle. Most engineers wouldn’t last five minutes outside their cloud sandbox.”

    Our industry is obsessed with AI hype, most devs have never seen the bare metal, never written a real doc, and never owned their own stack. Meanwhile, the only thing standing between us and our systems’ total collapse is duct tape, a few command-line obsessives, and the shrinking number of people who still know how to fix things when the it all stops working. We’re staring down an industry where the median troubleshooting skill is somewhere between “reboot and pray” and “copy-paste from Stack Overflow”.

    So please, stop the doomscroll and quit worrying about being replaced. LLMs amplify you; they don’t substitute for you. The edge is in the hard parts: critical thinking, debugging, taste for clean architecture, putting it all together. That’s not going anywhere. The job is shifting not getting eliminated: more architecture, more security, more maintenance, more troubleshooting. Still deeply human, and still non-trivial to automate.

    This is blogpost #3 in my 101 Days of Blogging . If it sparks anything; ideas, questions, or critique, my DMs are open. Hope it gives you something useful to walk away with.

    Yesterday morning I hosted an X/Twitter Audio Space on how LLMs, open-source, and the gravitational pull of platform centralization are forcing us all to rethink what it actually means to be a developer. The cloud got us coddled… The cloud was a mistake , and I believe that next decade’s winners won’t be the ones who just ship the most code (LLMs are really good at that BTW), but the ones who get obsessed with understanding, documenting, and actually owning their tools, top to bottom.

    Let’s set the scene. Google Cloud outage just crashed the internet. X/Twitter is in full panic mode , Cursor/Claude Code/Windsurf/etc aren’t working anymore. LLMs have become the default code generator, human programming skills are fading. For me, I didn’t even notice the outage until I got online. My local agents, running on my hardware from my basement , kept running.

  • Just Like GPUs, We Need To Be Stress Tested: 101 Days of Blogging

    Posted on
    4 Minutes

    101 Days of Technical Blogging, Consistency, and Self-Experimentation

    Writing is how we come to understand ourselves, a gift to our future selves, a record of what once mattered. It grounds our thoughts and gives them shape.

    This one is for me. I hope you enjoy it too.

    The past few months have given me a lot to think about. Life can happen to you out of nowhere, faster than a finger snap, and you’ve only got yourself-mostly-to keep it together.

    In life, you’re either getting smarter or dumber. Stronger or weaker. More efficient or completely helpless. Subject to dependence or reliance. The latter is becoming exponentially easier, and the trend will only accelerate in the years ahead.

    “I want to live happily in a world I don’t understand.” ― Nassim Nicholas Taleb, Antifragile: Things That Gain From Disorder

    Don’t be that guy.

    Being prepared is fundamental to your survival, but not only that… Being prepared is our only duty in life: to ourselves, to our loved ones, and to everything we care about. So, I am no longer taking time for granted, and I will always be prepared.

    Actions-per-minute matter. A lot. We’re entering an era where productivity multipliers, across the board, are approaching infinity. That has to be harnessed, deliberately and fast. Or else…

    So, I’ve made a decision: I’m going to stress-test myself—across the board, for an extended amount of time. No more skipped workouts. No more pushed plans. No more dragging out already-soft deadlines. I have to show up. Fully. For all of it.

  • Once Undesirable, Now Undeniable—How Flipping the Script Changed Everything

    Posted on
    5 Minutes

    How Flipping the Script Made Me the Hunted, Not the Hunter

    March 2024, reading through Hacker News Who Is Hiring Thread, I saw a job profile that fits me. I liked it so much that I reached out to the author that same night on LinkedIn (Co-founder & CTO). My profile based on previous employment alone would signal a misfit, but everything I do day-to-day in my own time tells a different story, this is my passion and I do it non-stop. Nevertheless, it was an uphill battle to convince the CTO to get on the phone with me. I basically wrote a 3-pages cover letter about my profile and how it matches their job description. He decided I should get a chance to go through their process, but ultimately they hired someone else who’s more experience with their stack. No hard feelings.

    Last month I published From the Shadows to the Feed: Why I’m Finally Playing the Game . Since then, I have been honored with ~3.5k new followers. Superlinear growth is something I’m addicted to; I crave it in anything and everything I do in my life. Once I have the fundamentals down, I obsess over growth rates, getting better at a higher-than average pace is an absolute necessity to me.

    Getting a job, however, in this market, was the opposite of superlinear. It was a death-by-a-thousand-cuts. I had already been dabbling with AI/LLMs for the past year or so, and I could foresee what was coming. I told my very supportive better half that I will not be applying for jobs because it’s a waste of time, and that given all indicators, I have to pivot and fully focus on AI, and invest heavily in it right now.

    That’s how I came to build my AI Cluster . I sat down by the hours learning, experimenting, and stressing out about that bet I am taking on myself. I relaunched my website , started blogging, and started putting the time into making myself seen. Tweeting, trying to say hey, I am that kinda-famous user on r/LocalLLaMA .

    I am here. I am good at what I do. I just don’t know how to get you to see it. Every now and then I would shamelessly, insert my AI server into replies when it’s relevant. I got invited to livestreams and accepted right away. I started hosting audio spaces on X/Twitter. I tweeted more and more frequently, engaged with people, made friends. I blogged some more. Wrote articles on X because the algo. Started my own YouTube channel. Livestreamed some stuff. Got invited to more live streams and more spaces… Finally, some superlinear growth.

    But that’s not what this article is about.

    When you’re playing the same game as everybody else, you’re pouring everything you’ve got into a game that was never designed to favor you. Unless you have an unfair advantage, you’re just gambling.

    What’s my unfair advantage? What makes my profile hunted for? My focus immediately shifted to that after the experience I mentioned at the beginning of the article. I decided to stop playing the game with rules that aren’t meant to make me win. I’ll become the huntee.

    In December last year, 3 months from my first blogpost and any of the activity I spoke of above, I cold emailed ~3-4 dozens of AI startups that were hiring on Hacker News. I pretty much wrote a paragraph about myself, attached links to a few of the things I have posted and built, and asked if they’d be interested in chatting.

  • Private Screenshot Organizer with LMStudio (Runs Fully Local)

    Posted on
    13 Minutes

    Organize Screenshots with Local Multimodal LLMs, No Cloud Needed

    I run an AI screenshot organizer locally from my PC. I don’t want to send my screenshots anywhere on the internet; my data is mine, and sending it to any proprietary model means I am giving away the rights to it. So, I have a local VLM pipeline that organizes all my screenshots, this pipeline was previously powered by my 14x RTX 3090s Basement AI Server and now is running directly from my PC with LMStudio SDK and occupying less than 6GB of GPU VRAM.

    Recently LMStudio released their Python and Javascript SDKs . LMStudio is my go-to LLM desktop application , especially for models running directly on my PC and not AI Cluster. I have been intending to give their Python SDK a try with a small project, and the release of Gemma 3 new 4-bit quantization made me pull the trigger.

    Given that Gemma 3 is a multimodal that accepts both image and text as input (4B, 12B, and 27B; 1B is text only), and the wild size (and performance) that the QAT quantization makes the model sit at, I decided to rewrite my screenshots organizer to run directly from my PC.

    This article starts off slow, but it ramps up and gets way more interesting as we go. If you’d rather jump straight into the action, feel free to skip ahead to the Prerequisites section. And yep, this works on pretty much any image, not just screenshots.

    My Screenshots Folder with 875 Screenshots

    I hate a desktop littered with screenshots named Screenshot 2024-05-15 at 11.23.45 AM.png, Screen Shot 2024-05-16 at 9.01.12.png, or even worse, Untitled.png. The screenshots folder used to be where things went to die unless I use them right away. And then, sometimes, I find myself wondering about that one screenshot from 4 months ago!

    When Qwen2-VL came out last year, I built an asynchronous pipeline that ran on my AI cluster to automatically rename, categorize, and organize my screenshots based on their content. Given my atypical use of my AI cluster, that pipeline didn’t run frequently, and I much preferred to run it from my PC directly; but I also didn’t want to replicate the complex software configuration on my PC. You can learn more about how I use my AI cluster in this blogpost . Again, LMStudio simplifies these processes on my PC, one-stop shop for ai models kind of thing, and I already have enough headaches to add more to it; so, ultimately, I ran this pipeline from my AI cluster every few weeks once the screenshots mess bothered me enough to go around remembering how to get the pipeline up and running.

    In this post, we’ll build a practical screenshot organizer step-by-step, and in parallel we’ll get introduced to the core functionalities of the lmstudio-python library.

    We’ll create a Python script that:

  • Stop Wasting Multi-GPUs Setup—Use vLLM or ExLlamaV2 for Tensor Parallelism

    Posted on
    7 Minutes

    Use vLLM or ExLlamaV2 for Tensor Parallelism

    Context: Yesterday, I watched @ThePrimeagen live stream (love his streams by the way) where he was stress testing his new Green Tinybox—a 6x RTX 4090 build. His plan was to get the LLM to send and receive concurrent messages and respond to each others, increasing the number and frequency of those messages with time, as a way to stress test those GPUs; and he was using llama.cpp for inference. The llama.cpp part got my attention, and with such a powerful setup, llama.cpp is pretty much a system crippler. Around the 26-minute mark of his stream, I commented on that, and after some back-and-forth, I figured it was best not to hijack his stream and just write this blogpost instead.

    In one of his responses while live streaming, Michael(@ThePrimeagen) showed a GitHub thread about llama.cpp supporting concurrent requests, but that is more on the software threading side of things, and llama.cpp is not optimized for Tensor Parallelism and Batch Inference. In this blogpost, we dive into the details of various inference engines, explaining when each one makes sense depending on your setup. We’ll cover llama.cpp for CPU offloading when you don’t have enough GPU memory, how vLLM’s Tensor Parallelism gives a massive boost for multi-GPU systems with batch inference, and why ExLlamaV2’s EXL2 quantization is a great choice for Tensor Parallelism and Batch Inference when memory is limited, but not critically so.

    In Short: an Inference Engine is a software that understands how to properly send human-input to, and in-turn show human-readable output from, these massive AI Models. In more detail, Large-Language Models (LLMs) are Deep Learning Neural Network Models. The LLMs we use right now come from an Architecture called Transformer which was coined in the infamous paper Attention Is All You Need . Inference Engines usually utilizes the Transformers implemented by the Hugging Face team, which on a lower-level supports the PyTorch, TensorFlow, and Jax libraries allowing for a wide variety of hardware support that those libraries provides tooling for.

    The short version above is really all you need to know, so feel free to skip to the next section . But in case you’re curious, Inference Engines also implement a variety of other things that are not necessarily provided out of the box from the Transformers library, such as quantizations, and models’ architectures for those quantizations.

    Are you still with me? Good. There are several layers to how an Inference Engine works. It starts with at the bottom level with the hardware you are running (CPU only, CPU and GPU Mixed, GPU only, TPU, NPU, etc), and then it looks into the details of that hardware (Intel, AMD ROCm, Nvidia Cuda, etc), then it goes one level higher and tries to figure out whether you are using a quantization (GGUF, exl2, AWQ, etc), or the original safetensor weights, and then the model architecture itself. The model architecture is the secret sauce—which sometimes is released in a training/white paper—of how the model does it magic to make meaningful connections of your input and then produce a—hopefully— meaningful output.

    Imagine Large Language Models as complex Lego sets with Transformers as the basic bricks. However, each set has different instructions—some build from left to right, others can see the whole picture at once, and some might even shuffle the pieces around for a unique approach. Plus, each set might have its own special pieces or ways of snapping the Legos together, making each model unique in how it constructs or understands language.

    There are many architectures out there, what we need to keep in mind is that:

    This means code implementation for how they’re understood—AKA Inference—is also different.

    llama.cpp is an Inference Engine that supports a wide-variety of models architectures and hardware platforms. It however does not support Batch Inference, making it less than ideal for more than one request at a time. It is mainly used with the GGUF quantization format, and the engines runs with an okay performance for single-request runs but not much else. The only time I would actually recommend using llama.cpp is when you do not have enough GPU Memory (VRAM) and need to offload some of the model weights to the CPU Memory (RAM).