Load Balancing: The Intuition Behind the Power of Two Random Choices

In dynamic resource allocations and load balancing, one of the most well-known and fascinating algorithms is the so-called “power of two random choices,” which proposes a very simple change to the random allocations that yields exponential improvement. I was lucky enough to have implemented this very technique on a gigantic scale to improve the resource utilization of AWS Lambda, and yet I struggled to “feel” it intuitively for a long time. In this article, I’d like to share my mental model for understanding this algorithm which also helps build a good intuition for many other advanced techniques in this area. Update: I have a <a href="https://medium.com/@mihsathe/load-balancing-a-very-counterintuitive-improvement-to-the-best-of-k-algorithm-608ffbdb7c8a" rel="noopener">follow-up article</a> on a very cool optimization to this technique. Also, in addition to these articles, I’ve started writing short form content <a href="https://twitter.com/_mihir_sathe" rel="noopener ugc nofollow" target="_blank">on twitter</a> for a faster feedback loop. Follow if you’d like to keep up. <h1>What is Load Balancing?</h1> Let’s say you run a large-scale AI bot service like ChatGPT. You have thousands of servers with expensive GPUs generating text based on user prompts. Now if your algorithm was routing about 80% of incoming requests to only 20% of servers, that 20% of servers would be quite busy, and the requests hitting those servers will likely have to wait for quite some time to see a GPU. On the other hand, 80% of cold servers are poorly utilized and waste many precious GPU hours. This is what we call bad load balancing. <a href="https://betterprogramming.pub/load-balancing-the-intuition-behind-the-power-of-two-random-choices-6de2e139ac2f">Read More</a>