The Intuition Behind the Power of Two Random Choices
<p>In dynamic resource allocations and load balancing, one of the most well-known and fascinating algorithms is the so-called “power of two random choices,” which proposes a very simple change to the random allocations that yields exponential improvement. I was lucky enough to have implemented this very technique on a gigantic scale to improve the resource utilization of AWS Lambda, and yet I struggled to “feel” it intuitively for a long time.</p>
<p>In this article, I’d like to share my mental model for understanding this algorithm which also helps build a good intuition for many other advanced techniques in this area.</p>
<p><strong>Update:</strong> I have a <a href="https://medium.com/@mihsathe/load-balancing-a-very-counterintuitive-improvement-to-the-best-of-k-algorithm-608ffbdb7c8a" rel="noopener">follow-up article</a> on a very cool optimization to this technique. Also, in addition to these articles, I’ve started writing short form content <a href="https://twitter.com/_mihir_sathe" rel="noopener ugc nofollow" target="_blank">on twitter</a> for a faster feedback loop. Follow if you’d like to keep up.</p>
<h1>What is Load Balancing?</h1>
<p>Let’s say you run a large-scale AI bot service like ChatGPT. You have thousands of servers with expensive GPUs generating text based on user prompts. Now if your algorithm was routing about 80% of incoming requests to only 20% of servers, that 20% of servers would be quite busy, and the requests hitting those servers will likely have to wait for quite some time to see a GPU. On the other hand, 80% of cold servers are poorly utilized and waste many precious GPU hours. This is what we call bad load balancing.</p>
<p>On the other hand, if all your servers are roughly taking equal load, you are answering your requests in near-optimal time and not overpaying for resources. Moreover, you can easily “right size” your fleet — expand it to improve response time or contract it to reduce cost.</p>
<h1>Power of Two Random Choices</h1>
<p>There’s a simplified model to think about load balancing. You have an infinite stream of balls (representing requests) and N bins (representing servers). Your job is to assign each incoming ball to a bin such that you keep the allocations across all bins as equal as possible. A real-world problem is much more complex than this, but this simplification will do for now.</p>
<p>One of the easiest strategies for balls-in-bins is to pick a bin and place the ball in it randomly. If we use uniform random picks, all outcomes are equally likely for every pick, so we’d expect all bins to be roughly equally filled.</p>
<p>The other approach — you pick two random bins instead of one and place the ball in the emptier one. This is the so-called “power of two random choices” approach — we’ll call it “best of two” for short.</p>
<p>Let’s run a simulation placing one million balls into 1,000 bins with both of these approaches — the best case for the end state being 1,000 balls in every bin. As per the histograms of bins below, the difference is quite dramatic.</p>
<p><a href="https://betterprogramming.pub/load-balancing-the-intuition-behind-the-power-of-two-random-choices-6de2e139ac2f">Click Here</a></p>