Accelerating Ranking Experimentation at Thumbtack with Interleaving
<p>While A/B tests are rightly considered the gold standard for causal inference, they can also be costly. A typical ranking experiment takes many weeks to complete. This wouldn’t be a big problem if we only had a handful of ideas to try, but Thumbtack’s rankers are powered by ML models that could be improved through any combination of new features, model architectures, and/or training techniques. In other words, there’s a vast space of potential improvements to evaluate, and with A/B testing alone, we don’t have the time to systematically explore it. That’s why we have turned to an experimentation technique called <strong><em>interleaving</em></strong> that powers our tests up to 100X faster than A/B testing. Interleaving is specifically designed to accelerate experiments involving ranked lists by quickly identifying the better of two possible rankings, allowing us to evaluate many more ranking ideas over a much shorter period of time.</p>
<h1>How interleaving works</h1>
<p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/0*CSmTr9oOiCnMxVXY" style="height:276px; width:700px" /></p>
<p>To evaluate the impact of a new ranker in an A/B test, we randomly split Thumbtack’s consumers into a control group that sees ordered search results generated by our existing production ranker and a treatment group that sees ordered search results generated by the new ranker. We then perform a hypothesis test to check if there is a statistically significant difference in engagement (e.g. click rate) between the control and treatment groups. In contrast, an interleaving test does not split the consumers into separate groups but rather serves a combined list of search results from the control ranker and the treatment ranker <em>interleaved</em> together.</p>
<p><a href="https://medium.com/thumbtack-engineering/accelerating-ranking-experimentation-at-thumbtack-with-interleaving-20cbe7837edf"><strong>Visit Now</strong></a></p>