Accelerating Ranking Experimentation at Thumbtack with Interleaving

<p>While A/B tests are rightly considered the gold standard for causal inference, they can also be costly. A typical ranking experiment takes many weeks to complete. This wouldn&rsquo;t be a big problem if we only had a handful of ideas to try, but Thumbtack&rsquo;s rankers are powered by ML models that could be improved through any combination of new features, model architectures, and/or training techniques. In other words, there&rsquo;s a vast space of potential improvements to evaluate, and with A/B testing alone, we don&rsquo;t have the time to systematically explore it. That&rsquo;s why we have turned to an experimentation technique called&nbsp;<strong><em>interleaving</em></strong>&nbsp;that powers our tests up to 100X faster than A/B testing. Interleaving is specifically designed to accelerate experiments involving ranked lists by quickly identifying the better of two possible rankings, allowing us to evaluate many more ranking ideas over a much shorter period of time.</p> <h1>How interleaving works</h1> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/0*CSmTr9oOiCnMxVXY" style="height:276px; width:700px" /></p> <p>To evaluate the impact of a new ranker in an A/B test, we randomly split Thumbtack&rsquo;s consumers into a control group that sees ordered search results generated by our existing production ranker and a treatment group that sees ordered search results generated by the new ranker. We then perform a hypothesis test to check if there is a statistically significant difference in engagement (e.g. click rate) between the control and treatment groups. In contrast, an interleaving test does not split the consumers into separate groups but rather serves a combined list of search results from the control ranker and the treatment ranker&nbsp;<em>interleaved</em>&nbsp;together.</p> <p><strong>Learn More</strong></p>