How to Detect Data Drift with Hypothesis Testing

<p>Data drift is a concern to anyone with a machine learning model serving live predictions. The world changes, and as the consumers’ tastes or demographics shift, the model starts receiving feature values different from what it has seen in training, which may result in unexpected outputs. Detecting feature drift appears to be simple: we just need to decide whether the training and serving distributions of the feature in question are the same or not. There are statistical tests for this, right? Well, there are, but are you sure you are using them correctly?</p> <h1>Univariate drift detection</h1> <p>Monitoring the post-deployment performance of a machine learning model is a crucial part of its life cycle. As the world changes and the data drifts, many models tend to show diminishing performance over time. The best approach to staying alert is to calculate the performance metrics in real time or to estimate them when the ground truth is not available.</p> <p>A likely cause of an observed degraded performance is data drift. Data drift is a change in the distribution of the model’s inputs between training and production data. Detecting and analyzing the nature of data drift can help to bring a degraded model back on track. With respect to how (and how many) features are affected, data drift can take one of two forms: one should distinguish between multivariate and univariate data drift.</p> <p><a href="https://towardsdatascience.com/how-to-detect-data-drift-with-hypothesis-testing-1a3be3f8e625">Website</a></p>