Why Data Quality Is Harder than Code Quality

As a data engineer, I always feel less confident about the quality of data I handle than the quality of code I write. Code, at least, I can run it interactively and write tests before deploying to production. Data, I most often have to wait for it to flow through the system and be used to encounter data quality issues. And it’s not only the errors that are raised. It’s also this feeling that there are more unknown data quality issues than code bugs waiting to be discovered. But, is data quality a more complex problem to solve than code quality? Code quality is the process of ensuring code meets expectations. Likewise, data quality is the process of ensuring data meets expectations. In this article, I want to abstract away from expectations (known as <a href="https://www.metaplane.dev/blog/data-quality-metrics-for-data-warehouses" rel="noopener ugc nofollow" target="_blank">data quality dimensions</a> when talking about data) as these will likely be different based on your usage. Instead, I discuss the different steps to handle data quality issues: detecting, understanding, fixing, and reducing quality issues. Then I argue why I find each one of these steps harder to implement when applied to data than when applied to code.