Missing Data Demystified: The Absolute Primer for Data Scientists

<p>Earlier this year, I started a piece on&nbsp;<a href="https://medium.com/towards-data-science/data-quality-issues-that-kill-your-machine-learning-models-961591340b40" rel="noopener">several data quality issues</a>&nbsp;(or characteristics) that heavily compromise our machine learning models.</p> <p><strong>One of them was, unsurprisingly, Missing Data.</strong></p> <p>I&rsquo;ve been studying this topic for many years now (<em>I know, right?!</em>) but along some projects I contribute to in the&nbsp;<a href="https://datacentricai.community/" rel="noopener ugc nofollow" target="_blank">Data-Centric Community</a>, I realized that many data scientists still haven&rsquo;t fully grasped the full complexity of the problem, which inspired me to create this comprehensive tutorial.</p> <p><em>Today, we will delve into the intricacies the&nbsp;</em><strong><em>problem of missing data</em></strong><em>, discover the different&nbsp;</em><strong><em>types of missing data</em></strong><em>&nbsp;we may find in the wild, and explore how we can&nbsp;</em><strong><em>identify and mark missing values&nbsp;</em></strong><em>in real-world datasets.</em></p> <h1>The Problem of Missing Data</h1> <p>Missing Data is an interesting data imperfection since it may arise naturally due to the nature of the domain, or be inadvertently created during data, collection, transmission, or processing.</p> <p><strong>In essence, missing data is characterized by the appearance of absent values in data</strong>, i.e., missing values in some records or observations in the dataset, and can either be&nbsp;<em>univariate</em>&nbsp;(one feature has missing values) or&nbsp;<em>multivariate</em>&nbsp;(several features have missing values):</p> <p><a href="https://towardsdatascience.com/missing-data-demystified-the-absolute-primer-for-data-scientists-8c9244c764c4"><strong>Website</strong></a></p>