Missing Data Demystified: The Absolute Primer for Data Scientists

<p>I&rsquo;ve been studying this topic for many years now (<em>I know, right?!</em>) but along some projects I contribute to in the&nbsp;<a href="https://datacentricai.community/" rel="noopener ugc nofollow" target="_blank">Data-Centric Community</a>, I realized that many data scientists still haven&rsquo;t fully grasped the full complexity of the problem, which inspired me to create this comprehensive tutorial.</p> <p><em>Today, we will delve into the intricacies the&nbsp;</em><strong><em>problem of missing data</em></strong><em>, discover the different&nbsp;</em><strong><em>types of missing data</em></strong><em>&nbsp;we may find in the wild, and explore how we can&nbsp;</em><strong><em>identify and mark missing values&nbsp;</em></strong><em>in real-world datasets.</em></p> <h1>The Problem of Missing Data</h1> <p>Missing Data is an interesting data imperfection since it may arise naturally due to the nature of the domain, or be inadvertently created during data, collection, transmission, or processing.</p> <p><strong>In essence, missing data is characterized by the appearance of absent values in data</strong>, i.e., missing values in some records or observations in the dataset, and can either be&nbsp;<em>univariate</em>&nbsp;(one feature has missing values) or&nbsp;<em>multivariate</em>&nbsp;(several features have missing values):</p> <p><img alt="" src="https://miro.medium.com/v2/1*rQh88nLLoMlIef9GVwdR3Q.png" style="width:700px" /></p> <p>Univariate versus Multivariate missing data patterns. Image by Author.</p> <p><em>Let&rsquo;s consider an example.&nbsp;</em>Let&rsquo;s say we are conducting a study on a patient cohort regarding diabetes, for instance.</p> <p><strong>Medical data is a great example for this, because it is often highly subjected to missing values:</strong>&nbsp;patient values are taken from both surveys and laboratory results, can be measured several times throughout the course of diagnosis or treatment, are stored in different formats (sometimes distributed across institutions), and are often handled by different people.&nbsp;<em>It can (and most certainly will) get messy!</em></p> <p><a href="https://towardsdatascience.com/missing-data-demystified-the-absolute-primer-for-data-scientists-8c9244c764c4">Website</a></p>