Missing Data Demystified: The Absolute Primer for Data Scientists
<p>Earlier this year, I started a piece on <a href="https://medium.com/towards-data-science/data-quality-issues-that-kill-your-machine-learning-models-961591340b40" rel="noopener">several data quality issues</a> (or characteristics) that heavily compromise our machine learning models.</p>
<p><strong>One of them was, unsurprisingly, Missing Data.</strong></p>
<p>I’ve been studying this topic for many years now (<em>I know, right?!</em>) but along some projects I contribute to in the <a href="https://datacentricai.community/" rel="noopener ugc nofollow" target="_blank">Data-Centric Community</a>, I realized that many data scientists still haven’t fully grasped the full complexity of the problem, which inspired me to create this comprehensive tutorial.</p>
<p><em>Today, we will delve into the intricacies the </em><strong><em>problem of missing data</em></strong><em>, discover the different </em><strong><em>types of missing data</em></strong><em> we may find in the wild, and explore how we can </em><strong><em>identify and mark missing values </em></strong><em>in real-world datasets.</em></p>
<h1>The Problem of Missing Data</h1>
<p>Missing Data is an interesting data imperfection since it may arise naturally due to the nature of the domain, or be inadvertently created during data, collection, transmission, or processing.</p>
<p><strong>In essence, missing data is characterized by the appearance of absent values in data</strong>, i.e., missing values in some records or observations in the dataset, and can either be <em>univariate</em> (one feature has missing values) or <em>multivariate</em> (several features have missing values):</p>
<p><a href="https://towardsdatascience.com/missing-data-demystified-the-absolute-primer-for-data-scientists-8c9244c764c4"><strong>Website</strong></a></p>