Hashing in Spark/Databricks: A Faster Way to Find New Records in Large Datasets

<p>Hey Bob, how&rsquo;s it going with comparing those two gigantic datasets?&rdquo; Mike yells across the cubicles.</p> <p>&ldquo;Still at it. The row-by-row comparison is killing me, and my coffee supply,&rdquo; Bob replies, visibly stressed.</p> <p>Mike chuckles, &ldquo;Well, have you ever tried MD5 hashing?&rdquo;</p> <p>&ldquo;MD5 what now?&rdquo;</p> <p>For data professionals like Bob, handling large datasets is as common as hearing the phrase &lsquo;This meeting could&rsquo;ve been an email.&rsquo; Often, we&rsquo;re tasked with comparing two massive files to find new or updated records. Conventional methods, such as row-by-row comparison, can make this task equivalent to watching paint dry &mdash; very slow and not so rewarding.</p> <p>So, what do you do when you&rsquo;re dealing with mammoth-sized datasets and the clock &mdash; and your patience &mdash; are ticking away? The answer might just be hashing, especially if your data lacks Change Data Capture (CDC) columns.</p> <p><a href="https://gazulas1.medium.com/hashing-in-spark-databricks-a-faster-way-to-find-new-records-in-large-datasets-a1f9eab9e18d"><strong>Click Here</strong></a></p>
Tags: Large Datasets