Hashing in Spark/Databricks: A Faster Way to Find New Records in Large Datasets

Hey Bob, how’s it going with comparing those two gigantic datasets?” Mike yells across the cubicles. “Still at it. The row-by-row comparison is killing me, and my coffee supply,” Bob replies, visibly stressed. Mike chuckles, “Well, have you ever tried MD5 hashing?” “MD5 what now?” For data professionals like Bob, handling large datasets is as common as hearing the phrase ‘This meeting could’ve been an email.’ Often, we’re tasked with comparing two massive files to find new or updated records. Conventional methods, such as row-by-row comparison, can make this task equivalent to watching paint dry — very slow and not so rewarding. So, what do you do when you’re dealing with mammoth-sized datasets and the clock — and your patience — are ticking away? The answer might just be hashing, especially if your data lacks Change Data Capture (CDC) columns. <a href="https://gazulas1.medium.com/hashing-in-spark-databricks-a-faster-way-to-find-new-records-in-large-datasets-a1f9eab9e18d">Click Here</a>