Delta Merge — Optimisation Strategies

<p>This post discusses how we improved our Delta Merge performance using&nbsp;<strong>Concurrency</strong>&nbsp;and<strong>&nbsp;Partitioning</strong>. It also describes a few other strategies for performance gains based on what we observed in production. For a primer on how Concurrency works, please refer to the&nbsp;<a href="https://docs.delta.io/latest/concurrency-control.html" rel="noopener ugc nofollow" target="_blank">documentation</a>. Finally, the usage of some helper methods from&nbsp;<a href="https://github.com/MrPowers/jodie" rel="noopener ugc nofollow" target="_blank">jodie</a>&nbsp;is documented.</p> <h1>BATCH MERGE INTO</h1> <p>Delta Lake provides the merge functionality, which allows us to perform UPSERTS, aka&nbsp;<strong>UPDATE</strong>&nbsp;and&nbsp;<strong>INSERT</strong>&nbsp;and&nbsp;<strong>DELETE</strong>&nbsp;all at one go. This shiny functionality is not cheap, though and comes at a cost of two joins:</p> <ol> <li>An&nbsp;<strong>inner join</strong>&nbsp;between the target delta table and source DataFrame to determine matches. All that does not match for either update or delete will be inserted.</li> <li>A final&nbsp;<strong>outer join</strong>&nbsp;between the chosen target files and source files to write out the inserted/updated/deleted data</li> </ol> <p>Therefore, any&nbsp;<strong>optimizations that apply to these joins</strong>&nbsp;will help improve the&nbsp;<strong>overall merge performance</strong>.</p> <p><a href="https://medium.com/@joydeep.roy/delta-merge-optimisation-strategies-b78f18066966"><strong>Read More</strong></a></p>