Delta Merge — Optimisation Strategies
<p>This post discusses how we improved our Delta Merge performance using <strong>Concurrency</strong> and<strong> Partitioning</strong>. It also describes a few other strategies for performance gains based on what we observed in production. For a primer on how Concurrency works, please refer to the <a href="https://docs.delta.io/latest/concurrency-control.html" rel="noopener ugc nofollow" target="_blank">documentation</a>. Finally, the usage of some helper methods from <a href="https://github.com/MrPowers/jodie" rel="noopener ugc nofollow" target="_blank">jodie</a> is documented.</p>
<h1>BATCH MERGE INTO</h1>
<p>Delta Lake provides the merge functionality, which allows us to perform UPSERTS, aka <strong>UPDATE</strong> and <strong>INSERT</strong> and <strong>DELETE</strong> all at one go. This shiny functionality is not cheap, though and comes at a cost of two joins:</p>
<ol>
<li>An <strong>inner join</strong> between the target delta table and source DataFrame to determine matches. All that does not match for either update or delete will be inserted.</li>
<li>A final <strong>outer join</strong> between the chosen target files and source files to write out the inserted/updated/deleted data</li>
</ol>
<p>Therefore, any <strong>optimizations that apply to these joins</strong> will help improve the <strong>overall merge performance</strong>.</p>
<p><a href="https://medium.com/@joydeep.roy/delta-merge-optimisation-strategies-b78f18066966"><strong>Read More</strong></a></p>