Databricks Liquid Clustering

<p><strong><em>Have you ever wondered if there&rsquo;s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?</em></strong></p> <p>Well, I did! So let&rsquo;s talk about it.</p> <h1><strong>The Challenge of Fixed Data Layouts</strong></h1> <p>Have a look at this graph.</p> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*AAV0Qs_55LyhTrubOQT2lA.png" style="height:304px; width:700px" /></p> <p>Yearly row counts for kaggle_partitioned table</p> <p>This graph projects yearly row counts for a table &amp; reveals a significant skew in data distribution. This skew is particularly relevant as consumers frequently employ the&nbsp;<em>year</em>&nbsp;column as a filter in their queries.</p> <p>This table, when created, was&nbsp;<strong>partitioned</strong>&nbsp;using&nbsp;<strong><em>year</em></strong>&nbsp;and&nbsp;<strong><em>month</em></strong>&nbsp;columns. This is how the DDL looks like for this one.</p> <pre> %sql CREATE TABLE kaggle_partitioned ( year_month STRING, exp_imp TINYINT, hs9 SMALLINT, Customs SMALLINT, Country BIGINT, quantity BIGINT, value BIGINT, year STRING, month STRING ) USING delta PARTITIONED BY (year, month);</pre> <p>The problem here is, that 2 partitions have ~83% of the total data for the table.</p> <p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*nVzUVp8iVB7ebhVF5D-rHQ.png" style="height:304px; width:700px" /></p> <p>Data split yearwise</p> <p><strong><em>Based on the information provided above, do you think the table is under partitioned? Or is over partitioned?</em></strong></p> <p>Let&rsquo;s look at the data distribution in further depth for this table. The following chart present the monthly split for each yearly row counts.</p> <p><a href="https://medium.com/@rahulsoni4/have-you-ever-wondered-if-theres-a-dynamic-solution-to-the-relentless-challenge-of-data-9e9956bd6bf9"><strong>Visit Now</strong></a></p>