Databricks Liquid Clustering

Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses? Well, I did! So let’s talk about it. <h1>The Challenge of Fixed Data Layouts</h1> Have a look at this graph. <img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*AAV0Qs_55LyhTrubOQT2lA.png" style="height:304px; width:700px" /> Yearly row counts for kaggle_partitioned table This graph projects yearly row counts for a table & reveals a significant skew in data distribution. This skew is particularly relevant as consumers frequently employ the year column as a filter in their queries. This table, when created, was partitioned using year and month columns. This is how the DDL looks like for this one. <pre> %sql CREATE TABLE kaggle_partitioned ( year_month STRING, exp_imp TINYINT, hs9 SMALLINT, Customs SMALLINT, Country BIGINT, quantity BIGINT, value BIGINT, year STRING, month STRING ) USING delta PARTITIONED BY (year, month);</pre> The problem here is, that 2 partitions have ~83% of the total data for the table. <img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*nVzUVp8iVB7ebhVF5D-rHQ.png" style="height:304px; width:700px" /> Data split yearwise Based on the information provided above, do you think the table is under partitioned? Or is over partitioned? Let’s look at the data distribution in further depth for this table. The following chart present the monthly split for each yearly row counts. <a href="https://medium.com/@rahulsoni4/have-you-ever-wondered-if-theres-a-dynamic-solution-to-the-relentless-challenge-of-data-9e9956bd6bf9">Visit Now</a>