Databricks Liquid Clustering
<p><strong><em>Have you ever wondered if there’s a dynamic solution to the relentless challenge of data partitioning in the world of data lakehouses?</em></strong></p>
<p>Well, I did! So let’s talk about it.</p>
<h1><strong>The Challenge of Fixed Data Layouts</strong></h1>
<p>Have a look at this graph.</p>
<p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*AAV0Qs_55LyhTrubOQT2lA.png" style="height:304px; width:700px" /></p>
<p>Yearly row counts for kaggle_partitioned table</p>
<p>This graph projects yearly row counts for a table & reveals a significant skew in data distribution. This skew is particularly relevant as consumers frequently employ the <em>year</em> column as a filter in their queries.</p>
<p>This table, when created, was <strong>partitioned</strong> using <strong><em>year</em></strong> and <strong><em>month</em></strong> columns. This is how the DDL looks like for this one.</p>
<pre>
%sql
CREATE TABLE kaggle_partitioned (
year_month STRING,
exp_imp TINYINT,
hs9 SMALLINT,
Customs SMALLINT,
Country BIGINT,
quantity BIGINT,
value BIGINT,
year STRING,
month STRING
) USING delta PARTITIONED BY (year, month);</pre>
<p>The problem here is, that 2 partitions have ~83% of the total data for the table.</p>
<p><img alt="" src="https://miro.medium.com/v2/resize:fit:700/1*nVzUVp8iVB7ebhVF5D-rHQ.png" style="height:304px; width:700px" /></p>
<p>Data split yearwise</p>
<p><strong><em>Based on the information provided above, do you think the table is under partitioned? Or is over partitioned?</em></strong></p>
<p>Let’s look at the data distribution in further depth for this table. The following chart present the monthly split for each yearly row counts.</p>
<p><a href="https://medium.com/@rahulsoni4/have-you-ever-wondered-if-theres-a-dynamic-solution-to-the-relentless-challenge-of-data-9e9956bd6bf9"><strong>Visit Now</strong></a></p>