Liquid Clustering with Databricks Delta Lake

Databricks unveiled Liquid Clustering at this year’s Data + AI Summit, a new approach aimed at improving both read and write performance through a dynamic data layout.

Recap: Partitioning and Z-Ordering

Both partitioning and z-ordering rely on data layout to perform data processing optimizations. They are complementary since they operate on different levels, and apply to different types of columns.

Partition on most queried, low-cardinality columns.

  • Do not partition tables that contains less than 1TB of data.
  • Rule of thumb: All partitions to contain at least 1GB of data.

Z-order on most queried, high-cardinality columns.

  • Use Z-order indexes alongside partitions to speed up queries on large datasets.
  • Z-order clustering only occurs within a partition, and cannot be applied to fields used for partitioning.

Read More