Data Storage Decisions: Partitioning vs Z-Ordering

When we’re handling big data, efficiently storing it is an important task. If we’re not careful, we could encounter issues like slow query performance, increased storage costs, difficult data management, poor scalability, and more. The way data is laid out across storage systems directly impacts our work as data professionals, so understanding how to efficiently store data is paramount. And there are two fundamental techniques to do this: partitioning and Z-ordering. Partitioning is the method of dividing a dataset into various parts or “partitions” based on certain keys. Each partition can be stored in a separate directory or physical location within the storage system. For example, if we had a dataset with a “year” column, containing the years ‘2022’ and ‘2023’ as values, we could partition the dataset on “year”. This would result in two directories, one containing all the rows for 2022, and the other containing all the rows for 2023. Then, if we wanted to analyze only the data in 2023, we could entirely skip the 2022 partition. Depending on the distribution of data and the specific use case, this could significantly enhance our query performance, potentially reducing the amount of data read by half or more. <a href="https://medium.com/@tomhcorbin/data-storage-decisions-partitioning-vs-z-ordering-e39d5cddb178">Website</a>