Going beyond Parquet: Handling large-scale datasets with Iceberg and Dask EuroPython 2025

Going beyond Parquet: Handling large-scale datasets with Iceberg and Dask
.ical

2025-07-17 11:25–11:55, North Hall

Apache Iceberg is a high-performance open format for huge analytic tables. It brings the reliability and simplicity of SQL tables to big data. Originally built by Netflix to handle datasets at the petabyte scale, Iceberg is becoming the de facto open standard for storing tabular datasets. Its adoption can be seen across the industry, from Databricks' acquisition of Tabular to the release of Snowflake's Iceberg Tables or Amazon's S3 Tables.

In this talk, we start by looking at Iceberg's storage model and how it enables reliable concurrent data operations on top of Parquet files. We continue with Dask's DataFrame API, which allows users to scale Pandas workloads to clusters of machines in a pure Python environment. We discuss how it integrates with Apache Iceberg for ACID capabilities and high-performing querying. Equipped with this understanding, we investigate more advanced Iceberg features like hidden partitioning or time travel. Throughout the talk, we compare Dask workloads executed on Iceberg datasets against plain Parquet to gain a practical understanding of Iceberg's benefits.

Expected audience expertise:

Intermediate

Patrick Hoefler

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.

Going beyond Parquet: Handling large-scale datasets with Iceberg and Dask .ical 2025-07-17 11:25–11:55, North Hall

Going beyond Parquet: Handling large-scale datasets with Iceberg and Dask
.ical

2025-07-17 11:25–11:55, North Hall