Marc-André Lemburg

Marc-Andre is the CEO and founder of eGenix.com, a Python-focused boutique project and consulting company based in Germany, specializing in the data, finance and database space. He has a degree in mathematics from the University of Düsseldorf.

His work with and for Python started in 1994. He is a Python Core Developer, designed and implemented the Unicode support in Python, the editor of the Python DB-API and author of several open source libraries and tools (e.g. the mx Extensions mxDateTime and mxODBC).

Marc-Andre is a EuroPython Society (EPS) Fellow, a Python Software Foundation (PSF) founding Fellow and co-founded a local Python meeting in Düsseldorf (PyDDF). He served on the board of the PSF and EPS for many years and loves to contribute to the growth of Python wherever he can.

More information is available on https://malemburg.com/


Session

07-15
15:25
30min
DuckLake - Take Python and DuckDB for a swim in your data lake
Marc-André Lemburg

Pitch

With DuckDB and DuckLake, managing and analyzing huge data sets is no longer limited to complex cloud infrastructure setups. You can literally run these tasks on your notebook now and at comparable speeds. This talk will show you how.

Description

DuckDB is an embedded relational analytics database (OLAP) which can be added to a Python project with a simple uv add duckdb or pip install duckdb. It is both fast and powerful for processing analytical data warehouse workloads, using the well-known PostgreSQL SQL dialect. Data can be stored in memory and persisted on disk. DuckDB is well integrated with Polars via zero copy Apache Arrow data structures, making it a great choice for complex data science and engineering tasks.

DuckLake is a extension which comes with DuckDB to add data lake features, meaning that huge data sets can be managed using Parquet files stored on disk or in an object store such as S3. It uses a novel approach to data lakes in that the management structures are stored in a database (DuckDB), instead of complex file and directory structures, as many other data lake systems do. This provides great advantages for implementing smart features such as snapshots, schema evolution or time travel.

Again, installation of the extension is just a simple INSTALL ducklake command away, making this a really easy way to configure your own personal "lake house" - the ideal combination of a data warehouse with a data lake.

The talk will give a short introduction to the database terminology, explain what is novel about the DuckLake approach and then showcase a typical use case for lake houses: storing historical weather data and making this available for analytics to Python applications.

Both DuckDB and DuckLake are MIT licensed.

Resources:
- Python.org
- DuckDB – An in-process SQL OLAP database management system
- DuckLake is an integrated data lake and catalog format – DuckLake

Data Engineering and MLOps
Conference Hall Complex (S4)