Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars :: PyData London 2024

Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
.ical

06-15, 12:00–12:40 (Europe/London), Warwick

Dask is a library for distributed computing with Python that integrates with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance. A re-implementation of Dask DataFrames will bring it up to speed with Spark, DuckDB and Polars.

Dask is a library for distributed computing with Python that integrates tightly with pandas and other libraries from the PyData stack. It offers a DataFrame API that wraps pandas and thus offers an easy transition into the big data space.

Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance (there were many ways to accidentally perform poorly). It was great for experts, but bad for novices. Accidentally triggering computations on your data, killing your clusters because of inefficient shuffling algorithms and unexpectedly running out of memory were only a few of potential pitfalls. Other tools like Spark, and especially newer tools like DuckDB and Polars were just performing better.

The last 12 months brought many exciting improvements to Dask that address all of these problems:

A new, Apache Arrow based, shuffling algorithm
Query Planning for Dask DataFrames
Pandas 2.0 and efficient Strings with it

Integrating Arrow better into Dask and Pandas was a game changer. It provides a more robust and faster system and gives users a better UX. We will look at some of these improvements in more detail and illustrate the changes that Dask went through with live-demos before we focus how Dask fits into the big data space. The main competitors are Spark, DuckDB and Polars. We will explore different aspects that require consideration when making technology choices, including:

Deployment options
Ecosystem
Performance

We will use the TPC-H benchmarks to compare performance across all tools. No project emerges unscathed from this comparison. Finally, we will look ahead how we extend the Query Optimiser to fit different projects like Xarray and potentially Pandas as well.

Prior Knowledge Expected –

No previous knowledge expected

Patrick Hoefler

Patrick Hoefler is a member of the pandas core team and a Dask maintainer. He is currently working at Coiled where he focuses on Dask development and the integration of a logical query planning layer into Dask. He holds a Msc degree in Mathematics and works towards a Msc in Software engineering at the University of Oxford.

Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars .ical 06-15, 12:00–12:40 (Europe/London), Warwick

Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
.ical

06-15, 12:00–12:40 (Europe/London), Warwick