PyData London 2024

10 years of Parquet: what’s next?
06-16, 15:45–16:25 (Europe/London), Minories

As Apache Parquet enters its teenage years, it’s time for us to celebrate its success in tabular analytics and explore how we can improve upon it for tomorrow’s GPU and ML oriented workloads.


As Apache Parquet enters its teenage years, it’s time for us to reflect on its journey and plan for the next generation of data formats. This talk will delve into the intricacies of designing a modern data format that caters to the evolving needs of the modern Python data stack.

If you’re interested in learning more about Parquet, how columnar storage formats work, or about state-of-the-art compression, then this talk is for you!

We will be covering:
* the internals of the Parquet file format
* how it helps accelerate tabular analytics (e.g. Polars and DuckDB)
* why it is not such a great fit for today’s GPU and ML oriented workloads;
* recent advancements in light-weight, data-parallel compression; and
* how we can achieve 100x performance with a modern file format


Prior Knowledge Expected

No previous knowledge expected

CTO & Co-Founder at Spiral

Spiral is a small startup based in London and New York building a new database to accelerate machine learning and tabular analytics.