PyData London 2024

Analytics engineering without dbt? Building the composable Python data stack with Kedro and Ibis
06-14, 13:30–15:00 (Europe/London), Warwick

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. However, now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API, and we can leverage it to build scalable Python pipelines in Kedro. In this tutorial, we will develop a simple analytics pipeline locally, then deploy it in a cloud data warehouse, with just a configuration change.

Python has become the lingua franca of data science, and it's a great language for building AI/ML pipelines. However, in the data engineering world, it leaves much to be desired. A lot of data practitioners end up:

  • slurping up large amounts of data into memory, instead of pushing execution down to the underlying database/engine
  • implementing proof-of-concepts on data extracts, and then struggling massively when they need to migrate or rewrite their logic to run against the production databases and scale out
  • insisting on building data pipelines in Python for consistency (fair enough), when dbt would have been the much better fit for data engineering because they essentially needed a SQL workflow

In this session, we will first understand the motivation for a better solution for building production data pipelines in Python:

  • The dev-prod dilemma. Existing solutions excel in the PoC/development phase; however, deploying the same code in production doesn't work as well as one would hope.
  • The SQL solution. In spite of its drawbacks, SQL presents a standardized* programming language that's supported by every database (and many other compute frameworks).
  • "What if I don't like SQL?" In the end, there will always be people (like myself, and, I imagine, many other attendees at a major Python conference) who would rather use Python than SQL.

Then, we will implement a local solution using DuckDB and two popular open-source Python libraries:

  • Kedro for building data pipelines following software engineering best practices
  • Ibis for defining data transformations using a familiar dataframe API that get executed with the scale and performance of modern SQL

Last but not least, we will discuss other benefits of this solution, including the reusability and portability of the Ibis-based data pipelines and validations. To that end—with one simple configuration change—we will run the same pipeline at scale in Starburst Galaxy.

Materials can be found on GitHub: everything is ready to be run on GitHub Codespaces.

Prior Knowledge Expected

No previous knowledge expected

Deepyaman is a software engineer at Voltron Data. Before their acquisition by Voltron Data, he was a Founding Machine Learning Engineer at Claypot AI, working on their real-time feature engineering platform. Prior to that, he led data engineering teams and asset development across a range of industries at QuantumBlack, AI by McKinsey.

Deepyaman is passionate about building and contributing to the broader open-source data ecosystem. Outside of his day job, he helps maintain Kedro, an open-source Python framework for building production-ready data science pipelines.

Juan Luis (he/him/él) is an Aerospace Engineer with a passion for tech communities, outreach, and sustainability. He works at QuantumBlack, AI by McKinsey, as Product Manager for Kedro, an opinionated Python framework for creating reproducible, maintainable and modular data science code. He has worked as Developer Advocate at Read the Docs, as software engineer in the space, consulting, and banking industries, and as a Python trainer for several private and public entities.

Apart from being a long-time user and contributor to many projects in the scientific Python stack (NumPy, SciPy, Astropy) he has published several open-source packages, the most important one being poliastro, an open-source Python library for interactive astrodynamics used in academia and industry.

Finally, Juan Luis is the founder and former chair of the Python España association, the point of contact for the Spanish Python community, former organizer of PyCon Spain, which attracted 800 attendees in its last in-person edition in 2022, and current organizer of the PyData Madrid monthly meetups.