PyData London 2024

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration & Breakfast
Minories
08:00
60min
Registration & Breakfast
Warwick
08:00
60min
Registration & Breakfast
Salisbury
09:00
09:00
90min
An Introduction to Retrieval Augmented Generation
Dan Gibson

How do you build chatbots that answer questions using your organisation's data? The answer is Retrieval Augmented Generation (RAG). In this session you'll be introduced to RAG and build a simple RAG powered chatbot in Python.

Minories
09:00
210min
GPU Development in Python 101
Jacob Tomlinson, Andy Terrel

Since joining NVIDIA I’ve gotten to grips with the fundamentals of writing accelerated code in Python. I was amazed to discover that I didn’t need to learn C++ and I didn’t need new development tools. Writing GPU code in Python is easier today than ever, and in this tutorial, I will share what I’ve learned and how you can get started with accelerating your code.

Salisbury
09:00
90min
Mastering Data Flow: Empower Your Projects with Prefect's Pipeline Magic
Adam Hill

Embark on a transformative journey into the realm of data engineering with our 90-minute workshop dedicated to Prefect 2. In this hands-on session, participants will learn the ins and outs of building robust data pipelines using the latest features and enhancements of Prefect 2. From data ingestion to advanced analytics, attendees will gain hands-on experience and practical insights to elevate their data engineering skills.

Warwick
10:30
10:30
30min
Break & Snacks
Minories
10:30
30min
Break & Snacks
Warwick
11:00
11:00
90min
How you (yes, you!) can write a Polars Plugin
Marco Gorelli

Polars is a dataframe library taking the world by storm. It is very runtime and memory efficient and comes with a clean and expressive API. Sometimes, however, the built-in API isn't enough. And that's where its killer feature comes in: plugins. You can extend Polars, and solve practically any problem.

No prior Rust experience required, intermediate Python or general programming experience required. By the end of the session, you will know how to write your own Polars Plugin! This talk is aimed at data practitioners.

Warwick
11:00
90min
Multimodal Deep Learning in the Real World
Isaac

Many real world business problems are multi-modal in nature and would benefit from using a combination of text, imagery, audio, and numerical data. Recently, there has been a surge in powerful deep learning models that fuse multiple modalities of data, however, fine-tuning, deploying, and versioning these models remains challenging for most companies. This tutorial will discuss some of the latest research in the field and then walk through several real world examples of fine-tuning, deploying, and serving multi-modal deep learning models using open source frameworks like HuggingFace, Kubeflow, and Django.

Minories
12:30
12:30
60min
Lunch
Minories
12:30
60min
Lunch
Warwick
12:30
60min
Lunch
Salisbury
13:30
13:30
90min
Analytics engineering without dbt? Building the composable Python data stack with Kedro and Ibis
Deepyaman Datta, Juan Luis Cano Rodríguez

For the past decade, SQL has reigned king of the data transformation world, and tools like dbt have formed a cornerstone of the modern data stack. Until recently, Python-first alternatives couldn't compete with the scale and performance of modern SQL. However, now Ibis can provide the same benefits of SQL execution with a flexible Python dataframe API, and we can leverage it to build scalable Python pipelines in Kedro. In this tutorial, we will develop a simple analytics pipeline locally, then deploy it in a cloud data warehouse, with just a configuration change.

Warwick
13:30
90min
Graph databases and Retrieval Augmented Generation
Kehinde Richard Ogunyale

In the era of large language models (LLMs), the integration of external, structured knowledge bases has emerged as a frontier for enhancing AI's textual comprehension and generation capabilities. The Retrieval-Augmented Generation (RAG) architecture represents a pivotal advancement in this domain, particularly when leveraging graph databases to augment LLMs.
This workshop will break down how combining these technologies can make AI not just better at creating text that's both accurate and relevant, but also capable of understanding context like never before. We'll explore the building blocks of RAG—how it uses a 'retriever' to find useful information and a 'generator' to create responses. Graph databases play a crucial role here; they're a type of database that's really good at showing how different pieces of information are connected. This ability makes AI responses more insightful and adaptable to new information. Step by step, we'll walk through how to build AI applications using RAG and graph databases, covering everything from the initial setup and getting the data ready, to fine-tuning how the AI finds and uses information to answer questions or write text. This session is designed to give you the tools to create AI that not only knows more but can also use that knowledge to generate responses that truly understand and reflect the complexity of the world around us.

Minories
13:30
90min
Let AI help you find the Best Bar. Build a Real-Time Personalized AI Recommender System powered by a LLM.
Raymond Cunningham, lex Avstreikh

In this tutorial we will build a AI system to assist you in finding the best bar for you to go to in London - maybe even this evening after the PyData conference.

Salisbury
15:00
15:00
30min
Break & Snacks
Minories
15:00
30min
Break & Snacks
Warwick
15:00
30min
Break & Snacks
Salisbury
15:30
15:30
90min
From Classic to Cutting Edge Text Classification: Generating Customers Insights with Topic Modelling and HuggingFace SetFit Method
Sultan Al Awar

Stop data skimming and dive deep into your customer voices! Are you working with a load of unstructured reviews and you would like to gain an understanding on what customers are commenting about? This hands-on tutorial equips you with powerful text analysis techniques to unlock hidden insights and inform data-driven decisions. Whether you're an experienced data scientist or analyst or just starting out, this session will guide you through two text classification approaches:

1) Classic Topic Modelling: Uncover recurring themes and trends within customer comments using generative probabilistic modelling approach like LDA (Latent Dirichlet Allocation).

2) SetFit Few-Shot Learning: Fine-tune a HuggingFace (HF) sentence transformers model with minimal data to automatically categorise and label reviews, offering deeper insights into key strengths as well as opportunities for improvement.

Upon completing the tutorial, you will be equipped with hands-on experience gained through the utilisation of a Google Colab notebook provided beforehand which enable you to effectively apply the tutorial's knowledge and achieve the following outcomes:
- Apply topic modelling with necessary text pre-processing and feature engineering techniques to discover underlying topics in a collection of text.
- Fine-tune a HF transformer on a small labeled dataset using set-fit few-shot learning method
- Evaluate the performance of the fine-tuned transformers model
- Use the fine-tuned model to generate classification themes on unlabelled data
- Develop a baseline evaluation mechanism to monitor the model in production

Please follow these steps to prepare for the tutorial:

1) Set up Google Colab.

2) Download the data and notebooks folders from this repository: https://rb.gy/ovru2m.

This will allow you to run the notebooks and follow along with the tutorial using Google Colab!

Ready to transform your understanding of multi text classification on customers data? Join me and unleash its power!

Minories
15:30
90min
Probabilistic Programming and Bayesian Computing with PyMC
Chris Fonnesbeck, Thomas Wiecki

Bayesian statistical methods provide powerful tools for solving various data science problems. The Bayesian approach yields easy-to-interpret results and automatically accounts for uncertainty in our estimates or predictions. Although computational challenges have historically been an obstacle, especially for new users, there are now mature probabilistic programming tools that are both efficient and easy to learn. We will use the latest release of PyMC (version 5) for this tutorial, but the concepts and techniques taught can be applied to any probabilistic programming framework.

This tutorial targets practicing and aspiring data scientists and analysts who seek to incorporate Bayesian statistics and probabilistic programming into their work. It will provide new users with an overview of Bayesian statistical methods and their applicability in various situations. Learners will also gain practical experience in applying these methods using PyMC, including the specification, fitting, and validation of models using a real-world dataset.

Salisbury
15:30
90min
Test-Driven Data Analysis in Python
Nick Radcliffe

Test-driven data analysis is a methodology and open-source Python library for improving quality in data processes. It covers three main areas:
• Testing data (generating constraints and using them to validate new data)

Warwick
08:00
08:00
60min
Registration & Breakfast
Minories
08:00
60min
Registration & Breakfast
Warwick
08:00
60min
Registration & Breakfast
Salisbury
09:00
09:00
15min
Opening Notes
Warwick
09:15
09:15
45min
Keynote- Dr. Rebecca Bilbro- Mistakes were made | Data science ten years in
Dr. Rebecca Bilbro

To honor ten years of PyData London, join Dr. Rebecca Bilbro as she takes us back in time to reflect on a little over ten years working as a data scientist. One of the many renegade PhDs who joined the fledgling field of data science of the 2010's, Rebecca will share lessons learned the hard way, often from watching data science projects go sideways and learning to fix broken things. Through the lens of these canon events, she'll identify some of the anti-patterns and red flags she's learned to steer around.

Warwick
10:00
10:00
30min
Break & Snacks
Minories
10:00
30min
Break & Snacks
Warwick
10:00
30min
Break & Snacks
Salisbury
10:30
10:30
40min
RAG for a medical company: the technical and product challenges
Noé Achache

RAG (Retrieval Augmented Generation) is the process of querying a (large) set of documents with natural language, leveraging vector search and llms. While it has recently become widely accessible to develop a Proof-Of-Concept RAG using OpenAI and one of the various open-source contributions (e.g. langchain), building a performant RAG that brings value to users is challenging.
This talk will focus on learnings from building a RAG for a medical company, to allow doctors to query drug documentation with natural language, using tools like Chainlit, Qdrant and Langsmith.
Naturally, a product question emerged: how to effectively leverage LLMs that can never guarantee 100% accuracy in the health sector?
We will explain how we addressed this challenge, as well as the various technical improvements implemented to enhance both the retrieval (vector search) and generation (llm) metrics of our RAG.

Minories
10:30
40min
Training and Deployment of ML models at scale in a Risk Controlled Banking Environment
Arun Kundgol, Aaron Byrne

Controlled environments such as banks are characterized by stringent data governance, model risk policies and operational protocols. These present unique challenges for data science teams to deliver business and customer value. While these constraints manage model and technology risk, they often impede agility and experimentation - key drivers of innovation in data science.
This talk discusses how we've managed to scale model training and deployment by 10X with our existing on-prem data science platform.

Warwick
10:30
40min
Using Zarr as a universal and efficient format for drug discovery datasets in Polaris
Cas Wognum

Polaris is an industry-led, community-run, and open-source benchmarking
platform for machine learning (ML) based drug discovery. From kilobytes to
terabytes and from 3D protein structures to phenomics, Polaris leverages Zarr as
a universal format to support the wide variety of datasets used by ML researchers
in drug discovery.

Salisbury
11:15
11:15
40min
Building Multi-Agent Generative-AI Applications with AutoGen
Victor Dibia, Chi Wang, Diego Colombo

Discover the potential of multi-agent generative AI applications with AutoGen, a pioneering framework designed to tackle complex tasks requiring multi-step planning, reasoning, and action. In this talk, we will explore the fundamentals of multi-agent systems, learn how to build applications using AutoGen, and discuss the open challenges associated with this approach, such as control trade-offs, evaluation challenges, and privacy/security considerations.

With AutoGen's open-source platform and growing ecosystem, developers can harness the power of generative AI to create advanced AI assistants and interfaces for the digital world. This talk is ideal for those with a general understanding of generative AI and Python application development.

Minories
11:15
40min
Observability for Dask in Production
Hendrik Makait

Debugging is hard. Distributed debugging is hell.

Dask is a popular library for parallel and distributed computing in Python. Dask is commonly used in data science, actual science, data engineering, and machine learning to distribute workloads onto clusters of many hundreds of workers with ease.

However, when things go wrong life can become difficult due to all of the moving parts. These parts include your code, other PyData libraries like NumPy/pandas, the machines you’re running on, the network between them, storage, the cloud, and of course issues with Dask itself. It can be difficult to understand what is going on, especially when things seem slower than they should be or fail unexpectedly. Observability is the key to sanity and success.

In this talk, we describe the tools Dask offers to help you observe your distributed cluster, analyze performance, and monitor your cluster to react to unexpected changes quickly. We will dive into distributed logging, automated metrics, event-based monitoring, and root-causing problems with diagnostic tooling. Throughout the talk, we will leverage real-world use cases to show how these tools help to identify and solve problems for large-scale users in the wild.

This talk should be particularly insightful for Dask users, but the approaches to observing distributed systems should be relevant to anyone operating at scale in production.

Salisbury
11:15
40min
The evolving conversation: How continuous testing keeps your LLM on track.
Emeli Dral

LLM systems are powerful, but it can be challenging to ensure their reliable and effective operation in production. In this talk, we will explore continuous testing, one of the critical components for LLM safety. We will discuss how one can monitor unintended behaviors and low-quality responses, identify evolving user patterns, and help LLM adapt and improve over time.

Warwick
11:40
11:40
60min
Leaders at PyData
Ian Ozsvald

A facilitated session for leaders to discuss the opportunity and challenges they face. This is the 8th iteration at a PyData conference. Questions are raised and answered by attendees, it is facilitated by Ian Ozsvald (PyDataLondon co-founder). You are encouraged to carry on talking to fellow leaders after this session. If you have a question you may wish to raise please contact Ian directly (ian at ianozsvald com).

A novel discussion format called a "crit" will be introduced which Ian developed for a private leadership group, you're welcome and encouraged to copy and use it in your own organisations. Typical attendance is 60+ leaders.

The 2022 session ("Executives at PyData" as it was known) was written up, you can see it here: https://numfocus.medium.com/executives-at-pydata-global-2022-193cbc2d3f3b

Beaumont
12:00
12:00
40min
Dask DataFrame 2.0 - Comparison to Spark, DuckDB and Polars
Patrick Hoefler

Dask is a library for distributed computing with Python that integrates with pandas. Historically, Dask was the easiest choice to use (it’s just pandas) but struggled to achieve robust performance. A re-implementation of Dask DataFrames will bring it up to speed with Spark, DuckDB and Polars.

Warwick
12:00
40min
Strategic Planning in Public Health: Linear Programming for Resource Allocation
Carlos Samey

The Togolese Ministry of Health distributes contraceptives to women across the country every year. The Ministry employs two different types of interventions (open days at district health clinics, and mobile clinics that travel to remote regions), implemented multiple times in a year and in different districts, to achieve an annual target of the number of women reached (i.e. coverage). However, the two types of intervention have varying costs and varying efficacy in reaching women in different districts. This makes planning and budgeting for the interventions while also achieving a desired coverage extremely challenging for the Ministry. In this talk, we will explore how this problem may be tackled using linear programming to optimize how the two interventions are implemented across districts and at multiple timepoints annually. We analyze historical data and leverage optimization models to tailor implementation of the two interventions to ensure cost-effectiveness while meeting coverage targets. We also discuss different variants of the optimization model to introduce flexibility and customization for managing resources, the assumptions involved, and their utility in improving intervention planning. Attendees will gain insight into the technical complexities of implementing linear optimization models, the challenges involved in using them for decision-making, and its potential to help efficiently allocate scarce resources. Suitable for data professionals interested in implementing data-driven solutions to improve resource allocations under specific constraints. No prior knowledge required.

Salisbury
12:00
40min
Taking LLMs out of the black box: A practical guide to human-in-the-loop distillation
Ines Montani

As the field of natural language processing advances and new ideas develop, we’re seeing more and more ways to use compute efficiently, producing AI systems that are cheaper to run and easier to control. Large Language Models (LLMs) have enormous potential, but also challenge existing workflows in industry that require modularity, transparency and data privacy. In this talk, I'll show some practical solutions for using the latest state-of-the-art models in real-world applications and distilling their knowledge into smaller and faster components that you can run and maintain in-house.

Minories
12:45
12:45
60min
Lunch
Minories
12:45
60min
Lunch
Warwick
12:45
60min
Lunch
Salisbury
13:45
13:45
45min
Keynote- Data: Faithful or Traitor?
Dr. Matthew Crooks

Matt will be discussing the variety of data available to the BBC and many of the projects that the Data Science team have worked on. The BBC's digital products data presents exciting opportunities for working with huge volumes of data and helps the BBC to tailor its products and services to its users. However, this data over-represents already engaged audiences and existing content. To identify opportunities outside of its current offering to meet the needs of under served audiences the team have also worked on a range of projects using national representative survey and panel data.

Warwick
14:30
14:30
30min
Break & Snacks
Minories
14:30
30min
Break & Snacks
Warwick
14:30
30min
Break & Snacks
Salisbury
14:30
160min
PyMC Hackathon

We invite you to join the PyMC dev team for an afternoon hackathon. Help us write new code, refactor old code, squash bugs, write documentation, develop examples, and more! Whether you are a current user or developer, or are brand new to Bayesian computing, this is a great opportunity to contribute and learn.

Beaumont
15:00
15:00
40min
Moving from Offline to Online Machine Learning with River
Tun Shwe

Learn how to get started on your online ML journey with River, an open source Python ML library. The foundations of machine learning were built on offline batch processing techniques for model training and inference. As organisations become more dependent on real-time data, the technological trend for machine learning in production is moving towards adding an online stream processing approach. This has benefits such as lower computational requirements due to being able to incrementally learn from a stream of data points, which enables the continual upgrading of models by adapting to real-time changes in data.

Warwick
15:00
40min
Open Source Leadership: What to Give Away and What to Bring In
Deb Nicholson

Most leaders have loads of work they could give away while community-driven projects have plenty of tasks they could consider bringing in, so how do you figure out what to keep and what to give away?

Salisbury
15:00
40min
Uplift Modeling: How to Enhance Customer Targeting in Marketing with Causal Machine Learning
Hajime Takeda

Causal inference has traditionally been used in the field of marketing. Uplift modeling is one of the major techniques of customer analytics and it helps companies to identify customers with the highest marketing effect when targeted.
In recent years, algorithms combining causal inference and machine learning have been a hot topic. CausalML is a good Python package developed by Uber and it provides a suite of uplift modeling and causal inference methods.
In this talk, I will show the key concepts of causal inference with machine learning, their application in marketing science (Uplift modeling), their demonstration using CausalML, and practical tips.

Minories
15:45
15:45
40min
Are generator-coroutines really the answer?
James Powell

As we all know (or, at least, as I've been trying to tell everyone,) generators in Python are an extremely powerful API design technique. A generator represents the linear decomposition of a single computation into multiple parts, and such decomposition proves very useful in practice. For example, we can model an infinite computation and only execute the portions we desire. Very similarly, we can simplify APIs that specify when a computation terminates, by modeling these computations as infinite sequences of steps, and allowing the end-user to directly control which steps are peformed. We can even interleave the parts of multiple, distinct computations (though in Python ≥3.6, this is better done with the custom async and await syntax and associated protocols.)
A generator-coroutine offers us an alternative formulation for a state machine, but one which represents state and transitions implicitly in the form of (linearised) source text—in order words, a state machine that we can read and understand like any other regular code (and where we have arbitrary control over data-flow.)
But, in practice, the principles which support the use of generators (e.g., as iteration helpers,) often contrast with the code we get when we model with generator-coroutines, and a number of practical issues arise. While these issues may be surmountable (with enough effort and enough contortion,) the question remains: are generator-coroutines really the answer?

Salisbury
15:45
40min
Content Orchestration: Growing user engagement with RL-driven personalisation
Chris Wilkin

Discover Tesco's exciting new approach to growing user engagement through real-time RL-driven personalisation. This talk explores the use of contextual bandits to tailor content dynamically on the Tesco website, learning and adapting to individual user preferences on the fly. Join us to see how we plan to use reinforcement learning to transform online shopping into a seamlessly personalised experience.

Warwick
15:45
40min
What a serverless database means for users
Alex Owens

This talk aims to compare the performance of ArcticDB to the most popular Dataframe file formats for raw reads and writes, and then demonstrate the simplicity with which more complex data modification and access patterns can be achieved using ArcticDB without sacrificing performance.

Minories
16:30
16:30
40min
Enabling real-time insights through stream processing in Python
Adam Glustein

Event stream processing enables real-time analytics and decision-making, which is crucial in
financial services, healthcare, manufacturing, and more industries. However, real-time stream
processing also presents various challenges due to the complexity of systems and new
paradigms involved. This talk delves into the event stream processing landscape and potential
roadblocks in implementing real-time event streaming and discusses the fundamentals of
building streaming applications with a real-world example.

Minories
16:30
40min
Function Calling for LLMs
Jim Dowling

Retrieval Augmented Generation (RAG) for large language models (LLMs) enables you include real-time information and information created after the cutoff training time for the LLM. RAG is strongly associated with using the user's prompt (query) to retrieve semantically related data from a vector database, and then augment the user's prompt by adding the returned (hopefully related) data to the user's prompt. This version of RAG suffers from being probabilistic in the data retrieved (sometimes it may not work) and problematic in requiring you to index your existing (normally unstructured) data in a vector database.

This talk is concerned with extending RAG for LLMs to include the ability to query structured data and API calls using function calling. We will introduce the function calling paradigm for LLMs and describe how LLMs can be fine-tuned to detect when a function needs to be called and then output JSON containing arguments to call the function. We will demonstrate an open-source function calling example for Air Quality prediction that enables users to ask questions such as "what will air quality be like next week" or "will there be any day with bad air quality in the next 10 days"?
This talk will include source code and a demo and all material and software used will be open-source (including the LLM).

Salisbury
16:30
40min
Uncertainty estimation at scale with functime, Polars and conformal predictions
Luca Baggi

functime is a modern time-series forecasting library to generate predictions for thousands of time series at once, while never leaving your laptop. Thanks to Polars' powerful query engine, feature extraction and cross-validation are 1-2 orders of magnitude faster. Plus, functime offers a best-of-the-class set of diagnostic tools to further streamline your workflow.

In this talk, we'll learn how to use functime to analyse your model and generate blazingly fast prediction intervals using EnBPI, a state-of-the-art conformal prediction framework that is also available in other popular Python packages.

Warwick
17:30
17:30
120min
Conference Social & Pub Quiz Sponsored by NVIDIA and Anaconda
Warwick
17:30
120min
Conference Social & Pub Quiz Sponsored by NVIDIA and Anaconda
Salisbury
08:00
08:00
60min
Registration & Breakfast
Minories
08:00
60min
Registration & Breakfast
Warwick
08:00
60min
Registration & Breakfast
Salisbury
09:00
09:00
45min
Keynote-Tania Allard-The art of building and sustaining successful OSS tools and infrastructure
Tania Allard

People start open-source projects for many reasons, but they usually begin with a passion for code, science, knowledge, art, design, social justice or something else. Whether a hobby project by a single maintainer or a project with many maintainers, open source is powered by people who love and care about building solutions.
But what makes a successful open-source project? Are the technical merits alone enough to warrant success, relevance, and longevity? How do factors like user experience, building for sustainability, interoperability, and community management play into an open-source project's success?

Warwick
09:45
09:45
30min
Break & Snacks
Minories
09:45
30min
Break & Snacks
Warwick
09:45
30min
Break & Snacks
Salisbury
10:15
10:15
135min
Humble Data
Beaumont
10:15
40min
Behind the Pixels: The Art and Science of Deep Face Recognition in Python
Sefik Serengil

This talk explores the symbiotic relationship between art and science in facial recognition technology. Focusing on the DeepFace library in Python, the talk unravels the intricacies of deep learning algorithms that breathe life into pixels, enabling computers to artistically recognize and interpret human faces. Attendees will witness practical implementations, gaining insights into the ethical considerations surrounding this transformative technology, and ultimately understanding the harmonious blend of art and science at the core of deep face recognition.

Minories
10:15
40min
Fine Tuning: Building A Folk Music Recommendation System with LLMs
John Sandall

🎵 What happens when you feed an embedding model with folk tunes? 🤖

Our journey starts with a powerful transformer-powered NLP use-case: automated topic modelling using embedding models, clustering algorithms and LLMs. But what if, instead of vibe-based summarisation of free-text survey responses, we wanted vibe-based summarisation of 46,000 folk tunes?

Join us to discover the hidden melodic structures found within music, with 🎻 live demonstrations 🎻 to highlight the differences between a "bluesy reel" and an "Amixolydian jig" discovered through a variety of unsupervised machine learning techniques. Our adventure continues by exploring how to develop a semantic "vibe search" engine for music, a regression model for tune popularity, combined into a folk tune recommender system.

Expect a unique blend of LLM theory, practical advice for applying transformers to text data, code samples, and live violin demos of AI-discovered folk tunes. This talk would be appropriate for anyone curious about LLMs, those looking for ideas on using embeddings for NLP, or anyone who likes foot tapping.

Warwick
10:15
40min
Log messages processing using NLP tools
Arkadiusz Trawiński, PhD

The analysis of logging messages is a big challenge because of their massive number, different origins and unspecify formats. These challenges can be partially address with NLP techniques and ultimately detect, predict or even maybe avoid incidents. What we demonstrate is complete monitoring solution. That includes clustering and uncovering warning-incident correlation with a Hawkes model. This model was previously successfully applied for earthquake predictions based on aftershocks. The Hawkes process model is well-defined mathematically and can process a large volume of data.

Salisbury
11:00
11:00
40min
Can machines play the piano? Deep learning approach to modelling emotional nuance of musical performance.
Wojciech Matejuk

Audio, images, and text already have well established deep learning architectures and processing pipelines proven to yield amazing results. I will introduce the data obtained by recording piano performance in MIDI format as a new and exciting area of research, where many challenges encountered in text, image and audio combine in a single modality. This session is ideal for AI enthusiasts, data scientists with a love for music, and anyone curious about the future of creative machine learning.

Minories
11:00
40min
How to uncover and avoid structural biases in evaluating your Machine Learning/NLP projects
Sofie Van Landeghem

This talk will highlight common pitfalls that occur when evaluating Machine Learning (ML) and Natural Language Processing (NLP) approaches. It will provide comprehensive advice on how to set up a solid evaluation procedure in general, and dive into a few specific use-cases to demonstrate artificial bias that unknowingly can creep in. It will tell the story hidden behind the performance numbers, and get the audience into the right critical mindset to run unbiased evaluations and data analyses for their own projects.

With AI technology booming, the entry barrier to using ML/NLP in applications is continuously decreasing thanks to the release of novel open-source libraries, pretrained LLM/transformer models, and convenient API access for all. It has never been easier to integrate ML or NLP models into a commercial product or research application. As a consequence, the need for meaningful evaluation of these techniques to specific use-cases and domains has only become more pressing, both for developers as well as for users of these AI tools.

Salisbury
11:00
40min
Protein folding and what it means for drug discovery
Emlyn Clay

Discovering a drug is really hard and expensive; it can take decades to find one, and can fail years into a promising project. Advances in predicting how a protein folds has been at the forefront of the next leap in discovering new medicines, and we're in an age of predicting, with high accuracy, what shape they form and consequently simulating how proteins interact with one another and other chemical entities.

Warwick
11:45
11:45
40min
No More Raw SQL: SQLAlchemy & ORMs
Rhythm Patel

Managing a database and synchronizing service data representation with the database can be tricky. In this talk, we’ll explore SQLAlchemy, a powerful SQL toolkit, to simplify this task while also enhancing the readability and maintainability of the code. We’ll discuss how to leverage SQLAlchemy, it's powerful Object Relational Mapper (ORM) system, and even add some optimisations.

Salisbury
11:45
40min
Quantum artificial intelligence
Andrea Melloncelli

The proposed presentation offers a comprehensive exploration of the fundamental principles of Quantum Machine Learning (QML), focusing on its practical applications, current challenges, and future prospects. Throughout the talk, we will address the following key issues:

  1. Introduction to Quantum Computing: We will define the concept of quantum computing and illustrate its utility and potential. We will explore whether the time has come to use this technology and the challenges and opportunities within the realm of quantum technology.
  2. Programming and Operations Basics: We will explain how to program a quantum computer and discuss the Python libraries available to facilitate the approach to this innovative technology.
  3. Fundamentals of Quantum Machine Learning: We will delve into the advantages offered by Quantum Machine Learning compared to classical machine learning, highlighting the peculiarities that characterize this emerging discipline.
  4. Quantum Computers and Libraries: We will analyze the tools available to Python users for experimenting with quantum computers and how to integrate them into the learning and development process.
  5. Approach to the Topic: We will provide practical advice on how to approach this subject, illustrating effective ways to start learning and using Quantum Machine Learning.

The target audience consists of professionals, students, and machine learning enthusiasts interested in exploring the potential of QML, without necessarily possessing an in-depth knowledge of quantum physics. Our goal is to offer a comprehensive and accessible overview of this fascinating interdisciplinary field, encouraging involvement and active learning in this innovative sector.

Minories
11:45
40min
When and how to start coding with kids
Anna-Lena Popkes

Our world is driven by technology and there are many reasons to teach our kids how to code. For example, coding allows them to develop logical reasoning skills and teaches attention to detail. Allowing children to discover how much fun coding can be supports them in their development and opens many doors for their future.

But when and how should we start coding with kids? This talk will approach the question from a scientific perspective, looking into how children's brains develop, how children learn and how to best teach them coding abilities. It will answer important questions like "At what age can a child start coding?" or "What are the benefits of learning to code?". It will also present possible starting points, like learning platforms or tutorials.

Warwick
12:30
12:30
60min
Lunch
Minories
12:30
60min
Lunch
Warwick
12:30
60min
Lunch
Salisbury
12:30
60min
PyData Organizers Meetup
John Carney

We welcome all PyData Organizers to join us for an open discussion during lunch.

Beaumont
13:30
13:30
60min
Lightning Talks- Salisbury
Tomara Youngblood

Lightning Talks are 5 minute talks

Salisbury
13:30
60min
Lightning Talks- Warwick Room
Tomara Youngblood

Lightning talks are 5 minute time slots

Warwick
14:30
14:30
30min
Break & Snacks
Minories
14:30
30min
Break & Snacks
Warwick
14:30
30min
Break & Snacks
Salisbury
15:00
15:00
40min
Are you ready for MLOps? 🫵
Jeroen Overschie, Jetze Schuurmans

MLOps has survived the hype cycle and is gaining in maturity. But are we looking at MLOps for answers for the right things?
No matter how valuable MLOps can be for you, without proper building blocks in place MLOps cannot live up to its full potential. What are the prerequisites for MLOps? What parts of MLOps should you focus on? When should you even start thinking about MLOps, or when is ‘plain’ DevOps wiser to focus on first? Join us in this session to learn more!

Salisbury
15:00
40min
Generating embeddings for Yu-Gi-Oh Cards: A NumPy Approach to Represent Complex Data
Antonio Feregrino

Dive into the methodology of building a concurrence matrix to produce dense representations of cards, which pave the way for other, more interesting tasks such as recommendation systems. This session offers a blend of web scraping, clever coding practises and numpy's computational prowess, culminating in an illustrative card recommendation example.

Minories
15:00
40min
Navigating through financial data challenges by harnessing the power of synthetic data
Stamatios Lykos, Elena Chatzimichali, PhD, Alkiviadis Kariotis, Konstantinos Kogias, Evangelos Tsilikas

Addressing the complexities of financial time series analysis, we unravel prevalent challenges and introduce innovative solutions with the use of synthetic data. From data scarcity to imbalanced datasets, bank silos, privacy concerns and regulatory compliance, this talk dives into the multifaceted obstacles hindering effective financial modelling. Discover how synthetic data generation and augmentation emerge as pivotal solutions, offering privacy-preserving alternatives in the realm of data-driven finance. The talk will cover a broad array of synthetic data techniques ranging from simplistic to state-of-the-art models such as GANs.

Warwick
15:00
90min
[Unconference] How to define open source AI
Cheuk Ting Ho

As there are more regulations regarding AI and Data, OSI is leading a conversation regarding the definition of open source AI. In this unconference session, we will have a look at what has been in the discussion and how we can chim in for our opinions.

Beaumont
15:45
15:45
40min
10 years of Parquet: what’s next?
Nick Gates

As Apache Parquet enters its teenage years, it’s time for us to celebrate its success in tabular analytics and explore how we can improve upon it for tomorrow’s GPU and ML oriented workloads.

Minories
15:45
40min
5 Things I Learnt from Causing a Cloud Provider Outage
Alexander Darby

Earlier this year, my team caused an outage across Europe for a major cloud provider. The incident response taught me a lot about working with cloud data lake systems at massive scale. How do you make these systems performant, resilient, and easy to maintain? And of course, how do you stop them from behaving like a DDOS attack on the cloud provider?

Salisbury
15:45
40min
From Eggs to Poetry: The Evolutionary Saga of Python Packaging
Quazi Nafiul Islam

Join us on a journey through the evolution of Python packaging, where we'll untangle the web of tools and formats that have shaped Python development. From the humble beginnings of Eggs to the sophisticated elegance of Poetry, this talk is a tribute to the ingenuity of the Python community. Along the way, we'll nod to Conda's cameo, recognising its unique contribution to making scientific packages more accessible. This story is for anyone who's ever wondered about the magic behind pip install and why packaging is so hard to get right in the Python world.

Warwick
16:30
16:30
40min
Achieving Concurrency in Streamlit with a RQ scheduler, Building Responsive Data Applications
Harriet Yue Huang

With increasing adoption of Streamlit to create interactive data applications in the usage of generative AI technologies, a challenge of maintaining responsiveness under heavy or concurrent user interactions has emerged as applications grow in complexity, sometimes with a long-running background job. This is where integrating task queueing systems like Redis Queue (RQ) into Streamlit applications can come in handy.
In this talk, we will explore how we can enable this integration between RQ and Streamlit to achieve concurrency, improve user experiences and effectively manage long-running tasks.

Salisbury
16:30
40min
Adventures in not writing tests
Andy Fundinger

Developing reliable code without writing tests may be a far off dream, but Hypothesis' ghostwriter function will generate tests from type hints. The resulting tests are powerful and often appropriate for data analysis. In this talk, I'll discuss how to add tests to your data analysis code that cover a wide range of inputs -- all while using just a small amount of code.

Warwick
16:30
40min
Backtesting and error metrics for modern time series forecasting
Kishan Manani

Evaluating time series forecasting models for modern use cases has become incredibly challenging. This is because modern forecasting problems often involve a large number of related time series, often hierarchical, with a diverse set of characteristics such as intermittency, non-normality, and non-stationarity. In this talk we'll discuss all the tips, tricks, and pitfalls in creating model evaluation strategies and error metrics to overcome these challenges.

Minories