PyData London 2024

Using Zarr as a universal and efficient format for drug discovery datasets in Polaris
06-15, 10:30–11:10 (Europe/London), Salisbury

Polaris is an industry-led, community-run, and open-source benchmarking
platform for machine learning (ML) based drug discovery. From kilobytes to
terabytes and from 3D protein structures to phenomics, Polaris leverages Zarr as
a universal format to support the wide variety of datasets used by ML researchers
in drug discovery.


ML is driving exciting innovations in drug discovery. Over the years the community has
come to rely on a scattered and heterogeneous collection of datasets to train and
evaluate their models. Lacking a centralized repository of standardized datasets,
datasets are hard to find and require custom tooling to be used in ML pipelines. Even
slight discrepancies between preprocessing workflows can lead to incomparable results.

Polaris is an industry-led initiative to derive a rigorous and domain-appropriate
evaluation standard for ML-based drug discovery. We are working on a set of
open-source software tools centered around the Polaris Hub: a centralized platform to
host drug discovery benchmarks and datasets. To unify the numerous modalities and
file types in drug discovery, Polaris uses Zarr.

Zarr is a format for the storage of chunked, compressed, N-dimensional arrays. Zarr is a
flexible format implemented in multiple programming languages. In Python, it natively
integrates with Xarray and Dask among others. In this talk I will give an overview of the
Polaris initiative and how Zarr is used to efficiently scale to terabyte-sized, ML-ready
datasets.


Prior Knowledge Expected

No previous knowledge expected

Cas Wognum is a machine learning engineer at Valence Labs. Within Valence, he has
contributed to several open-source projects in the datamol.io toolkit and is now leading
the Polaris project. He holds a MSc. degree in Artificial Intelligence and Computer
Graphics from the University of Utrecht.

Valence Labs is a research engine, powered by Recursion, committed to advancing the
frontier of AI in drug discovery.