PyData London 2024

Fine Tuning: Building A Folk Music Recommendation System with LLMs
06-16, 10:15–10:55 (Europe/London), Warwick

🎵 What happens when you feed an embedding model with folk tunes? 🤖

Our journey starts with a powerful transformer-powered NLP use-case: automated topic modelling using embedding models, clustering algorithms and LLMs. But what if, instead of vibe-based summarisation of free-text survey responses, we wanted vibe-based summarisation of 46,000 folk tunes?

Join us to discover the hidden melodic structures found within music, with 🎻 live demonstrations 🎻 to highlight the differences between a "bluesy reel" and an "Amixolydian jig" discovered through a variety of unsupervised machine learning techniques. Our adventure continues by exploring how to develop a semantic "vibe search" engine for music, a regression model for tune popularity, combined into a folk tune recommender system.

Expect a unique blend of LLM theory, practical advice for applying transformers to text data, code samples, and live violin demos of AI-discovered folk tunes. This talk would be appropriate for anyone curious about LLMs, those looking for ideas on using embeddings for NLP, or anyone who likes foot tapping.


Why are Amixolydian reels so unpopular?

It took a lot to even ask the question: run thousands of folk tunes in ABC notation through an embedding model, parameter tune a clustering algorithm, identify distinct tune clusters, use an LLM to name the clusters, utilise the new transformer-powered "topic model" topics as dimensions for comparative analysis against a community-provided popularity score.

It turns out, Amixolydian tunes are very unpopular! This talk will educate and entertain in equal measure, covering some of the theoretical groundwork of the process outlined above:
1. What are embedding models & how to do they work?
2. Which clustering techniques performed the best & how did we find them?
3. What are the best practices for using LLMs for topic modelling?
4. Can embedding vectors be used as regression model features?

Great data scientists know that grokking ML models requires going deep into domain knowledge, understanding each sample qualitatively, finding out why outliers are outliers, and building an intuitive picture of your model's (probably wrong) view of the world. When it comes to music data this process is, literally, entertaining.

We will be illustrating the journey with live fiddle so you can hear out loud:
- The differences between each thematic cluster;
- The most popular tune according to humans, and a (questionably informed) AI model;
- The most over-rated tunes, and under-rated tunes;
- And samples from 100% LLM-generated tunes.

The talk with conclude with some thoughts into the future of AI and music, the potential impact of this technology on the professional music industry, and how these rapidly advancing techniques offer both dangers as well as untapped potential to enhance and enjoy shared cultural heritage.


Prior Knowledge Expected

No previous knowledge expected

John Sandall is the CEO and Principal Data Scientist at Coefficient.

His experience in data science and software engineering spans multiple industries and applications, and his passion for the power of data extends far beyond his work for Coefficient’s clients. In April 2017 he created SixFifty in order to predict the UK General Election using open data and advanced modelling techniques. Previous experience includes Lead Data Scientist at YPlan, business analytics at Apple, genomics research at Imperial College London, building an ed-tech startup at Knodium, developing strategy & technological infrastructure for international non-profit startup STIR Education, and losing sleep to many hackathons along the way.

John is also a co-organiser of PyData London, co-founded Humble Data in 2019 to promote diversity in data science through a programme of free bootcamps, and in 2020 was a Committee Chair for the PyData Global Conference. He is currently a Fellow of Newspeak House with interests in open data, AI ethics and promoting diversity in tech.