PyData London 2024

From Classic to Cutting Edge Text Classification: Generating Customers Insights with Topic Modelling and HuggingFace SetFit Method
06-14, 15:30–17:00 (Europe/London), Minories

Stop data skimming and dive deep into your customer voices! Are you working with a load of unstructured reviews and you would like to gain an understanding on what customers are commenting about? This hands-on tutorial equips you with powerful text analysis techniques to unlock hidden insights and inform data-driven decisions. Whether you're an experienced data scientist or analyst or just starting out, this session will guide you through two text classification approaches:

1) Classic Topic Modelling: Uncover recurring themes and trends within customer comments using generative probabilistic modelling approach like LDA (Latent Dirichlet Allocation).

2) SetFit Few-Shot Learning: Fine-tune a HuggingFace (HF) sentence transformers model with minimal data to automatically categorise and label reviews, offering deeper insights into key strengths as well as opportunities for improvement.

Upon completing the tutorial, you will be equipped with hands-on experience gained through the utilisation of a Google Colab notebook provided beforehand which enable you to effectively apply the tutorial's knowledge and achieve the following outcomes:
- Apply topic modelling with necessary text pre-processing and feature engineering techniques to discover underlying topics in a collection of text.
- Fine-tune a HF transformer on a small labeled dataset using set-fit few-shot learning method
- Evaluate the performance of the fine-tuned transformers model
- Use the fine-tuned model to generate classification themes on unlabelled data
- Develop a baseline evaluation mechanism to monitor the model in production

Please follow these steps to prepare for the tutorial:

1) Set up Google Colab.

2) Download the data and notebooks folders from this repository: https://rb.gy/ovru2m.

This will allow you to run the notebooks and follow along with the tutorial using Google Colab!

Ready to transform your understanding of multi text classification on customers data? Join me and unleash its power!


The tutorial delves into unlocking valuable and actionable insights from unstructured customer data using two distinct text analytics techniques: Classic Statistical Topic Modelling and HF SetFit Few-Shot Learning. It enables attendees to use data science to analyse reviews, uncover hidden customers’ likes and dislikes, personalise experience and drive satisfaction.
The session welcomes participants with a basic understanding of Python Programming and Machine Learning concepts, particularly:
- Data scientists and analysts of all levels seeking to leverage text analysis for customer insights
- Product managers and marketers want to translate customer feedback into product development and marketing strategies.

Upon completion of the session, participants will acquire a comprehensive set of skills enabling them to effectively harness text analysis for customer understanding. They will learn to apply classic topic modelling techniques, including preprocessing text data, feature engineering, and using Latent algorithms like Latent Dirichlet Allocation (LDA) to extract key themes and trends from customer feedback. Additionally, attendees will use SetFit few-shot learning, allowing them to fine-tune a HF sentence transformer model with a small labeled dataset, facilitating automatic text categorisation to identify strengths and areas for improvement. They will also learn to evaluate model performance and they will develop a baseline evaluation mechanism to assess model performance in real-world production settings.

The tutorial schedule will progress as follows:

0-10 minutes: Introduction to Text Analytics and Customer Experience

10-25 minutes: Classic topic modelling, covering text pre-processing, feature engineering, and the LDA algorithm.

25-40 minutes: Hands-on application - Unsupervised topic modelling

40-55 minutes: HuggingFace SetFit few-shot learning Method

55-70 minutes: Hands-on application - Fine-tuning a sentence transformer and generating model inference

70-80 minutes: Model performance evaluation and monitoring mechanism

80-90 minutes: Wrap-up and Q&A


Prior Knowledge Expected

No previous knowledge expected

Sultan is an experienced data scientist with proven records of delivering business solutions and data products through the application of AI, predictive modeling, and advanced analytics.

He is rigorous about collaborating with technical and non-technical stakeholders to transform data into meaningful business insights, ultimately enabling commercial advantages.

Sultan is also a ML Subject Matter Expert (SME) at Amazon Web Services and technical author at Towards Data Science (TDS), skilled in machine learning, data engineering, natural language processing, deep learning, and statistics.

He has a master's degree in Business Analytics from University College London.

Beyond his professional pursuits, Sultan has interests in traveling, hiking, and Tag Rugby.