06-16, 11:00–11:40 (Europe/London), Salisbury
This talk will highlight common pitfalls that occur when evaluating Machine Learning (ML) and Natural Language Processing (NLP) approaches. It will provide comprehensive advice on how to set up a solid evaluation procedure in general, and dive into a few specific use-cases to demonstrate artificial bias that unknowingly can creep in. It will tell the story hidden behind the performance numbers, and get the audience into the right critical mindset to run unbiased evaluations and data analyses for their own projects.
With AI technology booming, the entry barrier to using ML/NLP in applications is continuously decreasing thanks to the release of novel open-source libraries, pretrained LLM/transformer models, and convenient API access for all. It has never been easier to integrate ML or NLP models into a commercial product or research application. As a consequence, the need for meaningful evaluation of these techniques to specific use-cases and domains has only become more pressing, both for developers as well as for users of these AI tools.
This talk will discuss general principles for setting up a meaningful evaluation for ML/NLP projects. Further, it will illustrate specific examples where bias may creep in - taken from 15+ years of experience in this domain. Ultimately, the goal of this talk is to encourage every ML/NLP developer to critically (re-)evaluate their evaluation. It will inspire developers and users of ML/NLP applications to not be satisfied (merely) with a high performance number. But instead, to go beyond a single F-measure or accuracy number, to dive deeper into the false positives and false negatives, and to ultimately uncover hidden biases or structural issues in the underlying data or evaluation procedure.
Some evaluation topics we will tackle include:
- Various (surprising) ways in which data can leak from training to test
- The importance of setting up both intrinsic and extrinsic evaluations
- How to detect and/or avoid data drift
- How to uncover structural problems/unconscious biases in the underlying data
- How to critically (re)evaluate the evaluation procedure of your ML/NLP project
No previous knowledge expected
I am a machine learning and NLP engineer who firmly believes in the power of data to transform decision making in industry. I have a Master in Computer Science (software engineering) and a PhD in Sciences (Bioinformatics), and more than 16 years of experience in Natural Language Processing and Machine Learning, including in the pharmaceutical industry and the food industry. Since 2019, I have been a core maintainer of spaCy, a popular open-source NLP library created by Explosion. Additionally, I work as a consultant through my company OxyKodit. Throughout my code and projects, I am passionate about quality assurance and testing, introducing proper levels of abstraction, and ensuring code robustness and modularity.