06-14, 15:30–17:00 (Europe/London), Warwick
Test-driven data analysis is a methodology and open-source Python library for improving quality in data processes. It covers three main areas:
• Testing data (generating constraints and using them to validate new data)
Test-driven data analysis is a methodology and open-source Python library for improving quality in data processes. It covers three main areas:
* Validating data (generating constraints and using them to validate new data)
* Creating "reference tests" for data analysis processes
* Automatically creating Python tests for command-line scripts in any language (gentest)
This tutorial will motivate test-driven data analysis and show how to use all three parts of the library.
INSTRUCTIONS/ADVICE FOR ATTENDEES
Attendees will be able to follow along significant parts of the tutorial if they have a laptop with Python available, and there will be some interludes specifically to allow people to run examples. About half of the examples only use command-line tools and do not require any knowledge of Python.
In order to follow along, users simply need to install the tdda
library from PyPI. It is possible I will make tweaks to the examples (which are supplied with the library) in the run-up to PyData, so ideally attendees would update the library on 12th or 13th June.
Installation generally simply requires a normal pip
installation, depending on your preference and setup, and can be into a main Python installation or a virtual environment.
pip install -U tdda. # install or upgrade into the Python that owns pip
python3.11 -m pip install -Y tdda. # First-time installation into specific python, in this case python3.11
If all has gone well, you should be able to import tdda
from an interactive Python
$ python
Python 3.11.0 (v3.11.0:deaf509e8f, ...)
Type "help", "copyright", "credits" or "license" for more information.
>>> import tdda
>>> tdda.__version__
'2.1.08'
from your Python, and to use the tdda
command from, e.g.
$ tdda
Use
tdda discover to perform constraint discovery
tdda verify to verify data against constraints
[... truncated]
$ tdda version
2.1.08
$ tdda --version
2.1.08
If you prefer, you are welcome to install from source (https://github.com/tdda/tdda); but you won't need source for the tutorial, and if you're not sure how to do it, just install with pip.
Attendees who cannot or do not wish bring a machine or follow along should still get 90% of the benefit of the tutorial.
No previous knowledge expected
Nick Radcliffe is a practising data scientist with over 30 years experience, from neural networks (a.k.a. deep learning) and genetic algorithms on parallel systems in the late 1980s, through parallel machine learning and 3D visualisation software as a founder of Quadstone, from 1995, to novel modelling methods (e.g. uplift modelling) in the early 2000s. Since 2007, he has run Edinburgh data science specialists Stochastic Solutions Limited.
Nick uses his deep knowledge of underlying algorithms to fashion tailored solutions to practical business problems for clients including Barclays, Sainsburys, T-Mobile and Skyscanner, and was a key developer of Uplift Modelling—a method for modelling the differential effect of a treatment across a population.
Over recent years, he has developed a particular focus on testing data and data processes for correctness, developing and applying a methodology and set of tools known as test-driven data analysis (TDDA), with open-source and proprietary variants. These will feature in talks and training sessions in this year's DataFest.
Nick is also a Visiting Professor in the Department of Mathematics at the Edinburgh University and organises the PyData Edinburgh monthly meetup, which regularly brings together around 100 data scientists. He has acted as an adviser and consultant to various firms including SEP and Fluidinfo and has co-authored two books.