As machine learning is becoming more mainstream (well that’s already long past I guess) more and more teams who are new to ML are attempting to run data science projects. One of the most common mistakes is to think that ML is “just another library” so that people are approaching a data science project like a software engineering project.
Why is this a problem? In software engineering, we’re used to be able to create a solution that will work by thinking based on specs in most cases. This works because you can predict how things will fit together.
Let’s say someone tells you to write a service that stores metrics for machine learning runs. Given an idea of the data you want to store and ways to retrieve it, you can more or less design the solution on a piece of paper using some SQL database, setting up this and this table, adding these indices, building a REST interface in your favorite programming language etc. There might be things you have to look up, but these will be minor things that won’t change the overall architecture.
Especially if you did it in the past, you can do this with high accuracy, even taking things into account like “We should be fine if we use Postgres, with this amount of data, I’ve never seen queries take longer than 100ms.”
Your Data Will Always Be Unique
The fundamental difference to a data science project is that your data will always be unique, and whether your method works well or not depends ultimately on your data. Existing solutions, tutorials, papers, etc. will give you an idea of what worked on similar data, but you’ll always have to check whether the solution works on your data set, and tweak it to improve it.
My main recommendation and the first step to making projects more “data science-y” is to write a test harness that takes whatever model you have trained and runs and evaluates it on a test dataset. This closes the loop and also gives you the autonomy to try out things for yourself. In a way, it is the ultimate democratization of AI! You don’t need an expert to tell you whether your solution is good, your data will tell you.
The Ultimate Instance of Test Driven Development
A different way to think about this is to see ML as the ultimate instance of test driven development. In TDD, the idea is to put your specification into unit tests, and write them first, and then write the actual piece of code till it passes all your tests.
The advantages are having “specs as code” and an automated and quick way to see whether your code works.
The difference between TDD and ML is that in TDD test cases are written by humans, and there is a certain art to getting it right. You need to obviously have cases that specify what you want, but you also want to cover corner cases, and error cases. There also exists tools that automatically generate test cases randomly, called property-based testing, for example Hypothesis, or clojure’s specs library allows you to generate random parameter instances for functions.
Now ML is like TDD but instead of carefully designing the test cases, you just collect a lot of examples of inputs and desired outputs, almost like asking a bunch of people “what do you want the program to do?”
|Test Driven Development||Machine Learning|
|specs as automated tests||specs are given as concrete input/output examples that have been measured or collected|
|tests all green means the program works as desired||evaluation gives you an objective measure to automatically compare approaches.|
|percentage of tests passed||some loss function that fits your problem (accuracy, Area under ROC curve, precision/recall, F-Score, etc.)|
You Can Overfit in TDD, Too
This analogy also sheds some light on concepts like overfitting. In ML is that you’ll want a method that does not overfit but generalizes well. Overfitting means that you get a bit too committed to the training examples, and then on new data you won’t predict a good value.
For TDD, the equivalent of overfitting would be just do a big switch statement over exactly the input values which are in the tests and return the desired outputs. Of course you wouldn’t do that because as humans we understand that the goal is to find a general solution which works for the test cases, but also beyond.
The same is true for ML and is usually achieved through regularization (preferring solutions that are simpler), or by choosing the underlying class of models (for example linear functions), so that there is some structure in the solution.
Take the Time and Write Automated Tests, First
So next time you start on a data science project, take the time and set up a test harness, think about your data, how you want to evaluate how well it works. Automate as much as you can, so that if you want to try out a new idea you can focus on adding the feature processing, or the model, and then everything else, training, evaluating, creating reports, storing the code version, etc. should work at the press of a button. This will significantly speed up the exploratory work required to find a good solution for your data.