Data Analysis: The Hard Parts

(Repost from 2014)

I don’t know whether this word exists, but mainstreamificationis what’s happening to data analysis right now. Projects like Pandasor scikit-learnare open source, free, and allow anyone with some Python skills do lift some serious data analysis. Projects like MLbaseor Apache Mahoutwork to make data analysis scalable such that you can tackle those terabytes of old log data right away. Events like the Urban Data Hack, which just took place in London, show how easy it has become to do some pretty impressive stuff with data.

The general message is: Data analysis has become super easy. But has it? I think people want it to be, because they have understood what data analysis can do for them, but there is a real shortage in people who are good at it. So the usual technological solution is to write tools which empower more people do it. And for many problems, I agree that this is how it works. You don’t need to know TCP/IP to fetch some data from the Internet because there are libraries for that, right?

For a number of reasons, I don’t think that you can “toolify” data analysis that easily. I wished it would be, but from my hard-won experience with my own work and teaching people this stuff, I’d say it takes a lot of experience to be done properly and you need to know what you’re doing. Otherwise you will do stuff which breaks horribly once put into action on real data.

And I don’t write this because I don’t like the projects which exists, but because I think it is important to understand that you can’t just give a few coders new tools and they will produce something which works. And depending on how you want to use data analysis in your company, this might break or make your company.

So my top four reasons are:

data analysis is so easy to get wrong
it’s too easy to lie to yourself about it working
it’s very hard to tell whether it could work if it doesn’t
there is no free lunch

Let’s take these one at a time.

Data Analysis is so easy to get wrong

If you use a library to go fetch some data from the Internet, it will give you all kinds of error message when you do something wrong. It will tell you if the host doesn’t exist or if you called the methods in the wrong order. The same is not true for most data analysis methods, because these are numerical algorithms which will produce some output even if the input data doesn’t make sense.

In a sense, Garbage In Garbage Outis even more true for data analysis. And there are so many ways to get this wrong, like discarding important information in a preprocessing step, or accidentally working on the wrong variables. The algorithms don’t care, they’ll give you a result anyway.

The main problem here is that you’ll probably not even notice it apart from the fact that the performance of your algorithms isn’t what you expect them to be. In particular when you work with many input features, there is really no way to look at the data. You are basically just working with large tables.

This is not just hypothetical, I have experienced many situations where exactly that happened, where people were accidentally permutating all their data because they messed up reading the data from the files, or did some other non-obvious preprocessing which destroyed all the information in the data.

So you always need to be aware of what you are doing and have to mentally trace the steps to have a well-informed expectation about what you are doing. It’s debugging without an error message, often you just have a gut feeling that something is quite wrong.

Sometimes it’s not even that the performance is bad, but also because it’s just very good. Let’s come to that next.

It’s too easy to lie to yourself about it working

The goal in data analysis is always good performance on future, unseen data. This is quite a challenge. Usually you start working from collected data, which you hope is representative of the future data. But it is so easy to fool yourself into thinking it works.

The most important rule is that only test performance on data which you haven’t used in any way for training is reliable for future performance. However, this rule can be violated in many, sometimes subtle ways.

The classical novice mistake is to just take the whole data set, train an SVM or some other algorithm and look at the performance you get on the data set you used for training. Obviously, it will be quite good. In fact, you can achieve perfect predictions when you just output the values you got for training (ok, if they are unambiguous) without any real learning taking place at all.

But even if you split your data right, people often make the mistake of using information from the test data in the preprocessing (for example for centering, or building dictionaries, etc.). So you do the actual training only on the training data, but through the preprocessing information has silently crept into the test data as well, giving results which are much better than what you can realistically expect on real data.

Finally, even if you do proper testing and evaluation of your method, your estimates of future performance will become optimistic as you try out many different approaches because you implicitly optimize for the test set as well. This is called multiple testingand something one has to be aware of, too.

One can be trained to do all this properly, but if you are under the pressure to produce results, you have to resist the temptation to just run with the first thing which gives good numbers. And it helps if you’ve gone that route once and failed miserably.

And even if you did evaluate according to all secrets of the trade, the question is still whether the data you worked on was really representative of future data.

It’s very hard to tell whether it could work if it doesn’t

A different problem is that it is fundamentally difficult to know whether you can get better if your current approach doesn’t work well. The first thing you try will most likely not work, as will probably the next thing, and then you need someone with experience to tell you whether there is a chance or not.

There is really no way to automatically tell whether a certain approach works or not. The algorithms just extract whatever information fits their model and the representation of the data, but there are many, many ways to do this differently, and that’s when you need a human expert.

Over time you develop a feeling for whether a certain piece of information is contained in the data or not, and ways to make that information more prominent through some form of preprocessing.

Tools only provide you with possibilities, but you need to know how to use them.

There is no free lunch

Now you might think “but can’t we build all that into the tools?” Self-healing tools which tell you when you make mistakes and automatically find the right preprocessing? I’m not saying that it’s impossible, but these are problems which are still hard and unsolved in research.

Also, there is no universally optimaly learning algorithm as shown by the No Free Lunch Theorem: There is no algorithm which is better than all the rest for all kinds of data.

No way around learning data analysis skills

So in essence, there is no way around properly learning data analysis skills. Just like you wouldn’t just give a blowtorch to anyone, you need proper training so that you know what you’re doing and produce robust and reliable results which deliver in the real-world. Unfortunately, this training is hard, as it requires familiarity with at least linear algebra and concepts of statistics and probability theory, stuff which classical coders are not that well trained in.

Still, it’s pretty awesome to have those tools around, back when I started with my Ph.D. everyone had his own private stack of code. Which is ok if you’re in research and need to implement methods from scratch anyway. So we’re definitely better off nowadays.