Cleaning up my bookshelf a bit, I came upon the book Mining of Massive Datasets by Jure Leskovec**, **Anand Rajaraman, and** **Jeff Ullman. It deals with all kinds of ways to deal with big data sets, data streams, link structures between documents, social network analysis, and other kinds of data which of occur in large amounts.

There is a part about what is mostly Big Data, but I’d say at least 75% of the book are about entirely different algorithms.

There actually was a time when scalable didn’t mean that you could add more machines to make it faster, but very loosely that you could deal with a lot of data in a reasonable manner. Whatever that meant. For some reason, scalable nowadays is synonymous with parallelizable only.

The first and last lecture I gave at the university was called scalable machine learning and I think only very few algorithms qualified for being easily parallelizable. Instead, there were all kinds of clever insights and tricks to implement certain optimization algorithms with as little effort as possible. Some algorithms for the Support Vector Machine managed to just update a single dimension or two in each iteration and still converge. Impressive!

Other algorithms (my favorites!) dealt with large amounts of data by approximating. If you sum up a billion numbers, the law of large number will allow you to get close using only a subsample. If done properly, you don’t even have to see all those numbers. Of all those algorithms, I only see bloom filters from time to time.

What we call scalable today was sometimes called “embarassingly parallel” during these days, meaning that it was just so obvious how to parallelize the algorithm that you couldn’t even write a paper about it. Which is of course the main notion of whether something’s worth the effort to consider in some circles.

What academia probably didn’t get back then is that embarassingly parallel is good, because even a computer (pre-AI, just joking) could figure out how to do it, and that opened a path to write tools like Hadoop or Spark that allowed you to express an algorithm in a way that can be executed on one machine, or on a cluster of a 1000.

It turned out that it was easier to teach humans how to use these tools to do a lot of things than to teach people how to analyze individual computational problems and have enough luck to find something exploitable.

One of the interesting findings was that these simply tools (map/reduce and friends) actually allowed you to express many different kinds of algorithms. I remember being at a conference where someone presented how many learning algorithms could be written as map reduce, and many people were like “what the eff, and those aren’t even the good algorithms we have for these problems.” In a way, map/reduce and friends are a different kind of approximation. Sometimes, the simpler algorithm may have worse time complexity but parallelizes better.

Deep Learning on GPUs is no different from a computational point of view. The most complex operation are matrix-matrix multiplications which can also be parallelized easily, and by putting a bit more thought into organizing the computations, you can also make them sufficiently local in their memory access pattern to achieve these amazing TFLOPS numbers.

For some reason, we ended up with scalable databases in the end. Maybe it is because SQL is such a great abstraction, maybe it is because the people who build these tools had a database background. By the way, SQL joins are very similar to matrix multiplication.

Looking back, I think I’ve seen a glimpse of a different way. A more difficult path, but maybe if we had gone further, we would have found other sets of operations that allowed us to scale algorithms in different ways by parallel execution and approximation. In any case, I’m pretty sure we took the other path approximately five years ago.

Maybe people are also still working on it. I’ve been out for too long. If you know of research in this area, I am happy to hear about it!