In Search for the Holy Grail of MLOps: Taming the Beast of ML Use Cases

MLOps is a really hot topic right now and there are many products out there that promise to meet all your needs to put ML algorithms in production. At the same time, many if not most ML projects still struggle to make it to production and teams have a hard time managing a large number of tools and frameworks to be successful.

One reason for this is that ML applications are diverse, much more so than other areas where we’ve been more successful to create frameworks that cover most of what you need (web frameworks like Ruby on Rails, Spring, Django, for example).

In this post I want to run through four ML applications to highlight the diversity of requirements in practically all the stages of a ML system to give you a better understanding of the challenges that you’ll face.

The applications are:

  • sales forecasting
  • computer vision object detection
  • recommendation
  • fraud detection

The dimensions are:

  • data (what kind of data sources, what kind of data, how much data, how much does the data change over time)
  • models (what kind of models, what kind of infrastructure you need to train them, how often you need to retrain)
  • evaluation (how you validate that your model fits your business problems)
  • deployment (how to serve the predictions from your model)
  • monitoring (what you need to make sure your model’s quality is OK).

Let me run you through these application one by one, explain roughly how it is done and discuss the different areas.

I’ll be mostly focussing on the parts of the final system in production. I think there is another important aspect, which is tools you need to support the exploration, research, and iterative refinement it takes to create that system. But let’s not overcomplicate the (already) lengthy discussion below.

Sales Forecasting

Sales forecasting (or forecasting other business metrics) is a standard application that companies do as part of their planning. You can also use it to optimize discount rates or other decisions by adding more parameters to the input of the model. Typically, you want forecasts for timeframes like the next months or the next year on a day-to-day basis and of course you’d want your forecasts to be as exact as posisble.

The data that goes into this usually comes from databases. In the best case, the data already exists as part of some reporting. In the worst case, you’d need to assemble the data from multiple sources, but often these are also databases (or excel sheets in emails, but then you have other problems). It’s not a lot of data, maybe daily numbers going back a year, but the granularity of things you want to track (for example geographical regions, item categories, or even individual items) could be large. Still, the data we’re talking about is maybe in the low thousands.

For this kind of problem, classical multivariate time series models like ARIMA models or some variant of that still work well. These models can be trained on single machines if you can fit all the data into memory. If you have a lot of data with complex structures, you could also look into deep learning models for sequences like transformer models. Such models require machines with GPUs that are 20 to 100s faster than their CPU powered counterparts. You don’t need to retrain your models very often. Doing it manually when you need a new report might be totally fine.

Evaluation is fairly straightforward if you have enough historical data. You use old data to predict and then test on data that is also in the past. Evaluation might be tricky if the underlying reality is changing a lot (e.g. COVID, rapid business expansion), or you don’t have enough data, but then there isn’t a lot that can be done anyway.

Deployment is fairly easy, you compute predictions and store them in a database so that people can look at them. You don’t usually need to deploy your model as a REST service. One setting where you need to do that is if the predictions are used by another system to optimize some decision, but even then, the model is not that complex, so that timing is also not that critical.

Monitoring is also relatively easy as you compare data with the new data that comes in.

So in summary, I’d say this is a fairly straightforward setup: Get some data from a database, train a model that can fit on a single machine, store results back in a database.

If you wanted to build this, you’d probably use a tool like dbt for the pipelining to get the data, and the rest could be in a (hopefully robust and well tested) notebook.

Computer Vision Object Detection

OK, now let’s switch gears and look at something completely different: building a model that can detect objects in pictures. This is one of the classical problems of AI with a very interesting history that became one of the main drivers of deep learning and one of the first applications where deep learning beat the state of the art that used classical ML approaches. The key task is to recognize which objects are present in a picture, optionally together with their locations. The problem here is that the actual pixel values of objects can look quite different based on perspective, lighting, occlusion, or in the case of animals and humans, have different geometries. We know quite a lot about the different processing stages in biological systems and early approaches actually recreated these. Deep learning picks up the general structure of biological systems but optimizes the details of the filters at the various stages through the training data.

Although we know which network architectures work well for this kind of task, if you want to make sure it works well for your domain, you still need to start collecting a data set and retrain some part of the network (often the last layer is enough, sometimes you need more).

Your data set will most likely consists of a large number of pictures stored in flat files or some object storage like S3 buckets. The data set doesn’t change a lot, but it slowly grows as you add more examples. Here tools that help with manual annotation can be very handy. Crowdsourcing the data is another option.

For training you’ll use convolutional neural network that need to be trained on GPU hardware. Such machines can be bought or rented in the cloud. The times are gone when you’d have to explain why you need a gaming specced desktop machine “for work” (although that’s still possible, of course). Evaluation is pretty straight forward by testing on hold-out sets as in a standard supervised learning setup.

If you want to use the model on real-time hardware for face detection or object detection (e.g. in surveillance applications or for autonomous driving), you will want to deploy your model with a REST service. You might need to invest in some post-processing to make the neural network smaller or that it runs on less power intensive hardware (for example by quantizing so that it can run on CPUs or even more low-powered hardware).

Model drift and other issues can be a problem (for example if you’ve trained on summer pictures and then winter comes around and the weather changes), but it is not trivial to monitor the model, because you don’t have ground truth labels in real-time. Still, you can look at other statistics like average number of detected objects per picture, assuming that the statistics of the pictures you’re looking at don’t really change.

In summary, data and evaluation are not that complex here, but you might need some infrastructure to do labeling (e.g. via crowdsourcing). Training is the part where you’ll probably be spending most of your time and money. Deployment used to be tricky, but deploying models as REST services is one feature almost all MLOps tools provide, so that has gotten much easier.

You can already see how this example has almost nothing in common with the forecasting use case. Data is different (structured time series vs. images, databases vs. flat files), Models are different, deployment is different, and so on. Let’s look at two more examples to further explore the diversity of ML applications.

Recommendation

Systems that compute recommendations are a typical use case for e-commerce or media websites. People have come to expect a “similar items” section on their shopping websites and use them for exploratory navigation alongside search. It doesn’t have to be great, but it needs to be there and be “good enough”.

Technically, there are many ways to do this, but the classical approach is to look at which customers interacted with which items and then compute similarities based on how similar the customer groups for an item are. This is called collaborative filtering. Other approaches try to predict the probability that customers are going to click on an item base on all kinds of signals and features from the items and customer.

Maybe the nicest thing about recommendation use case is that the data is “self-labelling.” No need to take thousand of items and spend time and money to label them. Collecting user interactions like views, clicks, purchases are enough, although you can also add content features to the mix.

The biggest challenge on the data side is that you have to deal with large amounts of data. For bigger sites, these can easily be a few gigabytes of data per day. Since you usually train models on a few weeks worth of data, you have to use some form of scalable data processing system to preprocess the data. Typical steps are aggregating user sessions so you can link up user interactions to compute basic statistics like the number of times an item has been viewed and clicked by a customer, or how often items have been purchased after they have been shown in recommendations.

Here, technologies like Apache Spark, or Big Query really shine. Some people hold all their data your access pattern is mostly full data scans for a given range of days.

The classical collaborative filtering approaches can also be implemented on top of Big Data technology in a scalable fashion. More advanced approaches again use models that require GPUs for training. Models definitely need to be retrained regularly, often daily, sometimes even more, as assortment and customers change. Many application areas are highly seasonal and retraining makes sure that the models are current.

Since you need to provide recommendations as people are browsing your webpages, you need them to be available at a low latency. There is a shortcut, however. As the assortment and the customers change only to a limited degree during a day, it is often practical to precompute recommendations and store them in a key-value store like Redis and serve them from there. This doesn’t work if you want the model to react to changes in customer behavior quickly. In that case, the biggest challenge is often providing the additional features required by the model in realtime. For that you can also use key-value stores where features sets are stored at training time, or even updated in real-time via some stream processing infrastructure.

Evaluating the models is a tricky and interesting topic, as evaluation of models on historic data is not straightforward. The pure ML-level metrics like accuracy of prediction can be computed, but you cannot go back in time and show the customers entirely different recommendations. There are ways to estimate how much a customer was interested in that item, for example, by looking at what they did after the recommendation has been shown. Finally, depending on your metric it might also not be easy to compute it right away. Clicks can be seen quickly, but whether or not a customer has bought a recommendation might not be visible until much later. Again, for immediate monitoring, you can look at statistics like clicks, latencies, number of missed documents, etc.

Compared to the previous two examples, recommendation is much more complex on the data processing side, and also in terms of data instability. You’d need solid support for orchestrating regular retraining and interfacing to Big Data technology to build good pipelines.

As a last example, let’s pick an example that is similar to forecasting in terms of the data complexity, but even more challenging on the monitoring side.

Fraud Detection

As soon as you’re trying to sell stuff on the Internet, there will be people trying to steal some money from you. The sheer number of transactions, the fact that they are ideally fully automated, and the potential for automation on the side of the attacker means you need to create systems that can identify fraudulent behavior to automatically in a reliable fashion at least to some extent.

In terms of data, this application is similar to the forecasting case. You’ll usually use transaction and other business data (addresses, purchase history) as features. Often, you use classical models like random forests because of their robustness.

Deployment also requires you to return a prediction quickly, often pulling and computing features on past customer behavior from databases. We’ve seen this pattern a few times already.

A real difference is the monitoring of models, because there is no immediate feedback. The time span between predicting and knowing the true label could be months, until you’ve realized whether the bill is ultimately not being paid. You can back-test in similar fashion like in forecasting by looking back at historical data, but monitoring a model is quite an art. What you can observe are statistics like fraction of flagged transactions, or fraction of completed purchases to get some idea how your model is affecting the purchase flow.

The strategy of attackers might also change quickly as they identify new loopholes. Here, you often need to bring in the whole bandwidth of statistical measures to track distribution of many features to spot any change in the distribution of the data.

In summary

I hope I could give you some idea of just how diverse ML. Consequently, having one MLOps platform that can deal with all of these different settings and requirements is quite difficult. It might not even be a good idea to try and build a platform that can.

To summarize the diversity of requirements, let’s go through the different aspects again:

  • Data
    • Might come from databases or flat files
    • Might be a few thousand points of hundreds of GB
    • Processing might require Big Data technology
    • Might be growing slowly or change day to day
  • Models
    • Might be classical statistical models on moderate amounts of data, or GPUs on large data sets, or scalable implementations on Big Data technology on large amounts of data
    • Might be okay to train manually once in a while, up to needing to be retrained and redeployed in a fully automated fashion several times a day
  • Evaluation
    • Might be hold-out testing on available data, up to doing simulations and estimation of customer behavior (e.g. recommendations)
  • Deployment
    • Might be writing a batch of predictions back into a database, precomputing and storing results in a key-value store and real-time serving, storing or even updating features in key-value store to pull and compute the model in real-time, to large scale Deep Learning models
  • Monitoring
    • Might be as simple as evaluation, up to tracking simple statistics of the model, up to doing advanced statistical outlier detection on a large number of features

My hunch today is that most MLOps products promise everything but actually pick a specific subset of these possibilities. Nothing wrong with that. Building something that works for everything is probably not a good idea. But the marketing makes it hard to understand where you are.

For example,

  • mlflow or DVC seem to be more geared towards the cases with relatively simple and small to mid-scale data sets and supervised learning models.
  • Almost all MLOps models allow you to deploy models as a REST service (e.g. mlflow, Sagemaker). As I explained, that is just one possibilities of many, and in many case another option is better conceptually and operationally. Projects like kubeflow build out a lot of infrastructure (again, that you might not need) to deploy models, and products like seldon take it even further by allowing you to create whole pipelines chaining models together for prediction.
  • Products like feast are building systems for feature storage and feature delivery at prediction time. But as I have explained, you might not actually need these for your application when you do batch predictions or when you precompute computations and put them in a key-value store.
  • Cloud based solutions like Sagemaker seem to aim for the large scale data cases, because they integrate with scalable data processing, and make it the default to spin up training instances. But that is way too much overhead if you’re working just with a small data set. In that case, just running everything even on your laptop makes much more sense.
  • The classical open source approach of using jupyterlab on a small machine also works best with small to mid sized data sets, and also training GPU based instances by running it on a single large machine.
  • Many of these uses cases actually don’t require automation. If you need it, Airflow still seems to be the standard, although I think it shows its age. New products like prefect try to take a new approach. If most of your processing pipeline is with databases, dbt is very popular due to it’s good integration and workflow.

I hope I could provide some background to better understand the MLOps requirement landscape. I think it is absolute key to understand what you are working with and what you need. The marketing promises of products in this space are unfortunately not very good guideposts for making decisions right now. I don’t fault them, it is a fast moving space and I think many of these products are also searching for their sweet spot.

One final comment. I have mostly focussed on the MLOps system as it is in production. But the real work and challenge of any data science based project is figuring out the details of these systems. What kind of data do I need? What kind of model? Which kind of features? And so on. I believe that even different tools are required to best support this part of the work. You need to be able to quickly test out ideas (notebooks are good tool for this), but also to log and document your approaches (mlflow or DVC do some logging, but documentation is up to having the right process). Trying things out with data pipelines can be tricky, and most orchestration tools are too much focussed on the production use case, prioritizing resilience and monitoring. I don’t think we have a good picture here yet of what exactly is needed. Looking at web frameworks, features like scaffolding and auto-reload are examples that help people to move quickly and try out ideas. git branches can be used to isolate a piece of work to be able to roll back. And so on. I’m looking forward to what the future will bring in terms of tooling here.

Let me know what you think. Also happy to hear about your experiences working with different tools. Success and failure stories are both welcome!

Also, follow me on Twitter or connect with me on LinkedIn.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.