Issue #24: AI at Porsche. Efficient Inference. Bootstrapping Labels. Nearest-Neighbor Benchmarks.

and

Aug 09, 2021

Welcome to the 24th issue of the MLOps newsletter.

In this issue, we cover an interesting approach to autonomous vehicles at Porsche, discuss strategies for the efficient serving of ML models, explore the bootstrapping of labels with weak supervision and share some recent resources.

Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️

The Big Loop: artificial intelligence and machine learning

Porsche Engineering recently shared this exploration of how an autonomous vehicle could learn from experience and react intuitively to new circumstances. The article describes a very specific use case -- detecting lane changes in front of a car. While this article doesn’t talk about any other elements of self-driving, the process they describe is interesting and reminiscent of Tesla, which we discussed earlier.

The Problem

Their vehicles are equipped with an enhanced Adaptive Cruise Control (ACC) system which ensures that a safe distance is maintained from the vehicle in front, detecting early when other road users are cutting in, etc. With ACC:

“A likely lane change is detected half a second to a second earlier – the equivalent of 30 metres of driving on the motorway,” explains Dr. Joachim Schaper, Senior Manager AI and Big Data at Porsche Engineering.

The model needs to be continuously improved, but going through all the data that is being generated is too consuming.

“…we only want to record the data that really helps the system move forward,” says project manager Philipp Wustmann, an expert in longitudinal and lateral control at Porsche Engineering. “That's no easy task, because radar sensors and cameras generate an immense amount of data, most of which is not relevant to the function under consideration.”

Solution

They follow the following process to improve the model:

Find scenes where the ACC is not reacting optimally in the car (using some heuristics) and send them to their servers in the cloud
Augment this with additional simulated scenes that generate extra training data without more drives
Train a new model and validate it on unseen data
Push the model to the car and allow the new version of the model to be activated by the driver

They are now using this technical approach for other development projects, and we believe that elements of this approach are much more broadly useful.

Efficient Machine Learning Inference

This is a fascinating article that discusses strategies for serving multiple ML models in scenarios where latencies matter.

Why is this important?

Let’s imagine your team is responsible for serving multiple ML models, and having low latency inference is important for your business. There could be a wide variation in the kinds of models and use cases they are satisfying.

Some models are large while others are small...Other models get periodic bursts of traffic, while others have consistent load. Yet another dimension to think about is the cost per query of a model: some models are expensive to run, others are quite cheap.

Typically, teams will provision for one model per host -- this ensures (relatively) predictable latency since you just have to track per-host throughput. This can then be horizontally scaled for peak traffic per model, and you should have sufficient capacity. However, these provisioned servers are often in excess of what is necessary at most times, and if you choose to reduce costs by reducing the size of the VM, that often has a negative impact on latency.

So what should one do?

Multi-model serving, defined as hosting multiple models in the same host (or in the same VM), can help mitigate this waste. Sharing the compute capacity of each server across multiple models can dramatically reduce costs, especially when there is insufficient load to saturate a minimally replicated set of servers. With proper load balancing, a single server could potentially serve many models receiving few queries alongside a few models receiving more queries, taking advantage of idle cycles.

They then go on to show numbers from a real-world example. They gather the traffic in queries per second (QPS) for 19 models over a period of a week.

As can be seen, most models have infrequent traffic (and some spikes), but one model has high constant traffic and large spikes. When these models are deployed on a standard VM (in Google Cloud, which is where the example is from), the 99th percentile latencies for the models fall into different ranges (depending on the type of the model) ranging from 1ms to 1000ms.

Now, if all the models were to be hosted on a single, much larger VM, the 99th percentile latencies range between 4ms - 40ms (see the article for all the charts). Keeping all 19 models in memory consumes 40GB of RAM, but it is easy to find machines in the cloud that easily satisfy that requirement. All the more, this is done while keeping monthly costs 1-2 orders of magnitude lower.

Your mileage may vary (especially if you have only models with consistently high QPS), but these results are worth keeping in mind.

Conclusion

The authors put it well:

Multi-model serving enables lower cost while maintaining high availability and acceptable latency, by better using the RAM capacity of large VMs. While it is common and simple to deploy only one model per server, instead load a large number of models on a large VM that offers low latency, which should offer acceptable latency at a lower cost. These cost savings also apply to serving on accelerators such as GPUs.

If you’re also responsible for deploying a large number of models and need to serve inferences in real-time, this advice may be for you. One caveat: this approach leads to a coupling between models to some extent. Any correlation between the traffic experienced by a few models will have an outsized impact on the system.

Bootstrapping Labels via ___ Supervision & Human-In-The-Loop

Human-in-the-loop process for annotation QA

In a recent article, Eugene Yan shared a summary of prevailing approaches to bootstrap labeled training data in real-world machine learning applications - an underappreciated but important problem in our view. It discusses semi-supervised, active, and weakly-supervised learning, along with examples from DoorDash, Facebook, Google, and Apple.

Semi-supervised learning: This approach combines a small amount of labeled data with a larger amount of unlabeled data during training, and aims to improve upon the performance that can be obtained from just the labeled data (supervised learning) or unlabeled data (unsupervised learning).
Active Learning: Active learning aims to select the most interesting/informative unlabeled examples that should be labeled to improve the model performance. While multiple metrics can be used in active learning, one that’s commonly used is to select examples that are the “hardest” for the model (one where the model is the most uncertain). Active learning is often used to solve the “cold start” problem and set up a feedback loop to iteratively improve the performance of ML models. In this blog post, Doordash shared details of their human-in-the-loop active learning system for menu item tagging (a version of which we covered here). In this paper, researchers at Facebook AI shared details of an approach to couple active learning with similarity search around positive labels in cases of skewed datasets. This is especially relevant when the underlying detection problem of interest has a low prevalence (e.g. integrity problems like bullying, hate speech, misinformation) and we covered this in our last issue.
Weak Supervision: Related to semi-supervised learning, weak supervision aims to combine multiple (often noisy and imprecise) sources of labels to generate an additional source of information. These sources could include heuristics, regex rules, prior trained models, etc. In this joint paper, researchers at Google and Snorkel share some case studies of Snorkel Drybell, a weak supervision system deployed internally at Google.

Putting it all together

Given a large variety of constraints and priorities in real-world ML scenarios, it’s unlikely that there will be a single “globally optimal” path to acquire good quality and a large volume of labels over time. Nonetheless, the article shares some heuristics for when to apply which technique: On day 1, perhaps using heuristics or weak supervision is a good approach to get your product working end to end. Afterward, some combination of active learning and label denoising might work well as evidenced in the several examples in the article.

New Resources for Machine Learning and MLOps

We’ve recently come across a few useful resources that we’d like to share:

Approximate Near-Neighbor search benchmarks: We’ve covered applications related to approximate nearest-neighbor (ANN) similarity search earlier in a prior issue of the newsletter. This Github repo aims to benchmark various implementations of ANN search on a collection of public datasets. We liked the attention to usability in this repository -- datasets are already pre-generated and each implementation has a docker container. If you’re considering implementing ANN for your use case, we highly recommend checking this out to understand which implementation might best suit your needs.

Shapash: A neat Python library for ML interpretability based on SHAP and LIME. This library allows data scientists or end-users of a machine learning model to generate local and global explanations for predictions (features contributions) and understand feature correlations.

QuestDB: QuestDB is an open-source database for time-series or events data with a focus on performance. In benchmarks shared on their website, QuestDB outperforms other popular alternatives like InfluxDB and might be suitable for your use case especially if you’re looking for real-time query-ability on a high volume of data.

Twitter | Is pushing ML models to production really the hard part?

Neal Lathia @neal_lathia

The two worlds of ML in production: MLOps startup 🌍: the biggest challenge in ML is shipping models to production Medium blogger 🌍: here’s how easy it is to wrap an ML model in a flask API

Neal Lathia’s recent Twitter thread highlighted the two (and one might say somewhat at odds) points of view in the MLOps world, namely:

The biggest challenge is shipping models to production
Shipping models to production is the easy part: wrap it behind an API and call it from your service (for e.g. using Flask or FastAPI)

The thread generated interesting discussions. We recommend reading the various reply threads and conversations to get a complete picture, but we highlight a couple that we found especially relevant:

Batch and point predictions might need different approaches (thread)
The biggest question for companies is really the value added by deploying machine learning (thread)
While deploying a model behind a flask API (or equivalent) is easy, questions around scaling, authentication and reliability still need to be solved (thread)

Thanks

Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup or email us at mlmonitoringnews@gmail.com. If you like what we are doing please tell your friends and colleagues to spread the word.

Machine Learning Ops Roundup

Discussion about this post