Discover more from Machine Learning Ops Roundup
Issue #18: MLOps on Coursera. #datalift. Long-tail events. Teaching AI to Forget.
Welcome to the 18th issue of the MLOps newsletter. First things first, we are excited to show you some design changes to the newsletter. We have moved away from the default Substack orange, and have a newly designed logo (which you can see above) with its own dark version on Twitter. Let us know how you feel about it by replying to this email or commenting on the Substack post. ✏️
In this issue, we share a new MLOps course and a virtual event where you can meet us, discuss the effects of long-tail events on ML systems, deep dive into research on teaching ML models to forget, and more.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
In a previous issue of our newsletter, we covered Andrew Ng’s talk about MLOps where he outlines the importance of striking the right balance between data and modeling improvements in real-world ML applications. In partnership with Coursera and DeepLearning.AI, Andrew Ng is teaching a course focused on MLOps. We are very happy to learn about this development and highly recommend checking it out:
The Machine Learning Engineering for Production (MLOps) Specialization covers how to conceptualize, build, and maintain integrated systems that continuously operate in production. In striking contrast with standard machine learning modeling, production systems need to handle relentless evolving data. Moreover, the production system must run non-stop at the minimum cost while producing the maximum performance. In this Specialization, you will learn how to use well-established tools and methodologies for doing all of this effectively and efficiently.
#datalift is an initiative from the AI Guild which brings together a community of AI practitioners, and business leaders to help bridge the gap between proof-of-concept and productionization. They’re hosting an event on May 28 with interesting speakers and opportunities to network with folks interested in ML deployments.
We’re excited to announce that the MLOps Roundup will have a virtual booth at the event -- come say hi to us if you’re attending!
This is a really interesting blog post from the Doordash Engineering team that discusses the impact of long-tail events on a specific ML system at Doordash and how they improved their system to address this problem.
What are long-tail events?
Long-tail events are often problematic for businesses because they occur somewhat frequently but are difficult to predict. We define long-tail events as large deviations from the average that nevertheless happen with some regularity. Given the severity and frequency of long-tail events, being able to predict them accurately can greatly improve the customer experience.
In the context of Doordash, their customers see an ETA before they get food delivered. This ETA sets an expectation for their customers for when their order will arrive, and while this system works well for most cases, a small number of late deliveries can have an outsized negative impact (we all know what it’s like to be “hangry” - remember the Snickers commercials?).
The post makes a distinction between tail events and outliers:
Outliers tend to be extreme values that occur very infrequently. Typically they are less than 1% of the data...On the other hand, tail events represent occurrences that happen with some amount of regularity (typically 5-10%), such that they should be predictable to some degree.
Challenges with tail events?
Given that tail events happen 5-10% of the time, there is a sizable opportunity in improving the ML system for these events.
However, it’s challenging because there is often not as much ground truth or factual information for an ML model to learn generalized patterns. It can also be difficult to obtain leading indicators that are correlated with the occurrence of a tail event. A simple example of this is in the case of an online retailer, a social media post from an influencer might cause a sudden spike in the demand for a product. It is almost impossible to predict something like this.
For Doordash, they could choose to always overestimate the ETA time as a safeguard against tail events, but this hurts revenue. Many people will choose not to get food delivered if ETA crosses some threshold. So the only recourse is to improve predictions on tail events.
The Doordash team tried a few different options to address the challenges. First, they used bucketing and target encoding (read this good intro to target encoding) for certain continuous-valued features (like marketplace health). This gave the model an easier path to learning the effect of such features on the target value.
Second, they introduced real-time features that captured real-time signals about the outcome they cared about.
For example, we look at average delivery durations over the past 20 minutes at a store level and sub-region level. If anything, from an unexpected rainstorm to road construction, causes elevated delivery times, our ETAs model will be able to detect it through these real-time features and update accordingly.
Finally, they tweaked their loss function from a linear to a quadratic loss function, which penalizes large deviations much more strongly.
Based on the experiment results, we were able to improve long-tail ETA accuracy by 10% (while maintaining constant average quotes). This led to significant improvements in the customer experience by reducing the frequency of very late orders, particularly during critical peak meal times when markets were supply-constrained.
The lessons they share:
First, investments in feature engineering tend to have the biggest returns.
Secondly, it’s helpful to curate a loss function that closely represents the business tradeoffs.
A Transformer is a type of deep learning model architecture that utilizes the mechanism of attention to selectively focus on certain parts of the input that the model thinks are relevant for making the final prediction. While transformers have revolutionized language understanding, they suffer from the inability to scale to large pieces of text, primarily because of computational costs. Facebook recently published new research (paper, code) that allows models to “forget” or expire information it has learned in the past that might no longer be relevant, thus allowing attention mechanisms to scale to much longer sequences of inputs.
What is it?
In the paper above, authors introduce Expire-Scan, a new technique to allow neural networks to expire information it has learned in the past that might no longer be relevant. While the problem this paper solves has long been acknowledged, the main challenge so far has been that “expiring a given piece of information” is a discrete operation i.e. not differentiable. Expire-Span assigns an expiration value to each hidden state it has learned in the past and recomputes this value at each time step. In this way, the expiring information is a learnable parameter in the model.
As mentioned in the paper and in the article above, Expire-Scan is promising to scale attention to long pieces of text (or sequential inputs more generally - e.g. image frames in a video). It is also quite interesting to consider potential use cases of “memory expiration” for problems such as models encoding bias of the underlying datasets. For example, is it possible to set up training loss so that models learn to forget spurious (and often biased correlations) it has learned from data? Can we improve generalizability and tackle data/concept drifts by ensuring models forget information that might no longer be relevant? This is an exciting and novel idea, and we look forward to future research directions that build upon this work.
LinkedIn recently open-sourced Greykite (link to Github), a Python library for time series forecasting. As part of the library, they are also open-sourcing the core algorithm used for predictions based on time series data called ‘Silverkite’ (link to the paper if you’re interested). The authors have shown this algorithm to work well for cases like time-varying trends and seasonality, which we assume LinkedIn has to deal with quite frequently. The open-source article linked above, and the Github repo are quite comprehensive and deep-dive into multiple use cases. We recommend checking it out if you’re dealing with time-series data!
Twitter Fun | Types of ML/NLP papers
“All memes are wrong, but some are interesting” (or was it “All models are wrong, but some are useful”?). This one was too real to not share! 😃
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at email@example.com
If you like what we are doing please tell your friends and colleagues to spread the word.