Issue #21: Selecting MLOps Capabilities. Continuous Delivery For ML. Static Language Modelling. Sharing Data with AWS.
Welcome to the 21st issue of the MLOps newsletter.
In this issue, we cover some tips from Google Cloud on choosing the right MLOps capabilities, share what continuous delivery for ML systems looks like, deep dive into the performance of language models over time, discuss the implications of AWS terms of service, and much more.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
Google Cloud Blog | Getting started with MLOps: Selecting the right capabilities for your use case
In a previous edition of the newsletter, we covered VertexAI, a managed machine learning platform on top of the Google Cloud Platform for companies to train, deploy and maintain their AI models. As a follow-up, Google has released a structured framework to think through what a mature production MLOps stack can look like. In this article, Christos Aniftos shares valuable insights about how to map the various capabilities on offer to your company’s use cases.
Pilots: These use cases are about rapid iteration and testing a proof of concept. The capabilities most helpful for such applications are experimentation tracking and data processing & transformation capabilities.
Production/Mission Critical: These applications are on the critical path of serving customer needs and creating business value. Failures can have a significant negative impact (legal, ethical, reputational, or financial risks). In such cases, robust offline evaluation is important (to identify poor performance or potential bias before a model goes live). Additionally, production monitoring is critical to keep an eye on post-launch performance
Reusability & Collaboration: In cases where the same data source, feature sets, or model architecture powers multiple production ML use cases, it is helpful to standardize and have a single source of truth for these assets. E.g. a feature store helps the processes of registering, storing, and serving features for multiple ML models.
Ad-hoc vs Frequent retraining: For use cases where ML models are not recurringly trained, production monitoring is critical to detect drifts, outliers, and anomalies. For use cases where recurring training is part of the ML workflow, it is important to have end-to-end and reproducible ML workflows that stitch together various parts of the ML development and evaluation lifecycle. Additionally, online experimentation might be required too depending on your use cases.
The article goes into a few more examples and we recommend reading it. Our takeaway from the article, which we have expressed previously when covering the MLOps ecosystem as well, was that there isn’t a “one-size fits all” solution stack for production machine learning. Given multiple companies and open source tools specializing in one or few aspects of the ML development cycle, and the need for flexibility, we think initiatives like AI Infrastructure Alliance can help define a common standard of interoperability among these products in the long term.
Martin Fowler Blog | Continuous Delivery for Machine Learning
The process for developing, deploying, and improving ML applications is more complex compared to traditional software. Continuous Delivery has been the approach to bring automation and standardization to create a reliable process to release software into production. In this article, Martin Fowler discusses Continuous Delivery for Machine Learning (CD4ML), the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications.
Why is it hard?
ML applications have three underlying components that can cause the end output to change: the code (application code), the model (learned parameters), or the data (training or validation dataset used for model training). Their interaction is often complex and hard to predict which is what makes the problem harder compared to traditional software.
Components of CD4ML
Discoverable and Accessible Data: it is important that the data is easily discoverable and accessible. The harder it is to find the data, the longer it will take for them to build useful models. This means having a well-defined and functioning infrastructure that makes the flow of logging/production data to a warehouse, data lake, or traditional database.
Reproducible Model Training: It is critical to be able to reproduce and version artifacts in the ML training workflow (e.g. data clearing & transformation, splitting into training vs validation sets, etc). Tools like DVC (Data Science Version Control) and Pachyderm that we have covered previously can be a possible solution to solve this problem.
Model Serving: There are three possible ways to deploy a model to do predictions online - embedded in the end application, deployed as a service (exposed as an API that is called by the consuming application), or published as data (which is ingested as a data stream by the consuming application).
Testing: This step is equivalent to running quality checks, unit & integration tests for software releases. However, in the case of ML applications, it is important to test not just the code but also the model and data. E.g. data validation checks to ensure schema consistency or statistical properties; model quality with validation checks with predefined unit test cases or offline performance thresholds.
Experiments Tracking: To understand the performance of various experimental models that might be running online and their associated stability and performance. Tools like MLflow that we have covered previously can be a good option to accomplish this
Model Monitoring: Once the model is deployed in production, we need to understand how it performs and close the data feedback loop. This consists of logging model inputs and predictions and creating relevant alerts, detecting outliers and drifts, monitoring various sub-populations of the traffic for possible model bias, etc.
The article takes the example of a sales forecasting application to walk through a possible CD4ML setup and we highly recommend checking it out in detail. You can also check out the associated code repository on Github.
Paper | Pitfalls of Static Language Modelling
This is a really interesting paper from researchers at Deepmind (one of whom is a subscriber to this newsletter!) that discussed the performance degradation of large language models over time. Given the explosion of large language models in the past few years (BERT, GPT-3, etc), and how excitedly they’re being adopted by practitioners, it’s important to understand their performance characteristics and potential failure points.
As the authors put it:
Our world is open-ended, non-stationary and constantly evolving; thus what we talk about and how we talk about it changes over time. This inherent dynamic nature of language comes in stark contrast to the current static language modelling paradigm, which constructs training and evaluation sets from overlapping time periods.
A language model is a probability distribution over utterances (such as words or characters) that is learned from a set of observations. A good language model assigns a high probability to an utterance in the future, such as predicting the next word in a sentence.
Many NLP applications (such as question and answering systems) rely on language model pretraining and “require up-to-date factual knowledge of our ever-changing world.” Now imagine answering questions like “How many people in the world have died from COVID-19?” or “Who is the current President of the United States?” and you’ll see that the answer depends on when the question was asked. Using a language model that was trained on data from the past might be a recipe for failure.
The authors trained language models on news and scientific datasets, and tested the models in two ways: first on documents that were in the same time period as the training data, and then on documents that were in the future, relative to the training data. For all datasets, they saw a significant increase in perplexity (which means that it was harder to predict the future utterance).
Further, they observed that:
the model performs increasingly badly when it is asked to make predictions about test documents that are further away from the training period, demonstrating that model performance degrades more substantially with time.
They also noticed that the drop in performance was much more significant when trying to predict words that were nouns - both proper nouns and common nouns (imagine a newly elected politician), or when the documents were about rapidly changing topics like sports or when they discussed emerging topics like Covid or 5G.
They also found that training even larger language models doesn’t address this problem.
One way to solve this problem is to retrain the language models frequently, but this is an expensive proposition (as an example, training a model like GPT-3 costs a few million dollars).
The authors report that a technique such as “dynamic evaluation” could help. Dynamic evaluation is a form of online learning that updates the parameters of an already trained model as new data becomes available. They find that this provides a benefit with successfully predicting emerging new words and topics. However, this might not be good enough -- dynamic evaluation might lead to the language model forgetting important concepts from the past (“catastrophic forgetting”).
It’s best to be careful when using these language models, especially if they were trained a while ago.
Techmonitor Blog | AWS Customers are Opting in to Sharing AI Data Sets with Amazon Outside their Chosen Regions and Many Didn’t Know
This is a post from July 2020 that remains quite relevant today. It gets into the weeds of AWS service terms and agreements but the legal implications are super interesting.
As they report:
The cloud provider is using customers’ “AI content” for its own product development purposes. It also reserves the right in its small print to store this material outside the geographic regions that AWS customers have explicitly selected.
Digging into the service terms of AWS, “AI Content” is “Your Content that is processed by an AI Service”. AI Services include AWS Services such as Amazon Comprehend (which provides NLP algorithms such as Sentiment Analysis), Amazon Polly (text to speech), Amazon Transcribe (speech to text), Amazon Translate (machine translation), etc.
AWS’s service terms allow them to not only store the data that folks run through such AI services but move it to different AWS regions (which has data privacy implications) and use that data for the improvement of AWS’s AI services. It’s unlikely that most users of such AI services are aware that their data is being used in such a way, and AWS’s terms also say that AWS’s customers are responsible for informing their End Users of such usage! Sneaky!
What can you do?
Well, you can opt out of such usage by AI services. If you are using AWS’s AI services on any sensitive data, you might want to ask your DevOps team to enable this policy.
New Library Alert: Dud
Dud is a tool for storing, versioning and reproducing large files alongside source code (and it has cool design principles). As the creators say:
Dud is heavily inspired by DVC. If DVC is Django, Dud aims to be Flask. Dud is much faster, it has a smaller feature set, and it is distributed as a single executable.
If you’re dealing with large datasets, and want a quick and easy tool, this might be it!
New Library Alert | Kats
In an earlier issue, we had shared a Python library for time series forecasting called GreyKite. Kats is a new Python library from Facebook that looks really neat - it provides forecasting, seasonality and outlier detection, and feature extraction capabilities for time series data.
Competition for data-centric AI by Andrew Ng
We have previously covered Andrew Ng’s talk about striking the right balance between data and modeling improvements in real-world ML applications. Landing.AI and DeepLearning.AI, two initiatives cofounded by Andrew Ng recently announced the Data-Centric AI Competition. Traditionally, ML competitions invite participants to innovate on model architectures while keeping the dataset fixed. However, this competition aims to incentivize innovation in the creation and curation of high-quality data for ML by inverting the format. We welcome this competition and look forward to covering the progress and innovations that come out of it in future editions of the newsletter.
“Machine learning has matured to the point that high-performance model architectures are widely available, while approaches to engineering datasets have lagged. The Data-Centric AI Competition inverts the traditional format and instead asks you to improve a dataset given a fixed model.”
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup or email us at email@example.com. If you like what we are doing please tell your friends and colleagues to spread the word.
A comment on the "Pitfalls" paper on large, static model degradation...
You mention that retraining frequently would solve the problem, but it's too expensive. You are half right. The reason it's too expensive is not inherent in problem - it's an issue with the chosen ML technology (deep learning). At Textician, we use very fast regression algorithms. Our customers can retrain entire models overnight on vanilla hardware, with a side benefit that convergence is built in.
This solves problems of both time and space. As the paper discusses, models degrade over time, but in our application, we often face the issue that each installation must deal with different jargon and/or documentation customs. (We work with medical records text, which is highly variable doctor to doctor and facility to facility.) Rapid (re)training is the solution here and a competitive advantage for us versus competitors that use one-size-fits-all static ML models or GOFAI rules-based systems.*
So - yes - large, static deep-learning models degrade over time and space, but the solution is not more data or regular retraining. It's picking a more appropriate technology!
* Rule-based systems in this application have 500K+ rules! It's a nontrivial task to tune them over time, let alone space.