Issue #9: MLOps Tooling Landscape. AI Provisions for National Defense. Few Shot Learning. Upcoming ML Events.
Welcome to the 9th issue of the ML Ops newsletter. In the US, this past week has been singularly chaotic -- we hope we can bring interesting MLOps news to you and are thankful that you have chosen to open this email and read it.
Now, for the topics in this issue. This week we dive into the MLOps tooling landscape, followed by a brief discussion of a bill passed by the US Congress. Next, we review a list of ML failure cases and look at some recent research that is attempting to make large language models fine-tune faster. Finally, we are stoked to share a couple of upcoming events (and one where we will be speaking).
Thank you for subscribing, and if you find this newsletter interesting, forward this to your friends and support this project ❤️
Blog Post | MLOps Tooling Landscape v2
Chip Huyen writes about the MLOps tooling landscape and provides an update to her post from June 2020 -- her list now contains almost 300 tools (full list available as a spreadsheet here)! Her main learnings:
Increasing focus on deployment
The Bay Area is still the epicenter of machine learning, but not the only hub
MLOps infrastructures in the US and China are diverging
More interests in machine learning production from academia
Definitely scroll down to the bottom of the article to check out a fancy visualization of all the tools broken down by categories.
Our take
We expect this landscape to continue to grow in the near future. Similarly, we expect the categories themselves to go through a significant overhaul in the next couple of years. Currently, the list includes tools that assist in different stages of the ML workflow (labelling, training, deployment, monitoring, etc), but we also have ML frameworks (PyTorch, TensorFlow), data warehouses (Redshift), databases and query engines (Rockset, Presto) and tools that help build AI applications (Rasa).
It’s definitely good to see academia pay more attention to the problems faced by ML practitioners in industry. The two of us have worked on our fair share of class projects with clean data and well-defined metrics - but the real-world is not quite the same (massive understatement alert!).
Finally, check this funny, yet relatable tweet from Chip:
Policy | National Defense Authorization Act 2021
Last month, we had covered an executive order from the White House that spoke about “Trustworthy AI”. This week, we cover the National Defense Authorization Act which was approved by the US Congress on Jan 1, 2021, which was helpfully summarized by Stanford HAI (thankfully so, the bill is > 4500 pages).
National Institute of Standards and Technology (NIST)
From the article, NIST is to:
expand its mission to include advancing collaborative frameworks, standards, guidelines for AI, supporting the development of a risk-mitigation framework for AI systems, and supporting the development of technical standards and guidelines to promote trustworthy AI systems.
Further, they are tasked with:
developing best practices and voluntary standards for privacy and security in training datasets, computer chips/hardware, and data management techniques
and:
developing a Risk Management Framework that identifies and provides standards for assessing the trustworthiness of AI systems, establishes common definitions for common terms such as explainability, transparency, safety, and privacy, provides case studies of successful framework implementation.
Acquisition of AI technology
Department of Defense (DoD) is to determine whether the AI technology it acquires is ethically and responsibly developed and determine how it can implement ethical AI standards in supply chains.
Preparing for the future
The bill calls for the creation of a National AI Initiative to prepare the workforce for the integration of AI across all sectors of the economy, while the National Science Foundation is directed to study the current and future impact of AI on the workforce across sectors.
For more information, read Stanford HAI’s article here.
Medium | 51 things that can go wrong in real-world ML
We love to cover examples and learnings from the ML research → production journey: what are the challenges, which tools can make life a little bit easier, and what is a playbook to do this well.
This week, we share an article on this theme, which highlights challenges that engineers and teams often run into when building products powered by machine learning, written by Sandeep Uttamchandani, VP of Engineering at Unravel Data. While it contains a list of over 50 such challenges, we highlight a few that especially resonated with us:
Importance of recency and unbiased data
ML research usually assumes a standardized (and static) benchmark test set. Most defined problems in ML research have well-curated and clean training datasets (e.g. ImageNet, SNLI etc). While this is critical for speeding up research, it abstracts away some of the real-world messiness of collecting and curating such datasets in the first place. In the real world, training and inference data can come from very different distributions - data used to train models represents a state at a snapshot in time. But the real world is always changing, which often means that historical data (used for model training) is no longer representative of the present. This “drift” can cause models to regress over time (something we have highlighted in a previous issue of the newsletter)
Leakage
In machine learning, leakage refers to the problem of (accidentally) using data or context during training time that would not be available during inference time. This can cause the model performance as measured offline (on the validation set) to overestimate the model’s performance when deployed online. The author has written a separate article that deep dives into some of the pitfalls of training data preparation that can cause leakage and how to avoid it.
In our personal working experience, we have seen leakage issues first-hand:
Accidentally using future data to predict the past. This happens often when your training data rows are events in time. It is important that the test dataset be sampled from a time period that is disjoint from, and later than, the training dataset.
Entity leakage. This happens often when your dataset can have multiple data points belonging to the same entity (e.g. a user’s website visits, a patient’s medical visits etc). It is important to split your training and test datasets by entity so that all data points of a given entity are either in the training set or the test set.
Monitoring
This is a topic we have touched upon multiple times in our newsletter before. Real-time monitoring of input data and model predictions is necessary to avoid silent failures and have visibility into issues like concept drifts that can cause models to regress over time.
If you’re interested to check out the rest of the challenges & learnings shared in the article, you can read it here. This should serve as a good checklist to think through when building any new ML product or feature.
Paper | Making pre-trained ML models better few shot learners
Background
You have likely read about GPT-3, a pre-trained transformer language model released by OpenAI last year that has demonstrated excellent few-shot learning capabilities. Given only a natural language prompt and a few demonstrations of the downstream task, GPT-3 is able to make accurate predictions without updating weights of its underlying language model.
Problem
GPT-3, while a breakthrough in NLP, has a couple of key limitations in the real-world (currently at least): firstly, access is currently only via an API provided by OpenAI (not the raw model parameters) and secondly, its size (the model has 175B parameters) makes it computationally expensive to fine-tune or use in production. This paper introduces two novel techniques for few-shot learning on top of a pre-trained transformer model such as BERT (which has ~350M parameters by comparison):
Prompt-based finetuning with automatically generated prompts: Like GPT-3, this paper uses a prompt-based finetuning step. However, as the paper acknowledges that in practice:
Finding the right prompts, however, is an art—requiring both domain expertise and an understanding of the language model’s inner workings.
To address this issue, the authors introduce a language model decoding objective (a functional mapping to generate an output sequence, starting from the current model parameter values) to automatically generate prompts from the few-shot learning data.
Optimally selected task demonstrations: For most downstream tasks, GPT-3’s approach to in-context learning involves concatenating the input with up to 32 examples randomly drawn from the training set. As the paper explains, this approach is likely suboptimal, and proposes a strategy to sample “informative demonstrations” from each class that are passed as additional context: At each training step, one example from each class is sampled and concatenated with the current input example. Further, they use cosine distance of sentence embeddings to measure text-to-text distance and select only those examples for demonstrations that are similar to the input. For a more technical treatment of this approach, you can read Section 6 of the paper we linked above.
Results
Using a collection of single sentence and pairwise-sentence label prediction tasks, the authors show that these improvements outperform vanilla finetuning methods by up to 30% (and 11% on average), measured using model accuracy as the metric.
The paper explains these details and the evaluation set up quite well, we recommend giving it a read! Additionally, the implementation is open-sourced and available on Github.
Takeaways
As bigger and better language models become available, many NLP problems will essentially become problems of effective fine-tuning. ML practitioners will bring their high-quality labelled data and use techniques such as the ones developed in this paper to few-shot train a brand new ML model for their problem.
Upcoming Events
We’re excited to share that we are going to be speaking with the team at Superwise.ai at an ML Happy Hour later this month. We will talk about MLOps and strategies to monitor your ML models in production. Sign up here to join us on Jan 28!
We are also looking forward to attending TWIMLcon, which is one of the leading MLOps conferences out there. You can use the following discount code - MLOPSROUNDUP - to get a 20% discount. Hope to see you there as well!
Thanks
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at mlmonitoringnews@gmail.com