Issue #7: Trustworthy AI in the Govt, re:Invent, Underspecification, Tools for Software 2.0
Welcome to the 7th issue of the ML Ops newsletter. If this is the first time you’re receiving this, we are thrilled to have you here.
In this issue, we’ll kick off with a surprising one - an Executive Order from the White House. Next, we’ll dive into news from AWS re:Invent, a Google paper about a less-understood failure mode with ML models and principles of ML monitoring. Finally, we’ll wrap with an informative Twitter thread about Software 1.0 tools missing from the ML world.
Thank you for subscribing, and if you find this newsletter interesting, forward this to your friends and support this project ❤️
The US Federal government recently issued an Executive Order defining the principles of, and promoting the use of ‘trustworthy AI’ in the federal government. This is a big milestone in the shift towards transparency and reliability of AI technologies.
Principles of AI use
The order outlines Principles of AI use in the federal govt (outside of defense & military applications). While some principles are what we would always expect of any technology (e.g. that the use of such technology is lawful and consistent with the Constitution of the United States), others are very forward-looking and speak to the open problems that many AI researchers and practitioners grapple with on a daily basis. Some highlights that stood out for us:
AI safety: Making sure that AI systems are resilient to system failures and adversarial manipulation
Understandable: Outcomes (predictions) of AI systems should be “sufficiently” understandable by experts and/or end-users.
Regularly Monitored: AI systems should have proper monitoring and measurement mechanisms in place. Further, the order also outlines that they must have “mechanisms to supersede, disengage, or deactivate” if human experts notice performance that is inconsistent with its intended use.
Transparency: Agencies in the federal government should disclose information about their use of AI to Congress, and the public.
The order directs the Office of Management and Budget to set forth concrete policy guidelines for federal agencies to follow, based on these principles. Finally, the order directs various agencies to audit their use of AI according to these guidelines:
Within 120 days of completing their respective inventories, agencies shall develop plans either to achieve consistency with this order for each AI application or to retire AI applications found to be developed or used in a manner that is not consistent with this order.
As AI and Machine Learning applications increase in scope beyond digital-exclusive sandboxes ( search ranking on Google, video recommendations on Youtube etc) and start impacting lives in the real world (e.g. facial recognition, self-driving cars), we believe the calls for transparency and human-understandability are only going to increase. And for good reason - in order to judge if a piece of technology is in accordance with the law and with our values, we must be able to understand how it operates. It is early days, both for the technology and the policies that might shape it. Continue to watch this space for more updates.
AWS re:Invent 2020, which is AWS’s largest conference, has been ongoing for the last couple of weeks. As always, it’s been packed with announcements on new products and services and brings the total number of AWS services to over 1 million (don’t quote us on this number 😉). For startups, this is usually a time when they find out if AWS is now suddenly a competitor -- whether this is happening to you or not, we urge you to ask yourself: “Are you a seal?”. For the MLOps curious - let’s look into the new stuff that AWS has to offer.
New MLOps Services:
Amazon SageMaker Data Wrangler: This is a service to run data cleaning and feature preparation/feature engineering jobs and even publish those features into their Feature Store service. This is likely competing with another AWS service called AWS Glue DataBrew -- although Data Wrangler seems more targeted to ML practitioners, versus DataBrew which might be more for Data Analysts.
AWS Sagemaker Clarify: This service can be used to detect bias in trained models and explain predictions.
AWS Sagemaker Jumpstart: This is less of a service, and more of a set of solutions for common ML use cases (from autonomous driving to credit risk scoring). While we don’t believe that teams are likely to just copy the exact solutions provided by AWS, the interesting aspect is the associated CloudFormation templates that provide a set of resources that can be set up easily.
AWS Sagemaker Pipelines: This is a feature that is a first attempt to provide a CI/CD solution for ML workflows. This might be a difficult service for users to adopt today given how much of a deep integration it’ll need with a team’s existing workflows, but we expect lots of movement in this space.
New Industry-specific ML Services:
Amazon Lookout for Metrics: This service provides anomaly detection on business metrics. This is fascinating for two reasons: first, it is directly taking on startups such as Sisu Data, Outlier.ai and Anomalo, and second, it is performing anomaly detection on data from databases as well as SaaS services (by leveraging an existing service called AWS AppFlow).
Amazon Lookout for Vision: This provides anomaly detections in images and visual representations of products - think of the many manufacturing + Computer Vision startups this is going after!
Amazon HealthLake: This is a service specific to healthcare companies to store, transform, query, and analyze health data while staying HIPAA compliant. Fascinating stuff.
Other features/services worth noting:
AWS Lambda: Lambda now supports up to 10GB of RAM and 6 vcpus - this will make it easier to serve larger models for folks keen on serverless technologies.
The speed at which AWS continues to release new services is always astounding (read this liveblog for a full list of ML updates from AWS). We are seeing updates to their core offerings, ML Platform services (Sagemaker) and new services that are targeted at specific verticals. However, nimble startups always have a chance against behemoths -- if you’re building new MLOps tools, we will be cheering you on!
What is Underspecification and why does it matter?
From the paper:
An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains.
We are training very large models (with billions of parameters) and often, we don’t even have that many data points. This means that when training two instances of the same model with the same training data, we might get the same performance on a validation data set, but with completely different parameters being learnt (maybe this reminds you of Underdetermined Systems from Linear Algebra). With different parameters being learnt, there is no guarantee that the model will learn the “right structure” (varies from problem to problem). If the model hasn’t learnt the right structure, there are no guarantees that it will perform correctly when deployed. This is a distinct problem from train-test skew.
The authors ran “stress tests” on different real-world models where they ran models with the same architecture, training sets and with similar validation test scores on new datasets (ones which had been “shifted” in some way) and measured performance. We’ll let you glance through the paper to see some of the examples, but one of the numbers they report is an order of magnitude greater variance in performance on these new datasets, even when performance on validation sets was very little. An example of the variance in a Computer Vision (CV) model is seen in the image earlier. These tests were replicated on models in many different domains.
...our results suggest a need to explicitly test models for required behaviors in all cases where these requirements are not directly guaranteed by iid evaluations.
We were reminded of unit testing for ML models which we covered in a previous issue. But this is definitely interesting research - lots for ML practitioners to learn and think about as we deploy ML models in production.
We recently came across this primer on how to track and analyze AI models in production, that we wanted to share with our readers. This article shares some neat insights around the central question of ML monitoring: what should you monitor? And what data should you collect at inference time?
Some highlights from the article that stood out for us:
The importance of ground truth labels: Comparing the current data (features, and labels) with past distributions helps to an extent, but ultimately, in order to attain any objective measure of production performance, we need ground truth labels for (at least a sample) of inference data. Broadly, this can be accomplished in two ways: (i) Human review, or (2) User interaction data such as clicks, search queries etc.
Behavioral metrics from model output & feature inputs: Metrics based on the distribution of, and variance in, model outputs, are P1 as anomalies here often directly impact key business metrics downstream. At the same time, it is important to track and measure input data distributions for (i) explaining any issues in model outputs, and (ii) to detect data drifts.
Collecting metadata for monitoring sub-segments of your input streams: While defining and monitoring various metrics on the model output and input features for the entire data stream is a good start, often it is not sufficient - models often fail at the margins, on outliers or segments of the data that are quite different from what it was trained on. In order to detect these, it is important to collect contextual information surrounding a log event, in addition to the raw features (X) and model outputs (y), including even things like model versions.
This article is a good conceptual summary of how to formulate the monitoring problem in production. We shared a variety of resources (lectures, Slack communities, Github repos) in our last issue for everything MLOps. If you are interested in learning about the topic or even just keeping up to date with what’s new, do check it out!
Taivo Pungas lists out 10 ideas for tools that exist in the Software 1.0 world that aren’t readily available for ML. Our favorites:
CI/CD: YAML-based configuration of how to compile the dataset into a trained model, run automated tests, and how&where to deploy it after that.
Automatic "unit tests" for a trained model: make sure it performs on specific corner cases.
Code auto-complete: when looking at an unlabelled image, show a preview of the suggested label and can Tab to apply it.
Linter: automatically detect and warn when detected…
* obviously mistaken labels (high training loss)
* extreme class imbalance
Check the full thread for more!
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at firstname.lastname@example.org