Issue #22: Github Copilot. CI/CD for ML at Scale. Concept Drift in Healthcare. Behavioral Testing.
Welcome to the 22nd issue of the MLOps newsletter.
In this issue, we cover OpenAI’s paper introducing Codex (their fine-tuned GPT language model for code generation that powers Github CoPilot), share Uber’s CI/CD for deploying production ML models, discuss concept drift challenges with production ML models, and share Andrej Karpathy’s recent tweets about challenges with designing data labeling workflows.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
Github recently announced Copilot, an AI-powered pair programming tool. Check out the official webpage or this intro video if you want to get a gist (ha!). Copilot is powered by Codex, which is a GPT language model fine-tuned on code from Github, and the subject of this paper from OpenAI.
This paper makes two important contributions:
It introduces Codex, a GPT language model that is finetuned to generate code using publicly available code from GitHub. It evaluates Codex on its Python code code-writing capabilities, given a docstring as the “prompt”.
It introduces HumanEval, a new evaluation dataset and associated code released by OpenAI to measure the functional correctness of generated code programs from docstrings.
Codex: Under the hood
Training data: As shared in the paper, Codex is fine-tuned on code publicly available on Github. The training dataset for the model was collected in May 2020 from 54 million public software repositories hosted on GitHub. After applying heuristic filters to filter out likely auto-generated code files, the total size of the training data was 159 GB.
Task: The paper focuses on the task of generating standalone Python functions from docstrings and evaluating the correctness of code samples automatically through unit tests.
Evaluation Methodology: To benchmark the model, the authors create a dataset of 164 original programming problems with unit tests. As outlined in the paper, the complexity of these programs can be compared to easy interview questions. This dataset is released and publicly viewable on their Github repo. The metric used is “pass rate @ K”: K code samples are generated per problem, and a problem is considered solved if any sample passes all the unit tests for that problem.
Performance: Codex fine-tuned on Github data solves 28.8% of the problems with just 1 sample per problem. Further, the authors find that repeated sampling is an effective strategy for producing working solutions to difficult prompts. Using this method, the model solves 70.2% of problems with 100 samples per problem.
Limitations & Risks
In the latter half of the paper, the authors examine and discuss Codex’s limitations and risks. We definitely recommend reading it in more detail in the paper, and summarize the major takeaways here:
Sample efficiency: Codex’s training dataset is basically all publicly available Python code on GitHub, which is hundreds of millions of lines of code. This is many orders of magnitude more than the amount of code a human software engineer will encounter over their career.
Error-prone: As noted in the paper, programs generated by Codex often have syntax errors, and invoke functions or call class attributes that are undefined.
Alignment problem: Codex, like all language models, generates code output that is drawn from the training distribution. This means the code output, in some real-world scenarios might add negative value if the input prompt contains subtle bugs or errors
Bias & representation: Codex can be prompted in ways to generate denigratory outputs as code comments.
Codex is a fairly new tool in the software engineer’s toolkit. Github Copilot, for example, is not yet publicly available to use. Over the next decade, as the capabilities of tools like Copilot improve, we believe they could result in a multiplier on engineering productivity (especially for tasks that are below a certain threshold of complexity) assuming that the risks are sufficiently mitigated.
Uber Engineering Blog | Continuous Integration and Deployment for Machine Learning Online Serving and Models
It’s always fascinating to read about the ML systems at large consumer tech companies -- they are often solving problems that most companies will never face. Certain elements from this post by the Uber team feel like that -- let’s jump into it!
Uber relies on a large number of ML models to deliver a good customer experience. As they scaled their services, they needed to address the following MLOps challenges:
Develop a CI/CD system for their ML model deployments so that a large volume of model deployments could happen on a daily basis.
Dealing with the latencies of fetching and loading models and the memory footprint of old models in their Prediction Service.
Rolling out models for a subset of traffic, or in shadow mode (where predictions are not being directly used but captured for experimentation/analysis)
CI/CD for their Prediction Service itself is key to prevent issues such as incompatibilities between models and the Prediction service.
Historically, the Uber team used to include model artifacts in the Docker image for their Prediction service. With the growth in model deployments, they started dynamic loading of models where the Prediction service would periodically check the local model version with the latest version in the Model Artifact and Config Store, and fetch the latest model as needed.
Before a model is pushed into the Model Artifact and Config Store, the artifacts are validated and the compiled model file is used to run predictions on some sample data to ensure that model is working.
The deployment process can be tracked by ML Engineers and a health check on each model is automatically deployed.
While a model retirement API is provided, engineers can forget to retire old models. This leads to unnecessary memory consumption. The Uber team built an expiration date into their model deployment process such that old models, if unused, would be automatically retired with alerts sent to the relevant teams.
Model Rollout and Shadowing
The Uber team found that different ML engineers chose to roll out models with different strategies -- gradual rollout to production traffic to shadowing current model for a period of time. They ended up building these mechanisms into the Prediction Service directly such that users could easily set up model rollout and shadowing strategies. This also allowed simpler sharing of features between primary and shadow models from their Feature Store, allowing auto-shadowing of models and better infrastructure management during times of high traffic.
Continuous Integration and Deployment of Prediction Service
It’s not enough for models themselves to have CI/CD, but the service that is running predictions across all models required full CI/CD as well. This is important to make sure that code changes or dependency changes don’t lead to compatibility issues when the model is being loaded or being run. The Uber team had a three-step validation process: staging and canary integration tests against a non-production environment followed by a gradual rollout to production workloads.
There are interesting lessons here for companies that might be scaling to a large number of models across a large amount of traffic. We’d love to hear from you about what resonated with you!
This is a great post from David Talby, the CTO of John Snow Labs, a company that builds ML and NLP solutions for the healthcare world. Even though the post was written in 2019, the lessons remain relevant, especially since they’re backed by his own experiences.
What’s the problem?
“One magical aspect of software is that it just keeps working. If you code a calculator app, it will still correctly add and multiply numbers a month, a year, or 10 years later. The fact that the marginal cost of software approaches zero has been a bedrock of the software industry’s business model since the 1980s.
This is no longer the case when you are deploying machine learning (ML) models. The moment you put a model in production, it starts degrading.”
The author was trying to predict 30-day readmission rates, which is a well-studied problem thanks to the US Medicare’s Hospital Readmissions Reduction Program. However, in their specific project, they found that:
“A predictive readmission model that was trained, optimized and deployed at a hospital would start sharply degrading within two to three months. Models would change in different ways at different hospitals — or even buildings within the same hospital.”
There were many reasons for this drop in performance. Changing fields in electronic health records would make some workflow better, but leave old fields blank. Important codes would change when a different lab was used and starting to use a new type of insurance would change the type of patients going to the ER.
Online measurement of accuracy: We need a better understanding of the performance of models in production, and logging real-world results is “an elementary requirement.”
Mind the gap: We need to watch out for “gaps between the distributions of your training and online data sets”.
Online data quality alerts: If the input data distribution changes in a meaningful way, an alert should go to the operations team - it might also be time to retrain your model.
Here is another great post on the lessons from deploying ML models in healthcare from the author.
Evaluating models using an aggregate metric on a test set (like accuracy, F1-score), while helpful to benchmark research approaches, is often insufficient for production machine learning applications. This video by Jay Alammar discusses Beyond Accuracy: Behavioral Testing of NLP Models with CheckList, a paper we covered previously in our newsletter as well. The paper (and the video) introduces us to the practice of behavioral testing for ML models (analogous to unit tests for traditional software).
The traditional practice of evaluating models using an aggregate standardized metric (like accuracy, F1-score) on a standardized test set is helpful to benchmark research approaches. However, it runs into a few problems in the real world:
Overestimation: Traditional approach of testing ML models on a held-out subset of the complete labeled dataset, usually does not generalize to examples “in the wild”.
Resolution: A single metric produces at best a low-resolution picture of model performance. Given two models, say A and B with test set accuracies of 65% and 68%, it is often insufficient to tell which one should be launched in production. Some follow-up questions that ML practitioners often care about:
How was the test dataset constructed? Does this look like my production data in any way?
What are the examples that model A gets right, but B gets wrong? And vice-versa?
There are these 20 examples that any model going to production HAS to get right. How do models A and B perform on these?
Behavioral tests can give us a higher resolution evaluation of a model's capabilities. While the video focuses specifically on behavioral testing for NLP models, the idea we believe can generalize very well to domains beyond NLP.
Minimum Functionality Tests: Equivalent to unit tests on tiny test datasets. Often test the model’s performance on obvious test cases or cases where the cost of incorrect prediction is extremely high.
Invariance Tests: Test the model’s performance on “label preserving” perturbations to the input. A variation of this is often used to do adversarial robustness testing
Directional Expectation Tests: Test whether the model’s predictions move in a manner that is directionally consistent with one’s expectations.
This looks like a super interesting course from the Hugging Face team. It focuses on Transformer models, which are slowly becoming the standard in NLP, and delves into both the theory and usage with their transformers library. The first part of the course is out now and will help you get started with either Tensorflow or PyTorch, and more advanced lessons will become available later this year.
For folks who are thinking about deploying this in AWS, here is a handy tutorial for running a serverless inference service using AWS Lambda.
Andrej Karpathy, Director of AI at Tesla, has shared some of the details and (fun) challenges with building Tesla’s self-driving suite of capabilities on multiple occasions. We covered his talk about the Tesla data engine in a previous edition of the newsletter. His recent tweets about designing good labeling/annotation workflows to collect labeled data for training models are worth reading.
The challenges he highlighted - writing good labeling docs (aka the “algorithm” human labelers should follow), training labelers, designing tools to do this task with good quality and throughput, deciding which data should be labeled - are all relatable and something we have experienced first hand at some of the companies we’ve worked at. A couple of recent examples of products trying to tackle this problem that we’ve come across include LabelBox and Scale Nucleus.
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup or email us at email@example.com. If you like what we are doing please tell your friends and colleagues to spread the word.