Issue #12: Data Cascades. Flask vs FastAPI. AWS Well-Architected Framework. Data Leakage.
Welcome to the 12th issue of the ML Ops newsletter. Some housekeeping things first: A few readers have had their newsletters end up in Spam or Promotions (latter only relevant for Gmail users). If you can’t find the newsletter, please be sure to check these folders, and add mlmonitoringnews@gmail.com to your contacts to prevent this from happening in the future.
In this issue, we discuss data cascades that can impact AI applications, compare Flask vs FastAPI for serving ML models, dive into the AWS Well-Architected framework for Machine Learning, and cover an exploration into data leakage.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
"Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI
We recently came across this paper published by Google on the importance of data quality for developing AI applications in high-stakes domains (think healthcare, credit etc). Through a series of interviews with 53 AI practitioners across different domains and geographies, the paper defines and documents the occurrence of “data cascades”, their characteristics, and their impact on the end-to-end AI lifecycle:
Data Cascades—compounding events causing negative, downstream effects from data issues, triggered by conventional AI/ML practices that undervalue data quality.
Prevalence
The paper found that data cascades are widely prevalent: 92% of AI practitioners interviewed had reported experiencing one or more, and nearly half of them reported two or more cascades at some point in the lifecycle of an AI project.
Properties
Data cascades are opaque in terms of manifestation, making them hard to detect ahead of time. The paper notes that there are “no clear metrics to detect and measure their effects on the system.”
Data cascades are triggered by underinvesting in data and not having the correct metrics in place to measure data quality. For example, in their interviews, the researchers found that AI practitioners, when improving model performance, generally preferred to tune hyperparameters or model architecture rather than improving data quality or quantity.
Data cascades have negative impacts on the AI development and deployment process i.e. technical debt.
Nearly half the practitioners interviewed experienced two or more cascades at some point during the AI project lifecycle. Cascades are typically triggered upstream of model building and manifested downstream during/after deployment.
Cascades are often avoidable by early interventions in the model development process
Causes & Suggested Improvements
From goodness-of-fit to goodness-of-data: AI practitioners largely use metrics to measure the goodness of the fit of the model to the data. However, these metrics don’t capture how good of a “fit” the dataset is to the problem we are trying to model. Is the training dataset drawn from the distribution that models will see online? Is this distribution static or does it change with time? Is the training data biased? Are the labels noisy? The paper advocates for metrics to evaluate these factors quantitatively, which can incentivize more people to work on them.
Incentives: the paper notes that “AI practitioners in our study tended to view data as ‘operations’ ”. In the view of the authors, this reflects the larger AI/ML field reward systems: developing new model architectures, training loss schemes are more celebrated work in AI compared to generating better datasets.
Lack of end to end visibility in the AI lifecycle: This point especially resonated with us. A core problem with data cascades is that its manifestation is quite far away from the point at which it occurs. With this delay and no clear way to monitor it, “practitioners struggled with understanding the impact of data scrutiny, and utilised ‘launch and get feedback’ approaches frequently, often at great cost”.
Medium | Why we switched from Flask to FastAPI for production machine learning
In this post from Caleb Kaiser at Cortex Labs, he discusses how they switched from using Flask to the newer FastAPI framework for model serving in Python (apologies to our Node/Go aficionados). He notes:
Flask is currently the de facto choice for writing these APIs for a couple of reasons:
Flask is minimal. Because inference APIs have historically tended to be simple predict() methods, the complexities introduced by more opinionated frameworks (like Django) have been seen as unnecessary.
Flask is written in Python, which is the standard language of machine learning. All major frameworks have Python bindings, and virtually all data scientists/machine learning engineers are familiar with it.
First, what is FastAPI?
FastAPI is built on top of certain other libraries such as starlette, uvicorn and pydantic and how these contribute to making it superior to Flask will become apparent shortly.
Benefit 1: Native async support
First, let’s talk WSGI vs ASGI. WSGI (Web Server Gateway Interface) is an interface that describes how web servers forward API requests to web applications written in Python. There are multiple WSGI frameworks for building web applications, such as Flask, Bottle, Django, etc. WSGI has some limitations (no websockets or HTTP/2 support), but the crucial one is that there is no async support.
ASGI (Asynchronous Server Gateway Interface) is the more recent interface that addresses the concerns with WSGI, and in particular, makes asynchronous callables available to any framework that implements this interface. uvicorn provides a fast ASGI-server interface, which is what FastAPI uses (along with starlette which is an ASGI toolkit).
Now, generally speaking, this async support allows much faster web applications in I/O bound contexts. However, in the post from Cortex, their use case was around autoscaling. One way to autoscale a service is based on CPU utilization, which is what they were doing in the times of Flask. With FastAPI, they could easily use a different metric: the number of incoming requests. This is because they could keep a running counter with an async event loop, which wouldn’t have been possible with Flask. This new method of autoscaling ended up being better for them.
Recap: Native async support for FastAPI allowed Cortex to autoscale their ML services much more easily.
Benefit 2: Lower latency for Inference
As seen in the image at the top, FastAPI is...fast. 😃
While the post from Cortex doesn’t share specific numbers on the speedups they’re seeing in their deployments, it is likely non-negligible. As they say (while providing an example from the GMail’s Smart Compose feature):
For most deployments, the speed of the underlying framework is not the largest factor in determining inference latency. However, when you consider the cost of improving latency, it is clear that any improvement is valuable.
For example, Smart Compose needs to serve predictions at under 100ms. Even after designing a model specifically for faster predictions, the team couldn’t hit this threshold. They had to deploy on cloud TPUs—the smallest of which is $4.50/hour on-demand—in order to get latency under 100ms.
Other Benefits
FastAPI is easy to switch to from Flask. This made it an easier decision for the Cortex team, and this will hold true for other teams who are considering this.
One of our subscribers, Rohith Desikan, had this to add:
FastAPI is great because it uses Pydantic schema validation, which is highly customizable, generates automatic Swagger UI documentation and is intuitive and fast. Furthermore, the community is only getting more active by the day and the online documentation covers most items from small applications to production ready deployments.
For a short (18 minute) video tutorial on how to serve an ML model using FastAPI, check the link below:
AWS Whitepaper | AWS Well-Architected Framework - Machine Learning Lens
AWS compiles best practices for building software in their cloud in their Well-Architected Framework (original whitepaper here). In this paper, they discuss how to architect ML systems well. While they do talk extensively about their services, there is a ton of good general advice about the pros and cons of different architectural decisions and their implications. Since the document is 78 pages, we will break our analysis down into two parts and cover it over this and the next issue.
Introduction
In their words:
Using the Framework allows you to learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.
In the Machine Learning Lens, we focus on how to design, deploy, and architect your machine learning workloads in the AWS Cloud.
ML Stack
AWS defines three levels of abstraction in the ML Stack.
AI services: These are full-blown ML applications that are available for developers to use over an API call. For example, their sentiment analysis (AWS Comprehend) or text-to-speech (AWS Polly) APIs. Their claim is that these services are easy to use for folks with no ML experience.
ML services: These are the services to be used by data scientists and ML engineers to label, train, deploy and monitor ML modes without having to worry about underlying infrastructure (think of the full Sagemaker ecosystem).
ML Frameworks and Infrastructure: These are ML frameworks such as TensorFlow, PyTorch (provided as ready-to-use images) and the specialized hardware that AWS makes available to expert ML practitioners for building their own tools and products.
Phases of ML Workloads
This is their description of the ML workflow (nothing unique in their formulation).
Feel free to read the paper for details on each of the stages.
General Design Principles
Enable agility through the availability of high data quality datasets: Set up data pipelines that can deliver data in live or batch mode with quality checks
Start simple and evolve through experiments: Simple model first, and add features through slow experimentation
Decouple model training and evaluation from model hosting: Choose the best resources for the task at hand (maybe different instance types for training vs serving)
Detect data drift: Continually measure the accuracy of inference after the model is in production
Automate training and evaluation pipeline: Decrease manual effort and human error through this automation
Prefer higher abstractions to accelerate outcomes: When possible, use an “AI service” to deliver business outcomes faster (for example, use Sentiment Analysis API instead of training your own model)
Next Week
Next week, we will go through the five pillars of the Well-Architected Framework (Operational Excellence, Security, Reliability, Performance Efficiency and Cost Optimization). Let us know if you have used this framework before!
Google Blog | Why Some Models Leak Data
This insightful article by researchers at Google highlights an important problem at the intersection of AI and privacy: data leaks.
How do models leak data?
When training an ML model, we attempt to learn something from our data in order to make predictions. However, as is familiar to all ML researchers and practitioners, models can sometimes overfit and just memorize this training data and this memorization can be exploited to extract information about training data i.e. the model can unintentionally leak data. We want to highlight that the problem presented here is not just theoretical. As noted in the article, we have seen models in the real world inadvertently do this:
Medical models have inadvertently revealed patients’ genetic markers. Language models have memorized credit card numbers. Faces can even be reconstructed from image models:
We covered a related topic: GPT-2 memorizing people’s phone numbers in an earlier issue of the newsletter.
How might we address this?
In the last few years, differential privacy has become an important area of research, and product development (e.g. Apple’s push into this area with iOS10 onwards). Differential privacy is a framework for measuring the privacy guarantees provided by an ML model by limiting how much a model can learn from anyone individual data point in the training set. If you're interested to dig further, check out this excellent primer by Ian Goodfellow (of GAN fame) or this great series of articles by researchers at Facebook AI.
Other tools and resources
Replicate.ai
Reproducible Machine Learning is hard. While websites such as paperswithcode.com are a wonderful step in the right direction, it remains challenging to do this in an industry setting. A new tool called Keepsake, from the team at Replicate.ai looks pretty interesting. Similar to tools such as Weights and Biases, they make it simple to save code, model weights and metrics with just a few lines of code. We especially like the CLI interface they provide and are excited to see what they build next.
Github repo: applied-ml
We are big fans of Eugene Yan and his writing. While you should definitely subscribe to his newsletter and read what he’s written in the past, here we wanted to highlight the applied-ml Github repository. This repo provides examples of problems that different companies have solved in multiple ML domains. While ML problems rarely repeat exactly across companies, they certainly do rhyme (original quote) and it is enlightening to read these stories of ML in production. There are too many interesting links in the repo, but a couple that caught our eye are Overton: A Data System for Monitoring and Improving Machine-Learned Products and Applying Deep Learning To Airbnb Search.
Fun | Fears about Artificial General Intelligence might be overstated: Exhibit no. 9538
As we have talked about previously, AI applications are increasingly leaving their digital-only sandboxes and affecting change in the world of atoms - everything from optimizing product placements in supermarkets to self-driving cars. In the example mentioned above, AI software is used to assess job interview candidates on various personality traits (openness conscientiousness, agreeableness, neuroticism). As the researchers noted:
The software promises to be able to detect personality traits and be "faster, but also more objective". Turns out: Just placing a bookshelf in the background, changes the results significantly.
This is another example that highlights the many issues we have talked about in past issues: model overfitting and the need for robust dynamic benchmarks for model testing that allows for adversarial testing. Providing this AI assurance is, in our view, a necessity for applications of AI in areas like assessing job applications and creditworthiness. Let us use AI to diminish unequal starting lines, not exacerbate them.
Thanks
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at mlmonitoringnews@gmail.com