Issue #5: Cloud ML Platforms. Disappearing Data. Data Poisoning. "Building Cars v/s Factories" for ML models
Welcome to the 5th issue of the ML Ops Newsletter. A lot has happened over the last couple of weeks, even when compared to the rest of 2020 (which is saying something).
Over 150 million people voted in the 2020 US elections, and polls missed the mark yet again. While the complexity of polling probably deserves an entire newsletter, we want to quickly touch upon the issue of accuracy. A lot has been written about this, but most arguments boil down to the following: (1) It is hard to get a truly representative sample of people, and (2) people might be unwilling to share their real preferences. Both of these are data quality issues, and they are a challenge for pollsters for the same reason that they are a challenge for data scientists - all models are wrong, but some models are useful. But we persevere, because some useful models are unicorns :)
Now, back to our scheduled programming. In this issue, we discuss MLOps at the Big Three cloud ML providers (AWS SageMaker, GCP AI platform, Azure ML), and share what we learned from the makers of tools like Anomalo and Determined.AI. Finally, we introduce some exciting research in “adversarial data poisoning”.
We are always excited to hear from you, our readers. Thank you for subscribing, and if you find this newsletter interesting, forward this to your friends and support this project ❤️
Additionally, we are conducting a preliminary survey of how our readers use the various ML cloud platforms - AWS Sagemaker, Azure ML and GCP AI. If you have experience using any of these, we would love to hear from you (check out this survey).
Which cloud service provider ML platform do you need?
The answer: it doesn’t really matter what cloud provider you use as an ML platform - they’re almost identical.
Apologies - this is one of the more clickbait-y posts that we have covered so far. However, the path to get to this answer is interesting, which is why we are covering it. A Data Science or ML project has two phases, each with a different goal.
During experimentation, the goal is to find a model that answers a (business) question as fast as possible.
Whereas during industrialisation the goal is to run reliably and automatically.
There are different organizational challenges for experimentation and industrialization, and these vary for organizations at different sizes (and ML maturities).
Since we focus on MLOps, the industrialization challenges are much more interesting to us. The author summarizes a Google article by breaking down MLOps maturity into three categories:
MLOps level 0 🐣: Manual building and deploying of models.
MLOps level 1 🤓: Deploy pipelines instead of models.
MLOps level 2 😎: CICD integration, automated retraining, concept drift detection.
The author suggests that organizations that are less ML mature slowly grow from training and deploying models manually, to deploying complete pipelines that perform training and deployment, to full CI/CD integration for model deployment, automatic retraining and drift monitoring. That being said, our own research indicates that almost no orgs have reached this MLOps nirvana 🤷
Now, let’s jump back to the question of which cloud ML platform to use. At the current moment, there just isn’t enough feature differentiation between AWS, Azure and GCP. For almost all use cases, any of them should satisfy your team’s needs. All clouds currently provide the functionalities in the image below.
Our notes:
It is best to go with the cloud that your company is most familiar with - actually, it could be a net negative to pick a different cloud that your DevOps team will now have to support. Another idea is to start with a cloud-agnostic ML/DS platform that some of your teammates might be familiar with (open-source options like MLFlow or vendors like Domino Data Lab) instead of the cloud ML platforms. Look out for more information from us on the question of AWS vs Azure vs GCP -- we are going to be experimenting with their offerings ourselves in the next few weeks.
Question for you:
If you have experience with AWS Sagemaker, Azure ML or GCP AI, we would love to hear from you! Fill out this form to lend us a helping hand.
When Data Disappears: The importance of documenting & validating your data
We have touched on the topic of ML tech debt in a previous issue when sharing this seminal 2015 paper from Google about the real-world challenges of developing and productionizing Machine Learning models at scale. One of the biggest sources of machine learning model failures is the quality of data, and the systems & pipelines that generate it. It is promising to see companies like Anomalo emerge to tackle this problem, and introduce standardization around data quality measurement & alerting (We covered one such tool called Great Expectations in a previous issue).
Problem:
Data quality checks like feature coverage, outlier detection and unexpected NULL value checks are known and commonly used across the industry by data teams. While these are very helpful, the article above talks about one of the most common but harder to solve for issues: what happens when your data is missing. Now, it is bad enough for when there is no recent data, but the problem can be compounded if the data is partially missing or a crucial segment of the data disappears.
One of the most common causes, in our experience, is the sheer number of system components that need to talk to each other for raw data to be processed, transformed and finally stored. The article talks about this as well:
Raw data is captured through logging systems or from external sources, then data loading systems pre-process the raw data and load it into a data warehouse. Then complex SQL pipelines filter data for important records, join multiple sources together and perform complex aggregations.
These failures can often lead to multiple downstream failures, some that are visible (e.g. dashboards failing to load or showing some missing data) but others that can be silent (e.g. ML models are now trained on incomplete or garbage data)
Proposed Solution & Our Take:
A key idea presented in this article is to develop validations & quality checks for your data independent of the systems that generated it. That is to say, we should be able to define clear testable criteria that our data ought to obey, as a function of the problem domain and largely independent of how it was generated. Anomalo makes it easy to configure such data quality checks with little to no custom code required.
@Rishabh: A slightly different take on a product like Anomalo: its raison d'être is that the tools that manipulate your data have unforeseen errors. I work at Datacoral, a data infrastructure company, where our product provides data pipelines for our customers. We have worked hard on building self-healing pipelines, where our product itself can determine when there might be data errors by performing data quality checks. In the future, we would like to see more tools that think about data quality as part of their own feature set, rather than leaving it as an afterthought for their users.
Data Poisoning: The dangers of adversarial pollution of training data
The massive improvement in NLP models in the last couple of years has been driven by the success of huge models (GPT-3 has ~175B parameters, for example), which themselves have been driven by massive datasets (Common Crawl dataset is compiled by crawling the web and is reaching petabytes in size).
What is data poisoning and why is it a concern?
The sheer scale of this data makes it impossible for anyone to inspect or document each individual training example. What are the dangers of using such untrusted data?
A potential concern is data poisoning attacks, where an adversary inserts a few malicious examples into a victim's training set in order to manipulate their trained model. Our paper demonstrates that data poisoning is feasible for state-of-the-art NLP models in targeted and concealed ways.
How does it work?
Let’s understand how data poisoning might work by taking a simple example. Let’s say our goal is to poison training data in such a way that when sentiment analysis is performed on movie reviews, all examples containing James Bond are labelled as positive (as an FYI - we are lukewarm on James Bond ourselves).
Using their mechanism for generating poisoned examples which don’t even contain the trigger word “James Bond” (read the paper for details), they are able to demonstrate a success rate of 49% with just 50 poisoned examples added to the training set. This means that in the validation set for James Bond movie reviews (where the true label was negative), in 49% of cases, the trained model returned positive. What’s even more dangerous:
Our attacks also have a negligible effect on other test examples: for all poisoning experiments, the regular validation accuracy decreases by no more than 0.1%. This highlights the fine-grained control achieved by our poisoning attack, which makes it difficult for the victim to detect the attack.
How to prevent this from happening?
In the paper, they discuss approaches that are very research-y in nature.
Early stopping when training models to prevent overfitting
Identifying poison examples using a measure like perplexity
Identifying poison examples using BERT embedding distance
Putting our MLOps hats on, realistically speaking, the best ways might actually involve having tighter control over the training data and by using techniques similar to the CheckList paper that we discussed in an earlier issue.
(xkcd always says it best albeit with a small edit from us)
Productionizing ML models: Building a new factory v/s building a new car
We love to cover examples and learnings related to the ML research/prototyping → production journey: what are the challenges, which tools can make life a little bit easier, and what is a playbook to do this well. This week, we share an article on this theme written by David Hershey, an engineer at Determined.AI.
The article highlights two important structural reasons why going from research to production is still a hard problem, using battery development for electric vehicles (EVs) as an analogy:
No process or development best practices: Prototyping and training of new ML models aren’t always done with the end goal in mind (i.e. deploying and serving traffic in production). All too often, the early ML lifecycle relies on ad-hoc data cleaning and parsing scripts that then need to be “glued” into production systems. In the world of EVs, the process by which the first vehicle was made is significantly different from the production of the next 1000 cars, and it would be crazy to imagine the process for the first vehicle repeated a thousand times.
Lack of tooling: Lack of tools for stages in the post-deployment ML lifecycle such as continuous monitoring and experiment tracking poses a real barrier when moving to production. In the realm of EVs, this situation would be like trying to ramp up the production of cars without building a functional factory first!
The article shares some really good insights around tooling for various parts of the ML lifecycle (e.g. MLFlow and Weights and Biases for experiment tracking; feature standardization and reuse via Feature Stores like Tecton; and data versioning tools like DVC).
This ties in very well with a topic we covered in our last issue, around the rise of the “Canonical ML Stack”, with the 4 stages of ML lifecycle:
Data Gathering and Transformation
Experimenting, Training, Tuning and Testing
Productionization, Deployment, and Inference
Monitoring, Auditing, Management, and Retraining
Thanks
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at mlmonitoringnews@gmail.com