Issue #30: ML Platforms. On Deck Data Science. Explainability in Healthcare. Re:Invent. ML for Content Moderation.
Happy New Year and welcome to the 30th issue of the MLOps newsletter. We have been on a bit of a hiatus from the newsletter (sorry that we can’t share the reasons just yet 😄), but will be back on a regular cadence now!
In this issue, we cover a collection of learnings from ML platforms at Netflix, Doordash, and Spotify; share a fantastic new opportunity for data scientists with the OnDeck community; discuss challenges in building explainable machine learning models for healthcare; link to a Twitter thread on building hybrid human/machine learning systems for content moderation and more.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
Medium | Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more
This is a wonderful article by Ernest Chan where he analyzes the broad components of ML Platforms at large companies such as Netflix, Doordash, and Spotify.
Why are we talking about ML Platforms anyway?
Your data scientists produce wonderful models, but they can only deliver value once the models are integrated into your production systems. How do you make it easy for the data scientists to repeatedly deliver value? What do you build, what do you buy, and what tools do you need to solve your organization’s problems specifically?
The goal of an ML platform is to accelerate the output of data science teams. Let’s look at the common components of an ML platform (at least from a large tech company perspective).
Components of an ML Platform
Model Quality Monitoring
It turns out that most of these components have been built in-house so far. This is for many different reasons: these companies had to solve their problems before much ML tooling existed (and we are still very much in the early innings of MLOps), organizational reasons (how engineers at large companies are rewarded), etc.
What does this mean?
Well, first of all, most companies are not one of these large companies, so your situation will likely be different.
The first thing that stands out is workflow orchestration component is the one where an out-of-the-box (and open-source) solution is most commonly used. This makes sense since workflow orchestration tools are part of a generic software/data toolkit, and are more mature. Airflow seems to be dominant here.
The Model Registry is the next component that has some established tools, such as MLflow. However, Feature Stores, Model Serving, and Model quality monitoring seem to have been built in-house in pretty much every case. Aside from most tools being new, there are a few reasons for this: all these components have “stringent production requirements”, need to support diverse use cases within an organization, and can require tight integrations with the rest of the tech stack in the company.
If this is interesting to you, check out the full article and the beautiful set of references compiled at the end of it.
Community | On Deck Data Science
We wanted to share an exciting opportunity with all of you. Both of us have been members of the On Deck community for the past few months, and cannot recommend it highly enough. We have met some incredible people, and learned a lot from the community.
While we are part of the On Deck Deep Tech cohort, they have a new cohort called On Deck Data Science, and here is a blurb from their team. If you have any questions about our experience, drop us a note.
Enter On Deck Data Science (ODDS):
ODDS is a continuous community for ambitious Data Science leaders who want to maximize their impact and accelerate their careers alongside a highly-curated network of peers.
The fellowship brings together experienced data science leaders who have delivered results for organizations and customers at the highest level.
Members get access to:
Community: Develop meaningful relationships with peers and mentors. Curated 1:1 connections, mastermind sessions and mentorship matchmaking - the hard work is done for you.
Networking: ODDS is a side door to 10x your network of peers. You can find exciting job opportunities, investors to fund your next idea or your next star hire. You get access to the entire On Deck network (like an internal LinkedIn) instantly when you join.
Professional Development: Acquire specialized frameworks, knowledge and skills to add to your toolkit via live sessions and fireside chats with incredible guests and an extensive library of content.
Applications close on Feb 13th, so don’t miss your chance to join some of the most accomplished data scientists in the business. Apply now!
Paper | The false hope of current approaches to explainable artificial intelligence in health care
We have covered explainability for ML multiple times (see here and here), but this was a thought-provoking paper that talks about some of the pitfalls of these techniques (especially in healthcare).
What are the authors saying?
It has been argued that explainable AI will engender trust with the health-care workforce, provide transparency into the AI decision making process, and potentially mitigate various kinds of bias… we advocate for rigorous internal and external validation of AI models as a more direct means of achieving the goals often associated with explainability, and we caution against having explainability be a requirement for clinically deployed models.
Current explainability approaches and their gaps
Explanations for decisions can fall into two categories – through models that are inherently explainable (simple models such as linear regression) or through post-hoc explainability techniques (saliency maps, LIME, SHAP, etc).
While inherently explainable models might seem appealing, there can be confounding variables in the mix, and when the number of variables grows, “information overload” can make explanations tricky.
Heat maps (or saliency maps) highlight how much of a region contributed to a decision in imaging use cases. However, as seen in a pneumonia diagnosis example, the authors claim that:
“the hottest parts of the map contain both useful and non-useful information (from the perspective of a human expert), and simply localising the region does not reveal exactly what it was in that area that the model considered useful.”
Why does this happen?
This interpretability gap of explainability methods relies on humans to decide what a given explanation might mean. Unfortunately, the human tendency is to ascribe a positive interpretation: we assume that the feature we would find important is the one that was used (this is an example of a famously harmful cognitive error called confirmation bias).
Another issue is that explanations have no performance guarantees. Most tests “rely on heuristic measures” and qualitative measures rather than explicit scores. Since explanations are often a simplification of the original model, they are very likely a less accurate version of the trained (and hard to explain) model, which makes this process harder.
Rather than seeing explainability techniques as producing valid, local explanations to justify the use of model predictions, it is more realistic to view these methods as global descriptions of how a model functions. If, for example, a clinical diagnostic model appears to perform well in a specific test set but the heat maps show that the model is consistently distracted by regions of the images that cannot logically inform the diagnosis, then this finding can indicate that the test set itself is flawed and that further forensic investigation is required.
Here is the full paper if this was an interesting read!
AWS News Blog | Top Announcements of AWS re:Invent 2021
AWS re:Invent 2021 had a ton of interesting announcements from AWS, but they were light on AI/ML improvements this year around (compare this with our re:Invent coverage from last year).
Amazon Sagemaker Studio Lab: This is a free service that gives people access to a working Jupyter instance for any experimentation - appears very similar to Google Colab.
Amazon Sagemaker Inference Recommender: This service recommends what instance types you should be running your inference workloads on.
Amazon SageMaker Ground Truth Plus: This service (currently in pilot) appears to be a higher quality labeling service compared to Mechanical Turk, and seems to be a competitor to the Scale AI’s of the world.
Amazon Sagemaker Training Compiler: This service optimizes deep learning training code to run faster on Sagemaker GPU instances (if you do use this, let us know what your experience is like)
There are a few more, but we’ll leave it up to you to read about them here. We do worry about the length of Sagemaker service names at this point - we wonder if we will be covering Amazon Sagemaker Deep Training Speed-Booster Plus next year…
Washington Post | New technology mandate in infrastructure bill could significantly cut drunken driving deaths
We try to track interesting news from the policy world where ML may have a part to play, and this certainly caught our eye.
What is this and why is it important?
The recent infrastructure bill from the US Congress has a mandate that would require new cars to have technology that would stop drunk people from driving. As the article reports:
More than 10,000 people died in crashes involving an alcohol-impaired driver in 2019, according to the National Highway Traffic Safety Administration…A recent study by the Insurance Institute for Highway Safety concluded the technology could reduce deaths by 9,400 people a year if widely deployed.
How would it work?
The technology involved in preventing drunk driving isn’t finalized yet (and probably won’t be for a few years), but one of the ideas being floated is to:
“rely on cameras that monitor drivers for signs they are impaired, building on systems that automakers are using to ensure people relying on driver assistance technologies don’t lose concentration.”
Whether it plays out this way or not, this definitely seems like an ML application that has the potential to save a lot of lives.
Twitter | Lessons from ML Systems for Content Moderation
Nihit: I worked on systems for content moderation at Facebook. In a recent thread, I shared some learnings & observations around what makes this a challenging problem.
Detecting & taking enforcement action on illegal/undesirable content is an important problem for most online platforms as they scale. This is done typically with human-in-the-loop Machine Learning systems. The content moderation domain presents some unique challenges when building machine learning models - bootstrapping labels, subjective annotation guidelines that can lead to label noise, adversarial drift, and the need for adaptive enforcement. If some of these sound interesting or relevant, definitely check out the thread!
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup or email us at email@example.com.
If you like what we are doing, please tell your friends and colleagues to spread the word. ❤️