Issue #2: Nuts and Bolts of ML. Unfriendly Comments. Great Expectations. Common ML Misconceptions.
MLOps is a broad discipline that is moving quickly, and hopefully, that is reflected in the breadth of content we cover in this issue. From timeless lessons from Andrew Ng on applying Machine Learning, to an instructive case study on building an ML system at Stack Overflow, and from a new tool focused on data quality to a Twitter thread discussing misconceptions about ML in production, we hope you find something to your interest.
We are always excited to hear from you, our readers. Starting this week, we will also begin including interesting links that you share with us (scroll to the bottom to find them!).
If you find this newsletter interesting, forward this to your friends and support this project! ❤️
In a 2016 lecture, Andrew Ng talks about the tips, tricks and rules of thumb to apply concepts from machine learning successfully in the real world. It is a great lecture and we recommend listening to all of it (and not just because both of us were Teaching Assistants for Andrew Ng’s Machine Learning class in 2016 🎓).
What Resonated: Among the many lessons shared here (scalable and stable systems go hand-in-hand with better model architectures; cleaning noisy labels from training data; importance of having a “human error” benchmark to better put test and training set errors in context), one thing especially resonated with us: when developing an ML solution for a problem, it is important to understand the model’s performance on a few different datasets [at 27:06]:
The Training set: if the trained model isn’t doing well here, we likely need a bigger model or want to train longer (“Underfitting”)
The Dev set: if the trained model does well on train set but not well on the dev set, we likely better regularization or more training data (“Overfitting”)
The Test set: The test set should ideally reflect exactly the data that model will see when deployed online. If the trained model does well on train and dev sets but not well on the test set, it likely indicates the data distribution between training and test set is different. (“Data Drift”)
Taking it a step further: The first two points might be very familiar with ML practitioners and speak to the well known problem of “bias-variance tradeoff”. We found the last point insightful and relevant in our personal experience of shipping ML models in the past. Often, the kind of data a model will see online (we can call it the “Real-world set”) will be different from the data used to train models. While preventing this entirely might not always be feasible (we can’t exactly control what kind of traffic is seen by models online), it is critical to monitor such drifts, and continuously retrain models to minimize drift.
We ❤️ stories of machine learning systems that are solving real problems, and this post from Stack Overflow is one of our favorites from 2020. The challenge faced by the Stack Overflow team was around unfriendly or unwelcoming comments from users because of the effect that
“their tone has on their recipient’s and future readers’ willingness to contribute to Stack Overflow.”
What we learned about the project
The Stack Overflow team described a more or less complete ML lifecycle, from collecting data to deploying a model, to retraining the model with feedback from human reviewers. As they say:
“There’s a lot of stuff that has to go down after you get a good validation score. Training the model ended up being the easiest part, by far.”
They collected data by having different sets of people label “unfriendly comments” and saw a huge disparity in their estimates of the frequency of unfriendly comments – ranging from 0.105% to 3.5% (when there is a large inter-labeller discrepancy, it is the marker of a more subjective, and generally harder problem)
Error analysis revealed comments that were being missed by the earlier system (now showing up as “false positives”) and innocuous comments that had been flagged earlier (apparent “false negatives)
They understood the importance of having “humans in the loop”, and “wanted to build a tool to augment the capabilities of the humans in our system, not replace the humans”
Finally, they ended up training a v2 of the model, comparing it against the existing model and deploying it to production.
This particular problem of “unfriendly comment classification” has similarities with many “content curation” problems faced by consumer internet companies today, such as fake news detection or hate speech detection. Traits that are similar are:
no ground truth labels
inherent subjectivity in the human-generated labels
rules for deciding labels change with time (as definitions and cultural norms change)
These are complicated problems – so you should have realistic expectations for your team! Hopefully, this blog post from Stack Overflow provides a runbook of sorts when thinking about such problems.
What does it do
Great Expectations (GE) is a tool that provides a data validation and documentation framework, and we believe it has a great future.
What are they saying?
The Great Expectations team have begun to talk about ML Ops, and how they fit into the overall ML Ops landscape. Quoting from the link above:
“ML isn’t just a magic wand you can wave at a pile of data to quickly get insightful, reliable results. Instead, we are starting to treat ML like other software engineering disciplines that require processes and tooling to ensure seamless workflows and reliable outputs.”
As they show in the image, in any stage where data validation is necessary, Great Expectations has a role to play. While they don’t specifically speak about data validation for a machine learning model that is running in production (use case for monitoring), one can imagine this being a logical next step.
Running these data assertions can be computationally expensive. Combined with the fact that more and more data pipelines are running at shorter frequencies (with data being updated every 5 minutes or more frequently), knowing when to run these checks becomes important.
@Rishabh - Disclaimer: I work at Datacoral, where we have worked on data quality checks for data pipelines, which are triggered whenever a table is updated (in an event-driven fashion, rather than on a cron schedule).
Thoughts for the future
Currently, there is no standardization for how data quality is measured and how teams get alerted – internal tooling at different companies is built in an ad-hoc fashion and is often poorly maintained, so it is exciting to see dedicated tools in this space. We also look forward to seeing new software engineering best practices emerge for managing data quality.
Chip Huyen is an ML engineer at Snorkel, who recently tweeted about common misconceptions around deploying and using ML models in the real world (it is worth clicking on the tweet and reading through the entire thread and some of the top comments).
“ML models perform best right after training. In prod, ML systems degrade quickly bc of concept drift. Tip: train models on data generated 6 months ago & test on current data to see how much worse they get.”
As an example, we recently spoke with a Software Engineer at a large public tech company that had a spam detection model in production, which was trained and deployed when it was at “90% accuracy”, but 3 months later, the performance had degraded to about 60%. Naturally, they retrained the model on new data and re-deployed the model, but this is going to be an unending cycle for them.
Another Software Engineer at a yet another large public tech company mentioned to us that their team recently performed an internal study to understand the impact of training data recency for their video ML models. They found that models trained on the entire last year’s worth of data had the same performance as the model trained on the last 3 week’s worth of data when tested on online traffic today. YMMV, however, since this might depend on how much traffic your application sees. If you don’t have too much data, maybe continue training on the full year’s data!
In the future
We also see the future of ML systems moving towards “you want to update them as fast as humanly possible” with applications using features that have many models working in tandem to generate the output for the user. This is going to be an exciting future!
“Deploying a model for friends to play with is easy. Export trained model, create an endpoint, build a simple app. 30 mins.”
While this is true, the story is much more complex when it comes to deploying at scale or in production. The cloud ecosystem around productionizing ML models to serve predictions has many different players and the dust is yet to settle as evidenced in this other Twitter thread by Hilary Mason that we discussed in our previous newsletter issue.
Links From Our Readers
In the last issue, we spoke about MLOps announcements from two tech heavyweights, Google and Nvidia. One of our readers, Rohith Desikan, also pointed us to what Microsoft Azure is up to in the world of MLOps. This repository is a good start and we will keep tabs on how Azure’s services can be combined to provide comprehensive MLOps solutions to ML teams.
We touched on data versioning in the previous issue which is a topic we will be covering in more detail in the future (along with the related concept of model versioning). Another subscriber, Jayesh Kumar Gupta, introduced us to how Julia (a commonly used language in research) supports artifact management as a core feature of the language. While this is still nascent and is primarily used for binary dependencies, it appears that there are discussions to introduce data artifacts as well, which would be an exciting move.
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is our second issue and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at firstname.lastname@example.org.