Issue #15: AI for self-driving at Tesla. HuggingFace meets AWS. Embedding Stores. ML and Databases.
Welcome to the 15th issue of the MLOps newsletter.
In this issue, we highlight a talk on self-driving cars at Tesla, discuss a partnership between Hugging Face and AWS, share a post on embedding stores, dive into a paper on ML-in-databases and more.
Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
Andrej Karpathy, Director of AI at Tesla, gave a talk last year at ScaledML Conference on how Tesla is solving the ML challenges to provide Full Self Driving in their cars. It is a fascinating talk, both in terms of concepts in deep learning and vision and the systems and engineering aspects of making this work at Tesla’s scale. We recommend listening to the talk. For this article, we want to especially focus on the data feedback loops that help Tesla collect what is arguably the best data on self-driving, and how it helps their models improve over time.
Over the past decade, neural networks have repeatedly shown their ability to solve complex computer vision problems. However, when the same set of model architectures and training techniques are available to everyone, data is the most important variable in terms of improving visual recognition in self-driving cars
In the limit, this means that if the “right datasets” can be collected, the models will successfully learn to recognize the things we need. The right data is not just about quantity but also quality:
“What’s important is not just the scale of the dataset, but covering all possible use cases”
Because of the size of the Tesla fleet driving in the wild, Tesla has the ability to collect this data at scale to solve the long tail of corner cases (occluded Stop signs, construction sites, potholes, etc) that are critical to solve but hard to get data for.
Andrej mentioned that the platform allows the team to collect data when a driver’s actions disagree with the self-driving model’s predictions but also fanout to the fleet to collect “similar” images to the ones in a seed dataset:
This “Data Engine” is a feedback loop system that allows Tesla to know exactly where the models currently don’t do a good job, collect data at scale to solve it, and move on to the next problem in the long tail of problems to get to 99.999+ % reliability.
We’re big fans of the approach outlined by Andrej and the Tesla team in this talk. We believe that the data collection <> model improvement feedback problem is not just unique to cars or tesla but something that should ideally be solved for all real-world ML applications. We think this is broadly an unsolved problem (to the extent that it is even recognized as a problem) and why we’re excited and optimistic about the role MLOps and monitoring tools can play.
Hugging Face and AWS have announced a strategic partnership:
“to make it easier for companies to leverage State of the Art Machine Learning models, and ship cutting-edge NLP features faster.
Through this partnership, Hugging Face is leveraging Amazon Web Services as its Preferred Cloud Provider to deliver services to its customers.”
This is a big move from Hugging Face, less than a month after raising a $40M Series B round.
What does this involve?
First, Hugging Face and AWS Sagemaker will provide Hugging Face Deep Learning Containers optimized for PyTorch and TensorFlow training that will work well with different EC2 instances. This will also mean that Hugging Face models that are trained/fine-tuned with Sagemaker will only be charged by the number of seconds of compute used.
Second, there will be a Hugging Face extension to the Sagemaker Python SDK, which will simplify creating and managing training jobs in AWS. For a quick demo of what this looks like, check out this YouTube video.
Third, the Hugging Face extension to Sagemaker will work seamlessly with existing Sagemaker functionality for Data Parallelism, Model Parallelism and Hyperparameter tuning when training models. This extension will also simplify sending metrics into Sagemaker’s own metrics store or CloudWatch.
They already have an integrated solution for training, so what could be next?
“We are working on offering an integrated solution for Amazon SageMaker with Hugging Face Inference DLCs in the future - stay tuned!”
This is pretty exciting news. Transformer models in NLP show tremendous promise and Hugging Face and AWS Sagemaker are making it very simple to train such models. We would recommend reading through the article and skimming through some of the resources if you’re interested in learning more.
In our last issue, we had highlighted the AI Infrastructure Alliance which had many startups coming together to build the canonical stack for ML (as an alternative to the offerings from the Big Cloud providers). This news from Hugging Face and AWS shows that the world of MLOps is in a bit of a free-for-all, which can only be a good thing for ML practitioners.
When dealing with unstructured data (e.g. raw text, images, videos), training models from scratch is often not feasible in real-world scenarios because of two reasons. First, privacy considerations mean that we might not have access to the raw data. Second, unstructured data is incredibly high dimensional and training high-quality models on these takes more compute resources, larger datasets and more time. In such cases, embeddings (models that map the raw input to a low dimensional dense vector representation) can be a great tool.
Pretrained models for text and images are available out of the box (e.g. torchvision.models, or Hugging Face). By exposing these models as an API endpoint, the team built a system to generate and log embeddings.
The team then trained the ML models for their end use case by using these embeddings as feature vectors. This ensured that the data was used in a privacy-compliant way and the model training and iterations were faster compared to training end to end models from scratch.
The tradeoff here, of course, is the inability to backpropagate into the embedding model to fine-tune the embeddings for their specific downstream tasks.
Thoughts & Open Questions
The general approach highlighted in Neal’s article is a great summary of the best practices emerging in industry regarding content understanding: a generic “trunk” model that learns good general-purpose embeddings of raw content, combined with a multitude of task-specific downstream models that are built on top of the embedding representation and fine-tuned for the specific task (e.g. Copy.AI powered by GPT-3, Monzo uses HuggingFace).
As alluded to in the article, embeddings don’t entirely solve the privacy concerns as they aren’t strictly one-way hashes. As has been shown in prior research (and something we covered here), it is possible to recover the raw input partially from embeddings in some cases so it is important to evaluate this explicitly for your individual use case.
One limitation that we believe is yet to be solved (at least we’re not aware of a good solution) is a clean way to support versioning and experimentation on underlying embeddings, especially if multiple downstream models depend on it as input features.
This is a paper from the team that built DuckDB and MonetDB, two of the most exciting new databases of the last couple of years. If you’re interested in learning about DuckDB, check out this YouTube video from one of the creators. Now, let’s discuss the paper itself. The authors say that they:
“integrate unchanged machine learning pipelines into an analytical data management system. The entire pipelines including data, models, parameters and evaluation outcomes are stored and executed inside the database system. Experiments using our MonetDB/Python UDFs show greatly improved performance due to reduced data movement and parallel processing opportunities.”
What is the problem they are trying to solve?
The authors say that current ML workflows have the following problems when it comes to data management:
Managing large datasets as flat files is error-prone and multiple people working on such datasets leads to further issues
Loading data from structured data formats (such as CSV and XML) is inefficient, and often data needs to be loaded from multiple times
These problems can be solved by existing relational database management systems (RDBMS), but integrating analytical tools with databases has proven tricky. This is because:
The standard approach of storing data on a separate database and communicating over a socket connection is a bottleneck with large amounts of data
On the other hand, in-database processing techniques are cumbersome, and rewriting analytical pipelines into SQL remains a research problem.
Contributions of the Paper
In this paper, they show:
Classification models (such as the ones provided by scikit-learn) that can be trained within a column-store RDBMS using a combination of a Python User-Defined Function (UDF) and SQL (see image earlier for an example)
Storage of models within the database, which can then further be used for testing and future predictions
Performance benefits of running the end-to-end analytical workflow within the database. This is compared against reading raw data from files or a separate database, followed by pre-processing, model training and inference in a Python environment.
Given how much data is being stored in analytical databases (Snowflake, Amazon Redshift, Google BigQuery), there seems to be growing interest in bringing non-SQL workloads to these databases. BigQuery appears to be the furthest ahead today - check out the following example of creating a Deep Learning model using TensorFlow directly inside BigQuery:
However, whether such approaches will take off remains an open question. If you’re working on problems involving tabular data, this might be worth a try. We remain intrigued by these ideas and will follow developments closely.
This is a slightly depressing Twitter thread from Eric Topol. He discusses an article from Nature Machine Intelligence which shows that even with more than 2000 studies involving machine learning models on detection and prognostication of Covid-19 from Chest X-Rays and CT images, not a single one is of “potential clinical use due to methodological flaws and/or underlying biases.”
While the enthusiasm to do interesting work in Machine Learning remains high, producing valuable research remains difficult. We hope that clear documentation and well-defined processes become the norm both in industry and academia!
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at firstname.lastname@example.org