Issue #14: AI Index Report 2021. Multimodal Neurons. AI Infra Alliance. Similarity Search
Welcome to the 14th issue of the MLOps newsletter. We’ve been writing this newsletter for six months now and have honestly been surprised by how many wonderful folks (by that, we mean you!) choose to read this. We remain equally excited for the next six months.
In this issue, we look at charts and numbers from the AI Index Report 2021, cover some fascinating research from OpenAI, news of the AI Infra Alliance, Vector Databases and much more. Thank you for subscribing. If you find this newsletter interesting, tell a few friends and support this project ❤️
This is an annual report from the Stanford’s Institute for Human-Centred AI. Clocking in at 222 pages, it spans everything from R&D and technical performance of AI, trends with AI education and ethics and the impact of Covid-19 on AI development. Here we’ll highlight a subset of topics, but we’d recommend skimming through the report if you have the time.
Research and Technical Performance
The number of AI publications grew by 34.5% from 2019-2020, compared to 1.9.6% from 2018-2019.
Performance of generative AI systems (text, audio, images) has improved to a sufficiently high degree that humans have a hard time telling the difference between synthetic and non-synthetic outputs for certain applications. That being said, there has been a significant improvement in the results for the Deepfake Detection Challenge created by Facebook.
Performance on image benchmarks seems to be flattening out, which suggests that harder benchmarks are needed (as an example, we saw 98.8% Top-5 accuracy on ImageNet when using additional training data)
NLP continues to make massive gains against the state-of-the-art with advances like GPT-3, but also with systems achieving near-human performance on language understanding tasks such as SuperGlue.
AlphaFold from DeepMind made a significant breakthrough in the challenge of protein folding -- generally speaking, AI has had a major impact on biology and healthcare.
AI and the Economy
“Drugs, Cancer, Molecular, Drug Discovery” received the greatest amount of private AI investment in 2020, with more than USD 13.8 billion, 4.5 times higher than 2019.
More private investment in AI is being funneled into fewer startups. 2020 saw a 9.3% increase in the amount of private AI investment from 2019 (compared to 5.7% in 2019 from 2018), though the number of newly funded companies decreased for the third year in a row.
Brazil, India, Canada, Singapore, and South Africa are the countries with the highest growth in AI hiring from 2016 to 2020.
Despite the economic downturn caused by the pandemic, half the respondents in a McKinsey survey said that the coronavirus had no effect on their investment in AI, while 27% actually reported increasing their investment.
Check the chart below for the growth in AI job postings in different industries:
AI Policy and National Strategies
More than 30 countries have published documents discussing national AI strategy since Canada was the first to do so in 2017.
US Federal civilian (non-defense) agencies allocated ~1.1B USD for AI research and development, while some reports suggest that the Defense R&D spend on AI might be closer to 5B USD
In the US, the most recently ended 116th Congress (January 3, 2019 – January 3, 2021) was the most AI-focused congressional session in history, with the number of AI mentions in legislation and congressional reports being more than triple of the 115th Congress.
At a very high-level in the world of AI, the chart seems to be going “up and to the right” along almost all meaningful dimensions. The next decade will remain an exciting time to contribute to AI research, applications and strategy.
In the last issue of our newsletter, we briefly touched upon the multimodal neurons research by OpenAI. We cover this topic in more detail here. Recently, OpenAI released CLIP, a new state of the art visual understanding model that outperforms existing vision systems on datasets like ImageNet and ObjectNet. As a follow up, OpenAI shared their observations around the existence of “multimodal neurons” in CLIP -
We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually.
Typically, single neurons in an artificial neural network would fire for a visual cluster of ideas - e.g. “edge detectors”, “face detectors” etc. But this finding is novel, in that the neuron responds to a semantic cluster of ideas represented across a variety of forms (sketch, picture, text etc) by forming abstractions. However, as noted in the paper, the degree of abstraction in multimodal neurons can present new attack vectors and introduce sources of bias that haven’t manifested in previous systems.
Authors observed that excitations of the multimodal neurons in CLIP can be controllable by its response to images of text. This introduces a simple attack vector they call “typographic attacks”: Fooling the model into classifying an image by overlaying adversarial text unrelated to the underlying image. For example, in the image shared above, ovulating “$$$” over an image of a poodle, the model is fooled into thinking the image is a piggy bank:
“The finance neuron , for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.”
Bias and overgeneralization
Another unintended consequence of abstraction is new sources of bias stemming from overgeneralization: the model learns associations between concepts because of the associations present in underlying training data. As noted in the article, some such associations uncovered during CLIP’s evaluation:
“Middle East” neuron  has an association with terrorism;
“immigration” neuron  fires when the input contains Latin America.
Neuron  fires for both dark-skinned people and gorillas.
Both, adversarial attacks like the typographic attack mentioned above, and bias from overgeneralization, present challenges to real world adoption. While attacks like the typographic attacks can be at least formulated as adversarial learning, problems arising from bias and overgeneralization are even more challenging: While the examples shared above give us anecdotal evidence of bias, it is hard to measure or quantify this because the exhaustive set of all possible biases is impossible to anticipate in advance. We thank the authors for sharing these insights, and agree with their view on the importance of building a robust toolkit to study interpretability:
We believe that these tools of interpretability may aid practitioners the ability to preempt potential problems, by discovering some of these associations and ambiguities ahead of time.
What is it ?
We’ve written in the past about the wide variety of tools and products that comprise the enterprise AI landscape today (check out the Awesome MLOps repo, MLOps Tooling landscape). As a first order approximation, we can consider two axes along which to categorize these solutions:
Part of the ML lifecycle (data labeling, model training, inference, monitoring)
Breadth of support (tied to a single cloud, support across cloud providers, VPC etc)
With this cambrian explosion of tools, teams often struggle to stitch together the right set of tools to create an end-to-end ML training & deployment workflow. With the aim to tackle this issue, more than 20 startups are coming together to form the AI Infrastructure Alliance. Their goal is to define the “canonical stack” for enterprise Machine Learning - a set of common practices and standards for cross-platform support that these companies can build towards.
Dan Jeffries (who works at Pachyderm) will serve as director of the alliance and has previously written about the problem motivation in this post on a Canonical Stack (CS) for machine learning. This article, reporting on the alliance quotes Dan Jeffries:
“ In a conversation with VentureBeat, Jeffries referred to the endeavor for small to medium-size businesses in AI as a “rebel alliance against the empire” that will serve as an alternative to offerings from Big Tech cloud providers, which he characterized as “building an infrastructure just to lock you in.””
We think the problem of defining common standards & practices, allowing startups to build towards interoperability is a real one. We believe if this gains adoption then over time, larger cloud providers like AWS Sagemaker will want to bake these standards into the platform’s ML offerings, thus leveling the playing field and reducing complexity for the end customer. All in all, a big accelerant for ML adoption.
The alliance so far has laid out its goals in broad strokes. We look forward to more concrete technical details about the “canonical stack” and standards of interoperability surrounding it that are proposed.
We do wonder whether the Big Tech cloud providers (think Sagemaker and GCP Cloud ML) are really the “empire”? While they have MLOps offerings, there are startups both large and small that are trying to attack different parts of the pie. Will the “rebel alliance” be strong enough given that there are competitive dynamics between them? Will this alliance continue to be accepting of newer rebels? We have so many Star Wars themed questions.
Having said this, we continue to believe that AI & ML are going to create tremendous value over the next two decades and the journey is likely to be a positive sum for most participants.
Lina Weichbrodt, a Machine Learning engineer based in Germany, recently wrote an insightful article on the importance of domain-specific metrics when monitoring ML models. We recommend reading it in its entirety but share our key takeaways and thoughts.
The case for domain-specific metrics
Monitoring ML models, much like monitoring any online software service, is important for a real-time “pulse check”. One question we need to answer as part of any ML monitoring solution is “what should we measure”. In addition to the already well-defined and well understood metrics around data, concept drift etc, Lina makes the case that it is important to track metrics that are as closely related to business/product outcomes as possible. For instance, at Spotify, the homepage personalization team tracked “Rank of a user’s most used carousel”: if a new model ranks the user’s favorite carousel low or a sudden drop in the rank occurs it indicates a problem.
Insights lie at the extremes
When designing domain-specific metrics for monitoring ML models, it helps to think of possible failure cases and design metrics that can capture the occurrence of failures even though they may be infrequent. As an analogy from DevOps, we track not just the average or median latency but also the P95 and P99 latency as this can help detect problems earlier.
Metric Maturity Cycle
This insight holds not just for metrics related to ML monitoring, but really any metric you want to build and operationalize: it helps to think of the maturity cycle:
Step1: Research & Analysis - first “implementation” of a metric. The goal is to verify correctness and that tracking it can be valuable if productionized.
Step2: Offline/batch compute - typically how most metrics first get productionized. There is a data pipeline that runs at regular intervals to fetch the upstream data, compute the metric and store it for consumption downstream (in dashboards, email reports etc)
Step3: Realtime - the ideal stage for any metric. Computation is real-time, often using a streaming data abstraction. Realtime metrics are often accompanied by triggers and alerts that can proactively notify relevant stakeholders in case of any outliers.
We recently came across a company called Pinecone and the product that they're building, which is a vector database. This definitely feels like a missing piece of MLOps infrastructure. Often, data scientists are working with vectors and while vectors can be saved and retrieved relatively easily in relational and noSQL databases, a “vector database” can go above and beyond by providing similarity search as a service. Another analogy for this is: ElasticSearch for vectors (rather than raw text).
While an ML team can build something similar in-house by deploying a library like Faiss or Annoy (which are both pretty good for nearest-neighbour search), if a company can provide a hosted version (and take care of sometimes painful fine-tuning with these libraries), that might be valuable.
ML in Covid-19 Research: A sobering Twitter thread
Thanks for making it to the end of the newsletter! This has been curated by Nihit Desai and Rishabh Bhargava. This is only Day 1 for MLOps and this newsletter and we would love to hear your thoughts and feedback. If you have suggestions for what we should be covering in this newsletter, tweet us @mlopsroundup (open to DMs as well) or email us at firstname.lastname@example.org