BASELINE • March 2021

An ML Newsletter from Novetta

Welcome to the March 2021 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month we cover the following topics:

  • OpenAI’s discovery of multimodal neurons in artificial neural networks
  • Transforming computer vision with Facebook AI’s self-supervised learning model SEER
  • MIT’s new algorithm for certifying adversarial robustness in deep reinforcement learning
  • Facebook AI’s Unified Transformer model for multi-modal, multi-task learning

Artificial Neural Networks Can Now Think in the Abstract

The human brain is able to equate many different representations of an object, regardless if it’s represented in a photograph, a drawing, or read in plain text. This ability to abstract concepts is made possible by something in our brains known as multimodal neurons. This was something we didn’t think models could effectively do until about two months ago, when researchers discovered artificial multimodal neurons were present in the model CLIP, by OpenAI. Announced in January 2021, CLIP is a neural network that learns visual concepts from images alongside natural language representations. Using interpretability tools, researchers were able to identify neurons that were activated in cases where both synthetic and natural image concepts of inputs were given. For example, one such neuron was activated when fed pictures of spiders and sketches of Spider-Man. This interpretability work gives us a better understanding of the visual concepts that CLIP is learning and that contribute to its classifications. For example, consider the image below that shows selected neurons for the final layer of some CLIP models to get an idea of how the model is visually distinguishing between the concepts of “summer” and “winter”. Notice that more of the images in “winter” have snow and barren branches, while the “summer” photos all include warmer colors.

This also becomes a tool to detect bias in the model, as the researchers noted the model’s proclivity towards stereotypical depictions of some concepts like regions and emotions. In order to allow ML researchers the chance to better understand CLIP’s behavior, the OpenAI Microscope catalog has been updated with feature visualizations for every neuron in CLIP RN50x4. The exercise of analyzing and discovering model behaviors will be important for ML researchers going forward as models continue to become more sophisticated.

SEER: The start of a more powerful, flexible, and accessible era for computer vision

The introduction of self-supervised learning has been a paradigm shift in machine learning as it removes the requirement of labeled training data in order for a model to learn. In recent years, it has transformed the field of natural language processing (NLP) where large models pre-trained on vast, unlabeled text datasets have been instrumental in models being able to achieve state-of-the-art (SOTA) performance on downstream tasks like machine translation and question answering. Researchers at Facebook AI have developed SEER (SElf-supERvised) in order to similarly introduce this concept to the field of computer vision. SEER is a billion-parameter self-supervised computer vision model that is pre-trained on random, unlabeled public Instagram images. It outperforms SOTA self-supervised models with 84.2% top-1 accuracy on ImageNet and was able to achieve 77.9% top-1 accuracy when fine-tuned on labels from just 10% of ImageNet’s data. The model was developed using VISSL, Facebook AI’s new open-source PyTorch library dedicated to self-supervised learning. It was released with this paper and contains 60 pretrained models and a benchmark suite. This recent contribution by Facebook AI demonstrates the ability of self-supervised learning to perform well on real-world computer vision tasks without relying on highly curated, human-annotated datasets. This will allow the community to train models on more widely available, diverse datasets which paves the way for more fair and flexible models in the future.

Above are some examples of public Instagram images that make up the training data (source)

New algorithm from MIT brings adversarial robustness to autonomous systems

The use of deep learning in autonomous tasks like self-driving vehicles has become widespread. However, the adoption of this technology in safety-critical domains depends on the ability of practitioners to provide a formal guarantee of adversarial robustness. Being robust to adversarial inputs, such as shifts in image pixels and inaccurate feedback, means that the system can continue to act optimally despite uncertainty in its environment. While adversarial inputs can be the result of malicious attacks, more commonly it stems from perturbations in input signals which can result from faulty sensors or imperfect measurements. This type of noise can be detrimental to reinforcement learning (RL) systems, which make decisions based on ground truth assumptions. CARRL, a new deep RL algorithm developed by researchers at MIT, is able to take these perturbations into account, allowing the system to make optimal decisions based on worst-case assumptions. The algorithm determines a region of uncertainty for each input, and each possible variation within that region is fed into a deep Q-network which then associates the input with an optimal action. Researchers tested the algorithm in a simulated collision-avoidance test and the video game Pong. In both cases, CARRL was able to optimally adjust to increasing levels of adversarial noise, out-performing standard machine learning techniques. This research is the first to formally certify network robustness in deep RL systems, which can help facilitate their application in real-world environments where interactions are unpredictable and safety is of utmost importance.

Below is a demonstration of CARRL’s performance in the computer game Pong, compared to a standard machine learning algorithm (with and without adversarial noise):

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer

Transformer model architectures have shown strong performance in multiple domains, most notably natural language processing and computer vision. There has been great success in taking pre-trained transformers and fine-tuning them to create task-specific models. However, when applying transformer encoder-decoder technology to multi-task learning (where multiple tasks are learned by the same model during training), this becomes inefficient since each task requires its own set of parameters. Facebook AI recently sought to address this problem by creating a single transformer model, UniT (Unified Transformer), that can jointly learn important tasks across multiple domains, using a shared set of parameters. When trained jointly on seven tasks, over eight datasets, UniT was able to achieve comparable performance to the current state-of-the-art in each domain. Because of its simple design, UniT is able to be easily modified for additional modalities. This work is a promising step toward the goal of developing general models that flexibly adapt to any purpose. Facebook AI plans to release the code for UniT on MMF (Multi-Modal Framework) which is a PyTorch based framework that includes pre-trained state-of-the-art vision and language models, datasets, common model architecture components, and training/inference utilities.

Below is an example of UniT’s performance on common vision and language tasks such as object detection, sentiment analysis, and visual question answering:

This research was performed under the Novetta Machine Learning Center of Excellence.


Mady Fredriksz
Brian Sacash