BASELINE • December 2021

Newsletter from Novetta

Two years ago we launched Baseline to provide insight into machine learning topics that we found interesting. Since then we have had a large group of authors that have covered a broad range of topics relevant to the Defense and Intelligence Communities.

With this anniversary, we reflect on important innovations of the past year and anticipate things to come in 2022.

2021 Highlights

Synthetic Data Generation

Obtaining quality datasets continues to be a challenge for the development of many AI models. Improvements in synthetic data generation techniques that can produce realistic data for training can lower the burden of dataset curation for the future. This year OpenAI released DALL-E, a tool which generates images from text description and can apply transformations to existing images in realistic ways. Additionally, NVIDIA announced the Omniverse Replicator, a synthetic data generation engine designed to develop training data for deep learning models. While there are still improvements to be made so that synthetic data can reliably generalize and train models that can then perform inference on real-world data, the release of new models and tools this year demonstrates the progress that the field is making towards that goal.

ML Optimization

The jump to get machine learning models from development into production systems at scale usually involves an optimization step, a process which has previously required specialized knowledge of software and hardware systems. The release of tools such as Hugging Face OptimumHugging Face Infinity, and NVIDIA TensorRT help to automate the optimization process and enable practitioners with less specialized backgrounds to apply these steps. The release of these tools will allow production-level performing models to be developed more rapidly, and help contribute to the overall democratization of machine learning.

GitHub Copilot: AI Written Code, Friend or Foe

In July, Microsoft-owned Github released the AI assisted programming service Github Copilot. The service is powered by OpenAI’s Codex, which was built by fine-tuning OpenAI’s GPT-3 model on publicly available Github repositories. Codex is able to generate Python code from natural language descriptions, which allows Copilot to be able to generate code suggestions and even entire functions. It has since been integrated into the code editor, Visual Studio Code.

Copilot is testing new waters as it starts to blur the line between what it means to be a programmer in a world built on code. If code is written with the help of Copilot, who is the author? Should credit go to the programmer, the authors of the code the model was trained on, or the model itself? While somewhat philosophical, these questions become practical concerns when it comes to potential code vulnerabilities and product licensing. On the other hand, Copilot has the potential to be an ally in the fight against software vulnerabilities by helping human programmers identify and fix them sooner. We are excited to see how AI-driven technologies, like Copilot, can potentially assist programmers in writing code that is safer and more efficient.

Open Sourcing of AlphaFold 2.0

Predicting the 3D structure of proteins has been an open problem in biology for years. Previously, this has been a time-consuming process that was greatly improved by computational advances. DeepMind’s 2018 release of Alphafold 1.0, a neural network trained on over 170,000 proteins, helped researchers address this problem while avoiding expensive lab fees and reducing time spent on protein sequencing from months to days, though the source code was kept private. Their recent update, AlphaFold 2.0, operates up to 16 times faster – lowering the prediction speed to as little as a few minutes. Along with the release was the launch of an online database of over 250,000 proteins, nearly double what had been studied before. DeepMind followed this with plans to release the potential structures for over one hundred million more structures. Perhaps one of the most important advances in computational biology, the open-sourcing of this tool marks one of the largest advancements achieved by AI-assisted technology.

What We’re Excited About in 2022

Continued Rigor for Responsible AI

As machine learning and AI systems are integrated in nearly every industry, companies are continuing to reflect on the ways that AI is implemented and the scrutiny for responsible and ethical AI implementations has increased. Some of the large tech companies like Google AI and Amazon have released AI principles to help support dialogue, standards, and policy surrounding this issue. On the technological front, Twitter borrowed the concept of “bounties” more traditionally used in the cyber realm to provide incentives for users to identify algorithmic bias in their image cropping algorithms. Additionally, Amazon introduced Clarify for their Sagemaker platform, which automatically detects bias in data preparation and trained models, making it easier for developers to adhere to forward-facing ethical practices. We expect continued rigor to be developed in this field as a way to keep responsible AI at the forefront of developer’s minds as well as those responsible for implementing ML models in practice.

Multimodal Machine Learning

Deep learning algorithms have achieved strong performance on vision and natural language tasks in recent years by focusing exclusively on one of these modalities per model. In practice, many applications involve multiple streams of data in the form of images, text, and audio. Being able to make use of several forms of data for a given problem is highly desirable and should lead to a boost in performance since joint representations of multiple modalities can generate richer features. This past year, we’ve seen several advances in multimodal deep learning. Amazon’s ML Solutions Lab developed a scalable, multimodal approach for event detection on sports video data using Amazon SageMaker. Facebook AI’s Unified Transformer demonstrated the ability to jointly learn multiple tasks covering multiple modalities using a transformer-based architecture. Google recently announced their plans to develop Pathways, a single model capable of performing thousands of multimodal tasks. These developments demonstrate a move away from traditional, unimodal models and towards general and flexible algorithms. Going forward, we anticipate further developments in this area in 2022.

Harnessing Large, Unlabeled Datasets

There are many use cases where large datasets exist, but they lack hand-crafted labels. For example, things like large unstructured text collections, or unlabeled satellite imagery may be relatively easy to come by, and methods such as self-supervised learning have shown that occasionally these datasets can be harnessed for pre-training to improve the performance of deep learning models. Advances such as CLIP from OpenAI, and SimCLR have creatively constructed predictive tasks for pre-training that helped performance on downstream training. In October, Google AI released SimVLM, a weak supervision method for pre-training which they applied to the classification of medical imagery and improved upon supervised pre-training methods. We expect this trend to continue: where models are able to harness information from large datasets without the need for manual labeling and curation, in order to speed up and improve accuracy on downstream classification tasks.

Automated Machine Learning

As machine learning (ML) continues to expand into more industries, it is becoming increasingly important to make the process of developing ML models faster and more efficient. Automated ML tools help automate time-consuming tasks in the model development pipeline, which eases the job of ML practitioners and allows those with little to no domain expertise to utilize ML for their business/projects. The latest AutoML tools automate mundane development tasks such as data cleaning, cross validation, and neural architecture search. We’ve seen these tools emerge in the open-source arena with AutoGluon, as well as become standard offerings from popular cloud providers with Amazon SageMaker and Microsoft Azure. Looking ahead, we expect to see the ML development pipeline continue to become more streamlined and hope this will facilitate opportunities for ML to have a more far-reaching impact in the next year.

This research was performed under the Novetta Machine Learning Center of Excellence.


Shauna Revay, PhD
Brian Sacash
Xena Grant
Carlos Martinez