BASELINE • May 2021

An ML Newsletter from Novetta

Welcome to the May 2021 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month we cover the following topics:

  • A speech recognition system that requires no transcribed audio data for training
  • An open-source method to test the security of AI systems
  • A new take on a simpler architecture for computer vision models
  • Ways to decrease the carbon footprint of large ML models

Completely Unsupervised Speech Recognition

Training speech recognition systems typically requires vast amounts of transcribed audio data, which limits their application to just a handful of languages for which this data is available. There are thousands of languages and dialects spoken around the world that are not represented in labeled datasets, which blocks many people around the globe from using speech recognition technology. Researchers at Facebook AI sought to solve this problem by developing a speech recognition model that operates completely unsupervised, meaning it does not require any transcribed, labeled training data. Their method, wav2vec Unsupervised (wav2vec-U) is trained on speech audio and unpaired text. The first step involved training a self-supervised model (a previous version called wav2vec 2.0) on unlabeled audio in order to learn the structure of human speech. Then they trained a generative adversarial network (GAN) that can associate the speech representations learned in the last step with their corresponding phonemes (distinct units of sound in a language). They also fed the GAN additional text examples in order to help it discriminate between real human text and nonsense. When evaluated on the TIMIT benchmark dataset, wav2vec-U was able to reduce the error rate of the next best unsupervised method by 57 percent. When compared to state-of-the-art supervised models (typically trained on close to 1000 hours of transcribed data), wav2vec-U had comparable word error rates on the Librispeech (English) benchmark dataset. When tested on non-English and low-resource languages, wav2vec-U was not able to surpass supervised method error rates, but still demonstrated strong performance without relying on labeled data. This work is a promising step forward in expanding access to speech recognition technology globally, and provides a framework for creating high-performing unsupervised learning systems. The code has been released here.

A full demonstration video is available here.

Making AI Systems More Secure with CounterFit

As an increasing number of AI systems are being put into production, a growing concern among AI practitioners is the security of their models. To address this, Microsoft’s Azure team has developed an open source solution for testing the security of AI systems called CounterFit. As a new tool in the AI red team arsenal, CounterFit provides penetration testing, vulnerability scanning, and logging for AI systems. The tool is agnostic towards the data the model uses, the internal workings of the model, and is able to assess models deployed on-premises at on the edge. Adversarial AI frameworks, such as those implemented within the Adversarial Robustness Toolbox and TextAttack, are also implemented within CounterFit. Including these frameworks within CounterFit will allow users to stress test adversarial attacks on their frameworks. Since the frequency of adversarial attacks on AI is likely to increase in the future, utilizing CounterFit to test and strengthen production-level AI systems will help them be more robust and dependable in delivering mission critical insights and information.

An All-MLP Architecture for Vision

Much of deep learning’s success in computer vision tasks over the years has been attributed to pivotal developments in model architecture, most notably convolutional layers in convolutional neural networks (CNNs) and more recently, attention layers in vision transformers (ViTs). A recent study by the Google Brain team tried to find out if high performance on vision tasks was dependent on these specialized layers by exploring what would happen if they replaced convolutions or self-attention layers with multi-layer perceptrons (MLPs). Their resulting model, MLP-Mixer, contains two types of layers interleaved: a channel mixing layer in which MLPs are applied independently to image regions (mixing features at each location); and a token mixing layer in which MLPs are applied across image regions (mixing spatial information). To assess the performance of MLP-Mixer and compare it to state-of-the-art (SOTA) CNNs and ViTs, they looked at the trade-off between the model’s computational cost (which consists of the total pre-training time and throughput in images/sec/core) and accuracy. They observed that MLP-Mixer was able to reach near SOTA performance when pre-trained on large datasets. The model was able to achieve strong, but not SOTA, performance when trained on smaller datasets with regularization. These findings demonstrate that specialized layers are not essential to high performance on vision tasks and may drive new possibilities for creative architectures that do not include convolutions or self-attention.

Above is the architecture of MLP-Mixer, which contains fully connected layers, mixer layers (each with a channel-mixing and token-mixing layer), GELU nonlinearity, and a classifier head.

Using GPT-3 For Training/Inference? Don’t Forget to Check the Energy Consumption Too

Since the explosion of language models like GPT-3, Meena, and BERT, the size of model architectures for Natural Language Processing (NLP) tasks such as text generation and question/answering has increased. A by-product of training and running inference with these large models is that it has led to increased carbon emissions and energy costs. In a recent paper by Google and the University of California, Berkeley, researchers suggest that the geographic location of data centers, the architecture of the model, and cloud computing are significant factors that can drastically affect carbon emissions related to compute resources. They propose that organizations should choose cloud data centers over traditional data centers since they are ~1.4-2X more energy efficient. For example, cloud data centers using hardware like Google Tensor Processing Units (TPU’s), rather than traditional GPU’s, can see faster speeds and consume less energy while being more cost efficient. Cloud providers can also help organizations schedule model training at locations and times that will lead to lower carbon emissions (the level of which varies dramatically based on these factors). The choice of model architecture also plays a role in energy consumption, with sparse architectures consuming less than 1/10th the energy of their denser counterparts. The authors estimate that implementing the above changes could reduce the carbon footprint associated with training large models by ~100-1000X. Going forward, it will be important to not only include accuracy scores for model evaluation purposes, but also mention energy usage and CO2 cost metrics. Having these additional benchmarks in research papers would increase transparency and allow reductions in energy consumption to be considered state-of-the-art improvements. This work showcases that making simple changes can have a large effect on the carbon emissions and energy costs that stem from training large models, without sacrificing overall performance.

The figure above depicts the energy usage and CO2 equivalent emissions for recent large NLP models.

This research was performed under the Novetta Machine Learning Center of Excellence.


Mady Fredriksz
Jefferson Ridgeway