BASELINE • January 2021

An ML Newsletter from Novetta

Welcome to the January 2021 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month we cover the following topics:

  • Measuring bias in pre-trained language models
  • Microsoft’s DeBERTa AI, which outperforms humans in natural language understanding
  • Creating images from text using DALL-E from OpenAI

StereoSet: Measuring Stereotypical Bias in Pre-Trained Language Models

Using pre-trained language models has become a useful and almost necessary part of ML workflows that involve natural language processing. As models move from the research lab to the real world, such as BERT’s incorporation into Google’s search engine, ML researchers are exploring the biases that might be captured in these networks during training. Many prominent language models are trained on large public corpora, like documents from Wikipedia or Reddit, and are therefore susceptible to the prejudices that accompany these Internet postings.

StereoSet provides an open-source method for testing the fairness of pre-trained models, offering a Context Association Test (CAT) and associated data set that checks for stereotypical biases in categories centered on race, religion, profession, and gender. This test evaluates for bias on an inter-sentence and intra-sentence level by supplying “fill-in-the-blank” style sentences (i.e. “___ are good at math.”) and several sentence completion options (“women”, “Asians”, “round”) for which models must select the most appropriate option for finishing the statement. StereoSet then scores models for bias based on whether they consistently select options that are stereotype-consistent, stereotype-inconsistent, or irrelevant to the sentence context. Through this methodology, bias evaluations for popular models, like BERT, XLNet, GPT-2, and RoBERTa are publicly available, and practitioners can use StereoSet to develop confidence in the models they select, develop, and deploy.

DeBERTa AI: Deeper Comprehension of Everyday Speech

One main goal of AI language models is to generate models that have the ability to understand natural language similar to the way humans do. The SuperGLUE benchmark is a series of tasks that can assess a language model’s ability to understand and comprehend everyday speech. These tests include questions that measure a model’s comprehension of language as well as test reasoning skills that are required to answer open-ended questions. The human baseline score on the SuperGLUE benchmark is 89.8. Recent improvements to Microsoft’s DeBERTa model surpassed this with a score of 90.3.

It is important to note that although the model performs at state of the art levels on this benchmark, this does not indicate that the model can reach human-level intelligence. For example, DeBERTa must still work to leverage knowledge from unrelated subtasks in order to solve problems and generate answers the way that humans do. What it does mean is that DeBERTa is a state-of-the-art model when it comes to natural language understanding. Microsoft is releasing the DeBERTa model and the source code to the public so that ML practitioners can begin to incorporate the model in their own work. Given its impressive performance, we are exploring how the model may be used to improve downstream NLP tasks such as question-answering and text classification problems, which would increase the efficiency of analyzing large amounts of text.

DALL-E: Creating Images From Text

Recent developments in transformer language models have improved the ability to generate synthetic text (GPT-3) and to predict pixel values that complete partial images (Image GPT). DALL-E extends these advances and is able to generate images from text captions. The 12-billion parameter transformer model from OpenAI was trained on text-image pairs and is able to:

  • create new images from scratch that match a provided text description
  • generate anthropomorphized versions of animals and objects
  • combine unrelated concepts into a single image
  • render text in images
  • apply transformations to existing images such as adding objects or changing the color, visual perspective, or style
  • make use of temporal and geographical knowledge

DALL-E is impressive in the way that it can resolve ambiguities in input descriptions. When prompted with a partial image and text description, DALL-E is able to fill in the details in a way that is consistent with both. For example, when given an image featuring the top of a person’s head and a famous figure’s name, the model was able to fill in the details of the correct face and match the rotation angle of the head in the partial image.

DALL-E has its limitations which are most notable when the complexity of the input text increases, such as when more than three colors are given or uncommon spatial relationships are specified. Despite this, the model is a step forward in multimodal processing and we can foresee tools such as this one eventually being used in design, training data generation/augmentation, and even facial recognition technology. Details on the model architecture and training procedure will be provided by OpenAI in an upcoming paper.

Some examples of output images generated by DALL-E, along with their text inputs, are shown below:

This research was performed under the Novetta Machine Learning Center of Excellence.


Amber Chin
Mady Fredriksz
Shauna Revay, PhD