BASELINE • October 2020

An ML Newsletter from Novetta

Welcome to the October 2020 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month we cover the following topics:

  • Smaller and better NLP models
  • A new open source instance segmentation model
  • Uncovering hidden biases in image datasets
  • The Octopus Test: A Turing test for modern language models

GPT-3 at a 99.9% discount

In the world of NLP language models, OpenAI’s GPT-3 holds the current spotlight. However, its impressive performance comes at a cost: a model size of 175 billion parameters, over 500x larger than Google’s popular BERT model. Fortunately, researchers have developed a technique that they claim can best GPT-3 using just 223 million parameters, a nearly 99.9% reduction in model size. Using an open sourced technique called Pattern-Exploiting Training, or PET, they fine-tuned the ALBERT language model to score a 76.8 on the SuperGLUE benchmark, compared to GPT-3’s 71.8. This semi-supervised training technique was used to generate training data from a few examples for the final fine-tuning process. While the research team acknowledges that GPT-3 remains more flexible and can still generate longer sequences of text, this work shows that bigger may not always be better. Using training methods like PET on smaller models might greatly reduce the size of high-performing language models, leading to faster models on cheaper hardware and making high-performing NLP solutions more accessible and affordable.

Spherical Text Embeddings

Word embeddings are used in NLP to capture the semantics of textual units such as words and paragraphs. Word embeddings are important because the higher their quality, the better accuracy is on downstream tasks. Recent work done by the University of Illinois at Urbana-Champaign and Georgia Institute of Technology, in conjunction with the U.S. Army Research Laboratory, improves upon normal, euclidean word embeddings by utilizing a spherical generative model, Joint Spherical Embedding (“JoSE”). With this model unsupervised word and paragraph embeddings are jointly learned, leading to better quality embeddings. When used for common NLP tasks, such as word similarity and document clustering, JoSE achieves state of the art performance and surpasses contemporaries including BERT and Doc2Vec. In addition to creating better quality embeddings, JoSE also reduces the training time required to generate the embeddings.

DELG: Unifying DEep Local and Global Features

The state of the art in instance-level recognition (ILR) has recently been advanced as researchers design and test models against Google Landmarks Dataset v2 (GLDv2), released in 2019. ILR, an extension of image classification, refers to the computer vision task of detecting a specific instance of a particular object class. For example, instead of identifying that an image contains a building, ILR seeks to classify the building in the image more specifically (as the White House or the Louvre, for example). One open source model for ILR application that achieves high performance while improving efficiency is Google’s DELG. DELG approaches this problem by extracting both local and global image features in a unified, fully-convolutional model. Local features are found through attention-based keypoint detection and global features are found by leveraging generalized mean pooling. This innovative, unified design is more efficient than previous designs during training and at inference, eliminating redundant computations present when extracting global and local features using two separate networks. Beyond identifying landmarks, ILR has other applications including artwork recognition, product retrieval, and image search. While a few models outperform DELG in average precision on GLDv2, DELG is much more efficient than these models, making it a good option for practitioners wanting to deploy efficient ILR systems.

Mitigating Bias in Visual Datasets using REVISE

With machine learning increasingly being used to automate decisions, companies have come under fire for using algorithms trained on datasets with unwanted, unnoticed biases. These biases can have real-world implications, such as approving credit cards more often for males more than females. Since machine learning models amplify biases present in datasets, researchers at Princeton have released REVISE (REvealing VIsual biaSEs) as a way to mitigate potential biases in image data across object-based, gender-based, and geography-based patterns. REVISE uses Jupyter notebooks, which makes it simple to integrate with current data science workflows. It also outputs histograms to visualize possible sources of data bias. REVISE also provides examples of actions that can be taken to remove biases based on the type of bias detected, enabling data scientists to feel more confident in the performance of their models. A paper on arXiv explaining the tool as well as a GitHub repository are available for straightforward integration into data science projects.

The Octopus Test: A Turing test for the modern language model

The GPT-3 language model recently wrote a blog post that many readers thought was human-written. This leads us to consider what the true limits of advanced language models might be. This question is further complicated by reports that GPT-3 has been used to generate webpage code from natural language instructions. A recent research paper poses a thought experiment that may help ground us in our understanding of these models. ‘The Octopus Test’ was designed to convey the challenge of learning meaning from text observations. The test states that two fluent speakers of English (A and B) are separately stranded on uninhabited islands. With the ability to only communicate via text message through an underwater cable, they begin a conversation. We assume a hyper-intelligent octopus (O) with no knowledge of English taps into this cable and observes the conversation. This statistically inclined octopus eventually generalizes language patterns and learns what words appear in relation to others. Soon, the octopus decides to cut the cable and insert itself into the conversation posing as person B, while A continues the conversation unaware. The test asks “Can O successfully pose as B without making A suspicious?” The authors say this ‘weak form’ of the Turing test highlights something important. We can imagine at what point the conversation will break down. The authors suggest that if person A had then invented a new ‘coconut catapult’ and shared the instructions, O would have no concept of what ‘rope’ or ‘coconut’ might refer to and would be unable to recreate the new island-invention. The authors hope to highlight the problem of reasoning and meaning relation when it comes to advanced language models. While this is only a high level thought experiment it can help ground our expectations of what we can realistically expect from current language models.

This research was performed under the Novetta Machine Learning Center of Excellence.


Jack Buttimer
Carlos Martinez
Shauna Revay, PhD
Brian Sacash