BASELINE • July 2021

An ML Newsletter from Novetta

Welcome to the July 2021 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month is the last of two special editions, as we’ve invited each member of our summer ML intern class to contribute an article. This month we cover the following topics:

  • Github Copilot, a language model powered code writing assistant.
  • A new model that combines visual and textual information to improve natural language understanding capabilities
  • A framework for distributed, large scale deep learning using volunteer computing
  • A new take on transformer models that replaces self-attention

Github Copilot

With recent advancements in text generation fueled by OpenAI’s GPT-3 model, researchers at OpenAI extended this technology to create machine learning systems that are capable of generating Python code. Researchers trained a large GPT language model on publicly available code in order to create a new state-of-the-art code generation model called Codex which is being used to power Github’s new Copilot service. Copilot is built into Visual Studio Code, a code editing platform. This provides users suggestions for entire lines of code or even complete functions. Users can accomplish this by writing a function description in a comment and Copilot will suggest code for the entire function. This technology continues to show the immense success and flexibility of language models and has the potential to speed up coding tasks for data scientists.

Here, the user describes the function in the comments (green), and Copilot suggests the appropriate corresponding code. (Image from TechCrunch)

VidLanKD: Improving Natural Language Understanding

Recently, there has been an increase of interest in developing natural language understanding (NLU) capabilities. NLU refers to a machine’s ability to derive context and intent from text, as opposed to natural language processing (NLP) which processes text in a literal sense. While traditional language models have been trained on text datasets, the combination of visual and textual information can lead to greater real-world understanding than text descriptions alone. One approach to utilizing both visual and textual knowledge for NLU is vokenization, which correlates each image with a text token. One limitation is that it faces approximation errors due to the lack of diverse vocabulary in image-text datasets (which have been found to be less diverse than text only datasets).

Researchers from UNC Chapel Hill have recently sought to overcome the limitations of text-only models and vokenization with their new model, VidLanKD. VidlanKD consists of a multi-modal teacher model (consisting of a video and language encoder) and student model. VidLanKD has been documented to outperform text-only language models and vokenization models on several NLU benchmarks such as GLUE, SQuAD, and SWAG. VidLanKD’s ability to correlate text with visual representation and outperform current state-of-the-art NLU models shows the promise of multi-modal approaches, bringing us one step closer to developing NLU models that can better understand the human language.

[email protected]: Democratizing Deep Learning

Developing deep learning models can often be constrained by prohibitive training costs and the narrow scope of open source pre-trained models, which may only be applicable to specific datasets or tasks. Volunteer computing (VC) is a method where independent parties combine computational power to collaboratively perform large-scale experiments. It has been successfully applied to open source informatics projects (e.g., [email protected][email protected]), but this kind of architecture has not been shown to be effective in supporting machine learning workloads. Training deep learning models requires consistent communication between volunteer hardware, limiting traditional VC systems to very specific hardware types and network speeds – the opposite goal of a VC setup. The authors of Distributed Deep Learning in Open Collaborations propose DeDLOC, a distributed computing framework where volunteers independently and asynchronously train on microbatches. These microbatches are mathematically equivalent to large-batch training, where volunteers collaborate to synchronously update the global state of the model once the target batch size is reached. Asynchronous computation allows volunteers to exit and rejoin the training process without disrupting the current batch step. DeDLOC’s democratization of deep learning has the potential to help validate large-scale, open source models that are typically created by organizations with unconstrained hardware access. Independent researchers could use DeDLOC to reproduce and validate models, such as GPT-3, allowing them to explore and propose solutions for model biases and vulnerabilities, putting verifiability and diversity at the forefront of the research community.

Attention-Free Transformers

Since their introduction in 2017, Transformer models have powered advances in many areas of machine learning, including natural language processing and computer vision. Transformers have often demonstrated performance improvements over more traditional neural architectures like recurrent and convolutional networks. Much of the power of transformer models has been attributed to their self-attention layers, which increases the computational power of the model but also incurs significant cost: the time and memory complexities of traditional transformer models are quadratic in the size of the input, resulting in powerful models at the expense of computational efficiency. Researchers at Apple sought to ameliorate this concern by replacing the self-attention layers of traditional transformers with a more efficient way of computing attention, with an architecture they call Attention Free Transformer (AFT). AFTs achieve attention computation which has time and space complexities that are linear in the input size, rather than quadratic. Yet even with these reductions in complexity, AFTs perform at or above the level of traditional transformer models in a wide variety of benchmark tasks, including language modeling and image classification. By lowering the computational cost while retaining high levels of performance, AFTs can enable researchers to explore how state-of-the-art neural models perform on massive quantities of data which had previously been out of reach.

This research was performed under the Novetta Machine Learning Center of Excellence.


Authors:
Izzy Shehan
Jackson Petty
Clyde Sumagang
Christian F. Jung