BASELINE • November 2019

An ML Newsletter from Novetta

Welcome to the first Novetta Machine Learning Newsletter, where we will share thoughts on important advances in machine learning technologies. We’ll start with a few highlights that have impacted our work in 2019, including GPU-based acceleration of data tasks, the inflection point in performance on NLP tasks, speech recognition, and graph neural networks.

GPU-Enabled Data Science with NVIDIA RAPIDS

GPUs provide significant speed improvements when processing images, but when working with other data types developers have typically used slower CPUs. NVIDIA RAPIDS is a “suite of open source software libraries” that enables much of the data science pipeline – such as data manipulation in pandas – to operate on GPUs, significantly improving speed. Libraries are available for structured data (cuDF), geospatial data (cuSpatial), graphs (cuGraph), and computer logs (CLX), as well as training machine learning models (cuML). While we were initially skeptical (as we are with most new technology) of the claims of speed improvements of up to 100x, when we tested it on a clustering task we saw processing time decreased from 5 minutes to under 2 seconds!

The Language Model Arms Race

Transformer-based language models have recently been at the forefront of Natural Language Processing (NLP) research. Their impressive ability to interpret and replicate semantics and to perform language generation tasks at near-human levels has made this model type state-of-the-art. Google AI’s release of BERT in November 2018 was a major leap forward. Not long after, OpenAI announced GPT-2, but initially did not release their code for fear of it being used to generate fake news toward malicious ends on a large scale. This past September, Salesforce released CTRL, a model with the unique ability to control the theme of the text being generated based on context. While we have used many of these models to help understand text, we are also investigating ways that they might be misapplied to spread disinformation.

Approaches to Multilingual Learning

We work extensively with foreign language media, so we have been closely following the major improvements in NLP applied to non-English text. Of the many available libraries, we have focused on LASER and MultiFiT. LASER (open sourced by Facebook) performs multilingual sentence embeddings, the foundation of many NLP tasks, across 93 languages and 23 alphabets. MultiFiT (from fast.ai) is a multilingual fine-tuning model based on ULMFiT that outperforms other algorithms, such as BERT, on text classification.

Advancements in Automatic Speech Recognition

Data augmentation is a method to improve the generalizability of model performance by ensuring the model isn’t just memorizing the training data. While augmentation has applied to to improve performance for image tasks, augmentation for audio data is less prevalent. To address that, in April 2019 Google released SpecAugment, which provides methods for augmenting spectrogram data. SpecAugment achieved state-of-the-art results on automatic speech recognition tasks for multiple datasets.

Graph Neural Networks

Important information can often be captured by examining relationships in data in the form of graphs. Deep learning, which has led to significant increases in accuracy on image and text data, has not been as widely applied to graph data. That is starting to change, as new approaches for representing graph information allow users to take into account relationship information as well as the attributes of the nodes (entities) and edges (relationships) in a graph. Inspired in part by PinSage, we have used these methods to identify bad actors in a network, predict missing relationships, and recommend articles to readers.

Pushing the State-of-the-Art with Text Classification

While progress has been made using language models for text classification, other information or metadata associated with text can be useful in predicting topic or sentiment. For example, one might make very different inferences based on whether an article is written by the BBC or Sputnik. We created Metadata Enhanced ULMFiT, an approach which incorporates article metadata such as author and publication into the model. This enhancement delivered up to 20% performance improvements relative to “native” ULMFiT.

This research was performed under the Novetta Machine Learning Center of Excellence.


Authors:

Andrew Chang
Shauna Revay, PhD
Brian Sacash
Matt Teschke