An ML Newsletter from Novetta
Welcome to the March 2020 installment of BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month’s topics include:
- AdaptNLP, an open source framework for state-of-the-art Natural Language Processing (NLP)
- A new competition for geospatial AI applications corresponding with the release of a Synthetic Aperture Radar (SAR) dataset
- A new and improved way of training language models
- StanfordNLP gets a new coat of paint
- Research into training large ML models with CPUs
- A new tool designed for rapid prototyping quantum machine learning models
At Novetta, we closely follow the frequent advances in the open source NLP community. To help researchers and developers quickly utilize state-of-the-art NLP models, we have developed an open-source framework, AdaptNLP. AdaptNLP is a python package that provides a high-level framework and library for running, training, and deploying state-of-the-art NLP models. The package is built atop two open source libraries: Transformers (from Hugging Face) and Flair (from Zalando Research), and it includes tools that enable users to generate predictions from pre-trained models. AdaptNLP helps users fine-tune language models for text classification, question answering, entity extraction, and part-of-speech tagging. Novetta will be presenting AdaptNLP workshops and tutorials at the Open Data Science Conference (ODSC)!
In the past three years SpaceNet LLC, a nonprofit aimed at accelerating progress in geospatial machine learning, has open sourced five unique datasets and corresponding challenges for applying artificial intelligence to geospatial applications. Competitions have focused on detecting building footprints and road networks from visible-band images. SpaceNet, in collaboration with Capella Space, has announced a new dataset, SpaceNet 6, which provides both traditional satellite imagery data and Synthetic Aperture Radar (SAR) data. SAR data has advantages compared to regular satellite imagery, such as ability to capture images at night and penetrate cloud cover. The addition of SAR data to the competition should provide for interesting new approaches, as there has previously been little open-source data of this kind available to researchers.
Progress against the NLP benchmark GLUE tasks has largely been achieved by increasing the size of the models. This sort of leaderboard-chasing makes for good headlines, but is less important for companies that want to make use of these advances without breaking the bank. That’s why we are excited about ELECTRA (from Google AI), a creative approach to training language models that leads to better performance without simply increasing model size. In this new training approach the model is provided with incorrect language data in addition to real language data, learning to discriminate between the two. This process allows ELECTRA to better learn real language using fewer examples during training. Approaches like ELECTRA will allow for advanced NLP methods that are smarter and more economical by using fewer computation resources. ELECTRA is being released as an open-source model on top of TensorFlow.
The StanfordNLP library is getting a makeover in the form of Stanza, a native “Python NLP Library for many human languages”. It can be used in NLP pipelines for lower level tasks such as named entity recognition (NER) and part-of-speech (POS) tagging across more than 70 natural languages. Its modules are built on top of PyTorch, allowing increased performance when used with GPU-enabled machines compared to CPUs. Stanza also includes an official Python wrapper, making it easy to access the functionality of Stanford CoreNLP, which is written in Java. While the heart of Stanza is not entirely new, this shift differentiates the Stanford NLP group and the legacy StanfordNLP library as modern NLP software quickly evolves.
Advances in machine learning would not have been possible without the shift to performing computation on GPUs, which enable significant increases in the number of operations that can be performed in parallel compared to CPUs. SLIDE – Sub-Linear Deep Learning Engine – is a deep learning framework which aims to drastically reduce computations, thereby enabling training and inference to occur using a CPU. The team behind SLIDE made novel algorithmic and data-structural choices to reduce memory lookups along with locality sensitive hashing (LSH). Using two 22-core CPUs the SLIDE team ran their experiments against an NVIDIA Tesla V100 and found a 3.5x speed up in model training time while maintaining similar precision. While work is still necessary to expand this method to additional deep learning use cases, it is a promising start.
Rapid Prototyping of Hybrid Quantum-Classical Models with Tensorflow Quantum
Quantum computing promises to solve problems that classical computers cannot solve in our lifetimes. While progress has been made in building quantum computers, access to them has been limited to a few companies and those with highly specialized knowledge. TensorFlow Quantum (TFQ), developed by Google AI, is an open source library that seeks to address some of these challenges by enabling the rapid prototyping of quantum ML models. The framework offers high-level abstractions so that researchers can spend time on developing algorithms instead of setting up an infrastructure for quantum computing. To further democratize access to quantum computing and quantum ML development, TFQ can be combined with qsim, a high-performance, open-source quantum circuit simulator that Google AI is releasing.This research was performed under the Novetta Machine Learning Center of Excellence.