An ML Newsletter from Novetta
Welcome to the June 2020 installment of BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This is a special month, as we invited members of our 2020 intern class to provide their thoughts. They selected topics covering:
- New methods for training image models with less data
- Open source audio datasets
- A new approach for translating between programming languages
- Faster word embeddings
Training GANs with Limited Data
Training high-quality Generative Adversarial Networks (GANs) typically requires a large number of images whose collection may be costly or time-consuming. To address this issue, researchers at NVIDIA have introduced a new method that produces high-quality results using three orders of magnitude fewer training examples. They apply a new data augmentation technique – adaptive discriminator augmentation – to address the problem of discriminator overfitting in the low data regime. In this approach, images fed to the discriminator are weakly augmented, which allows the augmentation to be reversed during training. The discriminator only sees the augmented images, reducing overfitting, while the generator uses the native data distribution to produce high-quality results. The strength of the augmentation can be controlled dynamically based on the degree of overfitting.
Mozilla Common Voice: Open Source Audio Data
Mozilla Common Voice is an open source audio dataset that currently includes roughly 5,600 hours of validated, crowd-sourced speech recordings spanning 54 languages. Recording quality is at the discretion of the contributor, and contributors can include demographic information. This dataset is used in Mozilla’s DeepSpeech – an open source, automatic speech recognition (ASR) engine pre-trained for Voice Activity Detection, which allows users to immediately start transcribing streamed audio. The diversity of Common Voice data, paired with the contextual information provided by contributors, enhances the creation of custom audio models, such as dialect or age detection.
Improving Image Classification through Caption Generation
VirTex is an approach for image classification that first trains a model to generate image captions, then uses the model in down-stream computer vision tasks. By having the model predict a caption of an image instead of classifying it into a single-word label, the produced embeddings provide additional semantic context to the image which help to improve the accuracy of the model. As an example, higher quality embeddings can be generated about an image from a descriptive sentence such as: “it is a gray finch sitting on a bench,” rather than a simple label of “finch.” “Finch” is an example of a semantically sparse label, as only a small amount of information can be gathered from it. In contrast, this approach gets semantically dense labels describing not only relationships within the image but also actions and attributes, which help the model better understand what occurs inside the image.
This semantic-based approach performs better at image classification than pretrained state-of-the-art ImageNet models while being trained on only 10% of the data, reducing the need for large datasets. Although this research is focused on computer vision tasks, we believe this approach could also be adapted to areas such as sentiment analysis.
Unsupervised Translation of Programming Languages
Updating a code base to a new programming language is expensive, time consuming, and requires expertise in both the source and target languages. Transcompilers automate this process by translating source code from one programming language to another. Current transcompilers are based on hand-written rules that often need manual modification and fail to reproduce language conventions. Neural models are more promising, but their use is limited by their need for parallel language data, which is extremely rare. Facebook AI Research presented a fully unsupervised approach, called TransCoder, which translates functions between Python, C++, and Java with high accuracy, outperforming commercial baselines. The model training includes three main steps:
- pretraining a cross-lingual language model to ensure that code components that perform the same task have the same representation;
- denoising auto-encoding which allows the model to be robust to noisy input; and
- back-translation which involves training source-to-target and target-to-source networks in parallel.
This new approach could reduce the expense and expertise required to translate a codebase.
Magnitude: A Faster Word Embedding File Format
Magnitude is a Python package for storing word vectors (numerical representations of words) in an universal word embedding format. Using default on-demand loading for faster file loading and caching for better performance, Magnitude claims it can carry out word vector queries and other vector operations 60-6,000x faster at a fraction of the RAM utilization, enabling deep learning models to load word embeddings as features in seconds. Magnitude’s faster operations allows word vectors to be applied to new use cases, such as handling misspellings and typos, encoding parts of speech data, and combining word vectors models.
This research was performed under the Novetta Machine Learning Center of Excellence.