BASELINE • April 2020

An ML Newsletter from Novetta

Welcome to the April 2020 installment of BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This month’s topics include:

  • Creative applications of synthetic data to improve model accuracy
  • How deep learning may speed development of a COVID-19 vaccine
  • A new deep learning approach to tabular data classification

Kaggle Bengali competition winner 

Kaggle is a data science competition site that often sees competitors advance the state-of-the-art in different domains. We have found that Kaggle is one of the best places to identify creative new approaches to problems, and the Bengali.AI Handwritten Grapheme Classification competition was no exception. In this competition, the objective was to correctly classify each individual handwritten grapheme (roughly equivalent to a “letter” in English) into their base character and two accent glyphes. Instead of applying standard image classification techniques, the winner took a much more creative approach – using GANs to generate new training samples and expanding the amount of training data for classes with few examples. This is a novel application of CycleGAN, a special type of GAN that learns how to translate data from one domain into another domain. Using CycleGANs is a novel solution that could improve performance in situations where training data is limited.

Use of synthetic training data to train models 

Training accurate machine learning models requires large amounts of labeled training data, such as images. However, it is time-consuming and tedious to create training sets. While it would be much quicker to algorithmically generate training data, creating data realistic enough such that models can transition to performing well on real data is a challenge. Recent studies have shown the promise of using synthetic data for training. In one case related to mechanical property evaluation, researchers improved model accuracy while decreasing the amount of real data used by 95% by also training the model with synthetic data. In another case, researchers demonstrated that model trained with synthetic data to detect cars was able to successfully detect cars in real images.


Should tree-based models continue to be the preferred method for tabular learning? Researchers from Google Cloud AI have provided another option in the form of TabNet, an interpretable deep learning architecture for tabular learning. TabNet uses a deep neural network to learn complex representations of tabular data and uses a sequential attention mechanism for feature selection in order to outperform popular tree-based methods. TabNet also offers local and global interpretability by visualizing feature importance and each input feature’s contribution to the trained model. The Tensorflow implementation of TabNet is available at Google Research’s Github Repository.

DeepMind uses AlphaFold library to predict COVID19 protein structures

The machine learning community has responded to the outbreak of COVID-19 with attempts to help scientific researchers in their understanding of the virus. Google-owned DeepMind is using its protein modeling library, AlphaFold, to predict the structures of understudied proteins associated with COVID-19. The structures that it predicts can provide insight into how the proteins function and connect with other molecules. While AlphaFold’s generated structures are estimates and may vary from a protein’s true structure, this approach of using ML with domain expertise allows the development of tests and drugs at a much faster pace than with traditional methods. An implementation of AlphaFold and the predictions of the SARS-CoV-2 protein structures have been made available to the public.

This research was performed under the Novetta Machine Learning Center of Excellence.


Laura Dedic
Jiong Huang
Matt Teschke