BASELINE • JULY 2020

An ML Newsletter from Novetta

Welcome to the July 2020 BASELINE, Novetta’s Machine Learning Newsletter, where we share thoughts on important advances in machine learning technologies. This is another special month, as we invited additional members of our 2020 intern class to share their latest findings. They selected topics covering:

  • Fairness and bias in ML
  • New methods for interacting with tabular data
  • Information extraction from forms
  • Preprocessing for deepfake detection models

Algorithmic Injustices

As more questions arise concerning the ethical implications of machine learning (ML) usage, prominent conferences, such as NeurIPS, are beginning to ask contributors to evaluate and disclose the ethical implications of their work as a part of the submission process. One way that ethical considerations are being raised is in the field of relational ethics. Relational ethics is generally defined as a search for justice and fairness in human interaction, but in the context of ML it is defined as a search for justice and fairness in ML and algorithmic bias. Researchers at UC Dublin suggest several resolutions for achieving this goal in their paper Algorithmic Injustices: Towards a Relational Ethics. Their central premise can be summarized as follows:

When designing new ML algorithms and products, focus on impact over methodology, and consider how certain groups are impacted in a way that may reinforce certain social norms or biases. What society deems fair and ethical changes over time and therefore when defining bias, fairness, and ethics for a proposed solution these definitions should be viewed as being in flux. This allows for revision in accordance with the dynamic development of such definitions.

The applications of ML are endless, but so is the potential for negative unintended consequences for vulnerable, disproportionately impacted groups. Relational ethics is one tool that might mitigate the unintended consequences of ML and algorithmic bias.

A New Model for Understanding Tabular Queries

TaBERT is the first pretrained language model to simultaneously understand the representations of both unstructured text and structured tabular data, such as traditional databases. Using the idea of a “content snapshot,” TaBERT focuses on the most relevant parts of tabular data by learning the data’s structure for a smaller subset, as opposed to simply learning column names. This pairing of unstructured and structured data allows text data to represent questions relating to the tabular data, which act as answers. TaBERT is designed to enable analysts with no knowledge of structured query languages (SQL) to phrase questions in natural language and perform complicated queries, retrieving answers from large, structured databases.

Better Information Extraction from Forms

Extracting information from forms in a quick, accurate, and consistent way is difficult due to differences in organization, fields, or even file formats. Human review of forms is labor-intensive, and optical character recognition methods alone may miscapture or fail to identify all information. A recent paper from UC San Diego and Google researchers outlines a new method for reading form documents and learning new document formats using Google’s Vision and NLP cloud services. Starting from a scanned document and a set of fields to extract (and their information types, e.g. currency, date, or alphanumeric string), optical character recognition pulls out all text from the document, while the Vision cloud service records the locations of the text within the document, independent of the text’s content. The NLP cloud service then performs entity extraction on the scanned text, creating lists of text pieces by type. The pertinent information about each piece of text is encapsulated in an embedding of its position and its relationship to surrounding text (its neighborhood). Each field is matched with a single candidate text piece, using the embedding to determine how well that piece of text fits with the field, or determined not to have a matching string.

The method outperforms others on invoices and receipts in all but one field type, performs well on multiple file formats and unseen form templates, and is interpretable through proximity visualizations of candidate text and target field embeddings. It is also designed to guard against overfitting by omitting the candidate text content from consideration when selecting strings for fields.

Data Preprocessing for Deepfake Detection  

While the hype in machine learning tends to be focused on bigger and better neural networks, practitioners know that data selection and processing can have a large influence on model performance. In the case of deepfakes, one of the current detection processes involves using face detection algorithms to isolate and create a dataset of only the face regions for each frame in a video. This approach allows researchers to find inconsistencies and potential manipulations across a series of images. However, false positives output by face detectors (i.e. images that do not contain faces) can add noise to the training datasets used for deepfake detection models. By incorporating a preprocessing step outlined by a team at the Center for Research and Technology, Hellas (CERTH), most false positives can be eliminated. This step involves first computing facial embeddings for each face image in the dataset. These embeddings are then used to compute a similarity score between faces over a series of images, enabling researchers to remove spurious images whose embeddings do not meet a baseline similarity threshold. This process improves overall deepfake detection accuracy by 5-12%.

This research was performed under the Novetta Machine Learning Center of Excellence.


Authors:

Amber Chin
Annie Ghrist
Christian F. Jung
Carlos Martinez