A Powerful, Transformers-Based Framework for End-to-End Natural Language Processing Tasks
BERT’s release in November 2018 marked an inflection point for Natural Language Processing (NLP), moving us closer to human-level performance on a range of use cases. Since then, many transformer-based language models, such as XLM, GPT-2, XLNet, and roBERTa, have been released, with each subsequent model raising the performance ceiling. With the availability of these state-of-the-art pre-trained language models, we can produce NLP models that solve a variety of downstream language-based tasks, enabling our customers to quickly analyze large volumes of unstructured text.
The challenge with using these models is that they each differ slightly in their application. For example, there are a multitude of open-source pre-trained models tuned on different datasets for the named entity recognition task, requiring data to be formatted differently for input and output. Alternately, custom-trained models could be completely open-ended in regard to input data for training and output results for inference. Add to this the rapid pace of progress in the NLP field, plus the pressure to stay ahead of the curve, and the obstacles seem almost insurmountable.
To address this challenge, we developed NovettaNLP, a high-level framework for testing, running, and deploying state-of-the-art NLP models. By standardizing the input and output data and function calls, regardless of which model is used in the backend, developers can more easily utilize NLP algorithms. In the future, as new models are released, NovettaNLP users will not need to change their code; NovettaNLP will handle the incorporation of new models. All of these functions are available in a single, easy-to-use and maintain python package.
What can you do with it?
NovettaNLP enables simple and intuitive implementations of state-of-the-art, pre-trained NLP models for the following applications:
|Named Entity Recognition||Identify entity types such as person, location, and organization in unstructured text|
|Part-of-Speech Tagging||Tag noun, verb, adjective, and other parts of speech|
|Text Classification||Classify text into topics or sentiment|
|Question Answering||Query unstructured text and receive an answer. E.g., “What does Person X do?”|
NovettaNLP provides access to these capabilities through a standardized application programming interface (API), empowering users unfamiliar with recent developments in NLP to utilize state-of-the-art tools. Applications are initially available in English; by taking advantage of the multilingual pre-trained language models included in the underlying open-source libraries, users can also build models for languages including German, Russian, and Chinese.
As an example of internal use of NovettaNLP, one team combined two NLP capabilities to automatically generate an accurate, detailed timeline from unstructured text, in this case, a history of Machine Learning. Named Entity Recognition was used to extract date spans from the text, and queries regarding events from these dates were fed into the contextual question answering model, resulting in mapped “answers.”
By ordering these results chronologically, the team obtained an accurate timeline representation from the source text as shown below.
To further improve model performance, NovettaNLP enables users to fine-tune each model on a target dataset. While pre-trained models work well out of the box, they are trained on general datasets covering a range of topics. However, target datasets (such as Reddit comments or Twitter posts) can differ in how language is used. In those cases, fine tuning models on target data can greatly improve model performance.
How does it work?
NovettaNLP is built on top of two open-source libraries, flair (Zalando Research) and transformers (Hugging Face), which provide access to pretrained language models and language-based task models. While a user could directly use these libraries for the applications in Table 1, NovettaNLP provides a clear and easily accessible interface that will remain consistent over time, even if the underlying libraries change. As these libraries change, the inputs and outputs to NovettaNLP will remain consistent – all changes to the underlying code will remain hidden from the user.
NovettaNLP also provides a REST service that stands up endpoints for inference on pre-trained and custom models on a backend API. This allows for simple inference calls and UI implementations after installing the package and building the Docker app. This can be run on CPUs, but NVIDIA GPUs will significantly speed up training and inference.
NovettaNLP simplifies access to state-of-the-art machine learning for our customers in a consistent, scalable fashion. NovettaNLP can also be easily integrated into a CI/CD pipeline, ensuring that researchers and engineers can easily go from experimentation to production.
This research was performed under the Novetta Machine Learning Center of Excellence.