Headliner — Easy training and deployment of seq2seq models

By Christian Schäfer

At Axel Springer, Europe’s largest digital publishing house, we own a lot of news articles from various media outlets such as Welt, Bild, Business Insider and many more. Arguably, the most important part of a news article is its title, and it is not surprising that journalists tend to spend a fair amount of their time to come up with a good one. For this reason, it was an interesting research question for us at Axel Springer AI whether we could create an NLP model that generates quality headlines from Welt news articles (see Figure 1). This could, for example, serve our journalists as inspiration for creating SEO titles, which our journalists often don’t have time for (in fact we’re working together with our colleagues from SPRING on creating a SEO title generator).

Figure 1: One example from our Welt.de headline generator.

In the process of our research, we created a library called “Headliner” to generate our headlines. Headliner is a sequence modeling library that eases the training and, in particular, deployment of custom sequence models. In this article we will go through the main features of the library as well as why we decided to create it. And we love open-source, so you can find the code on GitHub if you want to try it out.

Generating news headlines is not a new topic. Konstantin Lopyrev already used deep learning in 2015 to generate headlines from 6 major news agencies including the New York Times and the Associated Press. Specifically, he used an encoder-decoder neural network architecture (LSTM units and attention, see Figure 2) to solve this particular problem. In general, generating headlines can be seen as a text summarization problem and a lot of research has been done in this area. A gentle introduction to this topic can be found on Machine Learning Mastery or FloydHub.

Figure 2: Encoder-decoder sequence-to-sequence model.

Why Headliner?

When we started with this project, we did some research on existing libraries. In fact, there are many libraries out there, such as Facebooks fairseq, Googles seq2seq, and OpenNMT. Although those libraries are great, they have a few drawbacks for our use case. For example, the former doesn’t focus much on production and the Google one is not actively maintained. OpenNMT was the closest one to match our requirements as it has a strong focus on production. We, however, wanted to provide a leaner repository that could easily be customized and extended by the user.

Therefore, we built our library with the following goals in mind:

  • Provide an easy-to-use API for both training and deployment
  • Leverage all the new features from TensorFlow 2.x like tf.function, tf.keras.layers etc.
  • Be modular by design and easily connectable with other libraries like Huggingface’s transformers or SpaCy
  • Be extensible for different encoder-decoder models
  • Work on large data

We approached the problem scientifically by starting with the basics and iteratively increasing complexity. Consequently, we first built a simple baseline repository in TensorFlow 2.0, which was freshly released by this time. During this process we learned to appreciate many of TensorFlow’s new features that really ease development, specifically:

  • Integration of Keras with model subclassing API
  • Default eager execution that enables developers to debug into the execution of the computation graphs
  • The possibility to write graph operations with natural Python syntax that are interpreted by AutoGraph

The repository is comprised of separate modules for data preprocessing, vectorization and model training, which makes it easy to test, customize and extend. For example, you can easily integrate your own custom tokenizer into the training pipeline and ship it with the model.

Many machine learning repositories do not really pay a lot of attention to deployment of the trained models. For example, it is pretty common to see that code for model inference depends on global parameter settings. This is dangerous, since a deployed model is bound to the exact preprocessing logic during its training, which means that even a slight change in the inference setup can mess up predictions pretty badly. A better strategy is to serialize all preprocessing logic together with the model. This is realized in Headliner by bundling all modules together that are involved in preprocessing and inference.

Figure 3: Model structure for (de)-serialization.

Once the codebase was built, we started to add more complex models such as attention-based recurrent networks and the Transformer. At last, we implement a SOTA summarizer based on finetuning pre-trained BERT language models.

Recently a fine-tuned BERT model achieved state-of-the-art performances for abstractive text summarization across several datasets [1]. The authors made two key adjustments to the BERT model, first a customized data preprocessing and second, a specific optimization schedule for training. We integrated those adjustments in in separate preprocessing modules that can be used out-of-the-box, try out a tutorial here!

To make the text consumable for BERT it is necessary to split it into sentences enclosed in special tokens. Here is an example:

In Headliner this preprocessing step is performed by the BertPreprocessor class, which internally uses a SpaCy Sentencizer to do the splitting. Then, the article is mapped to two sequences, first the token index sequence and second the segment index sequence to distinguish multiple sentences (see Figure 4).

Figure 4: Architecture of the original BERT model (left) and BertSum from [1] (right). For BertSum multiple sentences are enclosed by [CLS] and [SEP] tokens, and segment embeddings are used to distinguish multiple sentences.

The BertSum model is composed of a pre-trained BERT model as encoder and a standard transformer as decoder. The pre-trained encoder is carefully fine-tuned, whereas the decoder is trained from scratch. To deal with this mismatch it is necessary to employ two separate optimizers with different learning rates and schedules. We found that the training is relatively sensitive to the hyperparameters such as learning rate, batch size, dropout etc., which requires some fine-tuning for each dataset. We trained the model on a dataset of 500k articles with headlines from the WELT newspaper and were quite impressed by the results, an example is below:

(input) Drei Arbeiter sind in der thailändischen Hauptstadt Bangkok vom 69 Stockwerk des höchsten Wolkenkratzers des Landes in den Tod gestürzt. Die Männer befanden sich mit zwei weiteren Kollegen auf einer am 304 Meter hohen Baiyoke-Hochhaus herabgelassenen Arbeitsbühne, um Werbung anzubringen, wie die Polizei am Montag mitteilte. Plötzlich sein ein Stützkabel gerissen, worauf die Plattform in zwei Teile zerbrochen sei. Nur zwei der fünf Männer konnten sich den Angaben zufolge rechtzeitig an den Resten der Arbeitsbühne festklammern. Sie wurden später vom darunter liegenden Stockwerk aus gerettet.

(target) [CLS] Unfälle: Drei Arbeiter stürzten in Bangkok vom 69. Stock in den Tod [SEP]

(prediction) [CLS] Unglücke: Drei Arbeiter stürzen von höchstem Wolkenkratzer in Bangkok in den Tod [SEP]

[1] Liu, Y. and Lapata, M., 2019. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345.

To get started, you can use the library out-of-the-box to train a summarizer model. Just install Headliner via pip:

pip install headliner

All you need is to provide the data as a list (or generator) of string tuples for input and target. Then you create a summarizer model and trainer. After training the model, it can be saved to a folder and loaded for inference. In this minimalistic example, the trainer takes care of the data preprocessing and vectorization using a simple word-based tokenizer:

from headliner.trainer import Trainer
from headliner.model.transformer_summarizer import TransformerSummarizer

data = [('You are the stars, earth and sky for me!', 'I love you.'),
('You are great, but I have other plans.', 'I like you.')]
# train summarizer and save model
summarizer = TransformerSummarizer(num_layers=1)
trainer = Trainer(batch_size=2, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=2)
summarizer.save('/tmp/summarizer')
# load model and do a prediction
summarizer = TransformerSummarizer.load('/tmp/summarizer')
summarizer.predict('You are the stars, earth and sky for me!')

For further information on how to use the library have a look at our tutorials on GitHub.

Summary

In this article, we presented our library Headliner which we internally used for our research to generate news headlines. We showed that it’s really easy to use. We also talked about BertSum, a state-of-the-art approach for text summarization, which is also implemented in our library. Please check out our library and give us feedback.

If you found this article useful, give us a high five 👏🏻 so others can find it too, and share it with your friends. Follow us on Medium (Christian Schäfer and Dat Tran) to stay up-to-date with our work. Thanks for reading!