Beyond BERT? Transformers in 2020.
By Sergi Castella Sapé.
2019 was the year of BERT and much has been written about it. Truth be told, it’s hard to overestimate the impact Transformers have had in the NLP community: LSTMs now sound old-fashioned (or do they?²), state-of-the-art papers have been coming steadily along 2019 and, at Google, BERT made it into production in record-breaking time. All of the above while enabling Transfer Learning, which is now the coolest kid in NLP-town.
The development around these models has been remarkable so far, but could Transformers have peaked already? What areas of research should we be looking at most closely? What’s still exciting about these attention-based networks in 2020? These ideas were the focus of discussion recently at the event Transformers at Work¹ at Zeta Alpha, where many interesting angles on the topic were considered.
Here’s my take.
Models 2019 saw an explosion in architecture variants for Transformer models and it’s difficult to keep up (certainly forgetting some): big cousins (Transformer-XL, GPT-2, Ernie, XLNet, RoBERTa, CTRL), smaller cousins (ALBERT, DistilBERT) or most recently nephews like Reformer or the Compressive Transformer.It’s now clear that growing models is still successful to improve the state-of-the-art for many tasks, but should we? How much value does it add? Models that get smaller but preserve performance were a trend we started seeing in 2019 and want to keep steady for 2020. Maybe some innovative approaches will appear besides model pruning or distillation? The folks at Huggingface — creators of the ubiquitous Transformers library — got us talking about this refreshing trend with the training approach to DistilBERT¹⁰, which naturally connects to my next point.The ‘learning signal’ is essential for humans developing intelligence.
Shiny new architectures get a lot buzz and attention (pun intended); but in ML, the learning signal runs the show from the backstage. Broadly speaking, a model performance is limited by the weakest factor in the combination of model expressivity and training signal quality (objective or reward in RL or loss in DL). As an example, DistilBERT is trained in a student-teacher setting¹⁰ in which the student network (smaller) tries to imitate the behaviour of the teacher network (original). By adding this term instead of training only on the original Language Modelling task, the loss function for the student network is much richer, allowing the network to learn more expressively. If you still don’t believe me, think of what happened with GANs³ in 2014: a simple network coupled to an interesting loss function (another network) and…💥 magic!
Self-supervision and Language Modelling as a general purpose training signal for language tasks should be credited for NLP progress just as much as architectural revolutions, so for 2020 I want to see innovations in this domain.
Tasks & Datasets As you might’ve heard, the magnetic North Pole and the Earth’s one don’t perfectly align; actually, the magnetic one is continuously jiggling around year after year. Still, if you’re around The Netherlands and want to go towards the true North Pole, a conventional compass will be an excellent guide; well at least better than none at all. As you get closer to your destination, however, the bias of your compass will become increasingly evident, making it unsuited for the task.
An analogy can clearly be drawn here for AI research. Objective measurement is the cornerstone of scientific development, even a biased metric is usually better than none at all. How progress is measured is a big driver for how a field evolves and what research gets done at the end of the day; and that’s precisely why we need to thoroughly design evaluations in alignment with the incentives that will yield optimal development. Standard NLP tasks have been an amazing compass for research in the past few years, however, the closer we are to solving a dataset, the worse it becomes as a metric for progress, which is the reason why it’s exciting to see new benchmarks gain momentum in 2020.
As an example, at Facebook Research they’ve been recently working on a new dataset and benchmark for Long Form Question Answering: ELI5 (Explain to Me Like I’m 5) — yes, it’s based on the famous homonymous subreddit — . The aim of this new dataset is to propel research in the field of Open Domain Question Answering, pushing the boundaries on the tasks Transformers currently excel at.
[…] a Long Form Question Answering dataset that emphasizes the dual challenges of isolating relevant information within long source documents and generating paragraph-length explanations in response to complex, diverse questions.³
Another example of an interesting new dataset is the PG-19 Language Modelling Benchmark from DeepMind: a benchmark for long-range language modelling (book scale!), along with yet another Transformer reincarnation by the name of Compressive Transformer⁵. Hopefully, this task will help to overcome the current limitations of Seq2Seq models dealing with (very) long term dependencies.
Even the ubiquitous GLUE Benchmark is getting a much needed facelift. SuperGLUE⁶ arrived as a strong contender to be the near-future de-facto general-purpose benchmark for Language Understanding. It includes — among others — more challenging tasks and more comprehensive human baselines.
This section wouldn’t be complete without mentioning one of my favourite recent papers on the broader topic of The Measure of Intelligence by François Chollet, which flirts with a philosophical spin on the issue, bringing nonetheless a concrete proposal onto the table: the Abstract Reasoning Corpus and its challenging Kaggle competition. Keep these great initiatives coming!
A Better Understanding There is something attractively mysterious about systems we don’t fully comprehend. Often, our perception of intelligence in an algorithm is inversely proportional to how deeply we understand its machinery. Not that long ago, people used to think that intelligence was required to master the game of chess; then Deep Blue beat Gary Kasparov in 1996 and we understood how it could be done, so that machine ceased to need intelligence.
Building a solid understanding around ‘why questions’ is crucial for making progress, which is why models might look great in task leaderboards, but we shouldn’t draw premature conclusions about their capabilities without carefully investigating their inner workings. Mapping this idea into the space of Transformers, a lot of work has been devoted to unpacking why these models work as well as they do; but the recent literature has not fully converged yet on a clear conclusion.
For instance, in studying the behaviour of BERT’s pretrained model, “What does BERT look at?⁷” concluded that certain attention heads are accountable for detecting linguistic phenomena; whereas against many intuitions, “Attention is not an Explanation⁸” asserts that attention is not a reliable signal to interpret what BERT understands. “Revealing the Dark Secrets of BERT⁹” provides valuable insights into what happens during fine-tuning, but the scope of their conclusions is limited: no clear linguistic phenomena is being captured by attention, BERT is heavily overparametrized (surprising!🤯), and the fact that BERT doesn’t need to be very smart to solve most tasks. This kind of qualitative exploration is easy to overlook because it doesn’t show up in the metrics, but we should always keep an eye on it.
In conclusion, many secrets about why Transformers work remain to be unveiled, which is why it’s exciting to wait for new research to come up in this realm during 2020.
Those were my top picks, although many other topics also deserved a spotlight in this post, such as how frameworks like Huggingface transformers will keep growing to empower research, the scope of Transfer Learning widening or new approaches effectively combining Symbolic reasoning with DL methods.What’s your take? What are you most excited about Transformers in 2020?
This post was also published in a slightly modified format on Towards Data Science on Medium.com.
 “Transformers at Work”, January 17th 2020. Zeta Alpha Vector.
 Stephen Merity, 2019. Single Headed Attention RNN: Stop Thinking With Your Head
 Ian Goodfellow et. al. 2014. Generative Adversarial Networks
 Angela Fan, Yacine Jernite, Ethan Perez, et. al. 2019. ELI5: Long Form Question Answering.
 Jack W. Rae et. al. 2019. Compressive Transformers for Long-Range Sequence Modelling
 Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh et. al. 2019. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
 Kevin Clark, Urvashi Khandelwal, Omer Levy, Christopher D. Manning, 2019. What Does BERT Look At? An Analysis of BERT’s Attention.
 Sarthak Jain, Byron C. Wallace, 2019. Attention is not Explanation
 Olga Kovaleva, Alexey Romanov, Anna Rogers, Anna Rumshisky, 2019. Revealing the Dark Secrets of BERT
 V. Sanh, L. Debut, J. Chaumond, T. Wolf, 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter