top of page

ICLR 2021 — 10 papers you shouldn't miss

The International Conference on Learning Representations is already here and it’s packed with content: 860 papers 8 workshops and 8 invited talks. Choosing where to pay attention is hard, so here are some ideas on what’s worth looking at!

A year ago, the ICLR 2020 conference was the first big one to go fully online, and it set a surprisingly high standard for all fully virtual conferences. This year, the conference is again an online-only event, and it's looking very promising: Transformers appear less often in titles... because they're already everywhere! Computer Vision, Natural Language Processing, Information Retrieval, ML theory, Reinforcement Learning... you name it! The variety of content on this year's edition is jaw dropping.

When it comes to invited talks, the lineup is also exciting: Timnit Gebru will be opening the ceremony discussing how we can move beyond the fairness rhetoric in machine learning, suggesting that this topic will not be sidestepped at the conference. Workshops come also more packed than ever before, featuring Energy Based Models, Rethinking ML papers and Responsible AI among many others.

Making sense of this impressive lineup is no easy feat, but with some help from the AI Research Navigator at Zeta Alpha, we went through the most relevant ICLR papers by citations, twitter popularity, author influence, spotlight presentations and some recommendations from the platform and we identified some really cool works we’d like to highlight; some are already well known, and some are more of a hidden gem. Of course these picks do not aim to be a comprehensive overview — we’ll be missing on many topics such as Neural Architecture Search, ML theory, Reinforcement Learning or Graph NNs among others — but hey, I’ve heard it’s often better to choose sparse and deeply than broad and shallow; so here’s my top-10, enjoy!

By Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai et al.

Authors' TL;DR → Transformers applied directly to image patches and pre-trained on large datasets work really well on image classification.

❓Why → First paper to show how pure Transformers could improve over the best CNNs on (sort of) large images, launching the rapid “vision transformers revolution” of the past few months.

💡Key insights → Transfer learning has proven to be extremely effective on transformers: all NLP state-of-the-art incorporates transfer of some kind, such as from self-supervised pre-training. Broadly speaking, one finds that the larger a network is the better it can transfer tasks, and when it comes to big NNs, Transformers are second to none.

Driven by this vision, the authors show how a pure Transformer can perform extremely well on image classification, simply by feeding images as a sequences of patch embeddings — simply a linear projection of the patch pixels — and training directly on large amounts of supervised data (ImageNet). The paper hints that the model could benefit from self-supervised pre-training but don’t provide full experiments for it.

The results show how ViT outperforms CNNs and even CNNs + attention hybrids as soon as the model leaves the data-constrained regime; even being more compute efficient! Among the many interesting experiments, the authors show how the receptive fields from the attention evolve across layers: being very varied (global + local) initially, and specializing to local attention later in the network.

By Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis et al.

Authors' TL;DR → Performers, a linear full-rank-attention Transformers via provable random feature approximation methods, without relying on sparsity or low-rankness.

❓Why → The L² complexity of full attention is still keeping many ML researchers up at night. Eficient Transformers have been coming along for a long time now³ but no proposal has clearly dominated the space… yet?

💡Key insights → Unlike other proposals of efficient transformers, Performers don’t rely on specific heuristics for approximating attention such as constraining attention to lower rank approximations or enforcing sparcity. Instead, the authors propose a matrix decomposition of the self attention mechanism into the matrices below, which have a combined complexity which is linear w.r.t. the sequence length L: O(Ld²log(d)) instead of O(L²d).

This decomposition relies on way too many tricks to get them in here, but just name dropping’s sake, we’re talking kernels, random orthogonal vectors and trigonometric softmax approximations. All in service of building FAVOR+ with very tight theoretical guarantees to estimate self attention.

When it comes to the actual experiments, this work compares the Performer to existing efficient transformers such as Linformer¹ and Reformer², in tasks where modeling very long dependencies is crucial such as studying protein sequences, where it outperforms existing architectures. Finally, one of the biggest appeals for this method is that you can reuse an existing pre-trained transformer with the new linear attention mechanism, requiring only a bit of fine-tuning to regain most of the original performance, as you can see below (left).

By Yoav Levine et al.

Authors' TL;DR → Joint masking of correlated tokens significantly speeds up and improves BERT’s pretraining.

❓Why → A remarkably clean and straightforward idea coupled with equally remarkable results. It contributes to our understanding of the Masked Language Modeling pre-training objective.

💡Key insights → Instead of masking tokens randomly, the authors identify — using corpus statistics only — spans of tokens that are highly correlated. To do so, they create an extension to the pointwise mutual information between pairs of tokens to arbitrary length spans and show how training BERT with that objective learns more efficiently than the alternatives such as uniform masking, whole word masking, random span masking etc.

Intuitively, this strategy works because you prevent the model to predict masked words using very shallow correlations of words that often appear right next to each other, forcing the model to learn deeper correlations in language. In the figures below, you can see how Transformers learn faster with PMI-MLM.

By Anirudh Goyal, Jordan Hoffmann, Shagun Sodhani et al.

Authors' TL;DR → Learning recurrent mechanisms which operate independently, and sparingly interact can lead to better generalization to out of distribution samples.

❓Why → If artificial intelligence wants to ever resemble in some way human intelligence, it needs to generalize beyond the training data distribution. This paper — originally released a bit more than a year from now — provides insight, empirical foundations and progresses towards this kind of generalization.

💡Key insights → Recurrent Independent Mechanisms are NNs that implement an attentional bottleneck. This method draws inspiration by how the human brain processes the world; that is, largely by identifying independent mechanisms that only interact sparsely and causally. For instance, a set of balls bouncing around can be largely modelled independently until they collide with each other, which is an event that occurs sparsely.

RIMs are a form of recurrent networks where most states evolve on their own most of the time and only interact with each other sparsely via an attention mechanism, which can be either top down (directly between hidden states) or bottom up (between input features and hidden states). This network shows stronger generalization than regular RNNs when the input data distribution shifts.

One of the big takeaways of the whole Transformers thing is that the importance of inductive biases in NNs was arguably overstated. However, this is true when benchmarking models in-domain. This paper shows how, in order to prove the usefulness of strong priors such as the attention bottleneck, one needs to step outside of the training domain, and most current ML/RL systems are not benchmarked in this fashion.

While results might not be the most impressive, this paper — along with follow-up works (see below) — proposes an ambitious agenda of what’s the path forward for turning our ML systems into something that resembles our brain, might I even say merging the best from the good old symbolic AI with last decade’s DL revolution. We should celebrate such attempts!

By Yang Song et al.

Authors' TL;DR →A general framework for training and sampling from score-based models that unifies and generalizes previous methods, allows likelihood computation, and enables controllable generation.

❓Why → GANs are still weird creatures… Alternatives are welcome, and this one is very promising: turning data into noise is easy, turning noise into images is… Generative modeling! And this is what this paper does.

💡Key insights → Okay I can’t say I fully understood all the details, cause there’s a lot of math that’s just over my head. But the gist of it is pretty simple: you can transform an image into “noise” as a “diffusion process”. Think of how individual water molecules move inside of flowing water: there’s some deterministic flow of the water with some added random jiggling around. You can do the same with pixel images, diffusing them such that they end up as something like noise from a tractable probability distribution. This process can be modeled as a Stochastic Differential Equation, known in physics, basically a differential equation with some added jiggling at each point in time.

Now, what if I told you that this stochastic diffusion process is… reversible! You can basically sample from this noise and work your way back up do an image. And just like this the authors get a SOTA inception score of 9.89 and FID of 2.20 on CIFAR-10. Okay there just so much more going on under the hood… you really need to check out this paper!

By Nicola De Cao, Gautier Izacard, Sebastian Riedel, Fabio Petroni.

Authors' TL;DR →We address entity retrieval by generating their unique name identifiers, left to right, in an autoregressive fashion, and conditioned on the context showing SOTA results in more than 20 datasets with a tiny fraction of the memory of recent systems.

❓Why → A new straightforward approach to entity retrieval that quite surprisingly shatters some existing benchmarks.

💡Key insights → Entity retrieval is the tasks of finding the precise exact entity that natural language refers to (which can be ambigous at times). Existing approaches treated this as a search problem, where one retrieves an entity from a KG given a piece of text. Until now. This work proposes finding an entity identifier by autoregressively generating it: kind of how markdown syntax hyperlinks stuff: [entity](identifier generated by model). No search + reranking, nothing, plain and simple. Effectively, this means cross-encoding entities and their context with the advantage that the memory footprint scales linearly with the vocabulary size (no need to do a lot of dot products in the knowledge base space) and no need to sample negative data.

Starting with a pre-trained BART⁵, they finetune maximizing the likeliohood of autoregressive generation of a corpus with entities (wikipedia). At inference, they use constrained beam search to prevent the model from generating entities that are not valid (i.e. not in the KB). The results are just impressive, see an example in the table below.

By Lee Xiong, Chenyan Xiong et al.

Authors' TL;DR → Improve dense text retrieval using ANCE, which selects global negatives with bigger gradient norms using an asynchronously updated ANN index.

❓Why → Information Retrieval resisted the “neural revolution” for many more years than Computer Vision. But since Bert, the advances in dense retrieval have been giant, and this is a fantastic example of that.

💡Key insights → When training a model to do dense retrieval, the common practice is to learn en embedding space where query-document distance is semantically relevant. Contrastive learning is a standard technique to do so: minimize distance of positive query-document pairs and minimize that of negative samples. However, negative samples are often chosen at random, which means they’re not very informative: most of the time negative documents are obviously not related to the query.

The authors from this paper propose to sample negatives from the Nearest Neighbours during training, which yields documents that are close to the query (i.e. documents that the current model thinks are relevant). In practice this means that an index fo the corpus needs to be asynchronoysly updated during training (updating the index every iteration would be very slow). Fortunately, results confirm how BM25 baselines are finally being left behind!

By Denis Yarats, Ilya Kostrikovm, and Rob Fergus.

Authors' TL;DR → The first successful demonstration that image augmentation can be applied to image-based Deep RL to achieve SOTA performance.

❓Why → What are you rooting for? model-based or model-free RL? Read this paper before answering the question!

💡Key insights → Existing model-free RL are successful at learning from states input but struggle to learn from images directly. Intuitively, this is because when learning from an early replay buffer, most images are highly correlated presentig very sparse reward signals. This work shows how model-free approaches can hugely benefit from augmentations in pixel space to become more sample-efficient in learning, achieving competitive results when compared to existing model-based approaches on DeepMind control suite⁶ and 100k Atari⁷ benchmarks.

By Sashank J. Reddi, Zachary Charles et al.

Authors' TL;DR →We propose adaptive federated optimization techniques, and highlight their improved performance over popular methods such as FedAvg.

❓Why → To make federated learning widespread, federated optimizers must become boring, just like ADAM¹¹ is in 2021. This paper precisely attempts that.

💡Key insights → Federated learning is an ML paradigm where a central model, hosted by the server, is trained by multiple clients in a distributed fashion. For instance each client can use data on their own device, compute a gradient w.r.t. a loss function and communicate to a central server the updated weights. This process opens up many questions such as how one should combine weight updates from multiple clients.

This paper does a great job at explaining the current state of federated optimizers, builds a simple framework to discuss them and shows some theoretical results on convergence guarantees and empirical results to show their proposed adaptive federated optimizers work better than existing optimizers such as FedAvg⁸. The federated optimization framework presented in this paper is agnostic of the optimizer used by the client (ClientOpt) and that used by the server (ServerOpt), and enables them to plug in techniques such as momentum and adaptive learning rate into the federated optimization process. Interestingly though, the results they showcase always use vanilla SGD as a ClientOpt , and use adaptive optimizers (ADAM, YOGI) as ServerOpt.

By Yuchen Liang et al.

Authors' TL;DR→ A network motif from the fruit fly brain can learn word embeddings.

❓Why → The premise of this paper was too irresistible to not include it here, and it is also superb counterpoint to the dominant strain of massive ML.

💡Key insights → Words can be represented as sparse binary vectors quite effectively (even contextualized!). This work is very similar in spirit to already classics like Word2Vec⁹ and GloVe¹⁰ in the sense that word embeddings are learned using very simple neural networks and cleverly using coocurrence corpus statistics to do so.

The architecture is inspired by how biological neurons from fruit flies are organized: sensory neurons (PN) map onto Kenyon cells (KC) which are connected to the anterior paired lateral neuron (APL) which is responsible for recurrently shutting down most KCs, leaving only a few sparse activations.

Translating this to language, words are represented in PN neurons as a concatenation of a bag-of-words context and a one-hot vector for the middle word (see figure below). Then this vector is considered a training sample, which is projected onto the KC neurons and sparsified (only top-k values survive). The network is trained by minimizing an energy function that enforces words that share contexts to be close to each other in KC space.

Interestingly, this allows for generating contextualized word embeddings on the fly (😉), given that the bag-of-words context can be different for a given word during inference.


Quite an exciting group of papers! it trully was a challenge to narrow it down to 10. As a closing note, I’d like to mention how much of a pleasure it is to read ICLR papers, as they’re way more polished than your average publication. Regardless, this collection ends here, but there’s still so much more to explore for the conference and I’m really looking forward to it. The team will be reporting interesting insights live from our company twitter feed at @zetavector, so tune in if you don’t want to miss a thing.

What about you? What are you most looking forward about the conference? Feel free to share some suggestions down in the comments👇



[1] Linformer: Self-Attention with Linear Complexity — By Sinong Wang et al. 2020

[2] Reformer: The Efficient Transformer — By Nikita Kitaev et al. 2020

[3] Efficient Transformers, a Survey — By Yi Tay et al. 2020

[4] Big Bird: Transformers for Longer Sequences — By Manzil Zaheer et al. 2020

[6] DeepMind Control Suite — By Yuval Tassa et al. 2018

[7] Model-Based Reinforcement Learning for Atari — By Łukasz Kaiser, Mohammad Babaeizadeh, Piotr Miłos, Błaz ̇ej Osiński et al. 2019

[11] Adam: A Method for Stochastic Optimization—By Diederik P. Kingma et al. 2015


Recent Posts

See All


bottom of page