• castella2

The Zeta Alpha guide to NeurIPS 2020 -  10 essentials you shouldn't miss.

Updated: Dec 4, 2020

1900 papers, 10k attendees, 62 workshops, 7 invited talks. Choosing what to pay attention to is key, so here are some of our top recommendations.

Vancouver, Canada. Photo by Mike Benna on Unsplash

The Conference on Neural Information Processing Systems is always exciting because it serves as a collection of the best the field has offered this year. A good way to pick out meaningful content is to look at those papers that have already been making a lot of impact on arXIv; I mean, look at the top 25 most cited papers published at NeurIPS👇

But there also a lot of new papers and hidden gems, so here is my personal top-10 of papers I recommend you surely don't want to miss.

Semi and Self-Supervision

Leaving behind the costly reliance on labelled data has been one of the main topics in recent years’ ML agenda, it even has its own full workshop at NeurIPS this year.

1. Bootstrap Your Own Latent, A New Approach to Self-Supervised Learning | Virtual Poster ❓Why: the results on this paper seem bizzarre, which is why it’s so interesting. How can one learn representations with only positive samples and not collapse into a trivial solution? 💡Key insights: the method is fairly similar to a standard contrastive learning setting for computer vision, where augmentations are applied to images and a contrastive loss forces the images coming from the same source to come together and pushes the rest away. However, in this paper, there are no negative samples. Instead, there are two encoders:

  • T is the online encoder, whose parameters are updated at every iteration via SGD.

  • T’ is an encoder whose parameters are an exponential average from T (in a sense it just lags a bit behind T).

High level diagram of the training procedure. Source:

The training procedure consists of encoding representations of different views of an image through T and T’ and maximize the dot product of these representations. The fact that this method doesn’t collapse into a trivial representation is already impressive, but the results on ImageNet don’t fall short either.

Linear classification ImageNet performance. Source:

2. Unsupervised Data Augmentation for Consistency Training | Virtual Poster ❓Why: consistency training has a lot of potential to be a generic procedure that improves weak supervision in many tasks. As an extra fun fact, the paper was rejected at ICLR 2020, but is now at NeurIPS with an already strong citation record. 💡Key insights: in a nutshell, the unsupervised consistency loss consists on an agreement loss on different variations of an input (like back translation for text or random augmentation for images). The intuition is: different variations of an input need to have the same output classification, despite not knowing which one, which is a valid learning signal for a classification model M. Under this setting, remarkably few true labels are needed to learn a good classifier. The results are nothing short of impressive both in Computer Vision and Natural Language Processing, where as little as 20 labels are enough to get decent performance in tasks like sentiment analysis on the IMDb dataset¹.

High level overview of consistency Training. Source:

3. What Makes for Good Views for Contrastive Learning? | Virtual PosterWhy: contrastive learning can be understood from the lens of Information Theory, and this paper is an excellent combination of empirical and theoretical results which are helpful to better understand the fundamentals of this family of methods. 💡Key insights: Constrastive Learning in Computer Vision often implies generating different views of an image — such as croppings, filtering or other transformations — and learn a model that is able to discriminate between views from this image and the rest. Interestingly, this can be formulated as maximizing the mutual information between views of the image. Diving deeper into this framework, the paper shows:

  • The amount of shared information between the views can be too little or too much, and there exists a sweet spot where the resulting representations will perform best, which forms an inverted-U shape. The authors provide several empirical evidence displaying phenomenon.

  • They show how one can use this insight to formulate what they call “Unsupervised View Learning” framework that learns to find this sweet spot by having two models, f and g, one maximising and one minimizing the mutual information estimation between views.

Illustration from the Colorful-Moving-MNIST dataset. Source:

4. Hard Negative Mixing for Contrastive Learning | Virtual Poster ❓Why: similarly as in the previous suggestion, contrastive learning is one of the pillars of self-supervised representation learning, but regarding hard negatives, their impact in the quality of learned representations is not well understood. 💡Key insights: the authors propose a new method for adding synthetic hard negatives during training which are cheap computationally: MoCHi (Mixing of Contrastive Hard Negatives). The method creates synthetic hard negatives directly in the embedding space by:

  • For hard negatives: Linearly mixing features from the hardest negative samples.

  • For even harder negatives: mixing the query itself with negatives.

Surprisingly, this simple method improves self-supervised representation learning on images, and broad ablations are performed to understand their effect.

Results of linear classification on ImageNet-1k and object detection on PASCAL VOC. Source:

Others: Self-Supervised Relational Reasoning for Representation Learning and a more comprehensive selection.

Transformers and Attention

5. Untangling tradeoffs between recurrence and self-attention in neural networks | Virtual Poster ❓Why: around 2017 and 2018, seq2seq models went from being RNN almost across the board (GRUs², LSTMs³) to being fully Attention based (Transformers⁴). But isn’t recurrence still a valid inductive bias in NNs? Can we shed some light into self-attentive RNNs, in the sense of what general principles makes them good for learning? This paper provides a theoretical framework to think about it. 💡Key insights: full self-attention has the problem that it scales badly with sequence length (quadratic), and recurrence has the problem that information flow fails to travel “long temporal distances” due to the well known vanishing gradient effect, for which only heuristic-based solutions exit. This paper formalizes this tradeoff and shows how attention sparsityand gradient flow depth bound the computational complexity and the information flow in these types of networks. Somewhere within this tradeoff interesting things happen, such as an intriguingly good generalization in RL.

Results for Transfer Copy and Denoise tasks. Source:

6. Big Bird: Transformers for Long Sequences | Virtual Poster ❓Why: while BigBird is neither the first nor the last reincarnation of an efficient Transformer — see the zoo of approaches in the fantastic Efficient Transformers Survey⁵ — this version contains neat engineering tricks and has solid results. 💡Key idea: combine 3 different attention forms: window, global and random. With these tricks, the number of operations needed for the attention mechanism can be linear with respect to the sequence length. While this is by no means a tiny model — the window attention for their experiments is already 512 tokens, just like the OG BERT⁶ — this attention mode enables modelling of much longer sequences, such as those required in genomics, for which this publication provides some results on.

The 3 attention variants used in BigBird. Source:

7. Retrieval-Augmented Generation for Knowledge-Intensive NLP tasks | Virtual Poster ❓Why: the main appeal is the use of fully non-parametric memory, which while not novel, has the potential to allow for question answering systems that don’t need to be retrained to adapt to new or changing knowledge because they completely rely on external knowledge. 💡Key idea: retrieve documents as evidence, whose text is used as context for text generation. A part from results being state of the art, they show results on question answering over changing knowledge and show how RAG can answer questions for which it wasn’t trained on by swapping the collection of documents it gets knowledge from (without any re-training). Moreover, factual correctness seems to be a strong feature of this approach, although it still falls short to be called truly reliable.

Overview of RAG model. Source:

8. Language Models are Few -Shot Learners | Virtual PosterWhy: (aka GPT-3) a lot has been said about GPT-X series⁷ and there’s no doubt the latest iteration has impressed the most skeptic people in the field. Originally released in July, it’s now worth reading this streamlined version of the work. 💡Key insight: size, size, size. Scaling up models keeps improving performance and leading to surprising results, and the ceiling seems to still be far away… GPT-3 trains a 175 Billion parameter model which shows surprising results in few shot learning, where the model only needs a couple of examples to learn any language task to an astonishing degree. Still, many concerns arise such as the cost an environmental impact of such models or the biases they reveal.

GPT-3 performance SuperGLUE compared to fine-tuned baselines and SOTA. Source:

Others: Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping, O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers, Deep Transformers with Latent Depth and a more comprehensive list.

Benchmarks and evaluation

9. Learning to summarize from human feedback | Virtual Poster ❓Why: sometimes measuring performance in a task is just as hard if not harder than solving the task itself. Summarization is a good example: works often rely on measures such as ROUGE⁸ which correlate with human judgements only to a certain degree; and when models are close to that boundary, the measure ceases to be useful. 💡Key idea: 3 steps that can be repeated iteratively:

  • Collect human preferences from pairs of summaries.

  • Train an evaluation model that learns to predict human preference between two summaries.

  • Use the evaluation model as a reward function used to optimize the policy (model) that generates the summary with Reinforcement Learning (more precisely Proximal Policy Optimization⁹, PPO)

This evaluation seems to correlate better with human judgements, although this comes at the cost of this metric being less universal and explainable.

Overview of the training procedure. Source:

10. Open Graph Benchmark: Datasets for Machine Learning on Graphs | Virtual Poster ❓Why: graphs have been specially hot in the field for a couple of years, and graphs need their gold benchmark backed by heavyweights in the field. This one is a strong contender. 💡Keys: the main defining features of this benchmark are diverse sizes (from 100k to 100M nodes), coverage of many domains and multiple task categories (node, link and property predictions). Moreover, the authors claim that their experiments so far show significant challenges of scalability and out-of-distribution generalization, which resonate strongly with the challenges that real-world data presents. The steering committee backing it up includes giants such as Tommi Jaakkola, Yoshua Bengio and Max Welling, and the benchmark does not only include data, but also a pipeline for managing it (loading, evaluating, etc.) which can be found at and provides graph objects compatible with PyTorch, PyTorch Geometric and Deep Graph Library.

Overview of the OGD. Source:

Other relevant papers on Graphs at NeurIPS: Can Graph Neural Networks Count Substructures?, Learning Dynamic Belief Graphs to Generalize on Text-Based Games, Factor Graph Neural Networks. Others benchmarks at NeurIPS: BONGARD-LOGO: A New Benchmark for Human-Level Concept Learning and Reasoning, A Benchmark for Systematic Generalization in Grounded Language Understanding, RL Unplugged: A Collection of Benchmarks for Offline Reinforcement Learning; and a more comprehensive list.

What an exciting set of papers, it was honestly really hard to narrow it down to 10. As a closing note, I’d like to mention how much of a pleasure it is to read NeurIPS papers, as they’re way more polished than your average publication. Regardless, this little NeurIPS collection ends here, but there’s still so much more to explore for the conference and I’m really looking forward to it. The team will be reporting interesting insights live from our company twitter feed at @zetavector, so tune in if you don’t want to miss a thing.

What about you? What are you most looking forward about the conference? Feel free to share some suggestions down in the comments👇

References [1] Learning Word Vectors for Sentiment Analysis, Maas et al. 2011. [2] Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation, Kyunghyun Cho et al. 2014. [3] Long short-term memory, Sepp Hochreiter et al. 1997. [4] Attention is All You Need, Ashish Vaswani et al. 2017. [5] Efficient Transformers: A Survey, Yi Tay et al. 2020. [6] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Jacob Devlin, et al. 2018. [7] Improving Language Understanding by Generative Pre-Training, Alec Radford et al. 2018. [8] ROUGE: a Package for Automatic Evaluation of Summaries, Chin-Yew Lin 2004. [9] Proximal Policy Optimization Algorithms, John Schulman et al. 2017.


© 2020 Zeta Alpha

twitter orange.png
linkedin orange.png