NeurIPS 2021 — 10 papers you shouldn't miss

Updated: Dec 6, 2021

2334 papers, 60 workshops, 8 keynote speakers, 15k+ attendees. A dense landscape that's hard to navigate without a good guide and map, so here are some of our ideas!

The 2021 edition of the most beloved conference in Artificial Intelligence is here to end the year with a 'grand finale'. The growth of the conference hasn’t ceased: This year’s conference has 2334 papers, 20% more than last year. How do you select?

NeurIPS number of publications for the past decade in the main track.

Some of the published papers have been on for some time now and have already had an impact. For instance, here’s a list of the top 10 most cited NeurIPS papers even before the conference starts. This is the first good starting point to browse the program and watch the authors present their work.

Making sense of this impressive lineup is no easy feat, but with some help from the AI Research Navigator at Zeta Alpha, we went through the most relevant NeurIPS papers by citations, spotlight presentations, and some recommendations from the platform and we identified some cool works we’d like to highlight; some are already well known, and some are more of a hidden gem. Of course, these picks do not aim to be a comprehensive overview —we’ll miss many topics such as ML theory, Federated Learning, meta-learning, fairness— but there’s only so much we can fit in a blog post! [Update Dec 1st]:: now the best papers awards have been announced, which is also a good starting point though slightly theory heavy for our taste.

1. A 3D Generative Model for Structure-Based Drug Design | 👾 Code

By Shitong Luo, Jiaqi Guan, Jianzhu Ma, and Jian Peng.

❓WhyBio ML has been one of the most prominent areas of application and progress of machine learning techniques. This is a recent example of how generative models can be used in drug design based on protein bindings.

💡Key insights → The idea of Masked Language Modeling (corrupting a sequence by masking some elements and trying to reconstruct it) proves to be also useful to generate molecules: masking single atoms and ‘reconstructing’ how they should be filled up. By doing this process iteratively (e.g. ‘autoregressively’) this becomes a generative model that can output candidate molecules.

The main advancement of this paper with respect to previous works is that the generation of molecules is conditioned to particular protein binding sites, giving researchers a better mechanism to find molecule candidates to act as a drug for a particular purpose.


For this, several tricks are required for training, such as building an encoder that’s invariant to rigid transformations (translation and rotation), cause molecules don’t care about how they’re oriented. The results are not groundbreaking and there are still caveats to the method, such as the fact that it can often output molecules that are not physically possible, but this research direction is a certainly promising one for the future of drug development


Other works on ML for molecule structures at NeurIPS: Multi-Scale Representation Learning on Proteins, Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Generation, Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation.

2. The Emergence of Objectness: Learning Zero-shot Segmentation from Videos | 👾 Code

By Runtao Liu, Zhirong Wu, Stella Yu, and Stephen Lin.

Authors’ TL;DR → We present an applicable zero-shot model for object segmentation by learning from unlabeled videos.

❓Why → Humans can easily track objects which we have never seen and don’t recognize… Machines should do the same!

💡Key insights → The notion of objectness is often referred to as one of the important human priors that enable us to see and reason about what we see. The key here is that we don’t need to know what the object is to know it’s an object. On the contrary, when training a neural network for image segmentation in a supervised fashion, the model will only learn to segment objects seen during training.

This paper proposes to use video data and clever tricks to set up a self-supervised training approach that leverages how objects and foreground tend to behave differently in recordings. This is reminiscent of Facebook’s DINO [1], which uncovered how after training self-supervisedly with images, Transformers’ attention matrices resembled some sort of proto-segmentation.

In this case, the authors use a combination of single frame reconstruction loss (the segmentation network) and a motion network that tries to output a feature map of how pixels move in an image. This lets them predict a future frame given a reconstruction and a motion map, which is self-supervised end-to-end.


Many more tricks and considerations are required to make this work, and the results are still not earth-shattering. Still, leveraging non-labeled data for Computer Vision using videos is a key development for the future of Machine Learning, so this line of work is bound to be impactful.

Other interesting Computer Vision papers at NeurIPS: Scaling Vision with Sparse Mixture of Experts (we already covered a few months ago!) Pay Attention to MLPs (also covered back in June), Intriguing Properties of Vision Transformers, Compressed Video Contrastive Learning,

3. Multimodal Few-Shot Learning with Frozen Language Models | 🌐 Website

By Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, Felix Hill.

Authors’ TL;DR → We present a simple approach for transferring abilities of a frozen language model to a multi-modal setting (vision and language).

❓Why → Prompting is here to stay: now multimodal and async. Can you leverage the information in a pre-trained Language Model for vision tasks without re-training it? Well sort of… keep reading.

💡Key insights → The idea this paper proposes is fairly simple: train a Language Model, freeze it such that its parameters remain fixed, and then train an image encoder to encode an image into a prompt for that language model to perform a specific task. I like to conceptualize it as “learning an image-conditional prompt (image through a NN) for the model to perform a task”.


It’s a promising research direction but the results are not very impressive (yet?) in terms of absolute performance. However, it is interesting to compare the model which is fully finetuned with multimodal data (Frozen finetuned) versus one that keeps the Language Model frozen (Frozen VQA-blind): only the latter shows good generalization from the training dataset (Conceptual Captions [4]) onto the target evaluation dataset (VQAv2 [5]), still being far from a fully supervised model.


4. Efficient Training of Retrieval Models using Negative Cache

By Erik Lindgren, Sashank Reddi, Ruiqi Guo, and Sanjiv Kumar.

Authors’ TL;DR → We develop a streaming negative cache for efficient training of retrieval models.

❓Why → Negative sampling in dense retrieval is one of the most salient topics in the domain of neural IR! This seems to be an important step.

💡Key insights → A dense retrieval model aims to encode passages and queries into vectors to perform nearest neighbor search to find relevant matches. For training, a contrastive loss is often used where the similarity of positive pairs is maximized and that of negative pairs is minimized. Ideally, one could use a whole document collection as ‘negative samples’, but this would be prohibitively expensive, so instead, negative sampling techniques are often used.

One of the challenges of negative sampling techniques is that the final model quality strongly depends on how many negative samples are used — the more the better — but using many negative samples is computationally very expensive. This has led to popular proposals such as ANCE [2] where negative samples are carefully mixed with ‘hard negatives’ from an asynchronously updated index to improve a model performance at a reasonable computational cost.

In this work, the authors propose an elegant solution consisting of caching embeddings of documents when they’re calculated and only update them progressively, so instead of needing to perform a full forward pass in the encoder for all the negative samples, a large portion of cached documents are used instead, which is much faster.

This results in an arguably more elegant and simple approach that is superior empirically to existing methods.


Other information retrieval papers you might like at NeurIPS: End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering, One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval, SPANN: Highly-efficient Billion-scale Approximate Nearest Neighborhood Search.

5. VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

By Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong.

Authors’ TL;DR → A pure Transformer-based pipeline for learning semantic representations from raw video, audio, and text without supervision.

❓Why → Another step towards the promised multimodal future land.

💡Key insights → A still under-explored area for large transformer models to explore is that of multimodality. This work aims at constructing representations for video, audio, and text via self-supervised learning of data in those modalities using variants of contrastive losses jointly, so the three modalities inhabit the same embedding space. To do so, they use Noise Contrastive Estimation (NCE) and use matching audio/video-frame/text triplets as positive pairs and non-matching triplets (e.g. non-corresponding segments of a video) as negative samples.

This method is similar to previous multimodal networks such as “Self-Supervised MultiModal Versatile Networks” [3], with the main difference being that it relies on a pure Transformer architecture. It achieves SOTA in major video benchmark kinetics while avoiding supervised pre-training (e.g. using large scale Imagenet labels to pre-train the model), and using only self-supervised techniques.


6. Robust Predictable Control | 🌐 Website

By Ben Eysenbach, Russ R. Salakhutdinov and Sergey Levine.

Authors’ TL;DR → We propose a method for learning robust and predictable policies in RL using ideas from compression.

❓Why → Excellent bird-eye paper to understand some of the fundamental challenges of Reinforcement Learning through the lens of compression.

💡Key insights → Despite environment observations being often high dimensional (e.g. millions of pixels from images), the amount of bits of information required for an agent to make decisions is often low (e.g. doing lane keeping of a car with camera sensing). Following this insight, the authors propose Robust Predictable Control (RPC), a generic approach to learning policies that use few bits of information.

To do so, they ground their analysis on the information-theoretical perspective of how much information a model needs to predict a future state: the more predictable a state the more easily a policy can be compressed. This becomes a regularisation factor for agents to learn to ‘play it safe’ most of the time (e.g. a self-driving system will tend to avoid high uncertainty situations).


The specifics of the algorithm are much more complex so you’ll need to read the paper carefully if you want to dig into them. Beyond the particular implementation and the good empirical results, this seems to be a useful perspective to think of the tradeoffs one finds when navigating environments with uncertainty.

More on Reinforcement Learning at NeurIPS: Behavior From the Void: Unsupervised Active Pre-Training, Emergent Discrete Communication in Semantic Spaces, Near-Optimal Offline Reinforcement Learning via Double Variance Reduction.

7. FLEX: Unifying Evaluation for Few-Shot NLP | 📈 Leaderboard | 👾 Code

By Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy.

Authors’ TL;DR → FLEX Principles, benchmark, and leaderboard unifying best practices for evaluating few-shot NLP; and UniFew, a simple and strong prompt-based model by unifying pre-training and downstream task formats.

❓Why → Few-shot learning has been the NLP cool kid for some time now. It’s time it gets its own benchmark!

💡Key insights → Since GPT-3 [4] broke into the scene in May 2020 with surprising zero and few-shot performance on NLP tasks thanks to ‘prompting’, it has been standard for new large language models to benchmark themselves in the zero/few-shot setting. Often, the same tasks and datasets were used for finetuned models, but in this benchmark, zero/few-shot setting is a first-class citizen.

This benchmark coming from the Allen Institute for AI seeks to standardize the patterns seen in the NLP few-shot literature, such as The principles upon which this benchmark is built are: diversity of transfer types, variable number of shots and classes, unbalanced training sets, textual labels, no extra meta-testing data, principled sample size design and proper reporting of confidence intervals, standard deviations, and individual results. Hopefully, this will facilitate apples-to-apples comparison between large language models, which isn’t guaranteed when evaluation practices are a bit all over the place: the devil is in the details!

The authors also open-source the Python toolkit used to create the benchmark, along with their own baseline for this benchmark called UniFew, and compare it to a popular recent approach “Making pre-trained language models better few-shot learners” [5] and “Few-shot Text Classification with Distributional Signatures” [6].