2672 main papers, 63 workshops, 7 invited talks, and finally in person again. Language Models, Brain-Inspired research, Diffusion Models, Graph Neural Networks... NeurIPS comes packed with world-class AI research insights, and this guide will help you find where to focus your attention.
The 36th edition of the Neural Information Processing Systems Conference (NeurIPS) is about to kick off, and we can't understate how excited we are to be going there. As it's becoming a tradition for the conference, we want to help our audience navigate what can be a fascinating yet overwhelming lineup of 2672 papers, 163 Datasets & Benchmark track papers, and more than 700 workshop papers across 63 workshops, 43 of them in-person.
Diving into the official Conference Highlights you'll realize that 2022 is the 10-year mark since the breakthrough AlexNet paper (published at NeurIPS 2012) that showed how powerful Deep Neural Networks could be for Computer Vision, won the paper won the Test of Time award, as it couldn't be any other way after unleashing the so-called "Deep Learning Revolution". Moreover, Geoffrey Hinton — 2017 Turing Award recipient and AlexNet coauthor — will be giving a talk about training neural networks without propagating derivatives "Forward-Forward Algorithm for Training DNNs". Speaking of invited talks, David Chalmers will give the first keynote reflecting on whether Language Models are Sentient, a topic that is guaranteed to spark strong disagreements (or dismissals) among the attendees. And among other invited talks, Alondra Nelson will present a Blueprint for an AI Bill of Rights, a key piece in the American policy puzzle to minimize the harm done by AI-driven, increasingly automated decision-making.
To navigate the conference, we've created a semantic map of the whole conference (including workshop papers) we've created with the VOS-viewer and a bit of our secret sauce using Language Models to help organize each cluster in the map.
Given the "slow" process of peer review, you'll find that many of the papers highlighted here are "old" by modern standards: given that they've been on arXiv for a few months, some have already seen follow-up works! And claims on state-of-the-artness might be outdated by now. On the flip side, these works are much more polished and of higher quality than the average hot-off-the-press arXiv submission, which is noticeable and appreciated.
This year, we've divided the content into what we — and GPT-3 — believe are 10 key topical areas along with a brief description of what they contain, and a selection of 5 papers you shouldn't miss for each one of them. Were you to find yourself unreasonably intrigued by a specific paper, click on the "🔎 More like this paper" button and you'll find a list of relevant papers at NeurIPS for you on the topic.
Before we dive in, let us add a shameless plug to highlight the research at NeurIPS that hits close to our home, Amsterdam. Browse the 31 publications with the participation of the University of Amsterdam!
⚠️Disclaimer⚠️ This cannot be fully comprehensive given the sheer number of papers we're working with here, but we hope this is a useful entry point to the conference. Let us know what you think about it on Twitter (which is still alive?) @zetavector.
1. Language Models and Prompting
Arguably the most sought-after research area in AI — especially since the introduction of GPT-3 in 2020 — gets massive amounts of attention at the conference. With many blockbuster publications from the likes of Google, DeepMind, OpenAI, Meta, Stanford... all the big compute guys.
Expect to find a lot of work on "simple prompting" techniques like Chain of Thought techniques (or tricks?) that squeeze more performance out of plain pretrained autoregressive LMs. Multimodality also gets a spotlight this year, DeepMind's Flamingo (vision + language) being the most popular of the bunch, Google's Minerva shows how good LMs can be at math when pretrained with the right data, and InstructGPT showcases how human feedback and Reinforcement Learning can be used to fine-tune large LMs.
We're wondering what's the current community vibe (beyond Twitter and Mastodon) on large LMs? Bullish or bearish? we hope to find out in New Orleans. For now, here are our top 5.
💡 Simply prompting LMs to output reasoning steps instead of direct answers drastically improves performance. See also follow-up works like STaR.
💡 DeepMind introduces a "simple" single model pretrained on vision + language sets the state of the art on various multimodal tasks
💡 A large LM trained on mathematical data can achieve strong performance on quantitative reasoning tasks, including state-of-the-art performance on the MATH dataset.
💡 What aspects of large-scale pre-training drive in-context learning? Training data distribution needs burstiness and a large number of rarely occurring cases.
5️⃣ Training language models to follow instructions with human feedback (InstructGPT)
💡 OpenAI uses Reinforcement Learning from Humans in the Loop (RLHF) to fine-tune GPT-3 using data collected from human labelers. The resulting model, called InstructGPT, outperforms GPT-3 on a range of NLP tasks.
2. Diffusion Models
Granted, if anything deserves the coolest-kid-in-town crown in 2022, it has to be text-to-image generation models, most of them powered by Diffusion Models: OpenAI's DALL·E 2, Google's Imagen, or Stable diffusion.
As an early sign of maturity — and barely 2 years since they became popular! — the modeling technique has now spread beyond the realm of 2D still image generation and is being applied to 3D scene synthesis, video generation, and molecular docking among others. As we saw with Transformers in 2017, the time it takes for a research idea to become mainstream just keeps shortening.
💡 Imagen, a simple approach to text-to-image synthesis using diffusion models.
2️⃣ Object Scene Representation Transformer (OSRT)
💡 A highly efficient 3D-centric model in which individual object representations naturally emerge through novel view synthesis.
3️⃣ Denoising Diffusion Restoration Models (DDRM)
💡 Super-resolution, deblurring, inpainting, and colorization using pre-trained Denoising Diffusion Probabilistic Models (DDPMs) without problem-specific supervised training.
💡 DDPMs applied to the video domain. To capture long-range dependencies between frames, they present an architecture that can be flexibly conditioned on any subset of video frames.
💡 Energy-guided stochastic differential equations (EGSDE) that employ an energy function pretrained on both the source and target domains to guide the inference process of pretrained SDE for realistic and faithful unpaired Image-to-image (I2I).
3. Self-Supervised Learning
Self-Supervised Learning (SSL) has become such an essential ingredient of modern ML that it's now pretty much baked into most research in one way or another. For the first time in Deep Learning, NLP led the way with BERT in 2018 and Computer Vision later joined the SSL bandwagon with successful techniques like SimCLR.
This region of our semantic map has a heavy representation of novel SSL techniques for Computer Vision: learning from unsupervised videos, the effects of data augmentation on images... But arguably, the cherry on top is DeepMind's Chinchilla: a study on how much of a Language Model's pretraining budget should be spent on model parameters and how much in a larger training corpus (finding that most large LMs are too big or undertrained), resulting in Chinchilla, a 70B parameter LM that outperforms its bigger counterparts by training for longer.
Finally, we also couldn't miss out on wild new (partly) SSL techniques for Information Retrieval such as the Differentiable Search Index.
While SSL is now so commonplace it's often relegated to a disinterested footnote, all this research is proof that there are many stones unturned in this area with new insights to discover.
💡 It's better to train a smaller Language Model on more tokens. DeepMind showcases this with their 70B Chinchilla model outperforming bigger models such as Gopher (280B), GPT-3 (175B) or Megatron-Turing NLG (530B).
💡 Pretraining video representations on video done exquisitely. 3 takeaways: high masking ratio is best, technique works well even on small datasets, and quality > quantity when it comes to Self-Supervised Video Petraining.
💡 A systematic study the interactions between the pretraining data sources for CLIP. Surprisingly (?) mixing multiple data sources does not necessarily yield better models, which is corroborated by our theoretical analysis on toy models.
💡 An analytical study of data augmentation (DA) how it impacts the parameters of a model. E.g., given a loss at hand, common DAs require tens of thousands of samples for the loss to be correctly estimated and for the model training to converge.
💡 A single Transformer is trained to directly output document identifiers autoregressive given a query as a prompt. Followup work is also presented at NeurIPS such as the A Neural Corpus Indexer for Document Retrieval.
4. Graph Neural Networks
Equivariances, 3D molecule generation, Partial Differential Equations... Graph Neural Networks (GNNs) have been around for a while now, and while they haven't achieved celebrity-level popularity like Transformers or Diffusion Models, they've steadily growed their pie in the last couple of years extending into applications such as drug design, differential equation solving or reasoning.
This heterogeneous landscape makes sense because as outlined in the Geometric Deep Learning blueprint, in a way, GNNs are a new abstraction on NNs to unlock thinking from first principles about how to cast arbitrary problems into the right architectures, escaping the curse of dimensionality by leveraging symmetries and invariances. For example, this is key for finding the right representations to computationally solve Partial Differential Equations or predicting the shapes of organic molecules to design new drugs more efficiently.👇
1️⃣ Zero-Shot 3D Drug Design by Sketching and Generating (DESERT)
💡 A zero-shot drug design method powered by pre-training techniques. Existing deep-learning based methods for drug design often rely on scarce experimental data or slow docking simulation. DESERT splits the design process into sketching and generation phases speeding up generation while preserving high accuracy.
💡 Drug-like Molecule Conformer Generation via diffusion process on torsion angles the Fast and accurate conformer generation via diffusion modeling on the hypertorus and an extrinsic-to-intrinsic score model, while being orders of magnitude faster than previous diffusion-based approaches.
💡 A novel mesh-agnostic architecture that predicts solutions to PDE at any spatially continuous point of the PDE domain and generalizes across different meshes and resolutions.
💡 Message Passing NNs (MPNNs) are a powerful way to model interatomic potentials but they're inefficient. MACE introduces higher-order message passing in a highly parallel manner achieving SOTA in various benchmarks.
💡 CSR can make predictions for the target few-shot task directly by self-supervised pre-training over knowledge graphs.
5. Reinforcement Learning
RL is one of the largest and most prolific areas of publication within NeurIPS, so it's impossible to make it justice with just a short selection of 5 papers.
Making agents more efficient learners is a key question RL researchers are still grappling with, and this year's NeurIPS contains many proposals on how to achieve it. For instance, using offline learning and imitation learning at scale to overcome the initial inefficient exploration phase, improving credit assignment techniques to better navigate sparse rewards landscapes, or using pretrained Language Models to bootstrap policies with human priors. Other points of interest often revolve around robustness and reproducibility, which are closely related to efficiency in open-ended challenging settings.
Finally, RL is also seeing successful application into areas like chip design with surprisingly many papers on the topic (seriously, check out the "More like this!").
💡 Meta-learning agents can learn human inductive biases through co-training with representations from language descriptions and program induction.
💡 Leveraging large pretrained models to automatically label videos with actions to create large-scale datasets for offline learning, just with video data from Minecraft.
💡 RL agents that learn to distribute components on a silicon chip design better than humans.
💡 MCTS made more efficient by allocating a higher compute budget on harder states.
💡 A new training objective for generative flow networks that tackles the problem of credit assignment (what action within a trajectory is most responsible for a final reward?) leads to faster convergence and better fitting to a target distribution.
Let's not forget NeurIPS is also home for neuroscience adjacent literature; after all, our brains are the OG neural information processing system that sparked much of the modern artificial NNs.
This diverse area involves a lot of learning from brain imaging techniques such as fMRI, alternatives to backprop for learning that bode better with what we know about neurons, spiking Neural Networks, and more!
💡Backprop does not allow training on networks with cyclic or backward connections, which are hypothesized to be essential in brain-like computation. They show how predictive coding (PC), a theory of information processing in the cortex, can be used to perform inference and learning on arbitrary graph topologies.
💡 A theoretical investigation on the approximation power and computational efficiency of spiking neural networks with self-connections.
💡 Novel self-supervised learning techniques for neuroimaging data inspired by prominent learning frameworks in natural language processing, using one of the broadest neuroimaging datasets used for pre-training to date.
💡 Node perturbation is scalable against overparameterization but unstable in the presence of a model mismatch.
💡 A solvable model of curriculum learning and comment on the implications for the ML and the experimental psychology literature.
7. Out-of-Domain Generalization
Out-of-Domain generalization and closely related, causality, are topics of high interest in academic research. While standalone OOD research hasn't fully broken into the mainstream, there's an undeniable trend broadly in Machine Learning benchmarking culture towards caring more and more about robustly generalizing in challenging conditions such as zero/few-shot or under heavy data distribution shifts, because static in-domain evaluations have been cracked one after the other at a record-breaking pace.
The area — as most in early stages — still suffer from lack of standardization, which is why we're highlighting a couple of publications on the topic (2, 3). In addition, we're highlighting a simple ensemble technique to domain generalization (1), tabular embeddings (4), and sparsely connected factor graphs at scale for causal discovery (5), which represents progress in a research direction we've highlighted before: the use of sparsity + communication bottlenecks to constrain the model into learning models of the world that capture its robust causal structure.
💡 A simple hyper-parameter-free strategy of using the simple moving average of model parameters during training and ensembling achieves SOTA on domain generalization benchmarks, and can be explained using the Bias-Variance trade-off.
💡 A large-scale empirical study of out-of-distribution generalization.
💡 Metrics to study a broad range of mixture-of-experts styled modular systems. Such systems suffer from problems of collapse and specialization, and might require additional inductive biases to overcome this sub-optimality.
💡 Representing numerical features with vectors instead of scalar values can significantly boost DL models for tabular data.
💡 Using factor graphs for large-scale causal discovery learning with interventional data.
8. Learning Theory
Okay, let's start by admitting that many of us kinda find the heavy-duty mathy stuff off-putting. Still, there's so much to learn from it at NeurIPS.
For instance, large batch sizes have proven to be essential to successfully learning representations with contrastive learning. This makes sense intuitively, but is there a more grounded statistical explanation? (1) Or what guarantees can we expect when using autodiff blindly over functions that aren't smooth enough? (3) Or under which conditions is performing well in Out-of-Distribution even possible? (2). Check out the papers below if you're intrigued by these questions.
💡 A Bayesian data augmentation method to disentangle negative samples for gradient bias mitigation in contrastive learning.
💡 Through the lens of Probably Approximately Correct (PAC) learning theory, this work studies the generalization of OOD detection: classifying whether a sample falls within the training distribution or not. They find this is impossible under some conditions and prove formal theorems around it, but these conditions are mostly not a concern in real-world problems.
💡 What happens when you apply autodiff to functions that aren't smooth enough? It's mostly fine, they converge to classical derivatives anyway.
💡 Autodiff but in the implicit form (i.e. when you can't isolate f(x) in the left side of the equal sign). Implemented in JAX.
💡 Using gradient descent to tune not only hyperparameters, but also hyper-hyperparameters, and so on… This work shows how to automatically compute hypergradients with a simple and elegant modification to backpropagation.
9. Adversarial Robustness, Federated Learning, Compression
Please forgive us for merging a few more-than-niche subfields into one catch-all area. Running low on space here!
Adversarial Robustness in ML has been a thing for longer than I can remember, and this year it's no different (3, 5). This is understandable as this is a concern of high importance when it comes to using models in safety critical applications such as self-driving.
On the other hand, Federated Leraning, sometimes married with Differential Privay is another prolific area of research which still needs to find its use in mainstream real-world application (2). A camp that decidedly already has widespread usage in the real-world, compression, pruning and other speed/efficiency enhancing techniques (4). Finally, a new twist on pruning that looks at pruning data, not weights, to train NNs faster than scaling laws would allow for! (1)
💡 In theory and practice, power law scaling of error with respect to dataset size can be improved via intelligent data pruning.
💡 We propose a new adaptive federated learning algorithm for personalization.
💡 A test that enables researchers to find flawed adversarial robustness evaluations. Passing this test produces compelling evidence that the attacks used have sufficient power to evaluate the model’s robustness.
💡 A framework for on-device training on tiny IoT devices, even under a limited memory budget of 256KB.
💡 A novel algorithm to generate adversarial samples using pre-trained models which can fool the corresponding fine-tuned ones and thus reveal the safety problem of fine-tuning pre-trained models to do downstream tasks.
10. Datasets & Benchmarks
Last but not least, the unsung heroes of ML. As the pace of progress keeps accelerating and even modern datasets saturate faster than most predicted [link to palm and instruction fine-tuning], new datasets that measure interesting phenomena must emerge to substitute them. Like last year, NeurIPS had a special track for datasets and benchmark papers (see the full list of them here).
Many of the proposed benchmarks are incredibly interesting and useful, so the list below is guaranteed to miss some of those. Still, here it goes👇