A Guide to ICLR 2023 β 10 Topics and 50 papers you shouldn't miss
The 2023 International Conference on Learning Representations is going live in Kigali on May 1st, and it comes packed with more than 2300 papers. Reasoning in Language Models, Diffusion, Self supervised learning for Computer Vision, Molecular Modeling, Graph Neural Networks, Federated Learning, and much more... Here's our guide to get you started.

The role of conferences in the modern world of ML research has shifted. Previously seen as a platform for disseminating cutting-edge research, conferences now present established research that is typically six months old. Nonetheless, they offer two notable advantages compared to daily preprints on arXiv: (1) conference papers are proofread by a few reviewers and go through some iterations making them more polished an refined than your average preprint, and (2) they generally provide better communication and comprehension, making them suitable for learning about fields outside one's primary expertise.
To assist in navigating the conference content, we have created an interactive semantic map using VOS-viewer that organizes research by topic. The predicted impact of each paper is represented by its size on the map, considering factors such as early citations, social media popularity, and author influence, and you can use this tool to quickly skim through your areas of interest. We've selected 5 papers for each of 10 main topics and a quick overview of them.
If you want to learn more about these topics, sign up for our upcoming Trends in AI webinar on Thursday, May 4th from Lab42, or join us in person in room L3.35π
1. Language Models and Reasoning

As the AI research community continues to explore the potential of language models, a significant focus lies in refining their reasoning capabilities. This year's conference features several innovative approaches that aim to enhance such performance, as well as an improved understanding of their behaviour.
Some of the most insightful works include techniques such as self-consistency or least-to-most prompting which improve upon chain-of-thought for reasoning in Language Models.
1οΈβ£ Self-Consistency Improves Chain of Thought Reasoning in Language Models
π OpenReview | π More like this paper
π‘ A new decoding strategy, self-consistency, improves chain-of-thought reasoning in language models, achieving significant performance boosts on various reasoning benchmarks.
2οΈβ£ Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
π OpenReview | π More like this paper
π‘ A new prompting strategy, least-to-most prompting, breaks down complex problems into simpler subproblems and solves them in sequence, enabling complex reasoning in large language models.
3οΈβ£ Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
π OpenReview | π More like this paper
π‘ Combining knowledge from different pretrained language models for various tasks, resulting in competitive performance for zero-shot image captioning and video-to-text retrieval, as well as enabling new applications such as answering free-form questions about egocentric video and engaging in multimodal assistive dialogue with people."
4οΈβ£ Quantifying Memorization Across Neural Language Models
π OpenReview | π More like this paper
π‘ Large language models memorize training data, violating privacy, degrading utility, and hurting fairness, with memorization increasing as model capacity, duplication, and context increase.
5οΈβ£ Selective Annotation Makes Language Models Better Few-Shot Learners
π OpenReview | π More like this paper
π‘ A two-step framework for creating datasets for natural language tasks, using an unsupervised, graph-based selective annotation method, which improves task performance by a large margin with less annotation cost.
2. Learning Video Representations

The self-supervised revolution continues to dominate in Computer Vision. The trend points towards more unification of tasks, larger pertaining, more end-to-end models. For instance DETR for improved end-to-end object detection, or PaLI for multimodal and multilingual Language Modeling. Nonetheless, you'll also find work focused on better understanding widespread techniques such as contrastive learning, or how vision-language models behave like good-old Bag-of-Words.
1οΈβ£ DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection
π OpenReview | π More like this paper
π‘ A strong end-to-end object detector that improves performance and efficiency using denoising training, box prediction, and anchor initialization.
2οΈβ£ When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?
π OpenReview | π More like this paper
π‘ This work presents the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. identify when visual LMs behave like BoWs.
3οΈβ£ PaLI: A Jointly-Scaled Multilingual Language-Image Model
π OpenReview | π More like this paper
π‘ PaLI is a large multilingual language-image model that generates text based on visual and textual inputs, achieving state-of-the-art performance in multiple vision and language tasks.
4οΈβ£ No Reason for No Supervision: Improved Generalization in Supervised Models
π OpenReview | π More like this paper
π‘ Models trained with self-supervised learning tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds.
5οΈβ£ On the duality between contrastive and non-contrastive self-supervised learning
π OpenReview | π More like this paper
π‘ This work analyzes the theoretical similarities between contrastive and non-contrastive self-supervised learning methods and shows how they can be unified for better performance.
3. Diffusion Models for Generative AI

Diffusion Models continue to rule image generation. ICLR highlights work in porting the diffusion ideas to other domains such as discrete data, modeling human motion, video generation that slowly but steadily improves in appearance, coherence, and length, and applications of diffusion such as image editing.
1οΈβ£ Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning
π OpenReview | π More like this paper
π‘ A simple and effective approach for generating discrete data using continuous state and time diffusion models, achieving strong performance in image generation and captioning tasks.
2οΈβ£ Human Motion Diffusion Model
π OpenReview | π More like this paper
π‘ The Motion Diffusion Model (MDM) is a generative model for human motion data, which predicts the sample itself to achieve state-of-the-art results.
3οΈβ£ DiffEdit: Diffusion-based semantic image editing with mask guidance
π OpenReview | π More like this paper
π‘ DiffEdit is a novel method that uses text-conditioned diffusion models to automatically generate masks for semantic image editing.
4οΈβ£ CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
π OpenReview | π More like this paper
π‘ A 9B-parameter transformer for text-to-video generation that achieves state-of-the-art performance both in human and automatic benchmarks.
5οΈβ£ Blurring Diffusion Models
π OpenReview | π More like this paper
π‘ A new type of generative modeling based on heat dissipation, bridges the gap between inverse heat dissipation and denoising diffusion.
4. Long Sequence Modeling

This is a small but fascinating region of papers that look into modeling long sequences and more specifically the use of state-space-modeling for it. While this is not (yet?) part of the mainstream, such ideas might prove useful in near-term to make the context of language models much, much larger. For context, a State-space representation is a mathematical model of a physical systemβ βthat describes it in terms of the state of a system, its time derivative, inputs, and outputs. This type of representation (relying on matrices and vectors) lends itself very well with the linear algebra toolset which makes it ideal for analytically proving and reasoning about dynamics, stability, and modes of a system.
1οΈβ£ Mega: Moving Average Equipped Gated Attention
π OpenReview | π More like this paper
π‘ A theoretically grounded, single-head gated attention mechanism equipped with moving average, offers significant improvements over other sequence models.
2οΈβ£ Hungry Hungry Hippos: Towards Language Modeling with State Space Models
π OpenReview | π More like this paper
π‘ Understanding the expressivity gap between State Space Models (SSMs) and attention in language modeling. This work introduces a new SSM layer, H3, that matches attention on synthetic languages and outperforms Transformers on OpenWebText, and introduces FlashConv to improve efficiency on modern hardware.
3οΈβ£ A Time Series is Worth 64 Words: Long-term Forecasting with Transformers
π OpenReview | π More like this paper
π‘ PatchTST is a model for multivariate time series forecasting and self-supervised representation learning improves long-term forecasting accuracy significantly.
4οΈβ£ What Makes Convolutional Models Great on Long Sequence Modeling?
π OpenReview | π More like this paper
π‘ Two critical principles contribute to the success of S4 as a global convolutional model, leading to the development of a new model called Structured Global Convolution (SGConv) that exhibits strong empirical performance over several tasks.
5οΈβ£ Liquid Structural State-Space Models
π OpenReview | π More like this paper
π‘ Liquid-S4, a linear liquid time-constant state-space model, improves generalization across sequence modeling tasks with long-term dependencies.
5. Reinforcement Learning

It is not possible to adequately cover the breadth of the Reinforcement Learning (RL) field in ICLR with just five papers, as it is one of the largest and most prolific areas of research. One major focus of RL research is finding ways to make agents learn more efficiently, which is addressed in several papers. These proposals include leveraging Language Models for decision-making, explicitly disentangling policy control into controllable vs. purely stochastic with information-theoretic principles, large-scale offline learning, GFlowNets, and more.
1οΈβ£ Latent State Marginalization as a Low-cost Approach for Improving Exploration
π OpenReview | π More like this paper
π‘ The adoption of latent variable policies within the MaxEnt framework can improve exploration and robustness capabilities in reinforcement learning.
2οΈβ£ Dichotomy of Control: Separating What You Can Control from What You Cannot
π OpenReview | π More like this paper
π‘ The dichotomy of control (DoC) is a future-conditioned supervised learning framework that separates mechanisms within a policy's control from those outside, achieving better performance than decision transformer (DT) on highly stochastic environments.
3οΈβ£ Emergence of Maps in the Memories of Blind Navigation Agents
π OpenReview | π More like this paper
π‘ Map-free RL agenst have shown surprisingly strong performance. How? Do RL agents build implicit maps? This paper trains 'blind agents' with artificially handicapped sensing abilities and finds that such agents still largely succeed in new environments and collision detection neurons emerge (among other phenomena).
4οΈβ£ GFlowNets and variational inference
π OpenReview | π More like this paper
π‘ This paper explores the relationship between generative flow networks and variational inference, highlighting the advantages of GFlowNets for capturing diversity in multimodal target distributions.
5οΈβ£ Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals
π OpenReview | π More like this paper
π‘ The Read and Reward framework utilizes human-written instruction manuals to assist learning policies for specific tasks, leading to a more efficient and better-performing agent in Atari games.
6. Graph Representation Learning

Graph Neural Networks (GNNs) have been around for some time now, and although they have not gained the same level of popularity as Transformers or Diffusion Models, they have steadily increased their influence in recent years. GNNs are now being applied to diverse fields such as drug design, solving differential equations, and reasoning. The reason for this diverse application is that GNNs provide a new abstraction for neural networks, allowing problems to be cast into the right architecture. This overcomes the curse of dimensionality by leveraging symmetries and invariances. For instance, this is crucial for finding the appropriate representations to computationally solve Partial Differential Equations or to predict the shapes of organic molecules to design new drugs more effectively, as outlined in the Geometric Deep Learning blueprint.
1οΈβ£ Sign and Basis Invariant Networks for Spectral Graph Representation Learning
π OpenReview | π More like this paper
π‘ We propose neural networks invariant to the symmetries of eigenvectors; they are theoretically expressively powerful, and empirically successful at learning graph positional encodings
2οΈβ£ Gradient Gating for Deep Multi-Rate Learning on Graphs
π OpenReview | π More like this paper
π‘ Gradient Gating, a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph.
3οΈβ£ Graph Neural Networks for Link Prediction with Subgraph Sketching
π OpenReview | π More like this paper
π‘ A method that solves the expressivity issues that plague most MPNNs for link prediction while being as efficient to run as GCN. This is achieved by passing subgraph sketches as messages.
4οΈβ£ Are More Layers Beneficial to Graph Transformers?
π OpenReview | π More like this paper
π‘ the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth.
5οΈβ£ AutoTransfer: AutoML with Knowledge TransferβββAn Application to Graph Neural Networks
π OpenReview | π More like this paper
π‘ AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest.
7. Molecular Modeling and Geometric DL

This section focuses on the usage of GNNs and other geometric DL inspired techniques, for molecular modeling and physics-related applications. They also touch on Neural Differential Equation solvers or diffusion among others. For instance, Protein representations can be pre-trained using 3D structures, Partial Differential Equation modeling with Clifford Neural Layers, diffusion, among others.
1οΈβ£ DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
π OpenReview | π More like this paper
π‘ A diffusion generative model that outperforms traditional and deep learning methods in molecular docking with a 38% top-1 success rate.
2οΈβ£ Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem
π OpenReview | π More like this paper
π‘ The first generative modeling approach to motif-scaffolding by developing a diffusion probabilistic model of protein backbones and a procedure for generating scaffolds conditional on a motif.
3οΈβ£ Protein Representation Learning by Geometric Structure Pretraining
π OpenReview | π More like this paper
π‘ A new method for pretraining protein representations based on their 3D structures instead of their sequence, which outperforms existing sequence-based approaches.
4οΈβ£ Clifford Neural Layers for PDE Modeling
π OpenReview | π More like this paper
π‘ The first usage of multivector fields and Clifford convolutions in deep learning, resulting in universally applicable Clifford neural layers that improve generalization capabilities of neural PDE surrogates for physical system modeling.
5οΈβ£ DiGress: Discrete Denoising diffusion for graph generation
π OpenReview | π More like this paper
π‘ DiGress is a discrete denoising diffusion model that generates graphs with categorical node and edge attributes. It is state-of-the-art on both abstract and molecular datasets.
8. Biology Inspired

Taking inspiration from how living things learn is a principle many AI researchers follow. Here's a section of papers exploring techniques from biology (brains, humans, and evolution) such as biologically plausible gradient descent alternatives, structured memories, and other theory-heavy neuroscience-inspired papers.
1οΈβ£ Scaling Forward Gradient With Local Losses
π OpenReview | π More like this paper
π‘ Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. The standard forward gradient algorithm suffers from the curse of dimensionality in the number of parameters. This work proposes to scale the forward gradient by adding a large number of local greedy loss functions.
2οΈβ£ Disentanglement with Biological Constraints: A Theory of Functional Cell Types
π OpenReview | π More like this paper
π‘ A mathematical theory that explains how biological constraints on neurons promote disentangled representations, which are highly sought after in machine learning and can help understand how the brain represents single human-interpretable factors.
3οΈβ£ Incremental Learning of Structured Memory via Closed-Loop Transcription
π OpenReview | π More like this paper
π‘ A simple and efficient method for incremental learning of structured memories using closed-loop transcription, achieving better performance with fewer resources.
4οΈβ£ An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems
π OpenReview | π More like this paper
π‘ An evolutionary method for generating large-scale multitask models with sparsely activated task-based routing and knowledge compartmentalization to avoid common pitfalls.
5οΈβ£ REPAIR: REnormalizing Permuted Activations for Interpolation Repair
π OpenReview | π More like this paper
π‘ This work investigates the conjecture that permutation invariance eliminates the loss barrier to linear interpolation between SGD solutions, using computer vision architectures.
9. OOD Generalization, Optimization

The subjects of Out-of-Domain (OOD) generalization and its closely associated concept, causality, have long been of research interest. Although OOD has not yet entirely penetrated the mainstream, it is evident that the Machine Learning benchmarking culture is progressively placing greater emphasis on robust generalization under challenging circumstances, such as zero/few-shot scenarios or under substantial data distribution shifts. This shift is primarily due to the rapid and successive saturation of static in-domain evaluations such as the classic ImageNet or the GLUE Benchmark.
1οΈβ£ Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization
π OpenReview | π More like this paper
π‘ Most domain generalization algorithms focus on specific data shifts. This work introduces a dataset for multiple-attribute distribution shifts and shows how existing models fail to generalize under those circumstances. They also present the Causally Adaptive Constraint Minimization to better capture correct independence constraints.
2οΈβ£ Distilling Model Failures as Directions in Latent Space
π OpenReview | π More like this paper
π‘ A scalable method for automatically distilling and captioning a model's failure modes as directions in a latent space.
3οΈβ£ Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
π OpenReview | π More like this paper
π‘ A theory to explain why ensemble and knowledge distillation work for Deep Learning. It matches practice well, while traditional theory such as boosting, random feature mappings or NTKs, cannot explain the same phenomena for DL.
4οΈβ£ Learning to Induce Causal Structure
π OpenReview | π More like this paper
π‘ Causal induction models often rely on generating candidate causal graphs and evaluating them. This work maps observational and interventional data directly to graph structures via supervised learning on synthetic graphs.
5οΈβ£ Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization
π OpenReview | π More like this paper
π‘ This paper theoretically explains the generalization gap between Adam and SGD (with proper regularization) in learning neural networks for image-like datasets.
10. Adversarial Robustness, Federated Learning, Pruning

Finally, we're grouping Adversarial Robustness and Federated Learning β two clear-cut independent topics β into one last section. Adversarial Robustness is a key area of research especially now that models are being deployed and used by an exponentially increasing number of people. It's not only important that models perform tasks well, but also that we understand how and why they fail, which is often very different from humans.
When it comes to Federated Learning, the subfield has been growing slowly but steadily since its inception, and has become an established practice for some niche applications in tandem with differential privacy, in cases where preserving individual anonymity is highly important.
1οΈβ£ Unmasking the Lottery Ticket Hypothesis: Whatβs Encoded in a Winning Ticketβs Mask?
π OpenReview | π More like this paper
π