top of page
Search

A Guide to ICLR 2023 β€” 10 Topics and 50 papers you shouldn't miss

The 2023 International Conference on Learning Representations is going live in Kigali on May 1st, and it comes packed with more than 2300 papers. Reasoning in Language Models, Diffusion, Self supervised learning for Computer Vision, Molecular Modeling, Graph Neural Networks, Federated Learning, and much more... Here's our guide to get you started.

Image by Zeta Alpha

The role of conferences in the modern world of ML research has shifted. Previously seen as a platform for disseminating cutting-edge research, conferences now present established research that is typically six months old. Nonetheless, they offer two notable advantages compared to daily preprints on arXiv: (1) conference papers are proofread by a few reviewers and go through some iterations making them more polished an refined than your average preprint, and (2) they generally provide better communication and comprehension, making them suitable for learning about fields outside one's primary expertise.


To assist in navigating the conference content, we have created an interactive semantic map using VOS-viewer that organizes research by topic. The predicted impact of each paper is represented by its size on the map, considering factors such as early citations, social media popularity, and author influence, and you can use this tool to quickly skim through your areas of interest. We've selected 5 papers for each of 10 main topics and a quick overview of them.

If you want to learn more about these topics, sign up for our upcoming Trends in AI webinar on Thursday, May 4th from Lab42, or join us in person in room L3.35πŸ‘‡


1. Language Models and Reasoning

As the AI research community continues to explore the potential of language models, a significant focus lies in refining their reasoning capabilities. This year's conference features several innovative approaches that aim to enhance such performance, as well as an improved understanding of their behaviour.


Some of the most insightful works include techniques such as self-consistency or least-to-most prompting which improve upon chain-of-thought for reasoning in Language Models.


1️⃣ Self-Consistency Improves Chain of Thought Reasoning in Language Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A new decoding strategy, self-consistency, improves chain-of-thought reasoning in language models, achieving significant performance boosts on various reasoning benchmarks.

2️⃣ Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A new prompting strategy, least-to-most prompting, breaks down complex problems into simpler subproblems and solves them in sequence, enabling complex reasoning in large language models.

3️⃣ Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Combining knowledge from different pretrained language models for various tasks, resulting in competitive performance for zero-shot image captioning and video-to-text retrieval, as well as enabling new applications such as answering free-form questions about egocentric video and engaging in multimodal assistive dialogue with people."

4️⃣ Quantifying Memorization Across Neural Language Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Large language models memorize training data, violating privacy, degrading utility, and hurting fairness, with memorization increasing as model capacity, duplication, and context increase.

5️⃣ Selective Annotation Makes Language Models Better Few-Shot Learners

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A two-step framework for creating datasets for natural language tasks, using an unsupervised, graph-based selective annotation method, which improves task performance by a large margin with less annotation cost.


2. Learning Video Representations

The self-supervised revolution continues to dominate in Computer Vision. The trend points towards more unification of tasks, larger pertaining, more end-to-end models. For instance DETR for improved end-to-end object detection, or PaLI for multimodal and multilingual Language Modeling. Nonetheless, you'll also find work focused on better understanding widespread techniques such as contrastive learning, or how vision-language models behave like good-old Bag-of-Words.


1️⃣ DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A strong end-to-end object detector that improves performance and efficiency using denoising training, box prediction, and anchor initialization.

2️⃣ When and Why Vision-Language Models Behave like Bags-Of-Words, and What to Do About It?

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ This work presents the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. identify when visual LMs behave like BoWs.

3️⃣ PaLI: A Jointly-Scaled Multilingual Language-Image Model

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ PaLI is a large multilingual language-image model that generates text based on visual and textual inputs, achieving state-of-the-art performance in multiple vision and language tasks.

4️⃣ No Reason for No Supervision: Improved Generalization in Supervised Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Models trained with self-supervised learning tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds.

5️⃣ On the duality between contrastive and non-contrastive self-supervised learning

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ This work analyzes the theoretical similarities between contrastive and non-contrastive self-supervised learning methods and shows how they can be unified for better performance.


3. Diffusion Models for Generative AI

Diffusion Models continue to rule image generation. ICLR highlights work in porting the diffusion ideas to other domains such as discrete data, modeling human motion, video generation that slowly but steadily improves in appearance, coherence, and length, and applications of diffusion such as image editing.



1️⃣ Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A simple and effective approach for generating discrete data using continuous state and time diffusion models, achieving strong performance in image generation and captioning tasks.

2️⃣ Human Motion Diffusion Model

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The Motion Diffusion Model (MDM) is a generative model for human motion data, which predicts the sample itself to achieve state-of-the-art results.

3️⃣ DiffEdit: Diffusion-based semantic image editing with mask guidance

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ DiffEdit is a novel method that uses text-conditioned diffusion models to automatically generate masks for semantic image editing.

4️⃣ CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A 9B-parameter transformer for text-to-video generation that achieves state-of-the-art performance both in human and automatic benchmarks.

5️⃣ Blurring Diffusion Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A new type of generative modeling based on heat dissipation, bridges the gap between inverse heat dissipation and denoising diffusion.


4. Long Sequence Modeling

This is a small but fascinating region of papers that look into modeling long sequences and more specifically the use of state-space-modeling for it. While this is not (yet?) part of the mainstream, such ideas might prove useful in near-term to make the context of language models much, much larger. For context, a State-space representation is a mathematical model of a physical systemβ€Š β€Šthat describes it in terms of the state of a system, its time derivative, inputs, and outputs. This type of representation (relying on matrices and vectors) lends itself very well with the linear algebra toolset which makes it ideal for analytically proving and reasoning about dynamics, stability, and modes of a system.


1️⃣ Mega: Moving Average Equipped Gated Attention

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A theoretically grounded, single-head gated attention mechanism equipped with moving average, offers significant improvements over other sequence models.

2️⃣ Hungry Hungry Hippos: Towards Language Modeling with State Space Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Understanding the expressivity gap between State Space Models (SSMs) and attention in language modeling. This work introduces a new SSM layer, H3, that matches attention on synthetic languages and outperforms Transformers on OpenWebText, and introduces FlashConv to improve efficiency on modern hardware.

3️⃣ A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ PatchTST is a model for multivariate time series forecasting and self-supervised representation learning improves long-term forecasting accuracy significantly.

4️⃣ What Makes Convolutional Models Great on Long Sequence Modeling?

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Two critical principles contribute to the success of S4 as a global convolutional model, leading to the development of a new model called Structured Global Convolution (SGConv) that exhibits strong empirical performance over several tasks.

5️⃣ Liquid Structural State-Space Models

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Liquid-S4, a linear liquid time-constant state-space model, improves generalization across sequence modeling tasks with long-term dependencies.


5. Reinforcement Learning

It is not possible to adequately cover the breadth of the Reinforcement Learning (RL) field in ICLR with just five papers, as it is one of the largest and most prolific areas of research. One major focus of RL research is finding ways to make agents learn more efficiently, which is addressed in several papers. These proposals include leveraging Language Models for decision-making, explicitly disentangling policy control into controllable vs. purely stochastic with information-theoretic principles, large-scale offline learning, GFlowNets, and more.


1️⃣ Latent State Marginalization as a Low-cost Approach for Improving Exploration

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The adoption of latent variable policies within the MaxEnt framework can improve exploration and robustness capabilities in reinforcement learning.

2️⃣ Dichotomy of Control: Separating What You Can Control from What You Cannot

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The dichotomy of control (DoC) is a future-conditioned supervised learning framework that separates mechanisms within a policy's control from those outside, achieving better performance than decision transformer (DT) on highly stochastic environments.

3️⃣ Emergence of Maps in the Memories of Blind Navigation Agents

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Map-free RL agenst have shown surprisingly strong performance. How? Do RL agents build implicit maps? This paper trains 'blind agents' with artificially handicapped sensing abilities and finds that such agents still largely succeed in new environments and collision detection neurons emerge (among other phenomena).

4️⃣ GFlowNets and variational inference

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ This paper explores the relationship between generative flow networks and variational inference, highlighting the advantages of GFlowNets for capturing diversity in multimodal target distributions.

5️⃣ Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The Read and Reward framework utilizes human-written instruction manuals to assist learning policies for specific tasks, leading to a more efficient and better-performing agent in Atari games.


6. Graph Representation Learning

Graph Neural Networks (GNNs) have been around for some time now, and although they have not gained the same level of popularity as Transformers or Diffusion Models, they have steadily increased their influence in recent years. GNNs are now being applied to diverse fields such as drug design, solving differential equations, and reasoning. The reason for this diverse application is that GNNs provide a new abstraction for neural networks, allowing problems to be cast into the right architecture. This overcomes the curse of dimensionality by leveraging symmetries and invariances. For instance, this is crucial for finding the appropriate representations to computationally solve Partial Differential Equations or to predict the shapes of organic molecules to design new drugs more effectively, as outlined in the Geometric Deep Learning blueprint.


1️⃣ Sign and Basis Invariant Networks for Spectral Graph Representation Learning

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ We propose neural networks invariant to the symmetries of eigenvectors; they are theoretically expressively powerful, and empirically successful at learning graph positional encodings

2️⃣ Gradient Gating for Deep Multi-Rate Learning on Graphs

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Gradient Gating, a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph.

3️⃣ Graph Neural Networks for Link Prediction with Subgraph Sketching

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A method that solves the expressivity issues that plague most MPNNs for link prediction while being as efficient to run as GCN. This is achieved by passing subgraph sketches as messages.

4️⃣ Are More Layers Beneficial to Graph Transformers?

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth.

5️⃣ AutoTransfer: AutoML with Knowledge Transferβ€Šβ€”β€ŠAn Application to Graph Neural Networks

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest.


7. Molecular Modeling and Geometric DL

This section focuses on the usage of GNNs and other geometric DL inspired techniques, for molecular modeling and physics-related applications. They also touch on Neural Differential Equation solvers or diffusion among others. For instance, Protein representations can be pre-trained using 3D structures, Partial Differential Equation modeling with Clifford Neural Layers, diffusion, among others.


1️⃣ DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A diffusion generative model that outperforms traditional and deep learning methods in molecular docking with a 38% top-1 success rate.

2️⃣ Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-scaffolding problem

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The first generative modeling approach to motif-scaffolding by developing a diffusion probabilistic model of protein backbones and a procedure for generating scaffolds conditional on a motif.

3️⃣ Protein Representation Learning by Geometric Structure Pretraining

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A new method for pretraining protein representations based on their 3D structures instead of their sequence, which outperforms existing sequence-based approaches.

4️⃣ Clifford Neural Layers for PDE Modeling

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ The first usage of multivector fields and Clifford convolutions in deep learning, resulting in universally applicable Clifford neural layers that improve generalization capabilities of neural PDE surrogates for physical system modeling.

5️⃣ DiGress: Discrete Denoising diffusion for graph generation

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ DiGress is a discrete denoising diffusion model that generates graphs with categorical node and edge attributes. It is state-of-the-art on both abstract and molecular datasets.



8. Biology Inspired

Taking inspiration from how living things learn is a principle many AI researchers follow. Here's a section of papers exploring techniques from biology (brains, humans, and evolution) such as biologically plausible gradient descent alternatives, structured memories, and other theory-heavy neuroscience-inspired papers.


1️⃣ Scaling Forward Gradient With Local Losses

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. The standard forward gradient algorithm suffers from the curse of dimensionality in the number of parameters. This work proposes to scale the forward gradient by adding a large number of local greedy loss functions.

2️⃣ Disentanglement with Biological Constraints: A Theory of Functional Cell Types

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A mathematical theory that explains how biological constraints on neurons promote disentangled representations, which are highly sought after in machine learning and can help understand how the brain represents single human-interpretable factors.

3️⃣ Incremental Learning of Structured Memory via Closed-Loop Transcription

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A simple and efficient method for incremental learning of structured memories using closed-loop transcription, achieving better performance with fewer resources.

4️⃣ An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ An evolutionary method for generating large-scale multitask models with sparsely activated task-based routing and knowledge compartmentalization to avoid common pitfalls.

5️⃣ REPAIR: REnormalizing Permuted Activations for Interpolation Repair

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ This work investigates the conjecture that permutation invariance eliminates the loss barrier to linear interpolation between SGD solutions, using computer vision architectures.



9. OOD Generalization, Optimization

The subjects of Out-of-Domain (OOD) generalization and its closely associated concept, causality, have long been of research interest. Although OOD has not yet entirely penetrated the mainstream, it is evident that the Machine Learning benchmarking culture is progressively placing greater emphasis on robust generalization under challenging circumstances, such as zero/few-shot scenarios or under substantial data distribution shifts. This shift is primarily due to the rapid and successive saturation of static in-domain evaluations such as the classic ImageNet or the GLUE Benchmark.


1️⃣ Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Most domain generalization algorithms focus on specific data shifts. This work introduces a dataset for multiple-attribute distribution shifts and shows how existing models fail to generalize under those circumstances. They also present the Causally Adaptive Constraint Minimization to better capture correct independence constraints.

2️⃣ Distilling Model Failures as Directions in Latent Space

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A scalable method for automatically distilling and captioning a model's failure modes as directions in a latent space.

3️⃣ Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ A theory to explain why ensemble and knowledge distillation work for Deep Learning. It matches practice well, while traditional theory such as boosting, random feature mappings or NTKs, cannot explain the same phenomena for DL.

4️⃣ Learning to Induce Causal Structure

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ Causal induction models often rely on generating candidate causal graphs and evaluating them. This work maps observational and interventional data directly to graph structures via supervised learning on synthetic graphs.

5️⃣ Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ’‘ This paper theoretically explains the generalization gap between Adam and SGD (with proper regularization) in learning neural networks for image-like datasets.


10. Adversarial Robustness, Federated Learning, Pruning

Finally, we're grouping Adversarial Robustness and Federated Learning β€” two clear-cut independent topics β€” into one last section. Adversarial Robustness is a key area of research especially now that models are being deployed and used by an exponentially increasing number of people. It's not only important that models perform tasks well, but also that we understand how and why they fail, which is often very different from humans.


When it comes to Federated Learning, the subfield has been growing slowly but steadily since its inception, and has become an established practice for some niche applications in tandem with differential privacy, in cases where preserving individual anonymity is highly important.


1️⃣ Unmasking the Lottery Ticket Hypothesis: What’s Encoded in a Winning Ticket’s Mask?

πŸ”— OpenReview | πŸ”Ž More like this paper

πŸ