top of page

A Guide to ICLR 2023 — 10 Topics and 50 papers you shouldn't miss

The 2023 International Conference on Learning Representations is going live in Kigali on May 1st, and it comes packed with more than 2300 papers. Reasoning in Language Models, Diffusion, Self supervised learning for Computer Vision, Molecular Modeling, Graph Neural Networks, Federated Learning, and much more... Here's our guide to get you started.

Image by Zeta Alpha

The role of conferences in the modern world of ML research has shifted. Previously seen as a platform for disseminating cutting-edge research, conferences now present established research that is typically six months old. Nonetheless, they offer two notable advantages compared to daily preprints on arXiv: (1) conference papers are proofread by a few reviewers and go through some iterations making them more polished an refined than your average preprint, and (2) they generally provide better communication and comprehension, making them suitable for learning about fields outside one's primary expertise.

To assist in navigating the conference content, we have created an interactive semantic map using VOS-viewer that organizes research by topic. The predicted impact of each paper is represented by its size on the map, considering factors such as early citations, social media popularity, and author influence, and you can use this tool to quickly skim through your areas of interest. We've selected 5 papers for each of 10 main topics and a quick overview of them.

If you want to learn more about these topics, sign up for our upcoming Trends in AI webinar on Thursday, May 4th from Lab42, or join us in person in room L3.35👇

1. Language Models and Reasoning

As the AI research community continues to explore the potential of language models, a significant focus lies in refining their reasoning capabilities. This year's conference features several innovative approaches that aim to enhance such performance, as well as an improved understanding of their behaviour.

Some of the most insightful works include techniques such as self-consistency or least-to-most prompting which improve upon chain-of-thought for reasoning in Language Models.

💡 A new decoding strategy, self-consistency, improves chain-of-thought reasoning in language models, achieving significant performance boosts on various reasoning benchmarks.

💡 A new prompting strategy, least-to-most prompting, breaks down complex problems into simpler subproblems and solves them in sequence, enabling complex reasoning in large language models.

💡 Combining knowledge from different pretrained language models for various tasks, resulting in competitive performance for zero-shot image captioning and video-to-text retrieval, as well as enabling new applications such as answering free-form questions about egocentric video and engaging in multimodal assistive dialogue with people."

💡 Large language models memorize training data, violating privacy, degrading utility, and hurting fairness, with memorization increasing as model capacity, duplication, and context increase.

💡 A two-step framework for creating datasets for natural language tasks, using an unsupervised, graph-based selective annotation method, which improves task performance by a large margin with less annotation cost.

2. Learning Video Representations

The self-supervised revolution continues to dominate in Computer Vision. The trend points towards more unification of tasks, larger pertaining, more end-to-end models. For instance DETR for improved end-to-end object detection, or PaLI for multimodal and multilingual Language Modeling. Nonetheless, you'll also find work focused on better understanding widespread techniques such as contrastive learning, or how vision-language models behave like good-old Bag-of-Words.

💡 A strong end-to-end object detector that improves performance and efficiency using denoising training, box prediction, and anchor initialization.

💡 This work presents the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order information. identify when visual LMs behave like BoWs.

💡 PaLI is a large multilingual language-image model that generates text based on visual and textual inputs, achieving state-of-the-art performance in multiple vision and language tasks.

💡 Models trained with self-supervised learning tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K. In this paper, we propose a supervised learning setup that leverages the best of both worlds.

💡 This work analyzes the theoretical similarities between contrastive and non-contrastive self-supervised learning methods and shows how they can be unified for better performance.

3. Diffusion Models for Generative AI

Diffusion Models continue to rule image generation. ICLR highlights work in porting the diffusion ideas to other domains such as discrete data, modeling human motion, video generation that slowly but steadily improves in appearance, coherence, and length, and applications of diffusion such as image editing.

💡 A simple and effective approach for generating discrete data using continuous state and time diffusion models, achieving strong performance in image generation and captioning tasks.

💡 The Motion Diffusion Model (MDM) is a generative model for human motion data, which predicts the sample itself to achieve state-of-the-art results.

💡 DiffEdit is a novel method that uses text-conditioned diffusion models to automatically generate masks for semantic image editing.

💡 A 9B-parameter transformer for text-to-video generation that achieves state-of-the-art performance both in human and automatic benchmarks.

💡 A new type of generative modeling based on heat dissipation, bridges the gap between inverse heat dissipation and denoising diffusion.

4. Long Sequence Modeling

This is a small but fascinating region of papers that look into modeling long sequences and more specifically the use of state-space-modeling for it. While this is not (yet?) part of the mainstream, such ideas might prove useful in near-term to make the context of language models much, much larger. For context, a State-space representation is a mathematical model of a physical system   that describes it in terms of the state of a system, its time derivative, inputs, and outputs. This type of representation (relying on matrices and vectors) lends itself very well with the linear algebra toolset which makes it ideal for analytically proving and reasoning about dynamics, stability, and modes of a system.

💡 A theoretically grounded, single-head gated attention mechanism equipped with moving average, offers significant improvements over other sequence models.

💡 Understanding the expressivity gap between State Space Models (SSMs) and attention in language modeling. This work introduces a new SSM layer, H3, that matches attention on synthetic languages and outperforms Transformers on OpenWebText, and introduces FlashConv to improve efficiency on modern hardware.

💡 PatchTST is a model for multivariate time series forecasting and self-supervised representation learning improves long-term forecasting accuracy significantly.

💡 Two critical principles contribute to the success of S4 as a global convolutional model, leading to the development of a new model called Structured Global Convolution (SGConv) that exhibits strong empirical performance over several tasks.

💡 Liquid-S4, a linear liquid time-constant state-space model, improves generalization across sequence modeling tasks with long-term dependencies.

5. Reinforcement Learning

It is not possible to adequately cover the breadth of the Reinforcement Learning (RL) field in ICLR with just five papers, as it is one of the largest and most prolific areas of research. One major focus of RL research is finding ways to make agents learn more efficiently, which is addressed in several papers. These proposals include leveraging Language Models for decision-making, explicitly disentangling policy control into controllable vs. purely stochastic with information-theoretic principles, large-scale offline learning, GFlowNets, and more.

💡 The adoption of latent variable policies within the MaxEnt framework can improve exploration and robustness capabilities in reinforcement learning.

💡 The dichotomy of control (DoC) is a future-conditioned supervised learning framework that separates mechanisms within a policy's control from those outside, achieving better performance than decision transformer (DT) on highly stochastic environments.

💡 Map-free RL agenst have shown surprisingly strong performance. How? Do RL agents build implicit maps? This paper trains 'blind agents' with artificially handicapped sensing abilities and finds that such agents still largely succeed in new environments and collision detection neurons emerge (among other phenomena).

💡 This paper explores the relationship between generative flow networks and variational inference, highlighting the advantages of GFlowNets for capturing diversity in multimodal target distributions.

💡 The Read and Reward framework utilizes human-written instruction manuals to assist learning policies for specific tasks, leading to a more efficient and better-performing agent in Atari games.

6. Graph Representation Learning

Graph Neural Networks (GNNs) have been around for some time now, and although they have not gained the same level of popularity as Transformers or Diffusion Models, they have steadily increased their influence in recent years. GNNs are now being applied to diverse fields such as drug design, solving differential equations, and reasoning. The reason for this diverse application is that GNNs provide a new abstraction for neural networks, allowing problems to be cast into the right architecture. This overcomes the curse of dimensionality by leveraging symmetries and invariances. For instance, this is crucial for finding the appropriate representations to computationally solve Partial Differential Equations or to predict the shapes of organic molecules to design new drugs more effectively, as outlined in the Geometric Deep Learning blueprint.

💡 We propose neural networks invariant to the symmetries of eigenvectors; they are theoretically expressively powerful, and empirically successful at learning graph positional encodings

💡 Gradient Gating, a novel framework for improving the performance of Graph Neural Networks (GNNs). Our framework is based on gating the output of GNN layers with a mechanism for multi-rate flow of message passing information across nodes of the underlying graph.

💡 A method that solves the expressivity issues that plague most MPNNs for link prediction while being as efficient to run as GCN. This is achieved by passing subgraph sketches as messages.

💡 the existing graph transformers are relatively shallow. In this work, we explore whether more layers are beneficial to graph transformers, and find that current graph transformers suffer from the bottleneck of improving performance by increasing depth.

💡 AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest.

7. Molecular Modeling and Geometric DL

This section focuses on the usage of GNNs and other geometric DL inspired techniques, for molecular modeling and physics-related applications. They also touch on Neural Differential Equation solvers or diffusion among others. For instance, Protein representations can be pre-trained using 3D structures, Partial Differential Equation modeling with Clifford Neural Layers, diffusion, among others.

💡 A diffusion generative model that outperforms traditional and deep learning methods in molecular docking with a 38% top-1 success rate.

💡 The first generative modeling approach to motif-scaffolding by developing a diffusion probabilistic model of protein backbones and a procedure for generating scaffolds conditional on a motif.

💡 A new method for pretraining protein representations based on their 3D structures instead of their sequence, which outperforms existing sequence-based approaches.

💡 The first usage of multivector fields and Clifford convolutions in deep learning, resulting in universally applicable Clifford neural layers that improve generalization capabilities of neural PDE surrogates for physical system modeling.

💡 DiGress is a discrete denoising diffusion model that generates graphs with categorical node and edge attributes. It is state-of-the-art on both abstract and molecular datasets.

8. Biology Inspired

Taking inspiration from how living things learn is a principle many AI researchers follow. Here's a section of papers exploring techniques from biology (brains, humans, and evolution) such as biologically plausible gradient descent alternatives, structured memories, and other theory-heavy neuroscience-inspired papers.

💡 Forward gradient learning computes a noisy directional gradient and is a biologically plausible alternative to backprop for learning deep neural networks. The standard forward gradient algorithm suffers from the curse of dimensionality in the number of parameters. This work proposes to scale the forward gradient by adding a large number of local greedy loss functions.

💡 A mathematical theory that explains how biological constraints on neurons promote disentangled representations, which are highly sought after in machine learning and can help understand how the brain represents single human-interpretable factors.

💡 A simple and efficient method for incremental learning of structured memories using closed-loop transcription, achieving better performance with fewer resources.

💡 An evolutionary method for generating large-scale multitask models with sparsely activated task-based routing and knowledge compartmentalization to avoid common pitfalls.

💡 This work investigates the conjecture that permutation invariance eliminates the loss barrier to linear interpolation between SGD solutions, using computer vision architectures.

9. OOD Generalization, Optimization

The subjects of Out-of-Domain (OOD) generalization and its closely associated concept, causality, have long been of research interest. Although OOD has not yet entirely penetrated the mainstream, it is evident that the Machine Learning benchmarking culture is progressively placing greater emphasis on robust generalization under challenging circumstances, such as zero/few-shot scenarios or under substantial data distribution shifts. This shift is primarily due to the rapid and successive saturation of static in-domain evaluations such as the classic ImageNet or the GLUE Benchmark.

💡 Most domain generalization algorithms focus on specific data shifts. This work introduces a dataset for multiple-attribute distribution shifts and shows how existing models fail to generalize under those circumstances. They also present the Causally Adaptive Constraint Minimization to better capture correct independence constraints.

💡 A scalable method for automatically distilling and captioning a model's failure modes as directions in a latent space.

💡 A theory to explain why ensemble and knowledge distillation work for Deep Learning. It matches practice well, while traditional theory such as boosting, random feature mappings or NTKs, cannot explain the same phenomena for DL.

💡 Causal induction models often rely on generating candidate causal graphs and evaluating them. This work maps observational and interventional data directly to graph structures via supervised learning on synthetic graphs.

💡 This paper theoretically explains the generalization gap between Adam and SGD (with proper regularization) in learning neural networks for image-like datasets.

10. Adversarial Robustness, Federated Learning, Pruning

Finally, we're grouping Adversarial Robustness and Federated Learning — two clear-cut independent topics — into one last section. Adversarial Robustness is a key area of research especially now that models are being deployed and used by an exponentially increasing number of people. It's not only important that models perform tasks well, but also that we understand how and why they fail, which is often very different from humans.

When it comes to Federated Learning, the subfield has been growing slowly but steadily since its inception, and has become an established practice for some niche applications in tandem with differential privacy, in cases where preserving individual anonymity is highly important.

💡 This work provides an error landscape perspective on what information is encoded in a winning ticket's mask and how Iterative Magnitude Pruning finds matching subnetworks.

💡 A robustness certification framework against universal perturbations (including both universal adversarial noise and backdoor attacks).

💡 This work analyzes the fundamental properties of diffusion models to understand why and how it enhances certified robustness and propose a method to improve robustness even further fo the family of methods.

💡 This work identifies common pitfalls of existing personalized federated learning methods during deployment and proposes a novel test-time personalization solution.

💡 A study of pre-training in the context of Federated Learning. They find it makes global aggregation mode stable and tends ton converge to this same loss basin under different clients data conditions. Still, pre-training doesn't fix model drifting, a fundamental problem in FL under non-IID data.


Our selection ends here, but our ICLR coverage has just started! To keep up with the latest trends in AI, make sure to follow us on Twitter @zetavector to stay up to date with everything that's happening there!


Recent Posts

See All


bottom of page