top of page

Trends in AI   February 2022 Edition: Multimodality, Unsupervised Neural IR, ConvNets and more

Updated: Feb 8, 2022

A monthly selection of ML papers by Zeta Alpha: Reinforcement Learning, Multimodality, Language-Models-as-a-Service, a comeback for ConvNets in Computer Vision, Unsupervised IR, and more.

Read the blog post below to find out more... In our monthly webinar we also discuss these papers. You can now also watch the webinar recording on YouTube.

The world of AI research has gone into 2022 at full speed and the number of relevant publications and news in the past weeks can attest to this. Let’s start by highlighting some recent news you shouldn’t have missed:


🔬 Research

Zeta Alpha monitors trending AI research to help you determine what’s worth reading. Using our platform, and a bit of human touch, we’ve selected 10 of the most impactful papers of the month: Automated Reinforcement Learning (AutoRL), multimodal Language Models (LMs), ConvNets vs. Transformers in Computer Vision (CV), unsupervised Neural Information Retrieval (IR) and more. Enjoy!

By Jack Parker-Holder, Raghu Rajan, Xingyou Song et al.

❓Why → One of the main objectives of Machine Learning is the automation of several data processing workflows and pipelines that allow non-experts to use ML techniques, hence the popularity of topics like AutoML. AutoRL is the analog in the world of Reinforcement Learning.

💡Key insights → This paper gives an overview of the space, providing useful taxonomies to unify various approaches to AutoRL. It is especially useful to ML practitioners because the RL vocabulary is quite different from the ML one, making cross-pollination of ideas across fields harder.

Topics discussed include optimization techniques for different targets (e.g. hyperparameters, task design, architectures, etc.):

  • Random vs. grid search-driven approaches

  • Bayesian optimization

  • Evolutionary (and/or population-based) methods

  • Meta-gradients

  • Blackbox optimization

  • Learning RL algorithms, environment design


While the “dream” of always working out-of-the-box RL seems still far away, it doesn’t seem to deter researchers from going into it.

By Wenlong Huang, Pieter Abbeel, Deepak Pathak and Igor Mordatch.

❓ Why → NLP techniques crossing over to other domains in ML has been a recurring theme of the last couple of years. Here's what happens when you use a pretrained Language Model (LM) like GPT-3 to construct sequences of actions for an agent. It... works!

💡 Key insights →If large enough and trained appropriately, large L:Ms can decompose high level tasks into low-level plans without further training (i.e. only using the frozen models).

Still, plans generated by a free-form LM might not be executable, that is, mappable to an existing set of known objects and actions. This is why the authors propose to introduce a mapping step from the LM output to a valid action. This mapping is performed by a sentence similarity transformer that finds the closest valid low-level action in an embedding space.

While the results are not earth-shattering, they prove that frozen LMs contain the information required to often distill low-level action plans from high-level instructions. Here you can watch some demos and inspect their code.


By Armen Aghajanyan et al.

❓ Why → Multimodality has been a rapidly growing subfield in AI, especially since the advent of monstrously-sized data-hungry Transformers. While their performance has been arguably underwhelming for existing benchmarks so far, the amount of research on the topic will certainly keep increasing for the foreseeable future.

💡 Key insights → The authors of this work cleverly design a pretraining task operating on HTML data which contains text and images. But how are images encoded into tokens that can be fed to the model? Somewhat similarly to DALLE² from OpenAI, they learn a quantized representation of image patches using VQVAE-GAN¹ that can be treated as a discrete dictionary of symbols, just like regular text tokens.

For training, they use a combination of left-to-right and bidirectional language modeling, and the scale of the whole thing is big, but not outrageously big for today’s standard: 1TB of training corpus, and a maximum of 13B parameters for the largest model.

They benchmark their CM3 both on unimodal and multimodal tasks in the zero-shot setting showing solid (or even SOTA in some cases) performance on image captioning, image generation, zero-shot summarization, entity linking, and several other NLP tasks.


By Aleksandra Piktus et al.

❓ Why → A common criticism of GPT-3 when it came out in May 2020 is that it “knew” absolutely nothing about Covid, given that its training corpus was created before the pandemic started. Including that knowledge would’ve required training the model with new data either for finetuning or from scratch, which is very costly. Giving Language Models access to a corpus of knowledge has been a recent development that allows them to become more efficient learners and more factually accurate with the added upside of being able to update knowledge without retraining the neural network.

💡 Key insights → A knowledge-intensive NLP task is defined as one that a human is not expected to solve without consulting a knowledge corpus (e.g. a book, the web). This paper proposes a new benchmark precisely tailored to measure LMs’ performance in this respect. It builds upon the existing KILT benchmark³, primarily based on the Wikipedia corpus to construct Fact-checking, Entity linking, Slot filling, Open-domain QA, and Dialog generation tasks.

As more and more retrieval-enhanced Language Models are proposed, having a solid evaluation system to compare them becomes increasingly important. Some recent examples of such models include WebGPT: Browser-assisted question-answering with human feedback (OpenAI), Improving language models by retrieving from trillions of tokens (DeepMind), Artefact Retrieval: Overview of NLP Models with Knowledge Base Access (Saarland Uni)or LaMDA: Language Models for Dialog Applications (Google).


By Romal Thoppilan et al.

❓Why → Despite the tremendous progress in text generation, many of the chatbots you find out there are still quite annoying and not that useful. How can modern Language Models improve conversational AI? Here’s the latest proposal from Google.

💡 Key insights → This is in fact another instance of a Language Model that interacts with a knowledge base to answer queries from users, basically, a retrieval enhanced LM. In the usual Google fashion, they train a massive 137B model and use human judgments to evaluate it with metrics such as sensibleness and specificity. Unsurprisingly, performance keeps improving with scale without saturating.

At a conceptual level, the method is simple: two variants of the LM are used, LaMDA-Base which is a regular LM trained on conversations, and LaMDA-Research, a variant of a LM that is trained to interact with an external knowledge system which the authors call the toolset (TS). This toolset not only includes an information retrieval system but a calculator for arithmetic queries and a translator.

LaMDA-Base and LaMDA-Research interact by passing their inputs along and concatenating them to preserve global context (see figure below). Of course one of the keys to the success of this model is the high-quality training dataset that’s curated by the authors, consisting of more than 40k annotated dialog interactions, besides the usual large-scale self-supervised pretraining.


By Tianxiang Sun et al.

❓Why → As huge Transformers have become the norm in many research areas, challenges have emerged in how they’re used. Not long ago, one could simply download a model checkpoint of a few hundred MBs in size and run it wherever you wanted. But when the checkpoint is close to a Terabyte in size… well it needs to run across several machines and it’s just not feasible to download! Moreover, such large models have become extremely valuable IP for companies like OpenAI, being the backbone of services they provide and a clear competitive advantage they’re not willing to give up. Hence the emergence of ML models as a service, which exposes an ML model only as a blackbox API that returns predictions given a set of inputs. Now, can you tune such a model that’s only accessible as a black-box API…?

💡 Key insights → Users of a blackbox API can tune their systems with derivative-free algorithms (remember, we only have access to inputs and outputs, not gradients!). In particular, they use evolutionary algorithms to search in the space of prompts and hyperparameters efficiently learning prompts that outperform manual prompting and in-context learning, which means including training examples in the prompt like GPT-3 did for few-shot learning. In some cases, their method outperforms gradient-based methods such as prompt finetuning!


Another relevant work in the space of optimizing interface-only models is Memory-assisted prompt editing to improve GPT-3 after deployment.

By Zhuang Liu et al.

❓ Why → The intense thrust of Deep Learning in the early 2010s can be largely attributed to AlexNet’s massive success in 2012’s ImageNet challenge. Since then and for many years, convolutions — the main building block of such NNs— single-handedly dominated the world of Computer Vision. However, with the introduction of Transformers and their convenient scalability, approaches applying them to CV — like the Swin Transformer⁴ — have become increasingly popular; arguably threatening the crown that convolutions have held for so long.

💡 Key insights → Convolutions still rock.

This paper makes the case that ConvNets still have an edge over Transformers by optimizing them even further, resulting in a modern version of the popular ResNets that compare favorably to similar Transformer-based architectures. These changes include things like ditching the BatchNorm in favor of LayerNorm, switching from ReLU to GELU, or varying the sizes of the convolution kernels among others. And that’s pretty much it, their results and scaling laws on ImageNet are slightly above those from transformer-based architectures. Well, probably until another paper comes out next week…

The battle of architectures continues, and if one thing is clear, it is that the field of AI will certainly benefit from the competition!


By Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, et al.

❓ Why → Image generation has been a very eye-candy application of Deep Learning since the introduction of GANs in 2014. Lately, however, methods such as autoregressive generation with VQ-VAE (e.g. DALL·E) and Diffusion Models are becoming viable or even better alternatives.

💡 Key insights → In a hand-wavy nutshell, Diffusion Models generate images by iteratively adding differentiable noise on a pixel grid that eventually becomes a real-looking image. This paper proposes an approach for the generation and editing of images given text prompting based on diffusion models that are spookily good, beating the famous OpenAI’s DALL·E. Still these models still present drawbacks such as the computational cost required for each image generated that still prevent them from becoming widely used in many applications.


By Arvind Neelakantan, Tao Xu et al.

❓ Why → Neural Information Retrieval was late to the Deep Learning game and in some regards is still inferior to 20+ year old algorithms such as BM25! One of the key parts of the equation is the reliance on massive amounts of labeled data: all successful Neural Retrieval methods today rely heavily on labels such as those from the MS Marco dataset. Can these models be trained without supervision at all? In the past couple of months, there’s been signs of hope!

💡 Key insights → This is a proposal from OpenAI to learn text representations of text in a fully self-supervised manner. These representations (i.e. embeddings) aim to be solid performers in a variety of tasks including Information Retrieval. The working principle is very simple: using neighboring text snippets as positive pseudo-query-document pairs and in-batch negatives. Very large batch sizes must I add. However, not all that glitters is gold: while the fully unsupervised performance is solid, you can achieve better performance using small models only finetuned on just a few publicly available labels at a ridiculously low fraction of the cost, as Nils Reimers—creator of SBERT—has shown.

In conclusion, it’s an important step for unsupervised Neural Information Retrieval and representation learning, but not an all-solving embeddings API as some headlines could suggest. This is yet another example of a model that’s only accessible via a paid API, and we expect such instances to become even more widespread.


By Samyam Rajbhandari et al.

❓ Why → In the past year, Mixture of Experts (MoEs) have become the go-to strategy for scaling massive language models. The key concept is simple: route an input only through sub-paths within the model during inference, such that only a fraction of the model parameters are used at each step. The implementation details of such systems are still messy and include serious tradeoffs with respect to dense models such as inference speed.

💡 Key insights → DeepSpeed-MoE (soon to be open-sourced on GitHub) is the latest version of the DeepSpeed library from Microsoft which aims to make distributed Deep Learning training easy and efficient, and it is the implementation backbone of this work.

The authors show how MoEs shine when compared to their dense counterparts: more efficient training — around 5-fold — and better parameter efficiency.

The paper also goes into much deeper detail about what design choices make MoEs learn well. For instance, is it better to have more experts in shallow layers or in deeper layers? To increase model capacity, should the capacity of each expert be increased or the number of experts increased? While there's no absolute answer to these questions yet, this paper explores empirically these tradeoffs of these design choices, wrapping them under the generic PR-MoE (Pyramid Residual MoE). The basic structure of their PR-MoE is shown in the figure below, which includes a varying "experts width" along with residual MLP connections.

While MoEs are still not mainstream, if the complexity of implementation and design is solved, they have the potential to become a standard for the next generation of massive models.



Our monthly selection for February 2022 ends here; if you like this monthly overview of impactful AI trends, subscribe to our newsletter, follow us on Twitter @zetavector, and stay tuned for the next one! Or check out the selection of papers in the Zeta Alpha platform.



1. “Taming Transformers for High-Resolution Image Synthesis” by Patrick Esser, Robin Rombach, Björn Ommer, 2021

2. “Zero-Shot Text-to-Image Generation” by Aditya Ramesh et al., 2021

3. “KILT: a Benchmark for Knowledge Intensive Language Tasks” by Fabio Petroni et al., 2020

4. “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows” by Ze Liu et al., 2021

808 views0 comments

Recent Posts

See All


bottom of page