top of page

Best of arXiv — Zeta Alpha's picks for April 2021: GPT strikes back, Video Transformers and more.

Updated: Apr 6, 2021

A monthly selection of ML papers.

Staying on top of your reading list is hard, and finding which papers should be on that list can be even harder. At Zeta Alpha we’re always keeping a close eye to the latest ML research, so we’re sharing a monthly selection of recent papers to surface what we believe will be impactful publications, mostly based on each work’s contributions and the authors’ influence. Don’t take this list as comprehensive: we have our biases like everyone else, but hey there’s only so much you can choose out of 4000+ papers. Enjoy! And join us on Friday 9th of April for our monthly webinar where we will be discussing these picks.

By Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang.

🎖Why → This paper serves as a map of the 3 main existing language pre-training approaches: autoregressive (e.g. GPTs, which excel at text generation), Masked Language Modeling (a.k.a. fill-in-the-blank like BERT, which excel at NLU classification tasks) and seq2seq (for encoder-decoder models like T5, which excel at conditional text generation like translation or summarization). These 3 techniques present their own strengths and weaknesses, so wouldn’t it be nice if we could get the best of all worlds? Here’s an attempt.

💡Key insights → The main uses of the main 3 language pretraining approaches can be summarized in the table below. As a reminder, NLU are the classification tasks in benchmarks such as SuperGLUE¹ (sentiment analysis, Natural Language Inference, etc.), Conditional Generation are text generation tasks where there is a specific relationship between the input and the output sequences (like translating or summarizing text) and Unconditional Generation is the task of freely generating text.

Where ✓= good at; - =can be adapted to; ✕=cannot be directly applied to. Source:

The authors propose a unifying pretraining technique they call General Language Model (GLM), and is precisely summarized in the figure description.


The motivation behind this splitting of part A and B is to force the same model to learn both a bidirectional encoder (A) and a unidirectional decoder (B). One of the differences between previous span-based models such as spanBERT² is that the length of the span is now unknown to the model. This technique requires some tricks and details to work out, such as the positional encoding, which are specified in the paper.

When it comes to the results, the comparison with RoBERTA³ is probably one of the most interesting ones, where the same model with this new pre-training approach outperforms the original implementation. In some cases it’s still better to mix the original MLM training objective with GLM, pointing towards the fact that GLM is not universally superior. For seq2seq evaluation, they perform abstractive summarization, where it performs well compared to models of similar size.

By Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, Jie Tang.

🎖Why → I would confidently place this paper on my top-1 from the past month. The idea is brilliant and simple, the results seem nothing short of amazing and the paper is very clear and full of insightful bits. It challenges something that the previous paper — from the same research group — presented as a given: that autoregressive pretraining is no good for NLU. Well hold your paper! And keep reading… The technique they propose, p-tuning, has the potential to become a standard technique for few-shot learning and finetuning huge LMs for which conventional finetuning doesn’t work very well or is too costly.

💡Key insights → In May 2020, GPT-3 surprised even the most skeptics showing how a simple generative pre-training scaled up to hundreds of billions of parameters could show impressive zero and few shot performance, simply by “prompting” the model with natural language describing a task and/or giving it some examples. This inspired some works that delved further into the art of “prompting” such as PET⁴. Even some proposed techniques for automatically finding good prompts for models to solve tasks without updating any model parameters like AutoPrompt¹¹.

In this work, the authors have the brilliant idea to stop constraining prompts to be actual words in language from the fixed vocabulary. Instead, they learn a fixed number of continuous embeddings that can be optimized through gradient descent, and they call it p-tuning. This means that all original model parameters can remain frozen, and only the prompt embeddings are updated. It’s fun to think about this as some sort of differentiable programing 2.0, where you learn to explain a frozen pre-trained model what to do.


The results are most interesting in the comparison between finetuning, p-tuning and manual prompts. Specially for knowledge probing (extracting factoids from a frozen pre-trained model), where p-tuning performs tremendously better than the alternatives. On the SuperGLUE benchmark, while it doesn’t come close to other SOTA (but that wouldn’t be a fair comparison given the great model size differences there), p-tuning shows very strong performance when compared to standard fine-tuning or manual prompting.


By Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch.

🎖Why → Although in my opinion the central claim of the paper remains undecided (the devil’s in the details), the idea of understanding pre-trained transformers as “computation engines” — something that can compute anything given the appropriate instructions — is fascinating. It ties well in my opinion with the paper of “GPT Understands, Too”, where the model’s input for a task is learned instead of the model parameters.

💡Key insights → The authors explore how Transformers perform on a variety of unusual tasks, specially computational tasks: Bit memory (repeating a corrupted string of bits), Bit XOR (computing an element-wise XOR of two bitstrings, something that has been historically hard for NNs to perform), ListOps (given a sequence of operations predicting the resulting bit), MNIST (handwritten digits dataset) and CIFAR-10 (image classification benchmark).

More interestingly, they claim that there’s something about pre-training on language that makes it learn something universal about all these other tasks (which are a priori unrelated to language). To investigate this hypothesis, they get a pre-trained transformer on language and freeze all its weights except for the layer normalization and the input and positional embeddings, calling it Frozen Pretrained Transformer (FPT).


The devil is in the details because allowing finetuning in the normalization layers still affects how self attention behaves in future layers, implicitly optimizing them despite self-attention being frozen (see how it affects performance in Table 11). Moreover, a randomly initialized Transformer already performs very well on many of the tasks by only fine-tuning embeddings, output and layer norm parameters.

Regardless, while the central claim of the paper about Transformers being Universal Computation Engines remains in dispute, the paper is full of ablation studies — like the table below — that are novel and provide insightful results to understand what they’re good at and what they’re not.


By Charlie Nash, Jacob Menick, Sander Dieleman and Peter W. Battaglia

🎖Why → A reminder that classic well known techniques such as Discrete Cosine Transform (DCT) image processing for compression can enhance an ML task such as image generation.

💡Key insights → Partly inspired by the success of recent likelihood based image generative models such as DALL·E⁵ from OpenAI or VQ-VAE⁶, this paper explores the use of sparse representations for the task. One of the advantages of likelihood-based generative models in contrast with GANs, is that they are more stable to train and also don’t run the risk of falling into modes that don’t cover the whole space of image distributions. The motivation for using sparse representations is that they’re easy to compress (there’s a lot of 0s!), and it’s interesting to study how well neural networks can do in this representation space, in contrast to the common grid-like structure of images.

I personally didn’t know about DCT transformation used in JPEG compression and it’s really cool. In a hand-wavy way, you can split an image in blocks of a few pixels (i.e. 8x8) and then fit all pixel values into a cosine-based function in 2D with 8x8=64 degrees of freedom. This expresses an image patch as a superposition of 64 “frequency” functions weighted by 64 coefficients. Most of these coefficients can be just removed without affecting the perceived image quality (we humans don’t see a lot of small high frequency information), and this results in a sparse representation that is easy to compress (which is why it’s used in JPEG file compression). After watching this excellent introduction video to DCT, the paper will make a lot more sense.

The image representation that this paper constructs consists of a list of all the non-zero sparse coefficients from this DCT transform (after some special quantization tricks, but you get the gist of it), along with their channel and position information. The model is trained to predict these tuples autoregressively, considering their values categorical, maximizing the likelihood in a self-supervised way.


When it comes to results, they’re very good in general; comparable if not surpassing SOTA (except in class conditional in which bigGan still rules). Nevertheless let’s not forget that these metrics are only a proxy for human judged quality, so check the results with your own eyes!


By Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučic, and Cordelia Schmid.

🎖Why → Yet another task that transformers conquer (this could be a section in its own). Given enough parameters and data (along with proper augmentations) it seems like there’s no task Transformers can’t crack.

💡Key insights → This paper builds on top of existing Vision Transformers (ViT) such as “An image is worth 16x16 words”⁷ and experiments with different strategies to represent the spatial and the temporal dimension at the same time. One of the most interesting aspects of this work is the overview of different strategies to tokenize video and apply transformer layers to them. First, on tokenizing video they explain Uniform frame sampling vs. Tubulet embedding, see figures below.


Secondly, on calculating attention across space and time, they present 4 alternatives: spatio-temporal attention (everything attends to everything), factorized encoder (spatial transformer only first, then temporal), factorized self-attention (each transformer block has a spatial-then-temporal self-attention blocks), factorized dot-product attention (one self-attention with spatial heads and temporal heads that are later concatenated).

Their ablations show that the 4 different models are not that different if trained well, and that leveraging pre-trained transformers on image datasets helps a lot. In fact, they don’t really disclose in detail how they pre-train their models, just that they do it on ImageNet or JFT datasets. They achieve state-of-the-art performance on multiple video classification benchmarks including Kinetics 400 and 600, Epic Kitchens, Something-Something v2 and Moments in Time. Augmentations and tricks such as label smoothing, mixup and stochastic depth are still key to achieve this performance, as their ablations show.

By Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals and Joao Carreira.

🎖Why → Modeling data making the fewest possible assumptions about it is interesting because it has the potential to transfer well to many domains. In this case, the Perceiver is an architecture that focuses on scalability (doing away with the nasty N² scaling of self attention) and making minimal assumptions on the structure of the data.

💡Key insights → The Perceiver architecture consists of repeating the following architectural block consisting of:

  • A cross-attention step between a latent representation (of size NxD, length by embedding size) and the raw representation of the data (of size MxC, length by channels). This makes the cross attention have a complexity of NxM instead of MxM, which is substantial when N<<M.

  • A transformer layer that maps a latent representation to another latent representation of the same shape (see figure below).


This can be thought of as downscaling the raw representation repeatedly to a latent representation. Given that that in this implementation the blocks share their weights, it can be considered an unrolled RNN. Actually, in the appendix we can see the comparison between weight sharing and non-weight sharing, in which the former reaches better performance because it doesn’t overfit unlike the non weighting; this weight-sharing results in a 44M parameters model.

The authors run experiments for a variety of modalities: images, raw audio, Video, Raw audio + video and point clouds. Although the results section is not very comprehensive, performance is on par with or better than existing models, specially when compared to existing multimodal models (for instance, a 85.7% on ImageNet top-1). The results are quite impressive, but we must not forget the fine-print: while the architecture remains the same for all modalities, some modality specific augmentations and positional embeddings are needed to achieve it (cropping, special positional encodings, etc.)

Top-1 ImageNet performance. Approaches in red leverage domain specific "image grid structure" while results in blue do not. Source:

By Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas.

🎖Why → A theoretical paper once in a while won’t kill us, and sometimes they provide valuable insights beyond being scary with hardcore Math all over the place. This is one such case: why are those skip connections so important?

💡Key insights → Thank God we have those skip connections. Attention and skip connections are all you need*.


Okay let’s expand this a bit, you’ve probably heard that skip connections (or residual) help propagate gradients through deeper networks stabilizing training. Well this paper provides a theoretical foundation on why this is so important in transformers: without them, the self-attention output provably degenerates very fast — double exponentially —  via SGD, meaning that it turns into a rank-1 matrix that kills information that flows through it (i.e. imagine a sequence of embeddings where all are a multiple of each other).

The main insight from this paper is more of a confirmation of an existing suspicion rather than a surprising insight. Some works have shown empirically how the attention matrix can be decomposed into much lower-rank matrices minimally affecting performance such as “Linformer”⁸.

By Sander Dieleman, Charlie Nash, Jesse Engel, and Karen Simonyan.

🎖Why → The idea of variable-rate representations fascinates me. Intuitively, listening to and understanding spoken language, information is not distributed evenly, so why should our representations? This presents many challenges, but it’s great to see research tackling this problem.

💡Key insights → This work builds event-based representations involving an encoder decoder architecture with quantization over time, trained maximizing the log-likelihood of the decoder output conditioned to the quantized latent representation. A “slowness penalty” incentivizes the latent representation to stay the same value as in the previous time step; this penalty is motivated by the idea of imposing a capacity bottleneck explicitly. Another trick they use is Schmitt Trigger Quantization: due to noise, the quantization values might jump around too much, so the STQ imposes a memory quantization that will only jump the step if a variable has changed more than a certain amount.

Given this set-up, the intuition is that the quantized latent representation should only change when there’s an event. For instance, if there’s 2 seconds of silence, the representation should probably remain the same for that time, but if someone is speaking, the Average Event Rate (AER, changes in the latent representation) should be higher. The NN parametrizing the encoder and decoders is a — drumroll — Transformer, and more tricks detailed in the paper are needed to make this work.

Regarding the results, the most interesting part is the ablations with respect to all the hyperparameters, such as slowness penaly, AER, quantization levels, etc. The comparison with existing work is not as extensive as one would hope, but this is mainly because automatic evaluation of spoken language modeling is not very reliable.

By Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny.

🎖Why → It’s a new self supervised loss! It’s quite straightforward, is comparable to other SOTA representation learning techniques (SimCLR, BYOL) and presents a couple of intriguing properties that make it interesting to study further…

💡Key insights → The way I like to conceptualize it is doing some sort of contrastive learning on a “per-feature” basis (but don’t take the analogy too far cause it’s not correct). You have 2 views of an image (let’s say two crops), you maximize the correlation of each feature and at the same time minimize the correlation of the rest of the features with each other. You can also think of this as doing an outer product of the two representations (estimating the cross-correlation of the two representations), sum and normalize across the batch, and make that be as close as possible to an identity matrix.

The theoretical justification for this objective dates back to the neuroscientist H. Barlow in 1961⁹, who hypothesized that the goal of processing sensory information is to recode it into a factorial code, which means a representation with satistically independent componnets. The Barlow Twins loss function is inspired by this idea, as it incentivizes representations to be only correlated for each component instead of globally.

The results are comparable to existing representation learning techniques like BYOL and SimCLR, but it has a couple of interesting properties. First, it seems robust to smaller batch sizes unlike BYOL (small = 256, 512); it actually degrades for large batch sizes (2048, 4096)! We asked the authors about it and they told us that they’re puzzled too. Secondly, the representation dimensionality doesn’t seem to saturate, it keeps improving downstream performance unlike the compared methods.


By Geoffrey Hinton.

🎖Why → One of the founding father’s of Deep Learning places his bets on what are key challenges for Computer Vision and how they could be solved. But not presenting a working system. (Yet?).

💡Key insights → The first point the paper makes is that humans parse visual scenes into part-whole hierarchies and model a viewpoint-invariant spatial relationship between elements. In other words, we represent parts of an image as a hierarchy of what stuff belongs to what stuff (or subpart of an object and so on), and that these are viewpoint-invariant (we model the pencil and paper as still being the same when we move around). This work seems like a natural extension of his Capsule Networks¹⁰ idea, which also tried to capture explicitly different levels of representation.

The main problem here — according to Hinton — is that current end-to-end Neural Networks don’t allow us to construct dynamically these parse trees and dynamically allocate groups of neurons to represent nodes in it. The solution that he envisions — GLOM — is best understood as processing a stream of images (or a video). It consists of iteratively representing patches of an image in columns of vectors that represent different levels of visual structure (i.e. ~5 vectors per patch). At each time step, these columns are updated with different contributions: a bottom-up prediction (L-1 to L), a top-down prediction (L+1 to L), the same level prediction, and an attention weighted average of embeddings in the neighbourhood of the patch. Ideally, training this would yield islands of identical vectors at different levels, corresponding to a parse tree of an image that represents the its part-whole hierarchy.


The paper proceeds to motivate this with insights from biology, mathematics, and neural networks; as well as describing many considerations about how and why this system would work, which are too long to summarize here.

Finally, despite this paper not describing a working system, some people have already jumped to implement it, so check it out!


Our packed monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector. I’m already looking forward to share the next selection for May; see you soon!



[1] “SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems” by A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh et al. 2019.

[2] “SpanBERT: Improving Pre-training by Representing and Predicting Spans” by Mandar Joshi, Danqi Chen et al. 2019.

[3] “RoBERTa: A Robustly Optimized BERT Pretraining Approach” by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du et al. 2019.

[5] “Zero-Shot Text-to-Image Generation” by Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever et al. 2021.

[6] “Neural Discrete Representation Learning” by Aaron van den Oord, Oriol Vinyals, Koray Kavukcuoglu et al. 2017.

[7] “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” by Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai et al. 2020.

[8] “Linformer: Self-Attention with Linear Complexity” by Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma, 2020.

[10] “Dynamic Routing Between Capsules” by Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton, 2017.

[11] “AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts” by Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh, 2020.


Recent Posts

See All


bottom of page