top of page

Best of arXiv — June 2021

A monthly selection of recent ML papers: MLPs comeback, token-free Language Models, SSL on Vision Transformers and more.

The past month in ML research literature has brought surpising results such as the revival of MLPs as a competitive architecture for Computer Vision or the questioning of Batch Normalization as an all-good innocuous layer. Transformers are also (of course) on the plate: for self supervised learning on vision, as well as for sentence representation techniques and character-level language modelling.

This is a monthly selection of recent ML research literature, backed by Zeta Alpha, where we’re always keeping a close eye at the latest papers. Enjoy!

By Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer et al.

❓Why → Very simple MLP-based architectures suddenly work way better than they should, this has interesting implications and advances our knowledge about what makes Deep Learning work.

💡Key insights → You can probably solve ML by just scaling up. Well okay this is an oversimplification and exaggeration, but Rich Sutton’s Bitter Lesson seems to be aging better than fine wine so far.

This paper builds an architecture that doesn’t use attention nor CNNs which works as follows:

  • Divide an image into patches and flatten into vectors and stack them.

  • Transpose the stack and apply a 2-layer MLP across each feature for all patches.

  • Transpose back to the original shape and apply a 2-layer MLP across all features in each patch.

  • Repeat the previous block N times.

The results are not quite SOTA, but are good enough to raise some eyebrows and spark discussion. While CNNs are not going anywhere just yet, this reinforces the idea that architectural choices are not necessary to introduce desired inductive biases — such as translation invariance/equivariance — and these can be learned from only the data, with techniques like augmentation, although less efficiently.


By Linting Xue Aditya Barua, Noah Constant, Rami Al-Rfou et al.

❓Why → Most NLP Language Models (LM) operate on a fixed vocabulary of words or subwords units of a few dozen thousands tokens. While this method works great most of the time, it still struggles with things like typos, variations of capitalisation or morphological changes. Token-free LMs are a promising direction to solve these problems.

💡Key insights → The main goal of this paper is to show how an existing token-based language model (mT5⁹) can be adapted to operate on the character or byte level with minimal modifications without sacrificing performance. In fact, the authors barely change the backbone T5 architecture: they simply do away with the original SentencePiece tokenizer and feed text as raw sequence of UTF-8 characters, which is equivalent to a “vocabulary” size of 256. The pre-training objective — originally masking spans of ~3 tokens and predicting them — is adjusted to an average masking span length of ~20 characters. Finally, the authors also find that byT5 benefits from an uneven parameter distribution between encoder and decoder: having a “heavy” encoder at the expense of a lighter decoder is beneficial, unlike in the regular T5.


The most surprising aspect of this paper is how barely any adaptations are needed for byT5 to reach equivalent performance as its original incarnation. Moreover, byT5 presents several advantages such as being significantly more robust to noise and performing better on tasks where spelling and pronunciation is salient. This paints a promising future for adapting existing Transformers to byte-level sequence inputs, although many caveats apply: training and inference efficiency is still noticeably worse.

By Mathilde Caron et al.

❓Why → Full-fledged Self supervised Vision transformers were a matter of time, and here they are. While this is only a first step on what’s possible with vision transformers and SSL, the technique will only gain relevance in the following months.

💡Key insights → Vision Transformers can be trained using Bootstrap Your Own Latent method, where different views of an image are encoded by a teacher and student, and the similarity of the outputs is maximized. The approach avoids collapse (i.e. outputting the same embedding for all images) by having one network be an exponential average of the other one.

Differently than on BYOL however, instead of an inner-product similarity, they use cross entropy on a mock classifier as a similarity measure between outputs, which allows them to set a “temperature” regularization in the softmax that helps with training stability.

The most eye-catchy result from this approach is that without any supervision, the attention maps of deep layers in the transformer form a surprisingly performant object segmentation, and k-NN retrieval on the learned representations shows also promising results for applying the technique on image retrieval.

By Tianyu Gao, Xingcheng Yao and Danqi Chen

❓Why → Impressive improvements on unsupervised sentence representation learning with a surpsiginly simple method which has potential to be used in other domains.

💡Key insights → In Computer Vision, generating two views of an input is as simple as taking two different crops of the an image, but such straightforward augmentations can’t be directly applied to language because of its discrete nature, making contrastive learning a bit trickier.

In this work, the authors propose to simply feed the same sentence twice to a transformer encode, but draw different random dropouts in the feed forward and attention layers. Surprisingly — at least to me — adding any other kind of augmentations that were previously used like “deleting one word” or “cropping parts of the sentence” degrades performance. Similarly, adding previously studied objectives like “next sentence prediction” also degrades performance.

Although the most impressive results are for unsupervised-only training, the authors also extend their method to leverage labeled data by using positive and negative sentence pairs as positive and hard negative samples respectively.

The results not only include standard sentence-level benchmarks such as STS⁵, but analysis on alignment and uniformityof representations, recently proposed as predictive proxies for representation quality⁵.


By Elodie Laine, Stephan Eismann, Arne Elofsson and Sergei Grudinin

❓Why → DeepMind’s AlphaFold made the rounds a few months ago showing how Transformers and SSL could leapfrog the task known as protein folding, where the 3D structure of a protein is constructed soley from their linear aminoacid structure. I don’t have the background necessary to understand how impactful this really is, but this paper helped a ton laying the state of affairs and providing an informed opinion on the future of Deep Learning in the protein structure domain.

💡Key insights → AlphaFold 2 is truly ahead of its time, and a big reason for it computational resources, which are increasingly a differentiating factor for research. This once again puts academic research groups at a difficult spot to compete with labs backed by deep pocketed corporations.

The impact of precise prediction of structural biology is still unknown to a large degree. While it’s true that a protein structure largely determines its properties, there’s many more variables that play a role. On the optimistic end, in silico structural protein prediction will enable consistent, profound and novel biological insights at orders of magnitude faster than before; but on the pessimistic end, these won’t be enough to successfully model more realistic behaviour of proteins such as their complex dynamics, flexibility and interactions, and this methods will only be useful in a handful of cases.


By Prafulla Dhariwal and Alex Nichol

❓Why → GANs⁶ were the unchallenged method for generating image for years since they introduction. In the last couple of years, however, more and more alternative likelihood-based methods are emerging to dethrone them. Will they become as popular?

💡Key insights → We highlighted a Diffusion Model paper at ICLR a month ago, and the gist of this one is mostly similar, although with the refinements that come with 8 extra months of work. The gist of diffusion models goes as follows: you can transform an image into “noise” as a “diffusion process”. Think of how individual water molecules move inside of flowing water: there’s some deterministic flow of the water that follows a gradient with some added random jiggling around. You can do the same with pixel images, diffusing them such that they end up as something like noise from a tractable probability distribution. Now, this process is actually reversible, so the same “backward” diffusion process can be used to generate images from noise.

The authors show how a well tuned model can reach SOTA on many image generation benchmarks, overtaking GANs. However, there’s still caveats.


There are some trade-offs that are still inescapable: GANs are still the most popular technique for image generation tasks and are fast, but they often lack diversity and don’t cover a whole domain of images, which makes them harder to scale and apply to new domains. On the other hand, likelihood-based models — like VQ-VAE, Diffusion Models or Autoregressive Generation — offer better coverage at the cost of speed and image fidelity, as judged by humans (or other proxy metrics).

By Yuxin Wu and Justin Johnson

❓Why → Since batchnorm⁷ was proposed, it’s become an ubiquitous tool within the DL toolbox. However, it has non-trivial implications that are often overlooked by empirically driven research.

💡Key insights → There’s a lot to unpack here, but my overall takeaway to highlight is that BatchNorm is less inoffensive than it might seem at first glance.

For starters, BatchNorm is one of the only layers that operates on a group instead of an individual input, which means that it will necessarily behave differently during training and during inference. Moreover, BatchNorm can lead to information leakage within the batch. The paper gives the following insightful example: imagine a training batch is constructed consistently by having 32 images with 16 different classes, 2 images per each; the model will learn to leverage this batch-pattern by at least generating labels “in pairs” instead of for each sample individually. A similar “cheating” phenomenon can be observed in contrastive learning⁸.

Train-test inconsistencies also inadvertedly hurt performance of models trained with BatchNorm: the Exponential Moving Average (EMA) technique —  most used to estimate population statistics to be used for inference  — results in biased estimations that hurt performance in the test set.

So what can we make of all of this? You should probably keep using BatchNorm and benefitting from how well it works under the most common settings (drawing fixed-size batches i.i.d from the training set and using a test set drawn from the same data distribution); but be extremely careful when using it under other circumstances.


Our monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector. If you want to learn more about recent developments on ML code, implementations and repositories, check out our other blog series Best of arXiv. See you soon in the next one!



[1] Pay Attention to MLPs — By Hanxiao Liu, Zihang Dai, David R. So, Quoc V. Le, 2021.

[2] RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition — By Xiaohan Ding, Xiangyu Zhang, Jungong Han, Guiguang Ding, 2021.

[6] Generative Adversarial Networks — By Ian J. Goodfellow et al. 2014.

[9] mT5: A massively multilingual pre-trained text-to-text transformer — By Linting Xue, Noah Constant, Adam Roberts et al. 2020.

149 views0 comments

Recent Posts

See All


bottom of page