top of page

Best of arXiv — Zeta Alpha's top picks for March 2021

Staying on top of your reading list is hard, and finding which papers should be on that list can be even harder. At Zeta Alpha we’re always keeping a close eye to the latest ML research, so we’re sharing a monthly selection of recent papers to surface what we believe will be impactful publications, mostly based on each work’s contributions and the authors’ influence. Don’t take this list as comprehensive: we have our biases like everyone else, but hey there’s only so much you can choose out of 4000+ papers. Enjoy!

By Yifan Jiang, Shiyu Chang & Zhangyang Wang.

🎖Why → Yet another task that Transformers conquer. I was actually surprised to hear that nobody had successfully built fully Transformer based GANs before; but they turn out to be very hard to get right and need some tricks that this paper explores. Yifan Jiang — the main author — told us that having worked extensively with GANs previously³⁴ was of great help, as the devil’s in the details, and it’s likely that many have tried, without success, before this attempt.

💡Key Insights → GANs have always been notoriously unstable to train, and this is even more the case for TransGAN. A fully Transformer based GAN doesn’t work out of the box, it needs some tricks to make it perform comparably to its well established CNNs siblings.

The 3 main tricks that make this work are:

  • Data augmentation is key to strong performance (which is not so prominently the case for CNN based GANs).

  • An auxiliary self-supervisced reconstruction loss as a Mean Squared Error.

  • Dynamically increasing the attention receptive field during training, starting with local only attention and gradually increasing to global attention. This is arguably the most controversial trick, because local attention is a linear operation on the neighbourhood of a pixel, which is suspiciously close to what a convolutional kernel does.

This last point doesn’t invalidate the research though, but confirms the importance of inductive biases in ML. Moreover, the state-of-the-art results are impressive and call for further research in this direction.


By Danny Hernandez et al. from OpenAI

🎖Why → Transfer learning is becoming increasingly relevant at a time when self- supervised pre-training and task-specific finetuning is the dominant paradigm to achieve SOTA for many tasks. This paper from OpenAI experiments at a large scale to derive empirical laws that quantify how much transfer helps at Language Modelling.

💡Key Insights → The effective data transfered is defined as the amount of data that a from-scratch model needs to match a transfered model performance. This effective data transfer follows a power law of the form:

Where D_F is the fine-tuning data size, and N is the number of model parameters. Quite impressively, this empirical law fits experiments transfering from text to python data vs. training from scratch; for the task of Language Modeling within 4 orders of magnitude in model size and 3 orders of magnitude in the target dataset size.

Visual explanation of key concepts of Effective Data, Effective Data Transfered and Fine-tuning dataset. Source:

The paper still has many limitations, such as the fact that the experiments focus only on the transfer between english text and python code, but scaling laws will be relevant at a time when ever increasing models and datasets are the norm. One of the most interesting results is how models pre-trained on text are not “data constrained” even when the fine-tuning dataset is “small”, unlike when training from scratch (see figure below).


By Weizhe Yuan, Pengfei Liu & Graham Neubig.

🎖Why → This provocative title hides a careful and serious study about the automation of scientific reviewing. Interestingly, the act of asking this question hints a certain degree of skepticism towards the current state of reviewing…

💡Key Insights → This paper proposes to use NLP models to generate first-pass peer reviews for scientific papers. The authors collect a dataset of papers in the machine learning domain, annotate them with different aspects of content covered in each review, and train targeted summarization models that take in papers to generate reviews. They conduct extensive experiments to show that system-generated reviews tend to touch upon more aspects of the paper than human-written reviews, but the generated text can suffer from lower constructiveness for all aspects except the explanation of the core ideas, which are largely factually correct. (This first paragraph was fully generated by the system the paper presents.)

The paper concludes that review generation cannot replace human reviewers in its current state, but might be helpful as an assisting process. To justify this position, the authors define a set of aspects for studying quantitatively human reviews that can be somewhat objectively measured:

  • Decisiveness — measured in Recommendation Accuracy (RAcc)

  • Comprehensiveness — measured in Aspect Coverage (ACov) and Aspect Recall (ARec)

  • Justification — measured in Informativeness (Info), as the

  • Accuracy — measured as a Summary Accuracy (SAcc)

The generated reviews are quite impressive, but it slips in key factual statements as it’s usual with text generation models from time to time, which are unacceptable in a scientific reviewing setting. The dataset collection and evaluations are quite extensive and intricate, so it’s worth checking out. This work poses an interesting question: if an automatic reviewing model were to be much better than humans (if that can be properly defined), should it be used? Would it suffer more gravely from Goodhart’s law: “when a metric becoming a target, ceasing to be a good metric”?


By Mehdi Azabou et al.

🎖Why → This work is surprising because in the context of hard negative mining, one often uses nearest neighbour embeddings assuming they’re negative samples that need to be ‘pushed away’ in the contrastive loss. However, this paper does exactly the opposite. And it works.

💡Key Insights → Interestingly, this paper comes from a neuroscience background as one might intuit by the experiments sections. Mehdi Azabou — main author — told us how this project started by trying to use self supervision to decode neural spiking data, and the results were so good they decided to apply the technique to images! This is an excellent example of cross pollination between close fields.


By OpenAI et al.

🎖Why → Large scale impresses again, and compelling image generation conditioned to captions can be created ala Language Modeling, trained in a self-supervised autoregressive mode. Popularly known as DALL·E from its original blog post, this work advances in methods for text-to-image generation and is bound to be impactful.

💡Key Insights → Encode the caption of an image into tokens and concatenate them with tokens that represent the image. The tokens representing the image are a result of training a discrete VAE on images that represents them as a grid of 32x32 tokens, where the vocabulary size of this tokens is 8192. You can think of this representation as each image being 32x32=1024 words out of a vocabulary of size 8192. This trick reduces substantially the size of the space of images and makes it analogous to language tokens, which are discrete symbols from a dictionary.

This sequence of tokens can be trained autoregressively at a large scale, which at inference time can be prompted with a caption of an image, which will be completed autoregressively by the model with the image tokens which can in its turn be decoded into pixel space by the dVAE.

The engineering required to make this large scale model succeful is astounding as usual, and the results are real eye candy.


By B. Schölkopf, F. Locatello, S. Bauer, N. Rosemary Ke, N. Kalchbrenner, A. Goyal & Y. Bengio.

🎖Why → The state of Machine Learning and Causality in 2021. While not yet the most prominent, this area of research has a promising potential to overcome classical ML limitations on out-of-distribution generalization; finally abandoning the (in)famous i.i.d. assumption.

💡Key Insights → This paper can serve as an entry point for practitioners and researchers interested in causality for ML. A useful and simple taxonomy of world models is presented, where the capabilities of each class is carefully considered (see table below).


It’s impossible to cover the whole content of the paper in a paragraph, but some relevant topics it touches on are the fragility of i.i.d. assumption, the differences between observational and interventional data, the difference between Causal Graphical Models and Structural Causal Models (only the latter can make counterfactual inference) and foundational principles such as the Independent Causal Mechanism.

Finally, the authors review the implications that causal representation learning can have to methods like Semi-Supervised Learning, Adversarial Vulnerability, Robustness and Generalization, Self-Supervision, Data Augmentation, Reinforcement Learning, Multi-task Learning and general Scientific Applications.


By Andrew Brock, Soham De, Samuel L. Smith & Karen Simonyan.

🎖Why → Layer and batch normalization have been a standard method in the Deep Learning toolset for a while now. This paper challenges the importance of this technique by showing how large-scale image recognition can work without any kind of normalization; only with gradient clipping.

💡Key Insights → Typically, vanilla gradient clipping limits the norm of the gradient vector to a fixed hyperparameter lambda. While this technique prevents gradients from exploding and allows for larger batches in training, the procedure is very sensible to the tuning of the hyperparameter lambda. To circumvent this issue, the authors propose “Adaptive Gradient Clipping” (AGC), which clips gradients based on the ratio between weight gradient and weight value:

Adaptive Gradient Clipping. Source:

One of the biggest appeals to do away with normalization is its computational inefficiency; swapping batch normalization for AGC yields faster training for equivalently sized models. In addition, as seen below, the clipping hyperparameter is robust and the technique allows for even larger batch sizes in training.

(left) AGC scaling to larger batch sizes; (right) performance across clipping thresholds.

The main results section seems to only provide cherry-picked cases where comparison to other work is favourable, so further scrutiny is necessary. Still, training end-to-end models without any normalization is a promising direction to making DL models more efficient.

By Philipp Dufter, Martin Schmitt, Hinrich Schütze.

🎖Why → Encoding order in Transformers — which are invariant to shuffling by design of self-attention — kind of always seemed like an afterthought and let’s face it, they’re all over the place! Here’s a paper that sheds some light into the topic and makes sense of all the variations. Still, questions remain about to what degree this is even relevant…!

💡Key Insights → How positional information is encoded in Transformers has not converged and there’s an extensive body of different research ideas. The authors of this paper provide a categorization that can be useful to map the state of this research (see table below).


The authors also provide a thorough theoretical framework to compare different approaches: adding position embeddings vs. modifying attention masks, sequential vs graph-based ordering, or the effect on cross-lingual learning.

While this paper does not provide any novel result to the reader, it’s an excellent example of on the zoo of Transformers research that is enabled by the fact that Transformers are so ubiquitous. Another recent example of such research is “Do Transformer Modifications Transfer Across Implementations and Applications?”², where the authors explore whether the vast amounts of small modifications to the core Transformer architecture actually yield meaningful differences (spoiler alert: not really).

By Patrick Lewis et al.

🎖Why → Impressive results for an automatic question-answer generation approach from a corpus: if you can come up with all the possible questions and their answers, you just need to memorize them!

💡Key Insights → State of the art Open Domain Question Ansewing models generally follow the two steps retriever and reader approach, in which for a given question a retriever retrieves passages of evidence where the answer to the question is likely to be, and a reader extracts the answer by processing the question and the passages jointly. In practice, this means that the amount of computation required for each single answer is high.

In this work, the authors propose a retriever-only method that matches the performance of retriever-reader models being orders of magnitude faster. The approach is conceptually simple: from a wikipedia corpus, find bits that look like answers and generate questions that match them which humans might ask. This results in a large scale, automatically generated 65 million question-answers pairs which performs competitively at TriviaQA and Natural Questions, because it has many of the answers already. Moreover, the question-answer retriever is pretty good at determining the confidence of an answer, so it’s suitable to add a fallback full retriever-reader model when a question doesn’t look like any existing PAQ.


Our packed monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector. I’m already looking forward to share the next selection for April; see you soon!



285 views0 comments

Recent Posts

See All


bottom of page