top of page

Best of arXiv — January 2021

A monthly selection of ML papers.

Staying on top of your reading list is hard, and finding which papers should be on that reading list can be even harder. At Zeta Alpha we’re always keeping a close eye to the latest ML research, so we thought it would be useful to share a monthly selection of recent papers to surface what we believe will be impactful publications, mostly based on each work’s contributions and the authors’ reputation. Don’t take this list as comprehensive: we have our biases like everyone else, but hey there’s only so much you can choose out of 2000+ papers. Enjoy!

By Alec Radford, Jong Wook Kim et al.

🎖Why → OpenAI papers always make a lot of noise in the community, most of the time for a good reason. The extensiveness of the results section and the impressive zero-shot performance make this work a must-read for folks interested in CV and NLP.

💡Key Insights → The main gist of this work follows the OpenAI’s playbook, and is once again evidence reinforcing Sutton’s Bitter Lesson:

  1. Curate the largest ever, best-in-class dataset for a task; in this case (image, text) pairs crawled from the web (400 million samples🤯).

  2. Meticulously engineer training for scale and large compute.

  3. Show how a simple task is all you need if you scale up the data and compute enough.

In this case, they create a dataset of 400 milion (text, image) pairs from the web without any human labeling, and jointly learn representations for the text and the images in a contrastive setting; where the model maximizes the similarity for positive (text, image) pairs and pushes away the negative pairs. The extensiveness of the experiments is truly breathtaking, and among the results, perhaps the most interesting ones are those of zero shot classification, where performance is comparable to a fully supervised linear classifier on top ResNet-50 features (see figure below).

Contrastive pre-training framework for CLIP (left) and zero-shot classification method. Source:
Zero shot performance of CLIP vs ResNet50 features + fine-tuned linear classifier. Source:

By Xiaohan Ding et al.

🎖Why → SOTA is not everything, and this work proves it going back to the basics of robust and fast CNNs for image classification while preserving the performance of the latest tricks from branching ResNets, striking a sweet speed-performance balance.

💡Key Insights → as the field of Computer Vision matures, efficiency, speed and customizability gain relevance because research wants to be relevant in real-world applications. The main theme behind this work is “let’s go back to the basics and excel at that”: they brag about not having branching, only using 3x3 convolutions + ReLUs and not using architecture search, compound scaling nor any other “heavy designs”.

The main contribution of this paper is what the authors call Structural Re-parametrization. This method allows for training a model with residual connections and then converting them into a single-path model topology that makes inference wicked fast. The ImageNet Top-1 performance is still in the 80%s (far from SOTA’s 90%), but the real world speed of examples per second on a single GPU is far ahead of the competition.

ImageNet top-1 accuracy and inference speed trade-off, compared to other existing models. Source:

By Tianyu Gao, Adam Fisch and Danqi Chen.

🎖Why: we won’t be deploying GPT-3 sized models anytime soon because of the resources it requires, but we’re all in for bringing the surprising few-shot capabilities to smaller models!

💡Key insights: arguably, the main GPT-3 contribution was the surprising few and zero shot performance and the “prompting paradigm”, where instead of fine-tuning a model on a task, one finds prompts that make the model perform a task successfully without task specific labels. In this work the authors investigate how we can train smaller Language Models to display similar few-shot capabilities. The Related Work section of the paper is a gold mine of relevant references in similar research directions such as “Small Language Models Are Also Few-Shot Learners”⁴ by Schick and Schütze.

The paper considers a few shot learning setting in which we assume to have a pre-trained model L that we want to fine-tune on a new task D with a limited set of K training examples per class. For this, they study prompt-based fine-tuning, which instead of updating the model parameters with a supervised signal from task D, you concatenate the training sample with a prompt that will make the model complete the sentence by classifying it. For instance, given a movie review “No reason to watch it.”, the model is prompted with “No reason to watch it. It was [MASK].” to which the model predicts the mask token, which we associate to the sentiment label.

They explore finetuning by manual and automatic prompting along with “demonstrations”, which concatenate labeled examples to the input of the prompt for the model. The results show that with these techniques, small language models can perform very well on few-shot settings.

Results for different prompting and fine-tuning strategies. Source:

By William Fedus, Barret Zoph and Noam Shazeer.

🎖Why → We often associate more parameters with requiring more computation, but this is not necessarily the case. Scaling up models is a trend that will remain relevant for years to come, and this work is an excellent example of pushing the boundaries of model size.

💡Key insights → The trillion word in the title requires some caveats, these trillion parameters are sparse, which means that most of them are not used when computing a forward pass. Each Transformer layer a mixture-of-experts with hard routing at inference time, such that the number of operations per forward pass remains constant as you add experts, although the memory footprint and communication overhead between computation nodes does increase.

The paper has too much substance to condense in a couple of paragraphs, but I would stress that one of the most interesting findings is how scaling up Transformers via adding experts to each layer speeds up the learning substantially, keeping other variables constant.

Judging from the results, performance is still not near convergence at the maximum number of parameters, so we can expect to find more interesting phenomena when going even bigger. This is important in real-world settings because smaller models can be distilled or pruned from the bigger ones which which outperform equivalent models trained from scratch; which is a learning paradigm that migh become dominant in the coming years.

Performance gains as a function of sparse model parameters (left) and training steps (right). FLOPs at inference are equal for all cases displayed. Source:

By Emily M. Bender, T. Gebru, A. McMillan-Major and Ss Shmitchell.

🎖Why → As a counterpoint to the work we just shared on big Transformers, here’s a paper pointing towards the dangers of the status quo of Language Models. On early December 2020, disputes over a prelimiary version of this work triggered the firing of AI ethics researcher Timnit Gebru from Google, which became a center of public debate on ethics within AI and Google’s fishy position in.

💡Key insights → In this position paper, the authors review the current state of Language Models and the broader dangers they carry such as environmental and financial costs, a training dataset that perpetuates negative social biases, tied to the lack of accountability in practitioners. The recommendation the authors put forward is to weight these factors when building Language Models, and going beyond larger and larger models in language research, focusing instead in areas such as curating and documenting higher quality datasets. While the whole story behind the paper made its content a bigger deal than it actually is, it’s an interesting read with many relevant references that capture a snapshot of Language Models as of January 2021.

Prominent language models along with their parameter count and pre-training dataset size. Source:

By Luyu Gao, Zhuyun Dai and Jamie Callan.

🎖Why → We find this work on top of the MS-Marco leaderboard, one of the most popular Information Retrieval benchmarks, and despite being very rough on the edges, this paper is based on a very simple change in loss functions for neural retrieval with promising results.

💡Key insights → Modern neural re-rankers work in two steps to alleviate the high computational cost of running a full neural network to compute the relevance of each document query pair.

  1. An initial retriever (M) selects a pool of candidates C from a whole corpus of documents D.

  2. A neural model — the re-ranker (R)— gets each document-query pair as an input and scores their relevance. This process generally relies on human annotations of query-document relevance, where the re-raker minimizes a Binary Cross Entropy loss among all candidates C, classifying them into positive or negative samples.

One would generally expect that when the first model M gets better, the performance of the system as a whole will improve, as the re-ranker gets better examples to choose from, and works have tried to improve this first retrieval stage. However, experiments show that when retriever M is a selects a better pool of documents, the re-ranker R often has a harder time differentiating the relevant documents from the irrelevant ones. This paper proposes a very simple solution to this phenomenon which consists in replacing the BCE loss — where all documents are classified as either relevant or non-relevant — by a Contrastive Loss where only 1 positive document is considered at a time and negative documents are sampled among the top ranked documents by M, which taxes more strongly the false positives than BCE:

This clever simple change is enough to put this work at the top of MS-MARCO benchmark. However, as you’ll see if you check out the paper, this is still very rough preliminary work and the results are very limited: top performances rely on tricks and heuristics (as all leading IR approaches) and many and many more ablation experiments are required to really understand the benefits of using this contrastive loss in re-rankers. We’re looking forward to that!

Performance of the proposed model compared to previous state-of-the-art PROP. Source:

🎖 Why → unlike the previous paper, this work presents a detailed systematic study of pre-training tasks for open Question Answering (QA), which shares a lot with the document ranking and re-ranking task we just discussed. It’s a good introduction to the most up-to-date practices that dominate in QA leaderboards and authored by reputable researchers from MILA, McGill and NVIDIA.

💡 Key insights → In this case, the neural pipeline for open Question Answering also consists of a first stage Retriever, which selects a pool of contexts, and the Reader takes a question “q” and the collection of contextsK”, encodes them and then decodes an answer “a” based on this two part input. As a Reader model, they use a pre-trained T5³ model. Similarly as with the Switch Transformers paper, there is a lot of substance to summarize in a paragraph. The two main pre-training tasks studied for the retriever are the Inverse Cloze Task¹ (ICT) and Masked salient spans²:

  • Inverse Cloze Task (ICT): extracting segments of a document and learning a representation of segments and documents that matches segments to the document they originally belong in a contrastive setting.

  • Masked salient spans: predicting masked salient spans of tokens such as named entities.

Additionaly, this work compares two methods for including contexts into the generation of an answer:

  • Individual top-k: the likelihood of an answer is decomposed into the sum of marginals over the collection of contexts K.

  • Joint top-k: the likelihood of an answer is computed directly over the collection of contexts K. In practice, this means that contexts are concatenated as input to the reader model and the question can attend to all documents simultaneously to generate the answer.

The work results in state-of-the-art for the first retrieval stage and also for “end-to-end” QA in Natural Questions and TriviaQA datasets.

Retriever-nly results for Natural Questions and TriviaQA. Source:

End-to-end QA performance compared to previous SOTA. Source:

🎖Why → The practice of Machine Learning with Differential Privacy (DP)is still not mainstream, and partly this is because of the high entry barrier and relatively early research stage. Don’t let the fancy words in the title scare you, this work provides an extensive introduction into DP and a study of how much privacy can be preserved when instantiating adversaries under realistic constraints.

💡Key insights → Imagine you want to train a model on confidential medical data hosted by a hospital. You define a computation for training a model with this data, and for each iteration of the training you send the weights of your model to the hospital and the hospital calculates some weight updates and sends them back to you. Now, if you were a very clever bad actor — an adversary — could you infer any individual data from the training dataset given the weights updates? Differential Privacy takes care of adding just enough noise to the data you receive such that you won’t be able to recover any sensible data from it (this is an extreme oversimplification, but you get the gist).

Now, Differential Privacy normally studies the formal upper bounds on privacy (i.e. worst case scenarios) in which the hypothetical adversary has perfect full access to each intermediate weight update, for instance. But in a real setting, we can refine further these constraints into more realistic ones. For instance, the case where an adversary has only access to the final model, or only its predictions via an API, etc. This paper examines how privacy preservation works across these more realistic cases. Results show how the privacy bounds are greatly increased when these realistic constraints are imposed, which is a hopeful result for the real-world applicability of these techniques.

Summary of Differential Privacy preservation under realistic adversary constraints. Source:

🎖Why: If ML progress nowadays is all about leaderboards, these shouldn’t be constrained to only domains where fully automatic evaluation is possible. Text generation is one such task where fully automatic evaluation is notably hard: BLEU, ROUGE scores correlate with human judgements up to a point, and this correlation breaks down once they become optimization targets. This paper puts forward an evaluation benchmark that combines classic automatic evaluation with crowdsourced convenient human evaluation. Convenience here is doing most of the heavy lifting: human evaluation benchmarks have been used for decades, but never at the scale and covenience of automatic benchmarks such as the GLUEbenchmark. Existing human-in-the-loop evaluation frameworks such as HYPE⁵, ChatEval⁶ or HUME⁷ focus on only one task each, so it’ll be interesting to see how much traction GENIE gets in the community as a more general purpose benchmark.

🔗 Where to find it: You can find more about it and how to submit your model in

Tasks currently supported by GENIE. Source: GENIE A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

10. Asymmetric Self-Play for Automatic Goal Discovery in Robotic Manipulation | 📺 Demo By OpenAI et al. (requested citation format from paper) 🎖Why → Self-play applied to robotic manipulation. Despite being rejected at ICLR 2021 because of lacking experimental depth (i.e. all experiments are simulation only), the idea behind it is very promising and it will no-doubt have a solid impact. 💡Key insights → The paper presents the task of robotic manipulation, which essentially means having a robot learn to manipulate objects given instructions or a certain goal. In this case, they explore how a robot can learn to manipulate objects to achieve a goal, given only the final goal and no instructions. The idea to solve this is very simple: we consider two robots, Alice and Bob. Alice creates configurations of objects and Bob needs to replicate them. Alice is rewarded for coming up with configurations that Bob cannot create, and Bob is rewarded when it’s able to replicate Alice’s state. Given that the Alice needs to generate the configurations for Bob, we’re certain that the states Bob is presented with are feasible. In this setting, neither Alice nor Bob need labeled supervision, and given that both Alice and Bob start from scratch in this adversarial setting, the configurations Alice comes up with will naturally grow increasingly difficulty, mimicking the concept of curriculum learning where a task becomes increasingly difficult, but without explicitly curating a set of tasks graded in complexity. There are many extra details necessary to make this process stable and work in practice, but as simulation experiments point to, self-play seems to be more efficient and robust than curriculum learning to teach Bob become an expert in manipulating objects on a surface.

Self-play framework for robotic manipulation. Source:

Comparison of self-play with curriculum learning on different simple tasks. Source:

Our packed monthly selection ends here, but we’re just getting started. If you want to keep up to date with the latest research, follow us on Twitter @zetavector. I’m already looking forward to share tinge next selection for the month of February; see you soon! References: [1] “Latent Retrieval for Weakly Supervised Open Domain Question Answering” by Kenton Lee et al. 2019. [2] “REALM: Retrieval-Augmented Language Model Pre-Training” by Guu et al. 2020. [3] “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel et al. 2020. [4] “Small Language Models Are Also Few-Shot Learners” by Schick and Schütze 2020. [5] “HYPE: A Benchmark for Human eYe Perceptual Evaluation of Generative Models” by Sharon Zhou et al. 2019. [6] “ChatEval: A Tool for Chatbot Evaluation” by João Sedoc et al. 2019. [7] “Unifying Human and Statistical Evaluation for Natural Language Generation” by Tatsunori Hashimoto et al 2019.

276 views0 comments

Recent Posts

See All


bottom of page