MS MARCO is in trouble. Best in AI — October 2021

Updated: Sep 29, 2021

Pupil Shapes Reveal GAN-generated Faces, Multimodal Prompt Engineering, CNNs vs Transformers vs MLPs for vision, Primer Evolved Transformer, FLAN, plus the pressing question of whether MS MARCO has reached end of life in neural retrieval evaluation.

We’re back with our monthly mix of news, code and research papers that are shaping the path of Artificial Intelligence. Let's dive in starting with some news!

  • Accelerated inference and tooling around running ML models is a growing interest of the industry. The latest example is the partnership between Huggingface and graphcore: Graphcore is a British startup designing and providing hardware for doing training and inference with Machine Learning models. And there is the new Huggingface Infinity offering. With this, Huggingface hopes to strengthen their hosted inference and training services, which have grown impressively in the past year.

  • More details about DeepMind's AlphaFold have slowly been coming: from a shallow announcement in November 2020, a Nature paper in July 2021, and more recently, the code, data and inference pipeline.

  • Finally, Facebook Research showcased their Textless NLP agenda by training a Language Model on raw audio—named Generative Spoken Language Model, GSLM—that can generate coherent expressive speech given a spoken prompt. NLP has been for decades almost a synonym for working on text data, because modelling raw audio was out reach, but that time might be over.

🔬 Research

When it comes to academic papers, the summer didn't seem to cause much of a slow down. Of all the research that came out, here are some trends I'd like to highlight:

  1. Huge Language models no longer compete in the same category as the smaller models (as in pre-train + finetune with supervised data). Why? It doesn’t make sense anymore, because everybody knows what happens when larger models are fine-tuned on supervised benchmarks like the SuperGLUE: they just get better. Instead, we’re seeing these models explore a space of problems that were not viable previously such as zero/few-shot and or multimodal learning.

  2. Metrics and benchmarks—how we quantify something we care about—need to be continuously rethought and reinvented for progress to be meaningful. MNIST is great for some Computer Vision research, but you won’t find headlines bragging about a new SOTA on the dataset. Luckily, academics are often concerned about this and most of the revealing papers in tradition come from universities, like the one we'll see shortly.

  3. We often discuss the modeling side of things—the architecture of a model, the loss function, optimizers, etc.— because they’re sexy, but evidence keeps piling up that if you want to solve a new problem, your brain-hours are best spent working on the data. Andrew Ng has already spent years championing this perspective and it’s aging like fine wine.

Without further ado, here are some recent highlights from arXiv!

1. A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

By Yucheng Zhao, Guangting Wang, Chuanxin Tang, Chong Luo, Wenjun Zeng and Zheng-Jun Zha.

❓Why? → This paper’s weakness is also its strength: the humble scope and effort put into further optimizing their experiments. When a community cycles through hundreds of iterations optimizing a certain technique, it’s hard to compare it to a new one in a fair way: how much of the success of an existing method can be attributed to its foundational ideas, versus how much of it is sheer accumulation of small optimizations and know-how that’s taken for granted? This is the case for CNNs which have ruled the last decade in Computer Vision and also the case for Transformers dominating in Language.

💡Key insights → The authors propose a fairly simple strategy to compare the three architectures head to head: define the network’s architecture as a combination of embedding, spatial mixing and downsampling blocks and only swap around the spatial mixing with the different architectures. An important caveat is that all architectures share an initial patch embedding layer, which effectively is a convolution on the original image (i.e. a linear projection for each patch sharing weights), so all of the techniques contain a first convolution-like layer.

The conclusions we can draw from these experiments are limited: after the initial patch embedding, MLPs do surprisingly well when there’s enough data, although CNNs still rule in the data constrained regime, and Transformers scale better, as in, they don't saturate as much when you keep growing them. Takeaway: you should probably still use CNNs today, but make sure to come back in a year and ask again. These findings are aligned with the common wisdom that strong inductive biases in an architecture are useful in learning efficiency, but cease to be important at some point for images, as equivariances can be learned instead of injected a priori (although this reasoning does not apply to all data modalities and model equivariances are probably necessary to learn from data with combinatoric explosions such as large graphs).

Other similar works you might like: Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

2. Shallow pooling for sparse labels

By Negar Arabzadeh, Alexandra Vtyurina, Xinyi Yan, Charles L. A. Clarke.

Quote by authors: “We were disturbed by the ramifications of these observations.”

❓Why? → MS MARCO is one of the most popular Information Retrieval (IR) benchmarks out there. This paper uncovers some of the shortcomings of the existing benchmark and more broadly proposes improvements on how IR can be evaluated.

💡Key insights → IR benchmarks are often annotated in the following way: given a query, a passage from a corpus is labeled as being relevant to that query. Most queries only have only one passage annotated as relevant, despite the fact that there might be many more in the corpus —  annotating all query-document pairs would not be feasible most of the times—hence we can say labels are sparse. Retrieval models are then evaluated using Mean Reciprocal Rank (MRR), in which the passage annotated as relevant should rank first, but not "penalizing" non-annotated passages as irrelevant (remember, they might be actually relevant despite a lack of annotation).

The problem with this setup is: what if the retrieval model ranks first a passage that is not annotated but is actually more relevant than the annotated passage? This work precisely answers this question.

Given a subset of queries from MS Marco, they run queries and ask annotators to answer the following question: which passage is most relevant to the query, the passage annotated as relevant, or the top result from a neural ranker model? Of course, annotators are blinded from which one is which, and only when the annotation and the model differ in the top ranked passage. It turns out, that the result from the neural ranker is preferred more often than the annotation (see figure below). In other words, one might say that these models are better than perfect, because their results would be preferred by annotators when compared to a perfect run in which the top passage retrieved is always the one labeled as relevant.

This result is tremendously relevant because it suggests that popular leaderboards might no longer reflect any improvements on IR, but an overfitting to the annotations. As a suggestion on fixing this, the authors propose to move to a non-static annotation dataset in which top relevance passages are continuously annotated with pairwise labels (e.g. which of the two passages is most relevant to a given query?).

3. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation | 👾 Code

By Ofir Press, Noah A. Smith, Mike Lewis.

❓Why? → Transformers are natively set-to-set neural networks, meaning the architecture is order invariant, which is why researchers have tried different ways of encoding position information into the models. This paper proposes a new super simple, non-learnable way of encoding relative position information that’s robust to doing inference in larger contexts than in training. Could this become a new standard?

💡Key insights → Position encoding in Transformers has long been a bit all over the place, cause it seems like "anything goes": given a way to encode position, the transformer and gradient descent will find a way around it that performs well. We even needed a survey to make some sense out of it! [2] But what are the hidden pitfalls of common techniques like sinusoidal positional embeddings or fixed learned absolute embeddings? They don’t generalize well when doing inference at a longer length than they were trained on.

The proposed method here simply relies on adding a bias to the attention matrix (before the softmax) that is proportional to the distance between the center token and its neighbours (see figure blow). These biases are fixed and not learned and the scaling factor m is set as a hyperparameter.

Surprisingly, the Transformer learns just fine with this encoding of position and performs competitively to equivalent models of similar size. The interesting comparison happens when the model performs inference with sequences that are longer than it was trained on (see figure below). While existing models struggle to generalize (i.e. perplexity sharply increases with longer sequences), this is not the case for ALiBi.

However, there's an important caveat that's not thoroughly addressed in the paper: the biases in the attention matrix run through a softmax which tames down the contributions of far tokens, which is like having a "soft window" of attention: instead of only attending to N neighbouring tokens, we attend with decreasing weighting. Effectively, this means that at a certain point where the position bias is negative enough, that token won't ever contribute meaningfully, making the effective context always of limited size regardless of the input length at inference.

4. Finetuned Language Models Are Zero-Shot Learners (FLAN) | 👾 Code

By Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu et al.

❓Why? → Despite great progress, prompting in NLP is still not very robust: adding a comma where it’s not supposed, to be and the output might change completely. This paper shows how including labeled data during the autoregressive language modeling pre-training makes a model learn more robustly and transfer better to new tasks in a zero-shot setting.

💡Key insights → Zero-shot learning is one of the most promising areas of development within Machine Learning. The dream is clear:  find a model and have it work on your domain on day 0 without requiring data collection, labeling, training, maintenance, monitoring drift, etc.

Large self-supervised Language Models are currently the leading candidates for eventually delivering on this dream (although there’s many barriers to overcome). Specially since GPT-3 (more than a year old now!), prompting has become a key technique, and it’s here to stay. This paper investigates how large language models can be trained to be more robust and accurate for zero-shot natural language prompts, like GPT-3 did. It’s unfair though to compare it heads to heads with GPT-3: this model includes labeled data during pre-training, but instead of finetuning the model directly on it, they use templates to create natural language expressions of that task (see figure below); whereas GPT-3 did not include any training data—in principle, some datasets accidentally leaked into the pre-training data, exposing it to the model in pre-training [3].

The result is a model that performs better than GPT-3 across many tasks and shows good generalization over tasks that were not included in the pre-training, although it's still far from a fully supervised model.

5. Multimodal Few-Shot Learning with Frozen Language Models

By Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals and Felix Hill.

❓Why? → More on prompting: now multimodal and async. Can you leverage the information in a pre-trained Language Model for vision tasks without re-training it? Well, sort of... keep reading.

💡Key insights → The idea this paper proposes is fairly simple: train a Language Model, freeze it such that its parameters remain fixed, and then train an image encoder to encode an image into a prompt for that language model to perform a specific task. I like to conceptualize it as “learning an image-conditional prompt (image through a NN) for the model to perform a task”.

It's a promising research direction but the results are not very impressive (yet?) in terms of absolute performance. However, it is interesting to compare the model which is fully finetuned with multimodal data (Frozen finetuned) versus one that keeps the Language Model frozen (Frozen VQA-blind): only the latter shows good generalization from the training dataset (Conceptual Captions [4]) onto the target evaluation dataset (VQAv2 [5]), still being far from a fully supervised model.

Other works on large scale Transformers and Self Supervised Learning

On the Opportunities and Risks of Foundation Models by Rishi Bommasani, Percy Liang et al. made quite some buzz when it was released on August. This paper-book is a lot of things: a recap, introduction and position paper on the emerging fields of large neural models trained unsupervisedly on large amounts of data, for which they coin the name Foundation Models. While it's a compelling and comprehensive overview—covering both technical and social impact of these models—it seems unclear whether they require a new nomenclature.

Primer: Searching for Efficient Transformers for Language Modeling by David R. So et al. propose a method to search for a Transformer architecture that is maximally efficient while performing well, achieving around a 2x speedup compared to vanilla architectures. The resulting found architecture has two main modifications in comparison to the standard Transformer: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in self-attention.

6. ETA Prediction with Graph Neural Networks in Google Maps

By Austin Derrow-Pinion, Jennifer She, David Wong, Petar Veličković et al.

❓Why? → Ever wonder what goes on behind the scenes when Google Maps calculates how much time it’ll take you to go from point A to B? Here’s a glimpse of it…

💡Key insights → Once again, high quality data at scale is most of what you need. This paper describes the problem setting and formalization of estimating the time it will take for something to go from point A to point B, fully with Neural Networks. They basically:

  • Gather a huge dataset (hello, Google).

  • Represent the map of roads and paths as a graph with segments and intersections.

  • Apply a Graph Neural Network (GNN) to learn embeddings for each node + use those to do inference, training with the supervised data along with some auxiliary losses to regularize the training. The GNN consists of edges, nodes and global supersegment representations (embeddings) which are combined through aggregation functions (Neural Networks) which take as input the previous representations and output new representations which can be used to make a prediction. For instance, the edge representation will be used to estimate a time elapsed per segment given the previous node, edge and supersegment representation.

The gains over the existing production baseline from Google Maps are substantial, with cities like Sydney seeing 40% improvement in ETA accuracy. Another fascinating aspect of this paper is the detailing of how a model like this can be deployed while meeting the latency demands, which involves precomputing and caching predictions for several supersegments.