Best in AI November Edition — Like GPT-3 but you can *actually* use it!
Updated: Nov 3, 2021
EMNLP 2021 is here next week, MuJoCo is no longer expensive, Microsoft claims the crown on the biggest Transformer yet, and Hugging Face's BigScience Workshop shares the first paper and model, marching towards the Metaverse with advancements in differentiable rendering, and much more.
After a slow summer break, the ML world has been back at full speed for the past month: conferences getting back to in-person format, new parameter count records, Deepmind being the Robin Hood of Reinforcement Learning or a GPT-3 like model (T-0) now published and open-sourced.
Nonetheless, as we approach the end of 2021, for the first time ever on AI arXiv, the publication growth seems to be slowing down: after several years of consistent exponential growth (~yearly 30-40%) it looks like 2021's number of publications will only top 2020 ones by a shy margin (around 10% more). Will we see a strong surge for NeurIPS and ICLR? Or has AI Research mellowed down?
Let's begin with some of the hot news from the past few weeks:
The Empirical Methods in Natural Language Processing conference (EMNLP) is happening 7–11 November in a hybrid format: simultaneously online and in Punta Cana, Dominican Republic. The official open proceedings will be published shortly in the ACL anthology.
Deepmind acquired MuJoCo and open sourced it. It’s a big deal cause MuJoCo is one of the most widely used physics simulation software for robotics and RL and it used to be pricey. While big universities bought licenses for their students and staff, the cost made the entry barrier higher for the playful curious.
Microsoft’s megatron 530B parameter model. But wait... It's still only a hand-wavy blog post! They claim it's largest monolithic transformer to date; what the heck is monolithic you might think? Well that's a way of saying all parameters are used, unlike for Mixture of Experts (MoE) types of models, like Wu Dao’s 1,75 trillion or Switch Transformer's trillion as well, where only a smaller subset are activated during each inference/training step. While the sheer size seems pretty incredible, we'll have to wait until they share a more in depth account of their work. Speaking about parameter counts, will we ever stop caring about them?
State of AI Report for 2021 was recently released by AI investors Nathan Benaich and Ian Hogarth. It provides a useful yearly executive summary on AI from a birds eye perspective: research, industry, talent, politics and predictions. Definitely worth a read!
If you want to try out big attention-based architectures for computer vision, it's your lucky day, because Scenic  was recently released: a codebase (with lots of boilerplate code and examples) to run JAX models for computer vision, including several popular like the original Vision Transformer , ViViT  and many more.
If your thing is playing with generative models for images, check out VQGAN-CLIP, a repo for running the popular generative model that turns a natural language sentence into an image.
Finally, we propose you check Dagster an "orchestration platform for development, production, and observation of data assets".
And finally, here’s our selection of impactful recent papers.
Recursively Summarizing Books with Human Feedback
By OpenAI et al.
❓Why → Very long-document summarization (e.g. book scale) is a hard task for machines largely because annotating data is terribly time consuming: to annotate 1 instance or example, a person needs to read a book and come up with a summary of it, which takes several hours.
💡 Key insights → Long range summarization can be (somewhat) successfuly broken down into hirearchical summarization tasks that are way cheaper to annotate: split a book into chunks, then summarize each chunk into summaries. Concatenate those summaries and summarize them. Apply this process recursively until a desired full book summary length is reached.
To give a sense of the scale of the data involved: 40 books used, 100K words on average, mostly fiction, and each summarization subtask compresses to a ratio of approximately 5–10 to 1.
The results of this process are still far from human quality, only 5% of the summaries reach a comparable quality. Interestingly, model size seems to play an important role, as summaries from their biggest model clearly outperform those from a smaller model that followed the same training procedure.
In conclusion, this is yet again a really impressive big, complex human in the loop effort for training big models. It’s still far from generating that “wow this is spookily good” feel, but it’s a start. I’m thinking next up, how can this be translated into a few shot setting where only very few or very sparse annotations from humans are needed?
Multitask Prompted Training Enables Zero-Shot Task Generalization
By Victor Sanh, Albert Webson, Colin Raffel, Stephen H. Bach. et al.
❓Why → Outrageously large models research has been mostly limited to companies with big budgets. This is the first paper from the Hugging Face BigScience Workshop that proposes a collaborative effort to make large scale ML viable for smaller institutions such as universities. In all fairness, this is not the first large GPT-3 like model to be open sourced (e.g. check out GPT-J) but this is bound to be influential.
💡Key insights → We’re talking about a 11 billion parameter model, completely open-sourced and accessible via 🤗Hugging Face.
model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp")
You can check all details on the forum for the project, GitHub repo which includes detailed descriptions for the training of each variant of the model.
The model is a T5-style¹ encoder-decoder Transformer (unlike GPT-3’s decoder-only architecture) which is trained on autoregressive language modeling to predict the next token. However, the training set is now curated more carefully: besides using large web crawls of general use of language, the authors propose to include labeled NLP tasks expressed with natural language prompts. For instance, for a sentence classification task for movie reviews with annotations such as
The film had a superb plot, enhanced by the excellent work from the main actor. | Positive
Converting with a template to:
The film had a superb plot, enhanced by the excellent work from the main actor. It was <great/amazing/fantastic...>.
To avoid over-optimizing for a narrow set of templates, these are sourced from multiple people (36) to maximize variety, ending in dozens of templates for many NLP tasks to alternate.
The result is that even being 16x smaller than GPT-3, T0 outperforms it in most tasks even when the training set for those tasks was not seen during training.
Here's a summary of the key results. The different variants of T0 reflect what datasets were included during training: T0 excludes all datasets that GPT-3 used for evaluation, T0+ adds the datasets used in evaluation (only the training split, blinding to test set is still guaranteed) and T0++ adds on top of T0+ the datasets in SuperGLUE .
If you read our last month's blog, you might've noticed that this approach is very similar to FLAN  by Google, published just a few weeks ago. The authors address this work thoroughly and T0 still has a lot going for it: T0 and +/++ variants have comparable or better performance while being 10x smaller (137B vs. 11B params!!!). Key differences between the two works are:
T0 uses an encoder-decoder that was trained on MLM vs. decoder only FLAN (MLM has shown to be way more efficient pretraining approach, although it’s not good for autoregressive generation, thus the encoder-decoder strategy that uses MLM pretrained representations)
More diverse prompts
holding out multiple tasks at once vs. a single task at a time.
P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks
By Xiao Liu, Kaixuan Ji, Yicheng Fu et al.
❓Why → It hasn’t been a year since continuous p-tuning/prompt-tuning/prefix-tuning was proposed , and it has already become a viable alternative to finetuning in many tasks and a blossoming corner of ML research. This is its latest revision showing strength in tasks where p-tuning was struggling before.
💡 Key insights → If anyone still had doubts about prompt tuning this paper should clear them out (e.g. not working well for small sized frozen models, or bad for some specific tasks such as hard sequence tagging). For those late to the party, p-tuning (also known as prefix-tuning, soft or continuous prompt-tuning) is a technique for finetuning a pretrained model for a particular task without changing the pretrained parameter models. Instead, it consists of learning a prompt via gradient descent of a few continuous embeddings that are a fixed prefix of any input. This has shown to perform very well with Transformers trained on autoregressive language modeling and is more parameter efficient (i.e. only a very small amount of parameters need to be learned for a specific task compared to full finetuning).
The step further the authors take in this work is to add "depth" to prompts. That is, adding various prompts to different layers of a Transformer. While this increases the trainable parameter count, it improves performance while keeping the ratio of total model parameters vs. trainable prompt in the range of 0.1-3%. These are independent of each other in interlayer (they’re trained independently at each layer instead of coming from the transformer layer forward pass).
Here's a summary of the main results, expect to see p-tuning applied to other tasks in the near future!
Exploring the Limits of Large Scale Pre-training
By Samira Abnar, Mostafa Dehghani, Behnam Neyshabur and Hanie Sedghi.
❓Why → Scale has been a persistent topic of discussion within ML circles. We have been including papers on this topic for many months now, because it is definitely one of the important questions the field has to grapple with: where will adding parameters and data stop being... useful? Keep reading.
💡Key insights → Sort of pretty much “As we increase the upstream accuracy, the performance of downstream tasks saturates”.
Okay so the gist of this paper is simple, they study how does pre-training performance on an Upstream (US) tasks (e.g. large scale imagenet labels) transfer to Downstream (DS) performance (e.g. whale detection). Then do this experiment for a lot—by a lot mean a lot—of architectures and sizes:“4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data” 🤑💸
So the interesting plots compare Upstream performance (US) which means performance on the pre-training task, and Downstream performance (DS) which is on the evaluation task. Pretty much across the board it saturates eventually. Still, it's super interesting differences across architectures for computer vision!
The authors claim that their observations overall seem robust to choices such as the size of the upstream data or number of training shots, and architecture choices. They also explore the influence of hyper-parameter choices: are some hyper-parameters very good for US but don't translate well to DS? Yes! They dive deep into this phenomenon in section 4, and find that for instance, weight decay is a particularly salient hyperparameter that influences US and DS performance differently.
In a context where nobody really trains models from scratch but chooses pre-trained models to bootstrap their application, this research is key. There's much more to the paper than what can be summarized in a few paragraphs, it's definitely worth a read if you want to dive deeper!
A Few More Examples May Be Worth Billions of Parameters
By Yuval Kirstain, Patrick Lewis, Sebastian Riedel and Omer Levy.
❓Why → To annotate or to grow? This can be a common dilemma for ML practitioners deciding how to allocate resources: bigger pre-trained models or annotating more data. It depends!
💡Key insights → The main takeaway is that in the context of NLP tasks, scaling parameters consistently yields performance improvements, however, the contribution of additional annotations highly depends on the task. For instance, in Open Question Answering datasets, adding annotations doesn't significantly improve performance whereas in sentence classification or extractive question answering, it does. Here's the best summary figure for the findings of the paper, one would probably expect the heatmaps to have a gradient along the diagonal: both size and annotations yield performance improvements, but that's not what happens.
And that's pretty much it!. To be fair it’s not that super comprehensive and we’ll have to see how well these can be replicated on other modalities and so on but still, the question being addressed is undoubtedly relevant.
SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing
By Junyi Ao, Rui Wang, Long Zhou et al.
❓Why → NLP is often used almost as a synonym for text processing, but there’s so much more to natural language than text! Spoken language uses many more dimensions of expression than just its characters. Here’s an approach to model all that by leveraging the existing techniques that have been so successful in NLP for the past few years.
💡Key insights → Jointly learn text and speech representations by feeding a model both audio and text and train self in a self supervised setting with an analogous task to bidirectional Masked Language Modeling applied to sound. But applying MLM to audio is not as straightforward as it is with text, it involves pre-processing audio to a suitable representation called log-Mel filterbank and apply quantized targets in this representation state where a classification task can be performed. Importantly, audio and text representations are combines and fed to the model jointly, allowing for modeling across modalities.
The results are state-of-the-art for some tasks like voice conversion (VC), Automatic Speech Recognition (ASR) and performs competitively when applied to Text To Speech and Speech to Class (SID).
ADOP: Approximate Differentiable One-Pixel Point Rendering | Code
By Darius Rückert, Linus Franke and Marc Stamminger.
❓Why →Using Neural Networks to improve rendering at a reduced computational cost—in comparison to traditional techniques—is an extremely exciting, specially at a time where the VR and AR sectors are slowly but steadily taking off (hello Meta). After all, Deep Learning might play a key role in rendering the metaverse...
💡 Key insights → Rendering a view of a scene (e.g. in a videogame or simulation) is an impressively complex process: 3D objects can be defined in several ways, lighting, occlusion, textures, transparencies, reflections interact in complicated ways, rasterising stuff into a pixel grid, etc. Brute forcing these tasks is out of the question for low latency applications; instead, one must be smart about not computing things that don't need to be computed, like opaque objects that are occluding other objects.
It turns out that most of the processes involved in rendering can be performed by differentiable modules, which means that one can use gradient descent to optimize them given an appropriate loss function. The main modules involved in rendering novel views of a scene are the rasterizer, the renderer and the tonemapper, as you can see in the figure below.
We can't go too much in detail because in all honesty, the topic is a bit over our heads. Still, the video demos they provide are quite impressive and we can't wait for this kind of technology to be widely adopted by mainstream rendering technology.
On the ethics side of AI, this past month we've also seen a couple of papers we'd like to highlight
Delphi: Towards Machine Ethics and Norms is a brave attempt at teaching a machine the intricacies of right and wrong. While the complexity of the task has eluded philosophical consensus for millennia, this work is a tangible step towards introducing ethical judgements into algorithms.
Systematic Inequalities in Language Technology Performance across the World’s Languages introduces a framework for estimating the "global utility" of language technologies and how it covers the diversity of languages around the world.
On the topic of information retrieval, Adversarial Retriever-Ranker for dense text retrieval is an exciting new approach to model the interaction between a retriever and a ranker for the 2 stage retrieval setting, in which the retriever tries to fool the ranker with documents that "seem relevant" but aren't and the ranker tries to surface the most top relevance labeled document.
Our monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector. See you soon in the next one!
 Finetuned Language Models Are Zero-Shot Learners. By Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu et al. 2021
 SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. By Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman, 2019.
 Prefix-Tuning: Optimizing Continuous Prompts for Generation. By Xiang Lisa Li, Percy Liang, 2021.
 SCENIC: A JAX Library for Computer Vision Research and Beyond. By Mostafa Dehghani, Alexey Gritsenko, Anurag Arnab, Matthias Minderer, Yi Tay, 2021.
 An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. By Alexey Dosovitskiy et al. 2020.
 ViViT: A Video Vision Transformer. By Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid, 2021.