A Transformers at Work recap, OpenAI's Whisper transcription model, NVidia's new 4090s GPUs, how Google trains models to improve Ads' Click-Through-Rate, an open-source GitHub Copilot competitor, a survey on Diffusion Models to catch-up with all the craze, an in depth look at Mixture of Experts, and much more.
This beginning of the academic year marks a special time for us, our company has turned 3 years old, we've moved back to Science Park in the LAB42 building, and we've got so much to cover about what happened in the past few weeks in AI.
Firstly, a brief reflection on our recent 3rd edition of the Transformers at Work workshop, which meant tons of insight, engagement, and fun. Much satisfyingly, its name — Transformers at Work — has aged like fine wine since our first edition in early 2020: the Transformer architecture has taken over most Deep Learning applications, and accordingly, the topics covered workshop were the broadest we've ever had: Research at Zeta Alpha, Multimodality, Retrieval Augmented Language Models, pushing the limits of what GPT-3 can do, Vision Transformers, Efficient Transformers, Transfer Learning... All are covered by world-renowned researchers.
At the end of this post, you'll find the links to revisit the talks with some added editorial commentary.
Now, onto the latest news and research from the summer. Here are the stories we chose to highlight:
Nvidia Announces Next-Gen RTX 4090 and RTX 4080 GPUs with up to 76 billion transistors and 24GB of GDDR6X memory, and roughly a 2-4x speedup with respect to the 3090Ti models.
ACT-1: Transformer for Actions. Adept — a small AI research company formed by high profile researchers in the field — announced a new Transformer model that is able to follow instructions given in natural language to perform complex tasks on your computer software. This has the potential to become a powerful general-purpose automation tool with a ridiculously fast learning curve, but there's still no public version of the system and all examples we've seen so far are cherry picked, so the jury is still out on how robust and useful this model actually is.
On the research side, here's a rundown of our highlighted papers covering audio transcription, RL, reasoning, Mixture of Experts, neural Information Retrieval, collaborative computing, and more.
Faithful Reasoning Using Large Language Models by Antonia Creswell, and Murray Shanahan. This work is part of a line of research we've been highlighting recently: the use of advanced prompting to improve the reasoning capabilities of large pretrained language models (e.g. chain of thought, etc.). In this case, two Language Models optimized for differing purposes interact to complete a prompt: a Selection model (S) generates text with pieces of evidence that might be relevant to complete the prompt, then an Inference model (I) follows up with a conclusion that can be drawn from the text snippet generated by the Selection model. This process can be sampled to generate various Reasoning Traces, and then beam search can be applied to select the best one by a value function estimation model (i.e. a Language Model fine-tuned with correct/incorrect reasoning samples to infer the value of a reasoning step). In terms of results, perhaps the most interesting bit is how well this method performs when the depth of the reasoning increases. This proves once again how fertile the ground is when it comes to research on prompt engineering large Language Models.
Transformers are Sample Efficient World Models by Vincent Micheli, Eloi Alonso, and François Fleuret. Arguably, sample efficiency is still one of the most disappointing facets of applying Deep Learning to RL, which is why sample efficient RL is a key research area in the field. IRIS (Imagination with auto-Regression over an Inner Speech), is a data-efficient agent that learns in a world model composed of a discrete autoencoder and an autoregressive Transformer. An encoder encodes a state (pixels) into a sequence of discrete of tokens (emphasis on discrete!) which is then fed into the Transformer to model the environment dynamics. Once again, this shows how as long as you know how to tokenize anything into a sequence of discrete tokens, a Transformer architecture is a powerful modeling tool to do prediction on that sequence of tokens. After the equivalent of 2h of supervised learning on the Atari 100k dataset, IRIS achieves human-parity in performance. You can find the code in the repo: https://github.com/eloialonso/iris.
Petals: Collaborative Inference and Fine-tuning of Large Models by Alexander Borzunov et al. This paper is a technical report on how to run inference and training on large models (such as the 176B parameter BLOOM) on a distributed heterogeneous compute fashion where several peers share compute resources in a network. The implementation is done with the Hivemind library, and the ballpark inference speed that can be achieved in this limited setting is around 1 second per inference step, which is too slow for real-time applications (e.g. chatbot) but still workable for offline generation of text. Collaborative computing could become a small microeconomy of its own with initiatives like this one to democratize the use of large Neural Networks, although it will have a tough time competing with the expensive-but-performant dedicated data centers from the likes of Amazon, Google, and Microsoft.
Diffusion Models: A Comprehensive Survey of Methods and Applications by Ling Yang et al. If you missed the meteoric rise of Diffusion Models from the past two years, this is a perfect opportunity to catch up. Despite still having weaknesses such as coherent text generation, this modeling paradigm has taken over image generation and applying it to other domains such as NLP or RL is an active area of research. This paper provides an overview with a proposed taxonomy to help make sense of the space which considers algorithmic details, application areas and relationship to other generative modeling techniques (VAEs GANs FLOWs etc.)
Out of One, Many: Using Language Models to Simulate Human Samples by Lisa P. Argyle et al. A line of research that we've highlighted previously is that of using foundation models to generate synthetic data to aid learning in domains or tasks where labeled data is hard or expensive to come to collect. This work turns this idea around to explore how well GPT-3's bias can model subpopulation preferences identified in social sciences research, such as political inclinations and speech. For instance, when self-identified Democrats and Republicans are asked to describe each other, human answers and GPT-3 correlate extremely well. While this line of research is bound to be controversial and should be traded with care, it opens the possibility of using large Language Models as another lens to investigate social phenomena that manifest themselves in text.
Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford, Jong Wook Kim, et al. OpenAI's latest release follows the usual recipe: a simple modeling technique paired with high-quality data collection at scale. A key drawback of existing transcription models is that they're not robust outside of the benchmark datasets: superhuman performance on LibriSpeech was achieved back in 2015, but humans were still far superior in real-world conditions. A key piece to solving this puzzle is to follow the same principle used in GPT-3: pre-train on large-scale dataset-agnostic data, and follow up with benchmarking on specific datasets in a zero-shot setting. This setup reduces overfitting to dataset-specific phenomena and results in more robust modeling.
A Review of Sparse Expert Models in Deep Learning by William Fedus, Jeff Dean, and Barret Zoph. We've been highlighting MoE works for a while now, they're a promising approach to continue scaling up models. Intuitively, the motivation behind them is clear: only activate sub-parts of the model for each inference step such that we can simultaneously increase model capacity and keep the cost of inference low. The survey covers history (MoEs date back to much before the Deep Learning revolution!), scaling properties in upstream and downstream performance, routing algorithms, load balancing, hardware... The verdict is that sparse MoEs are very useful when training with data parallelism across many GPU/CPU nodes, but they're not worth the hassle when the full model parameters fit in a single compute node. Many possibilities still remain uninvestigated such as the use of more heterogeneous architectures across experts or whether the optimal number of experts varies predictably with the task being solved. A key driver of progress in this space will be the maturing of the software stack to create, train and deploy these models that lowers the entry barrier for doing research on the topic.
On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models by Rohan Anil et al. Click-Through-Rate (CTR) is the holy grail of internet advertising companies like Google: in the context of a business model where advertisers pay per click, CTR is the direct multiplying factor that converts impressions into revenue, which is why Google et al. care so much about predicting CTR performance. How is this done at scale? Of course Google doesn't provide too much to fully replicate their secret ad sauce, but it's enough to get a rough idea of how things work under the hood. Ad recommendations fall into the online learning paradigm (online as in "real-time", not "on the internet"), in which there's not a static evaluation benchmark, but performance is instead continuously measured. In order to find the ideal Neural Network for prediction, AutoML is used to perform architecture search during training. But the optimization target in this problem is complex because it needs to include constraints such as training speed or serving inference, which is why the training loop includes an RL-based controller module that decides what sub-networks (from the neural architecture search) to sample from to evaluate. The paper details many more design considerations such as the loss function engineering, dealing with UI biases, dealing with "reproducibility", etc. In conclusion, this work is a glimpse at what goes into training and running real-time recommendation systems at scale.
Promptagator: Few-shot Dense Retrieval From 8 Examples by Zhuyun Dai, Vincent Y. Zhao, Ji Ma et al. This work tackles the challenge of domain transfer in Neural Information Retrieval. It's a well-documented phenomenon that neural IR performs much better than traditional keyword-based retrieval systems when the training and evaluation domains are the same. However, they struggle much more when evaluated in out-of-domain settings, which is why training on large-scale generic datasets such as MS MARCO became the go-to recipe to train robust neural retrievers. In this work, however, the authors propose to use large LMs to fully generate domain-specific datasets given a collection of documents by generating relevant queries linked to relevant documents. With this synthetic data, neural retrievers can be trained end-to-end on the target domain without the need for annotations and perform strongly when compared to models that leverage large annotated datasets like MS MARCO or Natural Questions. The spirit behind this is very similar to previous work such as InPars which performs data augmentation using large LMs, and this brings neural IR closer to fulfilling the tangible dream of robust and smart neural semantic IR systems on arbitrary document collections.
Transformers at Work 2022
Revisit the talks from our workshop on September 16, 2022.
Introduction by Jakub Zavrel (Founder and CEO) and Sergi Castella (Analyst) covering company news, a bit of history on transformers, and an overview of the topics covered at the event.
Marzieh Fadaee — "From Transformers to Work: Advances in Neural Search"
Covering research we've done at Zeta Alpha in the past year, answering questions such as why Neural? How do we leverage large Language Models to generate diverse and abundant training data? What about Multilingual data? how does in-domain vs. out-of-domain evaluation compare? How do distilled models perform compared to large ones?
Gautier Izacard — "Retrieval Augmented Language Models"
Unlike regular Language Models, which store all their knowledge implicitly in their parameters, Retrieval Augmented Language Models leverage an explicit memory component coupled with a retrieval module to access useful information at inference time. This mechanism presents several advantages such as the ability to update the memory without retraining the model or much more parameter-efficient text generation.
Rodrigo Nogueira — "The Prompting Power of Large Language Models"
This talk is guaranteed to raise some eyebrows. Rodrigo walks us through what modern large Language Models such as GPT-3 are capable of and what they fail at. For instance, they mostly succeed at translating from imaginary languages, zero-shot summarization, translation, and even at reasoning with the right prompting techniques, but will still fail when dealing with numerical IDs, answering nonsensical questions, or taking into account massively long contexts such as full books. To conclude, Rodrigo reflects on the future of these models: specialized models vs. general purpose and costs of training and running them.
Ahmet Üstün — "Transformer Adapters and Hyper-Networks"
Transfer Learning took off with the advent of large-scale self-supervised pre-training using various forms of language modeling tasks (e.g. BERT with Masked Language Modelling or GPT with autoregressive Language Modelling). The common practice of re-training the whole model into the downstream task has been a standard recipe for achieving great performance, but this technique is very inefficient. Ahmet introduces his work with adapters and hypernetworks which enable high-performing transfer learning with much higher parameter efficient mechanisms to adapt a pre-trained Transformer into a downstream task.
Auke Wiggers — "Efficient Transformers: towards Deployment on Edge Devices"
While a large portion of the ML research community is focused on scaling up models, Qualcomm is busy shrinking them to sizes that can run on edge devices such as smartphones. Auke walks us through recent works by his colleagues that focus on neural data compression, efficient video super-resolution with gated local self attention, and Transformer quantization for faster runtime and reduced power consumption.
Cees Snoek — "Computer Vision Transformers"
After years of leading the Deep Learning revolution, Computer Vision is now playing catchup with the architectural advancements that were originally introduced in the context of Natural Language Processing: Transformers. In this talk, Cees walks us through the history of vision from its biological evolutionary origins up to the latest advancements in Vision Transformers for object recognition, tracking, and segmentation such as Swin Transformers, DETR, BoxeR, and more. Are the classical vision inductive biases championed by CNNs still relevant? What are the weak aspects of state-of-the-art models?
Our monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector, and stay tuned for the next one!