From 1.75 Trillion Parameters Models to GitHub Copilot - Best in AI —July 2021

Jakub Zavrel
Jul 1, 2021
10 min read

The global race to even bigger Language Models starring Mixtures of Experts, distributed learning from Yandex and Huggingface, SpeechBrain and more. And will OpenAI powered GitHub Copilot change computer programming? Here's our monthly selection of recent ML news, research and code gaining traction:

We're halfway 2021, and the ML-sphere keeps spinning: the Conference on Computer Vision and Pattern Recognition (CVPR 2021) was just held, Github and OpenAI released Copilot, an unprecedentedly intelligent code completion assistant, and much more happened in the last few weeks. Zeta Alpha is happy to help you discover the latest AI research and software and keep you up-to-date. Enjoy!

🗞 Some News

The trend of outrageously large models is nowhere near an end. One year ago the release of OpenAI’s GPT-3 got the AI community flabbergasted with 175 Billion parameters. This month was the turn of Wu Dao 2.0 to break the record, showing how China’s not dragging behind at all when it comes to pouring resources in AI research. Wu Dao is a multimodal (text and images) massive model with 1.75 Trillion parameters, based on a Mixture of Experts architecture (more on that later!). While the official press release only touched the surface of the model and not much is public about it, the paper outlining the system for training the model: FastMoE: A Fast Mixture-of-Expert Training System is on arXiv and the code open sourced on GitHub. Wish OpenAI would do more of that.

While Wu Dao is not open to the public, GPT-J is: the best zero-shot performing, performing publicly available GPT Transformer to date (at 6B parameters), recently released by Ben Wang and Aran Komatsuzaki. Built with JAX, yet another boost to the library, which has been slowly but steadily gaining popularity in the last 2 years.

Finally, Github Copilot just released a few days ago: a plugin that brings next generation code synthesis, based on Codex, a GPT-like model from OpenAI trained on a massive dataset of public Github code. But the announcement leads to a dashing landing page with cherry picked examples, and a public demo is still not available. Many questions are still in the air: how big and how fast can this model do inference? What are the details of the training dataset used? Should we be concerned about copyright protected data being accidentally surfaced by the model as it has been shown previously⁵? This twitter thread sheds some light on the topic, and we’re impatient to try it ourselves… It has the potential to make programming 10x more productive, and to democratize writing code, but then it has to work really, really well. And we know that bugfree code does not exist. Would it be easier than bringing self-driving cars on the road?

🔬Research

The Computer Vision and Pattern Recognition conference (CVPR 2021), volunteer computing, more transformer variants for NLP, vision and even tabular data. Here are the highlights!

Distributed Deep Learning in Open Collaborations | 👾 Code

By Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin et al.

❓Why → Cars spend almost all their lifetime parked, and similarly, a big chunk of the world’s compute is standing idle most of the time. This paper shows how volunteer computing — where different parties volunteer to provide compute resources — can be used to successfully train a large Language Model.

💡Key insights → Volunteer Computing (VC) is a paradigm where various parties collaborate by providing compute resources for a common algorithm (e.g. people letting their personal computers run stuff for someone else while they’re sleeping). Though promising, VC is no free lunch: there’s many considerations and careful design decisions needed for it to work in practice stemming from the fact that you can assume little from participants: their internet can have varying speed and latencies and hardware can range from mobile chips to high-end multi GPU nodes. You want to be able to use a heterogeneous group of volunteers setting a low bar for requirements, but you also want to avoid levelling all resources with the least common denominator.

This paper does an excellent job at presenting the existing paradigms for distributed computing while going through their main trade-offs such as node communication vs. computation. Based on this analysis, the authors propose DeLOC, where peers perform training steps on microbatches independently and asynchronously, storing the weight gradients for a fraction of the network and aggregate them with a certain frequency to update the state of the whole model. Effectively this is equivalent to training the whole model on very large batches. The way nodes communicate to synchronize on the global state of the model falls somewhere in between a “parameter server” and an “all-reduce” models (i.e. one central node does the aggregation or all nodes do the aggregation by themselves). All these hyperparameters, such as how often and which nodes communicate with each other is cleverly optimized to maximize training throughput.

ree — Source: https://arxiv.org/pdf/2106.10207.pdf

The implementation is done with Hivemind, a PyTorch library specialized in Volunteer Computing (which is described more in depth below).

Scaling Vision with Sparse Mixture of Experts

By Carlos Riquelme, Joan Puigcerver, Basil Mustafa, et al.

❓Why → Mixture of Experts is becoming the go-to technique for scaling models to outrageous sizes: the key advantage is the possibility of increasing the model parameters while keeping the inference computational cost constant.

💡Key insights → In a nutshell, a Mixture of Experts is a model where an input is routed to different submodels at inference time: the computational cost of inference will be determnined by the used computation path, whereas the model expressive power will be determined by the total number of parameters; sort of getting the best of both worlds.

Mixture of Experts had previously shown to be effective for Language Model transformers such as Switch Transformers¹ and systems like FastMoE², but had not yet been applied at this scale to images. The model is almost identical to the original ViT³: divides the image in patches, project linearly into patch embeddings and run through a transformer as a sequence. In this case, however, regular ViT layers are interleaved with MoE ViT layers where the MLP feed forward layer is replaced by a set of k MLP experts preceded by a router that sends each image patch through a different expert depending on the value of the input. Experts and routers are all trained by gradient descent with carefully designed loss functions to incentivize variety of experts in training to avoid collapse modes such as only having one active expert.

Perhaps the most surprising result on MoE applied to huge neural networks is that learning efficiency (i.e. the amount of compute necessary to train a model to a certain performance) is significantly improved with respect to the original ViT.

ree — Source: https://arxiv.org/pdf/2106.05974.pdf

Other recent work compares CNNs and Transformers for Computer Vision: VOLO: Vision Outlooker for Visual Recognition and its popular implementation on GitHub.

An Attention Free Transformer

By Shuangfei Zhai et al.

❓Why → Joining last month’s surge of MLP-based architectures for CV, evidence keeps accumulating that there’s not that much special about attention itself. As long as a network models the interaction of its inputs in some way, gradient descent will just find its way given sufficient parameters and data.

💡Key insights → Instead of computing attention as the conventional matrix product between query and key matrices the authors propose a simple learned pairwise bias w added to the keys matrix, which is transformed through an exponential in the form of the expression below, where t is the sequence element and all products are element-wise instead of dot-products.

The computational costs are not reduced when computing the full attention-free-attention (well only in memory space), and in my opinion, the most interesting bit is the fact that this works just fine: results are not SOTA, but they're just high enough to raise some eyebrows. The experiments are conducted both on images (with image autoregressive modelling) and text (auto-regressive language modelling), showing the versatility of the mechanism.

You might also like… Charformer: Fast Character Transformers via Gradient-based Subword Tokenization, which in a similar spirit as ByT5⁶, a character-level T5 Language Model performing surprisingly well published last month.

Revisiting Deep Learning Models for Tabular Data

By Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov and Artem Babenko.

❓Why → Tabular data might be one of the modalities where there’s the biggest gap in interest between academic research — where it receives little attention nowadays — and industry application, where it’s ubiquitous. Perhaps because the latest and greatest from Deep Learning doesn’t do so well here…? In any case, this paper is the best head-to-head comparison approaches for tabular data I’ve seen in a long time!

💡Key insights → Gradient Boosting Decision Trees (GBDTs) have been around since 2001⁴ and popular implementations such as XGBoost⁷ are widely used for good reason: Deep Learning methods still don’t quite outperform it robustly and across the board.

This work presents detailed numerical results on many tabular datasets and for several algorithms and importantly, without extra optimizations and tricks for the neural networks; just however they work out-of-the-box. The most insightful bit is in my opinion in the experiments they perform using synthetic data. The authors generate synthetic tabular datasets using heuristics that accomodate for either GBDTs style decision rules or a Neural Network regression. When mixed up to different degrees, the result is a spectrum of datasets where the two techniques are expected to outperform the other. In the figure below, the leftmost side is the error on a NN friendly dataset and the rightmost for a GBDT-friendly dataset. As expected, ResNet and CatBoost show a clear trade-off between the two, but the Transformer-based classifier seems to be a jack of all trades, master of none.

By Gail Weiss, Yoav Goldberg and Eran Yahav.

❓Why → This paper is different and fun. Having new ways to think and talk about known stuff is essential for developing new ideas, and this is an excellent example. As a bonus, while the devil’s in the details, it seems like Attention is Turing-Complete⁸ (sort of-ish?).

💡Key insights → Restricted Access Sequence Processing Language (RASP) is a programming language that naturally enables expressing the computation that a transformer performs. The gist goes as follows, RASP models a transformer as an any algorithm that manipulates sequences of length n and matrices of size n x n . An input sequence can be transformed by element-wise operations and/or by a selecting and aggregating elements whose relationship is modelled by a sort of attention matrix (aggregator). And that's pretty much it, you can solve many tasks by only using these primitives as they show in the paper, which nicely map into Transformer computations and can be compiled into each other.

One of the most interesting insights is how restricted-attention transformers (e.g. efficient transformers) can be expressed in the RASP formalism (e.g. by setting the aggregator matrix to False in certain regions) necessarily weakening its computational expressivity. The authors showcase this in synthetic tasks such as sorting, where only full transformers succeed.

Other synthetic tasks used to showcase RASP and its use to predict and understand how a Transformer performs computations are reversing a string, making histograms, double-histograms, sorting alphabetically, returning the most frequent token, and identifying Dick-k languages.

Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better

By Gaurav Menghani.

❓Why → While not presenting any novel contribution, this is an excellent introduction from an engineer’s perspective that covers relevant techniques for efficient DL which every practitioner should know.

💡Key insights → There’re many aspects to efficiency in DL. Perhaps the most important differentiation to make is that between training, where academic research spends most resources, and inference, where large scale applications spend most resources; and this survey covers both cases.

For instance, this work introduces the reader to techniques such as data augmentation, distillation or transfer learning which have a positive impact mostly on training efficiency. In parallel, other techniques explained such as quantisation or pruning are generally used to boost inference efficiency.

Other topics covered are hyperparameter optimization, efficient architectures and infrastructure considerations about frameworks like PyTorch Mobile or TensorFlow Lite which are a key ingredient to the ecosystem. One of the most useful bits of the survey are the rules of thumb and recipes recommended at the end of each section, which help ground each technique to its use-cases.

Surveys are probably the best way to know what’s happening in a research area where you’re not an expert. Here are other recent surveys that serve as an excellent entry point to those areas: Graph Neural Networks for Natural Language Processing: A Survey, A Survey of Transformers, A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios

🐍 Code

Here are some libraries and resources worth checking out.

👾 learning-at-home/hivemind ⭐️ 715 | 📄 Paper

👉 Wanna train huge models like Google and OpenAI do with proprietary technologies by crowdsourcing computing? Look no further.

🚀 Key features (from README)

Train NNs of arbitrary size.
Distributed training without a master node: Distributed Hash Table allows connecting computers in a decentralized network.
Fault-tolerant backpropagation: forward and backward passes succeed even if some nodes are unresponsive or take too long to respond.
Decentralized parameter averaging: iteratively aggregate updates from multiple workers without the need to synchronize across the entire network.

📈 More frameworks and libraries for distributed training…

👾 microsoft/DeepSpeed ⭐️ 5.2k| 🌐 Website 👉 A framework to perform distributed training for extreme-scale Machine Learning models such as Microsoft’s Turing-NLG (17B parameters).

👾 horovod/horovod ⭐️ 11.4k |🌐 Website | 📄 Paper 👉 A distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

👾 facebookresearch/fairscale ⭐️ 1.2k 👉 A PyTorch extension library for high performance and large scale training.

👾 pytorch/xla ⭐️ 1.4k 👉 A Python package that uses the XLA deep learning compiler to connect the PyTorch deep learning framework and Cloud TPUs.

👾 speechbrain/speechbrain ⭐️ 2.5k

👉 An open-source and all-in-one speech toolkit based on PyTorch.

🚀 Key features (from README)

Domains and tasks covered: Speech Recognition, Feature Extraction and Augmentation, Speaker recognition, identification and diarization, Speech enhancement and separation and Multi-microphone processing.
Multiple pre-trained models integrated with 🤗huggingface.
Abstractions such as the Brain class to remove unnecessary details of modes while being fully customizable tarining and evaluation.
Multi-GPU training via PyTorch Data-Parallel or Distributed Data-Parallel and mixed-precision training.
Transparent and customizable input and output pipelines: native PyTorch dataloaders, downsampling, BPE tokenization, etc.

➕ A new similar library worth checking out: sooftware/OpenSpeech

🤖 Popular Recent Transformer implementations…

👾 facebookresearch/xcit ⭐️ 415 | 📄 Paper 👉 Implementation of the Cross-Covariance Image Transformer (XCiT)⁸

👾 kzl/decision-transformer ⭐️ 661 | 📄 Paper 👉 Reinforcement Learning via Sequence Modeling.

👾 NVlabs/SegFormer ⭐️ 403 | 📄 Paper 👉 Semantic Segmentation with Transformers.

👾 OATML/non-parametric-transformers ⭐️ 217 | 📄 Paper 👉 Processing an entire dataset at a time and using datapoints instead of parameters.

👾 compphoto/BoostingMonocularDepth @CVPR21 | Project page

👉 High resolution monocular depth estimation.

🚀 This implementation so cool we had to include it. It implements CVPR 2021 paper Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging¹⁰.

Our monthly selection ends here; if you want to keep up to date with the latest research, follow us on Twitter @zetavector. See you soon in the next one!

References

[1] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — By William Fedus, Barret Zoph and Noam Shazeer, 2021.

[2] FastMoE: A Fast Mixture-of-Expert Training System — By Jiaao He et al. 2021.

[3] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale— By Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai et al. 2021.

[4] Greedy Function Approximation: A Gradient Boosting Machine — By J.H. Friedman, 2001.

[5] Extracting Training Data from Large Language Models — By Nicholas Carlini et al. 2020.

[6] ByT5: Towards a token-free future with pre-trained byte-to-byte models — By Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou et al. 2021.

[7] XGBoost: A Scalable Tree Boosting System — By Tianqi Chen, Carlos Guestrin, 2016.

[8] XCiT: Cross-Covariance Image Transformers — By Alaaeldin El-Nouby et al. 2021.

[9] Attention is Turing-Complete — By Jorge Pérez, Pablo Barceló and Javier Marinkovic, 2021.

[10] Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging — By S. Mahdi H. Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris and Yağız Aksoy, 2021.