top of page
Search

FastMoE and Wu Dao 2.0 - behind the scenes with two authors of the recent Chinese AI breakthrough

Updated: Jul 12, 2021

Zeta Alpha, 12th July 2021.


The AI trend of outrageously large language models is nowhere near an end, and many believe size does matter. One year ago the release of GPT-3 got the AI community excited with 175 Billion parameters. In June 2021, Wu Dao 2.0 from Beijing Academy of Artificial Intelligence took a turn to break the record, with a multi-modal model ten times larger: 1.75 Trillion parameters, emphasizing China’s advances in AI research, and stirring up news channels in the West. The number of parameters is not a success by itself, but there is a lot of technology advances behind the news. The underlying FastMoE paper is on arXiv, and the code is open sourced on GitHub for the world to enjoy. We had the pleasure to interview two of the authors (Professor Jie Tang, who is one of the main drivers behind the project, and PhD student Jiaao He, first author and core developer of the FastMoE package) for a behind the scenes perspective.


Congratulations with the impressive technical work on building the FastMoE system and training it to build the Wu Dao 2.0. Our readers in Europe would love to know more about the people and technical challenges behind the story. First of all, can you say a few things about your background and how you got into AI research?

Jie Tang is a Professor and the Associate Chair of the Department of Computer Science at Tsinghua University and a Fellow of the IEEE. After a PhD at Tsinghua University in 2006, and positions at Cornell, KU Leuven and Microsoft Research Asia, he led the project AMiner.org, an AI-enabled research network analysis system, which has attracted more than 20 million users from 220 countries/regions in the world. Currently, he is “leading the Wu Dao project toward building a super-scale pre-trained model, which now already exceeds 1.75 trillion parameters.”


Jiaao He: “I am a PhD student in computer systems. I have been working on distributed training systems for several years. Basically, I am interested in parallel programming. I used to compete in Student Cluster Competitions. That is how I started to work in computer systems. In fact, I am not that interested in AI personally. I regard AI as an application that can be accelerated by my parallel programming techniques. In late 2020, I was introduced to the Wu Dao project by Prof. Zhai and Prof. Tang."

How did you become involved in this research project and what role did you play?

Jie Tang: “We are working on large-scale pretrained models and we found that scaling model size is one of the simplest and most effective ways toward more powerful models. Mixture-of-Expert (MoE) really has a lot of potential in enlarging the size of pretrained models to by an order of magnitude to trillions of parameters. However, training trillion-scale MoE requires a well-tuned high performance distributed training system. Unfortunately, the only existing platform that meets the requirements strongly depends on Google’s hardware (TPU) and software (Mesh Tensorflow) stack, and is not open and available to the public, especially GPU and PyTorch communities. Our FastMoE algorithm is the cornerstone of the WuDao project.”

Jiaao He: “Our initial motivation was providing system support to training larger models. Then, I found that there was an interesting opportunity to build a system for MoE models. Meanwhile, Jiezhong was looking forward to exploring MoE models, but was suffering from the lack of a good training system. Therefore, we worked together closely and developed FastMoE. I mainly did system design and implemented low level functions of FastMoE. Jiezhong and Aohan added a lot of features to make the system friendly to end users. Tiago, my colleague from Portugal, contributed to improving the training performance of FastMoE.”

Jie Tang: “FastMoE is a cornerstone algorithm block in WuDao 2.0. Without it, it would be very difficult to scale up the model to trillion-scale parameters.”




Figure: How FastMoE distributes computation across multiple worker nodes. Source: FastMoE Release Notes. https://github.com/laekov/fastmoe


What next steps would you like to explore in this research?

Jie Tang: “We are still working on FastMoE for more features and faster training. Compared to the GShard model, FastMoE lacks functionalities to support load-balancing among experts. The work on load-balance monitoring and support for load-balancing loss is in progress. We are also trying to make the system more user-friendly on utilities, such as loading and saving of MoE models.”

Jiaao He adds “We are also considering supporting diverse platforms and more flexible models. MoE reduces the computation in the model, while leaving the parameter size unchanged. Therefore, it requires less resources to train a large model, but the reduced computation is highly irregular, and requires more sophisticated system support. Take language processing for example. Every token has its own preference of expert selection. Therefore, the tensor's batch dimension gets messed up. “It is not theoretically easy to control the trade-off between bias and variance, and practically, it is not guaranteed that a model with MoE, resulting in more parameters, can perform better,” says Jie Tang. According to Jiaao He, “The general ability of MoE models is still unclear and we are working on exploring this further.”

What is the main difference between your FastMoE and earlier MoE implementations like Google Switch Transformer?

Jie Tang: “FastMoE is an alternative implementation of MoE based on PyTorch. The system provides a hierarchical interface for both flexible model design and easy adaption to different applications, such as Transformer-XL and Megatron-LM. Different from direct implementation of MoE models using PyTorch, the training speed is highly optimized in FastMoE by sophisticated high-performance acceleration.”

Jiaao He: “with the existing Google MoE system being highly dependent on TPU clusters and Tensorflow. For most researchers, PyTorch and GPU are their most familiar and easy-to-use training platform. FastMoE aims at providing everyone with an easy and convenient MoE training platform. We are using efficient computation and communication methods. For example, we batched tokens together before experts, and gained up to 47x speedup against processing them one-by-one.”

What are the biggest challenges that you had to overcome to achieve the current results and what was your approach?

Jie Tang: “Our goal is to build the largest neural network and also the most powerful pre training model. The underlying approach is the general language model (GLM - https://arxiv.org/abs/2103.10360), in which we are trying to unify all pre training tasks into a single framework”.

Jiaao He’s challenges are more in the implementation: “In the beginning, I found that PyTorch lacked support for such a highly customized computation task. Therefore, I had to develop some modules in C and CUDA. Unfortunately, PyTorch's documentation on its C API is not as clear and detailed as the Python documentation. Also, PyTorch's C code is spread across several places. I spent a lot of time reading its documentation and code to get everything to work”.

What is the Beijing AI Ecosystem like, and how does it differ from the US and Europe?

Jie Tang: “In the Beijing AI Ecosystem we foster an innovative environment between academia and industry. For example, Beijing Academy of AI (BAAI) is a platform to provide fundamental support to AI industry development and AI applications to improve people’s life, and to promote sustainable development of human beings, environment and intelligence.”

Jiaao He: “At Tsinghua University, there are frequent talks about AI. Students from different departments are looking for opportunities to introduce AI to their own research. There are also quite a few AI companies nearby, such as Sensetime and MegaVII. Academia and industry do communicate and collaborate a lot.”

The pace and volume of progress in AI is exceptional. How do you personally stay up to date on the latest research and what kind of tools do you use?

Jiaao He: “I read the newest papers and articles frequently. Sometimes my advisor and friends share what they are reading with me. We check out the websites of top conferences as soon as the papers are released. In system conferences, there are not as many papers published as in AI conferences. So, there we can go through the list of all papers and find out the ones that we are interested in. I also sometimes go through arXiv and receive recommendations from Google Scholar.”

Jie Tang, on the other hand, mainly uses AMiner.cn, the system he helped develop, to keep up with the latest research.

What is your dream of how AI can help researchers like you?

Jiaao He chuckles: “Maybe they can help me do paper reading, writing and coding, then I will lose my job.”

Jie Tang is more optimistic: “ I hope AI can help me find the important information that I should not have missed, digest the information and trace the origin of the information.”

Do you believe we are on the path to General Artificial Intelligence and how do we make sure humanity will benefit?

Jie Tang: “Yes, I am quite sure and very confident that we will one day have General Artificial Intelligence. I am not sure how humanity will benefit from it, but simply believe that technologies will advance and make this happen.”

Interestingly, the two researchers take quite the opposite view here. Jiaao He says: “I am personally pessimistic about the so-called general AI. We develop models because we want to get our tasks done, but not to create something identical to human beings, or ourselves. As our models are already stronger than human beings in some aspects, they are going to be used in a wide range of areas in various forms. The need to integrate and wrap some of their functions in a human-like shell is in my view only entertainment.”

Thank you for the interview. At Zeta Alpha, we are looking forward to further advances in open source super-large language models enabled by this cutting edge work and how they can benefit more accurate NLP applications.

798 views0 comments

Recent Posts

See All
bottom of page