200 languages within a single AI model: A breakthrough in high-quality machine translation for low resource languages. We interview Angela Fan, one of the key Research Scientists at Meta AI Research Paris working on this project.
This is an interview transcript from a video interview you can watch on our youtube channel
We are so happy to have you here! To start off, can you tell us a little bit about what got you fascinated with AI and NLP early on, and how you came to work on machine translation at Meta AI Research?
What fascinated me about language is really reading. I've always been really interested in books, I spent a lot of time while growing up going to the library by myself. A lot of my early work was on how to use text generation technology to write books. I was really interested in storytelling and creative text generation and focused a lot on the generation of long text. Then I did my PhD in a similar area, really focusing on long text and creative writing. Afterwards, I was thinking, okay, so I focus a lot on text generation, but then how can I interact even more with my personal interests? One of the things that I always really think about is my ability to speak Chinese: I'm originally from China, I'm from Shanghai, but actually my standard Mandarin — what we kind of think of as standard Chinese — is not really that great cause it's not what I speak at home. So I almost never get a chance to practice it. It's only if I run into someone on the street, and we’re trying to get to know each other that I speak it. At home I've always spoken Shanghainese, which is an oral only low-resource language, but it still has at least like 10 million people speaking it. So I was thinking how cool it would be to work on text generation technology for lower resource languages. Now, of course, like writing stories in low-resource languages is extremely challenging, because we don't have even the ability to write individual fluid sentences for many, many languages. So I started becoming very fascinated by machine translation. On the technical side, I think that a lot of great advancements have been made in machine translation right, the Transformer was originally proposed for Translation so I thought like it was a great confluence of things with a strong personal interest and drive to focus on the topic.
Low resource languages are one of the biggest challenges in machine translation. So you’ve previously worked on summarization, text generation, and some pretty fundamental work with Transformers. What are the main learnings you got from working on machine translation, and what makes it interesting?
Machine translation is a field that for a long time I found very intimidating. Actually, when I first started at Facebook AI Research, there was an opportunity to work on machine translation when I thought, oh no, no, no, because it's like such an intense field. But actually right now, I just find it really nice. The community is extremely strong in machine translation, and one of the things that I like is that there has been so much historical work done.
When I started working on story generation, I felt like a lot of it was nascent, and something to develop, whereas machine translation, you definitely get the feeling like you're joining a very established field. At the same time, you become very excited because I think machine translation is one of the most commercially successful applications of AI. For instance, I can't really imagine going on vacation to a new country without having Google Translate! So that makes it feel very good. This is something that, if we improved it, and if we made it possible for more languages, it would be immediately useful for many people. It really has a direct impact on people using it.
The Babel fish has always been the dream of AI, right? Putting this thing in your ear… and now it actually works! It's amazing.
Yeah, it's definitely a science fiction type application. I used to read a lot of space opera-type science fiction novels, and you know, a lot of them go like “you meet some aliens”, and magically, you all speak the same language, or at least you can communicate.
Let’s dive into the specifics of NLLB-200. What are some of the main problems that you have tried to solve in this project?
The first thing that we tried to do in NLLB is really understanding, okay, what is the human problem? I think many researchers, particularly in AI, tend to jump to the solution. We wanted to really understand the problem at hand. So we started with interviewing a lot of low-resource language speakers from all over the world, asking if this technology that would actually be useful for you? What do you want? And it was really fascinating. So many people, really echoed the fact that not having access to translation technology does block them from many opportunities. A lot of people talked about education. In one interview, the person was saying, you know, I could be the smartest kid in my country, but you know, just have access to a lot less literature or learning material. So that was the foundational point.
Then from there, we moved into a lot of questions. If we wanted to solve the low resource problem, what are the major challenges? And so we invested very heavily in evaluation. One of the things that we're open sourcing is Flores, a high-quality evaluation data set to cover all 200 languages, because without measurement, it becomes very difficult to actually make advancements. Then we focus a lot on data: low resource languages are all about that the resources are not there.
So we contribute with this a small training dataset, which we call NLLB-seed. It’s meant to be a kind of starter seed data for a bunch of different languages where none exists, so that you have your sentences to just jumpstart a lot of your models. Then we focused on this large-scale automatic data set creation work, which we also are fully open sourcing. So that starts with the question, if you wanted to find data on the web for a language that you speak, how might you do it? And so it starts with having language identification systems, then cleaning the monolingual data. Finally, this really large-scale effort in bitext mining, where mine translations, or pseudo translations if you will.
Then we also focused on modeling, where we're thinking about what would it take to really increase translation quality for all 200 of these languages in one model? Then we also have a lot of modeling techniques, such as distillation so that people can actually use these models on reasonable compute resources. Like, if you have to wait five minutes for your translation that's not really like something that you're going to be using.
Finally, of course, extensive human evaluation, and we also focus on translation safety, we really want to make sure that the translations we produce are actually usable by people. So we have this effort in toxicity so that we produce safe translations for people.
In your blog, you say that NLLB-200 exceeds the previous state of the art by an average of 44% on BLEU score, right? That's truly an impressive breakthrough. So I'm assuming that's measured on this new FLORES benchmark?
For that number, the previous state of the art has only evaluated on 100 languages from FLORES 100. So when we compare it to previous data it's on these languages, which include almost 10,000 directions. So that's on FLORES, 100 averaged across, every single possible translation pair, and then we're adding FLORES 200 now, but that couldn’t be compared with previous state-of-the-art.
What are the main technical innovations that are behind these new models? Especially when you compare it to your earlier models from 2020?
I would say the two primary technical drivers are our ability to have more high-quality data through this large-scale mining work, and a lot of modeling innovation. So maybe we could talk about data first. We invest heavily in what is called bitext mining. If I want to produce translations for German, well, people have worked on German for decades, you have the workshop of machine translation, you have EU Parliament data, and just a lot of actual human German translations. However, if you want to work on a new language, let's say Assamese, there is content available on the web, but it might not be bilingual data.
So the way our bitext mining works is that we try and lift data from the web and then we try and say, okay, given any two sentences, are they a translation? The way we do this is by embedding all of the sentences in this multilingual representation space, which we call LASER-3, and then calculating the distance between those sentences. The major challenge is like, well, how do you have sentence representations for low resource languages? It's not like we have sufficient data to train those. So one of our major innovations here is creating this model that we call LASER-3. The idea is to start with a general model that's trained on many different languages but then try and specialize that representation space to extend to a new language. But you can't just train a new sentence encoder, because then your two spaces will be really misaligned. And so we focus on this teacher-student training approach, which is able to maintain the continuity of the embedding space by being able to incorporate new languages very quickly, such that with a very small amounts of data, we're able to train reasonable quality sentence representations, which in turn enables us to mine bitext.
So in that sense, would you say the new data that you have to train these low resource languages is a kind of synthetic data? Or is it actual like verbatim literal data that you have but for which you only have monolingual samples?
Actually, we use a mix of data sources, of which we have 3. The first is translation data that exists, for example, biblical translations that are available. We use publicly available datasets if they're licensed for research use, of course. The second is the human-curated dataset that we create ourselves called NLLB-seed, which is meant to be pretty high-quality human translation data; and we actually also use that to train our sentence encoders. Just to give the models a little bit of data that humans have translated into these languages. By the way that's where our evaluation data comes from as well. The last is, the use of monolingual data for our mining, where we take that monolingual data and convert it into bitext as I previously explained.
If we focus on the algorithmic advancements, I understood that the new model is actually a mixture of experts sparse model. Can you talk a little bit about that?
Yes. So one of the major questions of encoding so many languages is, how are you going to have a model that is able to have the capacity and the quality to represent all 200 languages? And it's not just the language, it’s also the directions. We’re not just translating in and out of English, we want to translate in and out of Hindi or in and out of Chinese like a lot of natural use cases. So our model actually has several thousand of translation directions.
In conclusion, there’s a fundamental model capacity challenge here. So we started with training dense models, but we found that scaling dense models is very challenging. Because you have so many different languages, a lot of the languages are low resource, they overfit like extremely easily. It's a massive problem. We then focused a lot on sparse mixture of experts models. The upside of these models is that they have a lot of capacity while keeping inference costs lower because you don't activate the full model to do a translation, whereas, for dense models, you're activating all of the parameters every time. So it's really inefficient to train and extremely slow in producing translations.
But then, you really confront the regularization problem. Because if you have a bunch of experts, and tons of expert capacity, then the model can really easily memorize the training data, so a lot of our advancements are in improved regularization, and optimization for these models, were able to train them and balance between our 1000s of different tasks.
So what are kind of the main tricks for training these sparse translation models?
We have two main regularization techniques. Firstly, advanced types of dropout techniques. Everyone uses dropout on neural networks, but when you switch to these mixture architectures, you can add dropout to parts of the network such as the gating that routes to the different experts. So the way you actually add regularization is really important. And the second improvement we have is around curriculum learning.
One challenge you have for all of these different language pairs is that some of them have millions of pairs of sentences, and for some of them we might only have 10,000. So if you group all of them together, your high resource languages haven't finished training yet while your low resource languages already experience overfitting. So we train in a kind of curriculum strategy, where we try and estimate, okay, how long would it take for the specific direction to converge, then we create buckets, where we estimate for these high resource directions, they're going to need several 100,000 updates to converge. So let's introduce them early. And then we add in different languages on this type of curriculum. This really allows the languages that need to continue training to have that time, and the lower resource language is not overfit so much.
Did you also want to prevent the catastrophic forgetting between the different languages?
Yes, exactly. Actually, that was something that we analyzed a lot. So I think a lot of people have the idea that if I have 10 experts, then each of the experts will be specialized in different languages. But that doesn’t actually happen, it's more like a dream. We actually spent a lot of time analyzing how this expert capacity is being used. Because like you said, we want to encourage the model to learn relationships between related languages, like Assamese is written in the Bengali script, but not have so much interference between languages that are unrelated.
Traditionally, it's been very hard to beat kind of one task models or model, one language pair specific models with multilingual models. So even the well-resourced languages as your do your new model actually have very good performance when benchmarked against purpose-built pair models?
I'll talk about it for NLLB in one second. But that's actually something that I was thinking a lot about, because so people are super excited about multilingual translation, of course, but you suffer from the problem that you mentioned, where, at the workshop of machine translation, there's like a shared task every year. And those are still won by bilingual models. So if I want the best model, why multilingual? This is why last year we actually specifically confronted this at Machine Translation team. We set our goal as to just enter with a multilingual model and see., we're gonna throw every technique at it and see what happens. So we were able to actually do very well at WMT. An important caveat though is, of course, for WMT there are not hundreds of languages, so we only did the languages of WMT which are a handful mostly high resources. We did show that you are able even for highly resourced languages to do very well with multilingual. So I think we won like 10 out of the 14 directions with one model. Judged on human evaluation, not BLEU.
That actually convinced me that multilingual is the way to go. So for NLLB, we do a lot of comparison on FLORES, and you can check out our extensive paper on this. We also evaluate community datasets, so we focus on many local resource datasets like there are a lot of datasets released Masakhane for example, the African languages, but we also evaluate of course on WMT. So you cover most of the high resources, of the WMT workshop, and different Indic NLP benchmarks, a specifically to get a sense of this quality. But we certainly don't want to sacrifice quality for high-resource languages.
Okay, so you don't sacrifice it? But does it actually benefit the model and from training in so many languages at once? apart from the fact of course that like, the pair-wise approach wouldn't scale very well which is an efficiency thing. But do you actually get some sort of interesting generalization capabilities out of training on so many low-resource languages at the same time?
Yeah, that's a really good question. So um, when we compare it to WMT, and other more mid to high resource directions, we do see improvement, not across the board, right, it's hard to win, like every single possible pair. But we do see in general that we have important performance improvements.
Regarding generalization, it's hard for me to say that if we trained, for example, on Catalan, and then we get that much better on Spanish, just given like the proportionate amounts of data, so I would say for the very high resource ones, we just don't drop performance. But I think for the mid to low resource, we do see some sort of generalization. And we did try and break down like the embedding space to know how are similar languages being grouped. We actually work with many linguists to try and analyze our model performance from a linguistic standpoint as well.
…so here is Chomsky's universal grammar in your model?
My God, I think this has always been the promise of multilingual, that there's like some sort of interlingual space or something. So far, I would say, you know, still an active area of investigation, there are some things that make a lot of sense that our model groups, but many times, it's not learning, you know, like the recreations of some sort of linguistic relationship trade between different languages. And so we're still looking into that.
It's tempting to say that machine translation is more or less a solved problem now. I would guess not. When you look at the model's behavior, are there some sort of challenging or interesting failure modes? What are some of the phenomena that you see that still need to work?
So even for high-resource languages, I think there's a lot of problems still. So I'll break down maybe the errors by like resource category. I would say for mid to high resource languages, there's still a lot of awkwardness like we call translation knees, things might be grammatical, but it's just not something that a native speaker naturally would write, which is still a problem. There are still some important factuality problems, like translating named entities incorrectly somethimes. And these challenges can be difficult to identify and very difficult to fix.
For our major focus, low resource, I would say there are certain sentences that are just straight-up hallucinations, you know, the model can't understand the source sentence. And so it produces a hallucination, which is very harmful. Often the hallucinations are biblical content since a lot of low resource data is kind of like the Christian bible in nature. One of the harmful errors that we really focus a lot on is toxicity. For instance, I can enter a perfectly benign source sentence and the model generates some sort of profanity in the translation. This is an extremely harmful experience for the user, you lose so much trust, and it’s so unsafe you can imagine just seeing it is a poor experience.
So one of the things we really focus on is this toxicity, so we actually produce toxicity word lists for all 200 languages to help people detect this. If you think about dialogue, this is an area where people are really focusing on toxicity in the chatbot regime, where you want to have a friendly chatbot that you're able to talk to, and so on. But then how do you scale this for languages that are not English? There has been a little bit of multilingual toxicity work but on a small scale, so that's something that we invest heavily in, but it's super difficult. Because toxicity is very cultural, insults are extremely cultural for instance. So just trying to adapt these toxicity lists and collect them such that they're culturally meaningful, and actually work for your language was an immense challenge.
So now you have open source these models. At Zeta Alpha’s trends in AI, we always praise Meta as being probably the most open source friendly, big tech research labs. So kudos for that. Definitely a lot of very good examples. What is exactly open source now? Is it the data, the algorithms, or the pre-trained models? Is it the full translation models that are open sourced?
Our vision is like you speak a low-resource language, or you want to work on your own language, you should be able to recreate everything we did for that language. So we really believe in open sourcing, everything I talked about. So from FLORES to this NLLB seed, we also have like multidomain seed versions, so you can detect if your model just performs well at translating one domain, or if it's actually generally a good translation model. Then we open source a script to recreate our training datasets so that people are able to train with the data we have created. We also open-sourced the libraries and the code that's required to do bitext mining and data cleaning. One of the things that I think is always very interesting is that data cleaning is an important part of most large-scale papers. But then it's like, exactly how did you clean your data? So we released you know, the code generally, and also our configuration settings. I think that could be improved significantly, actually! If you speak a language, you can tell instantly if the data is clean or not. Whereas we're sometimes guessing a little bit.
For modeling, our final model has around 54 billion parameters. But we released two distilled versions, so people are able to use them for many practical applications and study them on a smaller scale. We also describe our distillation approach in detail in our research paper. Of course, all of that is available, and then the code and the configurations to actually train our model. Then, of course, we also release our toxicity lists and detailed paper on exactly the human evaluation protocol.
You trained these models on a very large computer, right? How difficult was it from an engineering point of view, you mentioned, curriculum learning and this large sparse model and all that. How straightforward is it to train such a model?
We are extremely, extremely privileged to have access to the research supercluster. It came out earlier this year, and having access to a large stable compute cluster, is really useful for training our models. Actually, one of the things that we do break down in our work is exactly how many GPU hours are required, and training the final model… does require quite a bit of time. But actually, the significant time investment is all of the ablations. It takes us a year to basically reach the final model. So what takes the most number of GPU hours is just being able to ablate and study all of these different techniques that I talked about. And so that's where the significant investment is. We hope that by releasing these very small models, people in the community will be able to experiment with them, and also do fine tuning to different domains, different languages, because I think that's where, you know, the everyday person and even the NL practitioner will be able to use them. I think our largest model, will not produce translations fast enough for any type of real world application.
No language left behind kind of suggests an ambition to be adding more and open sourcing it also for people to be adding more languages from their own communities: there are 6000 languages and you guys know of 200. What do people need to do to add other new languages to these models?
That is the ultimate dream! That if you are a researcher, you can add your own language to just general technology support beyond translation like NLP in general, why should you have a chatbot? For your language? For example? I think that there's a wide variety of technologies that need to be built, for many of the languages. How do we reach 200? We, of course, scoped out hundreds of different languages, but in some languages, don't even have keyboard support: you can't even text your mom in that language! And so there's a whole wide variety of technologies that need to be built. So I'm really excited about a lot of community efforts like Masakhane, and AmericasNLP… that work towards that direction. S we're really happy to collaborate with the community to add more languages to FLORES, we have a shared task on large scale, multilingual. Last year, it was a different language group, but this year, we focused on African.
As part of that we actually also offer compute grants, so if you don't have access to sufficient compute to work on research for your language, you can apply for one of our grants through cloud compute. So we hope that that will help supplement a lot of the kind of compute barrier.
But is it in principle possible to take the model and add a new language?
Oh, yeah, of course, you can fine-tune. So there was a great paper from Masakhane about a few 1000 translations go a long way. In that paper, they start with m to m, and they adapt it to include different African languages. That's definitely a great starting point, I think a lot of the languages released still need performance improvement, so in addition to adding new languages, we would love to collaborate with people who want to improve the quality of translations for existing languages as well.
Let's talk a bit a little bit about the impact of high-quality machine translation. You mentioned already a couple of areas, what are the main ideas you guys have about the impact of high quality and D on people's actual lives? Not from the point of view, but you know, you guys have a very large platform. So there's a huge societal impact as well.
I love research and publishing papers. But one of our major motivations across the team is really to help people. So there are two things that we really have done. One, we've partnered with our production translation team, and a lot of our NLP techniques are now happening helping translation across meta platforms. The second thing is to focus on education.
One of the things that really came strongly through our interviews with different low-resource communities, is the desire to access information online. So this is actually where we partner with the Wikimedia Foundation. They have many languages on Wikipedia where that don't have translation support, so Wikipedia has this thing called the content translation tool. The idea is, let's say you want to add an article for your language, if it exists in another language, you could request a translation for it, and then edit the translation, which is much faster than writing the article by yourself. So we worked with Wikipedia, we understood a list of languages that they don't have sufficient quality for, or they don't have support for, and we are actually serving translations on Wikipedia. We really hope that this will enable people to create information for their languages, and really focus on this kind of information need education and information access.
Language builds bridges, and that has a very positive impact on connecting the world and empowering people. How do you see the risks of high-quality machine translation technology? Especially when you make it open source, because this can be applied to any application that you didn't think of.
That's one of the things that's really top of mind for us. One of the things that happen is if let's say you create a website like you could use Google Translate to just translate your website to every possible language, but you wouldn't really know the quality of that. So that content actually could be harmful, because you're not producing high-quality content in another language. So that's actually something that we spend a lot of time thinking about.
We work with an ethicist to really think through some of these, some of these considerations, because they're really important, and that's actually why we thought so much about translation safety, particularly from toxicity. This was a theme that we discussed extensively with a lot of native speakers as well, to try and understand questions like “how good does a translation need to be for you to think that it's net beneficial compared to net harmful?” Overall, we found that there is a basic quality bar that technology needs to meet. You can't just be like, “Oh, I produce one word, and your language, therefore it's covered.” And so that's why we actually focus so much on evaluation through FLORES or human evaluation because it's super important that if we say we support a direction, we can tell you exactly what we think the quality is.
If your machine translation becomes really high quality, it also could be applied to areas like surveillance or censorship? How do you kind of see that and especially with being open source, anyone could download it and build a surveillance model with it?
Indeed, actually, Amandalynne Paullada has this great article called Machine Translation Shifts Power. One of the things that she reflects upon is that historically the development of machine translation was very driven during the Cold War because people wanted to know what the Russians were saying. And I think that when you develop technology, you need to understand that we think that we often think “okay, we're only going to use it for all of these great applications that we discussed.” But you're totally right, that there are a lot of downsides.
And that's something that we actually need to think through. And so one, it's really important for people in general, all AI researchers to really acknowledge this and think this through, and to, you know, all technology has downsides. Like the fact that my sister is 16, and can drive a vehicle on the street and could irresponsibly drive, that vehicle is something that everyone needs to think through. And so this is where I think there needs to be a lot of societal standards around AI. This is beyond translation, right? Like when will you allow AI to make a decision that really a human should be making. So I think we need a lot more societal norms for this, and also potentially a lot of regulation on when AI systems can or cannot be deployed.
There was a lot of criticism about Facebook, and how your moderation policies work, especially in low resource areas with conflicts. So do you think this kind of work can impact some of that criticism?
I think it's extremely important that we're able to support all languages equally, I think one of the things that always motivates me is the idea that many things we say “it works”, like sentiment analysis, it works in English. All of these tools, especially if we use AI to do some things like content moderation, they need to work equally well in all languages. So that's why I definitely feel like our team focuses on translation.
But when we say no language left behind, I really feel that the idea is for NLP to become truly multilingual across all different types of tasks. So that's one of the things where I hope that people look at the technology that we're building, and also all of the datasets that we're sharing. They can use them as starting points to do other types of tasks as well. Ultimately, for many things like hate speech, people are using AI to do things like tech hate speech, tech bots, on Twitter, for example. So this is where I think in order for us to really cover all languages equally like there needs to be a focus on this. To focus on like, every language matters, like every single language, we need to know the quality and the support that we do have, and we can't just blindly apply systems. That's actually why we have such a focus on quality in this work.
Let's finish the interview by looking into the crystal ball a little bit. I think everybody, every AI nerd has been reading about the Babel Fish for decades now. Now we have it. We've also all seen The Matrix. Not sure that the Metaverse has a good ending in the fantasy of many AI nerds, but that's a different topic. So, what do you see, from a roadmap point of view, as the next chapters in this in this journey for machine translation at Meta?
I mean, for me, machine translation, translation is not solved. The more I work on it, the more I'm convinced that there was a long road ahead. We talked a lot about quality issues. I think for many languages, you know, it's a journey. People love the binary as "we cover it, or we didn't cover it". No, no, no, like, coverage of what quality, so that's a major focus.
The second is that I really hope that when people talk about multilingual NLP, they really value every single language, they really examine the quality, and we don't just like throw all of the languages together in one bucket. I think when we talk about developing new technologies, such as looking towards the Metaverse, I really hope that there's a focus on inclusion for everybody. I don't want technologies to be developed only for English, and then think “oh we'll work on everyone else later.” No, I think billions of people around the world speak low-resource languages. And it's really important that everyone can access content, access technology, and a language that works and is also culturally meaningful for them. So a lot of times, I just want to read the news in Chinese. I don't want to read it in English, right? Like, if it's us about China, I don't think I should read in English. So we really need to develop technology that's inclusive by default.
Cool. So thanks a lot for explaining all of this stuff about the new models. We're looking forward to reading the paper on arXiv. Yes, thank you. And just have to ask us to offer to close up the episode. So there's so much new stuff coming out on archive also in machine translation, there are also many groups across the world working on it. How do you keep yourself informed in this fast progress of AI? And how do you avoid missing completely important stuff?
I heard from everyone else that Twitter is the way to go because people tweet about their work, but I do not have a Twitter. I still keep it old school, I look at those arXiv email digests, but I look at them like once a week and do a little bit of control + F on keywords!
All right, well, let me just briefly shout out an alternative! Check out Zeta Alpha, a smarter way to discover and organize knowledge for AI and data science teams. Thanks so much, Angela for enlightening us on this fantastic breakthrough.