On September 15th 2023, at Science Park Amsterdam, Zeta Alpha organized the fourth edition of the Transformers at Work event. After a very fast rise, Transformers have become the workhorse of modern AI. From edge devices to the largest language models in the cloud, computer vision, robotics, reinforcement learning, they are everywhere. Even modern hardware has become optimized to the specifics of this Deep Learning architecture. In contrast to earlier editions, the focus was more on the shift towards real world applications of AI.
With Nils Reimers (Cohere), Rodrigo Nogueira (Zeta Alpha, UNICAMP), Madelon Hulsebos (University of Amsterdam), Konstantinos Bousmalis (Google Deepmind), Suzan Verberne (University of Leiden), Douwe Kiela (Contextual AI, Stanford University), Raza Habib (Humanloop), and Corina Gurau (Air Street), eight world-renowned experts pushing the boundaries in their respective subfields such as Neural Information Retrieval, Conversational Agents, Large Language Models, Multimodality, Autonomous Vehicles and Robotics shared their view on the field. The videos of the full talks are available on the Zeta Alpha YouTube channel. At the end of the workshop we asked the speakers in a panel discussion to touch upon the future of the Transformer architecture, its applications like LLMs and more general AI agents, and the potential for a successor to this powerful model. Here’s our summary of the discussion: Let’s start with a very general question. It's very hard to predict the future, but returning to the topic of this workshop: Transformers at Work – will we have a Transformers at Work event again next year? Or will it have to be called something else? In other words, are we stuck with the Transformer model architecture in AI?
So the parallelization of Transformers seems to be crucial. People tried in the past with Recurrent Neural Networks, but it really doesn't scale that well. And I think if we can stay with this parallelization scheme, we'll be seeing more and more, at least Transformer-like structures in the future.
As you say, predicting the future is really hard. There's a line of work from Chris Ré's group at Stanford, using State Space Models, that is really promising, because it allows you to use RNN type models where you can still have high degrees of parallelization. So there's a promise there that you might get really good scaling. That's the only thing I've seen that feels like it could be successful.
Yeah, I think that the only strong argument against transformers has been this quadratic attention. So a quadratic amount of memory and compute when you scale it. And so we were stuck at this 512 limit where people say RNN is nice, you can also scale it nicely. But now with Flash attention, that limit is more a problem of the past. I mean, it's a really cool thing that you can look at every token, at every compute step. That's like really, really hard to beat and RNNs don't match with that. So I think next year, we will have Transformers, but at some point equip it with a world model or so, because Transformers are really shitty at basic things like mathematics, like adding two numbers or multiplying two numbers, reasoning, and so on. So here, I think they are not a good fit. And when we want to get to AGI, I expect that it can do the addition of two numbers.
So maybe to add one thing, this workshop has become broader and broader each year, with more applications of the Transformer and bigger Transformers, so maybe next year, you could also invite people who actually look into the Transformer, so not broader and broader, but deeper and deeper, so that we can start to learn to understand why they are so powerful.
Question from the Audience
Picking up on the concept of the Transformer architecture dominating, and moving into more models with more agency, that can actually accomplish goals and understand that they've accomplished a certain goal. Where do you see that heading? Because we have a lot of different approaches now. In the language model space, but also in robotics.
More VLMs (Vision Language Models)? One interesting direction would be, at least in robotics, to use a VLM to be able to tell you, whether you are there or not. So instead of showing the goal, like we do, to say: is the red cube on top of the blue cube? And then the VLM can maybe score the image that you have automatically for you. That is a huge problem. Kind of like having world models, or like success detection.
I think it's indeed closely connected to robotics. It's really hard to do this planning and execution. Maybe for these agents and LLMs it's easier because it's not physical, but still, it's really hard in the planning to say: let's reach out to a website’s owners to find problems on their website and then send them an email on how to fix the problems.
So when I started in the field, I did my masters, the year that AlphaGo happened, and we were lectured at the time by the DeepMind researchers who were working on it. They had this one slide at the start of the talk that said: RL equals AI. I think there was a consensus at the time, that if you wanted to get to AGI and you wanted to build systems that could reason and act that reinforcement learning was the way to go. That consensus is much less clear now. There are now people putting LLMs into systems where it looks like it has some form of agency, though it's still pretty weak. Concrete examples that I've seen work well, are things related to search and retrieval augmented systems, where the model can take multiple goals rather than one go at search. That's a pretty limited agent. Chat is a very limited example of an agent. The OpenAI code interpreter is quite a limited example of an agent. We have a lot of people who have tried to build more complex agents, but the success rate in practice is really low. The trouble is that it's hard to get very high reliability at every step. And then once you stack together a large number of unreliable things, the probability of success for the final outcome goes down a lot. So LLMs seem to exhibit some degree of agency, more than anyone expected, but it is still really hard to get it to work in practice.
Maybe controversial, but RL might not even be needed. LLMs are really good at reasoning, well they seem to be, you can just ask the LLM. Do you think you've accomplished a task? And it's very likely that in the future, it will know whether the task has been accomplished or not.
Just to add to that spicy take: my personal experience across the last seven years is that every time I've worked on a project, removing the RL has made things better.
Exactly, I would like to agree, and I would hope that's true, but it's not true yet, and I'll give a counterexample; for many control problems, there are situations where it's just impossible to get there, where so far we don't have solutions that are better with any kind of technique, than with an RL agent, unfortunately.
After this workshop, are you more optimistic or less optimistic about the state of AI where we are today? We've heard of many current challenges. What is, in your view, the most important challenge to work on?
Context size. Being able to have more context becomes a lot more important as you're dealing with more modalities, e.g. somebody mentioned tactile earlier, or images and things like that. Just to give you a notion, RoboCat uses three time steps as context because that's all that fits. So being able to solve that would really help in at least in the robotics domain.
Trustworthiness, efficiency and to be able to reduce the footprints that these models have. And humans, in the sense that people should be better educated about what they should use these models for and what they should not use these models for.
For me, it's how to quantify if you don't know. That's a really strong skill for humans. If you ask me something, and I don't know, I say: Sorry, I don't know. I could look it up and check it, but I feel I really get calibration what I'm confident on what I'm not confident on, and what I can refuse. And this is where LLMs are still terrible.
If I asked them, hey, which band is playing at Transformers at work 2023, they will hallucinate stuff, because they don't have the knowledge, and they don't know that they don't know it. For a lot of these tasks like agents planning, you need to judge, is it correctly answered yet? Is there still something missing? And are there some aspects which you don't know? Before we solve it and the model can really confidently and well say: What do I know, and what do I not know, there will be little progress on agents.
Last year we were here, fascinated by the few-shot capabilities of GPT-3 that were made public via an API. And this year we're working on these tools with multiple agents. There was some news that people, I think Chinese researchers, were playing with this software engineering factory where there is a chatGPT CEO, chatGPT CTO, and chatGPT writer, and so on, and we're talking about trusting our lives to these models, and discussing the impact. Maybe I'm too optimistic about how they will work in the future, but what is actually more troubling me is the impact that it will have in the general society. And it seems that there's a bit of denial amongst ourselves: GPT for coding, it's scoring way better than what I can do. It doesn't know information retrieval, like I do, but then maybe GPT-5 will know. We have to think a bit about this. So I'm more worried about the societal implications of this, than probably the technical aspects, of which there are a ton, but how long will it take to solve?
I'm very optimistic that we'll manage to deal with multimodal data, and expand the reach. There's a huge potential for applications and some of them will be pretty impactful. One challenge that I've seen across the different talks here is the need for data, high quality data, also for evaluation, but also to train models beyond simulations, beyond toy examples.
I'm extremely optimistic that we will solve the technical challenges. But also, even if we don't, compared to where we were a year ago, we're well into having crossed the boundary from research to being practically useful. So even if there wasn't further progress, I'm extremely optimistic about what people will build in the next year. The current state of the art is not over-hyped, it's actually under-hyped massively.
Something that hasn't been mentioned, that I think is quite important to work on, may be adding some sort of common sense priors to these models, e.g. more geometric priors, thinking of visual language models. Sometimes some of the mistakes are not very intuitive. Understanding why models make these kinds of mistakes and introducing these common sense priors would be a really interesting thing.
Thank you for making these kinds of predictions. We will not hold them against you. Certainly not after we mingle them with a little bit of beer in the evening program at Transformers at Work 2023. A big thank you for all the speakers at Transformers at Work 2023!