# Competitive Coding for Graph Neural Nets and beyond — The CLRS Algorithmic Reasoning Benchmark

#### Zeta Alpha interview with Petar Veličković

*Petar Veličković is a Staff Research Scientist at DeepMind, where his main research interest *lies* in Geometric Deep Learning. He recently co-wrote a ‘proto-book’ about this nascent field of research with Michael Bronstein, Joan Bruna, and Taco Cohen. He is also an Affiliated Lecturer at the University of Cambridge, and an Associate of Clare Hall, Cambridge. He completed his PhD in Computer Science from the University of Cambridge (Trinity College), obtained under the supervision of *Pietro Liò*. In 2021 he wrote a position paper about “Neural Algorithmic Reasoning” which is the background of his most recent paper “The CLRS Algorithmic Reasoning Benchmark” (ICML 2022) where a team from DeepMind uses the classic Introduction to Algorithms book to define a set of challenging problems for deep learning architecture to learn classic theoretical computer science algorithms. We had the pleasure of interviewing Petar on 31 August 2022.*

*You can also *__watch the full video of the interview on our YouTube channel.__

**Can you tell us how you got into AI and eventually through your journey on geometric deep learning and algorithmic reasoning?**

I started out as a theoretical computer science student at Cambridge for my undergraduate, and like many in Central and Eastern Europe, I got into the computer sciences through algorithms and competitive programming and data structures and these things.

For those unaware, it's these kinds of competitions where you're supposed to write a program that solves some particular task given only language input and maybe a few examples in a way. It terminates with a correct answer in a certain amount of time and in a certain amount of memory limits.

So you have to quickly learn how to write efficient programs, and for a very long time, this is what computer science was to me. So when I started my studies in computer science, I specialized mainly in algorithms and data structures and the related theoretical fields. I assumed that I would probably have a career in software engineering when I initially started.

I first got exposure to ai when I started doing my final year project at Cambridge for my bachelor's. This was with Pietro Liò, who also ended up being my Ph.D. advisor. I wanted to do a project in bioinformatics because I knew that bioinformatics and biology generally were full of classical algorithms. So I thought it would be a great way to kind of directly apply what I have learned and what I know well. But my advisor very quickly convinced me that in reality nowadays, it's getting to a point where the dominating method in computational biology is not classical algorithmic, it's actually machine learning, and this is what got me to seriously consider studying machine learning and artificial intelligence.

I really enjoyed doing it because I saw it as a natural next step to me being able to program a machine to do something. Now it's me being able to program the machine to be, to learn by itself how to do something, which I thought was quite fascinating and obviously had a lot of interesting applications.

Shortly after I got my Bachelor, I continued at Cambridge as a Ph.D. student in machine learning together with Pietro. And this is how my interests in AI were originally created the expansion into geometric deep learning.

My encounter with graph neural networks happened only a few years later. First, during my internships at MILA Montreal, when I worked with Adriana Romero and Devon Hjelm. As you know, we got to touch upon these ideas of building machine learning models. Over data that is irregularly structured. So unlike the more popular kinds of data like text or images or speech we were now building models that worked over arbitrary graph structures, which I found to be extremely powerful and generally, aligned quite well with the kinds of graph algorithmic things I used to do when I started doing computer science.

I really enjoyed it and it felt to me like this was a very general way to reason about machine learning computation. So I, I thought it would be a good investment for the years to come. Then only many years later when I got in touch with Michael Bronstein, Joan Bruna, and Taco Cohen. I started to connect the dots about how these graph neural networks are just really an instance of a more general geometric deep learning blueprint that allows you to build neural network architectures over arbitrary input data with arbitrary symmetries.

GNNs just happened to be the version of geometric deep learning for graph-structured inputs and permutation symmetry.

I then realized that in the same way we talk about symmetries in machine learning, if I use some symmetry transformation on my input and apply a neural network, then ideally it should be the same output as if I first applied the neural network and then applied an appropriate symmetry transformation. This is known as the equivariance constraints.

The preconditions and postconditions of algorithms. So if I give you an algorithmic input that satisfies certain kinds of conditions, when I talk about algorithms, I can often think about one step of the algorithm satisfying some different kind of condition when it terminates.

So it was a natural way for me too, through the lens of GNS and geometric deep learning, to start studying, can we teach machine learning models to execute exactly those kinds of algorithms that I really enjoyed playing with when I started my career as a computer scientist? So in a way I've gone full circle, it started out as a toy problem kind of investigation, but I think we're actually only starting to scratch the surface over something that's potentially really powerful with these kinds of models.

**You kind of coined the term neuro-algorithmic reasoning, right? And I've been in AI and cognitive science for some 30 years now, and I'm always a little, little bit weary of the term reasoning. Like.... what is it exactly? Are we talking you know, like logic and algebra here, or like common sense reasoning, you know, human reasoning**?** Can you explain a little bit about** **how you see the connection between algorithms and reasoning?**

It's really something that depends on the perspective of the person you're speaking to. For me — someone who was trained to work with data structures and algorithms for my undergraduate studies — I actually see algorithms as a very neat way to formalize various kinds of decision-making and strategies and these kinds of things.

So maybe with algorithms, we won't be able to express all kinds of reasoning that one might think about because it's not necessarily always going to be rational or deterministic. But I think that algorithms are a very useful description language to talk about all sorts of reasoning, and procedures, and maybe more importantly: they’re a language that's going to stay relevant no matter what the model of computation is, CPUs, GPUs, TPUs, quantum computation, whatever.

You can see it as a way to describe reasoning on mathematical objects and their structure at the lowest possible level, but in my opinion, one of the coolest promises of neural algorithmic reasoning is to break away from the abstract realm and build models that might be applied in the real world. With noisy, raw, complex, temporal data.

**There are a lot of criticisms of modern deep learning-based paradigms on how the symbolic aspects of reasoning and logic are absent. Do you see your work on reasoning kind of contributing **toward** addressing those criticisms?**

Yeah, so that would definitely be a dream. We are currently unable to do that. So the systems are still from start to finish, end to end trained neural networks with gradient descent. So all that we do will inherit the same kinds of limitations that these models currently have. That being said, the method that we propose is a bit of a, let's say, the best possible way that we've been able to think about so far.

To capture the computation of some actually rigorous algorithm for which you have clear generalization properties and proper step-by-step thinking and things like that to take the essence of what those algorithms are to then put them in a setting where you can train things using gradient descent.

**Let’s jump into the CLRS algorithm reasoning benchmark paper, which was published on arXiv in June 2022 and later presented at ICML. For those who are wondering what CLRS stands for, it’s the last names of the authors of the Classical Introduction to Algorithms book Cormen, Leiserson, Rivest, and Stein. So Petar, maybe you can explain to a broader audience the core idea of the paper and what you hoped to achieve with it.**

First of all, thank you so much for highlighting our work. I have a feeling that these kinds of directions could become quite important going forward. And you know, it's, it's great to see that people in the community are able to recognize that we've already had some really good reception from people who played with our benchmark already.

To truly answer what brought us to this paper, I need to take you back like two and a half years in time. This was around the time when the pandemic was starting to get real and we all started to work from home. At this time I was starting to work on algorithmic problems, and we were able to publish two papers around this time, which dealt with various kinds of graph network variations that were able to fit certain classes of algorithms better.

There were some other groups also very actively working on this. I think Stefanie Jegelka's group at MIT is an excellent group of people that have quite intensely worked on this. Not just empirically, but also theoretically they developed some really cool theory around this that shows why it's a good idea to use GNS to fit algorithms in the first place.

But one thing that always struck me as a bit of a problem, is that we want to make it known for everyone that training neural networks and algorithm problems is worthwhile. Even from the perspective of just benchmarking your model to see to what extent it can solve standard algorithmic problems and use it as part of the software engineering toolbox. And I saw one thing as a very big impediment to that already, in the year 2020, every single one of those papers proposes a new method to solve a certain class of algorithmic problems better and to show that it works, they build their own high quality targeted data set for this particular category of reason.

And this is a bit of a problem for two reasons.

The first problem is that it's gonna be really, really hard to track progress in the area if you do things like that, because everybody will choose the baselines that make the most sense for this particular class of problems. But what if I am catastrophically injuring performance on other classes of problems, right?

Secondly, if you're someone who comes into this field and as I said, we want all machine learning practitioners to be aware of this kind of opportunities and to try to battle test their models on them. Let's say you're someone who hasn't had formal training in theoretical computer science and you want to enter this field. How do you do it? How do you even generate a data set of meaningful problems if you're unaware of what kinds of meaningful algorithms are there that probe a particular kind of reasoning?

So what we really wanted to do also is not only to unify evaluation across different papers but also to democratize access to this area, to make it so how do these, how do these data sets actually kind of look in practice.

When you’re fitting an algorithm, you know you have a target “function”, it's basically the code that implements the algorithm itself. The idea is you generate inputs from certain chosen distributions of inputs that you care about, and then you run the algorithm on those inputs and that algorithm produces the output that you can take as a label.

But very importantly, the algorithm also has some intermediate states, and if you truly want to fit the algorithm's direction, you’re also gonna be interested in that entire trajectory. Why is this important? Think of sorting algorithms as a classical example. Every single sorting algorithm implements exactly the same function. So if you go from input to output, it won't tell them apart, insertion sort, heap sort, quick sort… But they all have different logics! The intermediate trajectory is what sets them apart.

This surfaces very complicated questions, like where in the middle of the algorithm do I choose to capture data and which data do I capture.

So that's really partly what CLRS is all about. We built a data set with 30 algorithms with five or six different baseline models, and the algorithms automatically support capturing data at the most interesting intermediate points so you get full access to trajectories. And the way the benchmark is designed is for every algorithm you have a very clear specification of what are the inputs, outputs, and intermediates.

**In the paper**,** you mentioned also the way you can give “hints” to the algorithms. Can you go into that a little bit more? **

Hints are how we internally refer to these intermediate states. So you have the whole trajectory of the algorithm and we call them hints. Ultimately you wanna go from the inputs to the outputs, but these intermediate states are kind of intermediate hint signal that pushes your representations toward what the algorithm would be doing internally. So that's why we call them a hint as a kind of shorthand word for intermediate trajectories.

**Alright so you defined this benchmark results in some “scores” right? What, does it mean for one of the learning algorithms to achieve a score of 50 or 70, on an algorithmic reasoning benchmark around a single algorithm? **

That's a very good question and there are actually many ways to define it. We offered one particular way in our set of experiments in the code we publicly released. But the code is so extensible that if you want to write a different evaluation function, that will be quite easy to do.

What we currently do is for every single point where output is defined, we check whether you got the correct output or not, and we compute the micro F1 score to account for class imbalance. To go back to your question, if you say, I have 70% accuracy, 70% F1 score on a sorting algorithm, what it currently means is that over all of the list entries in your test set, you are predicting the correct predecessor in assorted list 70% of the times.

Now, some people might prefer a different metric. They might say, either you've sorted the array or you didn't. This would obviously give you a lower F1 score because now you either have zero or one for the whole array. Whereas in our current evaluation, we say one element is correctly sorted if it's predecessor is correct.

**I noticed that in your evaluation you had very high scores on the validation set and quite significantly lower on the test set. What does that mean? **

So this is the reason why I found it necessary to even create a new name for all. Because in a way, all we're doing is supervised learning behavior. All these things already exist, right?

So why do I need to create the term neural algorithmic reasoning? Well, I think the main reason why is because when you want to fit an algorithm, which is a target function, It's not like usual supervised learning where you might have a test data set you care about and you wanna just maximize your performance in that test data set, which is often somehow related to the distribution of your training data points. With algorithms, to say that you fit an algorithm means that you fit that algorithm everywhere. Algorithms don't collapse. If you give them inputs that are four times as big, for example, with our current test set, we wanna check to what extent your neural network is reasoning you need to check how it works on even bigger inputs.

So the distribution of the example is fundamentally different in the test and validation set.

**To what extent do you see cross-task generalization, like transfer learning from one algorithm to another? You kind of allude to that being the holy grail of the whole thing, being able to compose algorithms out of subroutines elements reusable components. **

I agree with you that that's where we want to be. The CLRS paper in and of itself doesn't test in this regime because like we could already barely squeeze the experiments onto eight pages as it is and it would’ve been very hard to pick and choose which transfer experiments to do.

So for CLRS itself, I don't have an answer to that question, but generally speaking, people have been playing with combinations of algorithms. In several previous papers, and I have previously co-authored a paper at NeurIPS last year (How to Transfer Algorithmic Reasoning Knowledge to Learn New Algorithms). And there we basically showed that under certain types of multitask training, you actually can get very meaningful, positive transfer.

**Something that surprised me about the benchmark is that I didn’t see a public leaderboard. Did you get any submissions from somebody else? Do you plan to have a leaderboard? **

Yeah, so, okay, so two points to be made here.

The first one is that this is definitely a challenge for the community to try to show interesting transfers between types of algorithms, but I'm also a firm believer to some extent you should learn to walk before you can run. And our paper already shows even for a single task where everything is known and given up front, we drop half the performance when we extrapolate, so that's the first challenge I would give to the community.

Only once we get better at understanding why performance drops so much, we can talk about combinations and transfers. So this is, this is my thinking. In terms of the benchmark, so we call it a benchmark, but it's actually more like a benchmark factory!

Because for a specific algorithmic target, a specific data sampler, and a specific type specification, you can generate your own benchmarks within CLRS, which I think is a really cool thing. It's not a static data set.

We do provide one static data set that we use to generate data for our particular paper, but we are deliberately not making this a server where people can send submissions because that's fundamentally not what we want. We don't necessarily want that if someone wants to evaluate a general purpose reasoner on a reasonable set of data points, they can reuse the data we used. We actually provide it as a zip file that you can download directly.

So I really see it as a generator of benchmarks rather than a benchmark in and of itself. That being said, we are in discussions with the Open Graph Benchmark community from Stanford where maybe we might find a way to create an actual proper leaderboard benchmark out of this specific thing we put in this paper.

But basically, we thought that making it you know, a more formal benchmark would send the wrong signal into the community. We did not want people to think this is what we need to beat right now, but more like, here is a great thing you can play with and you want, there's this one specific thing that came out of it.

So going back to your early days, it's not a competitive programming contest, it's a fundamental research program. That's, I mean I'm, I'm staying honest to my current position as a research scientist, so I fundamentally want people to look at this as a research challenge.

**Right. Well, thanks for, the great paper, and thanks for clarifying it.**

**As a wrapping-up question, all the work you guys do at DeepMind is part of a bigger long-term research program aiming at some sort of general intelligence. One of the aspects that humans are not particularly good at is reasoning, actually! In fact for algorithmic reasoning, you need many years of schooling, right? So con you reflect on whether Deep Learning architectures are inherently compatible or incompatible with simulating formal machines for algorithmic reasoning?**

That's a very good question. We outlined some possible visions in the neural reasoning position paper, which I think we also cite within CLRS. The general idea is that with these kinds of directions, you can learn a sort of library of behaviors.

So the whole CLRS textbook, which is, you know, thousands of pages long, only has about 100 algorithms. So even though it's huge, the actual collection of blocks of knowledge is not so big. So there is a hope that you could learn something like all the algorithms inside CLRS and then use that as a general-purpose tool library that helps your agent pick out a particular reasoning procedure when it needs it. So that's, the general hope I would say.

**One more thought that I had is about these recent code generation models, like OpenAI’s Codex. Codex writes codes and algorithms but it has never actually seen in output input-output pairs or being able to reason about generalization or anything like that, yet it still seems like a viable approach to learning to create algorithms. How do you contrast that with your line of research? **

Yeah so first of all, DeepMind also has developed a model like Codex (AlphaCode), so we’re definitely been thinking in this direction too. But it’s important to make the differentiation between what we like to call program induction versus program synthesis.

Essentially, in program induction — which is what the algorithmic reasoning benchmark is itself a member of — you are learning a program and you're sticking it inside the weights of a neural network. Whereas in program synthesis, you're learning to produce a program, so you're actually producing something that can be executed on some kind of virtual machine.

There are many benefits to producing the program: you can verify its correctness, you can talk about its various properties in terms of time or space, complexity, and all of those things, whereas you can’t do that with the weight on your own network.

One of the things that currently I'm not ruling out is that in the future, neural algorithmic reasoning could be very much oriented towards learning synthesizers, like the ones that you see in Alpha Code or Codex. However, a synthesis operation is fundamentally a discrete operation.

You need to produce discrete blocks of code, and you have to like this step by step. Then you need to validate that some intermediate code makes sense, and then you need to repair it in a certain way to do the next step, and so on. Whereas in our approach, you know, you have weights and those weights are differentiable with respect to everything our model is doing.

So I will just say our current approach of induction plays more nicely with the gradient descent style of learning.

That being said it's only a very simple, and right now kind of answer to a question that's deeper in general, and I've talked to many program synthesis experts over the years, and we believe that in the future there will be some kind of glorious convergence. Where a synthesizer like AlphaCode might be able to make advantage of some of the things that we learn and we might be able to make advantage of some of the things that synthesizers are able to do well.

So I would say regardless of which direction you care about more, I would say certainly watch this space because there are going to be fascinating things coming in the future.

**And just a kind of wild question. When we get quantum computers, will they have any impact **on** algorithmic reasoning?**

It's an interesting question. I would say that I actually made a remark about quantum computers as a model at the beginning of our conversation where I said that like, algorithms as a language will stay relevant and important regardless of what the model of computation is. So even on a quantum computer, you might be able to learn quantum algorithms, for example.

That's what it would be best suited for, I guess. And I already know from my interactions and collaborations with some quantum engineers that there's a lot of interesting algorithmic stuff to do there. One thing I would remark with quantum computation now is that it's very fundamentally memory limited, so the number of qubits you get access to currently is very low. As a result, people in quantum machine learning generally seem to be trying a lot to exploit anything they can think of to reduce the requirements of the parameterization, and as a result, the ideas from geometric deep learning such as equivariances are useful to reduce the parameter footprint of models.

**Finally, we have to ask this question given the problem we’re working on ourselves. How do you manage to keep up with the increasing volume of publications in Machine Learning?**

It's a very important question, I think going forward and sometimes getting to connections earlier can save one a lot of time if they're aware that somebody did something related. So what I try to generally do, it's a heuristic that works for me, not necessarily will work for everyone, but I tend to have these kinds of say GitHub repositories that generally track interesting papers of a certain.

Whenever there's a big conference, these repositories get updated and I can usually then query for certain terms or keywords that I care about and see if there are any papers I should look at. I also search by authors because I know who are most of the authors that do significant amounts of work in this space.

**It's been a really insightful conversation and I'm looking forward to seeing more of that great work and CLRS being adopted, maybe not as a leaderboard, but as the defining resource for shaping neural algorithmic reasoning. **

Thank you so much for having me!