Pursuing Computer Vision Magic in Amsterdam

Jakub Zavrel
May 20, 2022
11 min read

Updated: May 24, 2022

Interview with Cees Snoek, professor at the University of Amsterdam, on his fascination with the magic of machine learning in Computer Vision, contrasts between academia and industry, the worlds of multimedia and multimodal AI, the role of Amsterdam and Europe in the global AI ecosystem, and the exciting progress in the field now that new vision architectures based on Transformers are capturing the lead. By Jakub Zavrel and Gebrekirstos Gebremeskel.

Cees G.M. Snoek is a full professor in computer science at the University of Amsterdam, where he heads the Video & Image Sense Lab. He is also a director of three public-private AI research labs: QUVA Lab with Qualcomm, Atlas Lab with TomTom and AIM Lab with the Inception Institute of Artificial Intelligence. At University spin-off Kepler Vision Technologies he acts as Chief Scientific Officer. Professor Snoek is also the director of the master program in Artificial Intelligence and co-founder of the Innovation Center for Artificial Intelligence (ICAI). We had the opportunity to interview him earlier this year. To start with, Cees, can you tell us a bit about your personal story? How did you decide to go into Computer Vision and AI?

At the time, I was studying Business Information Systems. It was a combination of economics, AI, and computer science. I quickly realized that the technical part of the study attracted me more. In a course, I first saw the Informedia system, the first video search engine at the time. It was developed at Carnegie Mellon University in Pittsburgh, and it was already operational in 1995. They recorded broadcasts from CNN, digitized them, and made them searchable, but they also did face detection on broadcasts to name the faces in the news. For me, that was some kind of magic. I decided I wanted to know more about it, and that's basically how I got started. So my Master’s thesis was on a topic related to face detection in video: we tried to estimate camera distance from an image. My advisor offered me to do a PhD and I went to Carnegie Mellon to get to work with the people who made the magic system, and I have kept going since then.

At that time, there were a few other people working on video, but I often have the idea that video has not yet really found the right home. When I started doing video, the computer vision community was not so interested, because they thought it belonged more in multimedia. It was also a topic of interest in the information retrieval community. Nowadays, it's become very popular to work on video in computer vision.

And an interesting question today is, what is computer vision? I think every paper that gets published in computer vision conferences nowadays is basically deep learning. Transformer is the newest kid on the block. The generation of today, they feel they have a machine learning tool, and they can apply it to every problem. In my view, that is too simplistic because it doesn't always make sense to treat an image or video in the same way you treat a word sequence.

Today, you are heading a number of research groups. Can you tell us a little bit about the idea behind these groups and what you're trying to accomplish with them?

We have one big group, the Video & Image Sense Lab. The mission of the lab is to make sense of video, and image data, using both human and machine intelligence. Within it, we have all kinds of ICAI labs, public-private partnerships with industry. For example, we have a lab with Qualcomm where I work with Max Welling. Another example is the Atlas Lab with TomTom where I collaborate with Theo Gevers. AI is popular, we get a lot of support from industry.

The inspiration is still the same: understanding videos and images. Along the way, the level of understanding has increased. When we started, I was interested in image classification. Now, that question is not completely solved, but academically speaking, it is not so interesting anymore. Many people think that this has been solved by deep learning. That is, in a way true, but we were making a lot of progress on this problem before the deep learning revolution started. We had a spin-off, for example, from the University of Amsterdam, called Euvision Technologies, that was licensing image recognition software, and it was based on the bag-of-words model. That was the 2010s. Then deep learning made it even more powerful and easier for people to enter that market.

First Convolutional Neural Networks took over Computer Vision and more recently we’re seeing another takeover by Transformers. 2021 was really the year of Transformers in Computer Vision. Where is the field heading in your opinion?

Conflicts between frameworks can generate very strong opinions, and they are really good for progress. I'm pretty sure it will stay that way for a while. There's a big uptake of ConvNet in applications, including adapting hardware to ConvNet operations. That is not easy to replace with a Transformer (Liu et al, 2022). And people are still wondering whether Transformers are a hype. I think Transformers are interesting: they are new, worthwhile to explore further, and show promising results. They are particularly suited for sequential data and much more appealing than a recurrent neural network. A Transformer is also a natural way to incorporate multiple modalities. From a research perspective, I find Transformers more interesting.

Do you see more rich structures of neural nets such as graph neural networks coming to computer vision?

Yeah, that's coming and actually a lot of people in our labs are working on this. What is so strong about it? Why does ConvNet work? Because it has inductive biases that are tailored to the vision problem. Interestingly, not many people have looked into inductive biases for Transformers for vision. They just take the Transformer as it is designed for machine translation and apply it on their vision problem, treating the image tokens as words. But that doesn't make sense because there are a lot of inductive biases that are valid for images and video. You can learn them from data, of course, but why not put them in from the start?

One of the things in vision is that locality of information is very important, that nearby pixels are very similar. That is a very strong inductive bias, and is completely ignored in existing transformers. We have proposed the notion of box attention (Nguyen et al., 2022). We have another paper on spatial temporal detection of actions in video via Transformers (Zhao et al., 2022).

In terms of self-training and augmentation work, where do you see video heading?

The problem with video is that you cannot just keep on labeling for a particular objective, because at some point, you want to label pixels in the video, and do that consecutively over time. And then you find that there is an end to labeling, even if you hire millions of people to do that. So how can you learn pixel-precise representations without the need to label every pixel? Self-supervision is one way, but it is not enough because it still needs labels for fine-tuning, although much less. So I think self supervision is a required step, but it needs more...

With self-supervision, you can pre-train your backbone and then you can do image classification as good as training from scratch from ImageNet. Your fine-tuning tasks, though, still need image classification labels, for the final part. Much less, but you still need them. Now, if I'm interested in spatial temporal understanding in the video, I still need to label videos with the spatial temporal annotations. And nobody has done this yet. Of course, self-supervision will help in this scenario, too. But it is totally unclear how well it will perform in this scenario, how many annotations you need, and whether that is a sufficient reduction or whether that is still too hard..

You started out in the multimedia world and nowadays there's a lot of progress in multimodal models. Is multimodal going to cause a lot of progress and will it become better than single-modal?

It will be huge and better than a single modal. It's also a very big research opportunity. Because how do you know which signal to trust? Should you trust the sound or the sight? Or maybe the speech? Transformers are very suited for this problem. So one of the things we did is we looked into repetition counting in a video. Many people have studied this and all of them looked into pixels only. Last year, we had a paper where we introduced sound. If you also capture the sound, you can better estimate the repetitions. And the nice thing is even if the image is not captured for some reason, the sound still continues as before, making it invariant to accidental changes. Single-modal specialization is convenient for research, but it is limiting.

You have experience in both academia and industry, including in startups. What do you think is a recipe for a successful industry-academic cooperation? And how does academic research stay relevant in the age where industrial research labs have huge supercomputers and data?

A good question. Let me first answer the last one. I think academia should not compete with big tech, because they will beat universities when it comes to compute problems and solutions. So it's the task of academia to ask new questions. The most intriguing question for me at this stage is what can we do with as little data as possible? And how can we still learn things? Do we really need so much data? How can we make sure it is still robust? And how do we solve bias and ethical questions that come with using lots of data? And how can you mitigate that with less data?

Regarding collaboration with industry, I think it's very important to find common ground on a topic, and that there are different objectives. For industry, I think it's important to not expect academia to contribute directly to a product. And you need patience and a long timeline. Because when you start a PhD project, it always takes at least a year or two before some real output comes out. I think the most value is not per se in the big projects themselves, but much more in the conversation that you start with each other and that you inspire each other.

How do you see Amsterdam, Dutch and the European role in the global AI race or developments? Are we a serious AI hub, or only a contributor of students who migrate to the large industrial research labs in the US?

We have examples of those who have migrated to the big labs, but I think that's something to be proud of. In Amsterdam, we made the choice to go for data-driven AI a long time ago. I would say that was a good choice. That has paid really off now with the deep learning revolution. Within Europe, we're actually doing quite okay. We are also very active in the AI community in the Netherlands. We started ICAI which is now a national thing, Maarten de Rijke is making that a Netherlands affair which was also the intention from the start. In Europe we are active in the ELLIS networks. This is a network of excellence on learning systems with a heavy focus on machine learning based AI. We have an ELLIS unit in Amsterdam. Recently, there was a 100 million euro grant for a new ELLIS lab in Tubingen. The Chair of it and ELLIS is Bernhard Schölkopf who has worked on support vector machines in the past, and made really impactful contributions. From the AI researchers in Europe, he is probably at the top of the mountain, so to speak. We have Max Welling who is also very good.

The European AI Act was one of the major policy initiatives in Europe regarding AI, which is more about controlling harm and regulating rather than stimulating. Computer vision is a typical dual use technology. There are many, very risk-laden applications of it. Do you think Europe is doing the right thing?

I think Europe is doing a good thing. Europe wants to do AI in the European way. We have the American system, where the market determines everything, and the government will only step in if it's really needed or things go wrong. Then we have the Chinese system where we have a big state where the state is saying “we have the control, and then you can do everything”. Both have their pros and cons. And we have the European way where we say it's the people who should decide on what happens with the data, how algorithms influence society and their way of life. Now, that is a good thing. The only side note with that is that we have to make sure it does not limit the technology development too much, that it does not put us at a disadvantage. I think that this discussion is very important. It will change research and help in finding biases. Bias in data is a very hot topic, but there are also algorithms that can amplify the bias. That is a research topic that the computer vision people should also address themselves.

So yes AI should be regulated, for sure. It's hard to predict how technology will be used and I'm sure that I'm not researching to do harm. It is not my intent to come up with an algorithm so that I know when you install it on a drone, you can kill people and that will be great. I cannot believe that is some researcher’s intent.

Harmful applications typically use techniques that also exist for other use cases. What is a good development nowaday is that at least we researchers are being forced to write down limitations and possible bad use of what we propose. Mass surveillance depends on the government. On privacy, we recently did face detection work with generated faces instead of faces of people which is not allowed by GDPR to store. This face detector is almost as good as one that is trained on real data. But it's completely privacy-preserving. So humans are very good at finding bad uses of new technology first, because for good use, you need more imagination. Once we invented airplanes, the first thing we started to do with them was drop bombs, and transporting people back and forth came much later.

All of these new developments are going super fast. What do you do to stay up to date in your field, especially in connection to conferences and what are your recommendations for others?

Conferences in COVID times are a joke. It really doesn't work for me. I don't go to conferences anymore when they are online. I like physical conferences to especially meet people and also to notice papers that I would not easily notice in social media or on arXiv. I'm on Twitter, which I think is a great tool, because you can decide who you follow, and you get recommendations. I use my students to inspire me by recommending papers. And I also don't try to be too obsessed with what comes out every day. Follow your own intuitions and don't be too impressed by the hype of the day.

Thank you, Cees, for the interview and for sharing these insights and your unique view of the field with us. You can follow Cees Snoek on Twitter: @cgmsnoek and find more papers and links on his personal website.

More links and reading list:

VIS Lab: https://ivi.fnwi.uva.nl/vislab/

Amsterdam ELLIS Unit: https://ivi.fnwi.uva.nl/ellis/

Sander R. Klomp, Matthew van Rijn, Rob G.J. Wijnhoven, Cees G.M. Snoek, Peter H.N. de With. 2021. “Safe Fakes: Evaluating Face Anonymizers for Face Detectors”, in F&G21. https://isis-data.science.uva.nl/cgmsnoek/pub/klomp-safe-fakes-fg2021.pdf

Zhuang Liu and Hanzi Mao and Chao-Yuan Wu and Christoph Feichtenhofer and Trevor Darrell and Saining Xie. 2022. “A ConvNet for the 2020s”, https://arxiv.org/abs/2201.03545 paper and related work via Zeta Alpha.

Duy-Kien Nguyen, Jihong Ju, Olaf Booij, Martin R. Oswald, Cees G. M. Snoek. 2022. “BoxeR: Box-Attention for 2D and 3D Transformers”, In Proceedings of CVPR'2022. paper and related work via Zeta Alpha.

William Thong, Cees G. M. Snoek. 2021. “Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias”, in BMVC21.

https://isis-data.science.uva.nl/cgmsnoek/pub/thong-image-classifier-bias-bmvc2021.pdf

Yunhua Zhang, Ling Shao, Cees G.M. Snoek, 2021. “Repetitive Activity Counting by Sight and Sound”, in Proceedings of CVPR 2021. https://arxiv.org/abs/2103.13096

Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G.M. Snoek, Joseph Tighe. 2022. “TubeR: Tubelet Transformer for Video Action Detection”, in Proceedings of CVPR 2022.

https://arxiv.org/abs/2104.00969

Pursuing Computer Vision Magic in Amsterdam

Recent Posts

Comments