Automated Research Assistants and Neural Information Retrieval - an Interview with Rodrigo Nogueira
16-03-2020, by Jakub Zavrel Last week, Rodrigo Nogueira (University of Campinas, University of Waterloo, NeuralMind), who in his recent PhD work at NYU pioneered the use of Transformer models as rerankers and the use of doc2query augmented document representations, joined Zeta Alpha as an scientific advisor in our efforts to push the state-of-the-art in Neural Information Retrieval. Welcome to the team! A perfect opportunity to ask him some questions about how he got into IR, his more recent research interests, and where he sees the direction of the field going.
JZ: Rodrigo, we're very excited to have you as an advisor to Zeta Alpha. Can you tell us about your background and how you initially got interested in Neural Information Retrieval.
RN: After I finished my undergrad in electrical engineering, I worked for five years at Siemens on something that is completely unrelated to artificial intelligence: developing systems to monitor hydropower plants. You have these SCADA systems, with sensors, and then you have to acquire all this data and take actions based on it. And that was the time, around 2011, when I saw that deep neural nets were becoming quite good at image recognition. In 2012, we saw the result of the ImageNet competition, and I felt the need to start working on this area. I did my Master's in Brazil. With my advisor we started working with classical computer vision techniques, and he proposed to try to use it to detect if a fingerprint was fake or real.
After six months trying, we could more or less be on top of the state of the art. In parallel I started to use convolutional networks. At the time, there was no easy way to use something like pytorch, and GPUs were not widely available. We simply decided to go to AWS, take the largest machine there, and search for the best architecture for a ConvNet. After around three or four months, we finally beat the state of the art. We submitted this system to a competition, and we got first place with two points above second place. It was the first ConvNet in the competition, and all the other submissions relied on hand engineered features. I was quite happy with the results, and then I decided to start my PhD.
I applied for NYU, since it was famous for its big deep learning group. As I started my PhD, my advisor, Kyunghyun Cho, came to me with a proposal for a web navigation system: an agent that can learn how to navigate the graph of web links to find an answer. It was quite interesting to learn that what I was doing was actually called information retrieval, and we didn't know anything about IR. I was fascinated by the idea of finding a needle in the haystack, finding a piece of information that answers a query in large amounts of text. Could we use the deep neural nets from our group to improve upon that?
We haven this hammer, so what is the best way that we can use the hammer? We started in a more conceptually elegant way, and we treated the search engine like a black box and then we have this agent that learns how to interact with the black box in order to retrieve answers using reinforcement learning. I spent three years of my PhD doing that. To be honest, it wasn't quite successful. The main issue was that our queries contain way too little information for reinforcement learning, it's too unstable. And then came this revolution with pre-trained transformers. We were some of the first people to use BERT for the ranking task, and overnight, the accuracy jumped some 10 points higher. I was quite happy with that. So then I focused on how to use these models to improve information retrieval.
JZ: If you look at Neural Information Retrieval, it is moving very fast right now. Benchmarks are shattered all the time. What do you see as some of the most exciting recent developments in Neural Information Retrieval?
RN: For sure the way we use pre-training data. The information retrieval community has all this knowledge about how to deal with very large collections of texts. It is quite exciting how information retrieval can contribute to actually improve the pre-training of Transformers models. For example, some people use existing models to filter out what information their model will see during pre-training. It allows you to not only to make training more efficient, but also learn what to look into in the data while doing pre-training. I think this bi-directional flow of information between two models is quite important to make progress.
Taking inspiration from humans, we actively decide what to read. We don't read everything that Google suggests, for instance. This active selection is all about exploring what we know and what we don't know. Whereas in current pre-training methods, whatever you give to the model, it will try to learn it. So one interesting direction is to have these smart selectors of what to read, and what not read. In a sense, you are creating the data loader to train your models. Put differently, it's actually designing part of its training pipeline.
What's also quite exciting is that we are making progress on different IR domains. We're starting to see more and more successful applications in the medical and biomedical domain, and also in the legal domain. It's one thing to solve generic questions from MS MARCO, like: "Who was the president of the United States in the 1990s?". And another thing is to answer complicated legal or biomedical questions.
JZ: What is your dream of where you want your research and development to go?
RN: The main motivation to start working on information retrieval for me is to build something that Zeta Alpha also wants, I believe: a research assistant or automated scientist that helps us to come up with new hypotheses and find relevant work . How can you just emerge that knowledge to the scientist, so that they can create new hypotheses that they otherwise wouldn't, or it would take a lot longer, if they didn't have access to that knowledge? That's I think my main motivation to work with search engines: it's to uncover hidden information, the important data that otherwise would go unnoticed.
Currently, we have these scientific niches that don't talk too much to each other. They spend years working in semi-isolated ways, until they find that some other group is also interested in the same goals, but they don't share the tools. I think automated research assistants can bridge this gap. How to build them is still an open question, but I would just start with the simplest tools that we have, like using a good search mechanism that will provide you candidates of what to read, and good summarization tools that will try to make a concise transfer of information to scientists. And then hopefully, these models will reach a point that they can read an entire research domain and come up with questions that are not yet answered. We currently have tools that allow us to know what questions can be answered, but we don't have a way to emerge what questions cannot be answered in a given collection. I think creating such mechanisms is a promising way.
JZ: So do you hope that these automated scientists work for us and not against us?
RN: I have a more optimistic view on this. When our planes started to fly, we didn't have many mechanisms to prevent them from falling and crashing, but people over the years started to come up with safety devices. Similarly, in the first computer vision applications, most people didn't even know there was this bias in the classification model. And now we are developing mechanisms to prevent these biases.. The same way, we will gradually evolve towards safer machine learning models. I don't think machines will suddenly take over without our control. Machine learning systems will have a ton of mechanisms that will prevent them from making mistakes that will harm us.
JZ: As a researcher, how do you currently keep track of your interests and stay up to date in the very fast development of AI?
RN: I don't have a good answer for that. There is a ton of work being published, so I cannot keep up with the literature. I use some tools that I'm pretty sure are not optimal. I use Twitter to see in my circle what they have published, and I rely on people providing me some hints as to what are the interesting articles. Also, I read some recommendations from the scholarly search mechanisms like Google Scholar, Semantic Scholar, but, I'm quite sure that this is not enough in the sense that they are only giving me a small portion of the relevant articles that are coming out every week. And I wish we had much better recommender systems. But I think the short answer is that I simply don't keep up. I miss a ton of good work, for sure.
JZ: Do we actually have enough time to read? If you would know the papers that are relevant to read, the bottom line is still the human capacity to absorb that information.
RN: True. I think I can read in depth at most five papers per week. If it's more than that, I'll just gleam over the papers. And yeah, I have fully understood five papers, more or less per week; not more than that. So we better get some pretty good recommender systems to just select a small amount for us to read.
JZ: What would be your top three feature requests for Zeta Alpha to help you with this?
RN: First, I think a pretty good search mechanism in which I type a very long description, or at least a query summarizing my current work. And then Zeta Alpha will return to me the list of related papers that I have to read. In a sense it’s like, you give more or less the abstract of your dream paper, or at least a long description of what you're working on. And it will give me back a list of papers that I should definitely read to start with. This is one feature I don't think current search mechanisms provide. Mostly you really have to know the query terms. It will be awesome to have a very noisy description of what you want to accomplish, and the system will provide relevant work.
Another one, it's a recommender system that is based on your interests, like the papers that you published in the past and the papers that you marked as relevant. And this one good recommender is a super nice feature to have.
A way more ambitious feature is the one that I hinted at already: given this collection of texts that I'm interested in, what are the questions that they don't answer that are still interesting? We have papers about pre-trained language models being published all the time. Wouldn't it be nice to have a feature that says, well, they are all talking about these and these aspects of the work. But currently, there is this knowledge gap. So, I’d like Zeta Alpha to tell us: we still don't know the answer to this question, and that might be a good research direction. This would be super cool to have.
JZ: We will have to call this feature Rodrigo’s Analytics for the Unknown Unknowns. Personally, I think it's good that there is a lot of space for human imagination in finding things that are interesting and relevant, but certainly there are ways that AI can help to make the handwork and research homework more effective here.
Thanks for for the interview, Rodrigo and we look forward to working with you on these things.