Marzieh Fadaee — NLP Research Lead at Zeta Alpha — joins Andrew Yates and Sergi Castella to chat about her work using large Language Models like GPT-3 to generate domain-specific training data for retrieval models with little-to-no human input. The two papers discussed are "InPars: Data Augmentation for Information Retrieval using Large Language Models" and "Promptagator: Few-shot Dense Retrieval From 8 Examples".
The conversation touches on the details of prompting and the costs of generating domain-specific datasets for information retrieval.
📄 InPars: https://arxiv.org/abs/2202.05144
📄 Promptagator: https://arxiv.org/abs/2209.11755
Timestamps:
00:00 Introduction
02:00 Background and journey of Marzieh Fadaee
03:10 Challenges of leveraging Large LMs in Information Retrieval
05:20 InPars, motivation and method
14:30 Vanilla vs GBQ prompting
24:40 Evaluation and Benchmark
26:30 Baselines
27:40 Main results and takeaways (Table 1, InPars)
35:40 Ablations: prompting, in-domain vs. MSMARCO input documents
40:40 Promptagator overview and main differences with InPars
48:40 Retriever training and filtering in Promptagator
54:37 Main Results (Table 2, Promptagator)
1:02:30 Ablations on consistency filtering (Figure 2, Promptagator)
1:07:39 Is this the magic black-box pipeline for neural retrieval on any documents
1:11:14 Limitations of using LMs for synthetic data
1:13:00 Future directions for this line of research
Comments