Rama Akkiraju (NVIDIA)
FACTS about building Generative AI-based Chatbots: Lessons and Best Practices
Enterprise chatbots, powered by generative AI, are rapidly emerging as the most explored initial applications of this technology in the industry, aimed at enhancing employee productivity. Retrieval Augmented Generation (RAG), Large Language Models (LLMs), Langchain framework serve as key technological components in building generative-AI based chatbots. However, leveraging generative AI for enterprise chatbots presents numerous challenges and considerations. Crafting a successful enterprise chatbot demands meticulous engineering of RAG pipelines, fine-tuning LLMs, engineering prompts, ensuring the relevance and accuracy of enterprise knowledge, honoring document access control permissions, providing concise responses, including pertinent references, and safeguarding personal information. In this talk, we present our recipes for optimizing RAG performance across various control points from three case studies: three enterprise-grade chatbots for answering questions on IT and HR benefits, company financial earnings, and all enterprise content. Each of these domains exposed us to different concerns that must be addressed in RAG-based Chatbots ranging from dealing with data that contains structured, unstructured, and multi-modal content. Our key findings from our work are 1) document metadata enrichment plays a critical role in retrieval relevancy 2) retrievers struggle with complex and multi-part queries thereby necessitating complex agent architectures and 3) guardrails play a critical role in securing sensitive documents while building enterprise chatbots. Incidentally, all of these issues require LLM-based solutions themselves making the process of RAG pipeline optimization recursive. We conclude with best practices for building enterprise-grade chatbots that we have distilled from our work that shed light on techniques for dealing with chatbot FACTS - content freshness (F), architectures (A), cost economics of LLMs ©, testing cycles (T), and security (S).
Julia Kiseleva (MultiOn)
Evaluating Interactive Autonomous Agents
Seamless interaction between AI agents and humans through natural language remains a critical objective in AI research. This talk addresses the challenges of developing interactive autonomous agents that can understand and execute grounded natural language instructions. Despite significant progress, challenges such as the scarcity of suitable datasets and the need for robust evaluation platforms persist.
Natalia Vassilieva (Cerebras)
Training and Inference Trade-offs for Domain-Specific and Multilingual LLMs
This talk explores the trade-offs between training large language models (LLMs) from scratch and adapting existing generalist models for specific domains or languages. While the quality of large-scale foundational “generalist” models has steadily improved, this progress often comes with increased model size and higher serving costs. Moreover, even the most powerful models today may struggle with the nuances of specialized fields like medicine or finance, lack fluency in low-resource languages and dialects, and exhibit a bias toward Western cultures.
In many cases, it is more efficient to develop specialized models that are finely tuned to the unique vocabulary, context, and nuances of a particular field or language, leading to better quality outputs and more cost-effective inference. However, the question remains: should you train this specialized model from scratch, or adapt an existing, high-quality, English-centric generalist model? Also, which model size to pick? How much data is required? How to avoid catastrophic forgetting during adaptation?
We will address these and other questions, supported by case studies, while emphasizing the importance of efficient training techniques and scaling laws as critical tools in this process.
Raza Habib (Humanloop)
Best Practices for building high retention LLM powered products
Through his work as the CEO of Humanloop, Raza has personally worked with tens of companies to build and deploy LLM powered products. He's also interviewed many of the best engineering leaders building AI products through his podcast, High Agency. In this talk, Raza will summarise the best practices for building with AI. We'll cover questions like: how to actually build reliable agents in practice, the best way to evaluate your AI systems and what skills your team needs.
Douwe Kiela (Contextual AI)
RAG on the edge: GRIT and OLMoE for hyper-efficient retrieval and generation
Retrieval-augmented generation (RAG) has become the dominant paradigm for allowing language models to work on extraneous data sources. As generative AI becomes increasingly important, a natural question is how we can make it feasible to deploy RAG systems directly to edge devices. In this talk, I will cover two research contributions that push the frontier on hyper-efficient RAG on the edge. First, I will discuss generative representational instruction tuning (GRIT), where we demonstrate that the weights between a retriever and generator can be shared via instruction tuning, allowing us to cache representations for much faster RAG. Second, I’ll present OLMoE, the first fully open source Mixture-of-Experts (MoE) language model that outperforms much larger models while being an order of magnitude more efficient.
Michael Ryan (Stanford University)
DSPy: Prompt Optimization for LM Programs
It has never been easier to build amazing LLM powered applications. Unfortunately engineering reliable and trustworthy LLMs remains challenging. Instead, practitioners should build LM Programs comprised of several composable calls to LLMs which can be rigorously tested, audited, and optimized like other software systems. In this talk I will introduce the idea of LM Programs in DSPy: The library for Programming — not Prompting LMs. I will demonstrate how the LM Program abstraction allows the creation of automatic optimizers for LM Programs which can optimize both the prompts and weights in an LM Program. I will conclude with an introduction to MIPROv2: our latest and highest performing prompt optimization algorithm for LM Programs.
Zhyun Dai (Google Deepmind)
LLM-Powered Retrieval: From Distillation to New Architectures
Information retrieval systems are essential for accessing the vast knowledge stored in large corpora, but current models often fall short when it comes to reasoning, following instructions, and generalizing to new distributions. This talk delves into our research aimed at enhancing retrieval models by harnessing the power of large language models (LLMs).
We first tackle the challenge of generalizing neural retrievers across different domains. We showcase how LLM distillation can be leveraged to achieve this, enabling versatile neural retrievers like the Gecko text embeddings API. We then introduce XTR, a novel multi-vector retrieval model that brings closer the architecture of LLMs and retrievers, which improves retriever generalization ability while being efficient.
In the second part of this talk, we explore the potential of long-context LLMs to revolutionize the future of retrieval by digesting the entire corpus as a prompt. To evaluate this exciting frontier, we introduce LOFT, a new benchmark specifically designed to assess the impact of long-context models retrieval, retrieval-augmented generation (RAG), and database querying.