Ready to see the full picture? Multimodal RAG

Arjen de Hoop
May 28
2 min read

If you only consider the text in your enterprise documents for your Retrieval Augmented Generation (RAG) system, you lose a vast amount of information. Visual details can make a crucial difference while building state-of-the-art AI knowledge management tools and Agents. At Zeta Alpha, we're advancing how Large Vision Language Models (LVLMs) enable RAG to understand the entire document, including visual elements like figures, tables, and complex slides.

For too long, extracting value from these visual components presented a complex hurdle. Large Language Models have become Vision Language Models, which opened the door for multimedia content." This evolution, with models like GPT-4o, means we can now directly process document pages as images, bypassing extensive preprocessing.

Approaches like ColPali and Document Screenshot Embeddings (DSE) exemplify this shift. They offer unprecedented accuracy for RAG, not only for visual Q&A but also for retrieving detailed text passages from standard text documents, a leap beyond earlier multimodal techniques.

In a recent talk, Jakub Zavrel (CEO) and Batu Helvacıoğlu (AI Research Intern) outlined the technical specifics of our extensive research into implementing these methods:

Vision-centric indexing (e.g., ColPali) radically simplifies document pipelines and boosts performance for visual content understanding.
Bi-encoder models (DSE) provide fast, efficient retrieval with a single vector per page image.
Late-interaction models (like ColQwen2) offer higher accuracy by comparing query tokens to numerous image patches per page, but are more computationally demanding.
Our two-stage retrieval strategy, which uses averaged patch embeddings for initial filtering followed by late-interaction for re-ranking, demonstrated the best balance of effectiveness and practicality on the Vidore-v1 benchmark.

The real-world impact is clear. Zeta Alpha solutions recently provided answers from screenshots in documents and bar charts within PowerPoint slides, a task that would challenge most traditional systems.

➡️ Ready to see the full picture in your data? Reach out to us at Zeta Alpha to explore our cutting-edge multimodal RAG capabilities.

➡️ Watch the recording of our full talk: YouTube

Ready to see the full picture? Multimodal RAG

Recent Posts

Comments