Robust evaluations for RAG with RAGElo
- Dinos Papakostas
- Jun 30
- 4 min read
Retrieval-Augmented Generation (RAG) systems have gained strong traction because of their ability to ground generated answers in knowledge sources, boosting their accuracy and reliability. However, evaluating them remains tricky, especially when there's no clear ground truth data to compare against.
When deploying a new RAG pipeline, we want to measure two key things clearly and reliably:
Retrieval performance: How well does our system surface relevant documents to questions?
Answer correctness: Given these retrieved documents, how accurate and helpful are the generated answers?
But, without existing users or historical feedback data, obtaining annotations for the relevance of the retrieved documents and answer correctness can quickly become costly and time-consuming.
RAGElo, a toolkit we've been building at Zeta Alpha, directly addresses this issue. Rather than relying purely on pointwise evaluations, RAGElo implements an Elo-based tournament evaluation, a well-established ranking method that was originally popularized for chess rankings. This approach allows users to compare multiple RAG system variants head-to-head and aggregate pairwise preferences into a robust, easy-to-interpret Elo ranking.

RAG Evaluations with RAGElo
To illustrate RAGElo's capabilities, let's run through an example scenario where we compare 6 candidate question-answering systems. We'll use a subset of the Qdrant documentation as our example corpus, with queries extracted from the qdrant_doc_qna dataset.
Example benchmark queries
What is the difference between scalar and product quantization?
What is the purpose of "ef_construct" in HNSW?
What is the impact of ‘write_consistency_factor’?
RAGElo expects benchmark queries as a CSV file consisting of two columns:
qid: A unique query identifier (e.g., "q_1")
query: The query text (e.g., "What's ef_construct?")
Preparing Your Evaluation Data
RAGElo seamlessly integrates into any RAG workflow irrespective of your implementation details. To run an evaluation, you simply have to prepare two CSV files representing the retrieval and answer generation steps:
For Retrieval Evaluation (documents.csv)
Column | Description | Example |
---|---|---|
did | A unique document ID | doc_0 |
document | The text content of the retrieved document/chunk | We enabled scalar quantization + HNSW with m=16 and ef_construct=512. [...] |
qid | The ID of the corresponding query | q_1 |
agent | The name of the RAG system variant | agent_v3 |
For Answer Generation Evaluation (answers.csv)
Column | Description | Example |
---|---|---|
qid | The query ID (must match retrieval runs) | q_1 |
agent | The name of the RAG system variant | agent_v3 |
answer | The generated answer text | The ef_construct parameter determines the size of the [...] |
With these files, you can initialize a RAGElo experiment to control the evaluation process:
from ragelo import Experiment
experiment = Experiment(
experiment_name="qdrant-retrieval",
queries_csv_path="./data/queries.csv",
documents_csv_path="./data/documents.csv",
answers_csv_path="./data/answers.csv",
verbose=True, # just to see more info about the evaluations
)
Tip: RAGElo caches the annotation results by default to prevent redundant calls to the LLM.
Evaluating Retrieval and Answers with Expert Elo-based Comparisons
RAGElo includes several predefined but easily customizable evaluators. For most use cases, our "Domain Expert" evaluator provides reliable alignment with human judgment out of the box, but you can extend any evaluator according to your needs.
Setting Up Evaluators
The next step is to load your evaluators and define their parameters:
from ragelo import (
get_answer_evaluator,
get_llm_provider,
get_retrieval_evaluator,
)
# Example using OpenAI's GPT-4.1-nano.
llm_provider = get_llm_provider("openai", model="gpt-4.1-nano")
domain_evaluator_args = {
"llm_provider": llm_provider,
# Provided to the model as context for its area of expertise.
"expert_in": (
"the details of how to better use the Qdrant "
"vector database and vector search engine"
),
# Used to describe the persona of the system's users.
"company": "Qdrant",
# How many threads (parallel LLM calls) to use for evaluation.
"n_processes": 20,
# Whether to use rich to print colorful outputs.
"rich_print": True,
# Whether to overwrite any existing files.
"force": False,
}
retrieval_evaluator = get_retrieval_evaluator(
"domain_expert",
**domain_evaluator_args,
)
answer_evaluator = get_answer_evaluator(
"domain_expert",
**domain_evaluator_args,
# Whether to evaluate the answers in both directions.
bidirectional=False,
# The number of games to play for each query.
n_games_per_query=20,
# The minimum relevance score for a document to be relevant.
document_relevance_threshold=2,
)
Running Evaluations
We are now ready to run our RAG evaluations with RAGElo!
Evaluate both retrieval and the generated answers using the Experiment object from before:
# Evaluate all retrieved documents for all queries
retrieval_evaluator.evaluate_experiment(experiment)
# Create random pairs of answers for each query and evaluate them.
answer_evaluator.evaluate_experiment(experiment)
After running the evaluations, we can leverage RAGElo's Elo scoring to clearly rank our systems:
from ragelo import get_agent_ranker
elo_ranker = get_agent_ranker(
"elo",
# The k-factor for the Elo ranking algorithm.
k=32,
# Initial score for the agents. Updated after each game.
initial_score=1000,
# Number of tournaments to play.
rounds=1000,
)
elo_ranker.run(experiment)
Elo Results Example
Agent Name | Elo Rating |
---|---|
agent_2 | 1262.7(±184.1) |
agent_5 | 1228.7(±208.5) |
agent_1 | 1055.6(±139.9) |
agent_3 | 909.9(±148.5) |
agent_0 | 759.7(±176.2) |
agent_4 | 636.3(±133.5) |
From this ranking, it's immediately visible that agents 2 and 5 are our top performers. As a result, we can choose them for further benchmarking or user validation, saving development effort and time. Notice that due to their relatively close ratings, we cannot confidently say whether one of the two is significantly better, but the ranking suffices to safely discard the remaining agents.
Extend and Adapt Evaluations to Your Own Needs
While RAGElo provides powerful default evaluation capabilities, it is designed to be adaptable to unique business scenarios and evaluation criteria. You can easily extend existing evaluators or define entirely custom prompt templates tailored specifically for your domain.
Try RAGElo Yourself
Interested in effortlessly benchmarking your RAG pipeline? Here's how to get started right away:
🛠️ Check out the RAGElo Demo Notebook to see the evaluations and Elo rankings in action.
🤝 Join our GitHub community to report issues, request features, or contribute to RAGElo!
🔥 Need more hands-on help? Contact Zeta Alpha to schedule an introductory brainstorming session. Our experts will help you improve your retrieval quality, optimize your RAG pipeline, and move your AI application confidently into production.
Until next time, enjoy discovery and happy ranking!
Comments