Robust evaluations for RAG with RAGElo

Dinos Papakostas
Jun 30
4 min read

Retrieval-Augmented Generation (RAG) systems have gained strong traction because of their ability to ground generated answers in knowledge sources, boosting their accuracy and reliability. However, evaluating them remains tricky, especially when there's no clear ground truth data to compare against.

When deploying a new RAG pipeline, we want to measure two key things clearly and reliably:

Retrieval performance: How well does our system surface relevant documents to questions?
Answer correctness: Given these retrieved documents, how accurate and helpful are the generated answers?

But, without existing users or historical feedback data, obtaining annotations for the relevance of the retrieved documents and answer correctness can quickly become costly and time-consuming.

RAGElo, a toolkit we've been building at Zeta Alpha, directly addresses this issue. Rather than relying purely on pointwise evaluations, RAGElo implements an Elo-based tournament evaluation, a well-established ranking method that was originally popularized for chess rankings. This approach allows users to compare multiple RAG system variants head-to-head and aggregate pairwise preferences into a robust, easy-to-interpret Elo ranking.

RAG Evaluations with RAGElo

To illustrate RAGElo's capabilities, let's run through an example scenario where we compare 6 candidate question-answering systems. We'll use a subset of the Qdrant documentation as our example corpus, with queries extracted from the qdrant_doc_qna dataset.

Example benchmark queries

What is the difference between scalar and product quantization?
What is the purpose of "ef_construct" in HNSW?
What is the impact of ‘write_consistency_factor’?

RAGElo expects benchmark queries as a CSV file consisting of two columns:

qid: A unique query identifier (e.g., "q_1")
query: The query text (e.g., "What's ef_construct?")

Preparing Your Evaluation Data

RAGElo seamlessly integrates into any RAG workflow irrespective of your implementation details. To run an evaluation, you simply have to prepare two CSV files representing the retrieval and answer generation steps:

For Retrieval Evaluation (documents.csv)

Column	Description	Example
did	A unique document ID	doc_0
document	The text content of the retrieved document/chunk	We enabled scalar quantization + HNSW with m=16 and ef_construct=512. [...]
qid	The ID of the corresponding query	q_1
agent	The name of the RAG system variant	agent_v3

For Answer Generation Evaluation (answers.csv)

Column	Description	Example
qid	The query ID (must match retrieval runs)	q_1
agent	The name of the RAG system variant	agent_v3
answer	The generated answer text	The ef_construct parameter determines the size of the [...]

With these files, you can initialize a RAGElo experiment to control the evaluation process:

from ragelo import Experiment

experiment = Experiment(
    experiment_name="qdrant-retrieval",
    queries_csv_path="./data/queries.csv",
    documents_csv_path="./data/documents.csv",
    answers_csv_path="./data/answers.csv",
    verbose=True,  # just to see more info about the evaluations
)

Tip: RAGElo caches the annotation results by default to prevent redundant calls to the LLM.

Evaluating Retrieval and Answers with Expert Elo-based Comparisons

RAGElo includes several predefined but easily customizable evaluators. For most use cases, our "Domain Expert" evaluator provides reliable alignment with human judgment out of the box, but you can extend any evaluator according to your needs.

Setting Up Evaluators

The next step is to load your evaluators and define their parameters:

from ragelo import (
    get_answer_evaluator,
    get_llm_provider,
    get_retrieval_evaluator,
)

# Example using OpenAI's GPT-4.1-nano.
llm_provider = get_llm_provider("openai", model="gpt-4.1-nano")

domain_evaluator_args = {
    "llm_provider": llm_provider,
    # Provided to the model as context for its area of expertise.
    "expert_in": (
        "the details of how to better use the Qdrant "
        "vector database and vector search engine"
    ),
    # Used to describe the persona of the system's users.
    "company": "Qdrant",
    # How many threads (parallel LLM calls) to use for evaluation.
    "n_processes": 20,
    # Whether to use rich to print colorful outputs.
    "rich_print": True,
    # Whether to overwrite any existing files.
    "force": False,
}

retrieval_evaluator = get_retrieval_evaluator(
    "domain_expert",
    **domain_evaluator_args,
)

answer_evaluator = get_answer_evaluator(
    "domain_expert",
    **domain_evaluator_args,
    # Whether to evaluate the answers in both directions.
    bidirectional=False,
    # The number of games to play for each query.
    n_games_per_query=20,
    # The minimum relevance score for a document to be relevant.
    document_relevance_threshold=2,
)

Running Evaluations

We are now ready to run our RAG evaluations with RAGElo!

Evaluate both retrieval and the generated answers using the Experiment object from before:

# Evaluate all retrieved documents for all queries
retrieval_evaluator.evaluate_experiment(experiment) 
# Create random pairs of answers for each query and evaluate them.
answer_evaluator.evaluate_experiment(experiment)

After running the evaluations, we can leverage RAGElo's Elo scoring to clearly rank our systems:

from ragelo import get_agent_ranker

elo_ranker = get_agent_ranker(
	"elo",
    # The k-factor for the Elo ranking algorithm.
    k=32,
    # Initial score for the agents. Updated after each game.
    initial_score=1000,
    # Number of tournaments to play.
    rounds=1000,
)

elo_ranker.run(experiment)

Elo Results Example

Agent Name	Elo Rating
agent_2	1262.7(±184.1)
agent_5	1228.7(±208.5)
agent_1	1055.6(±139.9)
agent_3	909.9(±148.5)
agent_0	759.7(±176.2)
agent_4	636.3(±133.5)

From this ranking, it's immediately visible that agents 2 and 5 are our top performers. As a result, we can choose them for further benchmarking or user validation, saving development effort and time. Notice that due to their relatively close ratings, we cannot confidently say whether one of the two is significantly better, but the ranking suffices to safely discard the remaining agents.

Extend and Adapt Evaluations to Your Own Needs

While RAGElo provides powerful default evaluation capabilities, it is designed to be adaptable to unique business scenarios and evaluation criteria. You can easily extend existing evaluators or define entirely custom prompt templates tailored specifically for your domain.

Try RAGElo Yourself

Interested in effortlessly benchmarking your RAG pipeline? Here's how to get started right away:

🛠️ Check out the RAGElo Demo Notebook to see the evaluations and Elo rankings in action.
🤝 Join our GitHub community to report issues, request features, or contribute to RAGElo!
🔥 Need more hands-on help? Contact Zeta Alpha to schedule an introductory brainstorming session. Our experts will help you improve your retrieval quality, optimize your RAG pipeline, and move your AI application confidently into production.