Evaluating AI Systems — Trends in AI: May '25

Dinos Papakostas
May 14
7 min read

The rapid pace of developments in AI has made selecting the most suitable base model for a given task increasingly complex. Determining which public benchmarks accurately reflect downstream performance, choosing meaningful evaluation metrics, and comprehensively assessing AI systems end-to-end have become central challenges for researchers and practitioners alike.

In this month's Trends in AI webinar, we dive deep into the evolving world of AI evaluations, highlighting modern evaluation frameworks, exploring the nuances of setting up LLM-as-a-Judge pipelines, and reviewing the latest research breakthroughs shaping this field.

Trends in AI March 2025: AI Agents special

Recent State-of-the-Art Model Releases

Google DeepMind: Gemini 2.5 Flash, Gemini 2.5 Pro (I/O Preview)
OpenAI: o3 & o4-mini, GPT-4.1
Alibaba: Qwen-3
Meta: Llama-4
Microsoft: Phi-4-reasoning
Mistral: Mistral Medium 3

Benchmarking Large Language Models (LLMs)

When selecting a language or multimodal foundation model, an effective first step is to evaluate its performance using standard industry benchmarks. These assessments provide a quick look into a model's strengths across critical skills, such as general knowledge, reasoning abilities, problem-solving, coding proficiency, multilingual understanding, and more.

Below, we've outlined some of the most representative public benchmarks, along with the model dimensions and capabilities that they target:

MMLU / MMMU / GPQA

MMLU (Massive Multitask Language Understanding) is among the most popular benchmarks for assessing a model's general knowledge and problem-solving abilities. It contains 14,000 multiple-choice questions spanning areas like mathematics, history, physics, medicine, and law.

Text discussing legal scenario: a salesman ignores a warning sign and is injured by an explosive. Four answer options are given. Option B is correct. — MMLU Example from Hendrycks et al.

In the same vein, MMMU (Massive Multi-discipline Multimodal Understanding) extends the evaluation to multimodal inputs, with a variety of image types, such as maps, diagrams, and charts, featuring 11,500 questions from college exams, textbooks, and quizzes in core disciplines such as science, health & medicine, and tech & engineering.

Grid of academic questions by subject: Art, Business, Science, Health, Humanities, and Engineering. Each has a question, options, subject, image type, and difficulty. — MMMU Examples from Yue et al.

As models quickly reach human-level performance on popular benchmarks like MMLU, with models such as Claude 3.5, GPT-4o, and Llama 3.1 even surpassing humans, new collections of more challenging datasets have emerged. One particularly demanding example is GPQA (Google-proof Question Answering), which was explicitly designed to pose questions so specialized and niche that they cannot be answered by general internet knowledge, making them challenging even for non-experts with full web access.

Flowchart of a question validation process. Includes steps for expert and non-expert validations with feedback, revisions, and final decisions. — GPQA Data Curation Pipeline from Rein et al.

Humanity's Last Exam

Despite extensive efforts to create challenging benchmarks, powerful new models regularly outperform human experts. To this end, Humanity's Last Exam (HLE) was recently proposed, encompassing 2,500 multimodal questions across disciplines such as mathematics, humanities, and natural sciences. Notably, it includes short-answer questions to reduce decisions based on elimination in a multiple-choice setup. As of May 2025, OpenAI's o3 is leading the benchmark with a 20% accuracy (25% when allowing web browsing and code execution).

Classics question features Roman tombstone inscription with Palmyrene script. Ecology question asks about hummingbird sesamoid bone. — HLE Examples from Phan et al.

ARC-AGI

ARC-AGI is a collection of grid-based virtual reasoning puzzles that are intuitive for humans yet challenging for AI models. While OpenAI's o3 successfully surpassed human performance on ARC-AGI-1 (scoring 75% versus the average human performance of 64%), the updated ARC-AGI-2 remains significantly challenging, with the current top-performing models scoring just about 4%.

Four pixelated grids display color transitions of shapes. Arrows point between grids, ending with a question mark, suggesting a next step. — ARC-AGI-2 Example from ARC Prize 2025.

Competitive Math: AIME & Math Olympiads

An area of interest where the newest generation of reasoning models has made significant strides is competitive math examinations, which test their problem-solving abilities in arithmetic, algebra, geometry, number theory, and probability. A popular approach is to subject newly released models to the question pool of exams like the American Invitational Mathematics Examination (AIME).

Math problem text discussing Aya's walk and coffee shop visit, asking to find walk duration at varying speeds. Text in black on a white background. — Problem #1 from AIME 2024.

Competitive Programming / Code Generation

LLMs are quickly becoming indispensable to software developers, powering a new generation of AI-first code editors, and naturally, benchmarks that evaluate their ability in code generation have been on the rise. These benchmarks come in multiple formats, such as problem-solving contests like Codeforces, simulating real-life productivity tasks by solving GitHub issues and Upwork tasks with SWE-Bench and SWE-Lancer, respectively, or testing the validity of code completions on exercises across multiple programming languages with Aider's Polyglot benchmark.

Text titled "Instructions" on creating zippers for a binary tree. It lists operations like from_tree, to_tree, value, prev, and next with descriptions. — Example Problem from Aider's Polyglot benchmark.

MutiChallenge

Another crucial dimension to the real-world performance of language models is their ability to handle multi-turn conversations, with skills such as accurate instruction-following, in-context reasoning, and context allocation being highly valuable. Scale AI's MultiChallenge is a recent benchmark that tests 4 challenging aspects in conversational interactions:

Instruction retention, i.e., remembering what the user asked in the beginning
Inference memory, i.e., accumulating and utilizing information over time
Reliable versioned editing, i.e., changing parts of the previous response
Self-coherence, i.e., maintaining consistency throughout the conversation

Two columns labeled "Instruction Retention" and "Inference Memory" show conversations about film festival suggestions and dining ideas with notable text in bold. — Examples from Sirdeshmukh et al.

TAU-Bench

One of the most impactful features of modern LLMs is their ability to effectively call external functions and use tools, allowing them to be integrated as sophisticated orchestrators within software components. Additionally, these skills are the backbone of the modern multi-agent workflows that involve complex interactions and interdependencies, but the success and reliability of such integrations depend heavily on how accurately these models select and invoke the right tools and APIs. TAU-Bench stands out as a leading benchmark focusing on precisely this capability, measuring performance by examining the final system state and checking the database entries before and after function calls rather than merely examining the model's textual output.

Diagram illustrating a chatbot setup and example interaction for flight changes. Includes tools, policy rules, user instructions, and cancellation. — Example Trajectory from Yao et al.

LMSys Chatbot Arena

Moving away from traditional static benchmarking methods, the LMSys Chatbot Arena has become highly influential due to its interactive and dynamic competitive evaluation process. In this arena-style evaluation, two language models go head-to-head, receiving identical prompts and producing parallel outputs. The users then select the preferred model outputs, determining a winner for each interaction. These user-driven judgements continuously feed into an Elo-based ranking system, creating an evolving leaderboard. This competitive, real-time evaluation method provides ongoing insights into model performance and user preferences, reflecting the models' practical usability and continuously evolving capabilities rather than snapshot-based comparisons.

Steps on a flowchart: Ask a Question, Compare Answers, Vote for the Best, and Discover and Repeat, shown with icons. — Usage Instructions for the Chatbot Arena.

Evaluating Real-World Use Cases

While public benchmarks offer valuable insights into models' foundational knowledge and capabilities, real-world performance also critically depends on nuanced practical considerations. A model excelling in benchmark tests might underperform in practice if not integrated thoughtfully. Consider, for example, a code assistant: raw benchmark scores alone might not reflect the real-world utility of generated code snippets, unless factors like contextual relevance, workflow integration, code readability, correctness, and user satisfaction are thoroughly validated.

Here are several practical tools that streamline the evaluation lifecycle of AI systems:

Ragas

Ragas is a widely used library designed for assessing LLM outputs through pointwise evaluations. It offers a diverse set of evaluation metrics, supporting both LLM-based evaluations following the LLM-as-a-Judge paradigm (to measure attributes such as faithfulness), as well as traditional string overlap methods such as the BLEU score.

Several key metrics that are particularly relevant to RAG systems include:

Noise Sensitivity: how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents.
Response Relevancy: how relevant a response is to the user input. Higher scores indicate better alignment with the user input, while lower scores are given if the response is incomplete or includes redundant information.
Faithfulness: how factually consistent a response is with the retrieved context.
Answer Accuracy: measures the agreement between a model’s response and a reference ground truth for a given question.

DeepEval

DeepEval introduces the concept of "unit testing for LLMs", adopting established principles from (and integrating with) software testing frameworks such as pytest. It is primarily geared around RAG, conversational chatbots, and agentic AI systems, treating concepts like retrieved context and tool use as first-class components. This framework enables robust end-to-end and structured evaluation, promoting thorough validation and increased reliability for real-world LLM applications.

RAGElo

In most evaluation setups, a common assumption is the availability of gold-standard labels or predefined ground-truth answers against which the generated outputs can be compared. However, collecting these gold labels is often prohibitively expensive, tedious, time-consuming, or outright impossible to scale sufficiently. To overcome this bottleneck and leverage the capabilities of LLM annotators, RAGElo was designed specifically for evaluating RAG systems end-to-end.

RAGElo operates by simulating a tournament-style evaluation, where candidate models compete against one another over a set of benchmark questions. Through this competitive setting, RAGElo assesses all of the crucial components in a RAG pipeline, including context retrieval accuracy, groundedness of answers, and overall response quality. Eventually, candidates earn an Elo-based performance ranking derived from their comparative results, allowing system builders to easily identify the winning candidate for each specific use case.

Flowchart of Ragelo process: Queries to RAG systems generate answers, evaluated by LLM for context and evidence, and ranked by ELO. — Demo Flowchart of the Benchmarking Process in RAGElo.

Synthetic Benchmarks

Creating suitable evaluation benchmarks can pose significant challenges, particularly when dealing with highly specialized domains, newly developed systems, or entirely novel tasks. In cases where real benchmark queries are sparse, limited, costly, or unavailable altogether, synthetic benchmark generation provides a compelling solution. Modern LLMs can enable the automatic creation of realistic, domain-specific evaluation questions directly from the target document corpus in an unsupervised or semi-supervised manner. These synthetic benchmarks swiftly produce meaningful evaluation sets (typically of 50-200 questions to ensure statistical reliability), which accelerates and simplifies the benchmarking process.

Flowchart of chunk filtering and query generation using LLM. Shows processes with text: Chunks, Context, Example queries, and Queries. — Overview of the Query Generation Process proposed by Hong et al. (Chroma)

Trending research papers

Here is an overview of recently trending research papers on AI evaluations, to give you a feel of where the field is headed. You can always find this collection of papers (and more that didn't make the cut) in Zeta Alpha, allowing you to easily explore and discover more related work.

FreshStack: Building Realistic Benchmarks for Evaluating Retrieval on Technical Documents - N. Thakur et al. (U. Waterloo & Databricks) - 17 Apr. 2025
The Leaderboard Illusion - S. Singh et al. (Cohere Labs) - 29 Apr. 2025
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations - Dietz et al. (UNH) - 27 Apr. 2025
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models - R. Pradeep et al. (U. Waterloo, Microsoft, Snowflake) - 21 Apr. 2025
Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation - K. Balog et al. (Google DeepMind) - 24 Mar. 2025

Throughout this blog post, we've highlighted just a fraction of the complexities in evaluating advanced AI systems. While modern tools simplify the prototyping phase of LLM-powered applications even without deep technical expertise, advanced, production-ready AI agents and RAG systems require extensive knowledge of evaluation and optimization best practices. At Zeta Alpha, we offer practical experience to guide organizations across multiple industries – including chemical, manufacturing, and regulatory domains – from R&D PoCs to production-level agentic RAG deployments. Do you want to elevate the quality of your RAG and AI agents to the next level through rigorous evaluation and optimizations? Contact us for an initial conversation to see how we can help you turn your internal expertise into a valuable asset.

For more insights and detailed coverage, don't forget to check out the complete webinar recording below, and join our Luma community for upcoming discussions and events.

Until next time, enjoy discovery!