Challenges in RAG evaluation

From: aidotengineer

Despite the perception that Retrieval Augmented Generation (RAG) is a solved problem, its evaluation presents significant challenges [00:00:08]. The core issue is that RAG evaluation is broken [00:00:24].

Issues with Current RAG Benchmarks

The primary reasons for the broken state of RAG evaluation stem from how easily benchmarks are constructed [00:00:28]:

Local Questions and Answers Most benchmarks consist of “local questions that have local answers,” assuming the answer resides within a specific data chunk [00:00:37].
Unrealistic Scenarios Benchmarks are often manufactured and do not represent real-world data [00:01:23]. For instance, “multiple hope questions” famously framed by Google, are not realistic [00:01:02].
Lack of Holistic Testing Existing benchmarks typically test only a part of the RAG system:
- Retrieval-only benchmarks focus on retrieving the best segments or chunks, assuming the answer is within them [00:01:35].
- Generation-only benchmarks (grounding benchmarks) assess if a question can be answered based on context already provided in the prompt [00:01:51].
- They neglect crucial pipeline stages like chunking, parsing, or cases where answers are not directly localized in the data [00:02:00].
Disconnection from Real-World Data Benchmarks often do not correlate with real-world data, which is typically messier and more diverse [00:02:22].

This leads to a vicious cycle where RAG systems are developed for flawed benchmarks, achieve high scores, but then struggle when applied to user or customer data [00:02:33].

Performance Limitations of Standard RAG Pipelines

Standard RAG pipelines, which simply retrieve top k chunks, struggle with complex query types [00:04:04]:

Aggregative Questions They fail to answer questions requiring aggregation or calculations, such as “which company has reported the highest quarterly revenue the most times?” or “how many fiscal years did Apple exceed hundred billion in annual revenue?” [00:03:40].
“All of Something” Questions Queries like “all Fortune 500 companies” are problematic because limiting retrieval to top k chunks might miss relevant information beyond that threshold [00:04:13].

An experiment using a small corpus of FIFA World Cup historical data demonstrated these limitations. A common RAG pipeline (e.g., using LangChain or LlamaIndex) and OpenAI responses answered only 5% to 11% of questions correctly, even for answers found in local segments [00:05:30].

Solutions to Improve RAG Systems: Beyond Standard RAG

To address these challenges, a proposed approach involves converting unstructured data into a structured format and then querying that structure [00:05:49]. This is because many complex questions are inherently SQL-like (counting, max/min, calculation) [00:05:59].

The process is split into two phases, similar to standard RAG, optimizing compute in ingestion for faster inference:

Ingestion Phase

Cluster Documents Documents are first grouped into subcorpuses (e.g., financial corpus, FIFA World Cup corpus) [00:06:40].
Identify Schema A schema is identified for each subcorpus [00:06:51].
Populate Schema An LLM pipeline populates this schema with data from each document [00:07:00]. For example, a FIFA corpus schema might include year, winner, top three teams, and top scorer [00:07:09].
Upload to SQL DB The results are then uploaded to an SQL database [00:07:02].

Inference Flow

Identify Relevant Schema When a query is received, the system identifies the schema relevant to that query (e.g., financial or FIFA World Cup) [00:07:34].
Text-to-SQL A text-to-SQL process is performed over the structured data to generate the final answer [00:07:44]. For instance, “which team has won the FIFA World Cup the most times?” becomes a simple SQL query [00:07:52].

Limitations of Structured RAG Solutions

While promising, this structured approach has its own challenges:

Applicability Not every corpus or query is suitable for a relational database model [00:08:06]. Not all corpuses are homogeneous or necessarily contain an underlying schema [00:08:12].
Normalization Issues Even in simple examples like the FIFA World Cup data, building a correct schema can be challenging. Questions arise regarding entities like “West Germany” vs. “Germany” or whether an attribute like “host country” should be singular or a list (e.g., South Korea and Japan hosting together) [00:08:31].
Ambiguity/Abstinence LLMs tend to try to please users, which can lead to attempts to answer irrelevant questions (e.g., “did Real Madrid win the 2006 final?” in a World Cup context) [00:09:07].
Ingestion Trade-offs There’s a trade-off between the complexity and granularity of clustering and schema inference during ingestion versus the computational investment [00:09:31].
Text-to-SQL Complexity Generating SQL from natural language becomes increasingly complex as schemas become more intricate [00:09:48].

Key Takeaways

RAG is not a one-size-fits-all system; different clients and use cases require tailored approaches [00:10:01].
The standard RAG pipeline (chunking, embedding, retrieving, reranking) is insufficient for many types of questions [00:10:10].
Existing benchmarks fail to capture these complex use cases, being too limited [00:10:23].
To solve problems, it may be necessary to go beyond standard RAG for specific settings [00:10:35].

Tubegraph

Explorer

Table of Contents