Limitations of current RAG benchmarks

Why RAG Evaluation is Broken [00:00:24]

RAG (Retrieval Augmented Generation) evaluation faces significant challenges primarily due to the ease with which benchmarks are constructed, often reflecting how humans typically think [00:00:28].

Characteristics of Flawed Benchmarks

Local Questions and Answers Most benchmarks consist of local questions that have local answers [00:00:37]. There is an inherent assumption that the answer resides within a specific chunk of data [00:00:42].
Manufactured and Unrealistic Scenarios The natural way to build such benchmarks involves reading a long document, finding a question whose answer is contained within, and labeling it as the “golden answer” [00:00:50]. While some benchmarks attempt to overcome this with “multiple hope” questions, famously framed by Google, these questions are often unrealistic and do not represent real-world scenarios [00:01:01].
Lack of Holistic System Testing Most existing benchmarks are either:
- Retrieval-only: Focusing solely on retrieving the best segments or chunks, assuming the answer is contained within one or several of them [00:01:35].
- Generation-only (Grounding benchmarks): Primarily checking if a system can answer a question based on context already provided in the prompt [00:01:51]. This narrow focus neglects crucial aspects like chunking, parsing, and specific cases where answers are not directly derivable from a single, directed data point [00:02:00].

Disconnect from Real-World Data

Current benchmarks do not correlate well with real-world data, which is inherently messier and more diverse [00:02:20]. Each dataset is unique, and benchmarks often fail to generalize effectively [00:02:27].

The Vicious Cycle of Flawed Benchmarking

When developing RAG systems, a vicious cycle emerges: 1. Developers build RAG systems optimized for flawed benchmarks [00:02:37]. 2. High scores are achieved and celebrated [00:02:40]. 3. Upon real-world deployment or testing with customer data, the systems struggle and underperform [00:02:53]. 4. New benchmarks are created, re-optimized for, and the cycle continues, often replicating the same underlying problems [00:02:59].

Performance on Complex Questions [00:03:40]

RAG systems currently struggle with aggregative or comprehensive questions that require more than just retrieving top-k chunks [00:04:02].

Examples of Challenging Questions

“Which company has reported the highest quarterly revenue the most times?” [00:03:42]
“In how many fiscal years did Apple exceed 100 billion in annual revenue?” [00:03:49]
Questions requiring an exhaustive list, such as “all Fortune 500 companies,” where retrieving a limited number of chunks (e.g., 10) may miss additional relevant entities [00:04:13].

Empirical Evidence of Struggles

A small corpus of 22 documents from FIFA World Cup historical Wikipedia pages was used to test common RAG pipelines [00:04:46]. Questions asked included “Which team has won the FIFA World Cup the most times?” and “In how many FIFA World Cups did Brazil participate?” [00:05:02].

A common RAG pipeline (e.g., LangChain or LlamaIndex) achieved only 5% correct answers [00:05:27].
Open responses achieved 11% correct answers [00:05:30]. These results demonstrate significant failures even for questions with local answers [00:05:37].

Conclusion

Existing benchmarks fail to capture the complexity and nuance required for many real-world RAG use cases [00:10:23]. RAG is not a one-size-fits-all solution, and optimization must account for individual client needs [00:10:01]. The standard pipeline of chunking, embedding, retrieving, and reranking is insufficient for many types of questions [00:10:16]. To address these problems, it may be necessary to go beyond standard RAG approaches for specific settings [00:10:35].

Tubegraph

Explorer

Table of Contents