From: aidotengineer
Why RAG Evaluation is Broken [00:00:24]
RAG (Retrieval Augmented Generation) evaluation faces significant challenges primarily due to the ease with which benchmarks are constructed, often reflecting how humans typically think [00:00:28].
Characteristics of Flawed Benchmarks
- Local Questions and Answers Most benchmarks consist of local questions that have local answers [00:00:37]. There is an inherent assumption that the answer resides within a specific chunk of data [00:00:42].
- Manufactured and Unrealistic Scenarios The natural way to build such benchmarks involves reading a long document, finding a question whose answer is contained within, and labeling it as the “golden answer” [00:00:50]. While some benchmarks attempt to overcome this with “multiple hope” questions, famously framed by Google, these questions are often unrealistic and do not represent real-world scenarios [00:01:01].
- Lack of Holistic System Testing Most existing benchmarks are either:
- Retrieval-only: Focusing solely on retrieving the best segments or chunks, assuming the answer is contained within one or several of them [00:01:35].
- Generation-only (Grounding benchmarks): Primarily checking if a system can answer a question based on context already provided in the prompt [00:01:51]. This narrow focus neglects crucial aspects like chunking, parsing, and specific cases where answers are not directly derivable from a single, directed data point [00:02:00].
Disconnect from Real-World Data
Current benchmarks do not correlate well with real-world data, which is inherently messier and more diverse [00:02:20]. Each dataset is unique, and benchmarks often fail to generalize effectively [00:02:27].
The Vicious Cycle of Flawed Benchmarking
When developing RAG systems, a vicious cycle emerges: 1. Developers build RAG systems optimized for flawed benchmarks [00:02:37]. 2. High scores are achieved and celebrated [00:02:40]. 3. Upon real-world deployment or testing with customer data, the systems struggle and underperform [00:02:53]. 4. New benchmarks are created, re-optimized for, and the cycle continues, often replicating the same underlying problems [00:02:59].Performance on Complex Questions [00:03:40]
RAG systems currently struggle with aggregative or comprehensive questions that require more than just retrieving top-k chunks [00:04:02].
Examples of Challenging Questions
- “Which company has reported the highest quarterly revenue the most times?” [00:03:42]
- “In how many fiscal years did Apple exceed 100 billion in annual revenue?” [00:03:49]
- Questions requiring an exhaustive list, such as “all Fortune 500 companies,” where retrieving a limited number of chunks (e.g., 10) may miss additional relevant entities [00:04:13].
Empirical Evidence of Struggles
A small corpus of 22 documents from FIFA World Cup historical Wikipedia pages was used to test common RAG pipelines [00:04:46]. Questions asked included “Which team has won the FIFA World Cup the most times?” and “In how many FIFA World Cups did Brazil participate?” [00:05:02].
- A common RAG pipeline (e.g., LangChain or LlamaIndex) achieved only 5% correct answers [00:05:27].
- Open responses achieved 11% correct answers [00:05:30]. These results demonstrate significant failures even for questions with local answers [00:05:37].
Conclusion
Existing benchmarks fail to capture the complexity and nuance required for many real-world RAG use cases [00:10:23]. RAG is not a one-size-fits-all solution, and optimization must account for individual client needs [00:10:01]. The standard pipeline of chunking, embedding, retrieving, and reranking is insufficient for many types of questions [00:10:16]. To address these problems, it may be necessary to go beyond standard RAG approaches for specific settings [00:10:35].