From: aidotengineer
Open RAG Eval is an open-source project designed for quick and scalable evaluation of Retrieval Augmented Generation (RAG) systems, specifically addressing the challenge of not requiring “golden answers” or “golden chunks” for evaluation [00:00:06], [00:00:19], [00:02:54]. This project is backed by research conducted in collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32], [00:02:32].
The evaluation process starts with a set of queries, which are processed by a RAG connector (e.g., for Vectara, LangChain, LlamaIndex) to collect generated chunks and answers [00:00:47]. These outputs are then fed into the evaluators, which run various metrics grouped into evaluators to produce RAG evaluation files [00:01:16]. Among these metrics, Umbrella and Auto Nuggetizer are key for evaluating RAG systems without golden data [00:01:36].
Umbrella Metric
Umbrella is a retrieval metric designed to evaluate the relevance of retrieved chunks without the need for golden chunks [00:01:44], [00:02:08].
How it Works
- Umbrella assigns a score to each retrieved chunk, ranging from zero to three [00:02:13].
- A score of zero indicates that the chunk or passage has no relevance to the query [00:02:17].
- A score of three signifies that the chunk is dedicated to the query and contains the exact answer [00:02:21].
Key Benefit
- Research from the University of Waterloo lab has demonstrated that this approach correlates well with human judgment, providing reliable results even without golden chunks [00:02:32], [00:02:45].
Auto Nuggetizer Metric
Auto Nuggetizer is a generation metric used to evaluate the quality of generated answers without requiring golden answers [00:02:51], [00:02:54].
How it Works
The process involves three main steps:
- Nugget Creation: Atomic units of information, referred to as “nuggets,” are created from the response [00:03:00].
- Nugget Rating and Sorting: Each nugget is assigned a “vital” or “okay” rating. The top 20 nuggets are then sorted [00:03:07].
- LLM Judge Analysis: An LLM judge analyzes the RAG system’s response to determine if each selected nugget is either fully supported or partially supported by the answer [00:03:15].
Further details on this metric are available in the associated research papers [00:03:25].
These metrics, along with others like citation faithfulness and hallucination detection, contribute to Open RAG Eval’s ability to provide a comprehensive evaluation of RAG pipelines [00:01:52], [00:03:30], [00:03:44]. The project also offers a user interface for visualizing evaluation results at open evaluation.ai [00:03:57].