From: aidotengineer
Open RAG Eval is a new open-source project designed for quick and scalable RAG evaluation [00:00:06]. It addresses a major challenge in RAG evaluation: the requirement for golden answers or golden chunks, which is not scalable [00:00:22]. The project is research-backed, developed in collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32].
Open RAG Eval Architecture Overview
The evaluation process begins with a set of queries (e.g., 10, 100, or 1,000) important for a RAG system [00:00:47]. A RAG connector collects information such as chunks and answers generated by a RAG pipeline [00:00:54]. Connectors are available for Vectara, LangChain, and LlamaIndex, with more in development [00:01:03]. These connectors produce the RAG outputs [00:01:09]. The actual evaluation then runs various metrics, which are grouped into evaluators that generate RAG evaluation files [00:01:16].
Key Metrics for RAG Evaluation without Golden Answers
Open RAG Eval offers several metrics designed to function without requiring golden answers [00:01:36].
Umbrella (Retrieval Metric)
Umbrella is a retrieval metric that assigns a score between zero and three to a chunk [00:02:08]:
- Zero: The chunk/passage has nothing to do with the query [00:02:16].
- Three: The chunk is dedicated to the query and contains the exact answer [00:02:20]. Research by the University of Waterloo’s Jim Lynn lab indicates that this approach correlates well with human judgment, even without golden chunks [00:02:32].
Auto Nuggetizer (Generation Metric)
Auto Nuggetizer is a generation metric that does not require golden answers [00:02:53]. It involves three steps:
- Nugget Creation: Atomic units called nuggets are created [00:03:00].
- Rating and Sorting: Each nugget is assigned a “vital” or “okay” rating, and the top 20 nuggets are selected [00:03:06].
- LLM Judge Analysis: An LLM judge analyzes the RAG system’s response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
Citation Faithfulness
This metric measures the fidelity of citations within the response [00:03:30]. It determines if the citation in the passage is:
- High fidelity (fully supported) [00:03:37].
- Partially supported [00:03:39].
- Not supported in the response [00:03:40].
Hallucination Detection
This metric uses Vectara’s Hallucination Detection Model (HHM) to check if the entire response aligns with the retrieved content [00:03:45].
User Interface
After running an evaluation, the resulting evaluation files can be dragged and dropped onto open evaluation.ai [00:04:06]. This provides a user interface that displays all queries run and allows comparison of retrieval scores, different generation scores, and other metrics [00:04:12].
Benefits
Open RAG Eval is a powerful package that can help optimize and tune RAG pipelines [00:04:27]. Being open source, it offers full transparency on how the metrics work [00:04:34]. It includes connectors for Vectara, LangChain, and LlamaIndex, and welcomes contributions for other RAG pipeline connectors [00:04:44].