From: aidotengineer
Open RAG Eval Project
Open RAG Eval is an open-source project designed for quick and scalable RAG evaluation [00:00:06]. It addresses a major challenge in RAG evaluation: the typical requirement for golden answers or chunks, which is not scalable [00:00:22]. The project is research-backed, a result of collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32].
Architecture and Workflow
The Open RAG Eval process involves several steps:
- Query Collection A set of queries (e.g., 10 to 1,000) that are important for the RAG system are collected [00:00:47].
- RAG Connector A RAG connector collects the actual information, chunks, and answers generated by a RAG pipeline [00:00:54]. Connectors are available for Vectara, LangChain, and LlamaIndex, with more being developed [00:01:04]. The project encourages contributions for additional RAG connectors [00:04:46].
- Output Generation These connectors generate the RAG outputs, including retrieved chunks and generated answers [00:01:09].
- Evaluation Execution The system runs a series of metrics grouped into evaluators [00:01:16].
- File Generation The evaluators generate RAG evaluation files containing all necessary information for pipeline assessment [00:01:24].
Key Metrics for RAG Evaluation Without Golden Answers
Open RAG Eval employs several metrics that do not require golden answers or chunks, making the evaluation process highly scalable [00:01:36]:
-
Umbrella (Retrieval Metric): This metric evaluates retrieval without golden chunks [00:01:44]. It assigns a score from zero to three to a chunk, indicating its relevance to the query:
- Zero: Chunk has nothing to do with the query [00:02:16].
- Three: Chunk is dedicated to the query and contains the exact answer [00:02:20]. Research by the University of Waterloo indicates that this approach correlates well with human judgment [00:02:32].
-
Autonuggetizer (Generation Metric): This metric evaluates generation without golden answers [00:02:51]. It involves three steps:
- Nugget Creation: Atomic units called “nuggets” are created [00:03:00].
- Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, and the top 20 are sorted [00:03:07].
- LLM Judgment: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
-
Citation Faithfulness: This metric measures whether citations in the RAG response are accurate and high-fidelity, determining if they are fully supported, partially supported, or not supported by the underlying passages [00:03:30].
-
Hallucination Detection: This feature uses Vectara’s Hallucination Detection Model (HHM) to check if the entire RAG response aligns with the retrieved content [00:03:45].
User Interface
Open RAG Eval provides a user-friendly interface to analyze evaluation results [00:03:58]. Users can drag and drop their evaluation files onto open-evaluation.ai
to visualize the data [00:04:06]. The UI displays:
- All queries that were run [00:04:16].
- Comparison of retrieval scores [00:04:19].
- Different generation scores [00:04:21].
This package offers transparency into how metrics work and can help optimize and tune RAG pipelines [00:04:31].