From: aidotengineer
Open Rag Eval is a new open source project designed for quick and scalable RAG evaluation [00:00:06]. The project aims to address a significant challenge in RAG evaluation: the requirement for golden answers or golden chunks, which is not scalable [00:00:22].
The project is backed by research conducted in collaboration with the University of Waterloo, specifically the Jimmy Lynn lab [00:00:32].
How Open Rag Eval Works
The general architecture of Open Rag Eval involves several steps to perform RAG evaluation [00:00:45]:
- Queries: The process begins with a set of queries, which can range from 10 to 1,000, collected for the RAG system [00:00:47].
- RAG Connector: A RAG connector collects actual information, chunks, and answers generated by the RAG pipeline [00:00:54]. Connectors are available for Victara, LangChain, and LlamaIndex, with more being added [00:01:04]. These connectors generate the RAG outputs [00:01:09].
- Evaluation: The evaluation phase runs various metrics, which are grouped into evaluators [00:01:16].
- RAG Evaluation Files: The evaluators produce RAG evaluation files containing all necessary information to assess the RAG pipeline [00:01:24].
Key Metrics for Evaluation
Open Rag Eval utilizes specific metrics that enable RAG evaluation without golden answers [00:01:36]:
-
Umbrella (Retrieval Metric): This metric evaluates retrieval without requiring golden chunks [00:01:44]. It assigns a score between 0 and 3 to a chunk, indicating its relevance to the query:
- 0: Nothing to do with the query [00:02:16].
- 3: Dedicated to the query and contains the exact answer [00:02:20]. Research by the University of Waterloo lab shows that this approach correlates well with human judgment [00:02:32].
-
Auto Nuggetizer (Generation Metric): This generation metric also does not require golden answers [00:02:53]. It involves three steps:
- Nugget Creation: Atomic units called “nuggets” are created [00:03:00].
- Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, and the top 20 are selected [00:03:05].
- LLM Judge Analysis: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
-
Citation Faithfulness: This metric measures whether citations in the response are high fidelity, meaning they are fully supported, partially supported, or not supported by the cited passage [00:03:30].
-
Hallucination Detection: This uses Victara’s Hallucination Detection Model (HHM) to check if the entire response aligns with the retrieved content [00:03:45].
User Interface
Open Rag Eval provides a cool user interface for visualizing evaluation results [00:03:58]. After running an evaluation, users can drag and drop the output files onto open evaluation.ai to view a UI that displays all run queries, retrieval scores, and generation scores for comparison [00:04:06].
Benefits and Contribution
Open Rag Eval is a powerful package that can help users optimize and tune their RAG pipeline [00:04:29]. Being open source, it offers full transparency into how the metrics work [00:04:34]. The project includes connectors for Victara, LangChain, and LlamaIndex, and contributions for other RAG pipeline connectors are welcome [00:04:44].