RAG evaluation without golden answers

From: aidotengineer

Traditional RAG evaluation often relies on “golden answers” or “golden chunks” (pre-defined correct responses or relevant text segments), which presents a significant scalability challenge [00:00:22]. To address this, a new open-source project called Open Rag Eval has been developed for quick and scalable RAG evaluation that does not require these golden datasets [00:00:06].

Open Rag Eval Overview

Open Rag Eval is an open-source project designed to solve the non-scalable nature of RAG evaluation that demands golden answers or chunks [00:00:19]. It is a research-backed initiative, developed in collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32].

Architecture and Workflow

The architecture of Open Rag Eval involves several steps to generate evaluation files:

Queries: The process begins with a set of user-defined queries, ranging from tens to thousands, which are crucial for the RAG system [00:00:47].
RAG Connector: A connector collects the actual information generated by the RAG pipeline, including chunks and answers [00:00:56]. Connectors are available for Vectara, LangChain, LlamaIndex, and more are being added [00:01:04]. These generate the raw RAG outputs [00:01:09].
Evaluation Run: The collected outputs are then fed into the evaluation engine, which runs various metrics grouped into evaluators [00:01:16].
Evaluation Files: The evaluators generate comprehensive RAG evaluation files containing all necessary data to assess the RAG pipeline’s performance [00:01:24].

Key Metrics for No-Golden-Answer Evaluation

Open Rag Eval introduces several metrics that enable RAG evaluation without relying on golden answers [00:01:36].

Umbrella (Retrieval Metric)

Umbrella is a retrieval metric designed to assess retrieval quality without golden chunks [00:01:44]. It scores each retrieved chunk or passage on a scale of zero to three:

Zero: The chunk has no relevance to the query [00:02:16].
Three: The chunk is dedicated to the query and contains the exact answer [00:02:22]. Research from the University of Waterloo lab demonstrates that this approach correlates well with human judgment, ensuring reliable results even without golden chunks [00:02:32].

AutoNuggetizer (Generation Metric)

AutoNuggetizer is a generation metric that also operates without requiring golden answers [00:02:53]. Its process involves three steps:

Nugget Creation: Atomic units called “nuggets” are created from the response [00:03:00].
Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, and the top 20 nuggets are selected [00:03:07].
LLM Judge Analysis: An LLM (Large Language Model) judge analyzes the RAG system’s response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].

Citation Faithfulness

This metric measures the fidelity of citations within the RAG system’s response [00:03:30]. It assesses whether a cited passage is fully supported, partially supported, or not supported at all by the response content [00:03:34].

Hallucination Detection

This metric utilizes Vectara’s Hallucination Detection model to verify if the entire generated response aligns with the retrieved content [00:03:45].

User Interface

Open Rag Eval provides a user-friendly interface at open_evaluation.ai to visualize the results [00:03:57]. Users can drag and drop their evaluation files onto the platform to view a detailed UI that compares retrieval scores, different generation scores, and other relevant data across all queries [00:04:08].

Benefits and Contributions

Open Rag Eval is a powerful, open-source package designed to help optimize and tune RAG pipelines [00:04:27]. Its open nature ensures transparency in how the metrics work [00:04:34]. While it includes connectors for popular RAG frameworks like Vectara, LangChain, and LlamaIndex, the project encourages contributions of issues or pull requests for additional connectors or improvements [00:04:44].

Tubegraph

Explorer

Table of Contents