Tools and interfaces for RAG evaluation

Open RAG Eval Project

Open RAG Eval is an open-source project designed for quick and scalable RAG evaluation [00:00:06]. It addresses a major challenge in RAG evaluation: the typical requirement for golden answers or chunks, which is not scalable [00:00:22]. The project is research-backed, a result of collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32].

Architecture and Workflow

The Open RAG Eval process involves several steps:

Query Collection A set of queries (e.g., 10 to 1,000) that are important for the RAG system are collected [00:00:47].
RAG Connector A RAG connector collects the actual information, chunks, and answers generated by a RAG pipeline [00:00:54]. Connectors are available for Vectara, LangChain, and LlamaIndex, with more being developed [00:01:04]. The project encourages contributions for additional RAG connectors [00:04:46].
Output Generation These connectors generate the RAG outputs, including retrieved chunks and generated answers [00:01:09].
Evaluation Execution The system runs a series of metrics grouped into evaluators [00:01:16].
File Generation The evaluators generate RAG evaluation files containing all necessary information for pipeline assessment [00:01:24].

Key Metrics for RAG Evaluation Without Golden Answers

Open RAG Eval employs several metrics that do not require golden answers or chunks, making the evaluation process highly scalable [00:01:36]:

Umbrella (Retrieval Metric): This metric evaluates retrieval without golden chunks [00:01:44]. It assigns a score from zero to three to a chunk, indicating its relevance to the query:
- Zero: Chunk has nothing to do with the query [00:02:16].
- Three: Chunk is dedicated to the query and contains the exact answer [00:02:20]. Research by the University of Waterloo indicates that this approach correlates well with human judgment [00:02:32].
Autonuggetizer (Generation Metric): This metric evaluates generation without golden answers [00:02:51]. It involves three steps:
1. Nugget Creation: Atomic units called “nuggets” are created [00:03:00].
2. Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, and the top 20 are sorted [00:03:07].
3. LLM Judgment: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
Citation Faithfulness: This metric measures whether citations in the RAG response are accurate and high-fidelity, determining if they are fully supported, partially supported, or not supported by the underlying passages [00:03:30].
Hallucination Detection: This feature uses Vectara’s Hallucination Detection Model (HHM) to check if the entire RAG response aligns with the retrieved content [00:03:45].

User Interface

Open RAG Eval provides a user-friendly interface to analyze evaluation results [00:03:58]. Users can drag and drop their evaluation files onto open-evaluation.ai to visualize the data [00:04:06]. The UI displays:

All queries that were run [00:04:16].
Comparison of retrieval scores [00:04:19].
Different generation scores [00:04:21].

This package offers transparency into how metrics work and can help optimize and tune RAG pipelines [00:04:31].

Tubegraph

Explorer

Table of Contents

Tools and interfaces for RAG evaluation

Open RAG Eval Project

Architecture and Workflow

Key Metrics for RAG Evaluation Without Golden Answers

User Interface

Graph View

Backlinks