Metrics for RAG evaluation

From: aidotengineer

Open RAG Eval is a new open-source project designed for quick and scalable RAG evaluation [00:00:06]. It addresses a major challenge in RAG evaluation: the requirement for golden answers or golden chunks, which is not scalable [00:00:22]. The project is research-backed, developed in collaboration with the University of Waterloo’s Jimmy Lynn lab [00:00:32].

Open RAG Eval Architecture Overview

The evaluation process begins with a set of queries (e.g., 10, 100, or 1,000) important for a RAG system [00:00:47]. A RAG connector collects information such as chunks and answers generated by a RAG pipeline [00:00:54]. Connectors are available for Vectara, LangChain, and LlamaIndex, with more in development [00:01:03]. These connectors produce the RAG outputs [00:01:09]. The actual evaluation then runs various metrics, which are grouped into evaluators that generate RAG evaluation files [00:01:16].

Key Metrics for RAG Evaluation without Golden Answers

Open RAG Eval offers several metrics designed to function without requiring golden answers [00:01:36].

Umbrella (Retrieval Metric)

Umbrella is a retrieval metric that assigns a score between zero and three to a chunk [00:02:08]:

Zero: The chunk/passage has nothing to do with the query [00:02:16].
Three: The chunk is dedicated to the query and contains the exact answer [00:02:20]. Research by the University of Waterloo’s Jim Lynn lab indicates that this approach correlates well with human judgment, even without golden chunks [00:02:32].

Auto Nuggetizer (Generation Metric)

Auto Nuggetizer is a generation metric that does not require golden answers [00:02:53]. It involves three steps:

Nugget Creation: Atomic units called nuggets are created [00:03:00].
Rating and Sorting: Each nugget is assigned a “vital” or “okay” rating, and the top 20 nuggets are selected [00:03:06].
LLM Judge Analysis: An LLM judge analyzes the RAG system’s response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].

Citation Faithfulness

This metric measures the fidelity of citations within the response [00:03:30]. It determines if the citation in the passage is:

High fidelity (fully supported) [00:03:37].
Partially supported [00:03:39].
Not supported in the response [00:03:40].

Hallucination Detection

This metric uses Vectara’s Hallucination Detection Model (HHM) to check if the entire response aligns with the retrieved content [00:03:45].

User Interface

After running an evaluation, the resulting evaluation files can be dragged and dropped onto open evaluation.ai [00:04:06]. This provides a user interface that displays all queries run and allows comparison of retrieval scores, different generation scores, and other metrics [00:04:12].

Benefits

Open RAG Eval is a powerful package that can help optimize and tune RAG pipelines [00:04:27]. Being open source, it offers full transparency on how the metrics work [00:04:34]. It includes connectors for Vectara, LangChain, and LlamaIndex, and welcomes contributions for other RAG pipeline connectors [00:04:44].

Tubegraph

Explorer

Table of Contents