Metrics for evaluating RAG systems

From: aidotengineer

Evaluating Retrieval Augmented Generation (RAG) systems is crucial for optimizing their performance. A significant challenge in RAG evaluation has traditionally been the requirement for golden answers or golden chunks, which is often non-scalable [00:00:29]. The OpenRAG Eval project aims to address this by providing a quick and scalable solution for RAG evaluation that does not rely on these prerequisites [00:00:09].

OpenRAG Eval Architecture

The OpenRAG Eval architecture begins with a set of user queries (e.g., 10, 100, or 1000) deemed important for a RAG system [00:00:49]. A RAG connector, available for platforms like Vectara, LangChain, and LlamaIndex, collects information such as actual chunks and answers generated by the RAG pipeline [00:01:02]. This output is then fed into the evaluation process, which runs various metrics grouped into “evaluators” [00:01:22]. These evaluators generate RAG evaluation files containing all necessary information for pipeline assessment [00:01:29].

Key Metrics for Evaluation

OpenRAG Eval introduces several metrics designed to function effectively without the need for golden answers:

Umbrella (Retrieval Metric)

Umbrella is a retrieval metric that assigns a score between zero and three to each chunk [00:02:16]:

Zero: The chunk has no relevance to the query [00:02:20].
Three: The chunk is dedicated to the query and contains the exact answer [00:02:24].

Research by the University of Waterloo’s Jimmy Lynn lab indicates that this approach correlates well with human judgment, allowing for reliable retrieval evaluation without golden chunks [00:02:42].

AutoNuggetizer (Generation Metric)

AutoNuggetizer evaluates generation without requiring golden answers [00:02:56]. It follows three steps:

Nugget Creation: Atomic units, called “nuggets,” are created from the response [00:03:05].
Vital/Okay Rating: Each nugget is assigned a vital or okay rating, and the top 20 are selected [00:03:12].
LLM Judge Analysis: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:23].

Citation Faithfulness

This metric assesses whether citations within the RAG response are accurate [00:03:35]. It measures if the citation or passage is high fidelity, fully supported, partially supported, or has no support in the response [00:03:43].

Hallucination Detection

Utilizing Victara’s Hallucination Detection Model (HHM), this metric checks if the entire RAG response aligns with the retrieved content, aiming to identify instances of hallucination [00:03:53].

User Interface for Analysis

Once evaluation files are generated, they can be uploaded to open evaluation.ai [00:04:12]. This provides a user-friendly interface to visualize and compare retrieval and generation scores across different queries [00:04:22].

Benefits of OpenRAG Eval

OpenRAG Eval is an open-source package designed to help optimize and tune RAG pipelines [00:04:31]. Its open-source nature promotes transparency in how metrics function and includes connectors for various RAG pipelines like Vectara, LangChain, and LlamaIndex, with provisions for community contributions of additional connectors [00:04:52].

Tubegraph

Explorer

Table of Contents