From: aidotengineer
Evaluating Retrieval Augmented Generation (RAG) systems is crucial for optimizing their performance. A significant challenge in RAG evaluation has traditionally been the requirement for golden answers or golden chunks, which is often non-scalable [00:00:29]. The OpenRAG Eval project aims to address this by providing a quick and scalable solution for RAG evaluation that does not rely on these prerequisites [00:00:09].
OpenRAG Eval Architecture
The OpenRAG Eval architecture begins with a set of user queries (e.g., 10, 100, or 1000) deemed important for a RAG system [00:00:49]. A RAG connector, available for platforms like Vectara, LangChain, and LlamaIndex, collects information such as actual chunks and answers generated by the RAG pipeline [00:01:02]. This output is then fed into the evaluation process, which runs various metrics grouped into “evaluators” [00:01:22]. These evaluators generate RAG evaluation files containing all necessary information for pipeline assessment [00:01:29].
Key Metrics for Evaluation
OpenRAG Eval introduces several metrics designed to function effectively without the need for golden answers:
Umbrella (Retrieval Metric)
Umbrella is a retrieval metric that assigns a score between zero and three to each chunk [00:02:16]:
- Zero: The chunk has no relevance to the query [00:02:20].
- Three: The chunk is dedicated to the query and contains the exact answer [00:02:24].
Research by the University of Waterloo’s Jimmy Lynn lab indicates that this approach correlates well with human judgment, allowing for reliable retrieval evaluation without golden chunks [00:02:42].
AutoNuggetizer (Generation Metric)
AutoNuggetizer evaluates generation without requiring golden answers [00:02:56]. It follows three steps:
- Nugget Creation: Atomic units, called “nuggets,” are created from the response [00:03:05].
- Vital/Okay Rating: Each nugget is assigned a vital or okay rating, and the top 20 are selected [00:03:12].
- LLM Judge Analysis: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:23].
Citation Faithfulness
This metric assesses whether citations within the RAG response are accurate [00:03:35]. It measures if the citation or passage is high fidelity, fully supported, partially supported, or has no support in the response [00:03:43].
Hallucination Detection
Utilizing Victara’s Hallucination Detection Model (HHM), this metric checks if the entire RAG response aligns with the retrieved content, aiming to identify instances of hallucination [00:03:53].
User Interface for Analysis
Once evaluation files are generated, they can be uploaded to open evaluation.ai
[00:04:12]. This provides a user-friendly interface to visualize and compare retrieval and generation scores across different queries [00:04:22].
Benefits of OpenRAG Eval
OpenRAG Eval is an open-source package designed to help optimize and tune RAG pipelines [00:04:31]. Its open-source nature promotes transparency in how metrics function and includes connectors for various RAG pipelines like Vectara, LangChain, and LlamaIndex, with provisions for community contributions of additional connectors [00:04:52].