From: aidotengineer

Open Rag Eval is a new open source project designed for quick and scalable RAG evaluation [00:00:06]. The project aims to address a significant challenge in RAG evaluation: the requirement for golden answers or golden chunks, which is not scalable [00:00:22].

The project is backed by research conducted in collaboration with the University of Waterloo, specifically the Jimmy Lynn lab [00:00:32].

How Open Rag Eval Works

The general architecture of Open Rag Eval involves several steps to perform RAG evaluation [00:00:45]:

  1. Queries: The process begins with a set of queries, which can range from 10 to 1,000, collected for the RAG system [00:00:47].
  2. RAG Connector: A RAG connector collects actual information, chunks, and answers generated by the RAG pipeline [00:00:54]. Connectors are available for Victara, LangChain, and LlamaIndex, with more being added [00:01:04]. These connectors generate the RAG outputs [00:01:09].
  3. Evaluation: The evaluation phase runs various metrics, which are grouped into evaluators [00:01:16].
  4. RAG Evaluation Files: The evaluators produce RAG evaluation files containing all necessary information to assess the RAG pipeline [00:01:24].

Key Metrics for Evaluation

Open Rag Eval utilizes specific metrics that enable RAG evaluation without golden answers [00:01:36]:

  • Umbrella (Retrieval Metric): This metric evaluates retrieval without requiring golden chunks [00:01:44]. It assigns a score between 0 and 3 to a chunk, indicating its relevance to the query:

  • Auto Nuggetizer (Generation Metric): This generation metric also does not require golden answers [00:02:53]. It involves three steps:

    1. Nugget Creation: Atomic units called “nuggets” are created [00:03:00].
    2. Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, and the top 20 are selected [00:03:05].
    3. LLM Judge Analysis: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
  • Citation Faithfulness: This metric measures whether citations in the response are high fidelity, meaning they are fully supported, partially supported, or not supported by the cited passage [00:03:30].

  • Hallucination Detection: This uses Victara’s Hallucination Detection Model (HHM) to check if the entire response aligns with the retrieved content [00:03:45].

User Interface

Open Rag Eval provides a cool user interface for visualizing evaluation results [00:03:58]. After running an evaluation, users can drag and drop the output files onto open evaluation.ai to view a UI that displays all run queries, retrieval scores, and generation scores for comparison [00:04:06].

Benefits and Contribution

Open Rag Eval is a powerful package that can help users optimize and tune their RAG pipeline [00:04:29]. Being open source, it offers full transparency into how the metrics work [00:04:34]. The project includes connectors for Victara, LangChain, and LlamaIndex, and contributions for other RAG pipeline connectors are welcome [00:04:44].