From: aidotengineer

OpenRAG eval is a new open-source project designed for quick and scalable RAG evaluation [00:00:06]. Developed by Victara in collaboration with the University of Waterloo’s Jimmy Lynn lab, it addresses a major problem in RAG evaluation: the common requirement for golden answers or chunks, which is not scalable [00:00:22].

How OpenRAG Eval Works

The architecture of OpenRAG eval involves several steps to evaluate a RAG system [00:00:45]:

  1. Query Input: The process begins with a set of queries, which can range from 10 to 1,000, collected for their importance to the specific RAG system [00:00:47].
  2. RAG Connector: A RAG connector collects actual information, including chunks and answers, generated by the RAG pipeline [00:00:56]. Connectors are available for Vectara, LangChain, LlamaIndex, and a growing number of others [00:01:04]. These connectors generate the RAG outputs [00:01:09].
  3. Evaluation: The evaluation process runs a series of metrics, which are grouped into evaluators [00:01:16].
  4. Output: The evaluators generate RAG evaluation files containing all necessary information to assess the RAG pipeline [00:01:24].

Metrics for Evaluation without Golden Answers

A key innovation of OpenRAG eval is its ability to perform RAG evaluation without golden answers [00:01:37]. This is achieved through several specific metrics:

  • Umbrella (Retrieval Metric):

    • This metric allows for retrieval evaluation without requiring golden chunks [00:01:44].
    • It assigns a score between 0 and 3 to a chunk, where 0 means the chunk has nothing to do with the query, and 3 means it’s dedicated to the query and contains the exact answer [00:02:13].
    • Research by the University of Waterloo shows that this approach correlates well with human judgment [00:02:32].
  • AutoNuggetizer (Generation Metric):

    • This metric evaluates generation without requiring golden answers [00:02:54].
    • It involves three steps [00:02:58]:
      1. Nugget Creation: Atomic units called “nuggets” are created [00:03:00].
      2. Nugget Rating: Each nugget is assigned a “vital” or “okay” rating, with the top 20 nuggets typically selected [00:03:07].
      3. LLM Judge Analysis: An LLM judge analyzes the RAG response to determine if each selected nugget is fully or partially supported by the answer [00:03:15].
  • Citation Faithfulness:

    • Measures whether citations within the response are high fidelity, fully supported, partially supported, or not supported by the answer [00:03:30].
  • Hallucination Detection:

    • Utilizes Vectara’s Hallucination Detection Model (HHM) to check if the entire response aligns with the retrieved content [00:03:45].

User Interface

OpenRAG eval provides a user interface for analyzing evaluation results [00:03:58]. Users can drag and drop the generated evaluation files onto open evaluation.ai to view a UI that displays queries, retrieval scores, and different generation scores [00:04:06].

Benefits and Contributions

OpenRAG eval is a powerful package that can help optimize and tune a RAG pipeline [00:04:29]. Being open-source, it offers transparency in how the metrics work [00:04:34]. While it includes connectors for Vectara, LangChain, and LlamaIndex, contributions for other RAG pipeline connectors are welcomed [00:04:44].