From: aidotengineer

While Retrieval Augmented Generation (RAG) systems are prevalent, their evaluation and performance in real-world scenarios often fall short of expectations, leading to a “broken” evaluation process [00:00:20]. Existing benchmarks, often designed with “local questions that have local answers” [00:00:39], fail to represent the messier, more diverse nature of real-world data [00:02:22].

Traditional RAG benchmarks frequently focus on either retrieval-only or generation-only aspects, neglecting crucial steps like chunking and parsing, and specific cases where answers are not confined to a single data point [00:01:31]. This creates a vicious cycle where systems are optimized for flawed benchmarks, perform well on them, but then struggle when applied to customer data [00:02:33].

Addressing Aggregative Questions with Structured Data

To overcome the limitations of current RAG benchmarks and improve RAG performance for complex queries, particularly aggregative questions, a proposed solution involves converting unstructured corpora into structured data [00:05:49]. Aggregative questions include those asking “which company has reported the highest quarterly revenue the most times” or “how many fiscal years did Apple exceed hundred billion in annual revenue?” [00:04:40]. Common RAG pipelines often fail miserably on such questions, correctly answering only 5-11% in tests [00:05:30].

The core idea is to handle these types of queries as SQL questions [00:06:02], which traditional RAG struggles with because it typically just grabs a limited number of “top k chunks” [00:04:04].

The Two-Phase Approach

This structured data approach splits the process into two main phases, similar to a regular RAG flow, where compute is invested in ingestion for quicker inference [00:06:27]:

Ingestion Phase

  1. Document Clustering: Documents are first clustered into sub-corpora (e.g., financial corpus, FIFA World Cup corpus) [00:06:40].
  2. Schema Identification: For each sub-corpus, a relevant schema representing the data is identified [00:06:51].
  3. Schema Population: An LLM-based pipeline is used to populate this schema according to each document [00:07:00].
  4. Database Upload: The results are then uploaded into a SQL database [00:07:02].

Inference Phase

  1. Schema Identification: When a query is received, the relevant schema (e.g., financial or FIFA World Cup) is identified [00:07:34].
  2. Text-to-SQL Conversion: A text-to-SQL conversion is performed over the structured data [00:07:44].
  3. Answer Return: The final answer is returned based on the SQL query result [00:07:47].

For example, to answer “Which team has won the FIFA World Cup the most times?” a simple SQL query would be executed against a populated database [00:07:52].

Limitations of the Structured Approach

While effective for certain types of queries, this approach does not solve all RAG problems [00:08:03]. Key challenges include:

  • Corpus Heterogeneity: Not every corpus or query is suitable for a relational database model, nor is every corpus homogeneous enough to easily derive an underlying schema [00:08:06].
  • Normalization Issues: Even in simple examples like the FIFA World Cup corpus, issues arise (e.g., “West Germany” vs. “Germany” for host country, singular vs. list attributes for hosts like South Korea and Japan) [00:08:31].
  • Ambiguity and Abstinence: LLMs may try to answer queries even when the question is irrelevant to the schema (e.g., asking about Real Madrid’s win in a World Cup schema) [00:09:07].
  • Trade-offs in Ingestion: There’s a trade-off between the complexity and granularity of clustering and schema inference during ingestion versus the computational investment [00:09:31].
  • Complex Text-to-SQL: As schemas become more complex, the text-to-SQL conversion itself becomes a significant challenge [00:09:48].

Key Takeaways for Improving RAG Systems

Ultimately, RAG is not a one-size-fits-all solution, and each client or use case may require a tailored approach [00:10:01]. The standard RAG pipeline of chunking, embedding, retrieving, and reranking is often insufficient for many types of questions [00:10:10]. Existing benchmarks fail to capture the nuances of these real-world use cases [00:10:23]. To address the limitations and improve RAG performance in specific settings, it is often necessary to go beyond standard RAG pipelines [00:10:35].