From: aidotengineer

Traditional Retrieval Augmented Generation (RAG) systems often struggle with complex, aggregative questions that require information from multiple sources or a structured understanding of data [03:42:00]. Conventional RAG pipelines, which typically grab the top-k chunks and attempt to compile an answer, perform poorly on such queries [04:04:00], [04:08:00]. This limitation leads to a disconnect between high scores on flawed benchmarks and actual performance with user data [02:37:00], [02:53:00].

Limitations of Traditional RAG for Complex Queries

Many real-world questions, especially in domains like financial data, are aggregative [03:30:00], [03:42:00]. Examples include:

  • “Which company has reported the highest quarterly revenue the most times?” [03:44:00]
  • “In how many fiscal years did Apple exceed hundred billion in annual revenue?” [03:49:00]
  • “List all Fortune 500 companies.” [04:13:00]

Traditional RAG systems find these challenging because they rely on retrieving a limited number of chunks, which might not contain all the necessary information for a complete aggregate answer [04:04:00], [04:19:00]. A small corpus of FIFA World Cup historical pages, for instance, showed common RAG pipelines answering only 5-11% of such questions correctly [04:43:00], [04:46:00], [05:27:00], [05:30:00].

Structured RAG with SQL

A proposed solution to address these challenges in RAG evaluation is to convert unstructured data into a structured format, like a database, and then process queries on top of this structured data [05:49:00], [05:52:00], [05:54:00]. The nature of these aggregative questions (counting, max/min, calculations) is inherently SQL-like [05:59:00], [06:02:00], [06:05:00].

This approach splits the process into two main phases:

Ingestion Phase

This phase involves investing compute to structure the data [06:33:00]:

  1. Document Clustering: Documents are grouped into sub-corpuses (e.g., financial data, FIFA World Cup data) [06:40:00], [06:43:00], [06:45:00].
  2. Schema Identification: For each sub-corpus, a relevant schema is identified [06:51:00], [06:53:00].
  3. Schema Population: An LLM-based pipeline populates the identified schema using data from each document [06:56:00], [07:00:00], [07:29:00]. For the FIFA World Cup corpus, a schema might include attributes like year, winner, top three teams, and top scorer [07:09:00], [07:12:00], [07:16:00], [07:18:00].
  4. Database Upload: The populated schema and data are then uploaded into an SQL database [07:02:00], [07:06:00].

Inference Phase

This phase focuses on quick query processing [06:35:00]:

  1. Schema Identification for Query: When a query is received, the relevant schema (e.g., financial or FIFA) is identified [07:34:00], [07:36:00], [07:39:00].
  2. Text-to-SQL Conversion: A standard text-to-SQL process is performed on the identified schema within the SQL database [07:44:00].
  3. Answer Retrieval: The final answer is returned directly from the SQL query result [07:47:00], [07:50:00]. For example, “Which team has won the FIFA World Cup the most times?” becomes a simple SQL query [07:52:00], [07:54:00], [07:56:00].

Limitations of the Structured RAG Approach

While effective for certain use cases, this approach does not solve all RAG challenges [08:03:00], [08:06:00]:

Conclusion

RAG is not a one-size-fits-all solution, and different clients and data types require tailored approaches [10:01:00], [10:03:00]. For specific settings involving structured data or aggregative queries, going beyond the standard RAG pipeline (chunking, embedding, retrieving, reranking) to incorporate SQL databases can significantly improve performance and address existing benchmark failures [10:16:00], [10:23:00], [10:25:00], [10:37:00].