Use of SQL in structured RAG data processing

From: aidotengineer

Traditional Retrieval Augmented Generation (RAG) systems often struggle with complex, aggregative questions that require information from multiple sources or a structured understanding of data [03:42:00]. Conventional RAG pipelines, which typically grab the top-k chunks and attempt to compile an answer, perform poorly on such queries [04:04:00], [04:08:00]. This limitation leads to a disconnect between high scores on flawed benchmarks and actual performance with user data [02:37:00], [02:53:00].

Limitations of Traditional RAG for Complex Queries

Many real-world questions, especially in domains like financial data, are aggregative [03:30:00], [03:42:00]. Examples include:

“Which company has reported the highest quarterly revenue the most times?” [03:44:00]
“In how many fiscal years did Apple exceed hundred billion in annual revenue?” [03:49:00]
“List all Fortune 500 companies.” [04:13:00]

Traditional RAG systems find these challenging because they rely on retrieving a limited number of chunks, which might not contain all the necessary information for a complete aggregate answer [04:04:00], [04:19:00]. A small corpus of FIFA World Cup historical pages, for instance, showed common RAG pipelines answering only 5-11% of such questions correctly [04:43:00], [04:46:00], [05:27:00], [05:30:00].

Structured RAG with SQL

A proposed solution to address these challenges in RAG evaluation is to convert unstructured data into a structured format, like a database, and then process queries on top of this structured data [05:49:00], [05:52:00], [05:54:00]. The nature of these aggregative questions (counting, max/min, calculations) is inherently SQL-like [05:59:00], [06:02:00], [06:05:00].

This approach splits the process into two main phases:

Ingestion Phase

This phase involves investing compute to structure the data [06:33:00]:

Document Clustering: Documents are grouped into sub-corpuses (e.g., financial data, FIFA World Cup data) [06:40:00], [06:43:00], [06:45:00].
Schema Identification: For each sub-corpus, a relevant schema is identified [06:51:00], [06:53:00].
Schema Population: An LLM-based pipeline populates the identified schema using data from each document [06:56:00], [07:00:00], [07:29:00]. For the FIFA World Cup corpus, a schema might include attributes like year, winner, top three teams, and top scorer [07:09:00], [07:12:00], [07:16:00], [07:18:00].
Database Upload: The populated schema and data are then uploaded into an SQL database [07:02:00], [07:06:00].

Inference Phase

This phase focuses on quick query processing [06:35:00]:

Schema Identification for Query: When a query is received, the relevant schema (e.g., financial or FIFA) is identified [07:34:00], [07:36:00], [07:39:00].
Text-to-SQL Conversion: A standard text-to-SQL process is performed on the identified schema within the SQL database [07:44:00].
Answer Retrieval: The final answer is returned directly from the SQL query result [07:47:00], [07:50:00]. For example, “Which team has won the FIFA World Cup the most times?” becomes a simple SQL query [07:52:00], [07:54:00], [07:56:00].

Limitations of the Structured RAG Approach

While effective for certain use cases, this approach does not solve all RAG challenges [08:03:00], [08:06:00]:

Corpus Suitability: Not every corpus or query is suitable for a relational database model [08:06:00], [08:09:00]. Not all data is homogeneous or necessarily contains an underlying schema [08:12:00], [08:16:00], [08:19:00].
Normalization Issues: Even with simple examples like the FIFA World Cup data, building the correct schema can be challenging [08:31:00], [08:34:00]. Questions arise about how to normalize data points (e.g., “West Germany” vs. “Germany” for host country) or handle multi-valued attributes (e.g., “South Korea and Japan” co-hosting) [08:42:00], [08:45:00], [08:47:00], [08:50:00], [08:52:00], [08:54:00].
Ambiguity: LLMs may try to answer questions even if they don’t directly fit the schema, leading to potentially incorrect or irrelevant responses [09:07:00], [09:10:00], [09:23:00].
Trade-offs: There are trade-offs between complexity, granularity, and computational investment during schema inference and clustering in the ingestion phase [09:31:00], [09:33:00], [09:36:00].
Text-to-SQL Complexity: Text-to-SQL remains a complex issue, especially with intricate schemas [09:48:00], [09:50:00].

Conclusion

RAG is not a one-size-fits-all solution, and different clients and data types require tailored approaches [10:01:00], [10:03:00]. For specific settings involving structured data or aggregative queries, going beyond the standard RAG pipeline (chunking, embedding, retrieving, reranking) to incorporate SQL databases can significantly improve performance and address existing benchmark failures [10:16:00], [10:23:00], [10:25:00], [10:37:00].

Tubegraph

Explorer

Table of Contents