From: aidotengineer

Current RAG (Retrieval Augmented Generation) evaluation systems often struggle with real-world data, which is inherently messier and less generalized than data used in most benchmarks [02:20:00]. This leads to a vicious cycle where RAG systems are optimized for flawed benchmarks, perform well on those benchmarks, but then struggle with actual user or customer data [02:33:00]. One significant aspect of this challenge lies in normalization issues when attempting to structure unstructured data for RAG systems.

Limitations of Current Benchmarks and Standard RAG

Most existing RAG benchmarks are limited, often focusing on retrieval-only or generation-only tasks [01:35:00]. They frequently assume that answers lie within specific chunks of data [00:44:00], and often use “multiple hope questions” that are not realistic [01:01:00]. They fail to test the entire system holistically, overlooking aspects like chunking, parsing, and cases where answers aren’t in a direct, local segment [02:00:00].

Standard RAG pipelines, which typically involve chunking, embedding, retrieving, and reranking, are often insufficient for many complex questions, especially those requiring aggregation or analysis across multiple data points [10:16:00]. For example, questions like “which company has reported the highest quarterly revenue the most times” [03:42:00] or “how many fiscal years did Apple exceed hundred billion in annual revenue?” [03:49:00] cause current RAG systems to struggle because they are limited to grabbing top K chunks and compiling answers from those [04:02:00].

An experiment with FIFA World Cup historical data showed that common RAG pipelines achieved only 5% to 11% accuracy on such questions [05:30:00].

Structuring Unstructured Data for Better RAG

To address these limitations, one approach is to convert unstructured corpuses into structured data, then pose questions on top of this structured data [05:49:00]. This is particularly effective for questions that are essentially SQL-like queries, involving counting, max/min calculations, or other aggregations [05:59:00].

This process involves:

  1. Ingestion Phase:
    • Clustering documents into sub-corpuses (e.g., financial data, FIFA World Cup data) [06:40:00].
    • Identifying a schema that represents each sub-corpus [06:51:00].
    • Populating the schema using an LLM pipeline for every document [07:00:00].
    • Uploading the results into an SQL database [07:02:00].
  2. Inference Flow:
    • Identifying the relevant schema for a given query (e.g., financial or FIFA World Cup) [07:34:00].
    • Performing a text-to-SQL conversion over the structured data to return the final answer [07:44:00].

For instance, with the FIFA World Cup corpus, a schema could include attributes like year, winner, top three teams, and top scorer [07:09:00]. A question like “which team has won the FIFA World Cup the most times?” [07:52:00] would translate into a simple SQL query [07:56:00].

Normalization Challenges and Other Limitations

Despite the benefits of this structured approach, it does not solve all problems in RAG evaluation and system design [08:03:00]. Significant challenges arise, particularly concerning normalization:

Schema Inconsistencies and Homogeneity

Not every corpus or query is suitable for a relational database model [08:06:00]. Furthermore, not every corpus is homogeneous in terms of its attributes, and an underlying schema may not always be easily discernible [08:12:00].

Normalization Difficulties

Even with seemingly straightforward examples like the FIFA World Cup data, building the correct schema can be a struggle due to normalization issues [08:31:00]:

  • Geographical/Historical Variations: For a “host country” attribute, “West Germany” might appear, but a user could ask about “Germany” [08:42:00]. Determining whether “West Germany” should count as “Germany” for aggregation is a normalization challenge.
  • Singular vs. List Attributes: The “host country” might sometimes be a singular entity, but other times it could be a list (e.g., “South Korea and Japan” hosted together) [08:50:00]. This requires careful consideration during schema design and data population.

These normalization issues impact both the ingestion and inference phases of the system [09:00:00].

Ambiguity and LLM Behavior

Ambiguity in user queries can also pose problems [09:03:00]. If a user asks a question not directly related to the current schema (e.g., “did Real Madrid win the 2006 final?” in a World Cup context), LLMs tend to try and “please the users” by attempting to answer even if the data isn’t directly relevant or complete [09:07:00].

Complexity and Compute Trade-offs

There is a trade-off between the complexity and granularity of clustering and schema inference during the ingestion phase and the amount of compute power required [09:31:00]. Additionally, transforming text to SQL queries becomes increasingly complex as the schemas themselves become more intricate [09:48:00].

Conclusion

RAG evaluation highlights that RAG is not a one-size-fits-all solution; each client and dataset needs separate consideration [10:01:00]. Standard RAG pipelines and existing benchmarks fail to capture the nuances of complex use cases, particularly those involving aggregation and requiring structured understanding of data [10:23:00]. To overcome these challenges, especially normalization issues, it’s often necessary to go beyond standard RAG approaches for specific settings [10:35:00].