Evaluation of LLMs using realworld scenarios

From: aidotengineer

Wasim, co-founder and CTO of Writer, details the company’s journey and their approach to evaluating Large Language Models (LLMs) in real-world scenarios [00:00:17]. Writer, founded in 2020, began by building decoder-encoder models, evolving into a family of approximately 16 published models, with 20 more in development [00:00:40]. These models fall into two categories: general models (e.g., PxP3, PxP4, with PxP5 coming soon) and domain-specific models tailored for creative, financial services, and medical fields [00:01:00].

The Question of Domain-Specific LLMs

By early 2024, a significant trend emerged: general LLMs were achieving very high accuracy, often between 80% and 90% on general benchmarks [00:01:24]. This raised a critical question for Writer: Is it still worthwhile to continue building domain-specific models if general models are already achieving such high accuracy, or should the focus shift to fine-tuning general models for reasoning or thinking tasks [00:01:53]?

To answer this, Writer decided to conduct evaluations using real-world data, specifically focusing on the financial services domain, with similar results observed in the medical field [00:02:26].

The FAIL Evaluation Framework

Writer developed an evaluation framework called “FAIL” (Financial AI Language) to assess LLMs in realistic scenarios [00:03:15]. The goal was to determine if newer models could deliver the promised accuracy in real-world contexts [00:03:24].

The evaluation included two main categories of failure, each with subcategories:

1. Query Failure

This category introduces issues within the user’s query [00:03:40]:

Misspelling Queries: Queries containing spelling errors [00:03:48].
Incomplete Queries: Queries missing keywords or lacking clarity [00:04:03].
Out-of-Domain Queries: Questions asked by non-experts or general answers applied to specific fields [00:04:11].

2. Context Failure

This category focuses on issues related to the provided context [00:04:23]:

Missing Context: Questions asked about context not present in the prompt [00:04:33].
OCR Errors: Errors introduced when converting physical documents to text, such as character issues or merged words [00:04:44].
Irrelevant Context: Supplying a completely wrong document for a specific question [00:05:08].

The evaluation data, including the white paper, dataset, and leaderboard, is open-sourced and available on GitHub and Hugging Face [00:05:37]. The key evaluation metrics were:

Whether the model gave the correct answer [00:05:57].
The model’s ability to follow the grounding or context [00:06:03].

Evaluation Results and Insights

A group of chat and thinking models were selected for evaluation [00:06:17]. The results revealed several interesting behaviors:

Hallucination and Context Following: Thinking models, while often refusing to answer, tend to fail in following grounding when given wrong context or data, leading to significantly higher hallucination rates [00:07:02].
Query Robustness: For queries with misspellings, incomplete information, or out-of-domain questions, most models (both domain-specific and general) could provide an answer, with reasoning or thinking models even scoring higher [00:07:27].
Grounding and Context Adherence: The critical difference emerged in grounding and context adherence [00:07:41].
- Smaller models actually performed better in grounding than larger, “overthinking” models [00:08:50].
- Larger, more “thinking” models showed a 50% to 70% worse performance in grounding, meaning they often did not follow the provided context, even when questions and answers existed outside of it [00:08:50].
- In domain-specific tasks, these models were not truly “thinking” but rather exhibiting a “Chain of Thought” behavior, leading to high hallucination in financial use cases [00:09:26].
The Robustness-Hallucination Gap: There is a substantial gap between a model’s robustness and its ability to get the correct answer without hallucinating [00:09:54]. Even the best models achieved a maximum of 81% in robustness and context grounding, implying that nearly 20% of requests could be completely wrong in a real-world setting [00:10:14].

Conclusion

Based on the data from these benchmarks, the answer to the initial question is a clear “yes”: domain-specific models are still necessary [00:11:09]. While general model accuracy continues to grow, their ability to follow context and perform proper grounding is significantly behind [00:11:24].

Therefore, to achieve reliable LLM utilization today, a full-stack approach is essential [00:10:44]. This includes Retrieval Augmented Generation (RAG) systems, robust grounding mechanisms, and effective guardrails built around the entire system [00:10:48].

Tubegraph

Explorer

Table of Contents