From: aidotengineer
Wasim, co-founder and CTO of Writer, details the company’s journey and their approach to evaluating Large Language Models (LLMs) in real-world scenarios [00:00:17]. Writer, founded in 2020, began by building decoder-encoder models, evolving into a family of approximately 16 published models, with 20 more in development [00:00:40]. These models fall into two categories: general models (e.g., PxP3, PxP4, with PxP5 coming soon) and domain-specific models tailored for creative, financial services, and medical fields [00:01:00].
The Question of Domain-Specific LLMs
By early 2024, a significant trend emerged: general LLMs were achieving very high accuracy, often between 80% and 90% on general benchmarks [00:01:24]. This raised a critical question for Writer: Is it still worthwhile to continue building domain-specific models if general models are already achieving such high accuracy, or should the focus shift to fine-tuning general models for reasoning or thinking tasks [00:01:53]?
To answer this, Writer decided to conduct evaluations using real-world data, specifically focusing on the financial services domain, with similar results observed in the medical field [00:02:26].
The FAIL Evaluation Framework
Writer developed an evaluation framework called “FAIL” (Financial AI Language) to assess LLMs in realistic scenarios [00:03:15]. The goal was to determine if newer models could deliver the promised accuracy in real-world contexts [00:03:24].
The evaluation included two main categories of failure, each with subcategories:
1. Query Failure
This category introduces issues within the user’s query [00:03:40]:
- Misspelling Queries: Queries containing spelling errors [00:03:48].
- Incomplete Queries: Queries missing keywords or lacking clarity [00:04:03].
- Out-of-Domain Queries: Questions asked by non-experts or general answers applied to specific fields [00:04:11].
2. Context Failure
This category focuses on issues related to the provided context [00:04:23]:
- Missing Context: Questions asked about context not present in the prompt [00:04:33].
- OCR Errors: Errors introduced when converting physical documents to text, such as character issues or merged words [00:04:44].
- Irrelevant Context: Supplying a completely wrong document for a specific question [00:05:08].
The evaluation data, including the white paper, dataset, and leaderboard, is open-sourced and available on GitHub and Hugging Face [00:05:37]. The key evaluation metrics were:
- Whether the model gave the correct answer [00:05:57].
- The model’s ability to follow the grounding or context [00:06:03].
Evaluation Results and Insights
A group of chat and thinking models were selected for evaluation [00:06:17]. The results revealed several interesting behaviors:
- Hallucination and Context Following: Thinking models, while often refusing to answer, tend to fail in following grounding when given wrong context or data, leading to significantly higher hallucination rates [00:07:02].
- Query Robustness: For queries with misspellings, incomplete information, or out-of-domain questions, most models (both domain-specific and general) could provide an answer, with reasoning or thinking models even scoring higher [00:07:27].
- Grounding and Context Adherence: The critical difference emerged in grounding and context adherence [00:07:41].
- Smaller models actually performed better in grounding than larger, “overthinking” models [00:08:50].
- Larger, more “thinking” models showed a 50% to 70% worse performance in grounding, meaning they often did not follow the provided context, even when questions and answers existed outside of it [00:08:50].
- In domain-specific tasks, these models were not truly “thinking” but rather exhibiting a “Chain of Thought” behavior, leading to high hallucination in financial use cases [00:09:26].
- The Robustness-Hallucination Gap: There is a substantial gap between a model’s robustness and its ability to get the correct answer without hallucinating [00:09:54]. Even the best models achieved a maximum of 81% in robustness and context grounding, implying that nearly 20% of requests could be completely wrong in a real-world setting [00:10:14].
Conclusion
Based on the data from these benchmarks, the answer to the initial question is a clear “yes”: domain-specific models are still necessary [00:11:09]. While general model accuracy continues to grow, their ability to follow context and perform proper grounding is significantly behind [00:11:24].
Therefore, to achieve reliable LLM utilization today, a full-stack approach is essential [00:10:44]. This includes Retrieval Augmented Generation (RAG) systems, robust grounding mechanisms, and effective guardrails built around the entire system [00:10:48].