From: aidotengineer
Wasim, co-founder and CTO of Writer, details the company’s journey and shares insights into the continued necessity of domain-specific models in the rapidly evolving landscape of AI [00:00:27].
A History of Model Development at Writer
Writer, founded in 2020, views its story as intertwined with the evolution of the Transformer model [00:00:35]. Initially building encoder-decoder models, the company has since developed a family of approximately 16 published models, with another 20 in development [00:00:46]. These models fall into two categories:
- General Models: Such as PX and P3/P4 (with P5 coming soon) [00:01:02].
- Domain-Specific Models: Covering areas like Creative, Financial Services, and Medical [00:01:10].
The Shifting Landscape of LLM Accuracy
By early 2024, a significant trend emerged: Large Language Models (LLMs) were achieving very high accuracy on general benchmarks, often reaching 80-90% [00:01:24]. This rise in performance sparked an internal question at Writer [00:01:53]: Is it still worthwhile to continue building domain-specific models if general models are nearing 90% accuracy [00:01:56]? The company considered whether to instead focus on fine-tuning general models or developing reasoning or thinking models [00:02:07].
The Need for Data: Introducing the FAIL Benchmark
To answer this critical question, Writer developed a benchmark called “FAIL” [00:02:26]. The objective was to create real-world scenarios to evaluate models and assess their promised accuracy in domain-specific contexts [00:03:19]. The benchmark, which is open-source and available on GitHub and Hugging Face [00:05:40], introduced two main categories of evaluation failures:
1. Query Failure
This category assesses how models handle imperfect user queries [00:03:40]:
- Misspelling Queries: User inputs with spelling errors [00:03:46].
- Incomplete Queries: Queries missing keywords or clarity [00:04:03].
- Out-of-Domain Queries: Attempts to answer specific domain questions using general knowledge [00:04:11].
2. Context Failure
This category focuses on the model’s ability to handle issues with the provided context [00:04:22]:
- Missing Context: Asking questions about information not present in the prompt [00:04:33].
- OCR Errors: Introducing character issues, spacing problems, or merged words common in optical character recognition conversions [00:04:44].
- Irrelevant Context: Providing a completely wrong document for a specific question [00:05:08].
The evaluation metrics primarily focused on two aspects: whether the model gave the correct answer and its adherence to “context grounding” [00:05:54].
Evaluation Results: The Grounding Challenge
The evaluation, specifically focusing on financial services, revealed “very interesting results” [00:03:40] [00:06:34].
Hallucination and General Model Behavior
- Refusal to Answer: Thinking models generally did not refuse to answer, which might sound good but proved problematic [00:06:52].
- Failure in Grounding: When given incorrect or irrelevant context, these models often “fail to follow this part” and still provide an answer, leading to “way higher hallucination” [00:07:05].
- Answer Accuracy vs. Grounding: While most domain-specific and general models produced answers that were “close to each other” in terms of simple answer correctness, and reasoning models even scored slightly higher [00:07:27], the “grounding and context grounding” performance showed a stark difference [00:07:41].
The Problem of Context Grounding
- Poor Performance in Grounding: In tasks like text generation or question answering, general models did “not performing well” in context grounding [00:07:52].
- Significant Drop: While models performed “amazingly” when handling misspelled, incomplete, or out-of-domain queries, their performance plummeted in grounding [00:08:12].
- Worse Results for Bigger Models: Bigger, more thinking models yielded “the worst result” in grounding, with a 50-70% worse performance [00:08:50]. This means the model “is just not following” attached context or answers from completely irrelevant context [00:09:01].
- Smaller Models Outperform: Surprisingly, “smaller model actually performing better than all this model overthinking at that side” [00:09:14].
These findings suggest that in domain-specific tasks, current models are “not thinking at that stage,” leading to “really high” hallucination [00:09:34]. Even the best models achieved only about 81% accuracy when combining robustness and context grounding, meaning nearly 20% of requests could be “completely wrong” [00:10:22].
Conclusion: The Enduring Need for Domain-Specific Models
Based on the data from the FAIL benchmark, the answer to the initial question is a definitive “yes” [00:11:09]. Writer concludes that it is still necessary to build and continue developing domain-specific models [00:11:16]. While general model accuracy is improving, their ability to correctly handle “grounding” and “context following” remains “way, way, way behind” [00:11:27].
For reliable deployment today, a full-stack approach is required, incorporating “robust system,” “grounding,” and “guard rails” around the AI system [00:10:44].