From: aidotengineer

Wasim, co-founder and CTO of Writer, details the company’s journey and shares insights into the continued necessity of domain-specific models in the rapidly evolving landscape of AI [00:00:27].

A History of Model Development at Writer

Writer, founded in 2020, views its story as intertwined with the evolution of the Transformer model [00:00:35]. Initially building encoder-decoder models, the company has since developed a family of approximately 16 published models, with another 20 in development [00:00:46]. These models fall into two categories:

  • General Models: Such as PX and P3/P4 (with P5 coming soon) [00:01:02].
  • Domain-Specific Models: Covering areas like Creative, Financial Services, and Medical [00:01:10].

The Shifting Landscape of LLM Accuracy

By early 2024, a significant trend emerged: Large Language Models (LLMs) were achieving very high accuracy on general benchmarks, often reaching 80-90% [00:01:24]. This rise in performance sparked an internal question at Writer [00:01:53]: Is it still worthwhile to continue building domain-specific models if general models are nearing 90% accuracy [00:01:56]? The company considered whether to instead focus on fine-tuning general models or developing reasoning or thinking models [00:02:07].

The Need for Data: Introducing the FAIL Benchmark

To answer this critical question, Writer developed a benchmark called “FAIL” [00:02:26]. The objective was to create real-world scenarios to evaluate models and assess their promised accuracy in domain-specific contexts [00:03:19]. The benchmark, which is open-source and available on GitHub and Hugging Face [00:05:40], introduced two main categories of evaluation failures:

1. Query Failure

This category assesses how models handle imperfect user queries [00:03:40]:

  • Misspelling Queries: User inputs with spelling errors [00:03:46].
  • Incomplete Queries: Queries missing keywords or clarity [00:04:03].
  • Out-of-Domain Queries: Attempts to answer specific domain questions using general knowledge [00:04:11].

2. Context Failure

This category focuses on the model’s ability to handle issues with the provided context [00:04:22]:

  • Missing Context: Asking questions about information not present in the prompt [00:04:33].
  • OCR Errors: Introducing character issues, spacing problems, or merged words common in optical character recognition conversions [00:04:44].
  • Irrelevant Context: Providing a completely wrong document for a specific question [00:05:08].

The evaluation metrics primarily focused on two aspects: whether the model gave the correct answer and its adherence to “context grounding” [00:05:54].

Evaluation Results: The Grounding Challenge

The evaluation, specifically focusing on financial services, revealed “very interesting results” [00:03:40] [00:06:34].

Hallucination and General Model Behavior

  • Refusal to Answer: Thinking models generally did not refuse to answer, which might sound good but proved problematic [00:06:52].
  • Failure in Grounding: When given incorrect or irrelevant context, these models often “fail to follow this part” and still provide an answer, leading to “way higher hallucination[00:07:05].
  • Answer Accuracy vs. Grounding: While most domain-specific and general models produced answers that were “close to each other” in terms of simple answer correctness, and reasoning models even scored slightly higher [00:07:27], the “grounding and context grounding” performance showed a stark difference [00:07:41].

The Problem of Context Grounding

  • Poor Performance in Grounding: In tasks like text generation or question answering, general models did “not performing well” in context grounding [00:07:52].
  • Significant Drop: While models performed “amazingly” when handling misspelled, incomplete, or out-of-domain queries, their performance plummeted in grounding [00:08:12].
  • Worse Results for Bigger Models: Bigger, more thinking models yielded “the worst result” in grounding, with a 50-70% worse performance [00:08:50]. This means the model “is just not following” attached context or answers from completely irrelevant context [00:09:01].
  • Smaller Models Outperform: Surprisingly, “smaller model actually performing better than all this model overthinking at that side” [00:09:14].

These findings suggest that in domain-specific tasks, current models are “not thinking at that stage,” leading to “really high” hallucination [00:09:34]. Even the best models achieved only about 81% accuracy when combining robustness and context grounding, meaning nearly 20% of requests could be “completely wrong” [00:10:22].

Conclusion: The Enduring Need for Domain-Specific Models

Based on the data from the FAIL benchmark, the answer to the initial question is a definitive “yes” [00:11:09]. Writer concludes that it is still necessary to build and continue developing domain-specific models [00:11:16]. While general model accuracy is improving, their ability to correctly handle “grounding” and “context following” remains “way, way, way behind” [00:11:27].

For reliable deployment today, a full-stack approach is required, incorporating “robust system,” “grounding,” and “guard rails” around the AI system [00:10:44].