From: aidotengineer

Wasim, co-founder and CTO of Writer, discusses the company’s journey and addresses a crucial question regarding the continued relevance of domain-specific models in the age of highly accurate general Large Language Models (LLMs) [01:53:00].

Writer’s Background and Model Philosophy

Writer was founded in 2020, focusing on building transformer models, specifically encoder-decoder models [00:40:00]. The company currently offers a family of about 16 published models, with another 20 in development [00:50:00]. These models fall into two main categories:

By early 2024, a noticeable trend emerged where general LLMs achieved very high accuracy on general benchmarks, often reaching between 80% and 90% [01:24:00]. This led to internal discussions at Writer about the necessity of continuing to build and fine-tune domain-specific models if general models were already performing so well [01:53:00].

Evaluating LLMs in Real-World Scenarios

To answer the question about the ongoing need for domain-specific models, Writer developed an evaluation framework called “FAIL” [03:15:00]. The objective of FAIL is to create real-world scenarios to assess how well models perform [03:19:00]. While applicable to various domains like medical or customer support, the presentation focused specifically on the financial services benchmark [02:31:00].

The evaluation categories included:

  • Query Failure:
    • Misspelling Queries: Introducing spelling errors in user input [03:48:00].
    • Incomplete Queries: Missing keywords or unclear phrasing [04:03:00].
    • Out-of-Domain Queries: Questions from users not expert in the field or general answers applied to specific contexts [04:11:00].
  • Context Failure:
    • Missing Context: Asking questions about context not provided in the prompt [04:33:00].
    • OCR Error: Simulating errors from converting physical documents to text, such as character issues or merged words [04:44:00].
    • Irrelevant Context: Providing a completely wrong document for a question [05:08:00].

The evaluation metrics focused on two key aspects:

  1. Whether the model provides the correct answer [05:57:00].
  2. The model’s ability to follow context grounding [06:03:00].

The data, evaluation set, and leaderboard are open-sourced and available on GitHub and Hugging Face [05:37:00].

Evaluation Results: General vs. Domain-Specific LLMs

A group of general chat and “thinking” models were selected for the evaluation [06:17:00].

Key Findings:

  • Answer Robustness: General and “thinking” models show good behavior in not refusing to answer and can provide an answer even with misspelled, incomplete, or out-of-domain queries [06:50:00]. Their scores for simply providing an answer are close to each other, with thinking models sometimes scoring slightly higher [07:27:00].
  • Context Grounding Challenges: This is where the significant difference appears. When given wrong context, wrong data, or completely different grounding, these models fail to follow the provided context [07:05:00]. This leads to significantly higher hallucination rates [07:16:00].
  • Performance in Specific Tasks: In tasks like text generation or question answering with poor context, general models do not perform well [07:48:00].
  • “Thinking” Models and Hallucination: The larger, more “thinking” models delivered worse results in grounding, with up to 50-70% worse performance [08:50:00]. This indicates that these models often don’t truly “think” in domain-specific tasks and lead to high hallucination when context is not followed [09:23:00].
  • Smaller Models: Surprisingly, smaller models sometimes performed better in context grounding than the larger, “overthinking” models [09:14:00].

Conclusion: The Enduring Need for Domain-Specific LLMs

Even the best performing general models in the evaluation achieved only about 81% in robustness and context grounding [10:17:00]. While seemingly good, this translates to roughly 20% of requests being “completely wrong” in a real-world scenario [10:27:00].

**The answer to the question "Do we still need to build models?" is a definitive YES.** <a class="yt-timestamp" data-t="11:11:00">[11:11:00]</a>

Despite the increasing accuracy of general models, their ability to correctly follow and ground context is still significantly behind what is needed for reliable applications [11:24:00]. Therefore, there remains a critical need to build and continue developing domain-specific models, especially for use cases like financial services where accuracy and grounding are paramount [11:16:00]. Furthermore, a “full stack” approach is required, incorporating robust systems, grounding, and guard rails to ensure reliability [10:44:00].