From: aidotengineer

Riter, a company founded in 2020, views its history as intertwined with the development of the Transformer model [00:00:38]. In its early days, Riter focused on building decoder-encoder models [00:00:43], and today possesses a family of approximately 16 published models, with another 20 in development [00:00:50]. These models fall into two primary categories: general models (e.g., PXP3, PXP4, with PXP5 coming soon) and domain-specific models tailored for areas such as Creative Financial Services, Palo, and Medical [00:01:02].

The Question of Continued Domain-Specific Development

By early 2024, a significant trend emerged where large language models (LLMs) achieved very high general accuracy, with average performance for good general models reaching 80-90% on benchmarks [00:01:22]. This development prompted Riter to question the necessity of continuing to build and maintain domain-specific models [00:01:53]. The alternative considered was whether to focus on refining general models through fine-tuning, or by developing “reasoning” or “thinking” models [00:02:08].

Evaluating Language Model Performance: The FAIL Framework

To answer this question, Riter developed an evaluation framework called “FAIL,” designed to assess models in real-world scenarios [00:03:12]. The framework focuses specifically on financial services, though similar evaluations are underway for medical applications with similar results [00:02:41]. The evaluation metrics include whether the model provides the correct answer and its adherence to grounding or context [00:05:57].

The FAIL framework categorizes challenges into two main types:

1. Query Failure [00:03:40]

This category introduces errors in the user’s query:

  • Misspelling Queries: Queries with spelling errors [00:03:46].
  • Incomplete Queries: Queries missing keywords or lacking clarity [00:04:03].
  • Out-of-Domain Queries: Queries posed by non-experts or attempts to answer specific questions with general responses [00:04:11].

2. Context Failure [00:04:22]

This category introduces issues with the provided context for the query:

  • Missing Context: Questions where the necessary context does not exist in the prompt [00:04:31].
  • OCR Error: Context derived from OCR (Optical Character Recognition) with character issues, spacing problems, or merged words [00:04:44].
  • Irrelevant Context: Uploading a completely wrong or irrelevant document to answer a question [00:05:08].

The evaluation data, including the evaluation set, leaderboard, and white paper, is open source and available on GitHub and Hugging Face [00:05:37].

Key Findings

The evaluation involved a selection of general chat and “thinking” models [00:06:21].

  • Query Handling: General models, particularly “thinking” models, showed good behavior in handling query failures such as misspelled, incomplete, or out-of-domain queries, still providing answers [00:06:50] [00:08:12].
  • Context Grounding: The challenge arose significantly with context failures. When given wrong context, wrong data, or a completely different grounding, general models tended to fail to follow the context and still provided an answer, leading to significantly higher hallucination [00:07:05].
    • In tasks like text generation or question answering, general models did not perform well in grounding [00:07:48].
    • Notably, larger, “thinking” models exhibited worse results in context grounding, showing almost 50-70% poorer performance [00:08:50].
    • This indicates that general models may not truly be “thinking” in domain-specific tasks, but rather exhibiting Chain of Thought leading to high hallucination [00:09:23].
    • In some cases, smaller models performed better in grounding compared to larger, “overthinking” models [00:09:14].
  • Robustness vs. Grounding: There remains a substantial gap between a model’s robustness (ability to answer a query despite errors) and its ability to correctly ground its answers in the provided context [00:09:54]. Even the best models in the evaluation did not achieve more than 81% in robustness and context grounding combined [00:10:17], implying that nearly 20% of requests could be completely wrong [00:10:30].

Conclusion

Based on the evaluation data, Riter concludes that, with current technology and model implementations, there is still a clear need to build and continue developing domain-specific models [00:11:09]. While the general accuracy of LLMs continues to grow, their ability to correctly follow and ground answers within specific contexts remains significantly behind [00:11:24].

For reliable utilization of LLMs today, a full-stack system is required, encompassing strong grounding, robust guardrails, and overall system reliability [00:10:44].