From: aidotengineer

Wasim, co-founder and CTO of Ater, detailed the company’s journey and their focus on developing language models, particularly in the financial sector [00:00:20]. Ater, founded in 2020, began by building decoder-encoder models, eventually expanding to a family of approximately 16 published models with another 20 in development [00:00:40]. These models fall into two categories: general models (e.g., PxP34, P5) and domain-specific models tailored for areas such as Financial Services, medical, and legal [00:01:00].

The Challenge for Domain-Specific Models

By early 2024, Ater observed a trend of Large Language Models (LLMs) achieving very high accuracy in general benchmarks, often reaching 80-90% [00:01:24]. This raised a crucial question internally: is it still necessary to build and maintain domain-specific models if general models can achieve such high accuracy, perhaps with fine-tuning or a focus on reasoning capabilities [00:01:53]?

To answer this question, Ater decided to gather data and conduct evaluations applicable to various domain-specific models, including financial services, medical, and customer support [00:02:26]. The focus of the presented data was on the financial services domain [00:02:41].

FinFAIL Evaluation Methodology

Ater developed an evaluation framework called “FinFAIL” to test models in real-world scenarios [00:03:12]. This framework introduces specific types of errors and complexities that users might encounter, aiming to see if new models truly deliver on promised accuracy in practical, challenging situations [00:03:17].

FinFAIL categorized evaluation challenges into two main types:

1. Query Failure [00:03:40]

This category assesses how models handle imperfect or unusual queries:

  • Misspelling Queries: Introducing spelling errors in user questions [00:03:46].
  • Incomplete Queries: Queries missing keywords or lacking clarity [00:04:03].
  • Out-of-Domain Queries: When a non-expert attempts to ask about a specific field, or a general answer is used for a highly specific topic [00:04:11].

2. Context Failure [00:04:23]

This category examines model performance when the provided context is problematic, highlighting challenges in data processing for finance professionals:

  • Missing Context: Asking a question about context that doesn’t exist within the prompt [00:04:31].
  • OCR Errors: Introducing errors common when converting physical documents to text, such as character issues, spacing, or merged words [00:04:44].
  • Irrelevant Context: Uploading a completely wrong document when asking a question about a specific one, to see if the LLM still attempts to answer or recognizes the irrelevance [00:05:08].

Ater ensures diversity in the financial services data used for evaluation, and the FinFAIL white paper, dataset, and leaderboard are open-source and available on GitHub and Hugging Face [00:05:30]. The key evaluation metrics are whether the model provides a correct answer and its adherence to the provided context (grounding) [00:05:54].

Evaluation Results: The Importance of Grounding

Selected chat and “thinking” models were evaluated [00:06:17]. Initial observations showed that thinking models often refuse to answer, which seems good in theory [00:06:52]. However, when provided with incorrect context or grounding, these models frequently fail to adhere to it, leading to higher hallucination rates [00:07:02].

While many general models and domain-specific models could provide an answer even with query issues (misspelling, incomplete, out-of-domain), scoring high on answer correctness [00:07:27], the performance drastically declined when it came to context grounding [00:07:41].

Notably, larger “thinking” models yielded worse results in grounding tasks, with accuracy dropping significantly (e.g., 50-70% worse) [00:08:46]. This indicates that these models often don’t follow the attached context and provide answers existing outside of it [00:09:01]. For domain-specific tasks, smaller models actually performed better in context grounding than these overthinking models [00:09:14]. This finding suggests that what appears to be “thinking” might simply be a “Chain of Thought” process that increases hallucination in domain-specific contexts [00:09:26].

There remains a significant gap between model robustness and their ability to provide correct, well-grounded answers [00:09:54]. Even the best models achieved only about 81% in robustness and context grounding, meaning nearly 20% of requests could be completely wrong [00:10:14].

Conclusion: The Continued Need for Domain-Specific Models

Based on current data and technology, domain-specific models are still essential [00:11:11]. While general model accuracy continues to improve, their ability to follow and correctly utilize context (grounding) remains significantly behind [00:11:24]. For reliable utilization today, a full-stack system is necessary, encompassing robust systems, grounding mechanisms, and guardrails [00:10:44].