Evaluating language model performance

From: aidotengineer

Wasim, co-founder and CTO of Writer, shared insights into the company’s journey and an evaluation of language model performance [00:00:17]. Writer, founded in 2020, views its history as the story of the Transformer, having built numerous decoder-encoder models [00:00:40]. The company currently has a family of about 16 published models, with another 20 in development [00:00:50].

These models fall into two categories:

General Models: Such as PX and P3/P4, with P5 coming soon [00:01:02].
Domain-Specific Models: Including Creative, Financial Services, and Medical models [00:01:12].

The Shifting Landscape of LLM Accuracy

By early 2024, a significant trend emerged: large language models (LLMs) began achieving very high accuracy across general benchmarks [00:01:22]. The average accuracy for good general models rose to between 80% and nearly 90% [00:01:38]. This raised a crucial question within Writer: is it still worthwhile to continue building and scaling large language models that are domain-specific if general models can achieve around 90% accuracy [00:01:53]? The alternative considered was to fine-tune general models and focus on reasoning or thinking models [00:02:08].

Evaluation Methodology: FAIL

To answer this question, Writer developed an evaluation framework called “FAIL” (Financial LLM Assessment and Insights Leaderboard), designed to create real-world scenarios for evaluating language models [00:03:07]. While the specific data presented was for the financial services domain, similar results were observed for medical models [00:02:41].

The FAIL evaluation introduced two primary categories of failures:

1. Query Failure

This category focuses on issues within the user’s query [00:03:40]. It includes three subcategories:

Misspelling Queries: User queries containing spelling errors [00:03:46].
Incomplete Queries: Queries missing keywords or lacking clarity [00:04:03].
Out-of-Domain Queries: Users attempting to answer specific questions with general knowledge or copy-pasted general answers [00:04:11].

2. Context Failure

This category focuses on issues related to the provided context for the LLM [00:04:23]. It includes three subcategories:

Missing Context: Asking the LLM questions about context that does not exist in the prompt [00:04:31].
OCR Errors: Introducing errors common in Optical Character Recognition (OCR), such as character issues, spacing, or merged words, when converting physical documents to text [00:04:44].
Irrelevant Context: Providing a completely wrong or irrelevant document when asking a question about a specific document [00:05:08].

The evaluation data, including the dataset, evaluation set, and leaderboard, is open-sourced and available on GitHub and Hugging Face [00:05:37].

Evaluation Metrics

The evaluation focused on two key metrics [00:05:52]:

Can the model give the correct answer [00:05:57]?
Can the model give good follow-up to grounding or context grounding [00:06:03]?

Results and Findings

The evaluation included a group of models, specifically “chat models” and “thinking models” [00:06:17].

General Model Performance in Query Handling

When models were presented with misspelled, incomplete, or out-of-domain queries, they showed “amazing” performance in providing an answer [00:08:12]. Most general models, including reasoning or thinking models, did not refuse to answer [00:06:55]. They generally gave an answer, and their scores were close to each other, with reasoning/thinking models even achieving slightly higher scores [00:07:29].

The Crucial Challenge: Grounding and Context Following

However, the results became “very interesting” when evaluating grounding and context following [00:07:41].

Poor Grounding: When given wrong context, wrong data, or a completely different grounding, these models, especially the larger “thinking” models, failed to follow the context and still provided an answer [00:07:05]. This leads to significantly higher hallucination [00:07:20].
Performance Gap: In tasks like text generation or question answering, general models did not perform well regarding grounding [00:07:48].
Smaller Models Outperform: Counter-intuitively, smaller models often performed better in grounding compared to the larger, “overthinking” models [00:09:14]. The larger, more “thinking” models showed the worst results in grounding, with performance being 50-70% worse [00:08:50]. This indicates that these models are “not thinking at that stage,” resulting in high hallucination, particularly in financial use cases [00:09:34].
Persistent Gap: There remains a significant gap between the model’s robustness and its ability to avoid hallucination while getting the answer correct [00:09:54]. Even the best models in the evaluation did not achieve more than 81% in robustness and context grounding [00:10:14], implying that nearly 20% of requests could be completely wrong [00:10:29].

Conclusion: The Continued Need for Domain-Specific Models

Based on the data, the answer to the initial question — whether to continue building domain-specific models — is yes [00:11:09]. Despite the growing accuracy of general models, their ability to follow context and provide proper grounding is “way, way, way behind” [00:11:27]. For reliable utilization today, a “full stack” system is needed, including guard rails and robust grounding mechanisms built around the LLM [00:10:44].

Tubegraph

Explorer

Table of Contents