Comparison between general and specific LLMs

From: aidotengineer

Wasim, co-founder and CTO of Rytr, discussed the company’s journey and shared insights from their internal evaluations regarding the efficacy of general versus domain-specific Large Language Models (LLMs) [00:00:20].

Rytr’s Model Development Journey

Rytr, founded in 2020, began by building decoder-encoder models, evolving into a “family of models” [00:00:40]. Currently, Rytr has about 16 published models with another 20 in development [00:00:51]. These models fall into two main categories:

General Models: Such as PXP3, PXP4, and the upcoming PX5 [00:01:02].
Domain-Specific Models: Including those tailored for creative, financial services, and medical applications [00:01:10].

The Core Question: Are Domain-Specific LLMs Still Necessary?

Around early 2024, a significant trend emerged where general LLMs started achieving very high accuracy on standard benchmarks, often reaching 80-90% [00:01:24]. This raised a crucial question within Rytr: Is it still worthwhile to continue building and investing in domain-specific models if general models are performing so well, or should the focus shift to fine-tuning general models for reasoning and thinking capabilities [00:01:53]?

Evaluation Methodology: The FAIL Benchmark

To answer this question, Rytr developed an evaluation of LLMs using real-world scenarios called “FAIL” (Financial AI Language Benchmark) [00:03:12]. While applicable to various domain-specific models (medical, customer support, etc.), the discussion primarily focused on the financial services benchmark [00:02:31].

The FAIL benchmark introduced two main categories of evaluation scenarios [00:03:34]:

Query Failure [00:03:40]:
- Misspelling Queries: Queries with spelling or grammatical errors [00:03:46].
- Incomplete Queries: Queries missing keywords or clarity [00:04:03].
- Out-of-Domain Queries: Queries where the user is not an expert or pastes general answers to specific questions [00:04:11].
Context Failure [00:04:22]:
- Missing Context: Asking the LLM a question about context that doesn’t exist in the prompt [00:04:31].
- OCR Error: Introducing character issues, spacing problems, or merged words typical of Optical Character Recognition (OCR) conversion from physical documents to text [00:04:44].
- Irrelevant Context: Providing a completely wrong document for a specific question [00:05:08].

The dataset for this evaluation included diverse financial services-specific data, and the white paper, evaluation set, and leaderboard are openly available on GitHub and Hugging Face [00:05:30]. The key evaluation metrics were whether the model provided a correct answer and its adherence to “context grounding” [00:05:52].

Evaluation Results

Rytr evaluated a group of general chat and “thinking” models against their domain-specific models [00:06:17].

General LLMs’ Performance

Refusal to Answer: Thinking models generally tend not to refuse to answer questions, which might sound good but can be problematic [00:06:52].
Query Robustness: When presented with “query failures” (misspelled, incomplete, or out-of-domain queries), general models, including reasoning/thinking models, performed “amazingly” and could still provide an answer [00:08:12]. Most general models achieved similar scores in generating an answer [00:07:27].
Context Grounding Issues: The significant challenge for general models arose with “context failures” and context grounding [00:07:05].
- When given incorrect, missing, or irrelevant context, these models failed to properly follow the grounding [00:07:07].
- This led to “way higher hallucination” rates [00:07:16].
- In tasks like text generation or question answering, general models showed poor performance in grounding [00:07:48].
- Surprisingly, “bigger, more thinking” models gave the worst results in grounding, with scores often 50-70% worse [00:08:50]. This implies they are not truly “thinking” in domain-specific tasks, leading to high hallucination [00:09:26].

Domain-Specific LLMs’ Performance

Superior Context Grounding: The evaluation revealed that “smaller models” (domain-specific ones) performed better than larger, general “thinking” models when it came to grounding and adhering to provided context [00:09:14].

Conclusion: The Continued Need for Domain-Specific LLMs

Despite the increasing accuracy of general LLMs on standard benchmarks, the results from the FAIL benchmark strongly indicate that domain-specific models are still necessary [00:11:11].

Even the best performing general models in the evaluation achieved only about 81% in robustness and context grounding [00:10:17]. In real-world applications, this means approximately 20% of requests could be completely wrong [00:10:27].

The key takeaway is that while general model accuracy is improving, their ability to correctly follow and ground answers within a given context is “way, way, way behind” [00:11:27]. Therefore, a full-stack system incorporating robust systems, grounding, and guard rails is crucial for reliable LLM utilization today [00:10:44].

Tubegraph

Explorer

Table of Contents