From: aidotengineer
Large Language Models (LLMs) can exhibit a phenomenon known as hallucination, where they generate answers that are not grounded in the provided context or factual information. This presents a significant challenge, especially in domain-specific applications where accuracy and grounding are critical [00:07:02].
Defining Hallucination
Hallucination in the context of LLMs refers to instances where models provide an answer even when given incorrect context, wrong data, or a completely different grounding [00:07:02]. While models might not refuse to answer questions, their failure to follow the provided (even incorrect) context leads to significantly higher rates of hallucination [00:07:07].
Evaluation and Observations
To understand the extent of this issue, a specialized evaluation called “FAIL” was developed to assess model performance in real-world scenarios [00:03:17]. This evaluation includes two main categories of failures:
-
Query Failure [00:03:40]:
- Misspelling queries: Questions with spelling errors [00:03:48].
- Incomplete queries: Missing keywords or unclear information [00:04:03].
- Out-of-domain queries: Questions outside the model’s expert domain, or general answers applied to specific contexts [00:04:11].
- Models tend to perform “amazingly” on query failures, successfully providing answers even with misspellings, wrong grammar, or out-of-domain queries [00:08:12].
-
Context Failure [00:04:23]:
- Missing context: Asking questions about context that does not exist in the prompt [00:04:33].
- OCR errors: Errors introduced during the conversion of physical documents to text, such as character issues, spacing, or merged words [00:04:44].
- Irrelevant context: Providing a completely wrong document for a specific question [00:05:08].
The Grounding Problem
While general models can achieve high average accuracy, sometimes reaching 80-90% on general benchmarks, the situation changes dramatically when it comes to grounding [00:01:38]. In “FAIL” evaluations, particularly in financial services use cases, models showed very interesting results [00:06:32]:
- Thinking models often do not refuse to answer, even when given wrong context or data [00:06:52].
- This failure to follow the provided context, especially when it’s irrelevant or erroneous, leads to significantly higher hallucination [00:07:07].
- When it comes to “grounding” and “context grounding,” performance drops sharply [00:07:41]. For tasks like text generation or question answering, performance is “just not performing well” [00:07:52].
- Larger, “thinking” models show the “worst result” in grounding, with performance decreasing by 50-70% compared to their ability to provide an answer [00:08:50]. This means the model often fails to follow the attached context, with answers existing outside the context completely [00:09:01].
The Coherence Trap in Large Language Models
The data suggests that for domain-specific tasks, these models are “not thinking” at the stage of reasoning, leading to very high hallucination rates [00:09:34].
There is a substantial gap between a model’s robustness and its ability to avoid hallucination while providing a correct answer [00:09:54]. Even the best models in the evaluation did not achieve more than 81% in robustness and context grounding [00:10:17], implying that nearly 20% of requests could be “completely wrong” [00:10:30].
Implications and Future Needs
The persistent challenges with grounding and context following indicate that, despite improvements in general accuracy and grounding, domain-specific models are still necessary [00:11:13]. To build reliable systems today, a “full stack” approach is required, incorporating:
- Robust systems [00:10:44]
- Effective grounding mechanisms [00:10:49]
- Guard rails around the system [00:10:52]
These components are essential to create a reliable and trustworthy system that can effectively manage the challenges in instruction following by language models and mitigate hallucination [00:10:55].