From: aidotengineer

Large Language Models (LLMs) can exhibit a phenomenon known as hallucination, where they generate answers that are not grounded in the provided context or factual information. This presents a significant challenge, especially in domain-specific applications where accuracy and grounding are critical [00:07:02].

Defining Hallucination

Hallucination in the context of LLMs refers to instances where models provide an answer even when given incorrect context, wrong data, or a completely different grounding [00:07:02]. While models might not refuse to answer questions, their failure to follow the provided (even incorrect) context leads to significantly higher rates of hallucination [00:07:07].

Evaluation and Observations

To understand the extent of this issue, a specialized evaluation called “FAIL” was developed to assess model performance in real-world scenarios [00:03:17]. This evaluation includes two main categories of failures:

  1. Query Failure [00:03:40]:

    • Misspelling queries: Questions with spelling errors [00:03:48].
    • Incomplete queries: Missing keywords or unclear information [00:04:03].
    • Out-of-domain queries: Questions outside the model’s expert domain, or general answers applied to specific contexts [00:04:11].
    • Models tend to perform “amazingly” on query failures, successfully providing answers even with misspellings, wrong grammar, or out-of-domain queries [00:08:12].
  2. Context Failure [00:04:23]:

    • Missing context: Asking questions about context that does not exist in the prompt [00:04:33].
    • OCR errors: Errors introduced during the conversion of physical documents to text, such as character issues, spacing, or merged words [00:04:44].
    • Irrelevant context: Providing a completely wrong document for a specific question [00:05:08].

The Grounding Problem

While general models can achieve high average accuracy, sometimes reaching 80-90% on general benchmarks, the situation changes dramatically when it comes to grounding [00:01:38]. In “FAIL” evaluations, particularly in financial services use cases, models showed very interesting results [00:06:32]:

  • Thinking models often do not refuse to answer, even when given wrong context or data [00:06:52].
  • This failure to follow the provided context, especially when it’s irrelevant or erroneous, leads to significantly higher hallucination [00:07:07].
  • When it comes to “grounding” and “context grounding,” performance drops sharply [00:07:41]. For tasks like text generation or question answering, performance is “just not performing well” [00:07:52].
  • Larger, “thinking” models show the “worst result” in grounding, with performance decreasing by 50-70% compared to their ability to provide an answer [00:08:50]. This means the model often fails to follow the attached context, with answers existing outside the context completely [00:09:01].

The Coherence Trap in Large Language Models

The data suggests that for domain-specific tasks, these models are “not thinking” at the stage of reasoning, leading to very high hallucination rates [00:09:34].

There is a substantial gap between a model’s robustness and its ability to avoid hallucination while providing a correct answer [00:09:54]. Even the best models in the evaluation did not achieve more than 81% in robustness and context grounding [00:10:17], implying that nearly 20% of requests could be “completely wrong” [00:10:30].

Implications and Future Needs

The persistent challenges with grounding and context following indicate that, despite improvements in general accuracy and grounding, domain-specific models are still necessary [00:11:13]. To build reliable systems today, a “full stack” approach is required, incorporating:

These components are essential to create a reliable and trustworthy system that can effectively manage the challenges in instruction following by language models and mitigate hallucination [00:10:55].