From: aidotengineer

The concepts of AI capability and reliability are distinct, and understanding their differences is crucial for the successful deployment of AI systems, particularly AI agents [01:46:06]. While models may possess high capabilities, achieving real-world reliability requires a focused AI engineering approach [01:59:57].

Defining Capability

Capability, in the context of AI models, refers to what a model could do at certain times [01:53:51]. For technically-minded individuals, this often translates to the “pass@K accuracy” of a model, meaning that at least one of K answers outputted by the model is correct [01:59:57]. Language models are already capable of many things [02:24:06].

Defining Reliability

Reliability, on the other hand, means consistently getting the answer right each and every single time [01:50:58]. This is often referred to as “5 9s of reliability” (99.999%) [01:57:07]. When AI agents are deployed for consequential decisions in the real world, reliability is paramount [01:55:58].

The Gap Between Capability and Reliability

A significant challenge in AI development is bridging the gap between a model’s capabilities and its real-world reliability [01:52:10]. Training methods that achieve 90% capability do not necessarily lead to 99.999% reliability [01:57:07]. Believing that high capability automatically translates to a reliable user experience is where real-world products can fail [01:52:10].

Consequences of Lacking Reliability

Failure to account for reliability can lead to significant product failures [02:08:52]. For example, if a personal assistant only orders food correctly 80% of the time, this constitutes a catastrophic product failure [02:15:37]. Products like Humane Spin and Rabbit R1 have faced issues because developers did not anticipate that a lack of reliability would lead to product failure [02:05:07].

Challenges in Measuring Reliability

Evaluating AI system performance and reliability is inherently difficult [02:40:02].

  • Evaluating AI agents is “genuinely hard” [02:40:02].
  • Static benchmarks can be misleading [07:18:29].
  • Unlike language models (LLMs) with bounded context windows, agents take open-ended actions, making cost a critical evaluation metric [08:08:09]. Cost must be a “first-class citizen” in all agent evaluations alongside accuracy or performance [08:37:05].
  • Agents are often purpose-built, meaning a single benchmark cannot evaluate all agents (e.g., a coding agent cannot be evaluated by a web agent benchmark) [08:50:07].
  • There is an overreliance on static benchmarks, which rarely translates into real-world performance [03:29:05], as seen with Cognition’s Devon agent, which succeeded in only 3 out of 20 tasks over a month of real-world use [01:50:07].
  • Verifiers (like unit tests) can also be imperfect, leading to false positives where incorrect code passes tests, causing model performance curves to bend downwards [04:47:04].

The Role of AI Engineering in Reliability

The core job of an AI engineer is to close the gap between 90% capability and 99.999% reliability [01:59:57]. This requires a shift in mindset, treating AI engineering as a reliability engineering field rather than solely software or machine learning engineering [01:57:07].

It involves:

  • Figuring out the necessary software optimizations and abstractions for working with inherently stochastic components like LLMs [01:29:04].
  • Approaching it as a system design problem, working around the constraints of an inherently stochastic system [01:41:00].
  • Prioritizing the fixing of reliability issues that plague agents using stochastic models [01:09:00].

Historical Precedent

The challenge of reliability in new computing paradigms is not new. The 1946 ENIAC computer, with over 17,000 vacuum tubes, initially failed so often it was unavailable half the time [01:24:00]. The engineers’ primary job for the first two years was to fix these reliability issues to make the computer usable for end users [01:29:04]. Similarly, AI engineers today must focus on ensuring the reliability of this next wave of computing for end users [01:29:04].