Realtime referencefree evaluation systems

From: aidotengineer

Developing an evaluation system that operates at scale is crucial for supporting mission-critical decisions, especially in fields like healthcare where there is no room for error [00:00:07]. Companies like Anterior, which serve insurance providers covering 50 million American lives, have had to develop such systems to ensure customer trust [00:00:14].

Challenges of AI at Scale

While creating an MVP (Minimum Viable Product) powered by Large Language Models (LLMs) is becoming increasingly easy, scaling it to serve customers introduces significant challenges [00:00:30]. As request volume grows, so does the number of unforeseen edge cases [00:00:44].

The Problem of Medical Nuance

In the medical industry, an AI system assisting with prior authorization decisions must handle complex nuances [00:00:50]. For instance, an AI might incorrectly interpret “suspicious for MS” as a confirmed diagnosis when the patient already has an existing diagnosis, rendering the AI’s answer wrong [00:01:34]. Mistakes like this, even if rare (e.g., every 1,000 or 10,000 cases), accumulate to a large number of errors when processing over 100,000 cases daily [00:01:54]. Errors in US healthcare AI automation can lead to lawsuits [00:02:07].

Limitations of Human Reviews

Human reviews of AI outputs are a key method for identifying and handling failure cases [00:02:16]. At Anterior, an internal clinical team uses a review dashboard called “Scalp” to efficiently review medical records, guidelines, and AI-generated answers [00:02:21]. Reviewers can critique incorrect answers, label them, and save them, which can then be used to generate ground truths for offline evaluations [00:02:50].

However, human reviews do not scale [00:03:17]:

Reviewing 50% of 1,000 daily decisions requires 5 clinicians [00:03:25].
Maintaining this 50% review rate for 10,000 daily decisions would require 50 clinicians, which is impractical for many companies [00:03:48].
Even reviewing a smaller subset, like 5% of 100,000 daily decisions, still requires 50 clinicians [00:04:13].

This leads to two critical questions: Which cases should be reviewed, and how is the AI performing on cases that are not reviewed [00:04:28]?

Limitations of Offline Eval Datasets

Offline eval datasets, built from human-generated ground truths, can be useful for defining gold standards, segmenting data, and tracking performance over time [00:04:39]. They help in iterating AI pipelines [00:05:08].

However, relying solely on offline evaluations is risky [00:05:21]. The input space for medical records is vast and highly heterogeneous [00:05:27]. New edge cases constantly emerge at scale, and waiting for them to appear in offline datasets means it might be “too late” to identify and respond [00:05:14].

Realtime Reference-Free Evaluation Systems

The solution to these scaling and response challenges is Realtime AI fraud defense and response evaluation systems [00:05:37]. “Reference-free,” also known as “label-free,” means that evaluation occurs before the true outcome is known or a human review has been performed [00:05:44]. This enables the system to be real-time, allowing immediate response to issues [00:05:53].

How it Works: LLM as Judge

A key starting point for monitoring tracing and evaluation in RAG systems and similar systems is using an LLM as a judge [00:06:07].

Input and AI Output: Inputs go into the main LLM pipeline, generating an output [00:06:10].
LLM as Judge: This output is fed into another LLM, acting as a judge, along with a scoring system [00:06:17].
Scoring Criteria: The scoring system can assess various metrics for evaluating RAG systems and other AI outputs, such as helpfulness, conciseness, tone, or confidence in correctness [00:06:21]. For binary or multiclass classification outputs, confidence levels can be assigned [00:06:33].
Confidence Grading: At Anterior, this process yields a “confidence grading” for the AI’s output, ranging from high confidence (correct) to low confidence (wrong) [00:07:03]. This can also incorporate other methods like confidence estimation using logic-based approaches [00:06:51].
Predicted Correct Output: The confidence score can then be converted into a predicted correct output using a threshold [00:07:16].

Applications of Realtime Eval Information

This dual information — confidence grading and predicted output — can be used in several ways:

Estimated Performance: Predict the estimated performance across all cases in real time, not just those reviewed by humans [00:07:29]. This allows for immediate response and feedback to customers [00:07:45].
System Alignment: Compare reference-free eval outputs with human reviews to compute the alignment and determine the system’s trustworthiness [00:07:50].
Dynamic Prioritization for Human Review: Combine confidence grading with contextual factors (e.g., cost of procedure, risk of bias, previous error rates) to dynamically prioritize cases for human review [00:08:03]. This identifies the most relevant cases with the highest probability of error [00:08:19].

The Virtuous Cycle: Validating the Validator

This creates a “virtuous cycle” where human reviews validate and improve the system’s performance, and dynamically prioritized cases feed back into the process [00:08:27]. Realtime reference-free evals surface potential problem cases, and human reviews confirm accuracy [00:08:35]. This process, known as “validating the validator,” continually reduces unseen edge cases and improves error detection [00:08:41].

Incorporating into the Pipeline

Once confidence in the system’s performance is high, the reference-free eval can be directly integrated into the AI pipeline [00:09:01]:

Inputs go through the original AI pipeline, generating outputs [00:09:06].
These outputs are passed to the reference-free evals [00:09:12].
Depending on the eval’s output, the system can either confidently return the response to the customer or take further action [00:09:14].
Further actions might include sending it to another LLM pipeline with more expensive models, assigning it to an internal on-call clinician for review, or surfacing it in the customer’s own review dashboard [00:09:21].

This powerful mechanism ensures customer trust in the AI’s outputs [00:09:38].

Impact and Benefits

At Anterior, the implementation of realtime AI fraud defense and response has yielded significant benefits:

Reduced Staffing Needs: Avoids the need to hire an ever-expanding team of expert clinicians. Anterior reviews tens of thousands of cases with a team of less than 10 clinical experts, unlike competitors who might hire hundreds of nurses [00:09:49].
Strong AI-Human Alignment: Achieved very strong alignment between AI and human reviews, comparable to the alignment observed between human reviewers themselves [00:10:08].
Quick Error Response: Enables rapid identification and response to errors, ensuring customer SLAs (Service Level Agreements) are met [00:10:20].
Industry-Leading Performance: Resulted in provably industry-leading performance in prior authorization, with an F1 score of nearly 96% in a recent study [00:10:35].
Enhanced Customer Trust and Love: Led to deep customer trust and satisfaction, exemplified by a nurse expressing relief at being able to continue using the AI [00:10:45].

Principles for Building an Evaluation System

Christopher Ljy recommends the following principles for building an effective evaluation framework:

Build a System: Think big. Use review data not just to audit performance, but to build, audit, and improve the auditing and evaluation system itself [00:11:04].
Evaluate on Live Production Data: Do not rely solely on offline evaluations. Identify problems immediately to respond quickly [00:11:19].
Empower Best Reviewers: Prioritize the quality of reviews over quantity. Build custom tooling if it helps to move faster [00:11:27].

This approach creates an evaluation system that provides real-time performance estimates, enables quick and accurate responses, scales to meet demand at low cost, and is powered by a small, focused team of experts [00:11:35]. It supports the transition from MVP to serving customers and maintaining their trust at scale [00:11:51].

Tubegraph

Explorer

Table of Contents