From: aidotengineer
Developing an evaluation system that operates at scale is crucial for supporting mission-critical decisions, especially in fields like healthcare where there is no room for error [00:00:07]. Companies like Anterior, which serve insurance providers covering 50 million American lives, have had to develop such systems to ensure customer trust [00:00:14].
Challenges of AI at Scale
While creating an MVP (Minimum Viable Product) powered by Large Language Models (LLMs) is becoming increasingly easy, scaling it to serve customers introduces significant challenges [00:00:30]. As request volume grows, so does the number of unforeseen edge cases [00:00:44].
The Problem of Medical Nuance
In the medical industry, an AI system assisting with prior authorization decisions must handle complex nuances [00:00:50]. For instance, an AI might incorrectly interpret “suspicious for MS” as a confirmed diagnosis when the patient already has an existing diagnosis, rendering the AI’s answer wrong [00:01:34]. Mistakes like this, even if rare (e.g., every 1,000 or 10,000 cases), accumulate to a large number of errors when processing over 100,000 cases daily [00:01:54]. Errors in US healthcare AI automation can lead to lawsuits [00:02:07].
Limitations of Human Reviews
Human reviews of AI outputs are a key method for identifying and handling failure cases [00:02:16]. At Anterior, an internal clinical team uses a review dashboard called “Scalp” to efficiently review medical records, guidelines, and AI-generated answers [00:02:21]. Reviewers can critique incorrect answers, label them, and save them, which can then be used to generate ground truths for offline evaluations [00:02:50].
However, human reviews do not scale [00:03:17]:
- Reviewing 50% of 1,000 daily decisions requires 5 clinicians [00:03:25].
- Maintaining this 50% review rate for 10,000 daily decisions would require 50 clinicians, which is impractical for many companies [00:03:48].
- Even reviewing a smaller subset, like 5% of 100,000 daily decisions, still requires 50 clinicians [00:04:13].
This leads to two critical questions: Which cases should be reviewed, and how is the AI performing on cases that are not reviewed [00:04:28]?
Limitations of Offline Eval Datasets
Offline eval datasets, built from human-generated ground truths, can be useful for defining gold standards, segmenting data, and tracking performance over time [00:04:39]. They help in iterating AI pipelines [00:05:08].
However, relying solely on offline evaluations is risky [00:05:21]. The input space for medical records is vast and highly heterogeneous [00:05:27]. New edge cases constantly emerge at scale, and waiting for them to appear in offline datasets means it might be “too late” to identify and respond [00:05:14].
Realtime Reference-Free Evaluation Systems
The solution to these scaling and response challenges is Realtime AI fraud defense and response evaluation systems [00:05:37]. “Reference-free,” also known as “label-free,” means that evaluation occurs before the true outcome is known or a human review has been performed [00:05:44]. This enables the system to be real-time, allowing immediate response to issues [00:05:53].
How it Works: LLM as Judge
A key starting point for monitoring tracing and evaluation in RAG systems and similar systems is using an LLM as a judge [00:06:07].
- Input and AI Output: Inputs go into the main LLM pipeline, generating an output [00:06:10].
- LLM as Judge: This output is fed into another LLM, acting as a judge, along with a scoring system [00:06:17].
- Scoring Criteria: The scoring system can assess various metrics for evaluating RAG systems and other AI outputs, such as helpfulness, conciseness, tone, or confidence in correctness [00:06:21]. For binary or multiclass classification outputs, confidence levels can be assigned [00:06:33].
- Confidence Grading: At Anterior, this process yields a “confidence grading” for the AI’s output, ranging from high confidence (correct) to low confidence (wrong) [00:07:03]. This can also incorporate other methods like confidence estimation using logic-based approaches [00:06:51].
- Predicted Correct Output: The confidence score can then be converted into a predicted correct output using a threshold [00:07:16].
Applications of Realtime Eval Information
This dual information — confidence grading and predicted output — can be used in several ways:
- Estimated Performance: Predict the estimated performance across all cases in real time, not just those reviewed by humans [00:07:29]. This allows for immediate response and feedback to customers [00:07:45].
- System Alignment: Compare reference-free eval outputs with human reviews to compute the alignment and determine the system’s trustworthiness [00:07:50].
- Dynamic Prioritization for Human Review: Combine confidence grading with contextual factors (e.g., cost of procedure, risk of bias, previous error rates) to dynamically prioritize cases for human review [00:08:03]. This identifies the most relevant cases with the highest probability of error [00:08:19].
The Virtuous Cycle: Validating the Validator
This creates a “virtuous cycle” where human reviews validate and improve the system’s performance, and dynamically prioritized cases feed back into the process [00:08:27]. Realtime reference-free evals surface potential problem cases, and human reviews confirm accuracy [00:08:35]. This process, known as “validating the validator,” continually reduces unseen edge cases and improves error detection [00:08:41].
Incorporating into the Pipeline
Once confidence in the system’s performance is high, the reference-free eval can be directly integrated into the AI pipeline [00:09:01]:
- Inputs go through the original AI pipeline, generating outputs [00:09:06].
- These outputs are passed to the reference-free evals [00:09:12].
- Depending on the eval’s output, the system can either confidently return the response to the customer or take further action [00:09:14].
- Further actions might include sending it to another LLM pipeline with more expensive models, assigning it to an internal on-call clinician for review, or surfacing it in the customer’s own review dashboard [00:09:21].
This powerful mechanism ensures customer trust in the AI’s outputs [00:09:38].
Impact and Benefits
At Anterior, the implementation of realtime AI fraud defense and response has yielded significant benefits:
- Reduced Staffing Needs: Avoids the need to hire an ever-expanding team of expert clinicians. Anterior reviews tens of thousands of cases with a team of less than 10 clinical experts, unlike competitors who might hire hundreds of nurses [00:09:49].
- Strong AI-Human Alignment: Achieved very strong alignment between AI and human reviews, comparable to the alignment observed between human reviewers themselves [00:10:08].
- Quick Error Response: Enables rapid identification and response to errors, ensuring customer SLAs (Service Level Agreements) are met [00:10:20].
- Industry-Leading Performance: Resulted in provably industry-leading performance in prior authorization, with an F1 score of nearly 96% in a recent study [00:10:35].
- Enhanced Customer Trust and Love: Led to deep customer trust and satisfaction, exemplified by a nurse expressing relief at being able to continue using the AI [00:10:45].
Principles for Building an Evaluation System
Christopher Ljy recommends the following principles for building an effective evaluation framework:
- Build a System: Think big. Use review data not just to audit performance, but to build, audit, and improve the auditing and evaluation system itself [00:11:04].
- Evaluate on Live Production Data: Do not rely solely on offline evaluations. Identify problems immediately to respond quickly [00:11:19].
- Empower Best Reviewers: Prioritize the quality of reviews over quantity. Build custom tooling if it helps to move faster [00:11:27].
This approach creates an evaluation system that provides real-time performance estimates, enables quick and accurate responses, scales to meet demand at low cost, and is powered by a small, focused team of experts [00:11:35]. It supports the transition from MVP to serving customers and maintaining their trust at scale [00:11:51].