Evaluating AI systems at scale

From: aidotengineer

Building an evaluation system that operates effectively at scale is crucial, especially for mission-critical decisions like those in healthcare, where errors are unacceptable [00:00:09]. This article, based on learnings from Anterior, which serves insurance providers covering 50 million American lives, explores how real-time, reference-free evaluations can foster customer trust and enable effective evaluation of AI agents [00:00:14].

Challenges in Scaling AI Products

While creating a Minimum Viable Product (MVP) powered by Large Language Models (LLMs) is becoming increasingly straightforward [00:00:30], transitioning from MVP to serving customers at scale introduces significant problems that are not apparent until high volumes are reached [00:00:38]. As request volume grows, so does the number of unforeseen edge cases [00:00:44].

Example: Medical Prior Authorization

At Anterior, the core product supports prior authorization decisions for treatment requests, determining if a treatment should be approved or reviewed by a clinician [00:00:50]. The AI processes medical records and guidelines to answer specific questions, such as whether a patient had a previous brain MRI suspicious for Multiple Sclerosis (MS) to decide on a cervical spine MRI [00:01:00].

An AI output might state a brain MRI confirmed prior findings suspicious for MS [00:01:15]. However, this output could miss key medical nuance. In a medical context, “suspicious” implies no confirmed diagnosis. If the patient already has a confirmed diagnosis, stating it’s “suspicious” is incorrect [00:01:39]. Such mistakes might occur once every thousand or ten thousand cases [00:01:54]. When processing over 100,000 cases daily, this translates to a significant number of errors that must be identified [00:01:58]. Mistakes of this nature are unacceptable in healthcare, as organizations face lawsuits for inappropriate AI automation [00:02:04].

Human Reviews and Their Limitations

A primary strategy to identify and handle failure cases is performing human reviews of AI outputs [00:02:13]. Anterior built an internal clinical team and specialized tooling, “Scalp,” to facilitate these reviews [00:02:21]. This dashboard provides reviewers with all necessary context (medical record, guideline) alongside the question and AI answer, enabling quick and effective reviews [00:02:30]. Reviewers can critique incorrect answers, label them, and save them [00:02:50]. These critiques can then be used to generate “ground truths,” which are descriptions of the correct answers, useful for offline evaluations [00:03:00].

However, human reviews do not scale effectively [00:03:17].

Reviewing 50% of 1,000 daily cases requires 5 clinicians [00:03:23].
Scaling to 10,000 daily cases at the same 50% review rate would require 50 clinicians, which is often larger than an entire company [00:03:46].
Even reducing the review rate to 5% at 100,000 daily cases still requires 50 clinicians [00:04:13].

This leads to two key questions: Which cases should be reviewed, and how is performance measured for cases not reviewed by humans [00:04:28]?

Offline Eval Data Sets

Offline evaluation datasets are built from generated ground truths and live outside the product [00:04:39]. They are helpful for defining gold standard datasets, segmenting performance by enterprise or medical type, plotting performance over time, and iterating AI pipelines [00:04:54].

However, relying solely on offline evals is risky [00:05:21]. The input space for medical records is vast and heterogeneous, meaning new edge cases will constantly emerge at scale [00:05:26]. Waiting for new edge cases to appear in a manually built dataset, downstream of customer interaction, means it could be too late to respond to issues [00:05:11].

Solution: Real-time Reference-Free Evaluation System

The solution to the scalability challenges of human reviews and the limitations of offline evals is a real-time, reference-free evaluation system [00:05:39].

Reference-free (or label-free) means evaluating an AI output before the true outcome is known or a human review has occurred [00:05:44]. This enables real-time response to issues as they arise [00:05:51].

LLM as Judge

A great starting point for reference-free evaluation is using an LLM as a judge [00:06:07]. The process involves:

Inputs go into the LLM pipeline being evaluated, producing outputs [00:06:10].
These outputs are fed into an “LLM as judge” along with a scoring system [00:06:15].
The scoring system can assess various factors like helpfulness, conciseness, tone, or confidence in correctness [00:06:21]. For binary or multi-class classifications, it can assess confidence levels [00:06:31].

At Anterior, with a binary output (approval or escalation for review), the reference-free eval system (which can include an LLM as judge or logic-based methods) provides a “confidence grading” – how confident the system is that the LLM output is correct [00:06:37]. This confidence grading can range from high confidence to low confidence, indicating potential errors [00:07:03]. A threshold is then used to predict the correct output [00:07:16].

Utilizing Reference-Free Eval Information

The confidence grading and predicted correct outputs from reference-free evals can be used in several ways:

Estimated Performance Prediction: Real-time estimated performance across all cases, not just those reviewed by humans, can be predicted [00:07:29]. This allows for immediate response and feedback to customers [00:07:45].
System Alignment: By comparing outputs from human reviews and reference-free evals on cases that have both, the alignment and trustworthiness of the automated system can be computed [00:07:50].
Dynamic Prioritization for Human Review: Confidence grading can be combined with contextual factors (e.g., cost of procedure, risk of bias, previous error rates) to dynamically prioritize cases for human review [00:08:02]. This ensures that the most relevant cases with the highest probability of error are reviewed by experts [00:08:18].

This creates a virtuous cycle: reference-free evals surface potential issues, human reviews determine accuracy, and this process continuously validates and improves the evaluation system itself [00:08:26]. Over time, the number of unseen edge cases diminishes, and the ability to detect them improves, building a robust, difficult-to-replicate system [00:08:44].

Incorporating Evals into the AI Pipeline

Once confidence in the system’s performance is high, the evaluation system can be integrated directly into the AI pipeline [00:09:00]:

Inputs pass through the primary LLM pipeline, generating outputs [00:09:05].
These outputs are then fed into the reference-free evaluation system [00:09:10].
Based on the eval output, the system can either confidently return the result directly to the customer or trigger a further action [00:09:14].
Further actions might include sending the case to another LLM pipeline (perhaps with more expensive models), escalating it to an internal clinician for review, or surfacing it in the customer’s review dashboard for their team to review [00:09:21].

This powerful mechanism ensures customer trust in the delivered outputs [00:09:38].

Impact at Anterior

Implementing this system provided significant benefits for Anterior:

Reduced need for large review teams: They did not need to hire an ever-expanding team of clinicians. While a competitor hired over 800 nurses for reviews, Anterior reviews tens of thousands of cases with a team of less than 10 clinical experts [00:09:49]. This is a clear example of leveraging AI tools for efficiency and scalability.
Strong AI-human alignment: After multiple iterations, strong alignment was achieved between AI and human reviews, comparable to alignment between human reviewers themselves [00:10:08].
Quick error identification and response: The system enables rapid identification and correction of errors, ensuring timely responses and meeting customer Service Level Agreements (SLAs) [00:10:19].
Industry-leading performance: Anterior achieved provably industry-leading performance in prior authorization, with an F1 score of nearly 96% in a recent study [00:10:35].
Enhanced customer trust and love: This performance led to customer trust and even “love” for their product [00:10:44].

Key Principles for Building an Eval System

Based on their experience, Anterior recommends the following principles for building an effective evaluation system:

Build a System, Not Just Audit: Think big. Use review data not just to audit performance, but to build, audit, and improve the auditing (evaluation) system itself [00:11:04].
Evaluate on Live Production Data: Do not solely rely on offline evaluations. Identify problems immediately to enable quick responses [00:11:19].
Empower Best Reviewers: Prioritize the quality of human reviews over quantity. Build custom tooling if it accelerates the process [00:11:27].

This approach results in an evaluation system that provides real-time performance estimates, ensures accuracy, scales to meet demand while maintaining low costs, and is powered by a small, focused team of experts [00:11:35]. It enables the transition from MVP to serving customers and maintaining their trust at scale [00:11:50].

Tubegraph

Explorer

Table of Contents