From: aidotengineer
AI in healthcare applications, particularly in areas like prior authorization decisions, require robust evaluation systems to ensure accuracy and build trust [00:00:09]. Christopher Ljy, an AI engineer and former medical doctor, highlights that in healthcare, there is “no room for error” when building systems that support mission-critical decisions [00:00:11].
Challenges of Scaling AI in healthcare
While it is relatively easy to create a Minimum Viable Product (MVP) powered by Large Language Models (LLMs), scaling these products to serve customers at scale presents numerous challenges [00:00:30]. As request volume increases, so does the number of unforeseen edge cases [00:00:44].
Example: Prior Authorization Decisions
Anterior, a company serving insurance providers covering 50 million American lives, focuses on supporting prior authorization decisions for medical treatments [00:00:14]. They receive medical records and guidelines to determine if a treatment request should be approved or reviewed by a clinician [00:00:50].
Consider an example question: “Has a patient had a previous brain MRI suspicious for multiple sclerosis (MS)?” This helps determine if the patient should receive a cervical spine MRI [00:01:04]. An AI might respond:
“The medical record shows a brain MRI from [date] that demonstrates hyperintensity in the infratentorial, DRX theortical, and per ventricular white matter, which is noted to be consistent with multiple sclerosis, and this confirms prior brain M findings suspicious for MS.” [00:01:15]
On the surface, this seems reasonable [00:01:31]. However, it misses a key medical nuance: in a medical context, “suspicious” implies no confirmed diagnosis [00:01:39]. If the patient already has a confirmed diagnosis, the AI’s answer is incorrect because it uses “suspicious” when it should be “confirmed” [00:01:46].
Mistakes like this might occur in 1 out of 1,000 or 10,000 cases [00:01:54]. But if processing over 100,000 cases daily, this leads to a significant number of errors [00:01:58]. Making such mistakes in healthcare can lead to organizations being sued for inappropriate AI automation [00:02:04].
Identifying and Handling Failure Cases
Human Reviews
A first step to identify and handle failure cases is performing human reviews of AI outputs [00:02:16]. Anterior has built an internal clinical team and specialized tooling called “Scalp” to facilitate this process [00:02:21]. The Scalp review dashboard displays all necessary context (medical record, guidelines) alongside the AI’s answer and the question, allowing reviewers to quickly assess and critique responses [00:02:27]. Reviewers can label incorrect answers and provide critiques, which are then used to generate “ground truths” (correct answers) [00:02:50].
However, human reviews do not scale effectively [00:03:17]:
- MVP phase (1,000 decisions/day): Reviewing 50% (500 cases) requires 5 clinicians (at 100 reviews/day per clinician) [00:03:23].
- Scaling (10,000 decisions/day): Maintaining a 50% review rate would require 5,000 human reviews daily, needing 50 clinicians—which is larger than most entire companies [00:03:48].
- Even reviewing a smaller subset like 5% at 100,000 decisions/day still leads to 5,000 reviews and 50 clinicians [00:04:13].
The challenge becomes: which cases should be reviewed, and how is the system performing on cases that are not reviewed? [00:04:28]
Offline Evaluation Datasets
Ground truths from human reviews can be used to build offline evaluation datasets [00:04:39]. These datasets live outside the product and can be used for continuous evaluations and scoring [00:04:42]. They are helpful for:
- Defining gold standard datasets [00:04:55].
- Segmenting performance by enterprise, medical type, complex cases, or ambiguous outcomes [00:04:57].
- Iterating AI pipelines [00:05:08].
However, relying solely on offline evaluations is risky [00:05:21]. The input space of medical records is vast and heterogeneous, meaning new edge cases will continuously appear at scale [00:05:27]. Waiting for new edge cases to be represented in these datasets, which are built downstream of customer interaction, can be too late [00:05:13].
Real-time Reference-Free Evaluation Systems
The solution to these scaling problems is a real-time, reference-free evaluation system [00:05:39]. “Reference-free” (or “label-free”) means evaluating before knowing the true outcome (i.e., before a human review) [00:05:45]. This enables real-time response to issues [00:05:53].
LLM as Judge
A core component of this system is using an LLM as a judge [00:06:07].
- Inputs go into the primary LLM pipeline, which generates outputs [00:06:10].
- These outputs are then fed into another LLM (the “judge”), along with a scoring system [00:06:15].
- The scoring system can evaluate various aspects: helpfulness, conciseness, brand tone, or confidence in correctness (especially for binary or multi-class classifications) [00:06:21].
At Anterior, their generated output is either “approval” or “escalation for review” (a binary output) [00:06:38]. They use a reference-free evaluation (which can include an LLM as judge or other methods like confidence estimation using logic-based methods) to provide a “confidence grading” [00:06:49]. This grading ranges from high confidence (correct) to low confidence (actively wrong) [00:07:10]. This score is then converted into a predicted correct output [00:07:16].
Applications of Reference-Free Evals
Reference-free evals provide two key pieces of information:
- Estimated Performance: They predict the estimated performance on all cases in real-time, not just those undergoing human review [00:07:29]. This allows for immediate response and feedback to customers [00:07:45].
- Dynamic Prioritization: They can combine confidence grading with contextual factors (e.g., cost of procedure, risk of bias, previous error rates) to dynamically prioritize cases for human review [00:08:03]. This identifies the most relevant cases with the highest probability of error [00:08:16].
This creates a virtuous cycle where human reviews validate and improve performance, while dynamic prioritization feeds cases back into the system [00:08:24]. The reference-free evals surface potential issues, and human review determines accuracy, in a process known as “validating the validator” [00:08:41]. Over time, the number of unseen edge cases decreases, and the ability to detect them improves, creating a system that is difficult to replicate [00:08:44].
Integrating into the Pipeline
Once confident in the system’s performance, the reference-free evaluation can be incorporated directly into the AI pipeline [00:09:01]:
- If the reference-free eval indicates high confidence, the output can be directly returned to the customer [00:09:15].
- If confidence is low, a “further action” can be taken [00:09:19]. This might involve sending it to:
- Another LLM pipeline with more expensive models [00:09:23].
- An internal on-call clinician for review [00:09:28].
- The customer’s review dashboard for their team to review [00:09:34].
This mechanism ensures customer trust in the AI’s outputs [00:09:38].
Impact at Anterior
Implementing this evaluation system has had significant impacts for Anterior:
- Reduced Hiring: They have not needed to hire an ever-expanding team of expert clinicians like competitors (e.g., one competitor hired over 800 nurses) [00:09:49]. Anterior reviews tens of thousands of cases with a review team of less than 10 clinical experts [00:10:01].
- High Alignment: Achieved strong alignment between AI and human reviews, comparable to human-to-human reviewer alignment [00:10:09].
- Quick Error Response: They can quickly identify and respond to errors, ensuring they meet customer Service Level Agreements (SLAs) and maintain confidence in results [00:10:20].
- Industry-Leading Performance: Achieved provably industry-leading performance in prior authorization with an F1 score of nearly 96% in a recent study [00:10:35].
- Customer Trust and Love: This has led to gaining customer trust and even “love” for their product, with nurses expressing gratitude for its continued use [00:10:45].
Principles for Building a System: AI implementation and best practices
Anterior recommends the following principles for building such a system:
- Build a System: Think big; use review data not just to audit performance but to build, audit, and improve the entire evaluation system [00:11:04].
- Evaluate on Live Production Data: Don’t rely solely on offline evaluations. Identify problems immediately for quick response [00:11:19].
- Empower Best Reviewers: Prioritize review quality over quantity and build custom tooling to accelerate the process [00:11:27].
This approach enables real-time performance estimates, accurate responses, scalability, low cost, and is powered by a small, focused team of experts [00:11:36]. It allows companies to scale from MVP while maintaining customer trust [00:11:51].