Human reviews of AI outputs

From: aidotengineer

In applications involving critical decisions, such as healthcare, there is “no room for error” when using AI systems [00:00:11]. While creating a Minimum Viable Product (MVP) with Large Language Models (LLMs) is becoming easier, scaling these products to serve customers presents significant challenges [00:00:30]. As request volume increases, so do the number of previously unseen edge cases [00:00:44].

The Need for Human Oversight in Healthcare AI

Anterior, a company serving insurance providers for 50 million American lives, utilizes AI for prior authorization decisions regarding treatment requests [00:00:14]. Their AI processes medical records and guidelines to determine if a treatment should be approved or reviewed by a clinician [00:00:52].

Example of an Edge Case

An example of an AI output might describe a brain MRI showing hyperintensity “consistent with multiple sclerosis” and confirming “prior brain MRI findings suspicious for MS” [00:01:15]. However, this output missed a crucial medical nuance: in a medical context, “suspicious” implies no confirmed diagnosis, but the patient actually had an existing diagnosis [00:01:34]. This makes the AI’s answer incorrect [00:01:51]. Mistakes like this, even if rare (e.g., every 1,000 or 10,000 cases), become frequent (e.g., 100 mistakes daily) when processing over 100,000 cases per day [00:01:54]. Errors in US healthcare AI automation can lead to lawsuits [00:02:07].

Implementing Human Reviews

To identify and handle failure cases, performing human reviews of AI outputs is essential [00:02:13]. Anterior built an internal clinical team and developed proprietary tooling, called “Scalp,” to facilitate efficient reviews [00:02:21].

The Scalp Review Dashboard

The “Scalp” review dashboard displays all necessary context, such as medical records and guidelines, on the right side without requiring scrolling [00:02:28]. The left side shows the question being answered and the required context, enabling reviewers to quickly assess a high volume of questions [00:02:41]. Reviewers can add critiques, label errors as “Incorrect,” and save this information into the system [00:02:50].

Critiques and Ground Truths

Critiques, which explain what is wrong with an AI’s output, can be used to generate “ground truths”—descriptions of the correct answer [00:03:00]. These ground truths are then used in offline evaluations [00:03:13].

Limitations of Human Reviews at Scale

While crucial, human reviews face significant scalability challenges [00:03:17].

MVP Stage: Reviewing 50% of 1,000 daily cases (500 reviews) requires 5 clinicians, with each clinician performing approximately 100 reviews per day [00:03:21].
Scaling Up: If daily medical decisions increase to 10,000, maintaining a 50% review rate would necessitate 5,000 human reviews daily, requiring 50 clinicians—a number larger than many companies [00:03:48]. Even reducing the review rate to 5% at 100,000 daily decisions still requires 5,000 human reviews and 50 clinicians [00:04:13].

This demonstrates that relying solely on human reviews “doesn’t scale” [00:04:25]. This leads to two key questions: which cases should be reviewed, and how does the AI perform on unreviewed cases [00:04:28]?

Offline Eval Datasets

Offline evaluation datasets are built from generated ground truths and reside outside the product [00:04:39]. They can be used to define gold standard datasets, segment performance by enterprise or medical condition, track performance over time, and iterate on AI pipelines [00:04:53]. However, relying only on offline evals can be risky, as new edge cases in medical records (due to high heterogeneity) will continuously appear, and waiting for them to be represented in a downstream dataset means it “could be too late” [00:05:11].

Integration with Real-Time Reference-Free Evaluations

The solution to the scalability challenges of human reviews and the limitations of offline evals is a real-time, reference-free evaluation system [00:05:37]. Reference-free means evaluating before the true outcome is known (i.e., before a human review has occurred), enabling immediate response to issues [00:05:45].

Prioritizing Human Reviews

Real-time reference-free evals can provide a “confidence grading” for AI outputs [00:07:03]. This confidence grading can be combined with contextual factors like procedure cost, bias risk, and previous error rates [00:08:03]. This allows for dynamic prioritization of cases, identifying the most relevant ones with the highest probability of error to be reviewed by humans [00:08:14].

The Virtuous Cycle: Validating the Validator

This creates a “virtuous cycle” where human reviews continuously validate and improve the performance of the automated reference-free evaluation system [00:08:24].

Reference-free evals surface potential problem cases [00:08:35].
Human review determines the actual accuracy [00:08:38].

This process, often described as “validating the validator,” leads to a decrease in unseen edge cases and an improvement in the ability to detect them [00:08:41]. Such a robust system is difficult for competitors to replicate, as it requires processing high volumes of real data and extensive data-driven iterations [00:08:50].

Impact of an Integrated Evaluation System

Anterior’s integrated evaluation system has had several significant impacts:

Reduced Staffing Needs: It eliminated the need to hire an “ever expanding team of expert clinicians” [00:09:49]. While a competitor hired over 800 nurses for reviews, Anterior reviews tens of thousands of cases with a team of fewer than 10 clinical experts [00:09:55].
Strong Alignment: The system achieved strong alignment between AI and human reviews, comparable to the alignment seen between human reviewers themselves [00:10:08].
Rapid Error Response: It enables quick identification and response to errors, ensuring customer Service Level Agreements (SLAs) are met [00:10:20].
Industry-Leading Performance: This approach resulted in “provably industry-leading performance” in prior authorization, with an F1 score of nearly 96% in a recent study [00:10:35].
Customer Trust: This high performance fostered customer trust and even “love” for the product [00:10:43].

Key Principles for Effective Evaluation

To build an evaluation system that works at scale, enables real-time performance estimates, and maintains low costs, the following principles are recommended [00:11:04]:

Build a System, Not Just an Audit: Use review data not just to audit performance, but to “build, audit, and improve your auditing system,” which is your evaluation system [00:11:08].
Evaluate on Live Production Data: Do not rely solely on offline evals. Identify problems immediately for quick response [00:11:19].
Empower Best Reviewers: Prioritize the quality of reviews over quantity and build custom tooling if it helps accelerate the process [00:11:27].

Tubegraph

Explorer

Table of Contents