Role of large language models in healthcare

From: redpointai

Oscar Health, a $3 billion public health insurance company, has been at the forefront of innovating with technology in healthcare for the past decade [00:00:20]. Mario Schaer, CTO and co-founder of Oscar Health, shares insights into the practical implementation of AI models for real use cases and challenges of AI in healthcare systems within the healthcare system [00:00:47]. Oscar functions as both an insurer and a care provider, experimenting extensively with advanced AI models like GPT-4 [00:00:31].

LLMs as Language Translators in Healthcare

A key strength of Large Language Models (LLMs) is their ability to convert between informal and formal language [00:01:40]. Healthcare uniquely combines both: highly formalized language (e.g., ICD-10, CPT codes, utilization management guidelines) and highly human language (e.g., patient-provider conversations, electronic medical record notes) [00:02:02]. This unique blend has historically limited algorithmic coverage in healthcare, where predictive models often stopped at the surface of formal language [00:02:47]. LLMs excel at this threshold, moving back and forth between formal and informal language [00:02:55].

The ability of LLMs to present information at different levels of sophistication based on the end user is highly compelling [00:05:32]. For example, they can translate complex insurance bills and claim statuses into understandable terms for patients [00:05:39], or transform clinical data for either doctor-to-doctor communication or patient-friendly summaries [00:06:05].

AI Use Cases at Oscar Health

Oscar Health leverages AI in healthcare and insurance across three primary financial levers: growth and retention, operational efficiency (administrative), and clinical cost reduction/outcome improvement [00:10:24].

Growth and Retention

AI assists in retaining members by enabling personalized outbound campaigns, reminding them of benefits or specific preventative care actions taken, like colorectal cancer screenings [00:13:42]. Different segments of members respond to different messaging; chronically ill members respond better to convenience, while generally healthy members respond better to empathy [00:14:51]. LLMs can extract personas and issues from customer service conversations to inform these strategies [00:15:51]. They can also assist in filling missing demographic information, like ethnicity, by analyzing names or detecting conversation language [00:16:33].

Administrative Use Cases

Initially, many transforming clinical processes with AI and use cases for LLMs at Oscar Health are administrative, aiming for real-time, transparent, and bidirectional processes [00:03:11]. Examples include:

Claims Explanation: LLMs translate complex rule-based claim traces (e.g., 1,000 lines of logic) into understandable explanations for care guides or laypersons, detailing why a claim was paid or denied [00:03:37].
Call Summarization: LLMs are increasingly replacing manual note-taking by care guides during customer service calls [00:17:22].
Lab Test Summarization: Launched within Oscar’s medical group [00:17:41].
Secure Messaging Medical Records Generation: Also launched in the medical group [00:17:46].
These administrative applications currently save “a few cents PMPM (per member per month)” [00:18:01].

Clinical Use Cases

While administrative applications offer immediate, tangible savings, the long-term goal is to replace caregivers and clinical intelligence with machine intelligence, significantly reducing the cost of doctor visits and potentially replacing specialists with AI [00:04:30]. The biggest current clinical use case is enabling doctors to “talk to their own medical records” [00:18:52].

Challenges in Clinical AI

Contextual Knowledge: Human providers often possess subtle contextual knowledge (e.g., remembering previous conversations not explicitly in records) that LLMs lack, making it an “unfair playing field” when inputs and outputs are less clean [00:07:11]. Improving LLM performance in clinical settings requires not just better models, but also expanding their “horizon of knowledge” by providing more context [00:08:04].
Physical vs. Virtual Interaction: While two-thirds of claims could be handled virtually, real-world patient loyalty to PCPs is low (only 28% of members stick to their attributed PCP) [00:59:02]. This suggests that the need for in-person actions (e.g., lab tests, foot exams) causes “leakage” that prevents clinical chatbots from fully replacing physicians or the entire system [00:59:42].
Business Models: Health systems currently have little incentive to switch to lower-cost care delivery channels, as this could lead to reduced reimbursement [01:00:00]. Insurers, while well-positioned to deploy automated virtual care, often lack the necessary member engagement [01:00:15].

Regulatory and Implementation Challenges

Healthcare AI faces significant regulatory hurdles, primarily HIPAA (Health Insurance Portability and Accountability Act) [00:20:58], which governs patient-specific information. Vendors must sign Business Associate Agreements (BAAs) to handle protected health information (PHI) [00:21:05]. Oscar was the first organization to sign a BAA directly with OpenAI [00:21:18].

New models, like Google’s Gemini Ultra, are not immediately covered by existing BAA agreements, requiring a delay of three to four months before they can be used with real medical records [00:22:30]. During this period, Oscar uses synthetic or anonymized test data [00:22:46].

Gaining trust and navigating enterprise sales processes are more critical than having the “best” product [00:24:42]. Hospitals and insurance companies are generally slow to rapidly prototype or follow up on pilots [00:25:01]. Industry collaborations, such as a consortium of health systems and insurers that wrote a document on principles for AI in healthcare, aim to democratize analytics and accelerate adoption [00:26:03].

Limitations of AI language models and Troubleshooting LLMs

LLMs have specific limitations that impact their performance:

Counting and Categorization: GPT-4 can fail miserably at tasks requiring it to characterize and count occurrences of specific reasons from a batch of customer service calls, especially when the context window is large [00:27:48]. This is a fundamental limitation, potentially due to how layers process tokens and synthesize information [00:28:21]. This can be mitigated by splitting tasks into different steps or chaining models (e.g., Chain of Thought prompting) [00:29:52].
False Positives: LLMs can generate a high rate of false positives for concepts like “post-traumatic injury” when extracting information from medical records for utilization management [00:31:49]. This likely occurs because the model’s training data contains layperson associations of the term, differing from its precise, regulated definition in healthcare [00:32:52]. A solution involves using “self-consistency questionnaires,” where the LLM generates multiple ways a concept might appear in records, which are then independently evaluated [00:33:41].
Context Window Issues: Early attempts at claims explanation faced context window limitations (e.g., 8,000 tokens for GPT-4), as the full internal logic trace was too large [00:37:09]. The solution was to provide the traces at a higher hierarchical level, allowing the LLM to “double-click” on specific sub-procedures for more detail [00:38:25].

Specialized vs General AI models in translation

Oscar’s experience suggests that when specializing a general-purpose model for a particular area (e.g., healthcare), there’s a risk of losing “alignments,” meaning the model may fail to follow simple instructions (e.g., formatting output in JSON) [00:44:05]. The current recommendation is to use the biggest general-purpose model for the best reasoning and combine it with Retrieval Augmented Generation (RAG) [00:45:09]. Recent research indicates that RAG and fine-tuning offer independent improvements in performance, suggesting both should be utilized [00:45:31].

Structuring AI Teams

Oscar Health has adopted a successful model for structuring its AI teams [00:46:20].

AI Pod: A dedicated team of seven people (two product managers, data scientists, engineers) provides office hours for anyone in the company working on AI [00:47:10]. This centralized pod also has its own three product priorities to ensure tangible outcomes [00:47:34].
Weekly Hacking Sessions: A regular, informal gathering (e.g., Monday nights) where anyone can bring ideas and show off their AI projects [00:47:54]. The goal is to lower the bar for participation and encourage sharing of both successes and failures [00:48:56].
Open Sharing: The company encourages sharing of insights and practices both internally and externally, as evidenced by their public resources [00:41:29].

Opportunities for AI Startups in Healthcare

For new AI in healthcare companies, obscure niches that solve very particular issues for non-technical users might be the most promising [00:53:22].

Regulatory Filings Composition: Automating the generation of regulatory documentation (e.g., for NCQA, state health departments, SOX compliance) could significantly reduce overhead [00:53:37].
Fraud, Waste, and Abuse: This industry segment is still dominated by older, expensive players, suggesting a ripe opportunity for disruption with AI [00:56:16].
Prior Authorization: While many companies are entering this space, it’s very close to the core competency of insurance companies [00:55:22]. Startups might struggle if they cannot offer highly interactive or platform-based solutions [00:55:46].

Overhyped and Underhyped AI in Healthcare

Overhyped: Clinical chatbots [01:00:49].
Underhyped: Voice outputs [01:00:56].

Learning More

To learn more about Oscar Health’s AI work and insights, visit their collection of papers and articles at hi.oscar.health [01:01:50]. Mario Schaer also posts his explorations and Oscar’s work on Twitter (MarioTS) [01:02:07]. He’s interested in using LLMs for novel applications, such as generating RPGs from company documents or dynamic games like Oregon Trail [01:02:40].

Tubegraph

Explorer

Table of Contents