Finetuning AI models for better performance

From: aidotengineer

Finetuning AI models is a powerful technique used to optimize their performance for specific use cases, often addressing challenges related to quality, cost, and latency in production environments [08:42:15]. This approach involves building custom models tailored to an application’s unique needs [13:39:58].

Challenges with Off-the-Shelf Models

Method, a company that collects and centralizes liability data, faced significant challenges when attempting to extract specific liability data points (e.g., payoff amounts, escrow balances) using existing data sources [01:48:00]. There was no central API, and direct bank integrations would take years [02:07:00]. The traditional method involved manual, inefficient, and error-prone offshore teams making phone calls to banks [02:55:00].

The company initially turned to advanced LLMs like GPT-4, which excelled at parsing unstructured data [05:02:00] and worked well in controlled environments [05:33:00]. However, scaling GPT-4 in production led to several issues:

High Costs GPT-4 incurred a cost of $70,000 in the first month of production traffic [05:57:00]. While valuable, this cost was unsustainable [06:07:00].
Prompt Engineering Limitations Prompt engineering proved insufficient for complex use cases, requiring overly detailed and convoluted instructions [06:31:00]. This led to a “cat and mouse chase” where fixes for one scenario broke another [06:44:00].
Latency GPT-4’s baseline latency was slow, preventing concurrent scaling [07:26:00].
AI Errors Hallucinations were difficult to catch, leading to inaccurate financial information [07:36:00].

Despite these issues, GPT-4 remained in production for specific use cases where it performed well [07:44:00]. The problem then shifted to how to scale this system robustly [07:57:00]. Method aimed for 16 million requests per day, 100K concurrent load, and sub-200 millisecond latency [08:07:00].

Benchmarking Existing Models

Before considering fine-tuning, it’s crucial to benchmark existing production models to understand their performance against specific business requirements [13:59:00]. For Method, evaluation metrics included:

Error Rate:
- GPT-4: 11% error rate [09:24:00].
- O3 Mini: 4% error rate [09:28:00].
- Error rates were measured by comparing agent outputs to human-verified correct data [09:51:00]. Method’s target error rate was around 9%, as they had additional plausibility checks [11:40:00].
Latency:
- GPT-4: Around 1 second [10:08:00].
- O3 Mini: About 5 seconds for their task [10:12:00].
- Method required sub-200 millisecond latency for its real-time agentic workflow [08:19:00].
Cost:
- O3 Mini, despite lower per-token cost, was more expensive than GPT-4 for Method’s use case due to generating many more reasoning tokens [10:42:00].
- Cost is highly dependent on volume and use case [10:35:00].

Neither GPT-4 nor O3 Mini met all three requirements for Method’s production needs [13:05:00].

The Power of Fine-Tuning

When off-the-shelf models or prompt engineering prove insufficient, fine-tuning becomes a viable solution [13:41:00]. Fine-tuning is considered a “power tool” that requires more engineering investment than simple prompting but can significantly improve the price-performance curve [13:46:00].

OpenPipe collaborated with Method to fine-tune a custom model, demonstrating significant improvements:

Improved Quality (Error Rate): The fine-tuned model achieved an error rate significantly better than GPT-4 [14:28:00]. This is now easier to achieve by using production data and a “teacher model” (like O3 Mini) to generate outputs for training [14:40:00].
Reduced Latency: By moving to a much smaller, fine-tuned model (e.g., an 8 billion parameter LL3.1 model), latency was drastically reduced [15:34:00]. Fewer calculations lead to lower latency, and models can even be deployed within an organization’s own infrastructure to eliminate network latency [15:47:00]. For most customers, models of this size or smaller are sufficient [15:18:00].
Lower Costs: The smaller fine-tuned model resulted in a much lower cost [16:04:00], far exceeding Method’s cost thresholds and making the solution economically viable [16:20:00].

When to Fine-Tune

Fine-tuning is a powerful tool to be used when prompt engineering with existing models cannot achieve the required reliability numbers [16:48:00]. It allows for a strong bend in the price-performance curve, enabling large-scale production deployment [16:58:00].

Productionizing AI Agents

Productionizing AI agents requires a level of openness and patience from engineering and leadership teams [17:47:00]. Unlike traditional software development where code is expected to work without breaking, AI agents take time to become production-ready and consistently deliver desired responses [18:01:00]. For Method, fine-tuning enabled them to use a cheaper, faster model, leveraging existing GPT-generated data for training without needing to acquire their own GPUs [17:28:00].

Tubegraph

Explorer

Table of Contents