From: aidotengineer
Finetuning AI models is a powerful technique for adapting pre-trained models to specific tasks, overcoming the limitations of general models and achieving better performance, lower costs, and reduced latency. It’s considered a “power tool” in AI development, requiring more engineering investment than simple prompt engineering, but offering significant advantages when off-the-shelf models don’t meet production requirements [13:48:00].
Challenges with General Purpose Models
Method, a company that collects and centralizes liability data, faced a challenge in providing specific liability data points like payoff amounts on auto loans or escrow balances for mortgages [01:48:00]. There was no central API available to retrieve this information [02:07:07].
Initially, companies would hire offshore teams to manually call banks, authenticate, gather information, and proof-check it before integration into financial platforms [02:55:00]. This manual process was:
- Inefficient and not scalable [03:23:00]
- Expensive due to the need for more personnel to scale [03:33:00]
- Slow due to its synchronous nature [03:41:00]
- Prone to human error, requiring additional teams for fact-checking and proof-checking [03:48:00]
The core problem was making sense of unstructured data [04:17:00]. With the rise of advanced Large Language Models (LLMs) like GPT-4, which excel at parsing unstructured data for tasks like summarization or classification [05:02:00], Method developed an agentic workflow using GPT-4 [05:16:00].
However, using off-the-shelf, larger models like GPT-4, even when effective, presented new challenges:
- Cost: The first month in production with GPT-4 incurred a cost of $70,000 [05:57:00]. This high cost made scaling difficult [07:15:00].
- Prompt Engineering Limitations: Prompt engineering only goes so far [06:25:00]. GPT models, while smart, are not financial experts, requiring extremely detailed and specific instructions with examples [06:31:00]. Prompts became long, convoluted, hard to generalize, and unstable (fixing one scenario broke another) [06:42:00].
- Latency: Baseline latency was too slow for concurrent scaling [07:24:00].
- Caching Issues: Variability in responses and frequent prompt tweaks made it difficult to optimize for caching [07:18:00].
- AI Errors (Hallucinations): Like human errors, AI models produced errors (hallucinations) that were hard to catch [07:36:00].
Despite these issues, GPT-4 remained in production for specific use cases due to its effectiveness [07:44:00]. The problem then shifted from solving data parsing to scaling this AI system robustly to handle high volumes, with targets like 16 million requests per day, 100K concurrent load, and sub-200 millisecond latency [07:57:00].
Benchmarking and Goals
To address these issues, OpenPipe worked with Method on benchmarking different models against specific criteria:
- Error Rates (Quality):
- GPT-4: ~11% error rate [09:24:00]
- 03 Mini: ~4% error rate [09:26:00]
- Method measured this by having a human determine the correct numbers for bank balances and comparing them to the agent’s final outputs [09:51:00]. Their target was around 9% error rate, due to additional plausibility checks [11:50:00].
- Latency:
- GPT-4: ~1 second [10:08:00]
- 03 Mini: ~5 seconds [10:12:00]
- Measurements were conducted under real production conditions with diverse tasks and reasonable concurrency [10:21:00]. Method’s real-time agent system required a hard latency cut-off for quick responses [12:06:00].
- Cost:
- 03 Mini was found to be more expensive than GPT-4 for Method’s specific use case, despite its lower per-token cost, because it generated more reasoning tokens and longer outputs [10:42:00].
- Given Method’s high volume, cost was a critical factor [12:40:00].
Neither GPT-4 nor 03 Mini met all three requirements for production deployment [13:03:00]. GPT-4 fell short on error rate and cost, while 03 Mini failed on cost and, critically, latency [13:16:00]. This situation indicated that simple prompting wouldn’t suffice, necessitating finetuning.
The Power of Finetuning
Finetuning involves building custom models for specific use cases. It allows for significant improvements in the price-performance curve [14:17:00]:
- Improved Quality (Lower Error Rates):
- Finetuning enabled Method to achieve significantly better accuracy than GPT-4, falling below their required error rate threshold [14:28:00].
- Modern models like 03 Mini facilitate this by allowing the use of production data to generate outputs, which then serve as training data for a smaller, finetuned model (a “teacher model” approach) [14:40:00]. While not always matching the teacher model’s performance, the finetuned model can get “quite close” and often outperform larger, less optimized models [15:00:00].
- For Method, an 8-billion parameter LLM model was sufficient [15:14:00].
- Reduced Latency:
- Moving to a much smaller, finetuned model (e.g., 8 billion parameters) significantly reduces latency because there are fewer calculations [15:34:00].
- It also enables deploying the model within one’s own infrastructure, co-locating it with application code to eliminate network latency entirely [15:51:00].
- Lower Cost:
- Smaller finetuned models result in a much lower cost of operation [16:04:00].
- For Method, the finetuned model far exceeded their cost thresholds, making the solution viable from a unit economics perspective and removing cost as a primary concern [16:20:00].
Implementing Finetuning
Finetuning allows companies to achieve high reliability numbers that might be unattainable with prompt engineering alone [16:50:00]. Method successfully implemented finetuning by:
- Identifying a specific use case [17:25:00].
- Using the cheapest available model [17:28:00].
- Leveraging existing data from GPT in production for training [17:32:00].
- Selecting a model that offered the fastest performance [17:39:00].
- Avoiding the need to purchase their own GPUs [17:42:00].
Productionizing AI agents requires openness and patience from engineering and leadership teams because, unlike traditional code, AI agents take time to become production-ready and provide desired responses consistently [17:47:00]. This approach of finetuning smaller models allows businesses like Method to scale to very large production volumes [17:04:00].