Scaling AI agents in production

From: aidotengineer

This article explores the journey of Method, a fintech company, in scaling AI solutions in production to over 500 million agents, detailing the challenges and solutions in scaling AI agents they faced and how OpenPipe assisted in optimizing their AI infrastructure.

Method’s Core Business and Initial Challenge

Method specializes in collecting and centralizing liability data from hundreds of sources, including credit bureaus, card networks (Visa, MasterCard), and direct connections with financial institutions [00:00:31]. They aggregate and enhance this data for customers, typically other fintechs, banks, or lenders, who use it for debt management, refinancing, loan consolidation, liability payments, or personal finance management [00:00:50].

An early challenge arose when customers requested liability-specific data points, such as payoff amounts for auto loans or escrow balances for mortgages [00:01:46]. There was no central API available to retrieve this information [00:02:07], and direct bank integrations would take years [00:02:21].

The Inefficient Status Quo

Existing companies providing similar services often relied on highly inefficient manual processes [00:03:24]. This involved hiring offshore teams of contractors to call banks, authenticate, gather information, proof-check it, and then integrate it into financial platforms [00:02:55].

Key problems with this manual approach included:

Expense: One person can only do one task at a time, requiring more hires to scale [00:03:33].
Slowness: The synchronous nature of the process made it very slow [00:03:41].
Human Error: Significant human error was involved, necessitating proof-checking teams and risking inaccurate financial information being surfaced [00:03:48].

Conceptually, this process mirrored an API with request, authentication, and response validation components [00:04:04]. The core problem was making sense of unstructured data [00:04:17].

Initial AI Solution: GPT-4 and Its Limitations

With the rise of LLMs like GPT-4, Method recognized their strength in parsing unstructured data, summarization, and classification [00:04:31]. They developed an agentic workflow using GPT-4, which initially performed very well [00:05:16].

However, as traffic increased, significant challenges emerged:

High Cost: The first month in production with GPT-4 incurred a cost of $70,000 [00:05:50]. While the value was immense, this was a major concern for leadership [00:06:01].
Prompt Engineering Difficulties:
- Prompt engineering offered limited scalability [00:06:25].
- GPT, despite its intelligence, was not a financial expert, requiring detailed instructions and examples [00:06:31].
- Prompts were hard to generalize, becoming long and convoluted [00:06:41].
- A “cat and mouse chase” ensued, where fixing one scenario broke another [00:06:44].
- There was a lack of prompt versioning [00:06:52].
Scaling Challenges (Similar to Manual Process):
- Expense: Inability to optimize for caching due to response variability and constant prompt tweaks [00:07:18].
- Latency: The baseline latency was slow, hindering concurrent scaling [00:07:24].
- AI Errors: Hallucinations were difficult to catch, posing a different nature of error compared to human errors [00:07:36].

Despite these issues, GPT-4 was kept in production for specific use cases where it performed well [00:07:43].

The Shift to Building Scalable AI Systems

The problem evolved from parsing unstructured data to building a robust, agentic workflow that could handle high volume reliably [00:07:57].

Method’s scaling targets were ambitious:

16 million requests per day [00:08:10]
100K concurrent load [00:08:12]
Minimal latency (sub-200 milliseconds) for real-time agentic workflows [00:08:16]

OpenPipe’s Role in Optimization

OpenPipe partnered with Method to address quality, cost, and latency issues, common challenges for many companies scaling AI solutions in production [00:08:40].

Benchmarking Existing Models

To begin, OpenPipe benchmarked existing models against Method’s specific task:

Error Rates:
- GPT-4: ~11% [00:09:24]
- O03 Mini: ~4% [00:09:27]
- Error rates were measured by comparing the agent’s final outputs to human-verified correct data [00:09:51].
Latency:
- GPT-4: ~1 second [00:10:08]
- O03 Mini: ~5 seconds [00:10:12]
- Measurements were conducted under real production conditions with diverse tasks and concurrency levels [00:10:21].
Cost:
- Surprisingly, O03 Mini was slightly more expensive than GPT-4 for Method’s use case, despite a lower per-token cost [00:10:41]. This was due to O03 Mini generating significantly more “reasoning tokens,” leading to longer outputs [00:10:52].

Defining Performance Requirements

Method needed to define target performance metrics, knowing that post-processing checks provided a safety net:

Error Rate: Around 9% was acceptable, as additional checks would catch further errors [00:11:58].
Latency: A hard latency cut-off was essential for the real-time agent system [00:12:04].
Cost: Due to high volume, cost was a critical factor [00:12:38].

Neither GPT-4 nor O03 Mini could meet all three requirements simultaneously [00:13:05]. GPT-4 fell short on error rate and cost, while O03 Mini failed on cost and especially latency [00:13:12].

The Solution: Fine-Tuning with OpenPipe

Fine-tuning is a “power tool” that requires more engineering investment than simple prompt engineering but is crucial when production models don’t meet necessary performance benchmarks [00:13:46].

OpenPipe fine-tuned a custom model for Method’s specific use case, significantly bending the price-performance curve [00:14:17]:

Improved Error Rate: The fine-tuned model achieved significantly better accuracy than GPT-4, surpassing the required threshold [00:14:28]. This was made easier by using existing production data and a “teacher model” like O03 Mini to generate training data [00:14:40]. The model deployed was an 8 billion parameter LL 3.1 model [00:15:14].
Lower Latency: Moving to a much smaller (8 billion parameter) model drastically reduced latency due to fewer calculations [00:15:34]. It also allowed for potential co-location with application code to eliminate network latency entirely [00:15:53].
Reduced Cost: The significantly smaller model resulted in a much lower cost, exceeding Method’s target cost thresholds and eliminating unit economics concerns [00:16:04].

Key Takeaways for Building and Improving AI Agents

Simplicity and Efficiency: Method successfully used the cheapest available model and fine-tuned it using existing production data from GPT [00:17:28]. There’s no need to buy dedicated GPUs [00:17:42].
Patience for Productionizing AI Agents: Productionizing AI agents requires openness and patience from engineering and leadership [00:17:47]. Unlike traditional code that’s expected to work flawlessly and not break, AI agents take time to become production-ready and provide desired responses [00:18:01].

Fine-tuning is a powerful method to achieve desired reliability numbers and strongly bend the price-performance curve, enabling organizations to scale AI solutions to a very large degree in production [00:16:55].

Tubegraph

Explorer

Table of Contents