From: aidotengineer

Method, a company that centralizes liability data from hundreds of sources for fintechs, banks, and lenders, faced significant challenges in extracting specific, liability-specific data points like payoff amounts or escrow balances for customers [01:48:09]. This information was crucial for debt management services such as refinancing, loan consolidation, and personal finance management [00:58:12].

Initial Challenges and the Status Quo

Initially, Method found no central API to access these specific data points [02:07:10]. Working directly with banks was impractical for an early-stage company, as it would take years to establish [02:17:00].

The existing industry solution involved companies hiring offshore teams of contractors to manually call banks, authenticate, gather information, proof-check it, and integrate it into financial platforms [02:53:00].

This manual process presented several challenges:

  • Inefficiency and Lack of Scalability: It was a highly inefficient and manual process, making scaling difficult [03:23:00].
  • High Cost: Each contractor could only handle one task at a time, necessitating more hires for scale [03:33:00].
  • Slowness: The synchronous nature of the process made it inherently slow [03:41:00].
  • Human Error: Significant human error was involved, requiring additional teams for fact-checking and proof-checking. Surfacing inaccurate financial information was a major risk [03:48:00].

Conceptually, this problem was akin to an API with request, authentication, and response validation components, boiling down to making sense of unstructured data [04:04:00].

The Rise of AI and Initial Adoption

The announcement of GPT-4 and the subsequent “Cambrian explosion” of AI/LLM-enabled applications offered a potential solution [04:31:00]. Advanced LLMs, particularly post-GPT-4 models, excel at parsing unstructured data for tasks like summarization or classification [04:56:00].

Method developed an agentic workflow using GPT-4, which initially performed well [05:16:00]. They expanded use cases, finding GPT-4 effective even with high API costs [05:21:00].

Challenges with Early AI Models and Implementations

Despite initial success, scaling the GPT-4 solution led to significant challenges with early AI models and improvements:

  • Prohibitive Cost: The first month in production with GPT-4 incurred a cost of $70,000 [05:50:00]. While leadership recognized the immense value, this cost was a major concern [06:01:00].
  • Prompt Engineering Limitations: As use cases scaled, prompt engineering reached its limits [06:23:00]. GPT-4, though smart, was not a financial expert, requiring detailed instructions and examples [06:29:00]. Prompts became long, convoluted, and hard to generalize, leading to a “cat and mouse” chase where fixes for one scenario broke others [06:41:00]. There was also a lack of prompt versioning [06:52:00].
  • Scaling Impediments:
    • Expense: Optimization for caching was difficult due to variability in responses and constant prompt tweaks [07:17:00].
    • Latency: The baseline latency was too slow, preventing concurrent scaling [07:24:00].
    • AI Errors: “Hallucinations” were hard to catch, similar to human errors but of a different nature [07:33:00].

Despite these scaling challenges and opportunities in AI adoption, the system remained in production for specific valuable use cases [07:43:00].

Shifting Focus: Scaling the Agentic Workflow

The problem evolved from parsing unstructured data (which GPT-4 had solved) to building a robust, scalable agentic workflow [07:50:00]. Method aimed for ambitious targets:

  • At least 16 million requests per day [08:10:00].
  • At least 100,000 concurrent load [08:12:00].
  • Minimal latency (sub-200 milliseconds) for real-time operations [08:16:00].

OpenPipe’s Solution: Benchmarking and Fine-Tuning

OpenPipe collaborated with Method to address these common issues of quality, cost, and latency [08:35:00].

Benchmarking Existing Models

OpenPipe performed detailed benchmarking under real production conditions, considering diverse tasks and concurrency levels [10:21:00].

MetricGPT-4 (0)O3 Mini (0)
Error Rate11% [09:24:00]4% [09:28:00]
Latency~1 second [10:08:00]~5 seconds [10:12:00]
CostLower [10:50:00]Higher for this use case (more reasoning tokens) [10:50:00]

Method measured error rates by having a human go through the agentic workflow and compare the agent’s extracted information (e.g., bank balances) to the real numbers [09:39:00].

Defining Performance Targets

Method established specific targets based on their operational needs:

  • Error Rate: Around 9% was acceptable, as additional plausible checks were performed after the model output [11:50:00].
  • Latency: A hard latency cutoff was necessary due to the real-time nature of their agent [12:06:00].
  • Cost: Critical due to the very high volume of transactions [12:40:00].

The Need for Fine-Tuning

Neither GPT-4 nor O3 mini met all three requirements simultaneously [13:05:00]. GPT-4 struggled with error rate and cost, while O3 mini failed on cost and especially latency [13:16:00].

This indicated the need for fine-tuning, which is a powerful tool requiring more engineering investment than prompt engineering, but necessary when off-the-shelf models don’t meet production reliability numbers [13:48:00].

Fine-Tuning Results with OpenPipe

OpenPipe fine-tuned an 8 billion parameter LL 3.1 model, which often suffices for the majority of their customers [15:12:00].

MetricFine-Tuned Model
Error RateSignificantly better than GPT-4, below 9% threshold [14:28:00]
LatencyMuch lower due to smaller model, can be further reduced by colocation [15:34:00]
CostMuch lower, exceeding cost thresholds for viability [16:02:00]

Fine-tuning enabled Method to bend the price-performance curve significantly [14:18:00]. The process was made easier by using existing production data from GPT-4 to generate training data [14:45:00].

Key Takeaways for Productionizing AI Agents

  • Simplicity and Cost-Effectiveness: It’s possible to identify a specific use case, fine-tune the cheapest available model, and achieve fast performance without needing to buy own GPUs [17:23:00].
  • Data Availability: Leveraging data already generated by earlier AI models (like GPT in production) can provide valuable training data for fine-tuning [17:30:00].
  • Patience and Openness: Productionizing AI agents requires a level of openness and patience from both engineering and leadership teams [17:47:00]. Unlike traditional code that is expected to “just work” once deployed, AI agents take time to become production-ready and continuously improve [17:57:00]. This marks a shift from traditional software engineering [18:14:00].