From: aidotengineer

Engineering teams play a crucial role in bringing AI agents to production, navigating challenges from initial concept to large-scale deployment. This involves identifying needs, selecting appropriate models, benchmarking performance, and optimizing for key metrics like quality, cost, and latency.

Method’s Challenge: Automating Financial Data Aggregation

Method, a company that centralizes liability data from various financial sources, faced a significant challenge when customers requested more specific liability data points, such as payoff amounts for auto loans or escrow balances for mortgages [01:48:00]. Traditional methods, like integrating directly with banks, would take years [02:21:00].

The existing status quo in the industry involved companies hiring offshore contractor teams to manually call banks, authenticate, gather information, proof-check, and integrate the data [02:55:00]. This manual process was inefficient, expensive, slow, and prone to human error, often resulting in inaccurate financial information [03:23:00]. Conceptually, this task was seen by the engineering team as making sense of unstructured data [04:13:00].

Initial AI Agent Development and Production Issues

With the announcement of GPT-4, Method’s engineering team saw an opportunity to automate this process, leveraging advanced LLMs’ ability to parse unstructured data [04:31:00]. They quickly developed an “agentic workflow” using GPT-4, which initially performed well and allowed for expanded use cases [05:16:00].

However, scaling this initial solution in production revealed significant challenges in AI agent development:

  • High Costs: The first month in production with GPT-4 incurred a cost of $70,000, which made leadership “unhappy,” though the value provided was immense [05:50:00].
  • Prompt Engineering Limitations: As use cases scaled, prompt engineering reached its limits. Despite GPT’s intelligence, it wasn’t a financial expert, requiring extremely detailed and lengthy instructions and examples [06:25:00]. This led to a “cat and mouse chase” where fixes for one scenario broke others, and there was no prompt versioning [06:41:00].
  • Scaling Inefficiency: The system couldn’t scale concurrently due to slow baseline latency and the inability to optimize for caching given the variability in responses and constant prompt tweaks [07:11:00].
  • AI Errors (Hallucinations): Similar to human errors, AI agents introduced “hallucinations” that were difficult to catch [07:36:00].

The problem for the engineering team shifted from finding a solution for unstructured data to scaling AI agents in production and building effective AI agents with a robust agentic workflow [07:57:00]. Method’s target production figures were ambitious: 16 million requests per day, 100K concurrent load, and sub-200 milliseconds latency [08:07:00].

Collaborating for Production-Ready AI Agents

OpenPipe collaborated with Method to address these common challenges and strategies in AI production: quality, cost, and latency [08:40:00]. The engineering approach involved:

  1. Benchmarking Existing Models:

    • Error Rates: GPT-4 had an 11% error rate, while O03 Mini had a 4% error rate for Method’s specific task [09:24:00]. Error rates were measured by comparing agent outputs to human-verified real numbers [09:51:00].
    • Latency: GPT-4 responded in about one second, while O03 Mini took about five seconds for the specific task [10:08:00].
    • Cost: O03 Mini was slightly more expensive than GPT-4 due to generating more reasoning tokens and longer outputs, despite a lower per-token cost [10:42:00]. Engineering teams should benchmark using real production conditions and diverse tasks [10:21:00].
  2. Defining Target Metrics:

    • Method’s quality target was around a 9% error rate, as they had subsequent checks to filter out inaccuracies [11:58:00].
    • Latency was critical due to the real-time nature of the agentic workflow, requiring a hard cut-off [12:06:00].
    • Cost was highly important given the very high volume of operations [12:40:00].
  3. Implementing Fine-Tuning:

    • Neither GPT-4 nor O03 Mini met all three requirements for production deployment [13:05:00].
    • Fine-tuning was identified as a “power tool” for developing AI agents for productivity when off-the-shelf models don’t meet performance needs [13:47:00]. It requires more engineering investment than prompt engineering alone [13:48:00].
    • By fine-tuning an 8-billion parameter LL 3.1 model, the team achieved:
      • Improved Accuracy: Significantly better error rates than GPT-4, meeting the required threshold [14:28:00].
      • Lower Latency: Much faster response times due to the smaller model, even allowing for potential collocation with application code to eliminate network latency [15:34:00].
      • Reduced Cost: Substantially lower costs compared to larger models [16:02:00].

Key Takeaways for Engineering Teams

The experience highlights several best practices for implementing AI in teams:

  • Simplicity and Optimization: Engineers can achieve success with a specific use case by using the cheapest available model and fine-tuning it [17:28:00]. Production data from initial LLM use (e.g., GPT-4) can be used for fine-tuning, eliminating the need to search for new datasets [17:30:00].
  • No Need for GPU Ownership: It’s not necessary to buy proprietary GPUs for deployment [17:42:00].
  • Openness and Patience: Productionizing AI agents requires a level of openness and patience from engineering and leadership teams [17:47:00]. Unlike traditional code that is expected to work flawlessly upon release, AI agents need time to become production-ready and consistently deliver desired responses [18:03:00].

In conclusion, engineering teams are central to the entire lifecycle of AI agent production, from identifying problems and evaluating solutions to optimizing performance and addressing technical challenges in AI agent development like cost, latency, and error rates. Fine-tuning emerges as a powerful strategy to bend the price-performance curve and achieve the scale required for real-world applications [16:55:00].