Cost and latency optimization in AI deployments

From: aidotengineer

This article explores the journey of Method, a company that aggregates and centralizes liability data, and their collaboration with OpenPipe to overcome challenges in scaling AI agents in production to over 500 million agents. The focus is on optimizing cost and latency in AI deployments [00:00:22].

Method’s Core Business

Method collects and centralizes liability data from hundreds of sources, including credit bureaus, card networks (Visa, MasterCard), direct financial institutions, and third-party sources [00:00:33]. This enhanced data is served to customers, typically other fintechs, banks, or lenders, for debt management purposes such as refinancing, loan consolidation, liability payments, or personal finance management [00:00:50].

OpenPipe assists in building, training, and deploying open-source models for practical use, enabling continuous model improvement using user and environmental signals from production [00:01:08].

Initial Challenges: Manual Data Collection

Early on, Method’s customers requested liability-specific data points like payoff amounts on auto loans or escrow balances for mortgages [00:01:46]. Research revealed no central API for these data points [00:02:05]. Direct integration with banks would take years, which was not feasible for an early-stage company aiming for rapid deployment [00:02:14].

The status quo for many companies obtaining this data involved hiring offshore teams of contractors [00:02:53]. These teams would:

Call banks on behalf of the company and end-consumer [00:03:02].
Authenticate with banks and gather information [00:03:06].
Require human proof-checking before integration into financial platforms for underwriting or user surfacing [00:03:08].

This manual process was highly inefficient and problematic for scaling AI products [00:03:29]:

Expensive: One person can only do one task at a time, requiring more hires to scale [00:03:33].
Slow: The synchronous nature of the process made it inherently slow [00:03:41].
Human Error: Significant risk of human error, necessitating additional teams for fact-checking and proof-checking, with inaccurate financial information being the worst outcome [00:03:48].

Conceptually, this process resembled an API with request, authentication, and response validation components [00:04:04]. The core problem was making sense of unstructured data [00:04:15].

Embracing AI: GPT-4 as the Initial Solution

The announcement of GPT-4 by OpenAI, amidst the “Cambrian explosion” of LLM-enabled applications, presented a seemingly perfect solution for Method [00:04:31]. Advanced LLMs, particularly post-GPT-4, excel at parsing unstructured data for tasks like summarization and classification [00:04:54].

Method quickly developed an agentic workflow using GPT-4, which performed exceptionally well [00:05:16]. They expanded use cases to maximize value from single API calls, testing different extractions in a controlled production environment [00:05:22].

The Problem of Scale with Off-the-Shelf LLMs

As traffic increased, significant challenges emerged:

Prohibitive Cost: The first month in production with GPT-4 incurred a bill of $70,000 [00:05:50]. Despite leadership initially accepting this due to the immense value, it highlighted a major cost consideration [00:06:01].
Prompt Engineering Limitations: Prompt engineering quickly became a scaling bottleneck [00:06:25]. While GPT-4 is intelligent, it lacked financial expertise, requiring highly detailed, generalized, and complex instructions with examples [00:06:31]. This led to a “cat and mouse” chase where fixes for one scenario broke others [00:06:44]. The absence of prompt versioning exacerbated these issues [00:06:52].
Inefficiency and Scalability Issues:
- Costly: Difficulty optimizing for caching due to variability in responses and frequent prompt tweaks [00:07:17].
- Slow Latency: Baseline latency was too high, hindering concurrent scaling [00:07:24].
- AI Errors: “Hallucinations” were difficult to catch, similar to human errors but in a different nature [00:07:34].

Despite these issues, GPT-4 remained in production for specific use cases where it performed exceptionally well [00:07:43].

The Shift in Problem: Scaling a Robust Agentic Workflow

The problem evolved from merely understanding unstructured data (solved by GPT) to building a robust, scalable agentic workflow that could handle significant volume reliably [00:07:50].

Method’s target scale required:

At least 16 million requests per day [00:08:09].
At least 100,000 concurrent load [00:08:12].
Minimal latency (sub 200 milliseconds) for real-time agentic workflows [00:08:16].

This led to the question of whether to invest in more GPUs or host their own models [00:08:24].

OpenPipe’s Approach: Benchmarking and Fine-Tuning

OpenPipe collaborated with Method to address the common issues of quality, cost, and latency in AI deployments [00:08:34].

Benchmarking Existing Models

OpenPipe began by measuring error rates, latency, and costs under real production conditions, using a diverse range of tasks and reasonable concurrency levels [00:09:04].

Error Rates:
- GPT-4: ~11% error rate [00:09:24].
- O3 Mini: ~4% error rate [00:09:27].
- Method could measure this easily by comparing the agent’s final outputs (bank balances, etc.) against human-validated real numbers [00:09:38].
Latency:
- GPT-4: ~1 second to respond [00:10:08].
- O3 Mini: ~5 seconds for their specific task [00:10:12].
Cost:
- Despite O3 Mini having a lower per-token cost than GPT-4, it was found to be more expensive for Method’s use case due to generating many more “reasoning tokens” and thus longer outputs [00:10:41].

Recommendation for Benchmarking

It’s recommended to benchmark different models using simple Python scripts to categorize performance for your specific use case, especially when optimizing after an initial proof of concept [00:11:04]. This allows quick comparison as new models emerge [00:11:18].

Defining Target Requirements

Method’s specific needs for their real-time agent system were:

Error Rate: Around 9% error rate was acceptable, as they had subsequent plausible checks for numbers [00:11:50].
Latency: A strict latency cut-off was required for their real-time system [00:12:06]. This varies widely by customer, from days for background batch processes to sub-500ms for real-time voice applications [00:12:18].
Cost: Due to the high volume, cost was very important to Method [00:12:40].

Identifying the Gap

Neither GPT-4 nor O3 Mini could simultaneously meet all three requirements (quality, latency, and cost) [00:13:05]. GPT-4 fell short on error rate and cost, while O3 Mini failed on cost and especially latency [00:13:14].

Fine-Tuning as a “Power Tool”

OpenPipe’s solution involved fine-tuning and building custom models for Method’s specific use case [00:13:37].

When to Fine-Tune

Fine-tuning is a “power tool” that requires more engineering investment than prompt engineering. It should only be pursued after benchmarking existing production models with simple prompting and determining they cannot meet the necessary performance numbers [00:13:47].

Fine-tuning significantly “bends the price-performance curve” [00:14:18]:

Error Rate Improvement: Fine-tuning enabled Method to achieve significantly better error rates than GPT-4, surpassing their required threshold [00:14:27]. This has become easier with models like O3 Mini, which allow using production data to generate outputs and train a smaller model (the “teacher model” approach) [00:14:40].
Reduced Latency: Moving to a much smaller model (an 8 billion parameter LL 3.1 model in Method’s case) dramatically lowered latency due to fewer sequential calculations [00:15:34]. It also allows for deployment within one’s own infrastructure, co-locating the model with application code to eliminate network latency [00:15:53]. This contributes to efficiency and smart execution engines in AI applications.
Lower Cost: The use of a significantly smaller, fine-tuned model resulted in a much lower cost, exceeding Method’s target cost thresholds and eliminating concerns about unit economics [00:16:03]. This is a key aspect of leveraging AI tools for efficiency and scalability and customization and scalability in AI models.

For the majority of OpenPipe’s customers, an 8 billion parameter model or smaller is sufficient to meet quality targets [00:15:18].

Key Takeaways for AI Agent Deployment

Simplicity and Focus: It’s not overly complicated to achieve scale. Identify a specific use case, leverage the cheapest suitable model, and fine-tune it [00:17:21]. Method successfully used production data from GPT-4 for fine-tuning, avoiding the need to acquire new data [00:17:30].
No Need for Personal GPUs: Businesses don’t necessarily need to buy their own GPUs for successful AI deployments [00:17:42]. This is relevant to Nvidia AI model deployment and architecture and leveraging existing infrastructure for AI integration.
Patience and Openness: Productionizing AI agents requires patience and openness from engineering and leadership teams [00:17:46]. Unlike traditional code, which is expected to “just work” after deployment, AI agents take time to become production-ready and consistently deliver desired responses due to their probabilistic nature [00:17:56].

Fine-tuning is a powerful method to achieve high reliability and bend the price-performance curve, enabling large-scale production deployments just like Method achieved [00:16:55].

Tubegraph

Explorer

Table of Contents

Cost and latency optimization in AI deployments

Method’s Core Business

Initial Challenges: Manual Data Collection

Embracing AI: GPT-4 as the Initial Solution

The Problem of Scale with Off-the-Shelf LLMs

The Shift in Problem: Scaling a Robust Agentic Workflow

OpenPipe’s Approach: Benchmarking and Fine-Tuning

Benchmarking Existing Models

Defining Target Requirements

Identifying the Gap

Fine-Tuning as a “Power Tool”

Key Takeaways for AI Agent Deployment

Graph View

Backlinks