Evaluations and finetuning in AI development

From: aidotengineer

Developing AI applications successfully requires careful consideration of evaluations and strategic application of fine-tuning. These elements are crucial for ensuring models perform as expected and deliver real business value.

The Importance of Evaluations

Evaluations are fundamental to guiding AI development towards desired outcomes [00:13:40]. They provide empirical data to understand how changes in model functions (e.g., prompt engineering, prompt caching) affect performance [00:15:04].

Common Mistakes in AI Evaluation

A frequent error in AI development is to build a robust workflow or architecture first, and only then consider building evaluations [00:13:28]. This approach is suboptimal because evaluations should ideally guide the entire development process [00:13:40].

Other common pitfalls include:

Struggling with data problems preventing effective eval design [00:13:50]. Claude can assist with data cleanup and reconciliation for evaluation purposes [00:13:54].
“Trusting the vibes” instead of rigorously testing on representative samples [00:13:59]. It’s essential to have enough samples to ensure statistical significance and predict performance in production environments [00:14:07].

Best Practices for Designing Evaluations

Effective evaluations help navigate the “latent space” of possible model behaviors, guiding developers to an optimized point faster than competitors [00:15:21].

Key practices include:

Setting up telemetry to back-test architectures in advance [00:15:35].
Designing representative test cases that include unusual or “silly” examples to ensure models respond appropriately or reroute questions [00:15:59].

Defining Metrics for Success

Organizations often face a trade-off between intelligence, cost, and latency [00:16:16]. Defining this balance in advance for a specific use case is crucial [00:16:32].

Customer Support Example: For a customer support agent, speed (e.g., response within 10 seconds) might be prioritized over extensive intelligence, as customers tend to leave if responses are too slow [00:16:40]. User experience (UX) solutions, like a “thinking box” or redirection, can help manage latency [00:17:27].
Financial Research Example: For a financial research analyst agent, accuracy and comprehensive answers might be paramount, even if it takes 10 minutes to generate a response, given the high stakes of financial decisions [00:16:55].

The stakes and time sensitivity of the decision should drive optimization choices [00:17:10].

Case Study: Intercom and Finn 2

Intercom, an AI customer service platform, partnered with Anthropic to enhance their AI agent, Finn [00:10:58].

Initial Sprint: Applied AI teams ran a two-week sprint with Intercom’s data science team, comparing Finn’s hardest prompts against prompts optimized with Claude [00:11:30].
Optimization Phase: Following positive initial results, a two-month sprint focused on fine-tuning and optimizing all prompts to maximize Claude’s performance [00:11:48].
Results: Benchmarks showed Anthropic’s model outperforming their previous LLM [00:11:57]. Finn 2, powered by Anthropic’s model, can solve up to 86% of customer support volume (51% out of the box) [00:12:22]. It also improved human-like interaction by allowing adjustments to tone and answer length, and showed strong policy awareness (e.g., refund policies) [00:12:35].

Finetuning AI Models

Fine-tuning is often considered a “silver bullet,” but it comes with significant costs and limitations [00:17:58].

When to Approach Finetuning with Caution

Fine-tuning involves "brain surgery" on the model, which can limit its reasoning abilities in domains outside of what it was specifically fine-tuned for <a class="yt-timestamp" data-t="00:18:06">[00:18:06]</a>.

It is recommended to try other approaches first before resorting to fine-tuning [00:18:16]. Many developers attempt fine-tuning without a clear evaluation set or success criteria [00:18:20]. Fine-tuning should only be pursued if the desired intelligence cannot be achieved through other methods in a specific domain [00:18:28].

The wide variance in success rates for fine-tuning means that the effort and cost involved must be clearly justified [00:18:41]. Don’t let the pursuit of fine-tuning slow down your initial progress; integrate it later if necessary [00:18:56].

Alternatives to Finetuning

Beyond basic prompt engineering, various features and architectures can significantly improve use case success without immediate fine-tuning:

Prompt Caching: Can lead to a 90% cost reduction and 50% speed increase without sacrificing model intelligence [00:19:47].
Contextual Retrieval: Drastically improves performance by feeding relevant information to the model more effectively, reducing processing time [00:19:55].
Citations: Can be an out-of-the-box solution [00:20:09].
Agentic Architectures: An architectural decision that can enhance model capabilities [00:20:13].

These methods offer powerful ways to optimize AI model performance before considering the complexities and costs associated with fine-tuning.

Tubegraph

Explorer

Table of Contents