From: aidotengineer
Anthropic, an AI safety and research company, focuses on building the world’s best and safest large language models (LLMs) [01:26:00]. Their Sonnet 3.5 model, launched in late October of the previous year, is recognized as a leading model in the code space, topping leaderboards for agentic coding evaluations like sbench [02:02:00].
A core differentiating factor for Anthropic is their interpretability research, which involves reverse engineering models to understand their thought processes and steer them for specific use cases [02:36:00]. This research progresses through stages:
- Understanding: Grasping AI decision-making [03:07:00].
- Detection: Identifying specific behaviors and labeling them [03:10:00].
- Steering: Influencing AI input [03:15:00].
- Explainability: Unlocking business value from interpretability methods [03:22:00]. This deep understanding contributes to significant improvements in AI safety, reliability, and usability [03:31:00].
Anthropic’s Approach to Customer Success and Evaluation
The Applied AI team at Anthropic works at the intersection of product research, customer interaction, and internal research [09:05:00]. They provide technical support for use cases, assisting with architecture design, evaluations, and prompt tweaking to optimize model performance [09:14:00]. Insights gained from customer interactions are fed back into Anthropic to improve products [09:23:00].
The team closely collaborates with customers facing niche challenges, helping them apply the latest research and maximize model output through prompting [10:02:00]. This often involves:
- Kicking off a Sprint when customers encounter tricky challenges, such as LLM Ops architectures or evaluations [10:17:00].
- Helping define key metrics for evaluating the model against specific use cases [10:26:00].
- Assisting in deploying the iterative loop into an A/B test environment and eventually into production [10:33:00]. The importance of evaluations is a crucial part of this process [10:42:00].
Case Study: Intercom’s Finn AI Agent
Intercom, an AI customer service platform, developed an AI agent named Finn [10:56:00]. Anthropic partnered with Intercom to enhance Finn’s capabilities:
- Initial Sprint: An Applied AI lead worked with Intercom’s data science team on a two-week sprint, comparing Finn’s hardest prompt against a Claude-assisted prompt [11:27:00].
- Optimization Phase: Following positive initial results, a two-month sprint focused on fine-tuning and optimizing all of Intercom’s prompts for Claude’s best performance [11:43:00].
- Outcome: Anthropic’s model outperformed the previous LLM in benchmarks [11:57:00]. Intercom subsequently launched Finn 2, powered by Anthropic’s models [12:17:00].
Finn 2 has demonstrated significant results:
- Can resolve up to 86% of customer support volume (51% out-of-the-box) [12:22:00].
- Anthropic’s own support team adopted Finn, observing similar resolution rates and enhanced human-like qualities like tone adjustment and answer length [12:29:00].
- Improved policy awareness, such as handling refund policies, unlocking new capabilities [12:45:00].
Common Mistakes in Testing and Evaluation of AI Agents
Organizations frequently encounter several common pitfalls in evaluation and feedback in AI systems:
- Retroactive Evaluations: Building a robust workflow first and only then attempting to build evaluations [13:28:00]. Evaluations should guide the development process from the outset [13:38:00].
- Data Problems: Struggling to design evaluations due to poor data quality, which can be mitigated by using LLMs like Claude for data cleaning and reconciliation [13:48:00].
- “Trusting The Vibes”: Relying on a few queries that “look good” without testing on a statistically significant or representative sample [13:59:00]. This can lead to unexpected outliers and poor performance in production [14:15:00].
It’s crucial to consider the “latent space” of a use case, where different functions (e.g., prompt engineering, caching) move the model’s position. The only way to find an optimized point is empirically, through evaluations [14:26:00]. Evaluations are considered a form of “intellectual property” that allows companies to navigate this space and find optimal solutions faster than competitors [15:14:00].
Best Practices for AI Evaluation
1. Set Up Telemetry and Design Representative Test Cases
Invest in telemetry to back-test architecture in advance [15:35:00]. Design representative test cases that include expected user queries as well as “silly examples” or edge cases that might occur in reality (e.g., an off-topic question for a customer support agent) to ensure appropriate model responses [15:43:00].
2. Identify Key Metrics and Trade-offs
Acknowledge the “intelligence-cost-latency” triangle of trade-offs [16:16:00]. Most organizations can optimize for one or two of these, but rarely all three [16:25:00]. The balance should be defined in advance, driven by the stakes and time sensitivity of the decision for the specific use case [16:32:00]. For example:
- Customer Support: Latency is critical; a response within 10 seconds is vital to prevent user abandonment [16:40:00]. UX design can help manage perceived latency [17:21:00].
- Financial Research: Accuracy is paramount; a 10-minute response time might be acceptable if the subsequent financial decision is high-stakes [16:55:00].
3. Consider Fine-Tuning Carefully
Fine-tuning is not a “silver bullet” and comes at a cost, potentially limiting the model’s reasoning in areas outside of what it was fine-tuned for [17:58:00]. It’s recommended to:
- Try other approaches first: Many issues can be resolved with prompt engineering or architectural adjustments [18:16:00].
- Establish clear success criteria: Fine-tuning should only be pursued if other methods fail to meet specific intelligence domain requirements [18:24:00].
- Justify the cost and effort: The significant variance in fine-tuning outcomes requires a clear justification for its implementation [18:39:00].
- Avoid delaying deployment: Don’t let the pursuit of fine-tuning hinder initial deployment. Implement existing solutions, and if fine-tuning becomes necessary, integrate it later [18:56:00].
4. Explore Alternative Methods for Performance Improvement
Beyond basic prompt engineering, various features and architectures can drastically improve use case success without necessarily sacrificing intelligence for speed:
- Prompt Caching: Can significantly reduce cost and increase speed [19:47:00].
- Contextual Retrieval: Improves the effectiveness of retrieval mechanisms by feeding information more efficiently to the model, reducing processing time [19:54:00].
- Citations: An out-of-the-box feature that can enhance reliability [20:09:00].
- Agentic Architectures: Architectural decisions that involve using AI agents to perform tasks [20:13:00].
These methods for AI evaluation and custom model building and code evaluation for AI systems are crucial for building custom evaluations for better AI performance and achieving successful AI implementations.