From: aidotengineer

Anthropic, an AI safety and research company, works closely with customers to implement artificial intelligence solutions in their businesses [00:00:21]. The insights shared are based on hundreds of customer interactions [00:01:14].

Anthropic’s Mission and Models

Anthropic aims to build the world’s best and safest large language models (LLMs) [00:01:29]. They have released multiple iterations of their frontier models, focusing on safety techniques, research, and policy [00:01:43].

Their marquee model is Sonnet 3.5, launched in late October of the previous year [00:01:58]. Sonnet is a leading model in the code space, performing strongly in evaluations like sbench, an agentic coding eval [00:02:11].

Anthropic’s research directions include model capabilities, product research, and AI safety [00:02:34]. A differentiating focus is interpretability, which involves reverse engineering models to understand how they “think” and why, and then steering them for specific use cases [00:02:48].

Interpretability research progresses through stages:

  • Understanding Grasping AI decision-making [00:03:07].
  • Detection Understanding specific behaviors and labeling them [00:03:13].
  • Steering Influencing AI input [00:03:18].
  • Explainability Unlocking business value from interpretability methods [00:03:25].

Interpretability is expected to provide significant improvements in AI safety, reliability, and usability [00:03:38].

Solving Real Business Problems with AI

When considering AI in enterprise applications, Anthropic encourages customers to focus on using AI to solve core product problems, rather than limiting applications to basic chatbots and summarization [00:05:25]. The goal is to make bigger bets on more transformative uses [00:05:42].

For an onboarding and upskilling platform, instead of just summarizing course content or offering a Q&A chatbot, AI could:

  • Hyper-personalize course content based on individual employee context [00:06:22].
  • Dynamically adapt content to be more challenging if an employee is breezing through [00:06:29].
  • Dynamically update course material based on individual learning styles (e.g., visual learners receive visual content) [00:06:45].

AI is impacting various industries, including taxes, legal, and project management, by significantly enhancing customer experience, making products easier to use, and improving trustworthiness [00:07:29]. These applications are achieving high-quality outputs, especially in business-critical workflows where accuracy is paramount (e.g., taxes where hallucination is unacceptable) [00:07:46].

Getting Started with Anthropic’s Products

Anthropic offers two main products:

  • API: For businesses looking to embed AI in their products and services [00:08:12].
  • Claude for Work: Empowers entire organizations to leverage AI in day-to-day work [00:08:18].

Anthropic also partners with AWS and GCP, allowing customers to access their frontier models on Bedrock or Vertex. This enables deployment of applications in existing environments without managing new infrastructure, breaking down barriers to entry [00:08:43].

Setting Customers Up for Success

Anthropic’s Applied AI team supports the technical aspects of use cases, helping to design architectures, evaluate performance, and tweak prompts to optimize model output [00:09:20]. They also feed customer insights back into Anthropic’s product and research teams [00:09:26].

The team works closely with customers facing niche challenges in specific use case domains, applying the latest research and maximizing model performance through prompting [00:10:14]. This often involves:

  1. Kicking off a Sprint when challenges arise (LLM Ops, architectures, evals) [00:10:24].
  2. Helping define metrics crucial for evaluating the model against the use case [00:10:30].
  3. Deploying the iterative results into an A/B test environment and eventually production [00:10:40].

Case Study: Intercom’s AI Agent Finn

Intercom, an AI customer service platform, developed an AI agent called Finn [00:11:03]. Anthropic’s Applied AI team collaborated with Intercom’s data science team on a two-week sprint, testing Intercom’s hardest prompts against prompts optimized with Claude [00:11:38]. This led to a two-month sprint of fine-tuning and optimizing all of Intercom’s prompts for Claude [00:11:53].

Results for Intercom’s Finn 2, powered by Anthropic’s models:

  • Outperformed the previous LLM [00:12:00].
  • Can solve up to 86% of customer support volume (51% out of the box) [00:12:27].
  • Increased human-like interactions, allowing for tone adjustment and answer length customization [00:12:44].
  • Improved policy awareness (e.g., refund policies) [00:12:47].

Best Practices and Common Mistakes in AI Implementation

Testing and Evaluation

A common mistake is building a robust workflow first and then trying to implement evaluations [00:13:36]. Evaluations should guide the workflow, as they direct towards the desired outcome [00:13:42]. Other mistakes include struggling with data for eval design, or “trusting the vibes” by only running a few queries without sufficient representative samples for statistical significance [00:14:11].

“Evals are your intellectual property. If you want to be competitive in a space, you need to be able to out-compete people by navigating that latent space and finding the attractor state faster than anyone else.” [00:15:27]

Best Practices:

  • Set up telemetry for back-testing architecture in advance [00:15:38].
  • Design representative test cases, including “silly” or unexpected examples, to ensure the model responds appropriately or reroutes the question [00:16:06].

Identifying Metrics

Organizations often face a trade-off triangle of intelligence, cost, and latency [00:16:21]. It’s difficult to optimize for all three simultaneously [00:16:27].

The stakes and time sensitivity of the decision being made should drive optimization choices [00:17:12]:

  • Customer Support: Latency is critical; a customer expects a response within 10 seconds [00:16:42]. User experience (UX) solutions, like a “thinking box” or redirecting to another page, can help manage perceived latency [00:17:41].
  • Financial Research Analyst Agent: Latency is less critical if the decision made after the response is highly important, such as capital allocation [00:17:05].

Fine-tuning

Fine-tuning is not a “silver bullet” and comes at a cost [00:18:02]. It can limit the model’s reasoning in other fields outside the specific domain it was fine-tuned for [00:18:13].

Best Practices:

  • Try other approaches first, especially if an evaluation set is not yet in place [00:18:20].
  • Have clear success criteria established in advance [00:18:26]. Fine-tuning should only be pursued if the desired intelligence cannot be achieved otherwise [00:18:30].
  • Don’t let the potential need for fine-tuning slow down the initial development of an AI use case [00:19:00]. Explore other methods first, then swap in a fine-tuned model if necessary [00:19:11].

Other Methods for Improving Use Case Success

Beyond basic prompt engineering, various features and architectures can significantly impact the success of an AI use case [00:19:35]:

  • Prompt Caching: Can lead to significant cost reduction and speed increases without sacrificing intelligence [00:19:52].
  • Contextual Retrieval: Drastically improves the performance of retrieval mechanisms, feeding information to the model more effectively and reducing processing time [00:20:04].
  • Citations: An out-of-the-box feature [00:20:09].
  • Agentic Architectures: Important architectural decisions for more complex AI systems [00:20:17].