AI implementation and best practices

From: aidotengineer

This article summarizes key insights on AI implementation and best practices, drawing from hundreds of customer interactions by Anthropic’s applied AI and go-to-market teams [01:14:00]. Presenters Alexander Bricken and Joe Bailey from Anthropic share their expertise on leveraging AI to solve business problems and common pitfalls to avoid [00:44:00].

Anthropic Overview

Anthropic is an AI safety and research company focused on building safe, large language models (LLMs) [01:26:00]. Founded by leading AI experts, the company has released multiple iterations of Frontier models, emphasizing safety techniques, research, and policy [01:34:00].

Their marquee model, Sonnet 3.5, launched in late October of the previous year, is noted as a leading model in the code space, topping leaderboards for agentic coding evaluations like sbench [01:53:00].

AI Interpretability Research

Anthropic’s research directions include model capabilities, product research, and AI model considerations safety [02:23:00]. A differentiating focus is interpretability, which involves reverse engineering models to understand how and why they “think,” and how to steer them for specific use cases [02:36:00].

Interpretability research is still in its early stages [02:53:00], progressing through stages that build upon each other:

Understanding Grasping AI decision-making [03:07:00].
Detection Identifying specific behaviors and labeling them [03:10:00].
Steering Influencing AI input [03:15:00]. An example is “Golden Gate Claude,” where the model’s activation for the Golden Gate Bridge was amplified, leading it to recommend painting a bedroom “red like the Golden Gate Bridge” [04:41:00].
Explainability Unlocking business value from interpretability methods [03:22:00].

Interpretability is expected to significantly improve AI safety, reliability, and usability [03:31:00]. The interpretability team understands feature activations at the model level and has published research on “towards monosemanticity” and “scaling monosemanticity” [03:38:00].

Customer Engagement and Use Cases

Anthropic encourages customers, especially AI-native or AI startups, to focus on using AI to solve core product problems, moving beyond basic chatbots and summarization [05:17:00].

Example Use Case: Onboarding and Upskilling Platform

Instead of just summarizing course content or offering a Q&A chatbot [06:07:00], an onboarding platform could leverage AI to:

Hyper-personalize course content based on individual employee context [06:18:00].
Dynamically adapt content to be more challenging if a user is breezing through [06:26:00].
Dynamically update course material based on learning styles (e.g., visual content for visual learners) [06:33:00].

Leading companies are achieving industry-leading results by combining their domain expertise with Anthropic’s models across various sectors like taxes, legal, and project management [07:00:00]. These applications drastically enhance customer experience, making products easier to use and more trustworthy, especially for business-critical workflows where hallucination is unacceptable (e.g., tax preparation) [07:22:25].

Intercom’s Finn AI Agent Case Study

Intercom, an AI customer service platform, partnered with Anthropic to enhance their AI agent, Finn [10:58:00].

Initial Engagement: Anthropic’s applied AI lead worked with Intercom’s data science team on a two-week sprint, comparing their hardest prompt for Finn against a prompt optimized with Claude [11:27:00].
Optimization Sprint: Positive initial results led to a two-month sprint focused on fine-tuning and optimizing all of Intercom’s prompts for Claude to achieve the best performance [11:43:00].
Results: Anthropic’s LLM outperformed the previous one [11:57:00]. Intercom’s resolution-based pricing model incentivizes helpful models [12:02:00]. The updated Finn 2 can solve up to 86% of customer support volume (51% out of the box) [12:22:00]. It also allowed for a more “human” element, with adjustable tone and answer length, and improved policy awareness (e.g., refund policies) [12:35:00].

Anthropic Products and Support

Anthropic offers:

API: For businesses that want to embed AI in their products and services [08:08:00].
Claude for Work: Empowers organizations to use AI in their daily work [08:14:00]. They also partner with AWS and GCP, allowing access to Frontier models on Bedrock or Vertex AI, enabling deployment in existing environments without managing new infrastructure [08:22:00].

The applied AI team works at the intersection of product research, customer interaction, and internal research [09:05:00]. They support technical aspects of use cases, design architectures, conduct evaluations, and tweak prompts to optimize model performance [09:14:00]. They also feed insights back into Anthropic to improve products [09:23:00]. The team often kicks off a sprint when customers face niche challenges related to LLM Ops, AI application frameworks and architecture, or evaluations [10:19:00]. They help define metrics, evaluate models, and deploy iterative loops into A/B test environments and eventually production [10:26:00].

Best Practices for AI Implementation

Testing and Evaluation

A common mistake is building a robust workflow and then creating evaluations [13:28:00].

Evaluations are Directional: Evaluations should direct towards the perfect outcome and be built from the outset or very shortly after starting workflow development [13:38:00].
Avoid “Trusting the Vibes”: Running a few queries and assuming performance is insufficient [13:59:00]. Evaluations must be based on a statistically significant, representative sample to predict real-world performance [14:04:00].
Navigating the Latent Space: Think of use cases as a “latent space” where applying different functions (e.g., prompt engineering, prompt caching) moves the model’s position [14:26:00]. Evaluations are the only empirical way to know how changes affect performance and find the optimized point or “attractor state” [15:04:00].
Evals as Intellectual Property: Robust evaluations are a key competitive advantage, allowing faster navigation of the latent space [15:17:00].
Setting up Telemetry: Invest in telemetry to back-test architectures in advance [15:35:00].
Designing Representative Test Cases: Include “silly examples” or edge cases in eval sets to ensure the model handles unexpected inputs appropriately (e.g., a customer support agent encountering an irrelevant question like “how to kill a zombie in Minecraft”) [15:43:00].

Identifying Metrics

Organizations often face a trade-off between intelligence, cost, and latency [16:16:00]. It’s difficult to optimize for all three simultaneously, so this balance should be defined in advance for a specific use case [16:26:00].

Stakes and Time Sensitivity: The importance and time-sensitivity of the decision should drive optimization choices [17:10:00]. For example:
- A customer support agent prioritizes speed (e.g., response within 10 seconds), as longer waits can lead to customer abandonment [16:40:00]. UX design can help circumvent latency issues (e.g., thinking boxes, redirecting to other pages) [17:21:00].
- A financial research analyst agent might tolerate a 10-minute response time, as the subsequent capital allocation decision is highly important [16:56:00].
Trade-offs: More instruction sets might lead to longer latency but higher performance [17:14:00]. Knowing the important indicators for your product allows for appropriate optimization [17:43:00].

Fine-tuning

Fine-tuning is not a silver bullet and comes with costs [17:58:00].

Limitations: It’s like “brain surgery” on the model, which can limit its reasoning in fields outside of the specific fine-tuning domain [18:06:00].
Try Other Approaches First: Most users attempt fine-tuning without even having an evaluation set [18:16:00]. There should be clear success criteria in advance, and fine-tuning should only be pursued if performance cannot be achieved through other methods in a specific intelligence domain [18:21:00].
Justify the Cost: The effort and cost of fine-tuning (e.g., team involvement, working with providers) must be justified by a clear difference in desired capabilities [18:43:00].
Avoid Delays: Don’t let the need for fine-tuning slow down initial implementation [18:56:00]. Pursue the use case, realize if fine-tuning is needed, and then substitute the fine-tuned model later [19:06:00].

Alternative Methods and Architectures

Beyond basic prompt engineering, many other features or architectures can drastically improve a use case’s success [19:26:00]:

Prompt Caching: Can lead to significant cost reduction (e.g., 90%) and speed increase (e.g., 50%) without sacrificing model intelligence [19:42:00].
Contextual Retrieval: Drastically improves the performance of retrieval mechanisms, feeding information more effectively to the model and reducing processing time [19:54:00].
Citations: An out-of-the-box feature that can enhance reliability [20:09:00].
Agentic architectures: Architectural decisions that can significantly impact performance [20:13:00].

Tubegraph

Explorer

Table of Contents