AI production and evaluation techniques

LangChain and LangSmith Overview

LangChain is a popular framework for working with Large Language Models (LLMs) and serves as an orchestration layer for building LLM applications [00:00:09]. It has gained significant traction, with over 38,000 Discord members and adoption by major companies like Elastic, Dropbox, and Snowflake [00:00:14]. The core idea behind LangChain is to connect LLMs to external data sources and computation [00:05:13].

LangSmith is a separate SaaS platform developed by LangChain, focusing on observability, testing, and evaluation for LLM applications [00:05:48]. It was developed to address the significant need for teams to transition from prototype to production with confidence [00:06:00].

Core Focus Areas

LangChain and LangSmith primarily focus on three interconnected areas:

Retrieval [00:06:57]
Agents [00:06:58]
Evaluation [00:06:59]

Agents can be used for retrieval, retrieval is a popular tool for agents, and evaluation is crucial for both [00:07:03]. Agents can also be used to perform evaluation [00:07:10].

LangSmith for Production Applications

LangSmith is particularly valuable for applications involving multiple LLM calls or steps [00:08:05]. Its tracing and observability features log all steps of a chain or agent, including inputs and outputs, which is crucial for understanding and debugging complex systems [00:08:16]. Even for single LLM calls, LangSmith provides value by visualizing templated prompts and conversational history [00:08:52].

On the testing side, LangSmith supports testing across the entire spectrum, from end-to-end applications to individual components [00:09:27].

Current State of Evaluation

The process of evaluating LLM applications involves several key questions for teams:

Data Set Gathering

Teams typically start by hand-labeling 20 or so examples [00:10:52]. They then incorporate edge cases from production data that cause failures, using systems like “thumbs down” feedback or flagging mechanisms [00:10:56]. LangSmith connects production traces with evaluation sets, allowing for continuous improvement [00:11:09].

Single Data Point Evaluation

For simple classification tasks, traditional machine learning techniques can be used [00:11:26]. However, for more complex scenarios, using LLMs to judge is emerging as a popular technique, though it’s not perfect and still requires a human-in-the-loop component [00:11:30]. Many teams still manually review and score individual data points [00:11:53].

Aggregating Metrics

The approach to aggregating metrics varies, from precise scoring to simply determining if a new prompt performs better than a previous one [00:12:08].

Frequency of Evaluation

Evaluation is often done before releases due to its expense and manual components [00:12:45]. The goal is to reduce manual effort enough to allow for continuous integration (CI) testing, similar to software unit tests [00:12:53].

Value of Manual Review

Despite the desire for automation, manual review of exceptions is where significant value comes from, as it helps teams understand how models behave and what causes unexpected outcomes [00:13:08]. It provides deeper insight into the system, which is crucial in the early, fast-moving stages of AI model development [00:13:39].

Best Practices

Look at the data: This is underrated and provides valuable insights [00:14:25].
Come up with an evaluation data set: This forces teams to define expectations for the system, including edge cases and user interactions, which is a key part of the product-building journey [00:14:28]. In traditional machine learning, this was a prerequisite for building models, and it remains important for LLMs [00:15:09].

Future of Evaluation

While manual review is crucial now, the future of evaluation with advanced models like GPT-7 is uncertain [00:15:19]. It is currently more important in this early, fast-moving space, but will likely remain somewhat important even as models mature [00:16:42].

The generalizability of evaluation across different use cases is a challenge [00:16:54]. While core components like data gathering and understanding system behavior are general, specific metrics and evaluations are often use-case dependent [00:17:35]. LangSmith aims for simple, generalizable components while providing scaffolding for common patterns like LLM-as-a-judge [00:18:42].

The Agent Landscape

There are generally two types of agents:

Super generalizable autonomous agents (e.g., AutoGPT, BabyAGI) [00:21:58]
More focused agents [00:22:01]

Initial hype was around autonomous agents, but more focused agents appear more practical for today’s use cases [00:22:06]. Multi-agent frameworks (e.g., AutoGen, CrewAI) are gaining traction, but their success lies in controlled flows between specific prompts and tools, rather than fully autonomous general agents [00:22:57]. LangChain’s LangGraph views agents as state machines, allowing for enforced control over transitions and states, which is proving reliable in production for applications like customer support chatbots [00:23:10].

Complex Applications and UX Innovation

There’s a noticeable shift towards more complex LLM applications [00:24:19].

Categories of Builders

Super early Gen-native startups: Building cutting-edge, often consumer-facing agents [00:25:00].
Digital native startups: Initially shipped single LLM calls, but are now moving towards more sophisticated applications like Notion QA [00:25:35].
Larger Enterprises: A significant amount of work is internal, building assistant-like platforms (similar to a private GPT store) hooked up to internal data and APIs [00:26:04]. These internal applications allow for more risk-taking due to lower exposure [00:26:10].

Application Archetypes for the Future

More complex chatbots: Moving beyond simple RAG bots to state machine-represented chatbots with different stages (e.g., customer support bots, AI therapists) [00:41:55].
Longer-running jobs: Applications like GPT researcher or GPT newsletter that generate first drafts of reports or articles over minutes rather than seconds [00:42:22]. These require different user experiences (UX) where instantaneous responses are not expected [00:42:51].

UX as a Bottleneck

The most interesting work in AI applications currently lies in user experience, as it’s not yet clear how people want to interact with these new systems [00:43:27]. An example of innovative UX is an AI-native spreadsheet that spins up a separate agent for each cell, populating it in parallel [00:43:50].

Development and Deployment

LangServe

LangServe was released to simplify the deployment of LangChain applications [00:37:41]. It wraps Fast API and leverages LangChain’s common orchestration layer (LangChain Expression Language) for consistent input/output schemas and endpoints (invoke, batch, stream) [00:37:24]. LangServe also provides a playground for interacting with the application, facilitating cross-functional collaboration and feedback from non-technical subject matter experts [00:38:17].

Balancing Fast-Moving Space and Stability

Building in the rapidly evolving AI space requires a balance between delivering current solutions (“you have to build” even if it’s a “hack”) and maintaining flexibility for future advancements [00:29:43]. LangChain has evolved towards more flexible, lower-level abstractions (like LangChain Expression Language and LangGraph) to allow for customization, moving away from rigid, higher-level chains [00:31:08]. The approach for base class abstractions is simple, avoiding assumptions about specific implementations (e.g., retries handled at individual class level) [00:31:32].

Multimodal models were a concern for abstractions, but fortunately, their integration didn’t require massive changes [00:32:30]. LangChain 0.1 was released after multimodal capabilities stabilized, aiming for more solid abstractions [00:32:50]. The current focus is on improving documentation and use cases now that the core orchestration layer is more solid [00:33:21].

Inference Costs

For startups, the advice is to focus on building with powerful models like GPT-4 to achieve product-market fit (PMF) first [00:45:22]. Costs and latency are expected to decrease over time [00:45:10]. A key principle is “no GPUs before PMF” [00:45:42].

Future Trends and Obviated Techniques

What might go away

Context window management: As context windows grow, some tricks for summarizing and managing conversational history might become less necessary [00:46:16].
Manual JSON/Poppy formatting instructions: Hopefully, more robust function calling and structured extraction capabilities will eliminate the need to explicitly instruct models to output in specific formats [00:48:32].

What might remain

Retrieval: Will continue to be needed [00:46:11].
State machine approach: Even with better models, the state machine mental model is helpful for developers in building complex applications, especially where specific instructions or database access depend on the current state [00:48:05].
Multimodal capabilities: Current multimodal models are not precise enough for many knowledge work tasks, particularly regarding spatial awareness in extraction [00:46:43]. Improvements are expected in this area.

Key Insights and Predictions

Multimodal is currently overhyped: While promising, it’s not yet good enough for many real-world use cases [00:49:11].
Few-shot prompting is underhyped: Teams having success are often utilizing few-shot examples, especially for structured output or complex instructions [00:49:17]. Dynamic selection of relevant examples from a database can enable continual learning and personalization [00:29:00].
Importance of streaming: Essential for modern LLM applications, allowing for continuous output as the model processes information [00:49:44].
Open-source models: Expected to become more ubiquitous, with high interest in local models and agents for personalized applications (e.g., “ask your documents,” coach/mentor personas) [00:52:00].
Personalization: A crucial “killer app” for AI, where content is tailored to individual users based on their interests and past interactions [00:53:51]. Companies like New Computer and Hearth AI are exploring this space [00:52:52].
- An example of a personalized application would be a journal app that remembers user details and initiates conversations based on journal entries and past memories [00:55:01].
UX innovation: The most significant area for advancement in AI applications, as designers figure out how users will best interact with these new capabilities [00:43:27].
Build despite uncertainty: In a fast-moving field, it’s essential to build and iterate rather than waiting for perfect stability [00:30:27].

Competitive Landscape

The AI space is too early for intense competition focus [00:36:10]. LLM technology differs from traditional applications due to its non-deterministic nature and reliance on APIs and prompting [00:35:04]. While traditional observability companies like DataDog offer LLM products, their value proposition might differ from specialized LLM evaluation platforms that focus on debugging complex chains and providing confidence in iteration [00:35:36]. Companies often use both types of tools in conjunction [00:36:04].

Tubegraph

Explorer

Table of Contents