From: redpointai

LangChain, the popular framework for building LLM applications, recognized early on that an orchestration layer alone wasn’t enough to simplify the development process. A significant need emerged around the observability, testing, and evaluation of these applications, leading to the creation of Lang Smith [00:50:51].

Core Functionality

Lang Smith operates as a separate Software as a Service (SaaS) platform, focused on making it easy to build, test, and understand LLM applications [00:51:51]. Its primary value propositions are:

  • Tracing and Observability: Lang Smith logs all steps of a chain or agent, including inputs, outputs, and their exact sequence [00:08:16]. This is particularly valuable for complex applications with multiple LLM calls or steps, providing immediate insight into what’s happening and aiding in debugging [00:08:24]. Even for single LLM calls, it helps visualize templated prompts, conversational history, and trimmed content [00:09:00].
  • Testing and Evaluation (Eval): Lang Smith supports testing across the entire spectrum of an application, from end-to-end user interactions to individual components [00:09:30]. This includes evaluating specific steps like routing choices in an agent [00:09:54]. The platform makes it easy to visualize and understand the thought processes of agents, as well as the inputs and outputs of each step, which is crucial for developers to “grock” these complex systems [00:13:52].

Addressing Evaluation Challenges

Evaluation in LLM applications presents several challenges:

  • Data Set Creation: Teams typically start by hand-labeling a small set of examples (e.g., 20) and then integrate production data that caused failures into their test sets [00:10:48]. Lang Smith connects production traces that fail (either via user feedback like “thumbs down” or system flags) directly into the evaluation set [00:11:11]. Creating an evaluation dataset also forces developers to consider specific desired behaviors and edge cases for the system [00:14:54].
  • Judging Single Data Points: While some simple classification tasks are straightforward, many require LLMs to act as judges [00:11:30]. However, LLM-based judging isn’t perfect, necessitating a human-in-the-loop component [00:11:41]. Lang Smith invests heavily in facilitating human interaction to best support the evaluation process [00:12:01].
  • Aggregating Metrics: How to aggregate evaluation metrics depends on the application. Some teams require perfect scoring for critical data points, while others need to confirm improvement over previous iterations [00:12:11].
  • Frequency of Evaluation: Due to the expense and time involved, evaluations are often conducted largely before releases [00:12:45]. Lang Smith aims to reduce the manual component to enable running evaluations in a continuous integration (CI) environment, similar to software unit tests [00:12:51].

Distinction from Traditional Observability

While there are parallels with traditional observability platforms like Datadog, Lang Smith addresses unique aspects of LLM applications [00:34:09]. LLMs are non-deterministic, and their applications involve APIs and prompting, which are new areas compared to traditional systems [00:35:04].

Lang Smith focuses on helping users understand what LLM applications are doing to iterate faster and with confidence [00:35:36]. In contrast, tools like Datadog excel at system-level monitoring and aggregate metrics like latency [00:35:48]. Companies have been observed using both platforms together, indicating different value propositions [00:36:04].

Integration and Development

Lang Smith is framework-agnostic, meaning it can be used with or without LangChain [00:18:14]. It treats the system being scored simply as a function, making no assumptions about the number of LLM calls or internal workings [00:18:20].

The development of Lang Smith was prioritized because observability, testing, and evaluation were identified as bigger pain points for developers than hosting platforms [00:37:06]. Lang Smith is a crucial component of LangChain’s mission to make building LLM applications as easy as possible [00:05:30]. It also ties into other LangChain components, as evaluation is needed for both retrieval and agents, and agents can even be used to perform evaluation [00:07:08].

Lang Smith was generally available as of recently (referring to the time of the podcast recording), after six months of iteration [00:05:56]. The team continually seeks to provide the most value by allocating resources to areas of greatest need, such as improving documentation and use cases [00:41:01].