Challenges and strategies in AI model evaluation

From: redpointai

Evaluating AI models, especially Large Language Models (LLMs), presents unique challenges due to their non-deterministic nature and the rapid evolution of the field [00:35:04]. Harrison Chase, founder and CEO of LangChain, emphasizes that the space is early and fast-moving, making it difficult to predict future developments [00:34:30].

Current State of Evaluation

Teams are grappling with several key questions regarding LLM evaluation [00:10:27]:

Data Set Gathering

The fundamental premise for evaluation is having a dataset against which to test the system [00:10:41]. Common strategies include:

Hand-labeling: Initially, teams hand-label about 20 examples [00:10:52].
Production data integration: Incorporating edge cases that fail in production into the test set [00:10:56]. This connection between production traces and evaluation sets is highly valuable [00:11:11].
Forcing function for product thinking: Creating an evaluation dataset forces teams to consider what the system should actually do, what edge cases it should handle, and how users are expected to interact with it [00:14:51]. This process is crucial at the start of a product-building journey [00:15:03]. In traditional ML, this had to be done before building a model, and it’s still beneficial for LLMs [00:15:09].

Evaluation Methods for Single Data Points

Classification: For straightforward classification tasks, traditional machine learning techniques can be used [00:12:26].
LLMs as judges: For more complex scenarios, using LLMs to judge responses is an emerging popular technique [00:11:32]. However, this method is not perfect, necessitating a human-in-the-loop component [00:11:40].
Human review: Many teams still manually review responses, scoring them and comparing them side-by-side [00:11:53]. This manual review is where significant value lies, as it helps teams understand how models work and identify unexpected behaviors [00:13:08]. Observing the thought processes of agents and the inputs/outputs of each step provides deeper system understanding [00:13:57].

Aggregating Metrics

Deciding how to aggregate evaluation metrics is varied [00:12:08]:

Some teams aim for perfectly scored results [00:12:11].
Others simply seek to confirm if a new prompt or system is better than the previous one [00:12:16].
Accuracy percentages are used for specific, critical data points [00:12:27].

Frequency of Evaluation

Evaluation can be expensive and slow [00:12:35]. Teams typically run evaluations before releases due to the significant manual component [00:12:45]. The goal is to reduce this manual effort sufficiently to enable running evaluations in Continuous Integration (CI) like software unit tests [00:12:53].

The Role of Human-in-the-Loop

The manual aspect of reviewing exceptions in AI models is crucial for understanding how these models behave [00:13:06]. This manual review is not a “stigma” but a source of immense value, especially in the early, fast-moving stages of AI development [00:13:26]. It helps developers “grock” these new systems and gain a deeper understanding [00:13:39].

Generalizability and Best Practices

Custom Data Sets: Teams generally develop their own application-specific datasets [00:19:55].
Shared Metrics: There is some sharing of metrics, such as using LLMs as judges with common prompts [00:20:04].
Emerging Best Practices: After an initial period of intense experimentation (the first six months), the subsequent period (next six months) focused on getting applications into production [00:20:38]. Now, with impressive production deployments like Elastic’s assistant, best practices are starting to emerge [00:20:56]. Facilitating discussion and sharing these learnings is an ongoing effort [00:21:02].

Future of Evaluation

Automation vs. Manual Review: While full automation in the background would be convenient, the manual review of exceptions provides vital insights [00:13:19]. It’s uncertain how evaluation will work with future models like GPT-7; it might be less critical but still important [00:16:05].
Abstraction Levels: Evaluation tools should avoid overly high-level abstractions, focusing on low-level, important aspects like data gathering and understanding system behavior [00:17:40]. A code-first approach with API exposure is favored for developers [00:18:04].
Challenges in LLM Operations (LLMOps): The LLM problem space differs from traditional applications due to non-determinism, API-based interaction, and prompting [00:35:00]. Existing observability tools like Datadog focus on system-level monitoring and aggregate metrics (e.g., latency), while LLM-specific tools like LangSmith prioritize understanding application behavior and enabling faster iteration with confidence [00:35:44].

Practical Advice for Startups

Build First: Given the rapid pace of AI, it’s essential to build applications even if techniques or underlying models might change [00:30:27]. Waiting for things to solidify would mean not building at all [00:32:00].
Focus on Product-Market Fit (PMF): Don’t prematurely optimize for inference costs or latency; these will decrease over time [00:45:10]. A common piece of advice is “no GPUs before PMF” or “use GPT-4 until product market fit” [00:45:42]. The focus should be on building a product that actually works [00:59:03].
UX Innovation: The most interesting work in AI applications currently lies in User Experience (UX) [00:43:27]. Figuring out how people want to interact with these new capabilities is a major area for innovation [00:43:35]. For example, an “AI-native spreadsheet” that uses agents for cell population, though not instantaneous, presents a novel UX for handling multiple tasks simultaneously [00:43:53].
Personalization and Memory: A significant opportunity for AI application development is in personalization at the user level, which could lead to a “step change improvement” [00:53:44]. This includes applications that remember user preferences and history, potentially through Retrieval-Augmented Generation (RAG) or fine-tuning [00:54:22]. An example is a journal app that remembers personal details and initiates conversations based on past entries [00:55:01].
Iterative Development: The industry is still in the first wave of AI apps, similar to early iPhone development [00:53:35]. Future “killer apps” are yet to be discovered, possibly emerging several years later as fundamental technologies mature [01:03:00].
Enterprise Adoption: While consumer-facing AI products gain public attention, a substantial amount of AI work, especially in larger enterprises, is focused on internal tools [00:26:04]. These internal applications often carry lower risk and allow for more advanced experimentation before external release [01:03:51].

Tubegraph

Explorer

Table of Contents