AI model evaluation and benchmarking

From: redpointai

Percy Liang, a leading AI researcher and co-founder of Together AI, provides insights into the current state and future of AI model evaluation. He emphasizes the evolving challenges and new approaches required as AI models become more capable and complex [00:22:09].

Current State of Evaluations

Liang notes that AI model evaluation has become significantly more complex than in the past, where a simple train/test split was sufficient [00:30:15].

Challenges in Evaluation

Train-Test Overlap: A major problem is the lack of transparency regarding training data, making it difficult to trust evaluation results and determine if models were inadvertently trained on test sets [00:30:26]. Companies are often unwilling to disclose this information [00:30:35].
Raw Benchmarks vs. Real-World Use: Simply looking at raw benchmark scores may not tell the full story, especially when models are integrated into larger systems. Compatibility issues can lead to unimpressive overall performance, even if a model improves on specific subtasks [00:44:09].
Monotonic Progress: The assumption of monotonic progress (models always getting better) is challenged when models don’t fit into existing system architectures [00:54:30].
Multi-dimensional Assessment: AI model selection and evaluation for businesses must consider various facets like cost, speed, accuracy, and customizability, not just a single “better” metric [00:09:24].

New Benchmarks and Approaches

Liang’s team has developed new benchmarks to address current challenges:

SideBench: A Capture the Flag cyber security exercise benchmark that requires complex multi-step reasoning. The hardest challenges take human teams over 24 hours to solve, while current models can only solve those initially solved by humans in around 11 minutes [00:03:41].
ML Agent Bench: Designed for solving machine learning research tasks [00:47:44].
HELM (Holistic Evaluation of Language Models): Initially a manual framework covering various aspects of language models, it has evolved to support different “verticals” or domains, such as safety evaluations, specific languages, medical applications, and finance [00:34:42].

Evolution of Evaluation Methodologies

AI Evaluating AI

As models become increasingly capable, the field is moving towards using language models themselves to benchmark other language models. This is crucial because it’s nearly impossible for humans to comprehensively test models that claim to “do anything” [00:31:31].

AutoBencher: A system that leverages language models to automatically generate diverse inputs for evaluation. It uses an asymmetry where the question-generating model has information unknown to the test-taker, enabling more sensible evaluations [00:32:13].

Beyond Superficial Judgments

Liang advocates for more structured and concrete evaluation methodologies.

Rubrics: Similar to grading exams, using rubrics can anchor evaluations in objective terms, moving beyond subjective “is this good?” assessments [00:33:42].
Vertical-Specific Benchmarks: There’s a growing need for benchmarks tailored to specific industries or tasks, rather than general “cutting-edge math problems.” For example, a model for diagnosing medical images would require a different benchmark than one for general reasoning [00:34:31].

Role of Academia and Transparency

Academia plays a unique role in AI model evaluation and auditing due to its lack of commercial interests [00:37:37].

Orthogonal Research: Academic research should focus on areas that are enhanced by or irrelevant to the progress of large, resource-rich labs like OpenAI. This includes novel use cases of language models (like generative agents) or developing benchmarks [00:11:00].
Open Science: Academia should prioritize contributing to the open source AI models and limitations community by creating and publicly sharing knowledge, including insights into data quality and weighting for pre-training [00:12:02].
Transparency and Auditing: Academia is uniquely positioned to assess the transparency of different AI providers, a task difficult for commercial entities [00:13:22]. Regulation should emphasize transparency and disclosure to help policymakers, researchers, and third-party auditors understand risks and benefits [00:20:11].

Connection to Interpretability

AI production and evaluation techniques are closely tied to interpretability, especially for regulated industries like finance and healthcare [00:35:36].

Challenges: Interpretability has become harder because access to model weights and training data is often withheld [00:36:28].
Attribution: Techniques like influence functions aim to attribute model predictions to specific training examples [00:37:38].
Explanations: While models can generate “Chain of Thought” explanations, research shows these don’t always reflect what’s truly happening internally [00:39:01].
Future: Transparency from open source AI models and limitations that show their internal steps (like OpenAI’s 01 model) could potentially aid interpretability, making it easier to explain how a result was derived [00:39:34].

Future Milestones

Significant milestones in AI model capability include:

Solving Open Math Problems: Achieving solutions to currently unsolved mathematical problems [00:48:22].
Extending Human Knowledge: Developing AI that can genuinely discover new research or solve problems humans haven’t [00:48:51].
Zero-Day Exploits: In cybersecurity, finding a zero-day vulnerability would be a significant game-changer [00:49:12].

Liang believes that AI is still rapidly progressing, with qualitative changes like new approaches to using systems (e.g., agentic models) driving progress alongside quantitative scaling [00:49:26].

Tubegraph

Explorer

Table of Contents