AI research at Stanford and Percys projects

From: redpointai

Percy Liang, a leading AI researcher and co-founder of Together AI, shares his insights on the current state and future direction of AI research from his vantage point at Stanford University [00:00:03].

Academic AI Research Philosophy

Liang emphasizes that academic AI research should be orthogonal to the work of large corporate labs like OpenAI to remain relevant and impactful [00:10:46]. If a research project is enhanced or remains largely irrelevant when new, powerful models like GPT-5 are released, it indicates a good choice for academic focus [00:11:11]. He advises choosing projects that will benefit from models getting better [00:11:28].

Three key directions for academia include:

Orthogonal Research: Focusing on novel use cases of language models rather than raw model development, as exemplified by his work on Generative Agents [00:11:39]. Similarly, benchmarks like ML Agent Bench are enhanced when new models come out [00:11:57].
Open Source Contributions: Academia, aligned with open science, should contribute to the open-source community by discovering and publishing knowledge, even if it means reinventing concepts, to allow broader community adoption and product development [00:12:10]. Examples include understanding data quality and weighting in pre-training [00:12:55].
Transparency, Benchmarking, and Auditing: Academia is uniquely positioned to develop tools and conduct research that assesses transparency and benchmarks AI systems, as it lacks the commercial interests that might hinder such efforts in private companies [00:13:30]. This work includes collaborating with areas like law to address unique problems [00:14:10].

Key Projects and Research Areas

Generative Agents

Liang’s work on Generative Agents created a virtual world, similar to The Sims, where AI agents interact with each other [00:00:14]. This project was driven by the idea of generating not just text, but agents or a society of agents [00:23:08]. Each agent is powered by a language model with specific prompts and operates within a virtual environment, allowing researchers to study complex social dynamics [00:23:25].

Key outcomes and future potential include:

Emergent Behavior: The simulation revealed phenomena like information diffusion, similar to human social dynamics [00:23:49].
Valid Simulations: The goal is to move from believable simulations to those that are actually valid and reflect reality [00:24:26].
Digital Twin of Society: This could enable running experiments, such as testing the impact of policies (e.g., mask mandates, new laws) in a simulated environment [00:24:50].
Social Science Studies: It offers a way to conduct social science research more efficiently and cost-effectively, even allowing for counterfactuals (giving an agent both treatment and control) [00:25:31].
Distinction from Prior Simulations: Unlike physical or agent-based models governed by simple equations, large language models allow for simulations with much greater detail and complexity [00:28:17].

Evaluations and Benchmarking

Liang notes that AI evaluation is currently “a huge mess” due to the difficulty in trusting benchmarks, primarily because the training data for large models is unknown and proprietary, leading to “train-test overlap” concerns [00:30:09].

Key aspects of his work in evaluations include:

Helm (Holistic Evaluation of Language Models): Initially a manual effort to cover all aspects of language models, Helm has evolved into a framework with vertical-specific leaderboards for areas like safety, healthcare, and finance [00:34:42].
Auto Bencher: This paper explores using language models themselves to generate automatic, intelligent inputs for benchmarking, particularly leveraging asymmetry where the question-generating model has information the test-taker does not [00:32:13].
Rubrics: He advocates for evaluation anchored in concrete terms, similar to grading exams with a rubric, rather than superficial judgments [00:33:35].
SideBench: A challenging Capture the Flag cyber security benchmark where the hardest problems take human teams over 24 hours to solve [00:04:04]. Current models can only solve challenges humans solved in around 11 minutes [00:04:17].
ML Agent Bench: A benchmark for evaluating language models’ ability to solve machine learning research tasks [00:47:44].

Regarding OpenAI’s O1 model, Liang found its product experience to be slow and difficult to use [00:00:57]. From a research perspective, O1 signals a shift towards “test-time compute” and agents that can reason, plan, and perform ambitious tasks over days or weeks [00:01:15]. However, when evaluating O1 on SideBench, simply dropping it in as a replacement did not significantly improve overall performance because O1 ignored existing reflection and planning templates, making it incompatible with the framework [00:04:47]. This highlights the importance of compatibility when integrating new models into larger systems [00:05:49].

Interpretability

Liang notes that interpretability has become even harder because, unlike in 2017, model weights and training data are often not accessible [00:36:08]. He identifies two main audiences for interpretability:

Scientific Understanding: Pure curiosity about how a model functions, such as mechanistic interpretability that analyzes individual neurons [00:37:16].
Debugging and Accountability: For developers to fix problems or for regulated industries (finance, healthcare) to understand why a decision was made [00:37:24].

He discusses:

Influence Functions: A method developed in 2016-2017 to attribute a model’s prediction to specific training examples [00:37:38]. While adaptable to language models, it’s difficult to scale and presents privacy concerns if training data is private [00:37:53].
Explanations (Chain of Thought): Models can generate explanations, but research shows these might not accurately reflect what’s actually happening internally [00:39:04].

Liang concludes that for meaningful interpretability, researchers need a return to the 2017 level of access to model weights and training data [00:39:45].

Model Architectures

Historically, architectures like LSTMs, CNNs, and Transformers arose from intuition and experimentation [00:40:47]. However, the Mamba (State Space Model) architecture notably emerged from mathematical research on fitting online polynomial updates to sequences [00:41:03].

Liang believes that current architectures like Transformers may not be the enduring ones, and new innovations are likely to come from tackling different types of problems beyond language, such as video processing or agentic search settings, where existing architectures might break [00:42:02].

Open Source Models for Robotics

Liang believes that robotics is currently in a “BERT era” rather than a “ChatGPT moment” [00:50:27]. While vision-language models for robotics are effective, they still require fine-tuning for narrow tasks, and the resulting policies remain brittle [00:50:39]. He is optimistic for the future, noting increased interest and funding, along with data collection efforts [00:51:07]. He hopes that many so-called robotics problems will ultimately be solvable as language and vision problems, leveraging existing infrastructure [00:52:19].

AI in Music

As a classical musician, Liang observes that the same “giant transformer, some data” recipe is effective in AI music [00:53:16]. However, he highlights challenges:

Copyright: A significant hurdle [00:53:28].
Control: His work on the “Anticipatory Music Transformer” focuses on models that allow users to control generation (e.g., conditioning on melody to generate harmony, or infilling sections) rather than just unconditional or text-to-music generation [00:53:38].

He envisions AI music as a co-pilot for musicians, similar to GitHub Co-pilot for programmers, helping artists realize their musical visions [00:54:35].

AI in Education

Liang is bullish on the use of AI as teachers and coaches, particularly for explaining complex concepts simply, citing its effectiveness in teaching children [00:56:20].

Broader AI Landscape and Future Outlook

Holistic View of AI and Safety

Liang believes that many AI researchers narrowly focus on the model as the central object [00:15:26]. He advocates for a more holistic perspective, viewing the model as one piece of a larger ecosystem with various actors and incentives [00:15:56]. For AI safety, the goal should be to ensure the entire system is safe, not just the model [00:16:07].

He argues that bad actors can circumvent model-level safety measures by decomposing problems (e.g., generating personalized email content separately for a phishing attack) [00:16:29]. More investment is needed in “defense” mechanisms like anti-spam filters and anti-fraud detectors, analogous to dual-use technologies like email or the internet [00:17:22]. Gating access to models is a losing battle as they become cheaper and widespread, but defense measures can secure the ecosystem [00:17:55].

Regulation

Liang emphasizes that it is still very early in AI’s development, and much is not understood, making heavy-handed regulation challenging [00:19:40]. He supports regulation that focuses on transparency and disclosure, as understanding risks and benefits is the first step [00:20:11]. With everything currently “closed off,” it’s hard for policymakers, researchers, or third-party auditors to understand what’s happening [00:20:29]. He suggests “nutrition labels” or spec sheets for AI models to inform downstream product developers [00:21:49].

He questions whether regulation should occur “upstream” (Foundation model developers) or “downstream” (end products) [00:20:59]. While sectoral regulation (e.g., in finance or healthcare) is effective for visible harms, heavy-handed upstream regulation can be blunt or ineffective [00:21:10].

Meaningful Milestones for AI

Liang identifies several meaningful milestones for AI progress:

Current Benchmarks: Continual improvement on existing benchmarks like SideBench for cybersecurity or ML Agent Bench for ML research tasks [00:47:41].
Extending Human Knowledge: Solving open math problems, creating new research, or discovering something new that hasn’t been solved by humans, such as finding a zero-day exploit in cybersecurity [00:48:22]. This represents a shift from mimicking human experts to extending human knowledge [00:48:46].
Coding Productivity: AI’s ability to create new code and help people code is a simpler version of how it can aid research [00:58:43].

He believes progress is still moving quickly, with qualitative changes like O1’s approach to computation and continuous advancements in chip power driving the entire ecosystem [00:49:26].

Underexplored Application Areas

Liang suggests that while many AI applications are driven by commercial needs (e.g., RAG solutions, Q&A, summarization), areas like fundamental science, scientific discovery, and improving researcher productivity are currently underexplored [00:59:02]. These areas are crucial because they can “feed into the whole cycle of improving the whole ecosystem” [00:59:34].

Overhyped and Underhyped

Overhyped: Agents [00:57:05].
Underhyped: Agents [00:57:05].
- Liang clarifies that while agents have gone through a hype cycle, the potential for ML agents to contribute novel insights to ML work (e.g., running experiments) is not far off [00:57:11]. He is optimistic that models can meaningfully contribute to research in the coming years [00:58:33].

Tubegraph

Explorer

Table of Contents