From: redpointai

Percy Liang, a leading AI researcher and co-founder of Together AI, shares his perspectives on the current state and future directions of AI research and innovation, covering topics from model architectures to societal implications [00:00:01].

OpenAI’s O1 Model and the Future of AI Tasks

Upon the release of OpenAI’s O1 model, Percy Liang noted that from a product perspective, it was “not very good” due to its slowness and difficulty of use [00:00:57]. However, from a research perspective, O1 signals a significant shift towards “test time compute” [00:01:07]. This concept suggests a move beyond simple prompt-response interactions, enabling AI to solve more ambitious tasks that could take days, weeks, or even months, similar to complex human projects [00:01:40]. The direction hints at agents that can reason and plan over extended periods, potentially leading to new research discoveries or drug inventions [00:02:02].

Shifting Paradigms: From Language Models to Agents

Liang observes a shift from focusing solely on large language models that predict the next token to the resurgence of agents [00:02:39]. These agents can take actions, and their generations can be interpreted as actions within some space, allowing them to gain experience and learn through appropriate reward signals [00:02:41]. This deployment of agents in various tasks, receiving feedback, and iteratively improving, is a key future trend [00:03:13].

Challenges and Evolution of AI Benchmarking

While O1 demonstrated incredible capabilities in specific domains like math and coding, which benefit from reasoning chains and better supervision, its overall performance in benchmarks like Sidebench (cybersecurity exercises) was not a “huge bump” [00:03:26]. This was partly due to O1 ignoring existing frameworks and templates designed for normal language models, highlighting the subtlety of evaluation and the importance of compatibility in larger AI systems [00:04:41].

Liang emphasizes that benchmarks are a constantly moving target because as models improve, new tasks emerge that existing benchmarks don’t capture [00:31:06]. He is excited about using language models themselves to benchmark other language models, particularly to assess coverage of the vast range of tasks these models claim to perform [00:31:27]. His paper “Auto Bencher” explores generating automatic inputs for more sensible evaluations [00:32:18]. He also advocates for more structured evaluations using rubrics, similar to grading exams, to anchor assessments in concrete terms rather than superficial judgments [00:33:35].

Liang’s work on Helm (Holistic Evaluation of Language Models) has evolved into a framework covering various verticals, including safety, trustworthiness, different languages, medical, and finance [00:34:42].

The Role of Academia in AI Research

Liang believes academia’s role in AI research must be orthogonal to large, resource-rich companies like OpenAI [00:11:02]. Academic projects should either be enhanced by better models (e.g., generative agents improved by more capable LMs) or be largely irrelevant to their advancements [00:11:11].

Key areas for academic contribution include:

  • Novel Use Cases: Exploring innovative applications of language models [00:11:40].
  • Open Source Community: Contributing to open science by discovering and publishing knowledge, making it accessible to the broader community [00:12:10].
  • Transparency and Auditing: Academia’s unique position without commercial interests allows it to conduct impartial benchmarking and assess the transparency of AI providers, benefiting the public good [00:13:30].

Holistic View of AI and Safety

Liang stresses the importance of thinking holistically about AI’s role, rather than focusing solely on the model [00:15:14]. He argues that AI safety should encompass the entire ecosystem of actors and their incentives, not just making a particular model “safe” by preventing harmful responses [00:15:39]. Bad actors can circumvent model-level safety measures by decomposing problems, such as using an AI to write a personalized email instead of explicitly requesting a phishing email [00:16:32].

He advocates for greater investment in “defense” mechanisms, similar to anti-spam filters or anti-fraud detectors for email and the internet [00:17:22]. Gating access to models is a losing battle as they become cheaper and more widespread [00:17:55].

AI Regulation and Transparency

Regarding AI regulation, Liang acknowledges it’s a complex and early topic [00:19:27]. He strongly supports regulation that emphasizes transparency and disclosure, as understanding risks and benefits is the first step [00:20:11]. With current models being closed off, it’s difficult for policymakers, researchers, or third-party auditors to understand what’s happening [00:20:32].

He suggests that regulation should focus downstream, on end products in regulated sectors like finance and healthcare, where harms are more visible [00:20:59]. However, transparency and obligations for foundation model developers to provide information (like “nutrition labels” or spec sheets) are necessary so that downstream product developers understand what they are working with [00:21:39].

Generative Agents and Social Simulation

Percy Liang’s work on “generative agents,” a project initiated with his students June Park and Michael Bernstein, aimed to see if language models could generate agents or societies of agents [00:22:31]. They built a virtual environment similar to The Sims, where agents powered by language models with specific prompts could move, interact, and communicate [00:23:25]. This project revealed emergent behaviors, such as information diffusion, and was a pure exploration into creating believable simulations [00:23:41].

The future of this work involves creating simulations that are not just believable but valid, reflecting reality [00:24:26]. This could unlock significant potential, such as creating a “digital twin of society” to run experiments on policy changes (e.g., COVID mask policies) or laws [00:24:50]. While trust in these simulations is still a challenge, they could revolutionize social science studies by allowing researchers to recruit demographically diverse agents and even apply both treatment and control conditions to the same agent by resetting their memory [00:25:31].

This type of simulation differs from traditional physics-based or stylized agent-based models because modern language models allow for much greater detail in mimicking human behavior and decision-making [00:27:52]. Liang envisions a future where people might run simulations for major life decisions or even daily interactions, like practicing for a podcast interview or a date [00:28:33].

Challenges in AI Interpretability

Interpretability in AI is becoming increasingly difficult [00:36:00]. Unlike earlier years where model weights and training data were accessible for debugging, modern large language models often lack such transparency [00:36:13].

Liang distinguishes between two audiences for interpretability:

  1. Scientific Understanding: Mechanistic interpretability aims to understand individual neurons within a network to grasp what’s happening [00:36:53].
  2. Regulatory/Debugging Needs: In regulated industries, understanding why a model made a decision is crucial [00:37:24].

He references his past work on “influence functions” from 2016-2017, which attribute a model’s prediction to specific training examples [00:37:38]. While adapted for language models, scaling this is challenging, especially if training data is private [00:37:53]. He notes that Chain of Thought explanations might not always reflect the model’s true internal workings, akin to human rationalizations [00:39:01]. For true interpretability, a return to greater access to model weights and training data (like the 2017 era) is needed [00:39:44].

Model Architectures and Future Innovations

Historically, AI architectures like LSTMs, CNNs, and Transformers originated from intuition and experimentation [00:40:47]. However, newer models like Mamba (state space models) emerged from mathematical breakthroughs, specifically from questions about online polynomial fitting [00:41:03].

Liang doesn’t have strong opinions on the endurance of current architectures like Transformers, betting that they won’t be the sole enduring architecture in the long run [00:42:04]. He suggests that significant architectural changes might come from tackling different data modalities, like video, which cannot be handled by a “naive giant Transformer,” or from the growing focus on agent-based systems [00:42:35]. New architectures emerge when tackling problems fundamentally different from previous ones, such as machine translation necessitating Transformers beyond simple image classification [00:43:30].

Implications for the AI Inference Market

The new paradigm, exemplified by O1’s “test time compute,” has significant implications for the inference market [00:43:51]. Liang views inference as a fundamental, low-level primitive that needs to be robust and cheap [00:44:16]. As co-founder of Together AI, he emphasizes the importance of GPUs, inference capabilities, and the ability to fine-tune and customize models [00:46:28]. The inference market is shifting from serving general models like Llama 3 to serving models specifically adapted and optimized for particular use cases, potentially offering significant performance gains [00:45:40]. Agentic workflows also create opportunities for further optimization, especially for high-throughput settings [00:45:56].

Meaningful Milestones in AI

For the near term, benchmarks like Sidebench (cybersecurity) and ML Agent Bench (solving ML research tasks) remain good trackers of performance [00:47:39]. Longer term, meaningful milestones involve solving truly open problems, such as open math problems, or creating something that extends human knowledge [00:48:22]. This means AI moving beyond mimicking expert humans to discovering new research, solving previously unsolved problems, or finding “zero days” in cybersecurity [00:48:44]. Liang is optimistic that AI will continue to advance rapidly, with qualitative changes beyond just naive scaling of models, driven by new systems and more powerful, cheaper chips [00:49:40].

Robotics Foundation Models

Liang states that robotics is not yet at a “ChatGPT moment” [00:50:27]. It is closer to the “BERT era,” where vision-language models and fine-tuning for specific tasks are effective, but the resulting policies are still brittle [00:50:32]. While there is increased interest and funding, and data collection efforts are underway, hardware limitations mean it will take a few more years [00:51:05].

However, he is optimistic because robotics can leverage architectural and data innovations from language and vision models [00:51:53]. His hope is that many “robotics problems” are actually language and vision problems, meaning that if a robot can understand concepts like “what a cup is” from vision and language data, it only needs robotics-specific data for manipulation and grasping [00:52:21].

AI in Music

As a talented classical musician, Percy Liang also discussed AI’s impact on music [00:52:54]. He notes challenges like copyright and the need for greater control for artists [00:53:28]. His lab worked on “anticipatory music Transformer” models that allow musicians to condition on specific musical events (e.g., melody) and generate other parts (e.g., harmony), or infill sections [00:54:12].

He envisions AI as a “co-pilot” for musicians, similar to GitHub Copilot for programmers, helping composers and artists realize their musical visions, especially for those who lack the time for extensive practice [00:54:35]. Classical music presents unique challenges due to its subtleties and limited data [00:56:02].

AI in Education

Liang is very positive about AI’s use in education, especially as teachers and coaches [00:56:20]. He uses it to explain complex concepts to his children, demonstrating its effectiveness at simplifying complicated information [00:56:43].

Quickfire Round: Overhyped/Underhyped & Future of AI

  • Overhyped/Underhyped: Agents are both overhyped and underhyped [00:57:05].
  • ML Agents & Novel Insights: Liang believes that ML agents contributing novel insights to ML work is not far off [00:57:37]. To the extent that an agent can run an experiment or a Bayesian experiment to answer a question, it can already be considered a “junior student” [00:58:02]. He is optimistic that models will meaningfully come up with new experiments and directions in the coming years, similar to how coding has been transformed by AI [00:58:25].
  • Underexplored Application Areas: Beyond commercial needs like RAG (Retrieval-Augmented Generation) and summarization, underexplored areas include fundamental science, scientific discovery, and improving researcher productivity [00:59:18]. These areas are crucial as they can feed into the cycle of improving the entire AI ecosystem [00:59:34].