From: hu-po

Q*, pronounced “Q-star,” is a rumored algorithm or technique that OpenAI purportedly used to achieve a significant improvement in the current state of AI capabilities [03:00:00]. Despite much speculation and mystery surrounding it [03:28:00], many experts, including Yann LeCun, suggest that its core concepts are not entirely new [03:48:00].

The Nature of Q*

Yann LeCun states that nearly every top research lab, including Facebook AI Research (Meta), DeepMind (Google), and OpenAI (Microsoft), is actively working on concepts similar to Q*, with some having already published related ideas and results [03:48:00]. He posits that Q* is likely OpenAI’s attempt at “planning” [04:01:00], which broadly falls under the umbrella of reinforcement learning (RL) [04:11:00]. This means that anyone familiar with reinforcement learning papers from the past decade likely already understands the underlying principles of Q* [04:15:00].

Historically, Yann LeCun used a “cake analogy” to describe the future of Artificial General Intelligence (AGI), suggesting that the majority of the “cake” (intelligence) would come from self-supervised learning, with reinforcement learning being a “little cherry on top” [05:10:00]. However, he has more recently contradicted this, stating that “agency and planning can’t be a wart on top of autoregressive LLMs; it must be an intrinsic property of the architecture” [04:38:00]. This recent view contradicts his earlier analogy, where RL was merely a “cherry on top” [05:30:00].

Self-Improvement of Large Language Models

Current methods to improve the performance of Large Language Models (LLMs) primarily involve advanced prompting techniques (like Chain of Thought) and fine-tuning with high-quality supervised data [09:11:00]. However, these methods are limited by the availability and quality of data [09:56:00].

A promising strategy proposes allowing LLMs to refine their outputs and learn from self-assessed rewards, leading to self-improvement [10:01:00]. This approach draws inspiration from AlphaGo, a groundbreaking reinforcement learning system [10:19:00].

AlphaLLM: Towards Self-Improvement via Imagination, Searching, and Criticizing

The paper “Toward Self-Improvement of LLMs via Imagination Searching and Criticizing” (April 18, 2024, by Tencent AI Lab) introduces AlphaLLM [06:35:00]. AlphaLLM applies AlphaGo’s principles to language models [10:29:00].

Key aspects of AlphaLLM include:

  • Monte Carlo Tree Search (MCTS): This search algorithm is used to enable models to learn from self-play and achieve or surpass human performance in complex tasks [12:47:00]. MCTS involves selection, expansion, evaluation, and backpropagation [25:58:00].
  • Challenges with LLMs:
    • Vast Search Spaces: The action space for LLMs is enormous due to their large vocabulary (e.g., 30,000 possible tokens for each prediction) [15:36:00].
    • Lack of Clear Feedback: Unlike games like Go (where win/loss is clear), natural language tasks lack unambiguous “win or loss” signals for reward [16:17:00].
  • Option-Level MCTS: To mitigate the vast search space, AlphaLLM proposes “option-level” MCTS, where actions are not individual tokens but sequences of tokens (like sentences), determined by a “termination function” [33:00:00].
  • Critics: AlphaLLM utilizes “critics” (value functions) to guide its search, which are also language models themselves, often initialized from the policy model [30:29:00], [35:51:00]. These include:
    • Process Reward Model (PRM): Provides feedback for each step in a Chain of Thought [48:57:00].
    • Outcome Reward Model (OMM): Provides feedback only on the final result of a Chain of Thought [48:53:00].
  • Self-Improvement Loop: The LLM synthesizes its own data using MCTS and critics. This data is then used to fine-tune the LLM (the “policy”) through gradient descent, creating a virtuous cycle [54:45:00].
  • Results: AlphaLLM, starting with a Llama 2 70B model, significantly improved performance on mathematical reasoning benchmarks like GSM8K (from 57% to 92%) and MATH (from 20% to 51%) [17:55:00]. This performance, especially with MCTS decoding during inference, becomes comparable to GPT-4 on these tasks [18:25:00], [1:04:14].
  • Limitations: The improvements are largely due to the MCTS search strategy during inference, rather than making the base model inherently “smarter” in greedy decoding [1:03:15]. The study only ran for two iterations of self-improvement, raising questions about potential overfitting or hitting performance walls [1:04:47].

From R to Q*: Your Language Model is Secretly a Q Function

The paper “From R to Q*: Your Language Model is Secretly a Q Function” (April 18, 2024, by Stanford University, Chelsea Finn et al.) focuses on the theoretical underpinnings, arguing that LLMs can be understood as Q-functions [1:09:53].

Key findings include:

  • Q-function as Optimal: In reinforcement learning, Q* refers to the “optimal Q-function” [1:10:42].
  • DPO and Q-learning: Direct Preference Optimization (DPO), a method for training and finetuning of language models using human feedback, is shown to be equivalent to the Q-learning algorithm when interpreted at the token level [1:20:12].
  • Credit Assignment: DPO can perform “credit assignment,” meaning it can identify which specific tokens within a long response were responsible for a successful outcome, even with sparse (end-of-task) rewards [1:32:59]. This is crucial for efficient learning.
  • Discrete vs. Continuous Spaces: Natural language, being a discrete space (limited set of tokens), allows for effective application of these Q-learning principles, unlike continuous spaces where it’s more challenging [1:30:38]. This also makes it highly applicable to robotic control, where continuous actions can be discretized into tokens [1:31:06], [1:41:01].
  • Impact on Robotics/Embodied AI: The ability to apply these advancements in language models to robotics suggests a future where robot policies, trained as LLMs, output “action tokens” and can self-improve in environments where success or failure is clearly observable [1:14:10], [1:15:14].

Conclusion

The convergence of reinforcement learning principles, particularly Q-learning and MCTS, with LLM Large Language Models development marks a significant step towards achieving self-improvement in AI models [1:57:07]. While current demonstrations primarily focus on domains with clear reward signals like math and coding [1:57:14], the potential for extending these techniques to more generalized reasoning and natural language tasks is highly anticipated [1:57:31]. The ability to use LLMs as both policies and value functions, leveraging their pre-trained knowledge, offers a powerful path to creating increasingly capable and autonomous AI [1:56:01].