From: aidotengineer

Reinforcement Learning (RL) is a paradigm where a system is designed to learn how to improve at achieving a goal over time through repeated interaction with an environment [00:04:19]. This approach is seen as crucial for the future development of more autonomous and capable AI agents [00:01:35].

Speaker’s Background

Will Brown, a machine learning researcher at Morgan Stanley, previously conducted theoretical work on multi-agent reinforcement learning at Columbia University [00:00:33]. His current work at Morgan Stanley involves Large Language Model (LLM) related projects, some of which resemble agents [00:00:40].

The Evolution of AI Agents

Current LLMs largely function as chatbots or reasoners, excelling at question answering and interactive problem solving [00:01:42]. While models like GPT-4, Claude 3, and Gemini are adept at longer thinking processes [00:02:00], advancing to agent level systems (Level 3 in OpenAI’s framework) means creating systems that can take actions, perform longer, harder, and more complex tasks [00:02:06].

Currently, agents are often built by chaining multiple calls to underlying LLMs, using techniques like prompt engineering, tool calling, and human-in-the-loop evaluations [00:02:15]. While effective for tasks with tight feedback loops (e.g., IDEs like Cursor, Warp, Replit, or advanced search tools) [00:03:20], few agents currently operate autonomously for extended periods, beyond 10 minutes at a time [00:03:42]. Examples of more autonomous agents include Devin, Operator, and OpenAI’s Deep Research [00:03:46].

The challenge lies in how to increase the autonomy of these systems [00:03:57]. While waiting for “better models” is a common approach [00:04:02], RL provides a path forward for teaching models to improve skills through trial and error [00:16:15].

How Reinforcement Learning Works for Agents

The core idea of RL is to “explore and exploit”: try various actions, observe what works, and then favor successful actions while reducing unsuccessful ones [00:07:03]. This feedback loop allows a model to learn from synthetic data rollouts and scores [00:07:32].

A key algorithm demonstrating this is GRPO (Generalized Reinforcement Learning with Policy Optimization), used by DeepSeek for their R1 model [00:07:38]. The GRPO algorithm involves:

  1. Sampling multiple completions (n) for a given prompt [00:07:52].
  2. Scoring all completions [00:07:56].
  3. Instructing the model to behave more like the higher-scoring completions [00:07:56].

This process enables models to learn effective strategies for problem-solving, leading to emergent behaviors like long chains of thought, as seen in models like O1 and R1 [00:06:24].

RL’s Role in Agentic Capabilities

RL has been a significant unlock for capabilities, particularly in test-time scaling [00:05:34].

  • DeepSeek R1: This model demonstrated that RL could teach a model to perform complex reasoning, like solving math questions, by learning to self-correct and utilize longer chains of thought based on accuracy rewards [00:06:06].
  • OpenAI’s Deep Research: This system uses end-to-end RL to take potentially hundreds of tool calls, browsing and querying the internet to synthesize large answers [00:08:24]. While impressive, it still faces limitations with out-of-distribution tasks or highly manual calculations [00:08:49].

While RL significantly enhances autonomy and skill acquisition, it does not automatically grant agents the ability to solve all kinds of problems universally [00:09:05]. It is a path for teaching models specific skills, especially when combined with environments, tools, and verification mechanisms [00:09:12].

Infrastructure and Future Directions

Current RL infrastructure for LLMs is largely RLHF (Reinforcement Learning from Human Feedback) style, focused on single-turn interactions with human-curated reward signals [00:09:28]. The future envisions RL agents as part of more complex systems, potentially supported by API services from large labs allowing fine-tuning for specific tasks [00:09:42].

Key challenges and opportunities for this ecosystem include:

  • Cost and Model Size: Determining the economic viability and minimum effective model size [00:10:12].
  • Generalization: Ensuring models can generalize learned skills across different tasks [00:10:15].
  • Reward and Environment Design: Developing effective reward signals and environments [00:10:17].

Rubric Engineering

A promising approach for designing rewards is “rubric engineering,” where models are given points for specific aspects of their output, beyond just correctness [00:12:52]. Examples include rewarding adherence to XML structures or correct answer formats, even if the final numerical answer is wrong [00:13:15]. This allows models to receive continuous feedback and improve [00:13:38].

Future advancements in rubric engineering could involve using LLMs to design or autotune rubrics, incorporating LLM judges into scoring, and diligently preventing “reward hacking” [00:13:52].

Multi-step Environments

New open-source efforts are focusing on frameworks for doing RL within multi-step environments [00:14:55]. The goal is to leverage existing agent frameworks for API models and allow models to plug into these environments, learning from interaction protocols without needing to manage weights or tokens directly [00:15:06]. This represents a step towards truly autonomous multiagent orchestration for AI copilot development or organizational systems powered by LLMs [00:17:43].

AI engineering skills developed in recent years, such as building evaluations and prompts, directly translate to the challenge of building environments and rubrics for RL [00:17:11]. Continued development of monitoring tools and a supportive ecosystem are essential for this advancement [00:17:21].