From: aidotengineer
Will Brown, a machine learning researcher at Morgan Stanley, discusses the meaning of reinforcement learning (RL) for agents and potential future directions in AI engineering [00:00:27]. His background includes working on theory for multi-agent reinforcement learning at Columbia [00:00:35]. The talk focuses on where the field might be headed, offering speculation and insights from recent open-source work [00:01:08]. The goal is to help prepare for a future where reinforcement learning is part of the agent engineering loop [00:01:32].
Current State of LLMs and Agents
Most Large Language Models (LLMs) today function as chatbots [00:01:42]. Using OpenAI’s five-levels framework, significant progress has been made with chatbots and reasoners, which are effective for question-answering and interactive problem-solving [00:01:50]. The next step is to elevate these to “agent level three” systems that can take actions and perform longer, more complex tasks [00:02:06].
Currently, agents are often built by chaining multiple calls to underlying chatbot or reasoner LLMs [00:02:15]. This involves techniques such as:
- Prompt engineering [00:02:24]
- Tool calling [00:02:24]
- Eval (evaluation) [00:02:24]
- Integrating humans in the loop [00:02:29]
While these methods yield good results for many tasks, the field has not yet reached a point where AI agents operate with the high degree of autonomy imagined for Artificial General Intelligence (AGI) [00:02:40].
Agents vs. Pipelines
It is helpful to distinguish between “agents” and “pipelines” (or workflows) [00:02:53]. Pipelines are systems with lower degrees of autonomy, requiring significant engineering to define decision trees and refine prompts [00:03:04]. Many successful applications in the agent space utilize very tight feedback loops, where users interact with an interface, tasks are performed quickly, and results are returned promptly [00:03:20]. Examples include integrated development environments (IDEs) like Cursor, Warp, Surf, and Replit, and search tools that integrate web search or research [00:03:34].
Currently, few agents can perform tasks for more than 10 minutes at a time [00:03:42]. Exceptions that demonstrate a more autonomous direction include Devin, Operator, and OpenAI’s Deep Research [00:03:46].
Limitations of Current Approaches
The traditional wisdom suggests that better models will automatically enable more capable agents [00:04:02]. However, the traditional definition of reinforcement learning defines an agent as something that interacts with an environment with a goal, designed to learn and improve over time through repeated interaction [00:04:11]. Many current approaches lack the tools to automatically improve performance from 70% to 90% when prompt tuning or existing models struggle [00:04:26].
Model Trends and New Tricks
- Pre-training: Showing diminishing returns on capital, suggesting a need for new techniques [00:04:51].
- Reinforcement Learning from Human Feedback (RLHF): Excellent for creating friendly chatbots, but not consistently pushing the frontier of smarter models [00:05:01].
- Synthetic Data: Useful for distilling larger models into smaller, performant ones, but not a standalone unlock for massive capability improvements without verification or rejection sampling [00:05:13].
- Reinforcement Learning (RL): Appears to be the key trick for test-time scaling, as seen in models like O1 and R1 [00:05:33]. It is not bottlenecked by manual human data curation and has proven effective [00:05:40].
Reinforcement Learning for Reasoning and Agents
DeepSeek’s R1 model and paper were significant because they explained how to build a model like O1, revealing that it was essentially reinforcement learning [00:05:51]. The process involves:
- Giving the model questions [00:06:12].
- Measuring if the answer is correct [00:06:14].
- Iteratively providing feedback to encourage more successful behaviors and less unsuccessful ones [00:06:17].
This process can lead to the emergence of long chains of thought, as observed in O1 and R1 models, without explicit manual programming [00:06:24]. RL is fundamentally about identifying good strategies for problem-solving [00:06:40].
Open-source models are also seeing a resurgence, with community efforts to replicate O1 and distill its data into smaller models [00:06:45].
How Reinforcement Learning Works
The core idea of RL is “explore and exploit”: trying different approaches, observing what works, and then doing more of what succeeded and less of what failed [00:07:03].
For example, in a coding task, rewards can be given for formatting, using the correct language, and ultimately, passing test cases [00:07:16]. This numerical feedback allows the model to generate synthetic data rollouts, receive scores, and feed them back into the model for improvement [00:07:30].
GRPO Algorithm
The GRPO algorithm, used by DeepSeek, is conceptually simple for RL:
- For a given prompt, sample N completions [00:07:52].
- Score all completions [00:07:55].
- Instruct the model to generate responses that are more similar to those with higher scores [00:07:56]. This still often operates within a single-turn reasoner model [00:08:01].
Challenges in Developing AI Agents with RL
While RL can unlock new skills and greater autonomy, as demonstrated by OpenAI’s Deep Research (an end-to-end RL system taking up to 100 tool calls for web browsing and querying) [00:08:17], it doesn’t solve all problems [00:09:08]. Deep Research struggles with out-of-distribution tasks or very manual calculations, indicating that RL alone does not grant agents the ability to do everything [00:08:49]. However, it is a path for teaching models specific skills and improving them through interaction with environments and tools [00:09:12].
Infrastructure and Ecosystem Challenges in Developing AI Agents
Existing infrastructure for RL is often RLHF-style, relying on reward signals from human-curated data [00:09:28]. For RL agents to become integral to our systems, we would need:
- Robust API services from large labs for building and fine-tuning these models [00:09:42].
- Multi-step tool calling capabilities [00:10:01].
Key unknown questions and challenges in this ecosystem include:
- The cost of training these models [00:10:12].
- How small the models can be [00:10:13].
- Generalization across tasks [00:10:14].
- How to design good rewards and environments [00:10:17].
This presents significant opportunities for open-source infrastructure development, defining best practices, and building tools and services to support agentic RL [00:10:20]. Beyond literal RL training, automation at the prompt level, like DSPI, can also provide signals for system improvement [00:10:41].
Rubric Engineering
A specific technique emerging with RL for LLMs is “rubric engineering” [00:12:49]. Similar to prompt engineering, this involves designing the reward system for the model [00:12:54]. Rewards don’t just have to be binary (right/wrong answer); they can include points for:
- Following specific formatting (e.g., XML structure) [00:13:15].
- Adhering to answer types (e.g., an integer answer, even if incorrect) [00:13:23].
This allows for creative rule design that provides actionable feedback to the model for further training [00:13:30]. Areas for further exploration include using LLMs to design or auto-tune rubrics and prompts, and incorporating LM judges into scoring systems [00:13:52].
Reward Hacking
A significant challenge to be cautious of is “reward hacking” [00:14:02]. This occurs when the reward model does not accurately capture the true goal, allowing the agent to find “back doors” to achieve a high reward without actually performing the intended task [00:14:07].
Open-Source Efforts and Frameworks
An open-source effort aims to provide a more robust framework for RL within multi-step environments [00:14:31]. The goal is to leverage existing agent frameworks and API models by allowing users to create an “environment” that the model plugs into [00:14:59]. This abstracts away concerns about model weights or tokens, enabling users to define an interaction protocol that feeds into a trainer, allowing the model to improve over time with rewards [00:15:11].
The Future of AI Engineering in the RL Era
It is uncertain whether off-the-shelf API models will suffice for all tasks [00:15:40]. A key reason is that while knowledge can be included in a prompt, skills are hard to convey [00:15:58]. Like humans, models often require trial and error to truly master a skill [00:16:05]. This trial-and-error approach, central to RL, has been the most promising unlock for higher autonomy agents like Deep Research [00:16:19].
Importance of Fine-Tuning Building and Improving AI Agents
Fine-tuning, previously somewhat disregarded, is regaining importance [00:16:28]. The gap between open and closed-source models is narrowing, making open-source hosted models viable for platforms [00:16:40]. Additionally, the “true” versions of RL, as seen in DeepSeek’s R1 and OpenAI’s Deep Research, necessitate reinforcement learning [00:16:51].
Many research questions remain unanswered regarding RL in AI engineering [00:17:03]. However, existing AI engineering skills directly translate to this new paradigm:
- The challenge of building environments and rubrics is akin to building evals and prompts [00:17:13].
- Good monitoring tools are still essential [00:17:21].
- A large ecosystem of companies, platforms, and products is needed to support the desired types of agents [00:17:22].
These existing skills will be crucial if the field moves towards a future requiring more reinforcement learning to unlock truly autonomous agents, innovators, or language model-powered organizations [00:17:30].