Challenges and Techniques in Reinforcement Learning for Agents

From: aidotengineer

Will Brown, a machine learning researcher at Morgan Stanley, discusses the meaning of reinforcement learning (RL) for agents and potential future directions in AI engineering [00:00:27]. His background includes working on theory for multi-agent reinforcement learning at Columbia [00:00:35]. The talk focuses on where the field might be headed, offering speculation and insights from recent open-source work [00:01:08]. The goal is to help prepare for a future where reinforcement learning is part of the agent engineering loop [00:01:32].

Current State of LLMs and Agents

Most Large Language Models (LLMs) today function as chatbots [00:01:42]. Using OpenAI’s five-levels framework, significant progress has been made with chatbots and reasoners, which are effective for question-answering and interactive problem-solving [00:01:50]. The next step is to elevate these to “agent level three” systems that can take actions and perform longer, more complex tasks [00:02:06].

Currently, agents are often built by chaining multiple calls to underlying chatbot or reasoner LLMs [00:02:15]. This involves techniques such as:

Prompt engineering [00:02:24]
Tool calling [00:02:24]
Eval (evaluation) [00:02:24]
Integrating humans in the loop [00:02:29]

While these methods yield good results for many tasks, the field has not yet reached a point where AI agents operate with the high degree of autonomy imagined for Artificial General Intelligence (AGI) [00:02:40].

Agents vs. Pipelines

It is helpful to distinguish between “agents” and “pipelines” (or workflows) [00:02:53]. Pipelines are systems with lower degrees of autonomy, requiring significant engineering to define decision trees and refine prompts [00:03:04]. Many successful applications in the agent space utilize very tight feedback loops, where users interact with an interface, tasks are performed quickly, and results are returned promptly [00:03:20]. Examples include integrated development environments (IDEs) like Cursor, Warp, Surf, and Replit, and search tools that integrate web search or research [00:03:34].

Currently, few agents can perform tasks for more than 10 minutes at a time [00:03:42]. Exceptions that demonstrate a more autonomous direction include Devin, Operator, and OpenAI’s Deep Research [00:03:46].

Limitations of Current Approaches

The traditional wisdom suggests that better models will automatically enable more capable agents [00:04:02]. However, the traditional definition of reinforcement learning defines an agent as something that interacts with an environment with a goal, designed to learn and improve over time through repeated interaction [00:04:11]. Many current approaches lack the tools to automatically improve performance from 70% to 90% when prompt tuning or existing models struggle [00:04:26].

Model Trends and New Tricks

Pre-training: Showing diminishing returns on capital, suggesting a need for new techniques [00:04:51].
Reinforcement Learning from Human Feedback (RLHF): Excellent for creating friendly chatbots, but not consistently pushing the frontier of smarter models [00:05:01].
Synthetic Data: Useful for distilling larger models into smaller, performant ones, but not a standalone unlock for massive capability improvements without verification or rejection sampling [00:05:13].
Reinforcement Learning (RL): Appears to be the key trick for test-time scaling, as seen in models like O1 and R1 [00:05:33]. It is not bottlenecked by manual human data curation and has proven effective [00:05:40].

Reinforcement Learning for Reasoning and Agents

DeepSeek’s R1 model and paper were significant because they explained how to build a model like O1, revealing that it was essentially reinforcement learning [00:05:51]. The process involves:

Giving the model questions [00:06:12].
Measuring if the answer is correct [00:06:14].
Iteratively providing feedback to encourage more successful behaviors and less unsuccessful ones [00:06:17].

This process can lead to the emergence of long chains of thought, as observed in O1 and R1 models, without explicit manual programming [00:06:24]. RL is fundamentally about identifying good strategies for problem-solving [00:06:40].

Open-source models are also seeing a resurgence, with community efforts to replicate O1 and distill its data into smaller models [00:06:45].

How Reinforcement Learning Works

The core idea of RL is “explore and exploit”: trying different approaches, observing what works, and then doing more of what succeeded and less of what failed [00:07:03].

For example, in a coding task, rewards can be given for formatting, using the correct language, and ultimately, passing test cases [00:07:16]. This numerical feedback allows the model to generate synthetic data rollouts, receive scores, and feed them back into the model for improvement [00:07:30].

GRPO Algorithm

The GRPO algorithm, used by DeepSeek, is conceptually simple for RL:

For a given prompt, sample N completions [00:07:52].
Score all completions [00:07:55].
Instruct the model to generate responses that are more similar to those with higher scores [00:07:56]. This still often operates within a single-turn reasoner model [00:08:01].

Challenges in Developing AI Agents with RL

While RL can unlock new skills and greater autonomy, as demonstrated by OpenAI’s Deep Research (an end-to-end RL system taking up to 100 tool calls for web browsing and querying) [00:08:17], it doesn’t solve all problems [00:09:08]. Deep Research struggles with out-of-distribution tasks or very manual calculations, indicating that RL alone does not grant agents the ability to do everything [00:08:49]. However, it is a path for teaching models specific skills and improving them through interaction with environments and tools [00:09:12].

Infrastructure and Ecosystem Challenges in Developing AI Agents

Existing infrastructure for RL is often RLHF-style, relying on reward signals from human-curated data [00:09:28]. For RL agents to become integral to our systems, we would need:

Robust API services from large labs for building and fine-tuning these models [00:09:42].
Multi-step tool calling capabilities [00:10:01].

Key unknown questions and challenges in this ecosystem include:

The cost of training these models [00:10:12].
How small the models can be [00:10:13].
Generalization across tasks [00:10:14].
How to design good rewards and environments [00:10:17].

This presents significant opportunities for open-source infrastructure development, defining best practices, and building tools and services to support agentic RL [00:10:20]. Beyond literal RL training, automation at the prompt level, like DSPI, can also provide signals for system improvement [00:10:41].

Rubric Engineering

A specific technique emerging with RL for LLMs is “rubric engineering” [00:12:49]. Similar to prompt engineering, this involves designing the reward system for the model [00:12:54]. Rewards don’t just have to be binary (right/wrong answer); they can include points for:

Following specific formatting (e.g., XML structure) [00:13:15].
Adhering to answer types (e.g., an integer answer, even if incorrect) [00:13:23].

This allows for creative rule design that provides actionable feedback to the model for further training [00:13:30]. Areas for further exploration include using LLMs to design or auto-tune rubrics and prompts, and incorporating LM judges into scoring systems [00:13:52].

Reward Hacking

A significant challenge to be cautious of is “reward hacking” [00:14:02]. This occurs when the reward model does not accurately capture the true goal, allowing the agent to find “back doors” to achieve a high reward without actually performing the intended task [00:14:07].

Open-Source Efforts and Frameworks

An open-source effort aims to provide a more robust framework for RL within multi-step environments [00:14:31]. The goal is to leverage existing agent frameworks and API models by allowing users to create an “environment” that the model plugs into [00:14:59]. This abstracts away concerns about model weights or tokens, enabling users to define an interaction protocol that feeds into a trainer, allowing the model to improve over time with rewards [00:15:11].

The Future of AI Engineering in the RL Era

It is uncertain whether off-the-shelf API models will suffice for all tasks [00:15:40]. A key reason is that while knowledge can be included in a prompt, skills are hard to convey [00:15:58]. Like humans, models often require trial and error to truly master a skill [00:16:05]. This trial-and-error approach, central to RL, has been the most promising unlock for higher autonomy agents like Deep Research [00:16:19].

Importance of Fine-Tuning Building and Improving AI Agents

Fine-tuning, previously somewhat disregarded, is regaining importance [00:16:28]. The gap between open and closed-source models is narrowing, making open-source hosted models viable for platforms [00:16:40]. Additionally, the “true” versions of RL, as seen in DeepSeek’s R1 and OpenAI’s Deep Research, necessitate reinforcement learning [00:16:51].

Many research questions remain unanswered regarding RL in AI engineering [00:17:03]. However, existing AI engineering skills directly translate to this new paradigm:

The challenge of building environments and rubrics is akin to building evals and prompts [00:17:13].
Good monitoring tools are still essential [00:17:21].
A large ecosystem of companies, platforms, and products is needed to support the desired types of agents [00:17:22].

These existing skills will be crucial if the field moves towards a future requiring more reinforcement learning to unlock truly autonomous agents, innovators, or language model-powered organizations [00:17:30].

Tubegraph

Explorer

Table of Contents