From: aidotengineer
Will Brown, a machine learning researcher at Morgan Stanley, presented insights on the future of agents and the increasing role of reinforcement learning (RL) in their development at the AI Engineer Conference [00:00:22]. His talk focused on speculative future directions rather than current production systems or best practices [00:01:04]. The goal is to prepare for a future where reinforcement learning becomes integral to the agent engineering loop [00:01:37].
Current State of Large Language Models
Most LLMs are currently understood as chatbots [00:01:42]. Using OpenAI’s five levels framework, progress has been strong in chatbots and reasoners [00:01:50]. These models, such as GPT-4, GPT-3, Gemini, and Grok-3, excel at question answering and interactive problem-solving, particularly for longer reasoning tasks [00:02:00].
The current challenge is to advance to Level 3: agents [00:02:06]. These are systems designed to take actions and perform longer, more complex tasks [00:02:10]. Present approaches involve chaining multiple calls to underlying chatbot or reasoner LLMs [00:02:17]. Techniques like prompt engineering, tool calling, and human-in-the-loop evaluations are common [00:02:24]. While results are “pretty good,” true autonomous agents with AGI-level capabilities are not yet realized [00:02:40].
Agents vs. Pipelines
It’s useful to distinguish between agents and pipelines (or workflows) [00:02:53]. Pipelines are systems with relatively low autonomy, requiring significant engineering to define decision trees and refine prompts [00:03:04].
Winning applications in the “agent space” often feature tight feedback loops where users interact with an interface, get quick responses, and perform short tasks [00:03:20]. Examples include IDEs like Cursor, Warp, and Replit, and search tools for complex question-answering [00:03:34]. Few current agents can operate autonomously for more than 10 minutes [00:03:42]. Exceptions like Devin, Operator, and OpenAI’s Deep Research are seen as moving towards more autonomous agents [00:03:47].
The Role of Reinforcement Learning
The traditional definition of agents in reinforcement learning involves a system interacting with an environment to achieve a goal, learning to improve over time through repeated interaction [00:04:11]. This iterative improvement is often missing in current LLM applications [00:04:26]. If an LLM-powered system performs at 70% accuracy after prompt tuning, getting to 90% often lacks a clear path forward without better models or tools for continuous learning [00:04:36].
Model Trends and Unlocks
Current model trends suggest:
- Pre-training: Showing diminishing returns on capital, indicating a need for new techniques [00:04:51].
- RLHF (Reinforcement Learning from Human Feedback): Excellent for creating friendly chatbots but doesn’t consistently push the frontier for increasingly smarter models [00:05:01].
- Synthetic Data: Good for distilling larger models into smaller, performant ones, but not a standalone unlock for massive capabilities unless combined with verification or rejection sampling [00:05:13].
- Reinforcement Learning: Appears to be the key trick for enabling test-time scaling in models like GPT-4 and DeepSeek-R1 [00:05:33]. It’s not bottlenecked by manually curated human data and has shown practical effectiveness [00:05:39].
DeepSeek-R1 and OpenAI’s Deep Research
The DeepSeek-R1 model and paper were significant because they explained how a model like GPT-4 is built, revealing that reinforcement learning is at its core [00:05:51]. The process involves giving the model questions, measuring answer correctness, and iteratively giving feedback to reinforce successful strategies [00:06:12]. The long chain-of-thought reasoning in these models emerges as a byproduct of this RL process, rather than being explicitly programmed [00:06:24]. Reinforcement learning excels at identifying effective strategies for problem-solving [00:06:40].
OpenAI’s Deep Research, an example of an end-to-end reinforcement learning system, can perform up to a hundred tool calls for browsing and querying the internet to synthesize answers [00:08:17]. While impressive, it’s not AGI and struggles with out-of-distribution tasks or highly manual calculations [00:08:49]. This suggests that reinforcement learning unlocks new skills and autonomy but doesn’t grant universal problem-solving capabilities [00:08:58]. It offers a path to teaching models skills and improving them in conjunction with environments, tools, and verification [00:09:12].
How Reinforcement Learning Works for LLMs
The core idea of reinforcement learning is explore and exploit: try things, see what works, and do more of the successful actions [00:07:03]. For example, in code generation, a model writes code to pass test cases, receiving numerical rewards for formatting, language use, and passing tests [00:07:16]. This means the model learns from synthetic data rollouts and their scores, rather than pre-curated datasets [00:07:30].
The GRPO algorithm, used by DeepSeek, illustrates this simply: for a given prompt, sample multiple completions, score them, and train the model to be more like the higher-scoring ones [00:07:52]. This is typically applied in single-turn reasoner models, not yet fully in the agentic world [00:08:01].
Rubric Engineering
Experiments with Large Language Models and AI Assisted Work have shown that even simple reinforcement learning setups can catch the imagination of the community [00:12:09]. A single Python file demonstration using the GRPO algorithm to train a small Llama model for math reasoning revealed how accuracy improved and reasoning chains extended with RL [00:11:16].
This led to the concept of rubric engineering, akin to prompt engineering [00:12:52]. When a model undergoes reinforcement learning, it receives a reward, but the design of this reward is crucial [00:13:01]. Beyond simple right/wrong answers, models can be rewarded for:
- Following specific XML structures [00:13:15].
- Adhering to output formats (e.g., producing an integer answer even if incorrect) [00:13:23].
This offers significant room for creativity in designing rules that allow the model to understand its own performance and use that feedback for further training [00:13:37]. Future opportunities include using LLMs to design or autotune these rubrics, incorporating LLM judges for scoring, and preventing reward hacking, where models find ways to maximize reward without truly performing the task [00:13:52].
Future of AI Engineering in the RL Era
A new framework is being developed for doing reinforcement learning inside multi-step environments [00:14:55]. The idea is to leverage existing agent frameworks by creating an “environment” that the model plugs into, allowing interaction protocols to be defined without worrying about weights or tokens [00:15:06]. This allows models to continuously improve over time with defined rewards [00:15:28].
Key considerations for the future of AI engineering in the RL era include:
- Off-the-shelf API models: It’s uncertain if they will be sufficient for all tasks [00:15:42].
- Skill vs. Knowledge: It’s hard to instill a “skill” through prompting alone; models, like humans, often require trial and error to truly master a skill [00:15:58]. RL has been the most promising unlock for higher autonomy agents [00:16:19].
- Fine-tuning: Despite being written off by some, fine-tuning remains important, especially as the gap between open and closed-source models closes [00:16:29]. True RL for models like DeepSeek-R1 and Deep Research requires fine-tuning [00:16:51].
Many challenges and research questions remain [00:17:03]. However, skills learned in current AI engineering, such as building environments and rubrics, are directly transferable to building evaluations and prompts [00:17:15]. The need for good monitoring tools and a supportive ecosystem of companies and products will persist [00:17:21]. Looking ahead, reinforcement learning may be essential to unlock truly autonomous agents, innovators, or organizations powered by language models [00:17:39].