From: aidotengineer

Will Brown, a machine learning researcher at Morgan Stanley, presented insights into the future of AI engineering, particularly focusing on the role of reinforcement learning (RL) in the development of AI agents and the significant contribution of the open source community [00:00:22]. His talk synthesized trends within the broader research community and included recent open source work [00:01:14]. The objective was to help prepare for a potential future where reinforcement learning is integrated into the agent engineering loop [00:01:32].

Current Landscape of LLMs and Agents

Most Large Language Models (LLMs) today function primarily as chatbots or reasoners, excelling at question answering and interactive problem solving [00:01:42]. Models like O1, O3, R1, Gro3, and Gemini are adept at longer-form thinking [00:01:58]. The current challenge lies in evolving these into “agents” (Level 3 in OpenAI’s framework), which are systems capable of taking actions and performing longer, more complex tasks [00:02:06]. This is typically achieved by chaining multiple calls to underlying chatbot or reasoner LLMs, employing techniques like prompt engineering, tool calling, and human-in-the-loop evaluations [00:02:15].

A distinction is made between “agents” and “pipelines” (or workflows):

  • Pipelines (or workflows) are systems with low degrees of autonomy, requiring significant engineering to define decision trees and refine prompts [00:03:04].
  • Agents, in the traditional reinforcement learning sense, are entities that interact with an environment with a goal, designed to learn and improve over time through repeated interaction [00:04:14].

Current agents, such as those integrated into IDEs like Cursor, Warp, Surf, and Replit, or advanced search tools, often operate with very tight feedback loops and typically don’t perform tasks for more than 10 minutes at a time [00:03:22]. More autonomous examples include Devin, Operator, and OpenAI’s Deep Research [00:03:47].

Traditional wisdom suggests waiting for better models to achieve more autonomous agents [00:04:02]. However, pre-training shows diminishing returns on capital [00:04:51]. Reinforcement Learning from Human Feedback (RLHF), while good for friendly chatbots, doesn’t seem to continuously push the frontier of smarter models [00:05:01]. Synthetic data helps distill larger models into smaller, performant ones but isn’t a standalone unlock for massive capabilities without verification or rejection sampling [00:05:13].

Reinforcement Learning (RL) has emerged as a key “trick,” notably unlocking test-time scaling for O1 models in R1 [00:05:33]. RL is not bottlenecked by manually curated human data and has proven effective [00:05:40].

The release of DeepSeek’s R1 model and paper was significant, as it was the first detailed explanation of how to build something akin to O1 [00:05:51]. The core mechanism was reinforcement learning: giving a model questions, measuring correctness, and iteratively feeding back information to encourage successful behaviors [00:06:10]. The long chain of thought seen in models like O1 and R1 emerged as a byproduct of this RL process, not from manual programming [00:06:24]. RL’s strength lies in identifying effective strategies for problem-solving [00:06:40].

The Rise of Open Source Models

A notable trend is the resurgence of open source models [00:06:43]. There is considerable excitement within the open source community, with efforts focused on:

  • Replication: Replicating projects like O1 [00:06:50].
  • Distillation: Distilling data from larger models like O1 into smaller models [00:06:54].

Open Source Community’s Impact: A Case Study

The speaker recounted an experience after the R1 paper’s release, where he created a single Python file implementing the GRPO algorithm from a HuggingFace trainer with a small Llama 1B model to solve math questions [00:11:01]. This experiment involved manually curating reward functions [00:11:35].

Upon sharing this work on X (formerly Twitter), it demonstrated the model’s self-correction, improved accuracy, and emergent longer chains of thought [00:11:39]. While not a “true” replication, its simplicity resonated widely, taking on a life of its own in the open source community [00:12:02]. People forked it, modified it for Jupyter notebooks, and wrote blog posts about it, drawn to its single-file, user-friendly nature that invited modification [00:12:18].

This phenomenon highlighted “rubric engineering” – the process of designing rewards for RL beyond simple right/wrong answers [00:12:52]. Rewards could be given for adhering to formats (e.g., XML structure) or even partially correct answers (e.g., providing an integer answer in the correct format, even if the value is wrong) [00:13:15]. This allows the model to receive granular feedback and improve [00:13:38].

Challenges in rubric engineering include:

  • Preventing “reward hacking,” where the model finds loopholes to maximize reward without achieving the actual task [00:14:02].
  • Opportunities lie in using LLMs to design or autotune rubrics and incorporating LM judges into scoring [00:13:52].

Following this experience, the speaker has focused on creating more robust, usable open-source research code for RL in multi-step environments [00:14:25]. This framework allows users to create “environment” objects that models can plug into, enabling RL without needing to worry about model weights or tokens, only the interaction protocol [00:14:52].

The Future of AI Engineering in the RL Era

The open source community has a significant role to play in building and growing the infrastructure, determining best practices, and creating necessary tools for this ecosystem [00:10:20].

It is uncertain whether off-the-shelf API models will suffice for all tasks [00:15:40]. Skills are difficult to include in prompts and are often acquired through trial and error, a process at which models also excel [00:15:58]. This trial-and-error learning, facilitated by RL, has been the most promising unlock for higher autonomy agents like Deep Research [00:16:19].

Fine-tuning, once dismissed, is regaining importance as the gap between open and closed source models narrows [00:16:29]. Many platforms now use open source hosted models [00:16:46]. True RL, as demonstrated by DeepSeek’s R1 and OpenAI’s Deep Research, requires explicit reinforcement learning [00:16:51].

While many research questions remain unanswered, existing AI engineering skills are highly transferable [00:17:03]. Building environments and rubrics for RL is conceptually similar to building evaluations and prompts [00:17:15]. The ecosystem will continue to need robust monitoring tools and a broad range of supporting companies and platforms [00:17:21]. The future of AI engineering may increasingly involve reinforcement learning to achieve truly autonomous agents, innovators, or language model-powered organizations [00:17:39].