From: aidotengineer

Will Brown, a machine learning researcher at Morgan Stanley, discusses the distinction between AI agents and pipelines, highlighting the potential role of reinforcement learning (RL) in advancing agent capabilities. His talk aims to synthesize current research trends and speculate on the future of AI engineering, particularly concerning the agent engineering loop [01:08:00], [01:35:00], [01:37:00].

Current Landscape of LLMs [01:41:00]

Most Large Language Models (LLMs) currently operate as chatbots or reasoners, excelling at question answering and interactive problem solving [01:53:00]. Examples include models like GPT-4, Claude 3, and Gemini [02:00:00]. To create agents (OpenAI’s Level 3), which are systems that take actions and perform longer, more complex tasks, the common approach involves chaining multiple calls to these underlying chatbot or reasoner LLMs [02:08:00], [02:17:00]. This process often relies on prompt engineering, tool calling, evaluation (eval), and human-in-the-loop interventions [02:24:00], [02:29:00]. While these methods yield “pretty good” results, they do not yet achieve the high degree of autonomy often imagined for Artificial General Intelligence (AGI) [02:31:00], [02:47:00].

Agents vs. Pipelines [02:53:00]

Brown distinguishes between agents and pipelines:

  • Pipelines (also referred to as workflows) [02:59:00]:

    • Characterized by fairly low degrees of autonomy [03:04:00].
    • Require a non-trivial amount of engineering to define decision trees and prompt refinements [03:10:00].
    • Often feature tight feedback loops where a user interacts with an interface, receives quick responses, and guides the system [03:22:00].
    • Examples include Integrated Development Environments (IDEs) like Cursor, Wind, Surf, and Replit, as well as search tools for complex question answering with web integration [03:34:00], [03:38:00]. These systems typically perform tasks for less than 10 minutes at a time [03:44:00].
  • Agents:

    • Imply higher levels of autonomy, performing tasks for extended durations [03:52:00].
    • Few examples currently exist that truly embody this, with Devon, Operator, and OpenAI’s Deep Research being notable exceptions [03:47:00], [03:48:00].

Traditionally, an agent in reinforcement learning (RL) is defined as a system that interacts with an environment with a specific goal, learning to improve its performance over time through repeated interactions [04:14:00], [04:22:00]. Brown notes that achieving higher performance (e.g., from 70% to 90% success) often requires models to learn from interaction rather than just relying on prompt tuning or static data [04:36:00], [04:43:00].

Reinforcement Learning as an Unlock [04:59:00]

Brown suggests that while traditional methods like pre-training and RLFH (Reinforcement Learning from Human Feedback) have their benefits, they face diminishing returns or limitations in pushing the frontier of capabilities [04:51:00], [05:07:00]. Synthetic data is good for distilling models but not for massive capability unlocks on its own [05:13:00].

Reinforcement learning (RL) appears to be the key trick for achieving “test-time scaling” [05:34:00]. DeepSeek’s R1 model and paper, which explained the architecture of models like GPT-4, demonstrated that RL is used to enable long chains of thought and complex reasoning [05:53:00], [06:06:00]. RL identifies good strategies for solving problems by giving models questions, measuring correctness, and providing feedback to encourage successful behaviors [06:14:00], [06:40:00].

RL operates on the principle of “explore and exploit”: trying different approaches, identifying what works, and then doing more of what worked [07:03:00]. An example is a model writing code to pass test cases, receiving numerical rewards for formatting, language use, and ultimately passing tests [07:13:00]. This feedback loop allows models to learn from synthetic data rollouts and refine their strategies [07:31:00]. The GRPO algorithm, used by DeepSeek, simplifies this process: sample completions, score them, and tell the model to be more like the higher-scoring ones [07:51:00].

RL for Autonomous Agents [08:08:00]

OpenAI’s Deep Research, an example of a more autonomous system, was built with end-to-end reinforcement learning, enabling it to make potentially hundreds of tool calls for browsing and querying to synthesize answers [08:24:00], [08:27:00]. While impressive, such agents still struggle with out-of-distribution tasks or highly manual calculations, indicating that RL is a big unlock for new skills and autonomy, but not a universal solution for all problems [08:49:00], [09:00:00]. It is particularly effective for teaching skills and improving performance in conjunction with environments, tools, and verification [09:12:00], [09:18:00].

Challenges and Opportunities [10:06:00]

The integration of AI agents into existing infrastructure via RL is still in its early stages. Key questions remain:

This presents significant opportunities for:

  • Open-source infrastructure: Building and defining best practices [10:23:00].
  • Tooling companies: Supporting the RL ecosystem [10:29:00].
  • Services: Dedicated to supporting agentic RL [10:37:00].

Brown also mentions that even at the prompt level, automation can mimic RL’s flavor, such as with DSPI, which uses a signal to improve a system based on downstream scores [10:44:00], [10:52:00].

Rubric Engineering [12:52:00]

Inspired by the public’s engagement with a simple Python file demonstrating RL, Brown introduces “rubric engineering.” This concept is analogous to prompt engineering but for defining rewards in RL [12:54:00], [12:57:00]. Instead of just a binary right/wrong score, models can be given points for following specific formats (e.g., XML structure) or demonstrating progress even if the final answer is incorrect [13:15:00], [13:23:00].

Rubric engineering allows for creative design of rules that not only help developers evaluate performance but also provide direct feedback to the model for further training [13:30:00], [13:38:00]. Future work in this area could involve using LLMs to design or auto-tune rubrics and prompts, incorporating LLM judges into scoring, and guarding against “reward hacking” (where models find loopholes to achieve high rewards without performing the actual task) [13:52:00], [14:02:00].

Brown is developing an open-source framework to facilitate RL in multi-step environments, allowing users to leverage existing API model agent frameworks to run RL experiments [14:55:00], [15:01:00]. The goal is to enable users to define interaction protocols and environments, then let a model learn and improve over time with rewards [15:18:00], [15:28:00].

The RL Era in AI Engineering [15:34:00]

The speaker posits that off-the-shelf API models might not always suffice because skills are hard to embed in prompts; they are often learned through trial and error [15:53:00], [16:05:00]. This trial-and-error process, facilitated by RL, has proven to be the most promising path for creating higher-autonomy agents like Deep Research [16:15:00], [16:21:00].

Fine-tuning, once dismissed, is gaining renewed importance as the gap between open and closed-source models narrows [16:29:00]. True RL (as used by DeepSeek and OpenAI) necessitates fine-tuning based on reinforcement learning feedback [16:51:00].

While challenges remain, existing AI engineering skills—such as building evaluations (evals) and crafting prompts—directly translate to the new challenges of building environments and designing rubrics [17:08:00], [17:15:00]. Continued development of monitoring tools and a supportive ecosystem of companies and platforms will be crucial for the types of agents engineers aim to build [17:21:00], [17:29:00]. The future of AI engineering may increasingly involve reinforcement learning to unlock truly autonomous agents and AI-powered organizations [17:37:00], [17:42:00].