From: aidotengineer

In 2022, OpenAI released InstructGPT, which was lauded as the first model capable of taking and following instructions effectively [00:00:00]. However, even by 2025 with GPT 4.1, language models (LLMs) continue to struggle with complex instruction following [00:00:19]. This challenge isn’t exclusive to OpenAI; every language model faces issues with consistently following instructions [00:00:29].

Initially, simple prompts like “Explain the moon landing to a six-year-old” demonstrated impressive capabilities [00:00:41]. But as developers began to leverage LLMs for more intricate tasks, prompts evolved to include extensive information, context, constraints, and requirements, all “shoved into one prompt” [00:01:02]. For even seemingly simple tasks like instruction following, LLMs alone are often insufficient [00:01:14].

The Rise of AI Agents

This limitation highlights the need for AI agents [00:01:21]. Unlike basic prompting, agents require planning [00:01:24]. The definition of an “agent” can be fluid, with some even referring to an LLM acting as a router that directs queries to specialized LLMs as an agent [00:01:53].

Other forms of agents include:

  • Function Calling: Providing an LLM with a list of external tools it can use to interact with APIs or perform actions like Google searches [00:02:08].
  • React (Reason and Act): A popular framework where an LLM thinks, then acts upon that thought, and observes the result, iterating step by step towards a solution [00:02:39]. However, React typically lacks a look-ahead to the entire plan [00:03:00].

The Importance of Planning

Planning is crucial for AI agents, involving the process of figuring out the necessary steps to achieve a goal [00:03:15]. It is particularly valuable for complex tasks that are not straightforward [00:03:24]. Planning offers advantages such as:

  • Parallelization: Enabling multiple steps to be executed concurrently [00:03:31].
  • Explainability: Providing a clear understanding of why certain steps were taken, which is often missing in reactive frameworks [00:03:33].

Planners can be forms-based, like Microsoft’s Magentic, or code-based, such as Small Agents from Hugging Face [00:03:43].

Dynamic Planning and Smart Execution

A key aspect of advanced planning is dynamic planning, which includes the option to replan [00:03:57]. This means a system can reassess its current plan mid-execution and decide if a different path is more optimal [00:04:09].

To achieve true efficiency, every planner needs an execution engine [00:04:14]. A smart execution engine can:

  • Analyze dependencies between steps, enabling parallel execution [00:04:22].
  • Manage trade-offs between speed and cost [00:04:29]. For example, using branch prediction can lead to faster systems [00:04:34].

AI21 Mastro: An Example System

AI21’s Mastro system exemplifies the integration of a planner and a smart execution engine [00:04:40]. In a simplified instruction-following scenario, Mastro separates the prompt (context and task) from explicit requirements (e.g., paragraph limits, tone, brand mentions) [00:04:53]. This separation makes validation easier [00:05:12].

The system uses an execution tree or graph where, at each step, the planner and execution engine:

  1. Choose several candidate solutions [00:05:23].
  2. Continue to fix and improve only the most promising candidates [00:05:26].

Key techniques for improving efficiency and quality include:

  • Best of N: Instead of a single generation, multiple generations are sampled from an LLM with high temperature, or different LLMs are used [00:05:36].
  • Candidate Ditching: Unpromising candidates are discarded, and only the best ones are pursued based on a predefined budget [00:05:49].
  • Validation and Iteration: Continuous validation allows for iterative fixing and improvement [00:05:59].

In more complex scenarios, the execution engine tracks expected cost, latency, and success probability, allowing the planner to choose the most appropriate path [00:06:15]. Finally, a “reduce” phase consolidates multiple results into the best or a combined answer [00:06:28].

Performance and Trade-offs

Systems employing planning and smart execution engines demonstrate significantly improved results compared to single LLM calls, even for advanced models like GPT-40 or Claude Sonnet 3.5 [00:06:42]. For instance, in requirement satisfaction tasks, these methods show considerable improvement [00:07:02].

While there is a trade-off in terms of increased runtime and cost, the resulting higher quality often justifies the investment [00:07:10].

Key Takeaways