From: aidotengineer
In 2022, OpenAI released InstructGPT, which was lauded as the first model capable of taking and following instructions effectively [00:00:00]. However, even by 2025 with GPT 4.1, language models (LLMs) continue to struggle with complex instruction following [00:00:19]. This challenge isn’t exclusive to OpenAI; every language model faces issues with consistently following instructions [00:00:29].
Initially, simple prompts like “Explain the moon landing to a six-year-old” demonstrated impressive capabilities [00:00:41]. But as developers began to leverage LLMs for more intricate tasks, prompts evolved to include extensive information, context, constraints, and requirements, all “shoved into one prompt” [00:01:02]. For even seemingly simple tasks like instruction following, LLMs alone are often insufficient [00:01:14].
The Rise of AI Agents
This limitation highlights the need for AI agents [00:01:21]. Unlike basic prompting, agents require planning [00:01:24]. The definition of an “agent” can be fluid, with some even referring to an LLM acting as a router that directs queries to specialized LLMs as an agent [00:01:53].
Other forms of agents include:
- Function Calling: Providing an LLM with a list of external tools it can use to interact with APIs or perform actions like Google searches [00:02:08].
- React (Reason and Act): A popular framework where an LLM thinks, then acts upon that thought, and observes the result, iterating step by step towards a solution [00:02:39]. However, React typically lacks a look-ahead to the entire plan [00:03:00].
The Importance of Planning
Planning is crucial for AI agents, involving the process of figuring out the necessary steps to achieve a goal [00:03:15]. It is particularly valuable for complex tasks that are not straightforward [00:03:24]. Planning offers advantages such as:
- Parallelization: Enabling multiple steps to be executed concurrently [00:03:31].
- Explainability: Providing a clear understanding of why certain steps were taken, which is often missing in reactive frameworks [00:03:33].
Planners can be forms-based, like Microsoft’s Magentic, or code-based, such as Small Agents from Hugging Face [00:03:43].
Dynamic Planning and Smart Execution
A key aspect of advanced planning is dynamic planning, which includes the option to replan [00:03:57]. This means a system can reassess its current plan mid-execution and decide if a different path is more optimal [00:04:09].
To achieve true efficiency, every planner needs an execution engine [00:04:14]. A smart execution engine can:
- Analyze dependencies between steps, enabling parallel execution [00:04:22].
- Manage trade-offs between speed and cost [00:04:29]. For example, using branch prediction can lead to faster systems [00:04:34].
AI21 Mastro: An Example System
AI21’s Mastro system exemplifies the integration of a planner and a smart execution engine [00:04:40]. In a simplified instruction-following scenario, Mastro separates the prompt (context and task) from explicit requirements (e.g., paragraph limits, tone, brand mentions) [00:04:53]. This separation makes validation easier [00:05:12].
The system uses an execution tree or graph where, at each step, the planner and execution engine:
- Choose several candidate solutions [00:05:23].
- Continue to fix and improve only the most promising candidates [00:05:26].
Key techniques for improving efficiency and quality include:
- Best of N: Instead of a single generation, multiple generations are sampled from an LLM with high temperature, or different LLMs are used [00:05:36].
- Candidate Ditching: Unpromising candidates are discarded, and only the best ones are pursued based on a predefined budget [00:05:49].
- Validation and Iteration: Continuous validation allows for iterative fixing and improvement [00:05:59].
In more complex scenarios, the execution engine tracks expected cost, latency, and success probability, allowing the planner to choose the most appropriate path [00:06:15]. Finally, a “reduce” phase consolidates multiple results into the best or a combined answer [00:06:28].
Performance and Trade-offs
Systems employing planning and smart execution engines demonstrate significantly improved results compared to single LLM calls, even for advanced models like GPT-40 or Claude Sonnet 3.5 [00:06:42]. For instance, in requirement satisfaction tasks, these methods show considerable improvement [00:07:02].
While there is a trade-off in terms of increased runtime and cost, the resulting higher quality often justifies the investment [00:07:10].
Key Takeaways
- LLMs alone are not always enough: Even for seemingly “simple” tasks like instruction following, advanced LLMs can fall short [00:07:21].
- Start Simple: Begin with basic LLMs if they suffice [00:07:30].
- Incrementally Add Complexity: If needed, incorporate tools or frameworks like React [00:07:35].
- Embrace Planning and Execution for Complex Tasks: For highly complex tasks, a full planning and execution engine is necessary to achieve desired outcomes [00:07:43]. This approach represents a powerful strategy for effective AI implementation and enhancing existing systems with AI capabilities.