From: aidotengineer
Language models (LLMs) released around 2022, such as OpenAI’s InstructGPT, were groundbreaking for their ability to follow instructions [00:00:00]. However, even by 2025, LLMs like GPT-4.1 still faced challenges in consistently following instructions [00:00:19]. This limitation applies to nearly all language models [00:00:29].
The initial use cases for LLMs were simple, like explaining the moon landing to a six-year-old [00:00:40]. However, as developers started using them more extensively, instructions became complex, often involving shoving “all the information, all the context, all the constraints, all the requirements” into a single prompt [00:01:02]. For tasks such as instruction following, LLMs alone are often insufficient [00:01:14]. This is where AI agents come into play, as they require planning in addition to prompting [00:01:21].
What is an AI Agent?
While there’s a philosophical debate about what constitutes an AI agent, from an engineering perspective, it’s “whatever works” [00:01:41]. Various interpretations exist:
- LLM as a router: A routing model directs a query to a specialized LLM [00:01:53].
- Function calling: The LLM is provided with external tools or APIs it can use, such as Google Search, to interact with the world [00:02:08].
- ReAct: A popular framework that enables any language model to “reason in an action” by thinking, acting upon that thought, and observing the outcome [00:02:36]. However, ReAct is a step-by-step process without a look-ahead to the entire plan [00:02:58].
The Role of Planning
Planning in AI agents involves “figuring out the steps that you have to take to reach your goal” [00:03:15]. It is particularly useful for complex tasks that are not straightforward and require parallelization and explainability [00:03:24]. Unlike ReAct, which shows thoughts but not the why behind them, planning offers greater transparency [00:03:36]. Planners can be form-based (like text-based or agentic from Microsoft) or code-based (like small agents from Hugging Face) [00:03:43].
Dynamic Planning
Dynamic planning means the system has the ability to “replan” [00:03:56]. Instead of rigidly following a single plan, it can reassess in the middle of execution and decide whether to stick to the current plan or create a new one [00:04:00].
Smart Execution Engines
For efficiency, every planner requires an execution engine [00:04:14]. An execution engine can:
- Analyze dependencies between steps, enabling parallel execution [00:04:22].
- Manage trade-offs between speed and cost [00:04:29].
- Utilize techniques like branch prediction for faster systems [00:04:34].
AI21 Maestro Example
AI21 Maestro is a system that combines a planner with a smart execution engine [00:04:40]. For instruction following, it separates the prompt (context and task) from explicit requirements (e.g., paragraph limit, tone, brand mentions) [00:04:49]. This separation simplifies validation [00:05:12].
The system uses an “execution tree” or “execution graph” [00:05:15]. At each step, the planner and execution engine select several candidate solutions and only continue to refine the most promising ones [00:05:20]. Techniques used include:
- Best-of-N: Sampling multiple generations from an LLM (potentially with high temperature) or using different LLMs [00:05:32].
- Candidate Ditching: Discarding unpromising candidates and pursuing only the best ones based on a predefined budget [00:05:49].
- Validation and Iteration: Continuous fixing and refining through iterative processes [00:05:59].
In more complex scenarios, the execution engine can provide metrics like expected cost, latency, and success probability for different “tracks” or paths [00:06:07]. The planner then chooses the optimal path [00:06:22]. Finally, a “reduce” step combines or selects the best results for a complete answer [00:06:28].
This approach significantly improves results for tasks like if_val
and requirement satisfaction, showing high quality even for models like GPT-4, Claude Sonnet 3.5, and 3 Mini [00:06:42]. While it may incur higher runtime and cost, it leads to substantially higher quality outcomes compared to single LLM calls [00:07:10].
Key Takeaways
- LLMs alone are often not sufficient, even for seemingly simple tasks like instruction following [00:07:21].
- Always start simple: use basic LLMs if they suffice [00:07:27].
- Progress to more complex solutions as needed:
- Add tools to a model [00:07:35].
- Utilize frameworks like ReAct [00:07:41].
- For highly complex tasks, dynamic planning and execution engines become essential [00:07:43].