From: aidotengineer
OpenAI observes that 2025 is expected to be the year of AI agents, with generative AI transitioning from an assistant role to a co-barker [00:08:41]. Through work with customers and internal product development, OpenAI has identified key patterns and anti-patterns in AI agent development [00:08:51].
Defining an AI Agent
An AI agent is an application comprising a model with instructions (usually a prompt), access to tools for information retrieval and external system interaction, all within an execution loop controlled by the model [00:09:04]. In each cycle, the agent receives natural language instructions, decides whether to issue tool calls, runs those tools, synthesizes a response, and provides an answer to the user [00:09:26]. The agent can also determine when its objective is met and terminate the loop [00:09:43].
OpenAI’s Enterprise AI Customer Journey
OpenAI typically sees the enterprise AI customer journey in three phases [00:01:47]:
- AI-Enabled Workforce
- Getting AI into employees’ hands to foster AI literacy and daily use [00:01:54].
- This typically starts with products like ChatGPT [00:02:30].
- Automating AI Operations
- Building internal automation and co-pilot use cases [00:02:11].
- More complex or customized internal use cases often utilize the API [00:02:40].
- Infusing AI into End Products
- Integrating AI into end-user-facing products, primarily via API use cases [00:02:22].
OpenAI’s approach to enterprise strategy involves [00:03:07]:
- Top-Down Strategic Guidance
- Aligning AI initiatives with the broader business strategy [00:03:17].
- Identify High-Impact Use Cases
- Scoping one or two significant use cases to deliver initial value [00:03:36].
- Build Divisional Capability
- Enabling teams and infusing AI throughout the organization through enablement, Centers of Excellence, or centralized technological platforms [00:03:52].
The Use Case Journey
A typical use case journey, illustrated over a three-month example, involves [00:04:31]:
- Ideation & Scoping
- Initial ideation, scoping, architecture review, and defining success metrics/KPIs [00:04:40].
- Development
- The bulk of the time, involving iterative prompting strategies, RAG, and continuous improvement [00:04:53].
- OpenAI’s team provides close interaction through workshops, office hours, paired programming, and webinars [00:05:06].
- Testing & Evaluation
- Performing A/B testing and beta rollouts based on predefined evaluation metrics [00:05:24].
- Production & Maintenance
- Launch, rollout, scale optimization testing, followed by ongoing maintenance [00:05:37].
OpenAI supports this process by providing dedicated teams, early access to new models and features, internal experts from research and engineering, and joint roadmap sessions [00:05:55].
Case Study: Morgan Stanley Internal Knowledge Assistant
Morgan Stanley built an internal knowledge assistant to help wealth managers query a large corpus of information, including research reports and stock data, to provide highly accurate information to clients [00:06:54]. Initially, accuracy was around 45% [00:07:21]. By introducing methods like [00:07:23]:
- High retrieval
- Fine-tuning embeddings
- Different chunking strategies
- Reranking
- Classification steps
- Prompt engineering
- Query expansion
Accuracy improved significantly, reaching 85% and ultimately 98% (surpassing the 90% goal) [00:07:33]. This case study demonstrates the iterative approach to improving core metrics through the use case journey [00:07:48].
Best Practices for Building AI Systems: Four Lessons in AI Agent Development
When designing AI agents, OpenAI has identified four key lessons to address challenges in developing AI agents and improve performance [00:09:51].
1. Start Simple, Optimize, and Abstract Minimally
When orchestrating models, retrieving data, and generating output, teams have two choices: start with primitives (raw API calls, manual logging) or a framework (abstractions handling details) [00:10:00]. While frameworks are enticing for quick proof-of-concepts, they can obscure system behavior and underlying primitives, deferring design decisions before constraints are understood, hindering optimization [00:10:33].
A better approach is to [00:10:50]:
- Build with Primitives First: Understand how the task decomposes, where failures occur, and what needs improvement [00:10:53].
- Introduce Abstraction When Necessary: Only abstract when reinventing the wheel (e.g., re-implementing an embedding strategy or model graders) [00:11:05].
“Developing agents in a scalable way isn’t so much about choosing the right abstraction, it’s really about understanding your data, understanding your failure points, and your constraints” [00:11:23].
2. Start with a Single Agent, Then Incrementally Improve
Teams often jump directly into designing multi-agent systems, but this can create unknowns and limit insights [00:11:48]. A recommended approach is to [00:12:08]:
- Start with a Purpose-Built Single Agent: Deploy it into production with a limited user set and observe its performance [00:12:10].
- Identify Bottlenecks: This allows identification of real issues like hallucinations over conversation trajectories, low adoption due to latency, or inaccuracy from poor retrieval performance [00:12:21].
- Incrementally Improve: Knowing how the system underperforms and what’s important to users allows for targeted, incremental improvements [00:12:33].
Complexity should increase as more intense failure cases and constraints are discovered, as the goal is to build a working system, not just a complicated one [00:12:44].
3. Graduate to a Network of Agents with Handoffs for Complex Tasks
For more complex tasks, a network of agents and the concept of handoffs become valuable [00:13:03].
- Network of Agents: A collaborative system where multiple agents work in concert to resolve complex requests or perform a series of interrelated tasks. This allows for specialized agents to handle sub-flows within a larger agentic workflow [00:13:17].
- Handoffs: The process by which one agent transfers control of an active conversation to another agent, preserving the entire conversation history and context [00:13:38].
For example, a fully automated customer service flow could use [00:14:03]:
- A GPT-4o mini call for triage on incoming requests [00:14:16].
- A GPT-4o on a “dispute agent” to manage the conversation [00:14:23].
- An O3 mini reasoning model for accuracy-sensitive tasks like checking refund eligibility [00:14:30].
Handoffs are effective because they keep the conversation history and context intact while allowing the model, prompt, and tool definitions to be swapped, providing flexibility for a wide range of scenarios [00:14:39].
4. Implement Guardrails in Parallel, Not in Prompts
Guardrails are mechanisms that enforce safety, security, and reliability within an application, preventing misuse and maintaining system integrity [00:14:55].
- Keep Prompts Simple: Model instructions should be simple and focused on the target task to ensure maximum interoperability, accuracy, and predictable performance [00:15:12].
- Run Guardrails in Parallel: Guardrails should not typically be part of the main prompts but rather run in parallel, a practice made more accessible by faster and cheaper models like GPT-4o mini [00:15:25].
- Defer High-Stakes Actions: Tool calls and user responses that are high-stakes (e.g., issuing a refund, showing personal account information) should be deferred until all guardrails have returned their results [00:15:42].
An example involves running a single input guardrail to prevent prompt injection and a couple of output guardrails on the agent’s response [00:15:57].