State and future of AI agents

From: redpointai

AI agents are artificial intelligence systems designed to perform tasks autonomously, often by interacting with tools and environments. Their capabilities have seen significant advancements, particularly with newer models like GPT-4.1. [00:17:17]

Current State

Agents currently demonstrate remarkable effectiveness in well-scoped domains [00:19:06], especially when they have access to the necessary tools and a clear understanding of the user’s request [00:09:16]. This means that in environments where all the right tools are available and the user’s intent is unambiguous, AI agents perform very well [00:09:21].

Challenges and Areas for Improvement

Despite their progress, several challenges remain in deploying AI agents effectively in the real world:

Bridging to the Fuzzy Real World [00:09:30]: A significant hurdle is connecting agents to the messy, ambiguous nature of real-world interactions [00:09:30]. Users often don’t know an agent’s capabilities, and agents might lack awareness of their own limitations or crucial real-world information [00:09:38].
Context Integration [00:09:56]: It is difficult to feed all necessary context into the model [00:09:56].
Ambiguity Handling [00:10:00]: Models need improved steerability regarding ambiguity. Developers should be able to tune whether a model asks for more information or proceeds with assumptions when faced with unclear instructions [00:10:02]. An agent that constantly asks for confirmation can be annoying [00:10:12].
Tool and Context Connection [00:10:29]: The full potential of current models is often not realized because they are not connected with enough context or tools [00:10:30]. When examining failure cases in external benchmarks for function calling or agentic tool use, issues often stem from misgrading, ambiguity, or the user model not following instructions well enough [00:10:41].
Longer-Term Task Execution [00:11:19]: Addressing multi-step, ambiguous tasks requires both engineering and model-side changes [00:11:26]:
- Engineering Side [00:11:35]: APIs and UIs must make it easier to monitor an agent’s actions, provide summaries of its progress, and allow users to intervene and change its trajectory [00:11:37]. OpenAI’s “operator” feature is an example of this steerability [00:11:47].
- Modeling Side [00:11:58]: Models need increased robustness and “grit” to handle errors, such as API failures, without getting stuck [00:12:00].

Role of Model Advancements

Recent model releases, like GPT-4.1, have significantly impacted agent capabilities by improving instruction following and long context processing [00:08:58]. The hypothesis behind cheaper, faster models like “nano” was to spur more AI adoption [00:06:37], and this has proven successful, demonstrating demand across various cost and latency points [00:06:43].

For companies, staying ahead of rapid model progress requires:

Robust Evals [00:17:56]: The most successful startups have deep knowledge of their use case and strong evaluation metrics. This enables them to quickly test new models and adapt [00:18:01].
Adaptable Prompting [00:18:13]: Being able to switch and tune prompts and scaffolding for different models is crucial [00:18:15].
Building for the Near Future [00:18:22]: Focus on use cases that are “just out of reach” of current models (e.g., those that work one out of ten times but could be improved to nine) [00:18:25]. If a problem shows significant improvement with fine-tuning (e.g., 10% to 50% pass rate), a future model will likely solve it completely [00:18:48].

Reinforcement Learning from Human Feedback (RFT)

A significant development is the RFT fine-tuning offering, which is data-efficient (requiring only hundreds of samples) and can push the frontier in specific domains [00:22:08]. It is particularly effective for:

Teaching an agent to pick a workflow or work through a decision process [00:22:37].
Applications in deep tech where organizations have verifiable data not available elsewhere, such as chip design or certain aspects of biology (e.g., drug discovery where outcomes are verifiable) [00:22:48].
Problems where no existing model in the market does what is needed [00:24:22].

RFT uses a similar reinforcement learning process to what OpenAI uses internally for improving its models, making it robust and less fragile than supervised fine-tuning (SFT) [00:23:32].

Future Outlook

The future of AI agents is expected to see continued progress, with several key trends:

Generalization vs. Specialization [00:15:52]: OpenAI’s general philosophy leans towards making one “general” model (the “G” in AGI) that can handle diverse use cases [00:15:54]. While targeted models like GPT-4.1 for developers showed success by decoupling from ChatGPT to move faster and optimize for specific needs (e.g., upweighting coding data) [00:16:15], the expectation is to simplify the product offering to one general model [00:16:06]. Combining capabilities across domains generally leads to better results [00:26:05].
Increased Multimodality [00:20:36]: Models are becoming natively multimodal and easier to use [00:20:39]. Significant improvements in this area mean that many tasks that failed in previous models now work [00:21:02]. It’s advisable to connect models to as much task information as possible, even if results are “meh” today, as they will improve [00:21:09].
Enhanced Memory and Personality [00:39:03]: Future models, especially in the context of human communication, will increasingly leverage enhanced memory to adapt to user preferences and personalities [00:39:05]. Steerability features, like custom instructions, will allow users to tweak model personality [00:39:25].
Leveraging AI to Improve AI [00:32:04]: A key research area involves using models to improve other models, particularly in reinforcement learning, by using signals to determine if a model is on the right track [00:32:06]. Synthetic data has been an incredibly powerful trend in this regard [00:33:18].
Specific Agentic Capabilities [00:35:02]:
- Coding Agents [00:35:07]: Given that current models are already exceeding human performance on some coding benchmarks like SWEBench, coding agents are expected to arrive soon [00:35:27]. The ability to supervise long runs of code generation is already present [00:35:35].
- Long Workflows (e.g., Customer Support) [00:35:40]: Models like O3 already integrate developer-specified tool calls into their chain of thought, allowing them to use previous tool outputs to reason further [00:35:44]. This capability makes agentic customer support and other long workflows feasible [00:36:00].
Unlocking Existing Value [00:36:31]: Even if model progress stopped now, there are potentially decades of building and value extraction possible from current capabilities, similar to how the internet continues to drive value [00:36:31]. Billion-dollar companies are still being built using models like 3.5 Turbo [00:36:56].

OpenAI’s Approach

OpenAI’s “Power Users Research Team” focuses on understanding and improving models for discerning users, including developers [00:41:44]. This focus is strategic because what power users do today will become commonplace for median users a year from now [00:42:27].

The challenge for future models like GPT-5 is combining diverse capabilities—such as being a delightful conversationalist (like GPT-4) and a rigorous problem-solver (like O3)—without sacrificing performance in either area [00:37:25]. This involves striking the right balance when tailoring the model’s training data, as optimizing for one aspect (e.g., coding) might entail downweighting others (e.g., chat data) [00:38:15].

Tubegraph

Explorer

Table of Contents