From: aidotengineer

The year 2025 marks a “perfect storm” for AI agents, driven by advancements in reasoning models like OpenAI’s O1 and O3, DeepSeek’s R1, and Grok’s latest models, which are outperforming human ability and showcasing new capabilities [03:38:00]. This progress is further fueled by increased test-time compute, enhanced engineering and hardware optimizations, cheaper inference and hardware, a closing gap between open-source and closed-source models, and billions in infrastructure investment [04:04:00]. Despite this momentum, AI agents are not yet consistently working as intended [05:00:00].

An AI agent, for the purpose of this discussion, is defined as a fully autonomous system where large language models (LLMs) direct their own actions [05:12:00].

Challenges in AI Agent Development

While much focus is often placed on hallucinations and fabrications in AI models, a significant challenge arises from tiny, cumulative errors that add up [06:52:00]. These errors can compound rapidly, leading to substantial disparities in performance over multiple steps [08:51:00]. For example, an agent with 99% accuracy can drop to 60% accuracy after 50 consecutive steps due to compounded errors [09:07:00].

Key types of cumulative errors include:

  • Decision Error: Choosing the wrong fact or overthinking/exaggerating [07:10:00]. An example is booking a flight to San Francisco, Peru instead of San Francisco, California [07:13:00].
  • Implementation Error: Incorrect access or integration, such as being locked out of a critical database or encountering a CAPTCHA [07:26:00].
  • Heuristic Error: Applying the wrong criteria or failing to acknowledge best practices, like not accounting for rush hour traffic when booking a flight [07:44:00].
  • Taste Error: Misinterpreting personal preferences, such as booking a flight on an airline or aircraft type the user dislikes [08:03:00].

Additionally, there’s a “Perfection Paradox” where users become frustrated when AI performs at human speed or has minor inconsistencies, despite its otherwise magical capabilities [08:22:00]. Even if an agent initially gets it right, inconsistency and unreliability can lead to underwhelming human expectations [08:38:00].

Best Practices for Building AI Agents

To address these challenges and consistently and reliably make the right decisions, several strategies constitute best practices for AI deployment and optimization:

Data Curation

Ensuring an AI agent has the necessary, high-quality information is crucial [10:04:00]. Data is often messy, unstructured, and siloed, encompassing not just web and text but also design, image, video, audio, sensor, and even real-time agent-generated data [10:11:00]. Key practices include:

  • Curating proprietary data: Utilizing unique datasets [10:32:00].
  • Managing agent-generated data: Data produced by the AI agent itself [10:35:00].
  • Quality control: Using data in the model workflow for quality assurance [10:37:00].
  • Designing an agent data flywheel: Building systems where every user interaction improves the product automatically, in real-time, and at scale [10:49:00]. This allows for continuous adaptation to user preferences [11:10:00].

The Importance of Evals

Evaluating how a model responds and choosing the correct answer is fundamental [11:22:00]. While simple in verifiable domains like math and science, evaluating non-verifiable systems, where clear yes/no answers are absent, is complex [11:33:00].

  • Collecting human preferences: Actively gathering signals on what users prefer [12:27:00].
  • Personalized evaluations: Building evals that are truly personal, sometimes relying on “vibes” and direct user experience rather than just numbers or leaderboards [12:29:00].

Scaffolding Systems

To prevent one error from cascading throughout an agentic system and across production infrastructure, scaffolding is essential [12:45:00].

  • Mitigating cascading effects: Building a complex compound system that ensures errors don’t spread [13:06:00].
  • Human-in-the-loop: Incorporating human intervention for reasoning models, allowing for checkpoints to verify decisions or steer the agent back on track [13:12:00].
  • Self-healing agents: Designing agents that can realize they are wrong and attempt to correct their own path, or pause execution when unsure [13:18:00].

User Experience (UX)

UX is paramount, as many AI applications often use the same underlying foundation models [13:51:00]. UX is the key differentiator for companies that reimagine product experiences and deeply understand user workflows, fostering elegant human-machine collaboration [13:59:00].

  • Asking clarifying questions: Ensuring the AI fully understands the user’s intent [14:13:00].
  • Predicting user next steps: Understanding user psychology to anticipate needs [14:20:00].
  • Seamless integration: Creating real return on investment (ROI) by integrating with legacy systems [14:29:00]. Companies with proprietary data sources and deep knowledge of specific user workflows, such as in robotics, hardware, defense, manufacturing, and life sciences, are well-positioned to create magical user experiences [14:55:00].

Building Multimodally

Moving beyond simple chatbot interfaces, multimodality allows for reimagined, 10x personalized user experiences [15:22:00]. The goal is to make AI more human by adding “eyes and ears, nose, a voice” [15:40:00].

  • Incorporating diverse senses: Significant improvements have been seen in voice, and there is exploration into digitizing the sense of smell and instilling a sense of touch and embodiment through robotics [15:50:00].
  • Personalized memories: Developing AI that truly knows the user on a deeper, more personal level [16:07:00]. This approach redefines “perfection” for a human, where the visionary nature of the product can exceed expectations even if the agent is occasionally inconsistent or unreliable [16:18:00]. Examples include reimagining visual canvases and combining multiple AI models seamlessly in the background [16:28:00].

In summary, while AI agents are poised for significant impact, achieving widespread reliability requires addressing cumulative errors through meticulous data curation, robust evaluations, resilient scaffolding systems, and a strong focus on innovative, multimodal user experiences [16:51:00].