From: aidotengineer
Diamond, from DataDog, shared insights into the development and application of AI agents designed to automate problem-solving in DevOps environments. DataDog’s initiative, “Bits AI,” aims to create an AI assistant that helps users manage their DevOps challenges [00:01:04].
DataDog’s Evolution with AI
DataDog, an observability and security platform for cloud applications, has been incorporating AI since around 2015 [00:02:03]. While not always overtly presented as “AI products,” these capabilities include proactive alerting, root cause analysis, impact analysis, and change tracking [00:01:56].
The current landscape represents a clear era shift, comparable to the advent of microprocessors or the transition to SaaS, with the emergence of larger, smarter, reasoning, and multimodal models [00:02:09]. This shift signifies a future where intelligence becomes “too cheap to meter” [00:02:24]. DataDog is actively working to leverage these advancements by moving up the stack and enabling AI agents to use their platform directly for customers [00:02:53].
Key AI Agents at DataDog
DataDog is currently developing several AI agents in private beta, with a focus on automating workflows and problem-solving [00:03:00] [00:03:22].
AI On-Call Engineer
This agent is designed to help engineers avoid being paged in the middle of the night [00:03:36]. When an alert occurs, the AI On-Call Engineer proactively initiates an investigation [00:04:04]. It performs several key actions:
- Situational Orientation: Reads run books and gathers alert context [00:04:07].
- Data Analysis: Examines logs, metrics, and traces [00:04:16].
- Hypothesis Testing: Formulates hypotheses about the issue, tests them using tools, and validates or invalidates each [00:05:36].
- Remediation Suggestion: If a root cause is found, it can suggest remediations, such as paging another team or scaling infrastructure [00:05:53].
- Workflow Integration: Can tie into existing DataDog workflows for remediation [00:06:16].
- Postmortem Generation: After an incident is resolved, the agent can write a postmortem documenting what occurred and the actions taken by both humans and the agent [00:06:31].
This agent also facilitates human-AI collaboration, allowing users to verify its actions and understand its reasoning, which helps build trust [00:04:49] [00:05:05].
AI Software Engineer
This agent acts as a proactive developer, observing and acting on errors [00:06:55]. It automatically analyzes errors, identifies causes, and proposes solutions, including generating code fixes and creating tests to prevent recurrence [00:07:07] [00:07:28]. This significantly reduces the time engineers spend manually writing and testing code [00:07:38].
Lessons Learned in Building Effective AI Agents
Developing these agents has provided DataDog with several key learnings [00:07:47].
Scoping Tasks for Evaluation
It is easy to build quick demos, but much harder to properly scope and evaluate what is occurring [00:08:03].
- Define “Jobs to Be Done”: Clearly understand tasks step-by-step from a human perspective [00:08:36].
- Vertical, Task-Specific Agents: Build agents for specific tasks rather than generalized agents [00:08:48].
- Measurable and Verifiable: Ensure tasks are measurable and verifiable at each step, as this is a significant challenge [00:08:52].
- Domain Experts as Design Partners: Utilize domain experts for evaluation and verification, rather than for writing code or rules [00:09:10].
- Deeply Consider Evaluation: Start by thinking deeply about evaluation, including offline, online, and living evaluations with end-to-end measurements [00:09:34].
Building the Right Team
A successful team doesn’t require a large number of ML experts, which are scarce [00:10:15]. Instead, it should be seeded with one or two ML experts and augmented with optimistic generalists who are skilled at coding and willing to iterate quickly [00:10:18]. UX and front-end expertise are also critically important for effective collaboration with agents [00:10:28]. Team members should be excited to be AI-augmented themselves, as the field is rapidly changing [00:10:41].
Adapting User Experience (UX)
The traditional UX patterns are changing, and developers must be comfortable with this evolution [00:11:04]. In this early space, UX is crucial for collaboration [00:11:18]. DataDog favors agents that function more like human teammates rather than requiring new pages or buttons [00:11:28].
The Importance of Observability
Observability is critical and should not be an afterthought, especially with complex AI agent workflows [00:11:36]. Situational awareness is necessary to debug problems [00:11:44]. DataDog has introduced “LLM Observability” within its product to provide a single pane of glass for monitoring various model interactions, whether hosted, run, or used via API [00:11:50].
Agent workflows can become very complex, involving hundreds of multi-step calls and tool-use decisions [00:12:26]. To address this, DataDog provides an “agent graph” view, which makes complex workflows human-readable, highlighting errors within the process [00:12:47].
The “Bitter Lesson” of Application Layer
General methods that can leverage new, off-the-shelf models are ultimately the most effective [00:13:19]. While fine-tuning for specific tasks might seem beneficial, new foundational models often solve much of the reasoning quickly [00:13:29]. It’s crucial to be able to easily try out any new models and not be tied to a particular one [00:13:45]. This aligns with the “rising tide lifts all boats” concept [00:13:52].
Future Outlook for AI in Problem Solving
Diamond anticipates that AI agents will surpass humans as users of SaaS products within the next five years [00:14:07]. This means companies should consider building for agents as users, providing context and API information that agents would utilize more than humans [00:14:21].
DataDog plans to offer a team of DevSecOps agents for hire, capable of handling on-call duties and platform integrations directly for customers [00:14:56]. They also envision AI agents themselves becoming customers, using platforms like DataDog just as humans would [00:15:10].
The future of small companies may involve leveraging automated developers like Cursor or Devin to bring ideas to life, with agents like DataDog’s handling operations and security, enabling a significantly greater number of ideas to reach the real world [00:15:25].
DataDog is actively hiring AI engineers and individuals passionate about this evolving space [00:15:51].