Building and leveraging AI for automated problem solving

From: aidotengineer

Diamond, from DataDog, shared insights into the development and application of AI agents designed to automate problem-solving in DevOps environments. DataDog’s initiative, “Bits AI,” aims to create an AI assistant that helps users manage their DevOps challenges [00:01:04].

DataDog’s Evolution with AI

DataDog, an observability and security platform for cloud applications, has been incorporating AI since around 2015 [00:02:03]. While not always overtly presented as “AI products,” these capabilities include proactive alerting, root cause analysis, impact analysis, and change tracking [00:01:56].

The current landscape represents a clear era shift, comparable to the advent of microprocessors or the transition to SaaS, with the emergence of larger, smarter, reasoning, and multimodal models [00:02:09]. This shift signifies a future where intelligence becomes “too cheap to meter” [00:02:24]. DataDog is actively working to leverage these advancements by moving up the stack and enabling AI agents to use their platform directly for customers [00:02:53].

Key AI Agents at DataDog

DataDog is currently developing several AI agents in private beta, with a focus on automating workflows and problem-solving [00:03:00] [00:03:22].

AI On-Call Engineer

This agent is designed to help engineers avoid being paged in the middle of the night [00:03:36]. When an alert occurs, the AI On-Call Engineer proactively initiates an investigation [00:04:04]. It performs several key actions:

Situational Orientation: Reads run books and gathers alert context [00:04:07].
Data Analysis: Examines logs, metrics, and traces [00:04:16].
Hypothesis Testing: Formulates hypotheses about the issue, tests them using tools, and validates or invalidates each [00:05:36].
Remediation Suggestion: If a root cause is found, it can suggest remediations, such as paging another team or scaling infrastructure [00:05:53].
Workflow Integration: Can tie into existing DataDog workflows for remediation [00:06:16].
Postmortem Generation: After an incident is resolved, the agent can write a postmortem documenting what occurred and the actions taken by both humans and the agent [00:06:31].

This agent also facilitates human-AI collaboration, allowing users to verify its actions and understand its reasoning, which helps build trust [00:04:49] [00:05:05].

AI Software Engineer

This agent acts as a proactive developer, observing and acting on errors [00:06:55]. It automatically analyzes errors, identifies causes, and proposes solutions, including generating code fixes and creating tests to prevent recurrence [00:07:07] [00:07:28]. This significantly reduces the time engineers spend manually writing and testing code [00:07:38].

Lessons Learned in Building Effective AI Agents

Developing these agents has provided DataDog with several key learnings [00:07:47].

Scoping Tasks for Evaluation

It is easy to build quick demos, but much harder to properly scope and evaluate what is occurring [00:08:03].

Define “Jobs to Be Done”: Clearly understand tasks step-by-step from a human perspective [00:08:36].
Vertical, Task-Specific Agents: Build agents for specific tasks rather than generalized agents [00:08:48].
Measurable and Verifiable: Ensure tasks are measurable and verifiable at each step, as this is a significant challenge [00:08:52].
Domain Experts as Design Partners: Utilize domain experts for evaluation and verification, rather than for writing code or rules [00:09:10].
Deeply Consider Evaluation: Start by thinking deeply about evaluation, including offline, online, and living evaluations with end-to-end measurements [00:09:34].

Building the Right Team

A successful team doesn’t require a large number of ML experts, which are scarce [00:10:15]. Instead, it should be seeded with one or two ML experts and augmented with optimistic generalists who are skilled at coding and willing to iterate quickly [00:10:18]. UX and front-end expertise are also critically important for effective collaboration with agents [00:10:28]. Team members should be excited to be AI-augmented themselves, as the field is rapidly changing [00:10:41].

Adapting User Experience (UX)

The traditional UX patterns are changing, and developers must be comfortable with this evolution [00:11:04]. In this early space, UX is crucial for collaboration [00:11:18]. DataDog favors agents that function more like human teammates rather than requiring new pages or buttons [00:11:28].

The Importance of Observability

Observability is critical and should not be an afterthought, especially with complex AI agent workflows [00:11:36]. Situational awareness is necessary to debug problems [00:11:44]. DataDog has introduced “LLM Observability” within its product to provide a single pane of glass for monitoring various model interactions, whether hosted, run, or used via API [00:11:50].

Agent workflows can become very complex, involving hundreds of multi-step calls and tool-use decisions [00:12:26]. To address this, DataDog provides an “agent graph” view, which makes complex workflows human-readable, highlighting errors within the process [00:12:47].

The “Bitter Lesson” of Application Layer

General methods that can leverage new, off-the-shelf models are ultimately the most effective [00:13:19]. While fine-tuning for specific tasks might seem beneficial, new foundational models often solve much of the reasoning quickly [00:13:29]. It’s crucial to be able to easily try out any new models and not be tied to a particular one [00:13:45]. This aligns with the “rising tide lifts all boats” concept [00:13:52].

Future Outlook for AI in Problem Solving

Diamond anticipates that AI agents will surpass humans as users of SaaS products within the next five years [00:14:07]. This means companies should consider building for agents as users, providing context and API information that agents would utilize more than humans [00:14:21].

DataDog plans to offer a team of DevSecOps agents for hire, capable of handling on-call duties and platform integrations directly for customers [00:14:56]. They also envision AI agents themselves becoming customers, using platforms like DataDog just as humans would [00:15:10].

The future of small companies may involve leveraging automated developers like Cursor or Devin to bring ideas to life, with agents like DataDog’s handling operations and security, enabling a significantly greater number of ideas to reach the real world [00:15:25].

DataDog is actively hiring AI engineers and individuals passionate about this evolving space [00:15:51].

Tubegraph

Explorer

Table of Contents