Development and deployment of observability platforms

From: aidotengineer

Datadog serves as an observability and security platform tailored for cloud applications [00:01:22]. Its core function is to enable users to observe system behavior and take appropriate actions, simplifying understanding and enhancing system safety and DevOps friendliness [00:01:30].

Historical Context of AI in Observability at Datadog

Datadog has been integrating AI into its offerings since around 2015 [00:02:03]. While not always overtly presented as “AI products,” these integrations include features like proactive alerting [00:01:56], root cause analysis [00:01:58], impact analysis [00:02:01], and change tracking [00:02:02].

The current era marks a significant shift, comparable to the advent of the microprocessor or the transition to SaaS, with larger, smarter models, enhanced reasoning capabilities, and multimodal AI [00:02:09]. This shift leads to “intelligence becoming too cheap to meter,” causing products like Cursor to grow rapidly and increasing user expectations for AI [00:02:24]. Datadog is adapting to this by moving up the stack, aiming for AI agents to utilize the platform on behalf of users, rather than users interacting with the platform directly [00:02:53].

AI Agents and Observability

The development of AI agents in DevOps at Datadog requires focus on several key areas:

Developing the agents themselves [00:03:08].
Conducting evaluations, which are crucial for agent performance [00:03:10].
Building out new types of observability [00:03:17].

Datadog is developing several AI agents, including:

AI Software Engineer: This agent identifies errors, recommends code, and can generate code fixes to improve systems and reduce on-call incidents [00:07:07].
AI On-Call Engineer: Designed to proactively respond to alerts, this agent can situationally orient itself by reading runbooks and gathering context [00:04:04]. It then investigates issues by analyzing logs, metrics, and traces, similar to a human engineer [00:04:16]. It can also generate post-mortems after an incident is remediated [00:06:36].

These agents function by forming hypotheses, testing them with tools, running queries against observability data (logs, metrics, traces), and validating or invalidating these hypotheses [00:05:36].

The Critical Role of Observability

Observability is paramount in developing and deploying complex AI agents, as it provides the necessary situational awareness for debugging problems [00:11:43]. It should not be an afterthought [00:11:41].

LLM Observability

Datadog has introduced “LLM observability” as a new view within its product [00:11:50]. This feature is particularly useful because AI agents involve a wide array of interactions and calls to various models—whether self-hosted, run locally, or accessed via API [00:12:07]. LLM observability consolidates all these interactions into a single view for debugging [00:12:15].

However, workflows involving AI agents can become very complex, with multi-step calls and hundreds of decisions about tools and loops [00:12:28]. To address this complexity, Datadog provides an “agent graph” [00:12:52]. This human-readable visual representation of the agent’s workflow makes it easier to identify issues, such as errors highlighted with a bright red node [00:12:56].

Lessons Learned in Developing AI Agents

Developing AI agents has provided several key learnings, especially regarding evaluation platforms for AI agents and the importance of observability:

Scoping Tasks for Evaluation: It’s crucial to define “jobs to be done” clearly and in a step-by-step manner, approaching it from a human perspective first [00:08:36]. Building vertical, task-specific agents is preferred over generalized ones [00:08:48]. The work should be measurable and verifiable at each step, which is a significant challenge for agents [00:08:54].
Evaluation is Paramount: Deeply considering evaluation from the outset is essential [00:09:35]. While demos are easy to build, verifying and improving performance over time in a “fuzzy, stochastic world” requires robust offline, online, and “living” evaluations, including end-to-end measurements [00:09:48].
Team Building: A small number of ML experts, supplemented by optimistic generalists who can code quickly and are willing to experiment, form an effective team [00:10:15].
Evolving UX: The user experience paradigms are changing, and being comfortable with these shifts is important [00:11:24]. Agents that behave more like human teammates are preferred over creating many new pages or buttons [00:11:28].
Observability is Not an Afterthought: Observability is critical for managing complex agent workflows and debugging problems effectively [00:11:41].

A “bitter lesson” in AI engineering is that general methods leveraging new, off-the-shelf models are ultimately the most effective [00:13:19]. Teams should be able to easily try out new models and avoid being stuck to a particular one they’ve been working on [00:13:45].

The Future: Agents as Users of Observability Platforms

A significant future trend is the expectation that AI agents will surpass humans as users of SaaS products like Datadog within the next five years [00:14:07]. This means platforms must consider building not just for human users or internal agents, but also for third-party agents (e.g., Claude) that might directly interact with their APIs [00:14:23]. This involves providing context and API information optimized for agent consumption [00:14:38].

Datadog anticipates offering “DevSecOps agents for hire,” where agents will handle platform integration, on-call duties, and more directly [00:14:56]. Conversely, SRE, coding, and other types of AI agents built by users should also leverage Datadog’s platform and tools just like humans would [00:15:12]. This future envisions small companies being built by automated developers and operations/security agents, enabling an order of magnitude more ideas to reach the market [00:15:25].

Tubegraph

Explorer

Table of Contents