The history and future of AI at Datadog

From: aidotengineer

Datadog’s approach to artificial intelligence integrates decades of experience in the field, moving from embedded AI features to advanced, autonomous AI agents for operations and development [00:00:36]. The company aims to provide an “AI assistant” to help with devops problems [00:01:06].

Speaker Background

The presenter, Diamond, has spent approximately 15 years in AI attempting to create more AI friends and co-workers [00:00:39]. This career spans roles at Microsoft Cortana [00:00:54], Amazon Alexa [00:00:56], Meta (working on PyTorch) [00:00:58], and his own AI startup focused on a devops assistant [00:01:00].

Datadog Overview

Datadog is an observability and security platform designed for cloud applications [00:01:24]. Its core function is to allow users to observe system behavior and take action, making it easier to understand and build safer, more devops-friendly systems [00:01:30].

History of AI at Datadog

Datadog has been incorporating AI into its products since around 2015 [00:02:03]. While not always overtly branded as “AI product,” features like proactive alerting, root cause analysis, impact analysis, and change tracking have utilized AI capabilities [00:01:56].

The current landscape represents a “clear era shift,” akin to the advent of the microprocessor or the shift to SaaS [00:02:08]. This shift is characterized by:

Bigger, smarter models [00:02:16]
Reasoning capabilities [00:02:17]
Multimodal AI [00:02:18]
“Foundation model Wars” [00:02:20]
Intelligence becoming “too cheap to meter” [00:02:24]

This has led to a rise in user expectations for AI [00:02:35]. Datadog is responding by moving “up the stack” to leverage these advancements and provide AI agents that use the platform on behalf of customers [00:02:53].

Bits AI: Datadog’s AI Agents

Datadog is developing “Bits AI” as an AI assistant for devops problems [00:01:04]. This requires work in agent development, evaluation (eval), and new types of observability [00:03:08]. Currently, two agents are in private beta [00:03:22]:

AI On-Call Engineer

This agent is designed to save engineers from 2 AM alerts [00:03:48].

Proactive Operation: Kicks off automatically when an alert occurs [00:04:04].
Situational Awareness: Reads runbooks, gathers alert context, and analyzes logs, metrics, and traces [00:04:06].
Investigation and Summarization: Automatically runs investigations, finds summaries, and pulls information before a human even gets to their computer [00:04:26].
Hypothesis and Tooling: Formulates hypotheses, reasons over them, and uses tools to test ideas by running queries against data [00:05:30].
Remediation and Post-Mortems: If a root cause is found, it can suggest remediations (e.g., paging another team, scaling infrastructure) [00:05:53]. It can also write post-mortems after an issue is remediated, summarizing what occurred and the actions taken by both humans and the agent [00:06:31].
Human-AI Collaboration: A new page facilitates collaboration, allowing humans to verify what the agent did, learn from its actions, and ask follow-up questions [00:04:47].

AI Software Engineer

This agent acts as a proactive developer or devops/software engineering agent [00:06:55].

Error Tracking Assistant: Observes and acts on incoming errors [00:07:00].
Analysis and Solutions: Automatically analyzes errors, identifies causes, and proposes solutions [00:07:05].
Code Generation: Solutions can include generating code fixes and creating tests (e.g., a recursion test for a recursion issue) [00:07:12].
Integration: Offers options to create a Pull Request in GitHub or open a diff in VS Code for editing [00:07:32]. This significantly reduces the time engineers spend manually writing and testing code [00:07:38].
Incident Reduction: Aims to reduce the number of on-call incidents in the first place by proactively addressing issues [00:07:14].

Learnings from Building AI Agents

Datadog has identified several key learnings in the process of building these AI agents [00:07:47]:

Scoping Tasks for Evaluation

It’s easier to build demos than to properly scope and evaluate what’s occurring [00:08:03].

Define “Jobs to Be Done”: Clearly understand step-by-step what needs to be accomplished, thinking from a human perspective first [00:08:35].
Vertical, Task-Specific Agents: Build specific agents rather than generalized ones [00:08:48].
Measurable and Verifiable: Tasks should be measurable and verifiable at each step, which is a significant challenge for AI agents [00:08:52].
Domain Experts as Verifiers: Use domain experts as design partners or task verifiers, not for writing rules or code, as stochastic models work differently than human experts [00:09:10].
Prioritize Eval: Deeply consider evaluation from the start, using offline, online, and “living” evaluations with end-to-end measurements and proper instrumentation [00:09:31].

Building the Right Team

The right team is crucial for moving fast and dealing with ambiguity [00:09:09].

ML Experts and Optimistic Generalists: Teams should be seeded with one or two ML experts, complemented by optimistic generalists who are good at writing code and willing to quickly try things [00:10:15].
UX Importance: User experience (UX) and front-end development are more important than often perceived, especially for collaboration with agents [00:10:28].
AI-Augmented Teammates: Team members should be excited to be AI augmented themselves, possessing an explorer mindset willing to learn in a rapidly changing field [00:10:38].

Evolving User Experience (UX)

The UX of AI agents is still an early space, and old UX patterns are changing [00:11:18]. Datadog favors agents that act more like human teammates rather than requiring a multitude of new pages or buttons [00:11:28].

Observability Matters

Even with AI agents, observability is critical and should not be an afterthought [00:11:36].

Complex Workflows: AI agent workflows are complex, often involving hundreds of multi-step calls, decisions about tools, and looping [00:12:28].
Situational Awareness: Debugging requires situational awareness, which is provided by tools like Datadog’s “LM observability” view [00:11:44]. This ties together interactions and calls to various AI models (hosted, running, or via API) into a single view [00:12:05].
Agent Graph: Datadog developed an “agent graph” to visualize and debug agent workflows in a human-readable format, making it easier to identify errors [00:12:47].

The “Agent or Application Layer Bitter Lesson”

General methods that can leverage new, off-the-shelf AI models are ultimately the most effective [00:13:19]. It’s important to be able to easily try out different models and not get stuck on a particular one, as the field is rapidly advancing with new models solving complex reasoning tasks quickly [00:13:45].

Future of AI at Datadog and Beyond

Datadog anticipates a future where AI agents are pervasive and become key users themselves.

AI Agents as Users

It is estimated that AI agents could surpass humans as users of platforms like Datadog and other SaaS products within the next five years [00:14:07]. This means companies should consider designing their products not just for humans or their own agents, but also for third-party agents (e.g., Claude) that might use their platform directly [00:14:23], providing context and API information tailored for agent consumption [00:14:38].

DevSecOps Agents For Hire

Datadog expects to soon offer a “team of DevSecOps agents for hire,” where their agents will directly handle on-call responsibilities and integrations for customers [00:14:56].

Empowering Small Companies

The speaker believes that in the future, small companies will be built by individuals who can use automated developers (like Cursor or Devin) to bring ideas to life [00:15:25], and then rely on agents like Datadog’s to manage operations and security. This will enable an order of magnitude more ideas to reach the real world [00:15:32].

Datadog is actively hiring AI engineers and enthusiasts to work in this evolving space [00:15:51].

Tubegraph

Explorer

Table of Contents