AI in IT infrastructure management

From: aidotengineer

DataDog, an observability and security platform for cloud applications, focuses on helping users observe and take action on their systems to build safer and more devops-friendly environments [01:22:00]. The company has been integrating AI into its offerings since around 2015, with features like proactive alerting, root cause analysis, impact analysis, and change tracking [01:46:00].

The current era marks a significant shift in AI, comparable to the microprocessor or the transition to SaaS, driven by bigger, smarter models, reasoning capabilities, and multimodal AI [02:06:00]. This shift is making intelligence increasingly accessible and expected, prompting DataDog to leverage these advancements and offer more to customers by evolving beyond just a devops platform to providing AI agents that utilize the platform for them [02:29:00].

DataDog’s AI Agents (Bits AI)

DataDog is developing “Bits AI,” an AI assistant designed to help with devops problems [01:04:00]. This initiative involves developing the agents, performing evaluations, and building new types of observability [03:06:00]. Currently, DataDog is working on two key AI agents in private beta:

AI On-Call Engineer

The AI On-Call Engineer is designed to respond to alerts, ideally preventing human engineers from being paged in the middle of the night [03:34:00].

Functionality: When an alert occurs, the agent proactively kicks off, situationally orients itself by reading runbooks and grabbing context [04:04:00]. It then investigates by reviewing logs, metrics, and traces to understand the situation [04:16:00]. This agent can automatically run investigations and pull summaries and information before a human engineer even gets to their computer, providing insights into why an alert occurred or a trace showed an error [04:24:00].
Human-AI Collaboration: A new page supports human-AI collaboration, allowing users to verify agent actions, learn from their processes, and build trust [04:47:00]. Users can see the reasoning behind a hypothesis, what the agent found, and the steps it took from the runbook, similar to overseeing a junior engineer [05:03:00].
Problem Resolution: The agent operates by forming hypotheses about what is happening, reasoning over them, and using tools to test ideas by running queries against logs, metrics, and other data [05:30:00]. If a root cause is found, it can suggest remediations, such as paging another team or scaling infrastructure up or down [05:51:00]. It can integrate with existing DataDog workflows [06:11:00].
Post-Mortem Generation: After an incident is remediated, the AI On-Call Engineer can write a post-mortem report summarizing what occurred, what the agent did, and what humans did, preparing it for the morning [06:26:00].

AI Software Engineer

The AI Software Engineer functions as a proactive developer or devops/software engineering agent [06:55:00].

Functionality: This agent observes and acts on errors, automatically analyzing them, identifying causes, and proposing solutions [07:00:00]. Solutions can include generating code fixes and creating tests to prevent recurrence, reducing on-call incidents [07:10:00].
Workflow Integration: It can catch issues like recursion problems, propose fixes, and even create recursion tests [07:22:00]. Users have the option to create a pull request in GitHub or open the diff in VS Code for editing [07:30:00]. This workflow significantly reduces the time engineers spend manually writing and testing code [07:38:00].

Lessons Learned Building AI Agents

Building these AI agents has provided DataDog with several key learnings:

Scoping Tasks for Evaluation: It is crucial to define “jobs to be done” and clearly understand step-by-step what is desired, thinking from a human perspective first [08:01:00]. Building vertical, task-specific agents is preferred over generalized ones [08:48:00]. Tasks should be measurable and verifiable at each step, as demos are easy to build, but consistent verification and improvement are challenging [08:52:00]. Domain experts should be utilized as design partners or task verifiers, not as code or rule writers, due to the stochastic nature of models [09:10:00].
Importance of Evaluation: Deep thought about evaluation is paramount from the start, as fuzzy, stochastic AI systems require robust evaluation [09:31:00]. This includes offline, online, and living evaluations, with end-to-end task measurements and appropriate instrumentation to gather human feedback [09:52:00].
Building the Right Team: While a few ML experts are helpful, the core team should consist of optimistic generalists who are proficient in coding and willing to experiment quickly [10:11:00]. AI-augmented teammates who are excited about day-to-day AI use and eager to learn in a fast-changing field are essential [10:38:00].
Evolving User Experience (UX): The user experience for AI agents is a constantly evolving area, and traditional UX patterns are changing [11:00:00]. The preference is for agents that act more like human teammates rather than relying on new pages or buttons [11:28:00].
Observability Matters: Observability is critical and should not be an afterthought for complex AI workflow automation and augmentation [11:36:00]. Situational awareness is necessary for debugging problems [11:44:00]. DataDog’s “LLM observability” view helps by grouping a wide variety of AI model interactions and calls into a single pane of glass [11:50:00]. Agent workflows, which can involve hundreds of complex multi-step calls, require specialized views like an “agent graph” to make debugging human-readable and identify errors [12:26:00].
“Agent or Application Layer Bitter Lesson”: General methods that leverage new, off-the-shelf AI models are ultimately the most effective [13:15:00]. Fine-tuning specific models for particular tasks can become obsolete when new, more capable foundation models are released [13:26:00]. It’s important to be able to easily try out new models without being tied to older ones [13:45:00].

Future Outlook for AI in IT Infrastructure Management

The future of AI is expected to be dynamic and accelerating [14:49:00].

Agents as Users: There is a strong belief that AI agents may surpass humans as users of SaaS products like DataDog within the next five years [14:01:00]. This means companies should design their products not just for humans but also for agents that might use their APIs [14:21:00].
Teams of Agents: DataDog anticipates offering teams of DevSecOps agents for hire, capable of directly using the platform and handling tasks like on-call responsibilities [14:56:00].
New Company Creation: It is envisioned that small companies will be built by individuals using automated developers like Cursor or Devin to bring ideas to life, with AI agents handling operations and security, enabling an order of magnitude more ideas to reach the real world [15:24:00].

Tubegraph

Explorer

Table of Contents