Telemetry and monitoring in AI data centers

From: aidotengineer

AI data centers require advanced telemetry and monitoring to ensure optimal performance and minimize downtime, especially given the high cost and criticality of the hardware involved [00:04:00]. The objective is to proactively identify network issues before they impact AI model training or inference [00:14:49].

Importance of Visibility and Telemetry

Visibility and telemetry are crucial in AI data centers for several reasons:

Proactive Problem Solving Network operations centers (NOCs) and operations centers aim to be aware of problems before developers call about model failures [00:14:49], [00:20:06].
High Costs and Criticality GPUs are extremely expensive, consume significant power, and are hard to obtain [00:02:48]. Ensuring they run 24/7 is paramount to maximize return on investment [00:04:05].

Key Monitoring Techniques

RDMA Error Codes and Packet Dropping Analysis

AI networks often use RDMA (Remote Direct Memory Access) for memory-to-memory writes, bypassing the CPU [00:16:44], [00:16:50]. RDMA is a complex protocol with numerous error codes [00:16:56].

For monitoring, when the network encounters congestion and starts dropping packets, it can:

Copy the dropped packet to a buffer [00:17:10].
Send only the headers [00:17:13].
Provide the reason for the packet drop, including any RDMA information [00:17:14], [00:17:25]. This detailed analysis helps pinpoint network issues [00:17:16].

AI Agent for GPU-Network Correlation

A specialized AI agent provides crucial visibility into GPU performance and its interaction with the network [00:17:36]. From a networking perspective, it’s challenging to gain insights directly into the GPUs [00:17:40].

This agent, an API with code loaded onto the GPUs, communicates directly with the network switches [00:17:45], [00:17:50]. Its functions include:

Configuration Verification The agent verifies that flow control mechanisms (like PFC and ECN) are correctly configured between the GPU and the switch. Incorrect configuration can lead to network disaster [00:17:53].
Statistical Reporting It provides statistics on packets received, packets sent, RDMA errors, and other RDMA issues [00:18:15]. This allows for correlation between GPU and network problems, significantly improving troubleshooting [00:18:21].

Network Design Considerations for Monitoring

Lossless Ethernet and Flow Control

For AI networks, lossless Ethernet is key, as dropping too many packets can be detrimental to model training [00:15:28]. Constant latency is also important [00:15:34]. Flow control mechanisms are vital to maintain network health:

ECN (Explicit Congestion Notification) Provides end-to-end flow control. When network congestion occurs, packets are marked, signaling the receiver to notify the sender to slow down. The sender then pauses and gradually speeds up if no more ECN packets are received [00:12:17].
PFC (Priority Flow Control) Acts as an emergency stop. If switch buffers are full, PFC signals the sender to halt traffic completely [00:12:41]. Because GPUs synchronize, a slowdown on one GPU can impact the entire collective, highlighting the need for effective flow control and managing oversubscription [00:15:50].

Buffer Management

Network switches need to manage buffers effectively [00:11:50]. Buffering is costly, but tuning buffers to accept specific packet sizes, which are often consistent in AI model training, can optimize resource allocation [00:16:07].

Network Isolation and Simplicity

AI networks, particularly the backend networks connecting GPUs, are typically kept completely isolated. This is due to the high cost of GPUs and the need to prevent interference [00:02:45], [00:02:55]. Simple network protocols like IBGP or EBGP are preferred for their efficiency and speed [00:04:08]. Unlike traditional data centers with firewalls and load balancers, AI networks are usually direct and purpose-built [00:13:00], [00:13:12].

Future Developments

Network speeds are rapidly increasing, with 800 gigabits per second (Gbps) currently supported and 1.6 terabits per second (Tbps) expected by late 2024 or early 2027 [00:14:25], [00:14:33]. The Ultra Ethernet Consortium is working on refining Ethernet for AI workloads, focusing on improved congestion control and efficient packet handling [00:21:23].

Tubegraph

Explorer

Table of Contents