From: aidotengineer
Building networks for AI workloads presents unique challenges compared to traditional enterprise networks, primarily due to the intense demands of GPUs and the specific nature of AI traffic patterns [00:03:53]. Paul Gil, a tech lead for Arista Networks, explains the infrastructure and protocols required for efficient AI network infrastructure.
AI Network Architecture
Networks designed for AI workloads are typically isolated and highly specialized. There are two main components:
- Backend Network This network connects GPUs, often with eight GPUs per server (e.g., Nvidia, Super Micro), which then connect to high-speed leaf and spine switches [00:03:03]. These networks are entirely isolated due to the high cost, power consumption, and scarcity of GPUs [00:02:48]. Nothing else is connected to this network [00:03:14].
- Frontend Network This network provides storage for training models and handles data synchronization for GPUs [00:03:19]. It is less intense than the backend network [00:03:34].
Unique Demands of AI Networks
Traditional data center applications are relatively easy to manage, with traffic flowing one way and failover mechanisms in place [00:06:34]. AI networks, however, operate differently:
- GPU Communication GPUs communicate with each other, sending and receiving data simultaneously [00:06:48]. If one GPU fails, the entire job might fail or require a lengthy recovery [00:06:53].
- Bursty Traffic Traffic is highly bursty, with thousands of GPUs capable of bursting at 400 GB/s simultaneously [00:07:03]. An H100 server alone can put 4.8 terabytes of data onto the network [00:07:44]. Upcoming GPUs (e.g., B-series) could push this to 9.6 terabytes per server [00:08:08].
- No Oversubscription To handle this bursty traffic, AI networks are built with a one-to-one subscription ratio, meaning no oversubscription [00:07:19]. This is significantly different from traditional data centers, which might use 1:10 or 1:3 ratios [00:07:24].
- Traffic Patterns AI networks experience both east-west traffic (GPU-to-GPU communication) and north-south traffic (GPU to storage network for data requests) [00:11:06]. East-west traffic typically runs at wire rate [00:11:29].
Networking Protocols and Congestion Control
Managing congestion is critical in AI networks to prevent packet loss and ensure job completion time [00:11:50].
- Protocols
- BGP (Border Gateway Protocol) is recommended as the best, simplest, and quickest protocol for these networks [00:19:29].
- Cuda and Nickel (NVIDIA Collective Communications Library) are key software protocols. Nickel, with its collective operations, specifically dictates how traffic is put onto the network [00:06:03].
- RDMA (Remote Direct Memory Access) is used for memory-to-memory writes, bypassing the CPU [00:16:47]. It’s a complex protocol with numerous error codes, which need to be monitored to diagnose network problems [00:16:56].
- Congestion Control Mechanisms
- Lossless Ethernet is essential because while some packet drops might be acceptable, too many will cause problems [00:15:28].
- Rocky V2 (RDMA over Converged Ethernet) is crucial for congestion control, utilizing two main mechanisms [00:12:04]:
- ECN (Explicit Congestion Notification) is an end-to-end flow control mechanism. When congestion occurs, packets are marked, and the receiver notifies the sender to slow down. The sender then pauses its transmission before gradually speeding up again [00:12:17].
- PFC (Priority Flow Control) is a “stop” signal used when network buffers are full, halting traffic entirely [00:12:41].
- Advanced Load Balancing Traditional load balancing uses entropy based on five-tuple information (IP address, port, MAC address) [00:08:42]. However, with GPUs, traffic often comes from a single IP address, leading to oversubscription of uplinks [00:08:51]. New methods involve load balancing based on the percentage of bandwidth used on uplinks, achieving up to 93% utilization [00:09:14]. There is also “cluster load balancing” that works with the specific collective operations being run [00:19:47].
- Buffer Management Network switches have limited buffering [00:11:50]. By adjusting buffer sizes to match the specific packet sizes sent and received by models, network efficiency can be significantly improved [00:16:15].
Visibility and Monitoring
Proactive monitoring is crucial to prevent model failures and minimize downtime [00:14:46].
- Telemetry Different telemetry and visibility tools are deployed to detect network issues before they impact operations [00:14:59].
- Packet Analysis Instead of simply dropping packets during congestion, advanced systems can capture snapshots of packets and their headers (including RDMA information) to provide detailed reasons for the drop [00:17:06].
- AI Agent An AI agent running on GPUs, with an API and code, can communicate with switches to verify correct configuration of flow control mechanisms (PFC and ECN) [00:17:36]. It also provides statistics on packets received/sent and RDMA errors, helping to correlate problems to either the GPU or the network [00:18:15].
Future Developments
- Ultra Ethernet Consortium This initiative aims to improve Ethernet for AI workloads by addressing congestion control and packet spraying, and by offloading more functions to Network Interface Cards (NICs), allowing the network to focus on forwarding packets [00:21:23]. Version 1.0 is expected to be ratified in Q1 2025 [00:21:38].
- Increased Bandwidth Network speeds are rapidly increasing, with 800 GB/s already supported and 1.6 TB/s expected by the end of 2024 or early 2027 [00:14:25]. Models will continue to grow larger and consume more network resources [00:14:41].