GPU Infrastructure and Performance

From: aidotengineer

Paul Gil, a tech lead for Arista Networks based in New York City, designs and helps build enterprise networks [00:00:19]. His focus is on the underlying “plumbing” of the network infrastructure required for training and inferencing AI models [00:00:30].

AI Network Fundamentals

Traditional computer networks prioritize “job completion time” [00:00:54]. For AI, understanding the distinction between model training and inference is crucial for network design [00:01:04].

Training vs. Inference

Training typically involves a large number of GPUs (e.g., 248 GPUs for one to two months) [00:02:05], while inference, especially after fine-tuning and alignment, might use significantly fewer (e.g., four H100s) [00:02:10]. The complexity of inference has increased with models like Chain of Thought and reasoning [00:01:13]. Next-generation LLMs now require substantial inference capabilities [00:02:27].

Network Segregation

AI networks are typically segmented into two main parts [00:03:31]:

Backend Network: Connects GPUs to each other [00:02:40]. This network is completely isolated due to the expense and power requirements of GPUs [00:02:45].
Frontend Network: Used for storage, allowing GPUs to call for more data after calculations [00:03:20].

Backend Network: The Core of GPU Communication

The backend network is where GPUs (like Nvidia or Supermicro) in servers, typically eight GPUs per server, connect to high-speed leaf and spine switches [00:03:03]. Nothing else attaches to this network [00:03:16].

High-Speed Requirements

GPUs can operate at 400 GB/s on the backend network, a speed rarely seen in typical enterprise data centers [00:03:42].
An Nvidia H100 server can put 4.8 terabytes per second (TB/s) of traffic onto the network from its eight 400 gig GPU ports and four 400 gig frontend ports [00:07:44].
Upcoming 800 gig “B” series GPUs are expected to generate 9.6 TB/s per server [00:08:03].
Network speeds are rapidly increasing, with 1.6 terabytes per second expected by late 2026 or early 2027 [00:14:25].

Network Design for Performance

Simplicity: Networks are built as simply as possible to ensure 24/7 operation of expensive GPUs [00:03:58]. Simple protocols like IBGP or EBGP are used [00:04:06].
No Over-subscription: AI networks are built with a one-to-one subscription ratio, unlike traditional data centers that might use 1:10 or 1:3 [00:07:21]. This is critical because GPUs will burst at their full 400 gig capacity simultaneously [00:07:05].
Traffic Patterns:
- East-West Traffic: Predominates on the backend network as GPUs communicate directly with each other at wire rate [00:11:18].
- North-South Traffic: Occurs when GPUs request data from the storage network [00:11:22].
Load Balancing: Traditional load balancing (e.g., 5-tuple IP address, port, MAC address) can lead to oversubscription on single uplinks in AI networks [00:08:37]. Advanced methods are needed, such as balancing based on the percentage of bandwidth used on an uplink, achieving up to 93% utilization [00:09:14]. Specific collective load balancing is also possible for AI application frameworks [00:19:47].

Power and Cooling

AI racks require significantly more power. A typical data center rack uses 7KW to 15KW, accommodating multiple servers [00:10:21]. An AI server with eight GPUs, however, can draw 10.2 KW, meaning only one such server fits in a traditional rack [00:10:41]. New racks for AI are being built to support 100-200 KW and require water cooling, as air cooling is insufficient [00:10:51].

Frontend Network: Data Ingestion

The frontend network is responsible for providing storage to train the model [00:03:20]. It is less intense than the backend, with most storage vendors currently putting around 100-200 gigabits per second (GB/s) of traffic on the network [00:11:34].

Unique Challenges and Solutions in AI Networking

Building AI networks presents distinct challenges compared to traditional data centers [00:06:31]:

Hardware and Software: The reliance on GPUs and specific software protocols like CUDA and NCCL (Nickel) is new [00:06:03]. NCCL, with its collective operations, dictates how traffic is placed on the network [00:06:11].
Application Behavior: Unlike web applications or databases with clear client-server patterns, all GPUs in an AI network communicate actively with each other [00:06:48]. If one GPU fails, the entire training job might fail [00:06:53].
Error Management: A single GPU failure can halt a model [00:09:31]. With thousands of GPUs, cable, optics, and transceiver issues become prevalent [00:09:53].
Congestion Control:
- Buffering: Switches require robust buffering, which is an expensive commodity [00:11:50]. Buffers can be adjusted to efficiently accept the specific packet sizes common in AI models [00:16:15].
- Congestion Control and Feedback: Rocky V2 (RDMA over Converged Ethernet) is essential [00:12:04]. It has two main components:
  - ECN (Explicit Congestion Notification): An end-to-end flow control where marked packets inform receivers to tell senders to slow down [00:12:17].
  - PFC (Priority Flow Control): A “stop” signal used when buffers are full, acting as an emergency brake [00:12:41].
- Lossless Ethernet is key [00:15:27].
Network Isolation: AI networks are kept totally isolated from other networks, including the internet, to prevent risks to expensive hardware [00:13:12].
On-Demand Applications: Unlike traditional applications that might recover seamlessly from failures, an AI model failure can be a critical event [00:13:30].
RDMA Monitoring: RDMA (Remote Direct Memory Access) is a complex protocol crucial for GPU kernel optimization and requires monitoring for error codes and dropped packets [00:16:44]. Network devices can capture packet headers and RDMA information to diagnose reasons for packet drops [00:17:25].
AI Agent: An AI agent, loaded on Nvidia GPUs via an API, allows the GPU to communicate with the network switch. It verifies correct flow control configuration (PFC, ECN) and provides statistics on packets sent/received and RDMA errors, helping to correlate problems to either the GPU or the network [00:17:36]. This enhances visibility and telemetry [00:14:46].
Smart System Upgrade: Allows upgrading switch software without taking the switch offline, critical for maintaining 24/7 operation of large GPU clusters [00:18:32].

Network Design Recommendations for AI

Key considerations for AI network design include:

No oversubscription on the backend [00:19:10].
Point-to-point connections with specific IP addressing (e.g., /30, /31) or IPv6 [00:19:17].
Using BGP as the routing protocol due to its simplicity and speed [00:19:29].
EVPN VXLAN for multi-tenancy [00:19:35].
Mandatory deployment of Rocky (PFC and ECN) to prevent network meltdown and provide early warnings [00:19:54].
Continuous visibility and telemetry to proactively identify issues [00:20:07].

Future Developments

The Ultra Ethernet Consortium (UEC) is working on evolving Ethernet to better handle AI workloads, focusing on congestion control, packet spraying, and NIC-to-NIC communication [00:21:23]. Version 1.0 is expected to be ratified in Q1 2025 [00:21:41]. This initiative aims to shift more intelligence into the NICs, simplifying the network’s role to just forwarding packets [00:21:56].

Summary

The backend network for GPUs is the most critical and bursty component of AI infrastructure [00:22:09]. Since GPUs operate in synchronization, a slow GPU can become a barrier for the entire system [00:22:19]. “Job completion time” is the primary metric, and network issues can drastically increase it [00:22:24]. Although models can use checkpoints, these are expensive [00:22:34].

Tubegraph

Explorer

Table of Contents