Designing AI Networks and Data Centers

From: aidotengineer

Paul Gil, a tech lead for Arista Networks based in New York City, designs and helps build Enterprise networks, focusing on the underlying “plumbing” for AI infrastructure rather than agents. His work involves understanding how models are trained, what the infrastructure looks like, and how inferencing is performed [00:00:16].

Training vs. Inference

Historically, when building computer networks, terms like “job completion time” and “barrier” were common [00:00:50]. The distinction between training and inference has evolved significantly with Chain of Thought and reasoning models [00:01:13].

Dr. Wes Sousa developed a concept illustrating the difference in GPU size requirements: training might require 18 times the resources, while inference might require 2 times, though this is changing with Chain of Thought and reasoning [00:01:31]. For example, a model trained with 248 GPUs for one to two months might only need four H100s for inference after fine-tuning and alignment [00:02:03]. While LLMs used to be small for inference, next-generation models require a lot more resources [00:02:27].

AI Network Architecture

AI networks are built with new terminology for networking professionals [00:03:36]. They typically consist of two main parts:

Backend Network

This network connects GPUs [00:02:40]. Due to the high cost, power consumption, and scarcity of GPUs, these networks are completely isolated from other systems [00:02:48].

Servers typically have eight GPUs (e.g., Nvidia, Supermicro) per pool [00:03:03].
GPUs connect to a high-speed leaf switch, which then connects to a spine switch, forming a dedicated network with nothing else attached [00:03:10].
The backend network is highly intense; GPUs can work at 400 GB/s depending on the model being trained [00:03:37].
Networking for AI is simplified, using basic protocols like IB BGP or EBGP to ensure maximum uptime [00:03:58].

Frontend Network

This network handles storage access for the model [00:03:19]. GPUs synchronize, calculate, produce an algorithm, and then request more data, forming a continuous cycle [00:03:26]. The frontend network is less intense than the backend [00:03:34].

Challenges in Building AI Applications and Networks

Designing AI networks presents unique challenges compared to traditional data center networks [00:05:01].

Hardware Differences

GPUs are unfamiliar hardware for many networking professionals; configuring them can take hours [00:05:50].
An H100 server, a popular AI server, has eight 400 gig GPU ports and four 400 gig Ethernet ports [00:04:30]. Such servers can generate unprecedented traffic loads, e.g., 4.8 terabytes from a single H100 [00:04:54].
Nvidia servers (like the DGX or HGX) come with a fixed number of GPUs (typically eight) and cannot be “scaled up” by adding more, though networks can be “scaled out” to add more GPUs over time [00:05:14].

Software and Protocols

Cuda and Nickel are key protocols [00:06:03]. Networking teams need to understand Nickel’s “collective” operations as they impact network traffic patterns [00:06:11].

Traffic Patterns

Unlike typical data center applications (web/database) where traffic goes between different parts and can fail over, AI networks involve GPUs communicating directly [00:06:33]. If one GPU fails, the entire job might fail [00:06:53].
AI network traffic is extremely bursty [00:07:03]. Thousands of GPUs can burst simultaneously at 400 gig, creating massive network load [00:07:05].
Traditional load balancing using entropy (five-tuple: IP address, port, MAC address) is insufficient for GPU traffic as it often uses a single IP address, potentially oversubscribing a single uplink [00:08:37].

Power and Cooling

AI racks require significantly more power than traditional data center racks [00:10:10]. An average rack is 7-15 KW, but a single AI server with 8 GPUs draws 10.2 KW [00:10:41].
New data centers are being built with racks supporting 100-200 KW, often requiring water cooling instead of air cooling [00:10:51].

Traffic Direction

Traditional data centers primarily have “north-south” traffic (user to database/web) [00:11:06].
AI networks exhibit both “east-west” traffic (GPUs speaking to each other at wire rate) and “north-south” traffic (requesting data from storage) [00:11:16]. East-west traffic is particularly intense [00:11:29].

Network Resiliency and Failures

A single GPU failure can stop the model, unlike typical applications where components can fail over [00:09:31].
Problems with optics, transceivers, DOMs (rates, loss), and cables are common in networks with thousands of GPUs [00:09:45].

Solutions and Best Practices

To address these challenges and strategies in AI production, specific design principles and technologies are employed.

Network Design Principles

Isolation: AI networks are designed to be completely isolated from other parts of the enterprise network to protect expensive GPU resources [00:02:48].
No Over-Subscription: Networks are built “one to one” rather than oversubscribed (e.g., 1 to 10 or 1 to 3), ensuring maximum bandwidth for bursty GPU traffic [00:07:19].
Simple Protocols: Using simple routing protocols like BGP is recommended for efficiency and speed [00:19:29].
Advanced Load Balancing: To handle GPU traffic from a single IP, advanced load balancing techniques are used that monitor bandwidth utilization on uplinks [00:09:11]. This can achieve up to 93% utilization across uplinks [00:09:17].
Tuned Buffering: Switches are configured with buffers specifically tuned to the packet sizes sent and received by models, optimizing expensive buffering resources [00:16:15].

Congestion Control

Rocky V2 (RDMA over Converged Ethernet v2): This protocol is essential for AI networks to prevent meltdowns [00:12:04].
- Explicit Congestion Notification (ECN): An end-to-end flow control mechanism where congested network paths mark packets, signaling the receiver to tell the sender to slow down [00:12:16].
- Priority Flow Control (PFC): A “stop” mechanism used when buffers are full, preventing further packet transmission [00:12:41].
Lossless Ethernet: While some packet drops might be acceptable, maintaining flow control and lossless Ethernet with ECN and PFC is crucial to avoid significant issues that can slow down synchronized GPUs [00:15:27].

Monitoring and Visibility

RDMA Monitoring: Networks should monitor RDMA (Remote Direct Memory Access) errors, which is a complex protocol with many error codes [00:16:49]. This allows for identifying network problems like packet drops by capturing packet headers and RDMA information [00:17:08].
AI Agent: An AI agent (API and code) can be loaded onto GPUs to communicate with switches, verifying correct flow control configuration (PFC, ECN) and providing statistics on packets sent/received and RDMA errors. This helps correlate issues to either the GPU or the network [00:17:36].
Proactive Awareness: Telemetry and visibility are crucial for the Network Operations Center to be aware of potential problems before receiving calls about failed models [00:14:49].

Smart System Upgrades

Network device software can be upgraded without taking switches offline, even in large clusters with thousands of GPUs and dozens of switches. This ensures continuous GPU operation [00:18:32].

Future Outlook

Network speeds are rapidly advancing: currently at 800 gig, with 1.6 terabytes expected by the end of 2024 or early 2027 [00:14:25]. Models will continue to grow and consume more bandwidth [00:14:41].
Ultra Ethernet Consortium: A new initiative aiming to redefine Ethernet for better congestion control, packet spraying, and direct NIC-to-NIC communication. Version 1.0 is expected to be ratified in Q1 2025, with deployments in Q3/Q4 2025 [00:21:23]. This will shift more functionality to NICs, allowing the network to focus on forwarding packets [00:21:53].

Summary

AI networks are distinct from traditional data center networks, requiring specialized design due to their unique characteristics [00:22:04]. The backend network is critical and highly bursty, with synchronized GPUs sending and receiving at the same time [00:22:09]. A slow GPU can act as a barrier, impacting overall job completion time [00:22:21]. While models can checkpoint, this is an expensive process [00:22:34]. Implementing Rocky (ECN/PFC) and ensuring comprehensive visibility and telemetry are crucial for maintaining network health and proactive problem resolution [00:19:56].

Tubegraph

Explorer

Table of Contents