From: aidotengineer
Paul Gil, a tech lead for Arista Networks, focuses on the “plumbing” of AI, specifically how models are trained and inferenced on infrastructure [00:00:32]. His work involves designing and building enterprise networks that support these demanding AI workloads [00:00:23].
Understanding AI Workloads: Training vs. Inference
The concepts of job completion time and barriers are crucial in network design for AI [00:00:54]. While building networks for model training is understood, inference has significantly evolved due to Chain of Thought and reasoning models [00:01:13].
Dr. Wed Sosa’s slide illustrates the scale difference between training and inference: training can involve 248 GPUs for one to two months, while inference after finetuning and alignment might only require four H100 GPUs [00:02:03]. Next-generation models mean inference now consumes significant resources, unlike earlier large language models (LLMs) that required very little [00:02:27].
AI Network Architecture
AI networks are fundamentally different from traditional data center networks [00:06:31]. They are designed as two distinct, isolated networks:
- Backend Network This network connects GPUs directly to each other [00:02:40]. It is completely isolated due to the high cost, power consumption, and scarcity of GPUs [00:02:48]. Typically, servers contain eight GPUs per pool, connected to high-speed leaf and spine switches [00:03:03]. GPUs can operate at 400 GB/s on this network, with future support for 800 GB/s with new GPUs [00:03:39]. The backend network is always designed to be wire rate, meaning no oversubscription [00:07:21].
- Frontend Network This network provides storage for the training model, allowing GPUs to request more data as needed [00:03:19]. It is not as intense as the backend network [00:03:34].
An H100 server, a popular AI server, features eight 400 gig GPU ports and additional Ethernet ports [00:04:30]. These servers can generate significant traffic, with an H100 potentially putting 4.8 terabytes onto the network [00:07:44]. Future 800 gig GPUs could generate 9.6 terabytes per server [00:08:08].
Networking Challenges in AI Infrastructure
AI networks present unique challenges:
- Hardware and Software Differences
- GPU Hardware: New to networking professionals, GPUs have specific port configurations (e.g., eight 400 gig ports on the backend, four 400 gig ports on the frontend) [00:06:21].
- Protocols: Cuda and Nickel are key protocols, with Nickel’s “collective” behavior significantly influencing network traffic [00:06:03].
- Application Traffic Patterns
- Synchronized Burstiness: All GPUs in an AI network will burst traffic at the same time, potentially reaching their maximum speed (e.g., 400 gig) simultaneously [00:07:03]. This contrasts with typical data center applications that are easier to balance [00:06:34].
- Failure Impact: If one GPU or component fails, the entire training job might fail, unlike traditional applications with built-in failover [00:06:53].
- Power Consumption: AI racks require significantly more power than traditional data center racks. An average rack is 7-15 KW, but a single AI server with eight GPUs can draw 10.2 KW [00:10:39]. This necessitates new infrastructure with 100-200 KW racks, often requiring water cooling instead of air cooling [00:10:51].
- Traffic Direction: AI networks have both East-West (GPU-to-GPU within the backend network) and North-South (GPU-to-storage on the frontend network) traffic patterns [00:11:03]. East-West traffic is particularly intense, running at wire rate [00:11:29].
- Buffering and Congestion Control: Network switches need robust buffering and congestion control mechanisms to handle the high-speed, bursty traffic [00:11:50].
- Single Point of Failure: A single GPU failure can stop a model training job [00:09:31]. Cable problems and transceiver issues are also prevalent in large-scale GPU networks [00:09:53].
- Network Complexity: AI networks are kept simple and isolated, often without firewalls, load balancers, or direct internet connections, to avoid performance bottlenecks [00:13:00].
Key Network Protocols and Technologies for AI
- Routing Protocols: Simple protocols like IBGP or EBGP are used [00:04:09]. BGP is recommended for its simplicity and speed [00:19:29].
- Congestion Control (Rocky V2): Essential for AI networks, Rocky V2 has two main components [00:12:04]:
- Explicit Congestion Notification (ECN): An end-to-end flow control where marked packets inform the receiver of congestion, prompting the sender to slow down [00:12:17].
- Priority Flow Control (PFC): A “stop” mechanism that halts traffic when buffers are full, preventing packet drops [00:12:41]. PFC and ECN must be correctly configured to avoid network disaster [00:17:58].
- Remote Direct Memory Access (RDMA): Used for memory-to-memory writes, bypassing the CPU for faster data transfer [00:16:47]. RDMA is a complex protocol with numerous error codes, which can indicate network problems [00:16:56].
- GPU Communication Protocols: Cuda and Nickel are critical for GPU communication, especially Nickel’s “collective” operations that influence network traffic patterns [00:06:03].
- Ultra Ethernet Consortium (UEC): This consortium is working on evolving Ethernet to better handle AI traffic patterns, focusing on improved congestion control, packet spraying, and direct communication between Network Interface Cards (NICs) [00:21:24]. Version 1.0 is expected to be ratified in Q1 2025 [00:21:41].
Network Optimization and Management
- No Oversubscription: Networks are built as “one-to-one” rather than “one-to-ten” or “one-to-three” as in traditional data centers, ensuring maximum bandwidth for GPUs [00:07:21].
- Advanced Load Balancing: Standard load balancing (entropy based on 5-tuple IP/port/MAC) is insufficient because GPUs might use a single IP address, potentially oversubscribing a single uplink [00:08:42]. New methods load balance based on the percentage of bandwidth used on uplinks, achieving up to 93% utilization [00:09:14]. “Cluster load balancing” specifically looks at the Nickel collective running on the GPUs [00:19:47].
- Buffer Management: Network switch buffers are tuned to the specific packet sizes sent and received by models, optimizing the use of this expensive commodity [00:16:15].
- Visibility and Telemetry: Enhanced monitoring is crucial to proactively identify network issues before they impact AI jobs [00:14:48]. This includes tracking RDMA error codes and understanding why packets are dropped [00:17:06].
- AI Agent for GPU-Network Correlation: Arista has developed an AI agent (API and code) that runs on Nvidia GPUs [00:17:45]. This agent communicates with the network switch to verify flow control configurations (PFC and ECN) and provides statistics on packets, RDMA errors, allowing correlation of problems between GPU and network [00:17:50].
- Smart System Upgrade: Allows for upgrading switch software without taking the switch offline, ensuring continuous operation of GPUs [00:18:47].
Network Scalability Examples
- A 1,400 gig cluster would use a spine-and-leaf architecture with no oversubscription, featuring 800 gig links between leaf and spine switches and 400 gig links down to the GPUs [00:20:25].
- For clusters with thousands of GPUs, larger switches (e.g., Arista’s 7800 series 16-slot boxes) are used, which can accommodate 576 x 800 gig GPUs or 1150 x 400 gig GPUs [00:20:40].
Summary
AI networks require a specialized approach:
- The backend network, connecting GPUs, is the most critical and bursty part [00:22:09].
- GPUs are synchronized, meaning a slow GPU creates a “barrier” that slows down all others [00:22:16].
- Job completion time is the key metric [00:22:23].
- Essential network features include no oversubscription, advanced load balancing (especially “cluster load balancing”), Rocky (PFC/ECN) for congestion control, and robust visibility/telemetry [00:19:09].
- Network speeds are rapidly increasing, with 800 gig currently available and 1.6 terabytes expected by the end of 2024 or early 2027 [00:14:25].