GPU and server configurations for AI

From: aidotengineer

Paul Gil, a Tech Lead for Arista Networks, specializes in building and designing enterprise networks, particularly the underlying infrastructure for training AI models and performing inferencing [00:00:16] [00:00:34]. His work focuses on the “plumbing” of AI systems, detailing what the infrastructure looks like for these demanding workloads [00:00:30].

Training vs. Inference Workloads

The infrastructure requirements differ significantly between model training and inferencing [00:01:36].

Training: Historically, training might require 18 times the resources compared to inference [00:01:40]. A typical example involves training a model using 248 GPUs for one to two months [00:02:05].
Inference: After fine-tuning and alignment, the same model might only require four H100 GPUs for inference [00:02:10]. However, with the advent of Chain of Thought and reasoning models, the nature and scale of inference are evolving, becoming more intensive than previously seen with traditional Large Language Models (LLMs) [00:01:13] [00:02:27].

AI Network Architecture

AI infrastructure typically comprises two distinct network types:

Backend Network

This network connects GPUs directly and is designed to be completely isolated due to the high cost and scarcity of GPUs [00:02:40] [00:02:48].

GPU Servers: Servers typically contain eight GPUs (e.g., Nvidia, Super Micro), which connect to high-speed leaf and spine switches [00:03:03] [00:03:10].
Traffic Intensity: GPUs in the backend network operate at extremely high speeds, with current models working at 400 Gigabit per second (Gbps) [00:03:39]. These speeds are unprecedented in typical enterprise data centers [00:03:46].
Isolation: No other devices are connected to this network to prevent interference and ensure maximum uptime for expensive GPU resources [00:02:57].
Protocols: Simple routing protocols like IBGP or EBGP are used to keep the network as efficient as possible [00:04:08].

Frontend Network

This network provides storage access to the GPUs for training data [00:03:19].

Traffic Profile: It is less intense than the backend network because current storage vendors cannot match the speeds required by the GPUs [00:03:34] [00:11:37]. Storage traffic typically runs at 100-200 Gbps [00:11:43].

GPU and Server Specifications

The NVIDIA H100 is noted as the most popular AI server currently available [00:04:23] [00:04:27].

Connectivity: An H100 server features eight 400 Gbps GPU ports (broken out from four physical ports) and additional Ethernet ports [00:04:30] [00:04:40].
Traffic Capacity: A single H100 server with eight 400 Gbps GPUs and four 400 Gbps front-end ports can generate 4.8 terabytes per second (TBps) of traffic [00:07:44].
Future Speeds: 800 Gbps GPUs (B-series) are expected soon, potentially increasing server traffic to 9.6 TBps [00:08:00] [00:08:14].
Scalability: AI networks are designed for scale-out, allowing the addition of more GPUs, starting small and potentially expanding to hundreds of thousands of GPUs in cloud environments [00:05:30] [00:05:38]. Scale-up (adding to existing servers) is not typical for Nvidia servers like the DGX or HGX [00:05:16].

Unique Challenges and Solutions in AI Networking

Network Design

No Oversubscription: AI networks, particularly the backend, are built with a one-to-one subscription ratio, meaning bandwidth is fully provisioned without oversubscription. This contrasts with traditional data centers that might use ratios of 1:10 or 1:3 due to cost considerations [00:07:19] [00:07:36].
Traffic Patterns:
- East-West Traffic: GPUs communicate directly with each other (east-west traffic), which is highly bursty, with all GPUs potentially bursting at 400 Gbps simultaneously [00:07:03] [00:11:16]. This is the wire-rate traffic.
- North-South Traffic: When GPUs request more data from storage, it’s north-south traffic [00:11:22].
Load Balancing: Traditional load balancing (e.g., using five-tuple entropy) is inefficient as GPU traffic often uses a single IP address, potentially oversubscribing a single uplink [00:08:37]. Advanced tools now load balance based on the percentage of bandwidth utilized on uplinks, achieving up to 93% utilization [00:09:14] [00:19:43]. Cluster load balancing looks at the collective operation being run [00:19:51].
Addressing: Point-to-point connections (e.g., /30, /31) are preferred, with IPv6 as an option for address space issues [00:19:17]. BGP is recommended as the routing protocol [00:19:29]. EVPN VXLAN can be used for multi-tenancy [00:19:35].

Power and Cooling

High Power Draw: A single AI server with eight GPUs can draw 10.2 kilowatts (KW), compared to 7-15 KW for an entire traditional data center rack [00:10:41] [00:10:21].
Advanced Racks: Enterprises are now building racks capable of handling 100-200 KW, necessitating water cooling instead of traditional air cooling [00:10:51] [00:10:55].

Fault Tolerance and Monitoring

GPU Dependencies: Unlike traditional data center applications where failures might cause minor skips, a single GPU failure in an AI network can halt the entire job, making reliability paramount [00:06:48] [00:09:34]. Job completion time is the critical metric [00:22:23].
Lossless Ethernet and Congestion Control: To prevent packet drops, which are detrimental to synchronized GPUs, AI networks implement lossless Ethernet using Remote Direct Memory Access (RDMA) [00:15:25] [00:16:47]. Key mechanisms include:
- Explicit Congestion Notification (ECN): An end-to-end flow control that marks packets during congestion, prompting the sender to slow down [00:12:17].
- Priority Flow Control (PFC): A “stop” mechanism that halts traffic when buffers are full, acting as an emergency brake [00:12:41].
Telemetry and Monitoring: Crucial for proactive problem identification. Switches can provide detailed insights into packet drops, including why they occurred and any associated RDMA error codes [00:14:48] [00:17:06].
AI Agent for GPU-Network Correlation: A network-side AI agent (API and code) can be loaded onto GPUs to communicate with network switches [00:17:36]. This agent verifies network configuration (e.g., PFC/ECN settings) and provides statistics on packets sent/received and RDMA errors, allowing correlation between GPU and network problems [00:17:53].
Smart System Upgrade: Networks can be upgraded without taking switches offline, enabling continuous operation of GPUs during maintenance [00:18:32].

Future Developments

Ultra Ethernet Consortium (UEC): This consortium aims to evolve Ethernet to better handle AI traffic patterns, specifically improving congestion control and packet spraying, and optimizing communication between Network Interface Cards (NICs) [00:21:23]. Version 1.0 is expected to be ratified in Q1 2025 [00:21:41].

AI networks are rapidly evolving, with speeds increasing to 800 Gbps and future projections of 1.6 TBps by the end of 2026 or early 2027 [00:14:25] [00:14:30] [00:21:26]. This continuous growth in model size and data consumption necessitates specialized and robust network infrastructure [00:14:41].

Tubegraph

Explorer

Table of Contents