SXM (socket)

SXM (Server PCI Express Module[1]) is a high bandwidth socket solution for connecting Nvidia Compute Accelerators to a system. Each generation of Nvidia Tesla since P100 models, the DGX computer series and the HGX boards come with an SXM socket type that realizes high bandwidth, power delivery and more for the matching GPU daughter cards.[2] Nvidia offers these combinations as an end-user product e.g. in their models of the DGX system series. Current socket generations are SXM for Pascal based GPUs, SXM2 and SXM3 for Volta based GPUs, SXM4 for Ampere based GPUs, and SXM5 for Hopper based GPUs. These sockets are used for specific models of these accelerators, and offer higher performance per card than PCIe equivalents.[2] The DGX-1 system was the first to be equipped with SXM-2 sockets and thus was the first to carry the form factor compatible SXM modules with P100 GPUs and later was unveiled to be capable of allowing upgrading to (or being pre-equipped with) SXM2 modules with V100 GPUs.[3][4]

Computing node of TSUBAME 3.0 supercomputer showing four NVIDIA Tesla P100 SXM modules
Bare SXM sockets next to sockets with GPUs installed

SXM boards are typically built with four or eight GPU slots, although some solutions such as the Nvidia DGX-2 connect multiple boards to deliver high performance. While third party solutions for SXM boards exist, most System Integrators such as Supermicro use prebuilt Nvidia HGX boards, which come in four or eight socket configurations.[5] This solution greatly lowers the cost and difficulty of SXM based GPU servers, and enables compatibility and reliability across all boards of the same generation.

SXM modules on e.g. HGX boards, particularly recent generations, may have NVLink switches to allow faster GPU-to-GPU communication. This as well reduces bottlenecks which would normally be located within CPU and PCIe.[2][6] The GPUs on the daughter cards are just using NVLink as their main communication protocol. For example a Hopper-based H100 SXM5 based GPU can use up to 900GB/s of bandwidth across 18 NVLink 4 channels, with each contributing a 50GB/s of bandwidth;[7] This compared to PCIe 5.0, which can handle up to 64GB/s of bandwidth within a x16 slot.[8] This high bandwidth also means that GPUs can share memory over the NVLink bus, allowing an entire HGX board to present to the host system as a single, massive GPU.[9]

Power delivery is also handled by the SXM socket, negating the need for external power cables such as those needed in PCIe equivalent cards. This, combined with the horizontal mounting allows cooling options of higher efficiency which in turn allows the SXM based GPUs to operate at a much higher TDP. The Hopper-based H100, for example, can draw up to 700W solely from the SXM socket.[10] The lack of cabling also makes assembling and repairing of large systems much easier, and also reduces the possible points of failure.[2]

The early Nvidia Tegra automotive targeted evaluation board, 'Drive PX2', had two MXM (Mobile PCI Express Module) sockets on both sides of the card, this dual MXM design can be considered a predecessor to the Nvidia Tesla implementation of the SXM socket.

Comparison of accelerators used in DGX:[11][12][13]



Accelerator
H100
A100 80GB
A100 40GB
V100 32GB
V100 16GB
P100
ArchitectureSocketFP32
CUDA
Cores
FP64 Cores
(excl. Tensor)
Mixed
INT32/FP32
Cores
INT32
Cores
Boost
Clock
Memory
Clock
Memory
Bus Width
Memory
Bandwidth
VRAMSingle
Precision
(FP32)
Double
Precision
(FP64)
INT8
(non-Tensor)
INT8
Dense Tensor
INT32FP16FP16
Dense Tensor
bfloat16
Dense Tensor
TensorFloat-32
(TF32)
Dense Tensor
FP64
Dense Tensor
Interconnect
(NVLink)
GPUL1 Cache SizeL2 Cache SizeTDPGPU
Die Size
Transistor
Count
Manufacturing Process
HopperSXM516896460816896N/A1780 MHz4.8Gbit/s HBM35120-bit3072GB/sec80GB60 TFLOPs30 TFLOPsN/A4000 TOPsN/AN/A2000 TFLOPs2000 TFLOPs1000 TFLOPs60 TFLOPs900GB/secGH10025344KB(192KBx132)51200 KB700W814 mm280BTSMC 4 nm N4
AmpereSXM4691234566912N/A1410 MHz3.2Gbit/s HBM25120-bit2039GB/sec80GB19.5 TFLOPs9.7 TFLOPsN/A624 TOPs19.5 TOPs78 TFLOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPs600GB/secGA10020736KB(192KBx108)40960 KB400W826 mm254.2BTSMC 7 nm N7
AmpereSXM4691234566912N/A1410 MHz2.4Gbit/s HBM25120-bit1555GB/sec40GB19.5 TFLOPs9.7 TFLOPsN/A624 TOPs19.5 TOPs78 TFLOPs312 TFLOPs312 TFLOPs156 TFLOPs19.5 TFLOPs600GB/secGA10020736KB(192KBx108)40960 KB400W826 mm254.2BTSMC 7 nm N7
VoltaSXM351202560N/A51201530 MHz1.75Gbit/s HBM24096-bit900GB/sec32GB15.7 TFLOPs7.8 TFLOPs62 TOPsN/A15.7 TOPs31.4 TFLOPs125 TFLOPsN/AN/AN/A300GB/secGV10010240KB(128KBx80)6144 KB350W815 mm221.1BTSMC 12 nm FFN
VoltaSXM251202560N/A51201530 MHz1.75Gbit/s HBM24096-bit900GB/sec16GB15.7 TFLOPs7.8 TFLOPs62 TOPsN/A15.7 TOPs31.4 TFLOPs125 TFLOPsN/AN/AN/A300GB/secGV10010240KB(128KBx80)6144 KB300W815 mm221.1BTSMC 12 nm FFN
PascalSXM/SXM2N/A17923584N/A1480 MHz1.4Gbit/s HBM24096-bit720GB/sec16GB10.6 TFLOPs5.3 TFLOPsN/AN/AN/A21.2 TFLOPsN/AN/AN/AN/A160GB/secGP1001344KB(24KBx56)4096 KB300W610 mm215.3BTSMC 16 nm FinFET+

References

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.