The most powerful end-to-end AI supercomputing platform
Massive datasets, exploding model sizes, and complex simulations require multiple GPUs with extremely fast interconnections. The NVIDIA HGX™ platform brings together the full power of NVIDIA GPUs, NVIDIA® NVLink®, NVIDIA Mellanox® InfiniBand® networking, and a fully optimized NVIDIA AI and HPC software stack from NGC™ to provide highest application performance. With its end-to-end performance and flexibility, NVIDIA HGX enables researchers and scientists to combine simulation, data analytics, and AI to advance scientific progress.
NVIDIA HGX A100 combines NVIDIA A100 Tensor Core GPUs with high-speed interconnects to form the world’s most powerful servers. With A100 80GB GPUs, a single HGX A100 has up to 1.3 terabytes (TB) of GPU memory and over 2 terabytes per second (TB/s) of memory bandwidth, delivering unprecedented acceleration.
HGX A100 delivers up to a 20X AI speedup out of the box compared to previous generations with Tensor Float 32 (TF32) and a 2.5X HPC speedup with FP64. Fully tested and easy to deploy, HGX A100 integrates into partner servers to provide guaranteed performance. NVIDIA HGX A100 with 16 GPUs delivers a staggering 10 petaFLOPS, forming the world’s most powerful accelerated scale-up server platform for AI and HPC.
DLRM Training
DLRM on HugeCTR framework, precision = FP16 | NVIDIA A100 80GB batch size = 48 | NVIDIA A100 40GB batch size = 32 | NVIDIA V100 32GB batch size = 32.
Deep learning models are exploding in size and complexity, requiring a system with large amounts of memory, massive computing power, and fast interconnects for scalability. With NVIDIA® NVSwitch™ providing high-speed, all-to-all GPU communications, HGX A100 can handle the most advanced AI models. With A100 80GB GPUs, GPU memory is doubled, delivering up to 1.3 TB of memory in a single HGX A100. Emerging workloads on the very largest models like deep learning recommendation models (DLRM), which have massive data tables, are accelerated up to 3X over HGX powered by A100 40GB GPUs.
Big data analytics benchmark | 30 analytical retail queries, ETL, ML, NLP on 10TB dataset | CPU: Intel Xeon Gold 6252 2.10 GHz, Hadoop | V100 32GB, RAPIDS/Dask | A100 40GB and A100 80GB, RAPIDS/Dask/BlazingSQL
Machine learning models require loading, transforming, and processing extremely large datasets to glean critical insights. With up to 1.3 TB of unified memory and all-to-all GPU communications with NVSwitch, HGX A100 powered by A100 80GB GPUs has the capability to load and perform calculations on enormous datasets to derive actionable insights quickly.
On a big data analytics benchmark, A100 80GB delivered insights with 83X higher throughput than CPUs and 2X higher performance over A100 40GB, making it ideally suited for emerging workloads with exploding dataset sizes.
HPC applications need to perform an enormous amount of calculations per second. Increasing the compute density of each server node dramatically reduces the number of servers required, resulting in huge savings in cost, power, and space consumed in the data center. For simulations, high-dimension matrix multiplication requires a processor to fetch data from many neighbors for computation, making GPUs connected by NVIDIA NVLink ideal. HPC applications can also leverage TF32 in A100 to achieve up to 11X higher throughput in four years for single-precision, dense matrix-multiply operations.
An HGX A100 powered by A100 80GB GPUs delivers a 2X throughput increase over A100 40GB GPUs on Quantum Espresso, a materials simulation, boosting time to insight.
Top HPC Apps
Geometric mean of application speedups vs. P100: Benchmark application: Amber [PME-Cellulose_NVE], Chroma [szscl21_24_128], GROMACS [ADH Dodec], MILC [Apex Medium], NAMD [stmv_nve_cuda], PyTorch (BERT-Large Fine Tuner], Quantum Espresso [AUSURF112-jR]; Random Forest FP32 [make_blobs (160000 x 64 : 10)], TensorFlow [ResNet-50], VASP 6 [Si Huge] | GPU node with dual-socket CPUs with 4x NVIDIA P100, V100, or A100 GPUs.
Quantum Espresso
Quantum Espresso measured using CNT10POR8 dataset, precision = FP64.
HGX A100 is available in single baseboards with four or eight A100 GPUs. The four-GPU configuration is fully interconnected with NVIDIA NVLink, and the eight-GPU configuration is interconnected with NVSwitch. Two NVIDIA HGX A100 8-GPU baseboards can also be combined using an NVSwitch interconnect to create a powerful 16-GPU single node.
* With sparsity
NVIDIA HGX-1 and HGX-2 are reference architectures that standardize the design of data centers accelerating AI and HPC. Built with NVIDIA SXM2 V100 boards, with NVIDIA NVLink and NVSwitch interconnect technologies, HGX reference architectures have a modular design that works seamlessly in hyperscale and hybrid data centers to deliver up to 2 petaFLOPS of compute power for a quick, simple path to AI and HPC.
Read this technical deep dive to learn what's new with the NVIDIA Ampere architecture and its implementation in the NVIDIA A100 GPU.