Building the next frontier of AI
Overview
NVIDIA Vera Rubin NVL72 unifies leading-edge technologies from NVIDIA: 72 Rubin GPUs, 36 Vera CPUs, ConnectX®-9 SuperNICs, and BlueField®-4 DPUs. It scales up intelligence in a rack-scale platform with the NVIDIA NVLink™ 6 switch and scales out with NVIDIA Quantum-X800 InfiniBand and Spectrum-X™ Ethernet to power the AI industrial revolution at scale.
Built on the third-generation NVIDIA MGX™ NVL72 rack design, Vera Rubin NVL72 offers a seamless transition from prior generations. It delivers AI training with one-fourth the GPUs and AI inference at one-tenth the cost per million tokens versus NVIDIA Blackwell. Featuring cable‑free modular tray designs and support from over 80 MGX ecosystem partners, the rack-scale AI supercomputer delivers world‑class performance with rapid deployment.
Performance
NVIDIA Rubin trains mixture-of-expert (MoE) models with one-fourth the number of GPUs over the NVIDIA Blackwell architecture.
Projected performance subject to change. Number of GPUs based on a 10T MoE model trained on 100T tokens in a fixed timeframe of 1 month.
LLM inference performance subject to change. Cost per 1 million tokens based on Kimi-K2-Thinking model using 32K/8K ISL/OSL comparing Blackwell GB200 NVL72 and Rubin NVL72.
NVIDIA Rubin delivers one-tenth the cost per million tokens compared to NVIDIA Blackwell for highly interactive, deep reasoning agentic AI.
Technology Breakthroughs
Specifications¹
| NVIDIA Vera Rubin NVL72 | NVIDIA Vera Rubin Superchip | NVIDIA Rubin GPU | |
|---|---|---|---|
| Configuration | 72 NVIDIA Rubin GPUs | 36 NVIDIA Vera CPUs | 2 NVIDIA Rubin GPUs | 1 NVIDIA Vera CPU | 1 NVIDIA Rubin GPU |
| NVFP4 Inference | 3,600 PFLOPS | 100 PFLOPS | 50 PFLOPS |
| NVFP4 Training | 2,520 PFLOPS | 70 PFLOPS | 35 PFLOPS |
| FP8/FP6 Training | 1,260 PFLOPS | 35 PFLOPS | 17.5 PFLOPS |
| INT8² | 18 POPS | 0.5 POPS | 0.25 POPS |
| FP16/BF16² | 288 PFLOPS | 8 PFLOPS | 4 PFLOPS |
| TF32² | 144 PFLOPS | 4 PFLOPS | 2 PFLOPS |
| FP32 | 9,360 TFLOPS | 260 TFLOPS | 130 TFLOPS |
| FP64 | 2,400 TFLOPS | 67 TFLOPS | 33 TFLOPS |
| FP32 SGEMM³ | 28,800 TFLOPS | 800 TFLOPS | 400 TFLOPS |
| FP64 DGEMM³ | 14,400 TFLOPS | 400 TFLOPS | 200 TFLOPS |
| GPU Memory | Bandwidth | 20.7 TB HBM4 | 1,580 TB/s | 576 GB HBM4 | 44 TB/s | 288 GB HBM4 | 22 TB/s |
| NVLink Bandwidth | 260 TB/s | 7.2 TB/s | 3.6 TB/s |
| NVLink-C2C Bandwidth | 65 TB/s | 1.8 TB/s | - |
| CPU Core Count | 3,168 custom NVIDIA Olympus cores (Arm® compatible) | 88 custom NVIDIA Olympus cores (Arm compatible) | - |
| CPU Memory | 54 TB LPDDR5X | 1.5 TB LPDDR5X | - |
| Total NVIDIA + HBM4 Chips | 1,296 | 30 | 12 |
1. Preliminary information. All values are up to and subject to change.
2. Dense specification.
3. Peak performance using Tensor Core-based emulation algorithms.
Get Started
Sign up for the latest news, updates, and more from NVIDIA.