NVIDIA Tensor Cores

Unprecedented acceleration for agentic AI.

Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy and providing enhanced security. The latest generation of Tensor Cores is faster than ever on a broad array of AI and high-performance computing (HPC) tasks. From training trillion-parameter AI models to achieving breakthrough inference performance, NVIDIA Tensor Cores accelerate all workloads for modern AI factories.

Revolutionary AI Training

Training multi-trillion-parameter generative AI models in 16-bit precision can take months. NVIDIA Tensor Cores feature NVFP4, a breakthrough format that delivers the speed and efficiency of 4-bit format with the accuracy of 16-bit. Supported by the Transformer Engine, NVFP4 utilizes micro-block scaling to dramatically boost throughput and reduce memory footprints. With native framework support via CUDA-X™ libraries, this innovation slashes training-to-convergence times for the next generation of frontier models.

Breakthrough Inference

Achieving low latency at high throughput while maximizing utilization is critical for reliable inference deployment. The NVIDIA Rubin platform features an enhanced Transformer Engine that boosts NVFP4 performance with fifth-generation Tensor Cores. At the same time, it preserves accuracy, enabling up to 50 petaFLOPS (PFLOPS) of NVFP4 inference. Fully compatible with NVIDIA Blackwell, the Transformer Engine ensures seamless upgrades, so previously optimized codes transition effortlessly to the NVIDIA Rubin.

Tensor Cores have enabled NVIDIA to win MLPerf industry-wide benchmarks for inference.

Advanced HPC

HPC is a fundamental pillar of modern science. To unlock next-generation discoveries, scientists use simulations to better understand complex molecules for drug discovery, physics to identify potential sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns. NVIDIA Tensor Cores offer a full range of precisions, including FP64 and FP32, to accelerate scientific computing with the highest accuracy needed.

The HPC SDK provides the essential compilers, libraries, and tools for developing HPC applications for the NVIDIA platform.

NVIDIA Rubin Tensor Cores

Enhanced Fifth Generation

The NVIDIA Rubin platform introduces enhanced fifth‑generation Tensor Cores. Designed to accelerate modern AI factories, they optimize support for 4‑bit narrow‑precision NVFP4 and FP8 arithmetic. By tightly integrating these Tensor Cores with expanded special function units within NVIDIA Rubin’s streaming multiprocessors, the platform significantly accelerates attention mechanisms and sparse compute paths,  boosting both arithmetic density and energy efficiency without compromising model accuracy.

50 PFLOPS Transformer Engine

Powering the next generation of agentic AI, the NVIDIA Rubin GPU features a 50 petaFLOPS Transformer Engine that leverages fifth-gen Tensor Cores and NVFP4 precision to maximize inference efficiency. This architectural leap scales seamlessly to 3,600 PFLOPS for NVFP4 inference in the NVIDIA Vera Rubin NVL72 system, delivering the massive throughput essential for real-time reasoning models.

Emulation

NVIDIA Blackwell and Rubin architectures can emulate FP32 and FP64 matrix operations by decomposing input values and leveraging high-throughput, lower-precision Tensor Cores. This approach can significantly boost performance and energy efficiency while matching or even exceeding native IEEE754 accuracy. By utilizing complex, software-driven algorithms and fixed-point operations, emulation provides a controlled, highly efficient alternative to traditional higher-precision hardware execution methods.

NVIDIA Blackwell Tensor Cores

Fifth Generation

The NVIDIA Blackwell architecture delivers a 30X speedup compared to the previous NVIDIA Hopper™ generation for massive models such as GPT-MoE-1.8T. This performance boost is made possible with the fifth generation of Tensor Cores. NVIDIA Blackwell Tensor Cores add new precisions, including community-defined microscaling formats, giving better accuracy and ease of replacement for higher precisions.

New Precision Formats

As generative AI models explode in size and complexity, it’s critical to improve training and inference performance. To meet these compute needs, NVIDIA Blackwell Tensor Cores support new quantization formats and precisions, including community-defined microscaling formats.

Second-Generation Transformer Engine

The second-generation Transformer Engine uses custom NVIDIA Blackwell Tensor Core technology combined with NVIDIA® TensorRT™-LLM and NeMo™ Framework innovations to accelerate inference and training for large language models (LLMs) and mixture-of-experts (MoE) models. The Transformer Engine is fueled by the Tensor Cores’ FP4 precision, doubling performance and efficiency while maintaining high accuracy for current and next-generation MoE models.

The Transformer Engine works to democratize today’s LLMs with real-time performance. Enterprises can optimize business processes by deploying state-of-the-art generative AI models with affordable economics.

The Most Powerful End-to-End AI and HPC Data Center Platform

Tensor Cores are essential building blocks of the complete NVIDIA data center solution that incorporates hardware, networking, software, libraries, and optimized AI models and applications from the NVIDIA NGC™ catalog. The most powerful end-to-end AI and HPC platform, it allows researchers to deliver real-world results and deploy solutions into production at scale.

NVIDIA Rubin NVIDIA Blackwell
Supported Tensor Core Precisions NVFP4, FP64, TF32, BF16, FP16, FP8/FP6, INT8, NVFP4, FP64, TF32, BF16, FP16, FP8/FP6, INT8,
Supported CUDA® Core Precisions FP64, FP32, INT32, FP16, BF16 FP64, FP32, FP16, BF16

*Preliminary specifications, may be subject to change

Learn more about the NVIDIA Vera Rubin platform.