NVIDIA Tensor Cores

Unprecedented Acceleration for HPC and AI

Tensor Cores enable mixed-precision computing, dynamically adapting calculations to accelerate throughput while preserving accuracy. The latest generation of Tensor Cores are faster than ever on a broader array of AI and high-performance computing (HPC) tasks. From 6X speedups in transformer network training to 3X boosts in performance across all applications, NVIDIA Tensor Cores deliver new capabilities to all workloads.

Revolutionary AI Training

AI models are exploding in complexity as they take on next-level challenges such as conversational AI. Training massive models in FP32 can take weeks or even months. NVIDIA Tensor Cores provide an order-of-magnitude higher performance with reduced precisions like 8-bit floating point (FP8) in the Transformer Engine, Tensor Float 32 (TF32), and FP16. And with direct support in native frameworks via CUDA-X™ libraries, implementation is automatic, which dramatically slashes training-to-convergence times while maintaining accuracy.

Tensor Cores enabled NVIDIA to win MLPerf industry-wide benchmark for training.

Breakthrough AI Inference

A great AI inference accelerator has to not only deliver great performance, but also the versatility to accelerate diverse neural networks and the programmability to let developers build new ones. Low latency at high throughput while maximizing utilization are the most important performance requirements of deploying inference reliably. NVIDIA Tensor Cores offer a full range of precisions—TF32, bfloat16, FP16, FP8 and INT8—to provide unmatched versatility and performance.  

Tensor Cores enabled NVIDIA to win MLPerf industry-wide benchmark for inference.

Advanced HPC

HPC is a fundamental pillar of modern science. To unlock next-generation discoveries, scientists use simulations to better understand complex molecules for drug discovery, physics for potential sources of energy, and atmospheric data to better predict and prepare for extreme weather patterns. NVIDIA Tensor Cores offer a full range of precisions, including FP64, to accelerate scientific computing with the highest accuracy needed.

The HPC SDK provides the essential compilers, libraries, and tools for developing HPC applications for the NVIDIA platform.

NVIDIA H100 Tensor Cores

Fourth Generation

Since the introduction of Tensor Core technology, NVIDIA GPUs have increased their peak performance by 60X, fueling the democratization of computing for AI and HPC. The NVIDIA Hopper™ architecture advances fourth-generation Tensor Cores with the Transformer Engine using a new 8-bit floating point precision (FP8) to deliver 6X higher performance over FP16 for trillion-parameter model training. Combined with 3X more performance using TF32, FP64, FP16, and INT8 precisions, Hopper Tensor Cores deliver the highest speedups to all workloads.

FP8

The training times for Transformer AI networks are stretching into months due to large, math-bound computation. Hopper’s new FP8 precision delivers up to 6X more performance than FP16 on Ampere. FP8 is utilized in the Transformer Engine, a Hopper Tensor Core technology designed specifically to accelerate training for Transformer models. Hopper Tensor Cores have the capability to apply mixed FP8 and FP16 precision formats to dramatically accelerate the AI calculations for transformer training while still maintaining accuracy. FP8 also enables massive speed ups in large language model inference with up to 30X better performance than Ampere.

TF32

As AI networks and datasets continue to expand exponentially, their computing appetite has similarly grown. Lower-precision math has brought huge performance speedups, but they’ve historically required some code changes. H100 supports TF32 precision, which works just like FP32, while delivering AI speedups up to 3X over NVIDIA Ampere™ Tensor Cores—without requiring any code change.

FP64

H100 continues to deliver the power of Tensor Cores to HPC—with more performance than ever. H100’s FP64 performance is 3X faster compared to the prior generation, further accelerating a whole range of HPC applications that need double-precision math.

FP16

H100 Tensor Cores boost FP16 for deep learning, providing a 3X AI speedup compared to the NVIDIA Ampere architecture’s Tensor Cores. This dramatically boosts throughput and cuts time to convergence.

INT8

First introduced in NVIDIA Turing™, INT8 Tensor Cores dramatically accelerate inference throughput and deliver huge boosts in efficiency. INT8 in the NVIDIA Hopper architecture delivers 3X the comparable throughput of the previous generation of Tensor Cores for production deployments. This versatility enables industry-leading performance for both high-batch and real-time workloads in core and edge data centers.  

NVIDIA Ampere Architecture Tensor Cores

Third Generation

The NVIDIA Ampere architecture Tensor Cores build upon prior innovations by bringing new precisions—TF32 and FP64—to accelerate and simplify AI adoption and extend the power of Tensor Cores to HPC. And with support for bfloat16, INT8, and INT4, these third-generation Tensor Cores create incredibly versatile accelerators for both AI training and inference.

NVIDIA Turing Tensor Cores

Second Generation

NVIDIA Turing™ Tensor Core technology features multi-precision computing for efficient AI inference. Turing Tensor Cores provide a range of precisions for deep learning training and inference, from FP32 to FP16 to INT8, as well as INT4, to provide giant leaps in performance over NVIDIA Pascal™ GPUs.

Turing Tensor Cores
Volta Tensor Cores

NVIDIA Volta Tensor Cores

First Generation

Designed specifically for deep learning, the first-generation Tensor Cores in NVIDIA Volta™ deliver groundbreaking performance with mixed-precision matrix multiply in FP16 and FP32—up to 12X higher peak teraFLOPS (TFLOPS) for training and 6X higher peak TFLOPS for inference over NVIDIA Pascal. This key capability enables Volta to deliver 3X performance speedups in training and inference over Pascal.

The Most Powerful End-to-End AI and HPC Data Center Platform

Tensor Cores are essential building blocks of the complete NVIDIA data center solution that incorporates hardware, networking, software, libraries, and optimized AI models and applications from the NVIDIA NGC™ catalog. The most powerful end-to-end AI and HPC platform, it allows researchers to deliver real-world results and deploy solutions into production at scale.

Hopper Ampere Turing Volta
Supported Tensor Core precisions FP64, TF32, bfloat16, FP16, FP8, INT8 FP64, TF32, bfloat16, FP16, INT8, INT4, INT1 FP16, INT8, INT4, INT1 FP16
Supported CUDA® Core precisions FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, bfloat16, INT8 FP64, FP32, FP16, INT8 FP64, FP32, FP16, INT8

 Preliminary specifications, may be subject to change

Take a Deep Dive into the NVIDIA Hopper Architecture