MLPerf Benchmarks

The NVIDIA AI platform achieves world-class performance and versatility in MLPerf Training and Inference benchmarks, enabled by extreme co-design.

What Is MLPerf?

MLPerf™ benchmarks—developed by MLCommons, a consortium of AI leaders from academia, research labs, and industry—are designed to provide unbiased evaluations of training and inference performance for hardware, software, and services. They’re all conducted under prescribed conditions. To stay on the cutting edge of industry trends, MLPerf continues to evolve, holding new tests at regular intervals and adding new workloads that represent the state of the art in AI.

Inside the MLPerf Benchmarks

MLPerf Inference v6.0 measures inference performance across a wide variety of model architectures, including dense and mixture-of-expert (MoE) large language models (LLMs), vision language models, text-to-video models, generative recommenders, and more.

MLPerf Training v5.1 measures the time-to-train models to a specified quality level across various types of models, including LLMs, text-to-image, recommender, graph neural networks, and object detection.

Reasoning LLMs

AI models that generate intermediate “thinking” tokens to enhance response accuracy.

Details

Vision Language Models

Multimodal, generative AI models capable of understanding and processing video, image, and text.

Details

LLMs

Deep learning algorithms trained on large-scale datasets that can recognize, summarize, translate, predict, and generate content for a breadth of use cases.

Details

Text-to-Video

Generative AI models that generate video outputs based on text inputs.

Details

Text-to-Image

Generates images based on text prompts.

Details

Recommender

Delivers personalized results in user-facing services such as social media or ecommerce websites by understanding interactions between users and service items, like products or ads.

Details

Graph Neural Network

Uses neural networks designed to work with data structured as graphs.

Details

Speech-to-Text

Converts spoken language into written text.

Details

NVIDIA Blackwell Ultra Delivers up to 50x Better Performance and 35x Lower Cost for Agentic AI

Built to accelerate the next generation of agentic AI, NVIDIA Blackwell Ultra delivers breakthrough inference performance with dramatically lower cost. Cloud providers such as Microsoft, CoreWeave, and Oracle Cloud Infrastructure are deploying NVIDIA GB300 NVL72 systems at scale for low-latency and long-context use cases, such as agentic coding and coding assistants.

This is enabled by deep co-design across NVIDIA Blackwell, NVLink™, and NVLink Switch for scale-out; NVFP4 for low-precision accuracy; and NVIDIA Dynamo and TensorRT™ LLM for speed and flexibility—as well as development with community frameworks SGLang, vLLM, and more.

NVIDIA MLPerf Benchmark Results

The NVIDIA platform achieved the fastest time to train on all seven MLPerf Training v5.1 benchmarks. Blackwell Ultra made its debut, delivering large leaps for large language model pretraining and fine-tuning, enabled by architectural enhancements and breakthrough NVFP4 training methods that increase performance and meet strict MLPerf accuracy requirements. NVIDIA also increased Blackwell Llama 3.1 405B pretraining performance at scale by 2.7x through a combination of twice the scale and large increases in performance per GPU enabled by NVFP4. NVIDIA also set performance records on both newly-added benchmarks—Llama 3.1 8B and FLUX.1—while continuing to hold performance records on existing recommender, object detection, and graph neural network benchmarks.

NVIDIA Blackwell Ultra Delivers Large Leap in MLPerf Training Debut

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 4.1-0050, 5.0-0014, 5.0-0067, 5.0-0076, 5.1-0058, 5.1-0060. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited See www.mlcommons.org for more information.

Annual Rhythm and Extreme Co-Design for Sustained Training Leadership

The NVIDIA platform delivered the fastest time to train on every MLPerf Training v5.1 benchmark, with innovations across chips, systems, and software enabling sustained training performance leadership, as shown on industry-standard, peer-reviewed performance data.

Max-Scale Performance

Benchmark Time to Train
LLM Pretraining (Llama 3.1 405B) 10 minutes
LLM Pretraining (Llama 3.1 8B) 5.2 minutes
LLM Fine-Tuning (Llama 2 70B LoRA) 0.40 minutes
Image Generation (FLUX.1) 12.5 minutes
Recommender (DLRM-DCNv2) 0.71 minutes
Graph Neural Network (R-GAT) 0.84 minutes
Object Detection (RetinaNet) 1.4 minutes

MLPerf™ Training v5.0 and v5.1 results retrieved from www.mlcommons.org on November 12, 2025, from the following entries: 5.0-0082, 5.1-0002, 5.1-0004, 5.1-0060, 5.1-0070, 5.1-0072. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

NVIDIA Delivers Highest Inference Performance, Unmatched Versatility

Blackwell Ultra GPUs powered the highest-performing submissions across the broadest range of models and scenarios in MLPerf Inference v6.0, and only the NVIDIA platform submitted on every newly added benchmark. Through software optimizations alone, the throughput of the GB300 NVL72 increased by up to 2.7x in just one round. And, for the first time, NVIDIA submitted MLPerf Inference results using 288 Blackwell Ultra GPUs across four GB300 NVL72 systems interconnected with NVIDIA Quantum-X800 InfiniBand—the largest submission scale in the benchmark’s history—to deliver record reasoning inference throughput of 2.5 million tokens per second.

MLPerf Inference v5.1 and v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 5.1-0072,and 6.0-0082. Per-chip performance derived by dividing total throughput by number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.1 or v6.0. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

Same GB300 NVL72, Up to 2.7x More Performance From Software Alone

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

NVIDIA GB300 NVL72 and NVIDIA Quantum-X800 Power Largest-Ever MLPerf Inference Submission

Record Scale

288 NVIDIA Blackwell Ultra GPUs 

Highest Token Throughput

Up to 2.5 million tokens/second DeepSeek-R11

MLPerf Inference v6.0, Closed Division. Results retrieved from www.mlcommons.org on April 1, 2026. NVIDIA platform results from the following entries: 6.0-0076. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited. See www.mlcommons.org for more information.

1 Offline scenario

The Technology Behind the Results

The complexity of AI demands a tight integration between all aspects of the platform. As demonstrated in MLPerf’s benchmarks, the NVIDIA AI platform delivers leadership performance with the world’s most advanced GPU, powerful and scalable interconnect technologies, and cutting-edge software—an end-to-end solution that can be deployed in the data center, in the cloud, or at the edge with amazing results.

Optimized Software That Accelerates AI Workflows

NVIDIA Dynamo is an open source distributed inference-serving framework to deploy models in multi-node environments at AI-factory-scale. It streamlines distributed serving by disaggregating inference, optimizing routing, and extending memory through data caching to cost-effective storage tiers.

Dynamo works by disaggregating (separating) the prefill and decoding phases of LLM inference across different GPUs, allowing for independent optimization and higher throughput. It was featured prominently in the MLPerf Inference v5.1 benchmarks, demonstrating superior performance in Llama 3.1 405B Interactive and DeepSeek-R1 reasoning tests.

Leadership-Class AI Infrastructure

Achieving world-leading results across training and inference requires infrastructure that’s purpose-built for the world’s most complex AI challenges. The NVIDIA AI platform delivered leading performance powered by the NVIDIA Blackwell and Blackwell Ultra platforms, including the NVIDIA GB300 NVL72 and GB200 NVL72 systems, NVLink and NVLink Switch, and Quantum InfiniBand. These are at the heart of AI factories powered by the NVIDIA data center platform, the engine behind our benchmark performance.

In addition, NVIDIA DGX™ systems offer the scalability, rapid deployment, and incredible compute power that enable every enterprise to build leadership-class AI infrastructure. 

Learn more about our data center training and inference performance.

Reasoning LLMs

MLPerf Inference uses: 

DeepSeek-R1 with samples sourced from the AIME, MATH500, GPQA Diamond, MMLU-Pro, and LiveCodeBench datasets.

GPT-OSS-120B with samples from the AIME 2024, LivecodeBench v6, and GPQA Diamond datasets.

Vision Language Model

MLPerf Inference uses the Qwen3-VL-235B-A22B-Instruct model using the Shopify Product Catalog dataset.

LLMs

MLPerf Inference uses:

 Llama 3.1 405B with samples sourced from LongBench, LongDataCollection, RULER, and GovReport summary. Llama 2 70B uses OpenOrca. Llama 3.1 8B uses CNN/DailyMail. Mixtral 8x7B with samples sourced from OpenOrca, GSM8K, and MBXP datasets.

MLPerf Training uses the Llama 3.1 generative language model with 405 billion parameters and a sequence length of 8,192 for the LLM pretraining workload with the c4 (v3.0.1) dataset. For the LLM fine-tuning test, it uses the Llama 2 70B model with the GovReport dataset with sequence lengths of 8,192. Llama 3.1 8B also uses the C4 dataset with sequence lengths of 8,192.

Text-to-Video

MLPerf Inference uses Wan-2.2-T2V-A14B with the VBench dataset.

Text-to-Image

MLPerf Training uses the FLUX.1 text-to-image model trained on the CC12M dataset with the COCO 2014 dataset for eval.

Recommender

MLPerf Inference uses DLRMv3 with a Synthetic Streaming 100B dataset.

MLPerf Training and Inference use the Deep Learning Recommendation Model v2 (DLRMv2) that employs DCNv2 cross-layer and a multi-hot dataset synthesized from the Criteo dataset.

Graph Neural Network

MLPerf Inference uses the Illinois Graph Benchmark (IGB) heterogeneous dataset.

MLPerf Training uses R-GAT with the Illinois Graph Benchmark (IGB) heterogeneous dataset.

Speech-to-Text

MLPerf Inference uses Whisper-Large-V3 with the LibriSpeech dataset.

Server

4X

 

Offline

3.7X

 

AI Superchip

208B Transistors

2nd Gen Transformer Engine

FP4/FP6 Tensor Core

5th Generation NVLINK

Scales to 576 GPUs

RAS Engine

100% In-System Self-Test

Secure AI

Full Performance Encryption and TEE

Decompression Engine

800 GB/Sec