This site requires Javascript in order to view all its content. Please enable Javascript in order to access all the functionality of this web site. Here are the instructions how to enable JavaScript in your web browser.

NVIDIA TensorRT Hyperscale Inference Platform

Fueling the Next Wave of AI-Powered Services

Meet the challenges head on with NVIDIA^®Tesla^® GPUs and NVIDIA TensorRT^™ platform, the world’s fastest, most efficient deep learning inference platform. NVIDIA’s inference platform supports all deep learning workloads and provides the optimal inference solution—combining the highest throughput, best efficiency, and best flexibility to power AI-driven experiences.

Watch to see how Microsoft is using it.

Microsoft Advancing AI-Powered Cloud Speech Using GPU Inference

Ready for a deeper dive? Learn more about inference and how other customers are using AI to accelerate machine learning workloads.

Start Exploring

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

NVIDIA DATA CENTER INFERENCE PRODUCTS

TESLA T4

The NVIDIA^® Tesla^® T4 GPU accelerates diverse cloud workloads, including high-performance computing, deep learning training and inference, machine learning, data analytics and graphics. Based on NVIDIA’s Turing™ architecture and packaged in an energy-efficient 70-watt, small PCIe form factor, T4 is optimized for scale-out servers scale-out computing environments.

deep-learning-ai-inference-platform-t4-background-2560-0912-ud

TESLA T4

TESLA V100
For Universal Data Centers

The Tesla V100 has 125 teraflops of inference performance per GPU. A single server with eight Tesla V100s can produce a petaflop of compute.

Tesla V100 Datasheet PDF

TESLA P4
For Ultra-Efficient Scale-Out Servers

The Tesla P4 accelerates any scale-out server, offering an incredible 60X higher energy efficiency compared to CPUs.

Tesla P4 Datasheet PDF

TESLA P40
For Inference-Throughput Servers

The Tesla P40 offers great inference performance, INT8 precision, and 24GB of onboard memory for an amazing user experience.

Tesla P40 Datasheet PDF

NVIDIA DATA CENTER COMPUTE SOFTWARE

NVIDIA TensorRT

NVIDIA TensorRT is a high-performance deep learning inference platform that can speed up applications such as recommenders, speech recognition, and machine translation by 40X compared to CPU-only architectures.

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server, formerly known as TensorRT Inference Server, is an open-source software that simplifies the deployment of deep learning models in production. The Triton Inference Server lets teams deploy trained AI models from any framework (TensorFlow, PyTorch, TensorRT Plan, Caffe, MXNet, or custom) from local storage, the Google Cloud Platform, or AWS S3 on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.

Kubernetes on NVIDIA GPUs

Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and high performance computing (HPC) applications can be deployed to multi-cloud GPU clusters instantly.

DeepStream SDK

NVIDIA DeepStream is an application framework for the most complex Intelligent Video Analytics (IVA) applications. Developers can now focus on building core deep learning networks rather than designing end-to-end applications from scratch given its modular framework and hardware-accelerated building blocks.

FEATURES AND BENEFITS

The Most Advanced AI Inference Platform

NVIDIA Tesla T4 has the world’s highest inference efficiency, up to 40X more compared to CPUs. T4 can analyze up to 39 simultaneous HD video streams in real-time using dedicated hardware-accelerated video transcode engines. Bringing all this performance in just 70 watts (W) makes NVIDIA T4 the ideal inference solution for mainstream servers at the edge.

24X Higher Throughput to Keep Up with Expanding Workloads

Tesla V100 GPUs powered by NVIDIA Volta™ give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.

Maximize Performance with NVIDIA TensorRT and DeepStream SDK

NVIDIA TensorRT optimizer and runtime engines deliver high throughput at low latency for applications such as recommender systems, speech recognition and image classification. With TensorRT, models trained in 32-bit or 16-bit data can be optimized for INT8 operations on Tesla T4 and P4, or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.

Deliver High Throughput Inference that Maximizes GPU Utilization

NVIDIA Triton Inference Server delivers high throughput data center inference and helps you get the most from your GPUs. Delivered in a ready-to-run container, NVIDIA Triton Inference Server is a microservice that concurrently runs models from Caffe2, NVIDIA TensorRT, TensorFlow, and any framework that supports the ONNX standard on one or more GPUs.

Performance Specs

	Tesla T4: The World's Most Advanced Inference Accelerator	Tesla V100: The Universal Data Center GPU	Tesla P4 for Ultra-Efficient, Scale-Out Servers	Tesla P40 for Inference-Throughput Servers
Single-Precision Performance (FP32)	8.1 TFLOPS	14 TFLOPS (PCIe) 15.7 teraflops (SXM2)	5.5 TFLOPS	12 TFLOPS
Half-Precision Performance (FP16)	65 TFLOPS	112 TFLOPS (PCIe) 125 TFLOPS (SXM2)	—	—
Integer Operations (INT8)	130 TOPS	—	22 TOPS*	47 TOPS*
Integer Operations (INT4)	260 TOPS	—	—	—
GPU Memory	16GB	32/16GB HBM2	8GB	24GB
Memory Bandwidth	320GB/s	900GB/s	192GB/s	346GB/s
System Interface/Form Factor	Low-Profile PCI Express Form Factor	Dual-Slot, Full-Height PCI Express Form Factor SXM2 / NVLink	Low-Profile PCI Express Form Factor	Dual-Slot, Full-Height PCI Express Form Factor
Power	70 W	250 W (PCIe) 300 W (SXM2)	50 W/75 W	250 W
Hardware-Accelerated Video Engine	1x Decode Engine, 2x Encode Engines	—	1x Decode Engine, 2x Encode Engines	1x Decode Engine, 2x Encode Engines

*Tera-Operations per Second with Boost Clock Enabled

CUSTOMER STORIES

Speech Recognition

Lower response time for speech recognition apps while maintaining accuracy on NVIDIA Tesla GPUs running TensorRT software.

Read Blog

Image and Video Processing

Maximize throughput efficiency for image and video processing workloads with NVIDIA DeepStream SDK and Tesla GPUs.

Read Blog

Recommender System

Increase recommender prediction accuracy with deep learning based neural collaborative filtering apps running on NVIDIA GPU platforms.

Read Blog

OPTIMIZE YOUR DEEP LEARNING INFERENCE SOLUTION TODAY.

The Tesla V100, T4, and P40 are available now for deep learning inference.

WHERE TO BUY

NVIDIA TensorRT Hyperscale Inference Platform

Fueling the Next Wave of AI-Powered Services

Microsoft Advancing AI-Powered Cloud Speech Using GPU Inference

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

NVIDIA DATA CENTER INFERENCE PRODUCTS

TESLA T4

TESLA T4

TESLA V100For Universal Data Centers

TESLA P4For Ultra-Efficient Scale-Out Servers

TESLA P40For Inference-Throughput Servers

NVIDIA DATA CENTER COMPUTE SOFTWARE

NVIDIA TensorRT

NVIDIA Triton Inference Server

Kubernetes on NVIDIA GPUs

DeepStream SDK

FEATURES AND BENEFITS

The Most Advanced AI Inference Platform

24X Higher Throughput to Keep Up with Expanding Workloads

Maximize Performance with NVIDIA TensorRT and DeepStream SDK

Deliver High Throughput Inference that Maximizes GPU Utilization

Performance Specs

CUSTOMER STORIES

Speech Recognition

Image and Video Processing

Recommender System

OPTIMIZE YOUR DEEP LEARNING INFERENCE SOLUTION TODAY.

TESLA V100
For Universal Data Centers

TESLA P4
For Ultra-Efficient Scale-Out Servers

TESLA P40
For Inference-Throughput Servers