NVIDIA TensorRT Hyperscale Inference Platform

Fueling the Next Wave of AI-Powered Services

Meet the challenges head on with NVIDIA® Tesla® GPUs and NVIDIA TensorRT platform, the world’s fastest, most efficient deep learning inference platform. NVIDIA’s inference platform supports all deep learning workloads and provides the optimal inference solution—combining the highest throughput, best efficiency, and best flexibility to power AI-driven experiences.

Watch to see how Microsoft is using it.

Microsoft Advancing AI-Powered Cloud Speech Using GPU Inference

Ready for a deeper dive? Learn more about inference and how other customers are using AI to accelerate machine learning workloads.

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

Snap’s monetization algorithms have the single biggest impact to our advertisers and shareholders. NVIDIA T4 powered GPUs for inference on Google GCP will enable us to increase advertising efficacy while at the same time lower costs when compared to a CPU only implementation.

- Nima Khajehnouri, Director of Engineering, Snap Monetization Group

Pinterest uses state of the art computer vision technology to build a sophisticated understanding of over 175B pins. We rely on GPUs for training and evaluating our recognition models and for performing real time inference.

- Andrew Zhai, Visual Search Tech Lead, Pinterest

Using GPUs made it possible to enable media understanding on our platform, not just by drastically reducing media deep learning models training time, but also by allowing us to derive real-time understanding of live videos at inference time.

- Twitter

PayPal needed GPUs to accelerate the deployment of our newest worldwide system, and to enable capabilities that were previously impossible.

- Sri Shivananda, PayPal’s CTO and Senior VP

NVIDIA DATA CENTER INFERENCE PRODUCTS

TESLA T4

The NVIDIA® Tesla® T4 GPU accelerates diverse cloud workloads, including high-performance computing, deep learning training and inference, machine learning, data analytics and graphics. Based on NVIDIA’s Turing™ architecture and packaged in an energy-efficient 70-watt, small PCIe form factor, T4 is optimized for scale-out servers scale-out computing environments.

deep-learning-ai-inference-platform-t4-background-2560-0912-ud

TESLA T4

The NVIDIA® Tesla® T4 GPU accelerates diverse cloud workloads, including high-performance computing, deep learning training and inference, machine learning, data analytics and graphics. Based on NVIDIA’s Turing™ architecture and packaged in an energy-efficient 70-watt, small PCIe form factor, T4 is optimized for scale-out servers scale-out computing environments.

TESLA V100
For Universal Data Centers

The Tesla V100 has 125 teraflops of inference performance per GPU. A single server with eight Tesla V100s can produce a petaflop of compute.

TESLA P4
For Ultra-Efficient Scale-Out Servers

The Tesla P4 accelerates any scale-out server, offering an incredible 60X higher energy efficiency compared to CPUs.

TESLA P40
For Inference-Throughput Servers

The Tesla P40 offers great inference performance, INT8 precision, and 24GB of onboard memory for an amazing user experience.

NVIDIA DATA CENTER COMPUTE SOFTWARE

NVIDIA TensorRT

NVIDIA TensorRT is a high-performance deep learning inference platform that can speed up applications such as recommenders, speech recognition, and machine translation by 40X compared to CPU-only architectures.

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server, formerly known as TensorRT Inference Server, is an open-source software that simplifies the deployment of deep learning models in production. The Triton Inference Server lets teams deploy trained AI models from any framework (TensorFlow, PyTorch, TensorRT Plan, Caffe, MXNet, or custom) from local storage, the Google Cloud Platform, or AWS S3 on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.

 

Kubernetes on NVIDIA GPUs

Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and high performance computing (HPC) applications can be deployed to multi-cloud GPU clusters instantly.

DeepStream SDK

NVIDIA DeepStream is an application framework for the most complex Intelligent Video Analytics (IVA) applications. Developers can now focus on building core deep learning networks rather than designing end-to-end applications from scratch given its modular framework and hardware-accelerated building blocks.

FEATURES AND BENEFITS

The Most Advanced AI Inference Platform

 

NVIDIA Tesla T4 has the world’s highest inference efficiency, up to 40X more compared to CPUs. T4 can analyze up to 39 simultaneous HD video streams in real-time using dedicated hardware-accelerated video transcode engines. Bringing all this performance in just 70 watts (W) makes NVIDIA T4 the ideal inference solution for mainstream servers at the edge.

 

24X Higher Throughput to Keep Up with Expanding Workloads

Tesla V100 GPUs powered by NVIDIA Volta™ give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.

 

Maximize Performance with NVIDIA TensorRT and DeepStream SDK

NVIDIA TensorRT optimizer and runtime engines deliver high throughput at low latency for applications such as recommender systems, speech recognition and image classification. With TensorRT, models trained in 32-bit or 16-bit data can be optimized for INT8 operations on Tesla T4 and P4, or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.

Deliver High Throughput Inference that Maximizes GPU Utilization

NVIDIA Triton Inference Server delivers high throughput data center inference and helps you get the most from your GPUs. Delivered in a ready-to-run container, NVIDIA Triton Inference Server is a microservice that concurrently runs models from Caffe2, NVIDIA TensorRT, TensorFlow, and any framework that supports the ONNX standard on one or more GPUs.

Performance Specs

Tesla T4: The World's Most Advanced Inference Accelerator Tesla V100: The Universal Data Center GPU Tesla P4 for Ultra-Efficient, Scale-Out Servers Tesla P40 for Inference-Throughput Servers
Single-Precision Performance (FP32) 8.1 TFLOPS 14 TFLOPS (PCIe)
15.7 teraflops (SXM2)
5.5 TFLOPS 12 TFLOPS
Half-Precision Performance (FP16) 65 TFLOPS 112 TFLOPS (PCIe)
125 TFLOPS (SXM2)
Integer Operations (INT8) 130 TOPS 22 TOPS* 47 TOPS*
Integer Operations (INT4) 260 TOPS
GPU Memory 16GB 32/16GB HBM2 8GB 24GB
Memory Bandwidth 320GB/s 900GB/s 192GB/s 346GB/s
System Interface/Form Factor Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor SXM2 / NVLink Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor
Power 70 W 250 W (PCIe)
300 W (SXM2)
50 W/75 W 250 W
Hardware-Accelerated Video Engine 1x Decode Engine, 2x Encode Engines 1x Decode Engine, 2x Encode Engines 1x Decode Engine, 2x Encode Engines

*Tera-Operations per Second with Boost Clock Enabled

CUSTOMER STORIES

Speech Recognition

Lower response time for speech recognition apps while maintaining accuracy on NVIDIA Tesla GPUs running TensorRT software.

Image and Video Processing

Maximize throughput efficiency for image and video processing workloads with NVIDIA DeepStream SDK and Tesla GPUs.

Recommender System

Increase recommender prediction accuracy with deep learning based neural collaborative filtering apps running on NVIDIA GPU platforms.

OPTIMIZE YOUR DEEP LEARNING INFERENCE SOLUTION TODAY.

The Tesla V100, T4, and P40 are available now for deep learning inference.