NVIDIA TensorRT Hyperscale Inference Platform

Fueling the Next Wave of AI-Powered Services

AI is constantly challenged to keep up with exploding volumes of data and still deliver fast responses. Meet the challenges head on with NVIDIA® Tesla® GPUs and NVIDIA TensorRTplatform, the world’s fastest, most efficient data center inference platform. Tesla supports all deep learning workloads and provides the optimal inference solution —combining the highest throughput, best efficiency, and best flexibility to power AI-driven experiences. TensorRT unlocks performance of Tesla GPUs across a variety of applications such as video-streaming, speech and recommender systems and provides a foundation for the NVIDIA DeepStream SDK.

NVIDIA DATA CENTER INFERENCE PRODUCTS

TESLA T4

The NVIDIA® Tesla® T4 GPU is the world’s most advanced inference accelerator. Powered by NVIDIA Turing Tensor Cores, T4 brings revolutionary multi-precision inference performance to accelerate the diverse applications of modern AI. Packaged in an energy-efficient 75-watt, small PCIe form factor, T4 is optimized for scale-out servers and is purpose-built to deliver state-of-the-art inference in real time.

deep-learning-ai-inference-platform-t4-background-2560-0912-ud

TESLA T4

The NVIDIA® Tesla® T4 GPU is the world’s most advanced inference accelerator. Powered by NVIDIA Turing Tensor Cores, T4 brings revolutionary multi-precision inference performance to accelerate the diverse applications of modern AI. Packaged in an energy-efficient 75-watt, small PCIe form factor, T4 is optimized for scale-out servers and is purpose-built to deliver state-of-the-art inference in real time.

TESLA V100
For Universal Data Centers

The Tesla V100 has 125 teraflops of inference performance per GPU. A single server with eight Tesla V100s can produce a petaflop of compute.

TESLA P4
For Ultra-Efficient Scale-Out Servers

The Tesla P4 accelerates any scale-out server, offering an incredible 60X higher energy efficiency compared to CPUs.

TESLA P40
For Inference-Throughput Servers

The Tesla P40 offers great inference performance, INT8 precision, and 24GB of onboard memory for an amazing user experience.

NVIDIA DATA CENTER COMPUTE SOFTWARE

NVIDIA TensorRT

NVIDIA TensorRT is a high-performance neural-network inference platform that can speed up applications such as recommenders, speech recognition, and machine translation by 40X compared to CPU-only architectures. TensorRT optimizes neural network models, calibrates for lower precision with high accuracy, and deploys the models to production environments in enterprise and hyperscale data centers.

NVIDIA TensorRT Inference Server

NVIDIA TensorRT inference server is a containerized microservice that allows application to use AI models in the data center. It maximizes GPU utilization and runs multiple models from different frameworks concurrently on a node. TensorRT inference server supports all popular AI models and frameworks and leverages Docker and Kubernetes to integrate seamlessly into DevOps architectures.

Kubernetes on NVIDIA GPUs

Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and high performance computing (HPC) applications can be deployed to multi-cloud GPU clusters instantly.

DeepStream SDK

NVIDIA DeepStream for Tesla is an SDK for building deep learning-based scalable intelligent video analytics (IVA) applications for smart cities and hyperscale data centers. It brings together NVIDIA TensorRT optimizer and runtime engines for inference, Video Codec SDK for transcode, pre-processing, and data curation APIs to tap into the power of Tesla GPUs. On Tesla P4 GPUs, for example, you can simultaneously decode and analyze up to 30 HD video streams in real time.

FEATURES AND BENEFITS

The Most Advanced AI Inference Platform

Tesla T4 powered by NVIDIA Turing Tensor Cores delivers breakthrough performance for deep learning training in FP32, FP16, INT8, and INT4 precisions for inference. With 130 TeraOPS (TOPS) of INT8 and 260TOPS of INT4, T4 has the world’s highest inference efficiency, up to 40X compared to CPUs with just 60 percent of the power consumption. Using just 75 watts (W), it’s the ideal solution for scale-out servers at the edge.

24X Higher Throughput to Keep Up with Expanding Workloads

Tesla V100 GPUs powered by NVIDIA Volta™ give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.

A Dedicated Decode Engine for New AI-Based Video Servicess

The Tesla P4 GPU can analyze up to 39 HD video streams in real time. Powered by a dedicated hardware-accelerated decode engine, it works in parallel with the NVIDIA CUDA® cores performing inference. By integrating deep learning into the pipeline, customers can offer new levels of smart, innovative functionality that facilitates video search and other video-related services.

Maximize Performance with NVIDIA TensorRT and DeepStream SDK

NVIDIA TensorRT optimizer and runtime engines deliver high throughout at low latency for applications such as recommender systems, speech recognition, and machine translation. With TensorRT, models trained in 32-bit or 16-bit data can be optimized for INT8 operations on Tesla T4 and P4, or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.

Inference that Maximizes GPU Utilization and Supports All the Top Frameworks

NVIDIA TensorRT inference server delivers high throughput data center inference and helps you get the most from your GPUs. Delivered in a ready-to-run container, NVIDIA TensorRT inference server is a microservice that lets you perform inference via an API for any combination of models from Caffe2, NVIDIA TensorRT, TensorFlow, and any framework that supports the ONNX standard on one or more GPUs.

Performance Specs

Tesla T4: The World's Most Advanced Inference Accelerator Tesla V100: The Universal Data Center GPU Tesla P4 for Ultra-Efficient, Scale-Out Servers Tesla P40 for Inference-Throughput Servers
Single-Precision Performance (FP32) 8.1 TFLOPS 14 teraflops (PCIe)
15.7 teraflops (SXM2)
5.5 teraflops 12 teraflops
Half-Precision Performance (FP16) 65 FP16 TFLOPS 112 teraflops (PCIe)
125 teraflops (SXM2)
Integer Operations (INT8) 130 INT8 TOPS 22 TOPS* 47 TOPS*
GPU Memory 16GB 16 GB HBM2 8 GB 24 GB
Memory Bandwidth 320GB/s 900 GB/s 192 GB/s 346 GB/s
System Interface/Form Factor Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor SXM2 / NVLink Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor
Power 75 W 250 W (PCIe)
300 W (SXM2)
50 W/75 W 250 W
Hardware-Accelerated Video Engine 1x Decode Engine, 2x Decode Engines 1x Decode Engine, 2x Encode Engines 1x Decode Engine, 2x Encode Engines

*Tera-Operations per Second with Boost Clock Enabled

CUSTOMER STORIES

Speech Recognition

Lower response time for speech recognition apps while maintaining accuracy on NVIDIA Tesla GPUs running TensorRT software.

Image and Video Processing

Maximize throughput efficiency for image and video processing workloads with NVIDIA DeepStream SDK and Tesla GPUs.

Recommender System

Increase recommender prediction accuracy with deep learning based neural collaborative filtering apps running on NVIDIA GPU platforms.

OPTIMIZE YOUR DEEP LEARNING INFERENCE SOLUTION TODAY.

The Tesla V100, P4, and P40 are available now for deep learning inference.