NVIDIA TensorRT Hyperscale Inference Platform

Fueling the Next Wave of AI-Powered Services

AI is constantly challenged to keep up with exploding volumes of data and still deliver fast responses. Meet the challenges head on with NVIDIA® Tesla® GPUs and NVIDIA TensorRT platform, the world’s fastest, most efficient data center inference platform. Tesla supports all deep learning workloads and provides the optimal inference solution —combining the highest throughput, best efficiency, and best flexibility to power AI-driven experiences. TensorRT unlocks performance of Tesla GPUs across a variety of applications such as video-streaming, speech and recommender systems and provides a foundation for the NVIDIA DeepStream SDK.



The NVIDIA® Tesla® T4 GPU accelerates diverse cloud workloads, including high-performance computing, deep learning training and inference, machine learning, data analytics, and graphics. Based on the new NVIDIA ’s new Turing(™) architecture and packaged in an energy-efficient 70-watt, small PCIe form factor, T4 is optimized for scale-out servers scale-out computing environments. Plus, it and it features multi-precision Turing Tensor Cores and new RT Cores, which, when combined with a with accelerated containerized software stacks from NVIDIA GPU Cloud, T4 delivers revolutionary performance at scale.



The NVIDIA® Tesla® T4 GPU is the world’s most advanced inference accelerator. Powered by NVIDIA Turing Tensor Cores, T4 brings revolutionary multi-precision inference performance to accelerate the diverse applications of modern AI. Packaged in an energy-efficient 75-watt, small PCIe form factor, T4 is optimized for scale-out servers and is purpose-built to deliver state-of-the-art inference in real time.

For Universal Data Centers

The Tesla V100 has 125 teraflops of inference performance per GPU. A single server with eight Tesla V100s can produce a petaflop of compute.

For Ultra-Efficient Scale-Out Servers

The Tesla P4 accelerates any scale-out server, offering an incredible 60X higher energy efficiency compared to CPUs.

For Inference-Throughput Servers

The Tesla P40 offers great inference performance, INT8 precision, and 24GB of onboard memory for an amazing user experience.



NVIDIA TensorRT is a high-performance neural-network inference platform that can speed up applications such as recommenders, speech recognition, and machine translation by 40X compared to CPU-only architectures. TensorRT optimizes neural network models, calibrates for lower precision with high accuracy, and deploys the models to production environments in enterprise and hyperscale data centers.

NVIDIA Triton Inference Server

The NVIDIA Triton Inference Server, formerly known as TensorRT Inference Server, is an open-source software that simplifies the deployment of deep learning models in production. The Triton Inference Server lets teams deploy trained AI models from any framework (TensorFlow, PyTorch, TensorRT Plan, Caffe, MXNet, or custom) from local storage, the Google Cloud Platform, or AWS S3 on any GPU- or CPU-based infrastructure. It runs multiple models concurrently on a single GPU to maximize utilization and integrates with Kubernetes for orchestration, metrics, and auto-scaling.


Kubernetes on NVIDIA GPUs

Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and high performance computing (HPC) applications can be deployed to multi-cloud GPU clusters instantly.

DeepStream SDK

NVIDIA DeepStream for Tesla is an SDK for building deep learning-based scalable intelligent video analytics (IVA) applications for smart cities and hyperscale data centers. It brings together NVIDIA TensorRT optimizer and runtime engines for inference, Video Codec SDK for transcode, pre-processing, and data curation APIs to tap into the power of Tesla GPUs. On Tesla P4 GPUs, for example, you can simultaneously decode and analyze up to 30 HD video streams in real time.


The Most Advanced AI Inference Platform

Tesla T4 powered by NVIDIA Turing Tensor Cores delivers breakthrough performance for deep learning training in FP32, FP16, INT8, and INT4 precisions for inference. With 130 TeraOPS (TOPS) of INT8 and 260TOPS of INT4, T4 has the world’s highest inference efficiency, up to 40X more compared to CPUs. Tesla T4 can analyze up to 39 simultaneous HD video streams in real time using dedicated hardware-accelerated video transcode engines. Developers can offer new levels of smart, innovative functionality using inference that facilitate video search and other video-related services. And bringing all this performance in just 70 watts (W) makes Tesla T4 the ideal inference solution for scale-out servers at the edge.

24X Higher Throughput to Keep Up with Expanding Workloads

Tesla V100 GPUs powered by NVIDIA Volta™ give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.

Maximize Performance with NVIDIA TensorRT and DeepStream SDK

NVIDIA TensorRT optimizer and runtime engines deliver high throughout at low latency for applications such as recommender systems, speech recognition, and machine translation. With TensorRT, models trained in 32-bit or 16-bit data can be optimized for INT8 operations on Tesla T4 and P4, or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.

Inference that Maximizes GPU Utilization and Supports All the Top Frameworks

NVIDIA Triton Inference Server delivers high throughput data center inference and helps you get the most from your GPUs. Delivered in a ready-to-run container, NVIDIA TensorRT Inference Server is a microservice that lets you perform inference via an API for any combination of models from Caffe2, NVIDIA TensorRT, TensorFlow, and any framework that supports the ONNX standard on one or more GPUs.

Performance Specs

Tesla T4: The World's Most Advanced Inference Accelerator Tesla V100: The Universal Data Center GPU Tesla P4 for Ultra-Efficient, Scale-Out Servers Tesla P40 for Inference-Throughput Servers
Single-Precision Performance (FP32) 8.1 TFLOPS 14 TFLOPS (PCIe)
15.7 teraflops (SXM2)
Half-Precision Performance (FP16) 65 TFLOPS 112 TFLOPS (PCIe)
Integer Operations (INT8) 130 TOPS 22 TOPS* 47 TOPS*
Integer Operations (INT4) 260 TOPS
GPU Memory 16GB 32/16GB HBM2 8GB 24GB
Memory Bandwidth 320GB/s 900GB/s 192GB/s 346GB/s
System Interface/Form Factor Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor SXM2 / NVLink Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor
Power 70 W 250 W (PCIe)
300 W (SXM2)
50 W/75 W 250 W
Hardware-Accelerated Video Engine 1x Decode Engine, 2x Decode Engines 1x Decode Engine, 2x Encode Engines 1x Decode Engine, 2x Encode Engines

*Tera-Operations per Second with Boost Clock Enabled


Smarter, Faster Visual Search

Bing uses NVIDIA GPU technology to speed up object detection and deliver pertinent results in real-time.

Image and Video Processing

Maximize throughput efficiency for image and video processing workloads with NVIDIA DeepStream SDK and Tesla GPUs.

Recommender System

Increase recommender prediction accuracy with deep learning based neural collaborative filtering apps running on NVIDIA GPU platforms.


The Tesla V100, T4, and P40 are available now for deep learning inference.