Powering New Levels
of User Engagement

Boost Throughput and Responsive Experience in Deep Learning Inference Workloads.

AI is constantly challenged to keep up with exploding volumes of data and still deliver fast responses. Meet the challenges with NVIDIA® Tesla® running NVIDIA® TensorRTTM, the world’s fastest, most efficient data center platform for inference. Tesla supports all deep learning workloads and provides the optimal inference solution —combining the highest throughput, best efficiency, and best flexibility to power AI-driven experiences. TensorRT unlocks performance of Tesla GPUs and provides a foundation for NVIDIA DeepStream SDK and Attis Inference Server products that can host a variety of applications such as video-streaming, speech and recommender systems.

Inference Technical Brief



iFLYTEK’s Voice Cloud Platform uses NVIDIA Tesla P4 and P40 GPUs for training and inference, to increase speech recognition accuracy.


NVIDIA Inception Program startup Valossa is using NVIDIA GPUs to accelerate deep learning and divine viewer behavior from video data.


JD uses NVIDIA AI inference platform to achieve 40X increase in video detection efficiency.


For Universal Data Centers

The Tesla V100 has 125 teraflops of inference performance per GPU. A single server with eight Tesla V100s can produce a petaflop of compute.

For Ultra-Efficient Scale-Out Servers

The Tesla P4 accelerates any scale-out server, offering an incredible 60X higher energy efficiency compared to CPUs.

For Inference-Throughput Servers

The Tesla P40 offers great inference performance, INT8 precision and 24GB of onboard memory for an amazing user experience.



NVIDIA TensorRT™ is a high-performance, neural-network inference accelerator that can speed up applications such as recommenders, speech recognition, and machine translation by 100x compared to CPUs.   TensorRT provides developers the capabilities to optimize neural network models, calibrate for lower precision with high accuracy, and deploy the models to production environments in enterprise and hyperscale data centers.

DeepStream SDK

NVIDIA DeepStream for Tesla is an SDK for building deep learning-based scalable intelligent video analytics (IVA) applications for smart cities and hyperscale data centers. It brings together NVIDIA TensorRT for inference, Video Codec SDK for transcode, pre-processing, and data curation APIs to tap into the power of Tesla GPUs. On the Tesla P4 GPUs, for example, you can simultaneously decode and analyze up to 30 HD video streams in real time.

ATTIS Inference Server

The NVIDIA® Inference Server provides everything devops and datacenter managers need to run an inference service, optimized for NVIDIA GPUs, in the datacenter or cloud. ATTIS maximizes performance of the inference application by optimal use of CPUs and GPUs on the server. Through a simple REST API, devops can deploy to multiple GPUs with homogeneous or heterogeneous GPU architectures.

Kubernetes on NVIDIA GPUs

Kubernetes on NVIDIA GPUs enables enterprises to scale up training and inference deployment to multi-cloud GPU clusters seamlessly. With Kubernetes, GPU-accelerated deep learning and HPC applications can be deployed to multi-cloud GPU clusters instantly.


50X Higher Throughput to Keep Up with Expanding Workloads

Volta-powered Tesla V100 GPUs give data centers a dramatic boost in throughput for deep learning workloads to extract intelligence from today’s tsunami of data. A server with a single Tesla V100 can replace up to 50 CPU-only servers for deep learning inference workloads, so you get dramatically higher throughput with lower acquisition cost.

Unprecedented Efficiency for Low-Power, Scale-Out Servers

The ultra-efficient Tesla P4 GPU accelerates density-optimized, scale-out servers with a small form factor and a 50/75 W power footprint design. It delivers an incredible 52X better energy efficiency than CPUs for deep learning inference workloads, so hyperscale customers can scale within their existing infrastructure and service the exponential growth in demand for AI-based applications.

A Dedicated Decode Engine for New
AI-Based Video Services

The Tesla P4 GPU can analyze up to 39 HD video streams in real time. Powered by a dedicated hardware-accelerated decode engine, it works in parallel with the NVIDIA CUDA® cores performing inference. By integrating deep learning into the pipeline, customers can offer new levels of smart, innovative functionality that facilitates video search and other video-related services.

Faster Deployment with NVIDIA TensorRT and DeepStream SDK

NVIDIA TensorRT is a high-performance, neural-network inference accelerator for production deployment of deep learning applications such as recommender systems, speech recognition, and machine translation. With TensorRT, neural nets trained in 32-bit or 16-bit data can be optimized for reduced-precision INT8 operations on Tesla P4 or FP16 on Tesla V100. NVIDIA DeepStream SDK taps into the power of Tesla GPUs to simultaneously decode and analyze video streams.

Performance Specs

Tesla V100: The Universal Data Center GPU Tesla P4 for Ultra-Efficient, Scale-Out Servers Tesla P40 for Inference-Throughput Servers
Single-Precision Performance (FP32) 14 teraflops (PCIe)
15.7 teraflops (SXM2)
5.5 teraflops 12 teraflops
Half-Precision Performance (FP16) 112 teraflops (PCIe)
125 teraflops (SXM2)
Integer Operations (INT8) 22 TOPS* 47 TOPS*
GPU Memory 16 GB HBM2 8 GB 24 GB
Memory Bandwidth 900 GB/s 192 GB/s 346 GB/s
System Interface/Form Factor Dual-Slot, Full-Height PCI Express Form Factor SXM2 / NVLink Low-Profile PCI Express Form Factor Dual-Slot, Full-Height PCI Express Form Factor
Power 250 W (PCIe)
300 W (SXM2)
50 W/75 W 250 W
Hardware-Accelerated Video Engine 1x Decode Engine, 2x Encode Engines 1x Decode Engine, 2x Encode Engines

*Tera-Operations per Second with Boost Clock Enabled


The Tesla V100, P4, and P40 are available now for deep learning inference.