Inference

NVIDIA Triton Inference Server

Deploy, run, and scale AI for any application on any platform.

Introduction
Benefits
Customer Stories
Webinars
Adopters
Get Started
Resources

Introduction
Benefits
Customer Stories
Webinars
Adopters
Get Started
Resources

Inference for Every AI Workload

Run inference on trained machine learning or deep learning models from any framework on any processor—GPU, CPU, or other—with NVIDIA Triton Inference Server™. Part of the NVIDIA AI platform and available with NVIDIA AI Enterprise, Triton Inference Server is open-source software that standardizes AI model deployment and execution across every workload.

Top 5 Reasons Why Triton is Simplifying Inference

Watch Video

The Benefits of Triton Inference Server

Supports All Training and Inference Frameworks

Deploy AI models on any major framework with Triton Inference Server—including TensorFlow, PyTorch, Python, ONNX, NVIDIA® TensorRT™, RAPIDS™ cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.

High-Performance Inference on Any Platform

Maximize throughput and utilization with dynamic batching, concurrent execution, optimal configuration, and streaming audio and video. Triton Inference Server supports all NVIDIA GPUs, x86 and Arm CPUs, and AWS Inferentia.

Open Source and Designed for DevOps and MLOps

Integrate Triton Inference Server into DevOps and MLOps solutions such as Kubernetes for scaling and Prometheus for monitoring. It can also be used in all major cloud and on-premises AI and MLOps platforms.

Enterprise-Grade Security, Manageability, and API Stability

NVIDIA AI Enterprise, including NVIDIA Triton Inference Server, is a secure, production-ready AI software platform designed to accelerate time to value with support, security, and API stability.

Explore the Features and Tools of NVIDIA Triton Inference Server

Large Language Model Inference

Triton offers low latency and high throughput for large language model (LLM) inferencing. It supports TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production.

Model Ensembles

Triton Model Ensembles allows you to execute AI workloads with multiple models, pipelines, and pre- and postprocessing steps. It allows execution of different parts of the ensemble on CPU or GPU, and supports multiple frameworks inside the ensemble.

NVIDIA PyTriton

PyTriton lets Python developers bring up Triton with a single line of code and use it to serve models, simple processing functions, or entire inference pipelines to accelerate prototyping and testing.

NVIDIA Triton Model Analyzer

Model Analyzer reduces the time needed to find the optimal model deployment configuration - such as batch size, precision, and concurrent execution instances. It helps select the optimal configuration to meet application latency, throughput, and memory requirements.

Features and Tools

Large Language Model Inference

TensorRT-LLM is an open-source library for defining, optimizing, and executing large language models (LLM) for inference in production. It maintains the core functionality of FasterTransformer, paired with TensorRT’s Deep Learning Compiler, in an open source Python API to quickly support new models and customizations.

Learn More About TensorRT-LLM

Model Ensembles

Many modern AI workloads require executing multiple models, often with pre- and postprocessing steps for each query. Triton supports model ensembles and pipelines, can execute different parts of the ensemble on CPU or GPU, and allows multiple frameworks inside the ensemble.

Learn More About Model Ensembles

Tree-Based Models

The Forest Inference Library (FIL) backend in Triton provides support for high-performance inference of tree-based models with explainability (SHAP values) on CPUs and GPUs. It supports models from XGBoost, LightGBM, scikit-learn RandomForest, RAPIDS cuML RandomForest, and others in Treelite format.

Learn More About Tree-Based Models

NVIDIA PyTriton

PyTriton provides a simple interface that lets Python developers use Triton to serve anything—models, simple processing functions, or entire inference pipelines. This native support for Triton in Python enables rapid prototyping and testing of machine learning models with performance and efficiency. A single line of code brings up Triton, providing benefits such as dynamic batching, concurrent model execution, and support for GPU and CPU. This eliminates the need to set up model repositories and convert model formats. Existing inference pipeline code can be used without modification.

Learn More About PyTriton

NVIDIA Triton Model Analyzer

Triton Model Analyzer is a tool that automatically evaluates model deployment configurations in Triton Inference Server, such as batch size, precision, and concurrent execution instances on the target processor. It helps select the optimal configuration to meet application quality-of-service (QoS) constraints—like latency, throughput, and memory requirements—and reduces the time needed to find the optimal configuration. This tool also supports model ensembles and multi-model analysis.

Learn More About Triton Model Analyzer

See What Our Customers Are Achieving With Triton

Learn how Oracle Cloud Infrastructure's computer vision and data science services enhance the speed of AI predictions with NVIDIA Triton Inference Server.

Learn More

Learn how ControlExpert turned to NVIDIA AI to develop an end-to-end claims management solution that lets their customers receive round-the-clock service.

Learn More

Discover how Wealthsimple used NVIDIA's AI inference platform to successfully reduce their model deployment duration from several months to just 15 minutes.

Learn More

Learn how American Express improved fraud detection by analyzing tens of millions of daily transactions 50X faster.

Learn More

See how NIO achieved a low-latency inference workflow by integrating NVIDIA Triton Inference Server into its autonomous driving inference pipeline.

Learn More

Learn how Amazon Music uses SageMaker with NVIDIA AI to optimize the performance and cost of machine learning training and inference.

Learn More

Explore how Microsoft Bing speeds ad delivery with NVIDIA Triton Inference Server, providing 7X throughput.

Learn More

Discover how Amazon improved customer satisfaction with NVIDIA AI by accelerating its inference 5X.

Learn More

Explore More Customer Stories

Hear From Experts

Watch a series of expert-led talks from our How to Get Started With AI Inference series. These videos give you the chance to explore—at your own pace—a full-stack approach to AI inference and how to optimize the AI inference workflow to lower cloud expenses and boost user adoption.

Move Enterprise AI Use Cases From Development to Production

Join AI inference experts to explore the fascinating world of AI inference and how NVIDIA's AI inference platform can help you successfully take your enterprise AI use case and trained AI models from development to production.

Watch On-Demand Webinar

Harness the Power of Cloud-Ready AI Inference Solutions

Gain insights into how the NVIDIA AI inference platform seamlessly integrates with leading cloud service providers, simplifying deployment and expediting the launch of LLM-powered AI use cases.

Watch On-Demand Webinar

Unlocking AI Model Performance

Learn how to find the right balance between latency and throughput using Triton Model Analyzer's hyperparameter search, and manage models efficiently with Triton Model Navigator.

Watch On-Demand Webinar

Accelerate AI Model Inference at Scale for Financial Services

Take a deep dive into the power of NVIDIA AI inference software and discover how it revolutionizes fraud detection, payment security, anti-money laundering, and know-your-customer systems for banks and insurance companies.

Watch On-Demand Webinar