Inference

NVIDIA Triton Inference Server

Deploy, run, and scale AI for any application on any platform.

Inference for Every AI Workload

Run inference on trained machine learning or deep learning models from any framework on any processor—GPU, CPU, or other—with NVIDIA Triton™ Inference Server. Part of the NVIDIA AI platform and available with NVIDIA AI Enterprise, Triton Inference Server is open-source software that standardizes AI model deployment and execution across every workload.

Deploying, Optimizing, and Benchmarking LLMs

Get step-by-step instructions on how to serve large language models (LLMs) efficiently using Triton Inference Server.

The Benefits of Triton Inference Server

Supports All Training and Inference Frameworks

Deploy AI models on any major framework with Triton Inference Server—including TensorFlow, PyTorch, Python, ONNX, NVIDIA® TensorRT™, RAPIDS™ cuML, XGBoost, scikit-learn RandomForest, OpenVINO, custom C++, and more.

High-Performance Inference on Any Platform

Maximize throughput and utilization with dynamic batching, concurrent execution, optimal configuration, and streaming audio and video. Triton Inference Server supports all NVIDIA GPUs, x86 and Arm CPUs, and AWS Inferentia.

Open Source and Designed for DevOps and MLOps

Integrate Triton Inference Server into DevOps and MLOps solutions such as Kubernetes for scaling and Prometheus for monitoring. It can also be used in all major cloud and on-premises AI and MLOps platforms.

Enterprise-Grade Security, Manageability, and API Stability

NVIDIA AI Enterprise, including NVIDIA Triton Inference Server, is a secure, production-ready AI software platform designed to accelerate time to value with support, security, and API stability.

Explore the Features and Tools of NVIDIA Triton Inference Server

Large Language Model Inference

Triton offers low latency and high throughput for large language model (LLM) inferencing. It supports TensorRT-LLM, an open-source library for defining, optimizing, and executing LLMs for inference in production.

Model Ensembles

Triton Model Ensembles allows you to execute AI workloads with multiple models, pipelines, and pre- and postprocessing steps. It allows execution of different parts of the ensemble on CPU or GPU, and supports multiple frameworks inside the ensemble.

NVIDIA PyTriton

PyTriton lets Python developers bring up Triton with a single line of code and use it to serve models, simple processing functions, or entire inference pipelines to accelerate prototyping and testing. 

NVIDIA Triton Model Analyzer

Model Analyzer reduces the time needed to find the optimal model deployment configuration - such as batch size, precision, and concurrent execution instances. It helps select the optimal configuration to meet application latency, throughput, and memory requirements.

Leading Adopters Across All Industries

Get Started With NVIDIA Triton

Use the right tools to deploy, run, and scale AI for any application on any platform.

Begin Developing With Code or Containers

For individuals looking to access Triton’s open-source code and containers for development, there are two options to get started for free:

Use Open-Source Code
Access open-source software on GitHub with end-to-end examples.

Download a Container
Access Linux-based Triton Inference Server containers for x86 and Arm® on NVIDIA NGC™.

Try Before You Buy

For enterprises looking to try Triton before purchasing NVIDIA AI Enterprise for production, there are two options to get started for free:

Without Infrastructure
For those without existing infrastructure, NVIDIA offers free hands-on labs through NVIDIA LaunchPad.

With Infrastructure
For those with existing infrastructure, NVIDIA offers a free evaluation license to try NVIDIA AI Enterprise for 90 days.

Resources

Top 5 Reasons Why Triton Is Simplifying Inference

NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production, letting teams deploy trained AI models from any framework from local storage or cloud platform on any GPU- or CPU-based infrastructure.

Deploy HuggingFace’s Stable Diffusion Pipeline With Triton

This video showcases deploying the Stable Diffusion pipeline available through the HuggingFace diffuser library. We use Triton Inference Server to deploy and run the pipeline.

Getting Started With NVIDIA Triton Inference Server

Triton Inference Server is an open-source inference solution that standardizes model deployment and enables fast and scalable AI in production. Because of its many features, a natural question to ask is, where do I begin? Watch to to find out.

Quick-Start Guide

New to Triton Inference Server and want to deploy your model quickly? Make use of this quick-start guide to begin your Triton journey.

Tutorials

Getting started with Triton can lead to many questions. Explore this repository to familiarize yourself with Triton's features and find guides and examples that can help ease migration.

NVIDIA LaunchPad

In hands-on labs, experience fast and scalable AI using NVIDIA Triton Inference Server. You’ll be able to immediately unlock the benefits of NVIDIA’s accelerated computing infrastructure and scale your AI workloads.

Get the Latest News

Read about the latest inference updates and announcements for Triton Inference Server.

Explore Technical Blogs

Read technical walkthroughs on how to get started with inference.

Take a Deep Dive

Get tips and best practices for deploying, running, and scaling AI models for inference for generative AI, LLMs, recommender systems, computer vision, and more.

Deploying, Optimizing, and Benchmarking LLMs

Learn how to serve LLMs efficiently using Triton Inference Server with step-by-step instructions. We’ll cover how to easily deploy an LLM across multiple backends and compare their performance, as well as how to fine-tune deployment configurations for optimal performance.

Move Enterprise AI Use Cases From Development to Production

Learn what AI inference is, how it fits into your enterprise's AI deployment strategy, key challenges in deploying enterprise-grade AI use cases, why a full-stack AI inference solution is needed to address these challenges, the main components of a full-stack platform, and how to deploy your first AI inferencing solution.

Harness the Power of Cloud-Ready AI Inference Solutions

Explore how the NVIDIA AI inferencing platform seamlessly integrates with leading cloud service providers, simplifying deployment and expediting the launch of LLM-powered AI use cases.

Oracle Cloud

NVIDIA Triton Speeds Inference on Oracle Cloud

Learn how Oracle Cloud Infrastructure's computer vision and data science services enhance the speed of AI predictions with NVIDIA Triton Inference Server.

ControlExpert

Revolutionizing Motor Claims Management

Learn how ControlExpert turned to NVIDIA AI to develop an end-to-end claims management solution that lets their customers receive round-the-clock service.

Wealthsimple

Accelerating Machine Learning Model Delivery and Inference

Discover how Wealthsimple used NVIDIA's AI inference platform to successfully reduce their model deployment duration from several months to just 15 minutes.

Triton Online Forum

Explore the online community for NVIDIA Triton Inference Server, where you can browse how-to questions, learn best practices, engage with other developers, and report bugs.

NVIDIA Developer Program

Connect with millions of like-minded developers and access hundreds of GPU-accelerated containers, models, and SDKs—all the tools necessary to successfully build apps with NVIDIA technology—through the NVIDIA Developer Program.

Accelerate Your Startup

NVIDIA Inception is a free program for cutting-edge startups that offers critical access to go-to-market support, technical expertise, training, and funding opportunities.