Simplify Model Deployment

Leverage NVIDIA Triton Inference Server to easily deploy multi-framework AI models at scale.

An End-to-End System Architecture

NVIDIA Triton Inference Server simplifies the deployment of AI models at scale in production. Triton is an open-source, inference-serving software that lets teams deploy trained AI models from any framework, from local storage, or from Google Cloud Platform or AWS S3 on any GPU or CPU-based infrastructure, cloud, data center, or edge. Get started with Triton by pulling the container from the NVIDIA NGC catalog, the hub for GPU-optimized software for deep learning and machine learning that accelerates deployment to development workflows.

Benefits of Triton Inference Server

Multi-Framework Support

Triton Inference Server supports all major frameworks like TensorFlow, NVIDIA® TensorRT, PyTorch, ONNX Runtime, as well as custom backend frameworks. It provides AI researchers and data scientists the freedom to choose the right framework for their project.

High-Performance Inference

It runs models concurrently on GPUs to maximize utilization, supports CPU-based inferencing, and offers advanced features like model ensemble and streaming inferencing. It helps developers bring models to production rapidly.

Designed for DevOps and MLOps

Available as a Docker container, it integrates with Kubernetes for orchestration and scaling, is part of Kubeflow, and exports Prometheus metrics for monitoring. It helps IT and DevOps streamline model deployment in production.

The Inference Pipeline

Simplified Model Deployment

NVIDIA Triton Inference Server simplifies deployment of AI deep learning models at scale in production, either on GPU or CPU. It supports all major frameworks, runs multiple models concurrently to increase throughput and utilization, and integrates with DevOps tools for a streamlined production that’s easy to set up.

These capabilities combine to bring data scientists, developers, and IT operators together to accelerate AI development and deployment to production.

Designed for Scalability

NVIDIA Triton Inference Server provides data center- and cloud scalability through microservices-based inference. It can be deployed as a container microservice to serve pre- or post-processing and deep learning models on GPU and CPU. Each Triton instance can be scaled independently in a Kubernetes-like environment for optimal performance. A single Helm command from NGC deploys Triton in Kubernetes.

Triton can be used to deploy models in the cloud, in on-premises data centers, or at the edge.

The Inference Pipeline

Get started with NVIDIA Triton Inference Server on NGC.