NVIDIA GPU Health

Visualize and monitor GPU fleets in real time.

Overview

Boost GPU Uptime Across Computing Infrastructures

NVIDIA GPU Health is a comprehensive solution for visualizing and monitoring fleets of NVIDIA GPU devices. With it, cloud partners and enterprises can monitor usage, configuration, and errors to ensure the uptime, availability, quality, and integrity of GPU and hardware infrastructure.

Join the NVIDIA GPU Health Early Access Program

Once qualified, collaborate with NVIDIA to improve the availability and integrity of your GPU fleet.

Learn About DGX Cloud

NVIDIA DGX Cloud accelerates AI workloads in the cloud, delivering high-performance training, scalable inference, and global GPU access for developers and platform teams.

What Is NVIDIA GPU Health?

NVIDIA GPU Health is a solution for monitoring the health and integrity of GPUs. It’s a low-level, deployment-agnostic managed service that can be used regardless of software stack or scheduler choice. GPU Health currently supports data center customers who are managing their own GPU infrastructure and consumers who need better insight into GPU behavior. The solution leverages technology and IP from across NVIDIA's portfolio of products, as well as learnings from running thousands of GPUs in NVIDIA DGX Cloud.

The GPU Health agent leverages GPU management and optimization technology from across NVIDIA's suite of products. The GPU Health agent captures metrics that are communicated back to the GPU Health platform, analyzed and hosted for customers to review.

Features

Ensure the Uptime, Availability, Quality, and Integrity of GPU Infrastructure

Fleet Inventory and Visualization

GPU Health offers rich visualization of fleet inventory across data centers and clouds. The solution uses an agent that can be easily deployed on GPU worker nodes to establish secure communication with GPU Health.

Reporting, Alerts, and Health Checks

The GPU Health agent leverages technology from DGX Cloud’s product suite. Metrics captured by the GPU Health agent are communicated back to GPU Health for review.

Integrity and Attestation

GPU Health uses NVIDIA Confidential Computing to verify GPU integrity. At runtime, the agent gathers and signs evidence using on-device certificates and the NVIDIA Attestation SDK, ensuring system authenticity and trust.

Benefits

What Does NVIDIA GPU Health Offer?

Power

Track spikes and throttling to keep within data center budgets and prevent brownouts while maximizing performance per watt.

Temperature

Detect hotspots and airflow issues early to avoid thermal throttling and premature component aging.

Performance

Watch utilization, memory bandwidth, interconnect health, and throttling reasons to spot regressions and imbalance across the fleet.

Health

Surface error-correction code (ECC) and XID errors, retired pages, anomalies in high-bandwidth memory (HBM), NVIDIA NVLink™, and PCIe, and other reliability, availability, and serviceability (RAS) signals to catch failures before they happen.

Uniform Configuration and Integrity

Enforce consistent drivers, CUDA® and toolchains, firmware, power limits, and Basic Input/Output System (BIOS) settings—plus verify image and firmware integrity—to ensure reproducible results and safe operation.

Next Steps

Ready to Get Started?

Get transparent, real-time infrastructure inventory and health monitoring for GPUs.

Learn More About NVIDIA DGX Cloud

NVIDIA DGX Cloud accelerates AI workload pretraining, fine-tuning, inference, and the deployment of physical and industrial AI applications.

Explore NVIDIA DGX Cloud Documentation

Access technical documentation for DGX Cloud, including software release updates, admin manuals, quick-start guides, and tutorials.