Visualize and monitor GPU fleets in real time.
Overview
NVIDIA GPU Health is a comprehensive solution for visualizing and monitoring fleets of NVIDIA GPU devices. With it, cloud partners and enterprises can monitor usage, configuration, and errors to ensure the uptime, availability, quality, and integrity of GPU and hardware infrastructure.
NVIDIA GPU Health is a solution for monitoring the health and integrity of GPUs. It’s a low-level, deployment-agnostic managed service that can be used regardless of software stack or scheduler choice. GPU Health currently supports data center customers who are managing their own GPU infrastructure and consumers who need better insight into GPU behavior. The solution leverages technology and IP from across NVIDIA's portfolio of products, as well as learnings from running thousands of GPUs in NVIDIA DGX Cloud.
The GPU Health agent leverages GPU management and optimization technology from across NVIDIA's suite of products. The GPU Health agent captures metrics that are communicated back to the GPU Health platform, analyzed and hosted for customers to review.
Features
GPU Health offers rich visualization of fleet inventory across data centers and clouds. The solution uses an agent that can be easily deployed on GPU worker nodes to establish secure communication with GPU Health.
The GPU Health agent leverages technology from DGX Cloud’s product suite. Metrics captured by the GPU Health agent are communicated back to GPU Health for review.
GPU Health uses NVIDIA Confidential Computing to verify GPU integrity. At runtime, the agent gathers and signs evidence using on-device certificates and the NVIDIA Attestation SDK, ensuring system authenticity and trust.
Benefits
Track spikes and throttling to keep within data center budgets and prevent brownouts while maximizing performance per watt.
Detect hotspots and airflow issues early to avoid thermal throttling and premature component aging.
Watch utilization, memory bandwidth, interconnect health, and throttling reasons to spot regressions and imbalance across the fleet.
Surface error-correction code (ECC) and XID errors, retired pages, anomalies in high-bandwidth memory (HBM), NVIDIA NVLink™, and PCIe, and other reliability, availability, and serviceability (RAS) signals to catch failures before they happen.
Enforce consistent drivers, CUDA® and toolchains, firmware, power limits, and Basic Input/Output System (BIOS) settings—plus verify image and firmware integrity—to ensure reproducible results and safe operation.
NVIDIA DGX Cloud accelerates AI workload pretraining, fine-tuning, inference, and the deployment of physical and industrial AI applications.
Access technical documentation for DGX Cloud, including software release updates, admin manuals, quick-start guides, and tutorials.