Cloud-Native Supercomputing

Uncompromised HPC and AI performance, multi-node tenant
isolation, and security.

Bare-Metal Performance with Multi-Tenant Isolation

Cloud-native supercomputing blends the power of high performance computing with the security and ease of use of cloud computing services.The NVIDIA Cloud-Native Supercomputing platform leverages the NVIDIA® BlueField® data processing unit (DPU) architecture with high-speed, low-latency NVIDIA® InfiniBand networking to deliver bare-metal performance, user management and isolation, data protection, and on-demand high-performance computing (HPC) and AI services—simply and securely.

Innovation for the Next Decade and Beyond

The Cloud-Native Supercomputing Platform

To deliver maximum performance, supercomputers need to offer multi-tenancy security—which is ideally achieved through cloud-native platforms. The key element that enables this architecture transition is the DPU. 

As a fully integrated data center-on-a-chip platform, the DPU can offload and manage data center infrastructure instead of the host processor, enabling security and orchestration of the supercomputer. 

Combined with NVIDIA InfiniBand switching, this architecture delivers optimal bare-metal performance, while natively supporting multi-node tenant isolation.

Cloud-Native Supercomputing Platform
Toward a Zero-Trust Architecture

Toward a Zero-Trust Architecture

Cloud-native supercomputing systems are designed to deliver maximum performance, security, and orchestration in a multi-tenant environment.

The BlueField DPU can host untrusted multi-node tenants while ensuring that supercomputing resources are handed over clean to new tenants without prior residuals. To achieve this, the BlueField DPU provides a clean boot image for a newly scheduled tenant, performs a complete cleanup and re-establishment of trust, virtualizes storage, and grants access to approved storage areas.

Application Performance Acceleration

HPC and AI communication frameworks and libraries are latency and bandwidth sensitive, and they play a critical role in determining application performance.

Offloading the libraries from the host CPU or GPU to the Bluefield DPU creates the highest degree of overlap for parallel progression of communication and computation. It also reduces the negative effects of operating system jitter and dramatically increases application performance. This is key to enabling the next generation of supercomputing architecture. 

Early research results from the Ohio State University demonstrate that cloud-native supercomputers can perform HPC jobs 1.4x faster than traditional ones.

DPU Provides 1.4X Higher Performance Acceleration for P3DFFT

HPC and AI communication frameworks

Cloud-Native Supercomputing Platform

NVIDIA Bluefield

The NVIDIA BlueField DPU combines the industry-leading NVIDIA ConnectX® network adapter,  an array of Arm cores with a PCIe subsystem, and purpose-built HPC hardware acceleration engines to deliver full data center infrastructure-on-chip programmability.

InfiniBand

NVIDIA InfiniBand networking accelerates and offloads data transfers to ensure compute resources never “go hungry” due to lack of data or bandwidth. The InfiniBand network can be partitioned between different users or tenants, providing security and QoS guarantees.

DOCA

The NVIDIA DOCA SDK enables infrastructure developers to rapidly create network, storage, security, management, and AI and HPC applications and services on top of the NVIDIA BlueField DPU, leveraging industry-standard APIs. With DOCA, developers can program the supercomputing infrastructure of tomorrow by creating high-performance, software-defined, and cloud-native DPU-accelerated services.

Magnum IO

The NVIDIA MAGNUM IO™ software development kit enables developers to optimize the input/output (IO) in applications, reducing the end-to-end time of their workflows.

Magnum IO covers all aspects of IO, including storage, networking, multi-GPU, and multi-node communications. It also includes tools to profile and tune applications and eliminate IO bottlenecks.

Key Features

  • Multi-tenant isolation, data protection, and security
  • Infrastructure service offloads
  • Dedicated hardware engines for accelerating communication frameworks
  • Enhanced quality of service (QoS)

Benefits

  • Deliver optimal bare-metal performance
  • Increases CPU availability, application scalability, and system efficiency
  • Higher compute and communication overlap
  • Reduced jitter / system noise
  • Reduced infrastructure costs

Learn more about cloud-native supercomputing in the technical overview.