High Performance Computing with CUDA


International Supercomputing Conference 2008
Dresden, Germany
International Congress Center Dresden
Conference Room 6, Level 4
Monday June 16, 2008, 9:00am to 5:00pm
Presenters: Massimiliano Fatica (NVIDIA), Mark Harris (NVIDIA, to be confirmed), Patrick LeGresley (NVIDIA), Jim Phillips (UIUC)

Note: This event is being held before ISC begins and as such is not affiliated with the ISC conference. Attendees do not need to register for the conference.

NVIDIA® CUDA™ is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multi-threaded many-core GPUs and scales transparently to hundreds of cores. Scientists throughout the industry and academia are already using CUDA to achieve dramatic speedups on production and research codes (see for a list of codes, academic papers and commercial packages based on CUDA). And with an upcoming version of CUDA, a new compiler backend extends CUDA to multi-core CPUs.

In this tutorial, NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. We will demonstrate its application with traditional HPC examples ranging from BLAS, FFT, integration with Fortran and high-level languages (MATLAB, Mathematica, Python) and describe in detail the programming model at the heart of it all. We will then turn to advanced topics including optimizing CUDA programs, CUDA floating point performance and accuracy, and CUDA programming strategies and tips. Finally we will present detailed case studies in which domain scientists will describe their experience using CUDA to accelerate mature, deployed, real-world science codes.

Preliminary Tutorial Outline

1) Introduction and Overview

  • Course overview: Goal, presenters, and schedule
  • Motivation: CUDA as a programming model for many-core systems
  • Success stories: Brief sampling of CUDA applications in various domains

2) Programming CUDA

  • Execution model: Kernels, threads, blocks, and grids
  • Memory model: Global, local, shared, and constant memory
  • Constructs: Declspecs, keywords, thread launch

3) CUDA Toolkit

  • Compiler, visual profiler and debugger
  • Libraries: CUBLAS and CUFFT
  • Calling CUDA from high-level languages: MATLAB, Mathematica, Python

4) Optimizing CUDA

  • Optimization strategies:
    • Coalescing memory operations
    • Streams and asynchronous API
    • Using template parameters to write general-yet-optimized code
  • Algorithmic strategies

5) Case Study: Molecular Visualization and Analysis

  • Electrostatic potential map calculations, ion placement
  • Trajectory analysis computations
  • Pursuing peak performance: What worked, what didn't
  • Application of advanced CUDA capabilities:
    • Asynchronous API, CPU/GPU overlap
    • Using the CPU to streamline GPU computation for peak performance.
    • Multi-GPU work decomposition and implementation details

6) Case Study: Computational Fluid Dynamics

  • Mapping PDE solver algorithms to the GPU architecture
  • Optimizing for single- versus multi-GPU performance
  • Efficiently handling boundary conditions

7) Case Study: Molecular Dynamics

  • NAMD: A parallel molecular dynamics code for high-performance simulation of large biomolecular systems
  • Fitting the problem on a GPU (high-level issues)
  • Scaling from liquid Argon to a million-atom biomolecule