High Performance Computing with CUDA

At International Supercomputing Conference 2008
Convention Center
Dresden, Germany
Monday, June 16 2008, 9:00am to 5:00pm
Presenters: Massimiliano Fatica (NVIDIA), Mark Harris (NVIDIA, to be confirmed), Patrick LeGresley (NVIDIA), Jim Phillips (UIUC)

Note: This event is being held before ISC begins and as such is not affiliated with the ISC conference. Attendees do not need to register for the conference.

NVIDIA® CUDA™ is a general purpose scalable parallel programming model for writing highly parallel applications. It provides several key abstractions – a hierarchy of thread blocks, shared memory, and barrier synchronization. This model has proven quite successful at programming multi-threaded many-core GPUs and scales transparently to hundreds of cores. Scientists throughout the industry and academia are already using CUDA to achieve dramatic speedups on production and research codes (see http://www.gpucomputing.org for a list of codes, academic papers and commercial packages based on CUDA). And with an upcoming version of CUDA, a new compiler backend extends CUDA to multi-core CPUs.

In this tutorial, NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. We will demonstrate its application with traditional HPC examples ranging from BLAS, FFT, integration with Fortran and high-level languages (MATLAB, Mathematica, Python) and describe in detail the programming model at the heart of it all. We will then turn to advanced topics including optimizing CUDA programs, CUDA floating point performance and accuracy, and CUDA programming strategies and tips. Finally we will present detailed case studies in which domain scientists will describe their experience using CUDA to accelerate mature, deployed, real-world science codes.

RSVP for this event!

Preliminary Tutorial Outline

1) Introduction and Overview

Course overview: Goal, presenters, and schedule
Motivation: CUDA as a programming model for many-core systems
Success stories: Brief sampling of CUDA applications in various domains

2) Programming CUDA

Execution model: Kernels, threads, blocks, and grids
Memory model: Global, local, shared, and constant memory
Constructs: Declspecs, keywords, thread launch

3) CUDA Toolkit

Compiler, visual profiler and debugger
Libraries: CUBLAS and CUFFT
Calling CUDA from high-level languages: MATLAB, Mathematica, Python

4) Optimizing CUDA

Optimization strategies:

Coalescing memory operations
Streams and asynchronous API
Using template parameters to write general-yet-optimized code

Algorithmic strategies

5) Case Study: Molecular Visualization and Analysis

Electrostatic potential map calculations, ion placement
Trajectory analysis computations
Pursuing peak performance: What worked, what didn't
Application of advanced CUDA capabilities:

Asynchronous API, CPU/GPU overlap
Using the CPU to streamline GPU computation for peak performance.
Multi-GPU work decomposition and implementation details

6) Case Study: Molecular Dynamics

NAMD: A parallel molecular dynamics code for high-performance simulation of large biomolecular systems
Fitting the problem on a GPU (high-level issues)
Scaling from liquid Argon to a million-atom biomolecule

7) Case Study: Computational Fluid Dynamics

Mapping PDE solver algorithms to the GPU architecture
Optimizing for single- versus multi-GPU performance
Efficiently handling boundary conditions

RSVP for this event!