At SC10, NVIDIA®
(booth# 1343) will be showcasing the advances in applications and research with GPU computing. We invite you to visit our booth to learn more about how Tesla GPU computing products are driving the industry trend in heterogeneous computing.
The Tesla™ GPU computing products are powering the next wave of HPC — delivering supercomputing performance at 1/10th the cost and 1/20th the power consumption.
NVIDIA GPU Computing Theater – Presentations Available for Download
Tuesday, November 16 – Thursday, November 18 during exhibition hours | NVIDIA Booth #1343
The NVIDIA GPU Computing Theater hosted talks on a wide range of topics on high performance computing. Open to all attendees, the theater is located in the NVIDIA booth and will feature industry luminaries, scientists, and developers. Download theater presentations here.
Featuring speakers such as:
- Takayuki Aoki, Tokyo Institute of Technology
- Bill Dally, NVIDIA
- Jack Dongarra, University of Tennessee
- Robert Farber, Pacific Northwest National Laboratory
- Wu-chun Feng, Virginia Tech
- Wei Ge, Chinese Academy of Sciences
- Mark Govett, National Oceanic Atmospheric Administration
- Satoshi Matsuoka, Tokyo Institute of Technology
- Patrick McCormick, Los Alamos National Laboratory
- Paul Navratil, Texas Advanced Computing Center
- John Stone, University of Illinois at Urbana-Champaign
- Jeff Vetter, Oak Ridge National Laboratory
Plenary Speaker: Bill Dally GPU Computing: To ExaScale and Beyond
Wednesday, November 17, 8:30AM - 9:15AM | Convention Center Auditorium
Performance per watt is the new performance. In today's power-limited regime, GPU computing offers significant advantages in performance and energy efficiency. In this regime, performance derives from parallelism and efficiency derives from locality. Current GPUs provide both, with up to 512 cores per chip and an explicitly-managed memory hierarchy. This talk will review the current state of GPU computing and discuss how we plan to address the challenges of ExaScale computing.
Tutorial: High Performance Computing with CUDA
Sunday, November 14, 8:30AM - 5:00PM | Room 391-392
CUDA is a general purpose architecture for writing highly parallel applications. It provides several key abstractions--a hierarchy of thread blocks, shared memory, and barrier synchronization--for scalable high-performance parallel computing. Scientists throughout industry and academia use CUDA to achieve dramatic speedups on production and research codes. The CUDA architecture supports many languages, programming environments, and libraries including C/C++, Fortran, OpenCL, DirectCompute, Python, MATLAB, FFT, LAPACK, etc. In this tutorial NVIDIA engineers will partner with academic and industrial researchers to present CUDA and discuss its advanced use for science and engineering domains. The morning session will teach the basics of CUDA C programming, give an overview of the various tools, and cover the main optimization techniques. The afternoon session will discuss best practices for tuning and profiling CUDA programs and close with real-world case studies from domain scientists using CUDA for computational biophysics, fluid dynamics, seismic imaging, and theoretical physics.
Download tutorial presentations here.
Tutorial: Advanced Topics in Heterogeneous Programming with OpenCL
Monday, November 15, 1:30PM - 5:00PM | ROOM 391-392
This tutorial will explore OpenCL in depth. We will briefly summarize the key features in OpenCL and then present a series of case studies showing how to use OpenCL. The focus will be on source code so we show not just want OpenCL can do, but how you can implement similar functionality in your own programs. Attendees have either worked with OpenCL before or have just completed our "Introduction to OpenCL" tutorial.
Birds of a Feather: GPUs and Numerical Weather Prediction
Tuesday, November 16, 5:30PM - 7:00PM | Room 399
There is a nascent community of atmospheric modelers which is looking toward GPUs as a stepping stone to higher resolution modeling which will enable new scientific development. Already several full-up models have been successfully ported to GPUs. But up to today, there have been few community discussions about the issues in programming GPUs for Numerical Weather Prediction (NWP). Our objectives for this Birds-of-a-Feather are: (1) to have at least two teams present their experiences in porting NWP models to GPUs, (2) to foster the evolving community of atmospheric modelers who calculate on, or would like to calculate on, GPUs, (3) to exchange information about programming techniques, code performance and obstacles to parallelization, and (4) to share ideas on how the task of programming NWP models on GPUs can be facilitated by hardware and software vendors.
ADDITIONAL SESSIONS OF INTEREST
Workshop: ATIP US-China Workshop on HPC: Specialized Hardware & Software Development
Monday, November 15, 9:00AM - 5:30PM | Room 272
Specialized Hardware – Challenges & Advantages" will include a significant set of presentations, posters, and panels from a delegation of Chinese academic, research laboratory, and industry experts and graduate students. Topics will include government support for the research, development, and utilization of special purpose hardware, including GPU and self-developed processors, and applications will be stressed. Industry speakers will provide perspectives of the importance of such hardware for real applications. A special effort will be made to include HPC developments in Hong Kong. A panel discussion will identify topics suitable for collaborative research and mechanisms for developing those collaborations. The workshop will provide a unique opportunity for members of the US research community to interact and have direct discussions with top Chinese scientists. A specific goal of the Workshop is to motivate preparation of joint research proposals with researchers from US and China.
ACM Gordon Bell Finalist: 190 TFlops Astrophysical N-body Simulation on a Cluster of GPUs
Tuesday, November 16, 10:30AM - 11:00AM | Room 394
Session presents the results of a hierarchical N-body simulation on DEGIMA, a cluster of PCs with 576 graphic processing units (GPUs) and using an InfiniBand interconnect. In this work, DEGIMA's interconnect was upgraded using InfiniBand. DEGIMA is composed by 144 nodes with 576 GT200 GPUs. An astrophysical N-body simulation with 3,278,982,596 particles using a treecode algorithm shows a sustained performance of 190.5 Tflops on DEGIMA. The overall cost of the hardware was $411,921 dollars. The maximum corrected performance is 104.8 Tflops for the simulation, resulting in a cost performance of 254.4 MFlops/$.
ACM Gordon Bell Finalist: Petascale Direct Numerical Simulation of Blood Flow on 200K Cores and Heterogeneous Architectures
Tuesday, November 16, 2:00PM - 2:30PM | Room 394
Session presents a high fidelity numerical simulation of blood flow by directly resolving the interactions of 200 million deformable red blood cells flowing in plasma. This simulation amounts to 90 billion unknowns in space, with numerical experiments typically requiring O(1000) time steps. In terms of the number of cells, we improve the state-of-the art by several orders of magnitude: the previous largest simulation, at the same physical fidelity as ours, resolved the flow of 14 thousand cells. This breakthrough is based on novel algorithms that we designed to enable distributed memory, shared memory, and vectorized/streaming parallelism. We present results on CPU and hybrid CPU-GPU platforms, including the new NVIDIA Fermi architecture and 200,000 cores of ORNL's Jaguar system. For the latter, we achieve over 0.7 Petaflop/s sustained performance. Our work demonstrates the successful simulation of complex phenomena using heterogeneous architectures and programming models at the petascale.
Best Student Paper Finalist: An 80-Fold Speedup, 15.0 TFlops, Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
Tuesday, November 16, 2:30PM - 3:00PM | Room 391-392
Regional weather forecasting demands fast simulation over fine-grained grids, resulting in extremely memory-bottlenecked computation, a difficult problem on conventional supercomputers. Early work on accelerating mainstream weather code WRF using GPUs with their high memory performance, however, resulted in only minor speedup due to partial GPU porting of the huge code. Our full CUDA porting of the high-resolution weather prediction model ASUCA is the first such one we know to date. Benchmark on the 528 (NVIDIA G200 Tesla) GPU TSUBAME Supercomputer at the Tokyo Institute of Technology demonstrated over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision for 6956 x 6052 x 48 mesh. Further benchmarks on TSUBAME 2.0, which will embody over 4000 NVIDIA Fermi GPUs and deployed in October 2010, will be presented.
Exhibitor Forum: Building Exascale GPU-Based Supercomputers
Tuesday, November 16, 2:30PM - 3:00PM | Room 280-281
Heterogeneous computing is quickly establishing itself as one of the leading approaches to build exascale supercomputers, while staying within power and monetary budgets. Application developers have to parallelize their applications – whether the system is a heterogeneous system or even a traditional multi-core CPU system. Given these challenges, the opportunities that GPUs represent are overwhelming. Already, there are several large supercomputers built using GPUs in Asia and many more coming soon. We will discuss these topics and also talk about programmability and developer productivity.
Panel: Toward Exascale Computing with Heterogeneous Architectures
Wednesday, November 17, 1:30PM - 3:00PM | Room384-385
Recent reports from DOE, DARPA, and NSF have identified multiple challenges on the road to Exascale High Performance Computing. These challenges include the unrelenting issues of performance, scalability, and productivity, but they also include the relatively new priorities of energy-efficiency and resiliency. Not coincidentally, recently announced HPC architectures, such as RoadRunner, Tianhe, Tsubame, Keeneland, and Nebulae, illustrate that scalable heterogeneous computer (SHC) systems can provide innovative solutions. Early experiences on these systems have demonstrated performance benefits; however, SHC systems have multiple challenges. Taken together, these issues will impede the adoption of these architectures by erecting a very high entry barrier to application teams and their scientific productivity. This panel will discuss the future of SHC architectures, and how future changes might lower this barrier.
Panel: On the Three P's of Heterogeneous Computing: Performance, Power and Programmability
Wednesday, November 17, 3:30PM - 5:00PM | Room 384-385
In recent years, heterogeneous computing platforms "in a box" have quickly moved from being the exception to being the norm in high-performance computing (HPC). Examples of supercomputers that are based on such heterogeneous computing platforms now exist everywhere. This panel seeks to address arguably the three biggest problems in heterogeneous computing on the road to exascale: performance, power, and programmability.
Masterworks: Using GPUs for Weather and Climate Models
Wednesday, November 17, 4:15PM - 5:00PM | Room 395-396
With the power, cooling, space, and performance restrictions facing large CPU-based systems, graphics processing units (GPUs) appear poised to become the next-generation super-computers. GPU-based systems already are two of the top ten fastest supercomputers on the Top500 list, with the potential to dominate this list in the future. While the hardware is highly scalable, achieving good parallel performance can be challenging. Language translation, code conversion and adaption, and performance optimization will be required. This presentation will survey existing efforts to use GPUs for weather and climate applications. Two general parallelization approaches will be discussed. The most common approach is to run select routines on the GPU but requires data transfers between CPU and GPU. Another approach is to run everything on the GPU and avoid the data transfers, but this can require significant effort to parallelize and optimize the code.
Paper: Scaling Hierarchical N-Body Simulations on GPU Clusters
Thursday, November 18, 1:30PM - 2:00PM | Room 391-392
This paper focuses on the use of clusters of use of GPGPU-based clusters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel organization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effect of various application parameters is modeled and experiments are carried out to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present scaling performance results from experiments on the NCSA Lincoln GPU cluster.
Paper: Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance
Thursday, November 18, 2:00PM - 2:30PM | Room 391-392
GPUs offer drastically different performance characteristics compared to traditional multicore architectures. To explore the tradeoffs exposed by this difference, we refactor MUMmer, a widely-used, highly engineered bioinformatics application which has both CPU- and GPU-based implementations. We synthesize our experience as three high-level guidelines to design efficient GPU-based applications. First, minimizing the communication overheads is as important as optimizing the computation. Second, trading-off higher computational complexity for a more compact in-memory representation is a valuable technique to increase overall performance (by enabling higher parallelism levels and reducing transfer overheads). Finally, ensuring that the chosen solution entails low pre- and post-processing overheads is essential to maximize the overall performance gains.
Paper: Optimal Utilization of Heterogeneous Resources for Biomolecular Simulations
Thursday, November 18, 2:30PM - 3:00PM | Room 391-392
Biomolecular simulations have traditionally benefited from increases in the processor clock speed and coarse-grain inter-node parallelism on large-scale clusters. Graphical processing units (GPUs) offer revolutionary performance potential at the cost of increased programming complexity. In this paper, we present a parametric study demonstrating approaches to exploit resources of heterogeneous systems to reduce time-to-solution of a production-level application for biological simulations. By overlapping and pipelining computation and communication, we observe up to 10-fold application acceleration in multi-core and multi-GPU environments illustrating significant performance improvements over code acceleration approaches, where the host-to-accelerator ratio is static, and is constrained by a given algorithmic implementation.
Panel: Disruptive Technologies for Ubiquitous High Performance Computing
Friday, November 19, 8:30AM - 10:00AM | Room :30AM - 10:00AM
Over the past forty years, progress in supercomputing has tracked progress in integrated circuit scaling, and this has resulted in exponential improvements in system-level performance. However, changes in device physics now seriously threaten further sustained progress toward exaflops HPC systems. The DARPA Ubiquitous High Performance Computing (UHPC) program is intended to foster new innovative projects to develop radically new computer systems that overcome the challenges of efficiency, dependability, and programmability anticipated in the exascale era. The panelists, who lead the selected UHPC teams, will discuss the perspective on how to address these challenges and their comprehensive hardware/software strategy they developed for the UHPC program.