Learn how to create a set of rational arithmetic operators that manipulate 1024 bit operands on a Tesla C2050. These operators are used to create a numerically stable implementation for Bessel functions. Naive implementations of the Bessel functions produce unreliable results when they are used to solve Maxwell's equations by way of Mie theory. Maxwell's equations are used to model the scattering of light by small particles. Light scatter is used in Particle Characterization to measure the quality of materials like cocoa, cement and pharmaceuticals.
BackFind out about a multiple GPU implementation of the Alternating Direction Implicit method for large 3D domains. The ADI technique is applied towards direct numerical fluid simulation. Modeling complex flows demands extremely large grids and a distributed computation is required for sharing the memory among multiple GPUs. In this session a novel distributed tridiagonal solver as well as parallelization and load balancing strategies will be covered in detail. Finally, a comprehensive performance analysis and scaling studies for different input geometries and possible future improvements will be discussed.
BackSee the newest developments in the area of hierarchical N-body methods for GPU computing. Hierarchical N-body methods have O(N) complexity, are compute bound, and require very little synchronization, which makes them a favorable algorithm on next-generation supercomputers. In this session we will cover topics such as hybridization of treecodes and fast multipole methods, auto-tuning kernels for heterogenous systems, fast tree construction based on prefix sums, fast load balancing of global trees, and more. Examples will be given using ExaFMM --an open source hierarchical N-body library for heterogenous systems developed by the speaker. (released at SC11)
BackLearn how to map irregular tree structured computations to the GPU efficiently. See how extremely irregular data-dependent computations can be implemented by composing them out of regular data-parallel primitives. In particular we focus on the problem of tree accumulation, a generalization of the scan primitive to arbitrary tree data structures. We first show how tree orderings and properties can be computed using the Euler tour technique and standard scan primitives. Using these orderings we then develop our new approach to computing tree accumulations in parallel.
BackThis presentation describes our development of a GPU-accelerated backpropagation implementation for Synthetic Aperture Sonar systems that supports multiple nodes via MPI and multi-GPU nodes. This implementation can form a complex-valued gigapixel image in one hour on a single C2050. We further scale this implementation to the Keeneland system where we can form the same gigapixel image in 21 seconds on 48 nodes with 144 C2070 Tesla GPUs. Our talk will discuss the details of our implementation, including our optimizations and scaling results for various node and GPU configurations, as well as the applicability to other domains, including Synthetic Aperture Radar.
BackThis tutorial will demonstrate how video I/O devices can take advantage of the GPU Direct for Video API to optimize the data transfer performance for digital video, film and broadcast applications and computer vision applications. The GPU Direct for Video API is a technology that permits the DMA transfer of data buffers between video I/O devices and the GPU through the use of a shared system memory buffer for immediate processing by OpenGL, DirectX, CUDA and OpenCL. This direct transfer can improve synchronization and eliminate latency between video capture, GPU processing and video output.
BackThe combination of the GPU's massively parallel compute engine with extremely high memory bandwidth and new programming paradigms such as CUDA and OpenCL have made the GPU well suited for image and video processing applications. This session will explore best practices and techniques for the development of efficient GPU-based video and image processing applications. Topics to be discussed include image segmentation and threading models for efficient parallelism, optimal memory usage strategies to reduce expensive data movement as well as multi-GPU considerations. Case studies and examples specific to video and image processing will be presented.
BackLearn how to use GPUs to accelerate compute- and data-intensive applications and algorithms Bioinformatics. High-throughput techniques for DNA sequencing and gene expression analysis with microarrays have led to a rapid growth in the amount of digital biological data, e.g. the NCBI Sequence Read Archive (SRA) houses raw sequence data generated by next-generation sequencing (NGS) technologies which succeeds 25 trillion base-pairs. Therefore, modern bioinformatics tools need to be scalable; i.e. they need to deal with an ever growing amount of data. GPUs and CUDA provide the opportunity to significantly reduce the runtime of many biological algorithms on inexpensive hardware.
BackMost import agriculture traits and human diseases are complex traits which are controlled by gene network with gene by gene interaction (epistasis) and gene by environment interaction (GE). New statistic methods and software are developed for analyzing genetic architecture for complex traits based on genome-wide association study (GWAS). When deal with large mapping population and huge amount of molecular information, GPU computation has an advantage over CPU computation. We will demonstrate the newly developed GPU based software QTLNetwork V3.0 and GWAS-GMDR for mapping genes with epistasis and GE interaction for complex traits of human, crops, and mouse.
BackAfter digitizing DNA double helix by sequencing, computation is the key connecting raw sequences with life science discoveries. As massive data is generated, how to process and analysis as well as storage them in an efficiently manner turns out to be a major challenge. By developing GPU accelerated bioinformatics tools and integrate them into pipelines, BGI researchers now run analysis pipelines in several hours instead of several days. These tools include SOAP3 aligner, SNP calling and tool for population genomics. The speed up is generally around 10-50x comparing with traditional counterparts.
BackEngineers, artists, scientists, and gamers are the most demanding visual thinkers on the planet, and as such have not been willing to move their computing environments to the infamous "cloud". These remotely accessed systems are seen as slow and not up to the visual experience that users expect when dealing with these types of applications. NVIDIA aims to change that perception with the NVIDIA Virtual Graphics Platform. In this session you will hear about the technologies behind accelerating graphics in the cloud, and some of the industry partnerships that are enabling it.
BackIn this session we describe our GPU accelerated computing service which supports several internal business processes in a large scale company setup. The service supports diverse computational needs such as on-demand rendering, mesh optimization, a Massive Multiplayer Online Game (MMO), product visualizations and other demanding computational tasks. We present the architectural considerations for a service-oriented computational framework and the practical learning's and opportunities encountered during development a enterprise system using NVIDIA technologies such as CUDA, OptiX, OpenGL and OpenCL. Our aim is to share knowledge and present LEGO's vision for a GPU accelerated computational platform as a business-driven technology.
BackRecent technological advances have made it practical to deliver 3D professional graphics applications from the Cloud (private or public) with a high quality user experience and at an attractive cost. Organizations can keep their intellectual property safe in the data center since only fully-rendered screen images are sent over the network. Users in remote locations no longer have to wait for large file transfers. And they can access 3D models from a wide variety of devices, including iPads and Android tablets. Learn how Citrix XenDesktop, XenServer and Receiver technologies have made all of this a reality for many organizations today.
BackThe Point Cloud Library (PCL - http://pointclouds.org) is a large scale, open project for 3D point cloud processing. The PCL framework contains numerous state-of-the art algorithms including filtering, feature estimation, surface reconstruction, registration, model fitting and segmentation. Due to the massively parallel nature of many of the above algorithms, GPGPU accelerations holds great potential for achieving real-time performance in numerous applications. In this work we demonstrate some of the recent advances in GPGPU programming for 3D point cloud processing, and outline plans for future development.
BackSJTU-NS3D is an in-house CFD code co-developed by SJTU and COMAC for large civil airplane, solving 3D Reynolds Average Navier-Stokes (RANS) equations on structured grids by finite volume method, which could be used in designing wing model. In this talk, we will present the design and further optimization of CUDA version of SJTU-NS3D, and it achieves 20-fold speedup for standard M6 wing model and 37-fold speedup for wing model candidate from COMAC on single Fermi C2050.
BackLearn how Run-Time Code Generation (RTCG) techniques allowed for fast development of a lattice Boltzmann (LB) fluid dynamics solver called Sailfish. Sailfish is completely open source, supports a wide variety of LB models (single and multiple relaxation times, the entropic model; single and binary fluids) and can take advantage of multiple GPUs. Even though the project is written predominantly in Python, no performance compromises are made. This talk will introduce the basic design principles of Sailfish and illustrate how RTCG allows to exploit the power of GPUs with minimal programmer effort.
BackWith SPH methods multi-phase flows within complex geometries can be efficiently investigated. Also physical effects present in micro- and nanofluidic applications are described with little effort using the SPH methodology. In order to investigate microfluidic applications relevant to industry, large domains and high spatial resolutions are required. Therefore, a SPH method for accelerated computations on GPUs is currently developed. The code features dynamic casting of computational data into blocks of appropriate size to fit the GPU memory layout. Also tree-like data structures for efficient manipulation of particle distributions help to obtain significant performance gains on GPU hardware.
BackSee how we employ GPUs to simulate the interaction of millions of solvent and solute particles of a fluid system. Often the domain of large cluster system, the most time consuming part of our simulations can now be done on desktop PCs in reasonable time. This contribution shows how GPUs can effectively be used to accelerate existing programs and how techniques like streaming and increased data locality significantly enhance calculation throughput. It also shows how a GPU-optimized program structure yields usually expensive additional functionality "almost free". Furthermore, a well-scaling single-node/multi-GPU implementation of the program is presented.
BackThe shooting and bouncing ray (SBR) method is one way to simulate electromagnetic field radiation. Like all methods, there are certain problems where it does not yield accurate results. In this presentation, we will explain one such case that consists of an antenna resonating between two metal plates. We will discuss how we used the graphics processing unit (GPU) to separate the problem into two parts. Each part is simulated individually with SBR producing an improved result. Such a GPU-accelerated, two-part approach can be applied to other more general hybrid simulations.
BackWith powerful lasers breaking the Petawatt barrier, applications for laser-accelerated particle beams are gaining more interest than ever. Ion beams accelerated by intense laser pulses foster new ways of treating cancer and make them available to more people than ever before. Laser-generated electron beams can drive new compact x-ray sources to create snapshots of ultrafast processes in materials. With PIConGPU laser-driven particle acceleration can be computed in hours compared to weeks on standard CPU clusters. We present the techniques behind PIConGPU, detailed performance analysis and the benefits of PIConGPU for real-world physics cases.
BackLearn how to port legacy Fortran plasma codes to GPU. Many legacy plasma codes are written in Fortran and have many lines of codes. We will discuss techniques in porting such legacy codes easily and efficiently to CUDA C/C++. Performance analysis of major algorithmic patterns in plasma codes will be discussed. The discussion will use the GTC and GeFi plasma code as realistic examples.
BackAttend this session to get the most out of OpenGL on NVIDIA Quadro and GeForce GPUs. Topics covered include the latest advances available for Cg 3.1, the OpenGL Shading Language (GLSL); programmable tessellation; improved support for Direct3D conventions; integration with Direct3D and CUDA resources; bindless graphics; and more. When you utilize the latest OpenGL innovations from NVIDIA in your graphics applications, you benefit from NVIDIA's leadership driving OpenGL as a cross-platform, open industry standard.
BackMany real world graphics applications need to transfer textures efficiently in and out of GPU memory in the form of 2D images, 2.5D terrains or 3D volumes as well as their time-varying counterparts. The first part of this talk covers technical pointers on how to optimize your OpenGL application to overlap transfers with rendering using the NVIDIA Copy Engines. The second part demonstrates the integration and performance of this feature within the a real world latency-sensitive broadcast graphics application from VizRT.
BackNURBS, or Non Uniform Rational B Splines, are a curved surface representation commonly used in computer aided design and digital content creation. This recursive representation gives a great deal of flexibility, allowing arbitrary surface order and knot vectors, enabling a single NURBS surface to contain many contiguous patches. However, this recursive representation is also expensive to compute, so a NURBS surface is often converted into multiple Bezier patches before being tessellated. In this implementation, we present an efficient method for directly tessellating NURBS surfaces using the NVIDIA CUDA computing API.
BackLearn how to render transparency, motion blur, and depth of field effects in real time using random sampling. These effects combine multiple objects in each pixel, making them expensive to compute directly. But recent research shows that, with stratified sampling and clever reconstruction, good image quality can be achieved with surprisingly small numbers of samples per pixel. We will explain how to do this on the GPU, and explore trade-offs of performance, quality, accuracy, and noise.
BackThe future of computer graphics presents many challenges. The worlds we render will be vastly more complex in geometry and artistic texture. Real-time rendering will use global illumination to achieve a far richer appearance, robustly. And content creation, which has grown to be the dominant cost of producing both games and film, must get simpler and less expensive. The NVIDIA Graphics Research group addresses these challenges with a focus on Computational Graphics: using general-purpose computation to enhance and extend the traditional pipelines and capabilities of real-time rendering. In this talk David Luebke, who leads graphics research, will give an overview of recent and ongoing work in computational graphics at NVIDIA Research.
BackDiscrete voxel representations are generating growing interest in a wide range of applications in computational sciences and particularly in computer graphics. A new real-time usage of dynamic voxelization inside a sparse voxel octree is to compute voxel-based global illumination. When used in real-time contexts, it becomes critical to achieve fast 3D scan conversion (also called voxelization) of traditional triangle-based surface representations. This talk describes an new surface voxelization algorithm that produces a sparse voxel representation of a triangle mesh scene in the form of an octree structure using the GPU hardware rasterizer. In order to scale to very large scenes, our approach avoids relying on an intermediate full regular grid to build the structure and constructs the octree directly.
BackThe most common approach in rendering is to define behavior at a point in terms of material properties and incident illumination. That approach works well when the geometry and material properties are well-known, and the light physics are simulated accurately. We present a technique to help situations where the model and/or physics is incomplete. This technique augments shaders with information about nearby edges, such as corners and boundaries between materials, and makes it natural to add richness procedurally near these visually critical regions.
BackLenovo ThinkStations utilize Nvidia Maximus technology to accelerate mission critical applications across multiple industries, including manufacturing, media & entertainment, and Life Sciences. Discover how GPUs are used to accelerate medical research from product experts with Lenovo and Beckman Coulter. Beckman Coulter has utilized Nvidia GPUs to reduce software development and test cycles by 50% with their Kaluza software. Kaluza is a revolutionary flow cytometry analysis software solution that provides visualization tools, speed and an innovative simplicity to the flow community. See how Kaluza allows users to analyze 10 million cells in real time. Session attendees will receive a drawing entry to win a brand new ThinkPad Tablet.
BackWe present a plug-in for Maya which enables an artist to simulate huge particle counts in real-time by leveraging the NVIDIA GPU. Being able to interact with the simulation opens up new possibilities for modifying the workflow. We will demonstrate the plug-in, and provide insight into the algorithms used.
BackPhotoshop is one of the most popular products in history. It attempts to delight the customers with an immersive experience. Since CS4, Adobe has been tapping into the horsepower of the GPU to create a compelling playground for the imaginations of creative pros. Please join us to review the latest developments on how GPUs have been an enabling force.
BackWe present a regular expression (regex) engine on a GPU. We utilize the highly parallel architecture of GPUs to accelerate such searches. We believe that previous attempts to utilize the GPU for this task did not fully tap its potential. Regex present imbalanced compute workloads which are very different from common GPU applications (CFD, CG and image processing). Hence, they can teach us general lessons on how to utilize GPUs for more general workloads.Our initial results show 30x improvement in running time relative to single threaded commercial regex engines.
BackIn business intelligence, tasks like corporate planning or what-if analysis complement traditional reporting and analysis. One main difference is that while the latter only read data, the former require the change of possibly large numbers of existing and creation of new data records in the business model, preferably in real time. In this session, we describe the extension of an existing BI tool, Jedox OLAP, by GPU-based parallel algorithms for interactive planning scenarios. Compared to sequential in-memory algorithms, our CUDA approach yields tremendous speedups and can also cope with large amounts of data by using multiple GPUs.
BackCUDA Debugger tools CUDA-GDB and CUDA-MEMCHECK provide a whole new feature set to help improve your CUDA application development cycle. This session is a detailed walk-through of the key new features and advanced techniques on using CUDA-GDB and CUDA-MEMCHECK together to improve overall code productivity. This tutorial will also include live demos. This session will be repeated later during the conference.
BackLearn about the latest developments in GPU acceleration for 3D Full Wave Electromagnetic simulation. The latest version of CST Studio Suite supports the full range of Tesla products on both Windows and Linux operating systems. Using GPU, multi-GPU and MPI-GPU Computing drastically reduces the simulation times for CST customers. We will provide a status of current and future GPU developments at CST and share detailed simulation results.
BackDiscuss techniques for compiling Parallel DSLs to GPUs. Verilog is a Domain Specific Language for Hardware Description. Verilog users express parallelism with guarded processes similar to Occam's guarded commands. Review Verilog semantics, and different approaches to compiling Verilog to parallel architectures and to GPUs. Discuss challenges with (a) Verilog description's runtime behavior (b) managing process dependency. Discuss approaches and challenges in compiling a parallel DSL to CUDA C.
BackIn this paper we show how GPUs can be used to significantly speedup computational lithography, which is heavily used in the Electronic Design Automation (EDA) industry. In particular, we demonstrate a noticeable performance increase in several basic optical lithography algorithms as well as the speedup of the full-chip verification software, crucial parts of which were ported to NVIDIA's GPUs. We summarize the advantages, disadvantages and challenges of using GPUs and compare it to more traditional multithreading and distributed computing alternatives for the same applications.
BackLearn about real time simulations of Concentrating Thermal Solar Power using GPU technology to enable performance optimization of these utility scale plants. By leveraging the power of GPUs and the parallel aspect of the field of thousands sun-tracking mirrors, we have been successful in cutting the computation time by orders of magnitude versus the previously required minutes and hours runtime. We will present an overview of the problem domain and describe how we used the GPU to derive a Monte Carlo physics ray tracing method to simulate the flux reflected by the mirrors onto the solar receiver.
BackThe oil and gas industry is already leveraging GPUs for seismic data processing, but what about 3D seismic interpretation? This session will cover how the GPU is being used by TerraSpark Geosciences to dramatically decrease the runtime of algorithms for enhancing faults, computing horizon orientation, and calculating volumetric curvature. We will share our experiences in porting these techniques to the GPU, the challenges encountered, the solutions found, and, of course, the benefits to execution time.
BackThe LiveQuest application delivery and collaboration solution allows petro-technical professionals to securely access and share exploration and production (E&P) applications and data, including 3D visualization applications, anytime, anywhere. By utilizing web and thin-client technologies, LiveQuest provides platform-independent and application-agnostic real-time collaboration. In this session, Mario Dean will provide an introduction to the needs of the O&G exploration from an application and large data 3D visualization perspective. He will discuss the LiveQuest solution stack, with specific focus on the 3D remote visualization technology, and share customer deployment examples and overall ROI considerations.
BackThe goal of this session is to show the improvements in quality, performance and flexibility of the volume rendering implementation of Open Inventor. The latest GPU techniques, such as virtual textures and ray casting, have been combined into a flexible shader API and applied on out of core data. The techniques of volume rendering, sugarcube rendering, basic and complex clipping, sculpting, editing and segmentation will be demonstrated using examples from a geobody extraction workflow. The great ease and flexibility of the shader pipeline API will be illustrated, and we will discuss the broad future perspectives of that technology.
BackGet a head start on the conference with this first-day introduction to key technologies for GPU Computing. This 90-minute tutorial session will cover the key features and differences between the major programming languages, APIs and development tools available today. Attendees will also learn several high level design patterns for consumer, professional and HPC applications, with practical programming considerations for each.
BackWe discuss an approach for using commercial graphic processors (GPUs) at the earliest trigger stages in high-energy physics experiments, and study its implementation on a real trigger system in preparation. In particular we focus on the possibility to reconstruct rings in a Cherenkov detector as building block of a selective trigger condition for rare decay search. Latency and processing rate measurements on several state-of-the-art devices are presented, and the potential issues related to processing time jitter and data transfer throughput are discussed.
BackGet the latest development in Next Generation Knowledge Based Engineering (KBE) software which provides real results over the traditional design approach. Today there exist numerous KBE applications in the field of vehicle ergonomics, suspension, NVH, safety, regulations etc which deal with huge number of iterations and mathematical algorithm. With GPU computing and CUDA the KBE kernel is restructured to incorporate parallel programming model which helps the applications run faster and achieving time reduction from hours to seconds. KBE geometry kernel also gets benefited by enabling CUDA in topology based operations which take lot of time when performed on CPU.
BackVerification has become the bottleneck of IC design process due to its fast increasing complexity. The fundamental means of verifying digital circuits is logic simulation, which can be performed at both register-transfer level (RTL) and gate level. In this work, we developed GPU based logic simulation solutions. We implemented a Chandy-Misra-Bryant parallel simulation protocol on GPUs for sufficient parallelism. A dynamic GPU memory allocator was introduced to efficiently manage GPU memory resources. RTL simulation is performed in a compiled-code scheme by translating Verilog code into equivalent CUDA code. Experimental results proved that the GPU simulators significantly outperform their CPU counterparts.
BackThe joint utilization of the electron's charge and spin in "spintronics" represents a promising technology for data processing and storage in nanostructures. The complex quantum effects like the spin-Hall effect in these devices require demanding numerical simulations providing a convenient link between idealized analytical models to often very complex results from measurements. The simulations involving multiplications and inversions of large matrices provide an ideal showcase for performance gain by employing GPGPUs in the execution of the algebraic routines on these matrices in computing environments with shared execution of algorithms on multiple nodes with multiple GPGPUs and CPU cores.
BackGraphics processors are already used for computationally intensive video tasks in many ISR (Intelligence, Surveillance, Reconnaissance) applications; GPU-based system for video enhancement and analytics outperforms a similarly priced CPU-based system 5-to-1 at HD resolutions. Our initial tests on 64 megapixel Wide Area Aerial Surveillance (WAAS) data show at least 10x speedup with tasks such as super-resolution or moving target indication. In this talk, we'll discuss unique design and implementation challenges of real-time processing of very large video data sets. We will demonstrate our existing GPU-based software, IKENA ISR, and discuss its video-processing pipeline and innovative processing solutions that are promising to dramatically expand capabilities of emerging aerial surveillance platforms.
BackNVIDIA CEO and co-founder Jen-Hsun Huang will take part in a fireside chat with Tim Bajarin, one of IT worldâ??s pre-eminent analysts and president of Creative Strategies. They will discuss trends in mobile, visual and parallel computing, and the transformational changes ahead for the industry.
BackDo not miss the opening keynote, featuring Jen-Hsun Huang, CEO and Co-Founder of NVIDIA. Hear about what's next in computing and graphics, and preview disruptive technologies and exciting demonstrations from across industries. Jen-Hsun co-founded NVIDIA in 1993 and has served since its inception as president, chief executive officer and a member of the board of directors.
BackCollective behavior is one of the most pervasive features of the natural world. Our brains are composed of billions of interconnected cells communicating with chemical and electrical signals. We are integrated in our own human society. Elsewhere in the natural world a fish school convulses, as if one entity, when being attacked by a predator. How does individual behavior produce dynamic group-level properties? Do animal groups -or even cells in a tumor- function as some form of collective mind? How does socially contagious behavior spread through natural human crowds? In his keynote address, Prof. Iain D. Couzin, Professor of Ecology and Evolutionary Biology at Princeton University, will demonstrate how GPU computing has been pivotal in the study of collective behavior, helping reveal how collective action emerges in a wide range of groups from plague locusts to human crowds, and the critical role that uninformed, or weakly-opinionated, individuals play in democratic consensus decision-making.
BackDo not miss the day 3 keynote, featuring Part-Time Scientists Robert Boehme and Wes Faler. Boehme and Faler are part of a team of international scientists and engineers who want to send a rover to the moon before the end of the year 2013. In this presentation, they will discuss their goals, recent accomplishments and milestones, and how GPUs have help in unexpected ways.
BackCome see how to select the k smallest elements from an unsorted list. We present a selection and combination of different algorithms that perform exact k-nearest neighbors search (k-NNS) on GPUs and outperform the competition. In this session we present four different selection algorithms designed to exploit differently the parallelization of the GPU according to the relative size of the corpus data set, the size of the query set and the number of neighbors sought. We show the application of Logo Retrieval with SIFT vector matching on two different GPUs, the Tesla C1060 and the Fermi GTX480.
BackIn this paper, we present how we improved the speedup of the electronic structure calculator VASP by more than an order of magnitude. Recently, the research works done (at IFP Energies Nouvelles) have shown that by coupling traditional clusters or High Performance Computing (HPC) machines with accelerators based on graphical processor units (GPUs), by recording the most time consuming parts of the codes (with programming languages like CUDA, OpenCL) and offloading them on the graphic chips, it is possible to reduce the computing time to ensure a speedup of a factor of 5 to 15.
BackThis session will present the fundamental performance-optimization concepts and illustrate their practical application in the context of programming for Fermi and Kepler GPUs. The goal is twofold: make the optimization process a methodical sequence of steps, facilitate making performance-aware algorithmic decisions before coding even starts. In order to maximize GPU performance, a code should have sufficient parallelism, access memory in a coalesced pattern, and be amenable to vector execution within warps (groups of 32 threads). We will show how to quantify these requirements for a specific GPU in order to determine performance limiters and their importance for a given code. To address the limiters, we will review hardware operation specifics and related optimization techniques. Optimization process will be illustrated using NVIDIA profiling tools and kernel case studies.
BackCUDA releases starting with 4.0 include a number of features that facilitate multi-GPU programming and computing. In this session we will review the features useful for programming for multiple GPUs, both within a single node and across network. We will cover peer-to-peer GPU communication, communication patterns for various GPU topologies, as well as streams in the context of multiple GPUs. Concepts will be illustrated with a case study of 3D forward wave modeling, common in seismic computing.
BackOpenACC is a programming standard for parallel computing on accelerators (including GPUs) using directives. It is designed to harness the transformative power of heterogeneous computing systems easily and quickly. In this tutorial you will learn how to add simple compiler hints to your code to expose parallelism to the compiler, allowing it to map computation onto an accelerator. OpenACC directives allow developers to make simple and portable code changes, enabling an easier migration to accelerated computing. This is part 2 of a 3-part tutorial that will take you from an overview through how to optimize your code. Part 2 will cover how GPUs execute parallel programs, and apply this understanding to optimizing OpenACC examples to gain larger speedups and accelerate applications with various types of parallelism. You will also learn how to use NVIDIA profiling tools to target your optimizations.
BackThis tutorial will cover various aspects of writing code in CUDA Fortran, which is the Fortran interface to the CUDA architecture. Topics covered will include a basic introduction to parallel programming concepts using CUDA, performance measurements and metrics, optimization, and multi-GPU programming via CUDA 4.0's peer-to-peer capability and MPI. Several case studies will be presented as well.
BackLearn how to optimize and profile your algorithms for the GPU. This session will cover the essentials of code optimization and will include: arithmetic optimizations, warps, branching efficiency, memory latency/occupancy and memory performance optimizations. Real life commercial examples will be discussed to highlight the critical aspects of GPU optimization techniques. A programming demonstration using the NVIDIA Visual Profiler will be included. Introduction to Optimizations and Profiling - Arithmetic optimizations - Warps - Branching efficiency - Memory latency/Occupancy - Memory performance optimizations - Programming Demo: Visual Profiler
BackStarting with a background in C or C++, learn everything you need to know in order to start programming in CUDA C. Beginning with a "Hello, World" CUDA C program, explore parallel programming with CUDA through a number of hands-on code examples. Examine more deeply the various APIs available to CUDA applications and learn the best (and worst) ways in which to employ them in applications.
BackThe libraries distributed in the CUDA SDK and offered by third parties provide a wealth for functions commonly encountered in a GPU acceleration project. Using these libraries can often significantly shorten the development time of a GPU project while leading to high-performance, high-quality software. In this tutorial, we will provide an overview of the libraries in the CUDA SDK, including cuBLAS, cuRAND, NPP and Thurst and introduce common use cases. The audience will not only learn about the strengths of the individual libraries, but also learn about the decision making process to select the best suited library for their project.
BackDirective-based programming is a very promising technology to deal with Many-Core. In this context, HPC users can rely on emerging standards such as OpenACC and OpenHMPP. CAPS will introduce OpenACC and HMPP directive-based programming models with companion tools (e.g. for tracing, tuning, debugging): HMPP Wizard, CULA, ArrayFire, Vampir, Paraver, DDT, CodeletFinder, etc. The speakers will provide insights on how GPU / CPU can be exploited in a unified manner and how code tuning issues can be minimized. The discussion will also cover the use of libraries which is essential when addressing Many-Core Programming. Pathscale will present its product supporting OpenHMPP programming model.
BackDirective-based programming is a very promising technology to deal with Many-Core. In this context, HPC users can rely on emerging standards such as OpenACC and OpenHMPP. CAPS will introduce OpenACC and HMPP directive-based programming models with companion tools (e.g. for tracing, tuning, debugging): HMPP Wizard, CULA, ArrayFire, Vampir, Paraver, DDT, CodeletFinder, etc. The speakers will provide insights on how GPU / CPU can be exploited in a unified manner and how code tuning issues can be minimized. The discussion will also cover the use of libraries which is essential when addressing Many-Core Programming. Pathscale will present its product supporting OpenHMPP programming model.
BackIn this talk, individuals from the GPU architecture and CUDA software groups will dive into the features of the compute architecture for ??Kepler, NVIDIA'??s new transistor GPU. From the reorganized processing cores with new instructions and processing capabilities, to an improved memory system with faster atomic processing and low-overhead ECC, we will explore how the Kepler GPU achieves world leading performance and efficiency, and how it enables wholly new types of parallel problems to be solved.
BackBy integrating NVIDIA's OptiX system for real-time GPU raytracing into a DirectX9 based engine, CCP Games enables high-quality raytraced player portraits for the single shard MMO EVE Online, reusing the game's assets and pipeline. We selectively add stochastic effects while closely maintaining the look of the DX9-based renderer that Art Direction aimed for. In this talk we approach OptiX from the point of view of a programmer familiar with DirectX, discuss integrating these two systems, and show how we reproduced some DirectX-based effects like transparency and subsurface scattering within OptiX.
BackOptiX has broken some major barriers recently by enabling out-of-GPU-core memory rendering and by adding a CPU rendering back-end when an OptiX-capable GPU is not present in the system. OptiX users and CUDA developers will be interested in how we accomplished these feats within the existing GPU architecture. This talk will provide a brief introduction to OptiX and then dive into what the new features provide. We will then go under the covers and show how we pulled it off.
BackLearn the latest approaches in levering GPUs for the fastest possible ray tracing results from experts developing and leveraging the NVIDIA OptiX ray tracing engine, the team behind NVIDIA iray, and those making custom renderers. Multiple rendering techniques, GPU programming languages, out-of-core rendering, and optimal hardware configurations will be covered in this cutting-edge discussion.
BackThe full range of advanced rendering solutions and frameworks from NVIDIA will be explored in this insightful product and technology discussion and demonstration. Come learn about the latest possibilities involving advanced rendering techniques and how they integrate within commercial products â?? from production ray tracing to volumetric and distributed rendering.
BackThis year, the leadership-class computing facility at Oak Ridge National Labs is upgrading its largest supercomputer for open science, "Jaguar", to employ high-performance, power- efficient GPUs. Once the transition is complete, the machine will be known as "Titan". In this extended GTC session, we will feature a range of presenters showcasing research codes that will run computational science on the GPU at scale. Through these selected presentations, we will investigate the progress and anticipated results of GPU-acceleration of these significant codes. In this session, we will also explain how research scientists interested in tapping into the immense capabilities of Titan can do so, through programs such as the INCITE program sponsored by the US Department of Energy. The presenters include: Speaker: Jacqueline H. Chen (Combustion Research Facility, Sandia National Laboratories) "Direct Numerical Simulation of Turbulence-Chemistry Interactions: Fundamental Insights Towards Predictive Models" Speaker: Ray Grout (National Renewable Energy Laboratory) "S3D Direct Numerical Simulation - Preparations for the 10-100PF Era" Speaker: William Tang (Director, Fusion Simulation Program at the Princeton Plasma Physics Laboratory (PPPL), Princeton) "Fusion Energy Sciences & Computing at the Extreme Scale" Speaker: John A. Turner (Group Leader of Computational Engineering & Energy Sciences , Oak Ridge National Laboratory) "Transforming Modeling and Simulation for Nuclear Energy Applications" Speaker: Loukas Petridis (Staff Scientist, Oak Ridge National Laboratory) "Computer Simulation of Lignocellulosic Biomass" Speaker: Jeroen Tromp (Director, Princeton Institute for Computational Science, Princeton) "Toward Global Seismic Imaging based on Spectral-Element and Adjoint Methods"
BackPrecise information about the structure of the solid Earth comes from seismograms recorded at the surface of a highly heterogeneous lithosphere. Seismic imaging based on spectral-element and adjoint methods can assimilate this information into three-dimensional models of elastic and anelastic structure. These methods fully account for the physics of wave excitation, propagation, and interaction by numerically solving the inhomogeneous equations of motion for a heterogeneous anelastic solid. Such methods require the execution of complex computational procedures that challenge the most advanced high-performance computing systems. Current research is petascale; future research will require exascale capabilities. We illustrate the current state-of-the-art based on an inversion for European upper-mantle structure. Our ultimate goal is to move toward �adjoint tomography�of the entire planet. This session is part of "S0606 - GPU-accelerated Science on Titan: Tapping into the World's Preeminent GPU Supercomputer to Achieve Better Science" mini-track with Dr. Jack Wells.
BackLearn how VSIPL++ can improve your productivity and provide software portability, without sacrificing performance. We will describe how VSIPL++'s open-standard high-level programming model addresses the challenges of writing high-performance embedded software on GP-GPUs and other heterogeneous hardware, using advanced C++ techniques and data abstraction -- and how we make this work in the real world. We will also present a comparison of performance results from various configurations of CPU and GP-GPU processing engines for a signal processing application developed using VSIPL++.
BackThe evolution of supercomputing into the mid-petaflop era has been typified by heterogenous compute nodes with the majority of the compute capability delivered by a large number of lightweight cores. In order to prepare for the extension of this trend, the DNS code S3D has been retooled in anticipation of a target architecture offering 10s of thousands of heterogeneous nodes containing many X86 cores as well as GPU derived accelerators. Movement of outer loops to the highest level in the code facilitates hybrid MPI-OpenMP performance and an elegant path to accelerated kernels using OpenACC. It is anticipated that relevant scientific simulations at this scale will have a per-node footprint that can be contained entirely on the accelerator, so provision is made to maintain primary solution variables in accelerator memory with specific regions moved to the CPU for inter-node communication and workload balancing. With the current performance it is estimated that the new code will make it possible to meet early science goals with the full build-out of the anticipated Titan system as well as provide a platform to transition into the exascale software research space.
BackThe fusion energy sciences community has made excellent progress in developing advanced codes for which computer run-time and problem size scale well with the number of processors on massively parallel supercomputers. A good example is the effective usage of the full power of modern leadership class computational platforms from the terascale to the petascale and beyond to produce nonlinear particle-in-cell simulations which have accelerated progress in understanding the nature of plasma turbulence in magnetically-confined high temperature plasmas. Illustrative results provide great encouragement for being able to include increasingly realistic dynamics in extreme-scale computing campaigns to enable predictive simulations with unprecedented physics fidelity. William Tang's session is part of "S0606 - GPU-accelerated Science on Titan: Tapping into the World's Preeminent GPU Supercomputer to Achieve Better Science" mini-track with Dr. Jack Wells.
BackBiomass from terrestrial plants offers the potential of an abundant source of cellulosic ethanol. However, technical problems still hinder the cost-effective conversion of biomass to ethanol arising from the recalcitrance of biomass to hydrolysis. Here, computer simulation of biomass is employed to understand the physical origins of biomass recalcitrance. The temperature-dependent structure and dynamics of lignin polymers in aqueous solution are examined using extensive molecular dynamics simulations. Neutron scattering experiments and molecular dynamics simulations reveal the structure of lignin aggregates. Finally, the interaction of lignin with cellulose is examined and differential binding to crystalline and amorphous cellulose explained thermodynamically. This session is part of "S0606 - GPU-accelerated Science on Titan: Tapping into the World's Preeminent GPU Supercomputer to Achieve Better Science" mini-track with Dr. Jack Wells.
BackThe Consortium for Advanced Simulation of Light-Water Reactors (CASL), is a U.S. Department of Energy Innovation Hub, established July 2010 to develop and apply advanced modeling and simulation to operating nuclear power plants. Through increases in power, plant lifetime extension, higher fuel burnup, and enhanced safety, CASL will reduce operating costs and enable delivery of more carbon-free electricity to the U.S. power grid. To achieve these goals, CASL is building the Virtual Environment for Reactor Applications (VERA), a system for analysis of phenomena within nuclear reactors. Since computational demands are considerable, VERA is being developed as a scalable system, able to take advantage of platforms ranging from high-end workstations to the largest leadership-class supercomputers such as Titan at Oak Ridge National Laboratory (ORNL).
BackIn this session we will cover all the different aspects of interaction between graphics and compute. The first part of the session will focus on compute API interoperability with OpenGL (using CUDA and OpenCL APIs), while the second part of the session will delve into interoperability at a system level. In particular we will go through the challenges and benefits of dedicating one GPU for compute and another for graphics, how different system configurations affect data transfer between two GPUs, and how it translates into application design decisions helping to enable an efficient, cross-GPU interoperability between compute and graphics contexts. This talk is repeated on Thursday at 3:30 PM (S0267B)
BackComputation and visualization doesn't necessarily have to act as two separate entities. This talk explains the integration of real-time compute with real-time visualization. Industry and academia have provided attractive solutions for compiler-directive optimized code for computations. To support cases that involves massive yet ad-hoc data I/O and computation with interactive visualization, Hue developed a different model which bridges the gap between "complete system rewrite" and "compiler directive optimized code". The talk explains how highly optimized data I/O mechanisms coupled with predefined input and output definitions for kernels provide excellent scalability and interactivity during runtime.
Back