7cde13sw-sdfg-443b-82d0-ba01dd84469a9 aeroCuda: GPU-Optimized Immersed Solid Code This is an immersed solid CFD code that uses Peskin's immersed boundary method with Tryggvason's formulation of Chorin's projection method for solving the full Navier-Stokes equations. It is written in Python and uses the PyCuda/PyFFT libraries to access CUDA and the cuFFT libraries. http://www.scribd.com/doc/97875813/aeroCuda-The-2-d-CFD-Code /content/cudazone/CUDABrowser/assets/images/applications/163272_Vel100-low.png /content/cudazone/CUDABrowser/assets/images/applications/163272_Vel100-med.png Academia Harvard School of Engineering and Applied Sciences 2011 06 21 06/21/2011 48 Open source Samir Patel Paper Computational Fluid Dynamics 7cde13sw-sdfg-443b-82d0-ba018el4469a9 Coulomb Integrals CUDA Experimental Routine Quantum Dots are artificial physical structures of semiconductor material in which charge holders are confined in a nanometric space region. This means they’re very close to each other, therefore calculating their state means to calculate a correlated state. These are computable by rewriting the many-particles Schrödinger equation in matrix form. Finding the correct elements of this matrix implies the calculus of Coulomb Integrals. From a computational perspective, this is a very demanding operation. This application shows that it's possible to dramatically improve the speed of the above operation by properly exploiting the power of CUDA GPUs. http://cudaci.tumblr.com/ /content/cudazone/CUDABrowser/assets/images/applications/157939_nanotrappole-low.png /content/cudazone/CUDABrowser/assets/images/applications/157939_nanotrappole-med.png Academia University of Modena and Reggio Emilia 2011 01 15 01/15/2011 44 Open source Martin Klapez Code Paper Presentation Numerics Science Quantum Dots, Coulomb Integrals 7cde13sw-sdfg-443v-82d0-ba018el4469a9 VexCL VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license. Main features: * Selection and initialization of compute devices according to extensible set of device filters. * Transparent allocation of device vectors spanning multiple devices. * Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices. * Appropriate kernels for vector expressions are generated automatically first time an expression is used. http://ddemidov.github.com/vexcl/ /content/cudazone/CUDABrowser/assets/images/applications/19880_logo-low.png /content/cudazone/CUDABrowser/assets/images/applications/19880_logo-med.png Academia Kazan Federal University, Supercomputer Center of Russian Academy of Sciences http://www.ksu.ru 2012 05 18 05/18/2012 10-20 Open source Denis Demidov Code Numerics Libraries Programming Tools Science C++, OpenCL, Libraries, Meta-programming, Linear Algebra 7cde13sw-7ofb-443b-82d0-ba018el4469a9 Computational Fluid Dynamics Simulations Using Many Graphics Processors In this scenario, computational fluid dynamics simulations of turbulence are performed with 64 GPUs and an optimized CFD algorithm using communication/computation overlapping. Detailed timings reveal that the GPUs' internal calculations are so efficient that operations related to data exchange between compute nodes now cause a scaling bottleneck on all but the largest problems. http://www.computer.org/csdl/mags/cs/2012/03/mcs2012030010-abs.html /content/cudazone/CUDABrowser/assets/images/applications/405887_GPU-CFD-low.png /content/cudazone/CUDABrowser/assets/images/applications/405887_GPU-CFD-med.png Academia University of Massachusetts, Amherst 2012 04 21 04/21/2012 45 N/A Ali Khajeh-Saeed J. Blair Perot Paper Computational Fluid Dynamics 7cde13sw-7ofb-443b-82d0-ba018el4468a9 Fast Multipole Method on GPU: Tackling 3-DCapacitance Extraction on Massively ParallelSIMD Platforms To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPU friendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU’s concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings 22X to 30X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU’s hardware resources, the proposed cube clustering and workload balancing techniques can bring 20% to 60% extra performance improvements. http://www.ece.mtu.edu/~zhuofeng/MTU_VLSI_DA_files/papers/DAC11paper.pdf /content/cudazone/CUDABrowser/assets/images/applications/25120_CapExtraction-low.png /content/cudazone/CUDABrowser/assets/images/applications/25120_CapExtraction-med.png Academia Michigan Technological University http://www.mtu.edu 2011 06 10 06/10/2011 30 N/A Xueqian Zhao and Zhuo Feng Paper Presentation Electronic Design Automation Fast Multipole Method, Capacitance Extraction 7cde1372-7ofb-443b-82d0-ba018el469a9 GPU-Accelerated Large-Eddy Simulation of Turbulent Channel Flows High performance computing clusters that are augmented with cost and power efficient graphics processing unit (GPU) provide new opportunities to broaden the use of large-eddy simulation technique to study high Reynolds number turbulent flows in fluids engineering applications. In this paper, we extend our earlier work on multi-GPU acceleration of an incompressible Navier-Stokes solver to include a large-eddy simulation (LES) capability. In particular, we implement the Lagrangian dynamic subgrid scale model and compare our results against existing direct numerical simulation (DNS) data of a turbulent channel flow at friction Re = 180. Overall, our LES results match fairly well with the DNS data. Our results show that the friction Re = 180 case can be entirely simulated on a single GPU, whereas higher Reynolds cases can benefit from a GPU cluster. http://scholarworks.boisestate.edu/mecheng_facpubs/37/ /content/cudazone/CUDABrowser/assets/images/applications/gpu-large-order-low.png /content/cudazone/CUDABrowser/assets/images/applications/gpu-large-order-med.png Academia Boise State University, Mechanical & Biomedical Engineering Department 2012 01 09 01/09/2012 N/A Rey DeLeon Inanc Senocak Paper Computational Fluid Dynamics 7cde1372-7efb-443b-82t0-ba018el469a9 GPU Application for High-Order Compact Finite Difference Scheme A high-order compact finite difference scheme for the solution of fluid flow problems is implemented to run on a GPU. Besides the compact scheme, a high-order low pass filter is also employed. For time integration, the classical fourth-order Runge–Kutta method is used. Advection of a vortical disturbance and a temporal mixing layer are simulated with the speed ups between 9x–16.5x http://www.sciencedirect.com/science/article/pii/S0045793011003227 /content/cudazone/CUDABrowser/assets/images/applications/gpu-high-order-low.png /content/cudazone/CUDABrowser/assets/images/applications/gpu-high-order-med.png Academia Istanbul Technical University, Faculty of Aeronautics and Astronautics 2012 02 15 02/15/2012 N/A Bulent Tutkun Firat Oguz Edis Paper Computational Fluid Dynamics Numerics 7cde1372-7efb-443b-82d0-ba018el469a9 Focused Ultrasound -Efficient GPU Simulation Methods for Therapy Planning Over the past years, high intensity focused ultrasound therapy has become a promising therapeutic alternative for non-invasive tumor treatment. The basic idea of this interventional approach is to apply focused ultrasound waves to the tumor tissue such that the cells are heated and hence destroyed. Since it is quite difficult to assess the quality of this non-invasive therapy, there is a dire need for computer support in planning, conduction, and monitoring of such treatments. In this work, we propose efficient simulation techniques for focused ultrasound waves as well as their heat dissemination using current graphics hardware as a numerical co-processor. We achieve speed-ups between 10 and 700 for the single simulation steps compared to an optimized CPU solution, overall resulting in a significant performance gain over previous approaches for simulation of focused ultrasound. http://diglib.eg.org/EG/DL/PE/vriphys/vriphys11/119-128.pdf.abstract.pdf /content/cudazone/CUDABrowser/assets/images/applications/screenshotTemperature-low.png /content/cudazone/CUDABrowser/assets/images/applications/screenshotTemperature-med.png Research Fraunhofer MEVIS 2011 12 05 12/05/2011 N/A J. Georgii. C.v. Dresky, S. Meier, D. Demedts, C. Schumann, T. Preusser Paper Medical Imaging Numerics Science FUS, HIFU 8cdd1372-7efb-443b-82d0-ba018el469a9 GPU and APU computations of Finite Time Lyapunov Exponent fields We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture. http://www.sciencedirect.com/science/article/pii/S0021999111006322 /content/cudazone/CUDABrowser/assets/images/applications/ftle-app-image-low.png /content/cudazone/CUDABrowser/assets/images/applications/ftle-app-image-med.png Academia ETH Zurich 2011 11 18 11/18/2011 N/A Christian Conti, Diego Rossinelli, Petros Koumoutsakos Paper Computational Fluid Dynamics Numerics Science Signal Processing 8cdd1372-7efb-429b-82d0-ba018el469a9 LASSO Regression Using Distributed GPUs We have introduced a scalable framework that uses MPI to distribute work across multiple GPUs in order to solve extremely large regression problems. Speed up is multiplied by the number of available nodes. http://code.google.com/p/parallel-lasso/ /content/cudazone/CUDABrowser/assets/images/applications/mpi-low.png /content/cudazone/CUDABrowser/assets/images/applications/mpi-med.png Academia University of Southern CA 2012 01 19 01/19/2012 Open source Gary K. Chen Code Paper Life Sciences bioinformatics 8ced1372-7efb-429b-82d0-ba018el469a9 CUDA ACCELERATED FACE RECOGNITION USING LOCAL BINARY PATTERNS We present a GPU accelerated face recognition framework using CUDA. The framework utilizes weighted regional LBP (Local Binary Pattern) histograms as features and k-nearest neighbour (k-NN) algorithm for classification. We show an efficient way to compute LBP values from an input image and construct weighted regional LBP histograms in GPU using a single kernel. We also propose a massively parallel GPU implementation of the k-NN algorithm optimized for handling high-dimensional feature vectors. Comparisons with CPU implementations have shown that, by accelerating both the feature extraction and classification process of the face recognition algorithm, we have managed to achieve up to 29x increase in recognition speed. http://cvip.itu.edu.tr/media/4809606.pdf /content/cudazone/CUDABrowser/assets/images/applications/45273_image-low.png /content/cudazone/CUDABrowser/assets/images/applications/45273_image-med.png Academia Istanbul Technical University www.itu.edu.tr 2012 02 20 02/20/2012 29x N/A Salih Cihan Tek Muhittin Gökmen Paper Imaging Signal Processing Face recognition, k-nearest neighbours, local binary patterns 8cdd1372-7efb-449b-82d0-ba018el469a9 PyCOOL PyCOOL (Cosmological Object-Oriented Lattice code) is a fast GPU accelerated program that solves the evolution of interacting scalar fields in an expanding universe with symplectic algorithms. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This is achieved by using the Python language with the PyCUDA interface to make a program that is very easy to adapt to different scalar field models. http://www.physics.utu.fi/tiedostot/theory/particlecosmology/pycool/ /content/cudazone/CUDABrowser/assets/images/applications/pycool-low.png /content/cudazone/CUDABrowser/assets/images/applications/pycool-med.png Academia University of Turku / Department of Physics and Astronomy 2012 01 24 01/24/2011 Open source Jani Sainio Application Code Science 8cdd1372-7efb-449b-82d0-ba018ef469a9 CUDAfy.NET An open source Microsoft .NET library that allows writing of CUDA applications including device code in languages such as C# and VB. Contains wrappers for CUSPARSE, CUBLAS, CUFFT and CURAND, as well as a growing number of specialized numerics libraries. http://www.hybriddsp.com/Products/CUDAfyNET.aspx /content/cudazone/CUDABrowser/assets/images/applications/cudafyi-low.png /content/cudazone/CUDABrowser/assets/images/applications/cudafy-med.png Commercial Hybrid DSP Systems http://www.hybriddsp.com 2011 12 07 12/07/2011 Open source Hybrid DSP Systems Code Numerics Libraries .NET,C#,VB,Solver 8cdd1372-7efb-449b-82d0-ba018ef469a6 QnDynCUDA We present a set of C++ classes which allow one to use the graphics card processors cores for quantum ab initio simulations, i.e. a direct solving of the time-dependent Schrödinger equation, gaining the benefits from the parallel architecture of the graphical processor units. We use the Chebyshev polynomial and FFT algorithm. The solution is based on NVIDIA CUDA technology. The speed-up factor in the test runs of our classes performed using the graphics card processor can even be of order of 300 in comparison with the test runs using only the single core of CPU. Not only the Schrödinger equation can be integrated using the presented solver. With only small changes it can be used for solving the nonlinear Gross–Pitaevskii equation of BECs dynamics, the heat equation, the diffusion equation or other parabolic partial differential equations of second order. http://dx.doi.org/10.1016/j.cpc.2011.11.026 /content/cudazone/CUDABrowser/assets/images/applications/QnDynCUDA-low.png /content/cudazone/CUDABrowser/assets/images/applications/QnDynCUDA-med.png Academia Nicolaus Copernicus University http://fizyka.umk.pl 2011 12 09 12/09/2011 Open source Tomasz Dziubak Jacek Matulewski Code Paper Numerics Libraries Science CUDA 8cdd1372-7efb-449b-82d0-ba08wef469a6 Efficient Decoding of QC-LDPC Codes Using GPUs In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs).We optimize the data structures of the messages used in the decoding process such that both the read and write processes can be performed in a highly parallel manner by the GPUs. We also propose a highly efficient algorithm to convert the data structure of the messages from one form to another with very little latency. Finally, with the use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder is found to run at around 100 times faster than a CPU-based simulator. http://www.springerlink.com/content/j8g53w2260224wx7/ /content/cudazone/CUDABrowser/assets/images/applications/QC-LDPC-low.png /content/cudazone/CUDABrowser/assets/images/applications/QC-LDPC-med.png Academia The Hong Kong PolyU 2011 06 16 06/16/2011 100 N/A Yue Zhao Xu Chen Chiu-Wing Sham Wai M. Tam and Francis C. M. Lau Paper Signal Processing CUDA 8cdd1372-7efb-449b-82d0-ba018ef4542f Integrating CUDA & GNU Autotools One of the drawbacks to GNU Autotools is that is only provides native support for certain languages, however, it is flexible enough so that you can make it do what you want it to do... if you know how. I wanted to distribute CUDA based applications with GNU Autotools but unfortunately CUDA is not one of the languages that it supports... so I started googling around. I found several threads where other people wanted to be able to do the same thing. I found various bread crumbs here and there that enabled me to piece together a working build. Since I couldn't find all of the information that I needed in one spot, I figured I'd write it all down and publish it so others don't have to waste time figuring it out. "The ClusterChimps Guide to Integrating CUDA and GNU Autotools" is a simple guide to building CUDA targets using GNU Autotools. It will show you how to build stand alone CUDA applications, static CUDA libraries, and shared CUDA libraries. The guide comes with a companion example tarball. http://www.clusterchimps.org/autotools.php /content/cudazone/CUDABrowser/assets/images/applications/dr-zaius-low.png /content/cudazone/CUDABrowser/assets/images/applications/dr-zaius-med.png Research ClusterChimps.org http://www.clusterchimps.org 2011 11 18 11/18/2011 Open source Dr. Zaius Code Paper Tools CUDA, Autotools 8cdd1372-7efb-449b-82d0-ba018ef543e8 FMRI Analysis on the GPU Faster fMRI analysis by using the GPU. http://www.sciencedirect.com/science/article/pii/S0169260711001957 /content/cudazone/CUDABrowser/assets/images/applications/article-01.png /content/cudazone/CUDABrowser/assets/images/applications/article-02.png Research Linköping University http://www.liu.se 2011 11 12 11/12/2011 N/A Anders Eklund Mats Andersson Hans Knutsson Multimedia Paper Signal Processing Medical Imaging fMRI, GPU, permutation test 8cdd1372-7efb-449b-82d0-ba018ef567r9 True 4D Image Denoising on the GPU 4D Image denoising on the GPU http://www.hindawi.com/journals/ijbi/2011/952819/ /content/cudazone/CUDABrowser/assets/images/applications/4-dimension-01.png /content/cudazone/CUDABrowser/assets/images/applications/4-dimensions-02.png Research Linköping University http://www.liu.se 2011 11 12 11/12/2011 N/A Anders Eklund Mats Andersson Hans Knutsson Multimedia Paper Signal Processing Medical Imaging Science 4D, image denoising, CT 8cdd1372-7efb-449b-82d0-ba018ef458y9 A real-time crosstalk canceller on a notebook GPU People want to participate in the communication with the feeling of being together and sharing the same environment. Crosstalk cancellation is one of the main applications in multichannel acoustic signal processing that provides this kind of feelings. This work shows that GPU can be used as a co-processor which carries out audio processing tasks, freeing CPU resources which can be employed in other tasks. http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6012072 /content/cudazone/CUDABrowser/assets/images/applications/159460_Crosstalk_01.png /content/cudazone/CUDABrowser/assets/images/applications/159460_Crosstalk_02.png Academia INCO2 (http://www.inco2.upv.es/) and GTAC (http://www.gtac.upv.es/) Groups. Universidad Politecnica de Valencia http://www.upv.es 2011 09 06 09/16/2011 N/A Jose A. Belloch Alberto Gonzalez F. J. Martínez-Zaldívar Antonio M. Vidal Application Paper Signal Processing Video & Audio 8cdd1372-7efb-449b-82d0-ba018ez454f2 Real-time massive convolution for audio applications on GPU Massive convolution is the basic operation in multichannel acoustic signal processing. This field has experienced a major development in recent years due to the growing need to incorporate new effects and the natural desire to improve the hearing experience. These acoustic effects require to compute multiples convolutions simultaneously in real-time. The work we present describes a GPU-implementation of all the operations involved in the convolution, extrapolated to multiple channels. One of the main feature in our work is the utilization of the streams on GPU. This allows to overlap computation and data-transfer from CPU to GPU. This application is the first step to carry out real-time multichannel-sound applications on GPU. http://www.springerlink.com/content/h37u46j2416m6733/ /content/cudazone/CUDABrowser/assets/images/applications/156214_Rea_Time_2.png /content/cudazone/CUDABrowser/assets/images/applications/156214_Rea_Time_1.png Academia INCO2 (http://www.inco2.upv.es/) and GTAC (http://www.gtac.upv.es/) Groups. Universidad Politecnica de Valencia http://www.upv.es 2011 04 19 04/19/2011 N/A Jose A. Belloch Alberto Gonzalez F. J. Martínez-Zaldívar Antonio M. Vidal Application Paper Signal Processing Video & Audio 8cdd1372-7efb-449b-82d0-ca018ef454f2 Exposure Render Exposure Render is a Direct Volume Rendering Application that applies progressive Monte Carlo raytracing, coupled with physically based light transport to heterogeneous volumetric data. Exposure Render enables the configuration of any number of arbitrarily shaped area lights, models a real-world camera, including its lens and aperture, and incorporates complex materials, whilst still maintaining interactive display updates. It features both surface and volumetric scattering, and applies noise reduction to remove the unwanted startup noise associated with progressive Monte Carlo rendering. The complete implementation is available in source and binary forms under a permissive free software license. http://code.google.com/p/exposure-render/downloads/list /content/cudazone/CUDABrowser/assets/images/applications/655602-example-01.png /content/cudazone/CUDABrowser/assets/images/applications/655602-example-02.png Academia TU Delft http://graphics.tudelft.nl 2011 10 19 10/19/2011 Open source T. Kroes Application Paper Code Digital Content Creation Graphics Imaging Medical Imaging Ray Tracing Monte Carlo Simulation, NVIDIA CUDA, Open Source, Ray Tracing, Volume Rendering 8cdd1372-7efb-449b-82e0-ba018ef454f2 DualSPHysics DualSPHysics is a combined CPU/GPU solver for mesh-free Smoothed Particle Hydrodynamics to be applied in CFD applications with free-surface flows. /content/cudazone/CUDABrowser/assets/images/applications/1370_283365_dualsphysics_cuda_small.png /content/cudazone/CUDABrowser/assets/images/applications/1370_283365_dualsphysics_cuda_large.png Academia EPHYSLAB--Universidade de Vigo and School of Mechanical, Aerospace and Civil Engineering-University of Manchester http://ephyslab.uvigo.es/index.php/eng/dual_sphysics/ 2011 01 11 01/11/2011 90 A.J.C. Crespo J.M. Dominguez M.G. Gesteira Multimedia Paper Computational Fluid Dynamics Science SPH, GPU, meshless method, lagrangian, fluid dynamics, free-surface flow,A.J.C. Crespo,J.M. Dominguez,M.G. Gesteira,alexbexe@uvigo.es,jmdominguez@uvigo.es,mggesteira@uvigo.es 9561fe2e-1de6-461b-b0dc-1cae41da5eb5 NeMo: real-time spiking neural network simulation Spiking neural network simulations are used to model biological brain structures. Simulating large-scale networks is computationally expensive, however, due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in a embodied (i.e. robotic) setting, the simulation must be real-time in order to be useful. http://nemosim.sourceforge.net /content/cudazone/CUDABrowser/assets/images/applications/1368_193704_firing_small.png /content/cudazone/CUDABrowser/assets/images/applications/1368_193704_firing_large.png Academia Imperial College London http://www.imperial.ac.uk 2010 07 18 07/18/2010 20 Open source Andreas Fidjeland Murray Shanahan Paper Life Sciences Science neural network simulation,Andreas Fidjeland,Murray Shanahan,andreas.fidjeland@imperial.ac.uk a6fbee45-c093-4bab-958c-ce8e5e7baa64 CUDA Benoit Realtime, high resolution, high iteration, supersampled, fractal zoom. Specify the vanishing point, iteration count and colors for an animated zoom into the Mandelbrot set and then watch the zoom without having to run a separate, lengthy, rendering step. Multiple zoom specifications, called tracks, can be stored and played back in sequence, creating a continuously running fractal zooming show. /content/cudazone/CUDABrowser/assets/images/applications/1367_86365_player_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1367_86365_player_large.jpg Research 2011 04 30 04/30/2011 Open source Roger Dahl Paper Application Code Graphics Mandelbrot fractal realtime zoom log scale map,Roger Dahl,dahlsys@gmail.com 457a8222-cc5c-4ab5-a938-290aa493803f SARUMAN SARUMAN (Semiglobal Alignment of short Reads Using CUDA and NeedleMAN-Wunsch) is a mapping approach that returns all possible alignment positions of a read in a reference genome sequence under a given error threshold, together with one optimal alignment for each of these positions. Alignments are computed in parallel on graphics hardware, facilitating an considerable speedup of this normally time consuming step. http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html /content/cudazone/CUDABrowser/assets/images/applications/1366_61128_saruman_flow_small.png /content/cudazone/CUDABrowser/assets/images/applications/1366_61128_saruman_flow_large.png Academia Center for Biotechnology, Bielefeld University http://www.cebitec.uni-bielefeld.de/ 2011 03 30 03/30/2011 Jochen Blom Tobias Jakobi Daniel Doppmeier Sebastian Jaenicke Jörn KalinowskiJens Stoye, Alexander Goesmann Jens Stoye Alexander Goesmann Paper Life Sciences Science Bioinformatics, Sequence Alignment, Short read mapping, Bioinformatics workbench, Sequence Analysis,Jochen Blom,Tobias Jakobi,Daniel Doppmeier,saruman@cebitec.uni-bielefeld.de a11c0bd6-6125-4d81-ad46-4f5c320951c7 Data Assimilation using a GPU Accelerated Path Integral Monte Carlo Approach A general approach to data assimilation (state and parameter estimation) in nonlinear dynamical systems with noisy dynamics and noisy measurements. In general terms, it is a method for extracting a few usefull pieces of information from a large amount of raw time series data. /content/cudazone/CUDABrowser/assets/images/applications/1365_90544_unobsStatesSmall_small.png /content/cudazone/CUDABrowser/assets/images/applications/1365_90544_unobsStatesSmall_large.png Academia University of California, San Diego http://physics.ucsd.edu/ 2011 04 05 04/05/2011 300 John C. Quinn Henry D.I. Abarbanel Paper Science Signal Processing Data Assimilation, Parameter Estimation, Monte Carlo, Path Integral,John C. Quinn,Henry D.I. Abarbanel,jquinn@ucsd.edu,habarbanel@ucsd.edu cf146928-c319-4ea3-97ee-6f24e3b80847 CP_select parallel selection algorithm for GPUs: calculation of the median and order statistics /content/cudazone/CUDABrowser/assets/images/applications/1364_8595_median_small.png /content/cudazone/CUDABrowser/assets/images/applications/1364_8595_median_large.png Academia Deakin University http://www.deakin.edu.au/ 2011 04 10 04/10/2011 6 Open Source Gleb Beliakov Paper Numerics Libraries Programming Tools Science selection, median, sorting,Gleb Beliakov,gleb@deakin.edu.au 5e4ba313-241d-4dc3-8146-66ddb7379614 Mesh-particle interpolations on GPUs and multicore CPUs Particle-mesh interpolations are fundamental operations for particle-in-cell codes, as implemented in vortex methods, plasma dynamics and electrostatics simulations. In these simulations, the mesh is used to solve the field equations and the gradients of the fields are used in order to advance the particles. The time integration of particle trajectories is performed through an extensive resampling of the flow field at the particle locations. The computational performance of this resampling turns out to be limited by the memory bandwidth of the underlying computer architecture. We investigate how mesh-particle interpolation can be efficiently performed on graphics processing units (GPUs) and multicore central processing units (CPUs), and we present two implementation techniques. http://rsta.royalsocietypublishing.org/content/369/1944/2164.abstract /content/cudazone/CUDABrowser/assets/images/applications/1363_259257_cyl-re40000_small.png /content/cudazone/CUDABrowser/assets/images/applications/1363_259257_cyl-re40000_large.png Academia CSE Lab, ETH Zurich www.cse-lab.ethz.ch 2011 06 01 06/01/2011 Diego Rossinelli Christian Conti Petros Koumoutsakos Paper Computational Fluid Dynamics Game Physics Numerics CPU, GPU, HPC, mesh-particle, grid-particle,Diego Rossinelli,Christian Conti,Petros Koumoutsakos,petros@inf.ethz.ch 9abc758e-c89b-4761-bfcc-57c36df7e9a1 GPU-computing in econophysics and statistical physics A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples. http://dx.doi.org/10.1140/epjst/e2011-01398-x /content/cudazone/CUDABrowser/assets/images/applications/1362_5681_econophysics_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1362_5681_econophysics_large.gif Academia ETH Zurich http://www.tobiaspreis.de/ 2011 04 07 04/07/2011 Open source Tobias Preis Paper Finance Science Compuational Finance, Computational Physics,Tobias Preis,mail@tobiaspreis.de aa2667f4-078f-444b-b136-f0065d5c014e Processing and rendering of Fourier domain optical coherence tomography images at a line rate over 524 kHz using a graphics processing unit In Fourier domain optical coherence tomography (FD-OCT), a large amount of interference data needs to be resampled from the wavelength domain to the wavenumber domain prior to Fourier transformation. We present an approach to optimize this data processing, using a graphics processing unit (GPU) and parallel processing algorithms. We demonstrate an increased processing and rendering rate over that previously reported by using GPU paged memory to render data in the GPU rather than copying back to the CPU. http://spie.org/x648.html?product_id=896535 /content/cudazone/CUDABrowser/assets/images/applications/1361_367646_Screenshot-a_small.png /content/cudazone/CUDABrowser/assets/images/applications/1361_367646_Screenshot-a_large.png Academia Aston University, Queen Mary Univ of London, NPL www.aston.ac.uk 2011 02 01 02/01/2011 Janarthanan Rasakanthan Kate Sugden Peter H. Tomlins Paper Medical Imaging Signal Processing Optical Coherence Tomography, OCT, medical Imaging,Janarthanan Rasakanthan,Kate Sugden,Peter H. Tomlins,raskanj@aston.ac.uk 13893c5b-60cd-468f-b8ec-ea6c11e367c2 Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids We present a computational method of coupling average interpolating wavelets with high-order finite volume schemes and its implementation on heterogeneous computer architectures for the simulation of multiphase compressible flows. The method is implemented to take advantage of the parallel computing capabilities of emerging heterogeneous multicore/multi-GPU architectures. http://epubs.siam.org/sisc/resource/1/sjoce3/v33/i2/p512_s1 /content/cudazone/CUDABrowser/assets/images/applications/1360_263363_application-image_small.png /content/cudazone/CUDABrowser/assets/images/applications/1360_263363_application-image_large.png Academia ETH Zurich www.cse-lab.ethz.ch 2011 03 01 03/01/2011 Diego Rossinelli Babak Hejazialhosseini Daniele G. Spampinato Paper Computational Fluid Dynamics Numerics Signal Processing GPU, compressible flow, wavelets, multiresolution, adaptive grid, multiphase, multicore architectures, OpenCL,Diego Rossinelli,Babak Hejazialhosseini,Daniele G. Spampinato,diegor@inf.ethz.ch 089e53a3-5369-4c7c-b9ff-20d04b833618 Real-time numerical dispersion compensation using graphics processing unit for Fourier-domain optical coherence tomography Numerical dispersion compensation for both standard and full-range Fourier-domain optical coherence tomography (FD-OCT) on the graphics processing unit (GPU) architecture has been implemented. The data acquisition, processing and image display were performed on a multi-thread, CPU-GPU heterogeneous computing system. The real-time ultra-high-resolution full-range complex-conjugate-free FD-OCT imaging was demonstrated at 68.4 frame/s with a frame size of 1024 (lateral) by 2048 (axial) pixels. /content/cudazone/CUDABrowser/assets/images/applications/1359_20606_dispersion_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1359_20606_dispersion_large.jpg Academia Johns Hopkins University http://www.ece.jhu.edu/photonics/zhangkang.html 2011 03 03 03/03/2011 Kang Zhang Jin U. Kang Multimedia Paper Imaging Medical Imaging Life Sciences Signal Processing GPU, Numerical Dispersion Compensation, Optical coherence tomography,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu fa369b09-c534-4b01-b7b8-28ee8d433c77 Real-time intraoperative 4D full-range FD-OCT based on the dual graphics processing units architecture for microsurgery guidance Real-time 4D full-range complex-conjugate-free Fourier-domain optical coherence tomography (FD-OCT) is implemented using a dual graphics processing units (dual-GPUs) architecture. One GPU is dedicated to the FD-OCT data processing while the second one is used for the volume rendering and display. GPU accelerated non-uniform fast Fourier transform (NUFFT) is also implemented to suppress the side lobes of the point spread function to improve the image quality. http://www.opticsinfobase.org/abstract.cfm?uri=boe-2-4-764 /content/cudazone/CUDABrowser/assets/images/applications/1358_40008_microsurgery_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1358_40008_microsurgery_large.jpg Academia Johns Hopkins University http://www.ece.jhu.edu/photonics/zhangkang.html 2011 03 01 03/01/2011 Kang Zhang Jin U. Kang Multimedia Paper Imaging Medical Imaging Life Sciences Signal Processing GPU, Optical coherence tomography , 4D imaging,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu 93912935-c2cc-4303-b5b2-7355ddff8c8e IGMAS+ Three-dimensional (3D) interactive modeling with the IGMAS software provides the means for integrated processing and interpretation of geoid, gravity and magnetic fields, yielding improved geological interpretation. IGMAS 3D models are constructed using triangulated polyhedra to which constant density and/or induced and remnant susceptibility are assigned. http://www.potentialgs.com/ /content/cudazone/CUDABrowser/assets/images/applications/1357_790835_IGMAS_small.png /content/cudazone/CUDABrowser/assets/images/applications/1357_790835_IGMAS_large.png Academia Transinsight GmbH http://transinsight.com/ 2011 02 01 02/01/2011 300 Transinsight GmbH, Christan-Albrecht-Universitat zu Kiel - Department for Geophysics & Geoinformation Paper Oil & Gas Science interactive modelling, gravity, magnetic, seismic, inversion, numerical modelling, OpenCL,Transinsight GmbH, Christan-Albrecht-Universitat zu Kiel - Department for Geophysics & Geoinformation,info@potentialgs.com fb111a83-f81b-4fb8-8adf-b15c8e375ad4 Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/OpenMP/GPU Parallelization in AMR We present the implementation and performance of a class of directionally unsplit Riemann-solver-based hydrodynamic schemes on Graphic Processing Units (GPU). These schemes, including the MUSCL-Hancock method, a variant of the MUSCL-Hancock method, and the corner-transport-upwind method, are embedded into the adaptive-mesh-refinement (AMR) code GAMER. Furthermore, a hybrid MPI/OpenMP model is investigated, which enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. /content/cudazone/CUDABrowser/assets/images/applications/1356_1516457_KH_small.png /content/cudazone/CUDABrowser/assets/images/applications/1356_1516457_KH_large.png Academia National Taiwan University, Department of Physics 2011 03 22 03/22/2011 101 Hsi-Yu Schive Ui-Han Zhang Tzihong Chiueh Paper Computational Fluid Dynamics Science hybrid MPI/OpenMP/GPU, AMR,Hsi-Yu Schive,Ui-Han Zhang,Tzihong Chiueh,b88202011@ntu.edu.tw e47a37df-6e16-41c8-b775-a88790713add Horizon MHD General relativistic magnetohydrodynamics code. Used in computational astrophysics applications, particular the prediction of gravitational radiation from compact objects, and the dynamics of magnetars. http://www.horizoncode.org/ /content/cudazone/CUDABrowser/assets/images/applications/1355_172283_orszag_tang_small.png /content/cudazone/CUDABrowser/assets/images/applications/1355_172283_orszag_tang_large.png Academia University of Tuebingen, Institute for Astronomy and Astrophysics 2011 02 25 02/25/2011 200 Burkhard Zink Multimedia Paper Computational Fluid Dynamics Science mhd astrophysics simulator relativistic,Burkhard Zink,bzink@tat.uni-tuebingen.de d0c8b7d9-8adf-4aa6-82aa-4dc5d4c7a070 Practical Time Bundle Adjustment for 3D Reconstruction on GPUt We present a hybrid implementation of sparse bundle adjustment on the GPU using CUDA, with the CPU working in parallel. The algorithm is decomposed into smaller steps, each of which is scheduled on the GPU or the CPU. We develop efficient kernels for the steps and make use of existing libraries for several steps. Our implementation outperforms the CPU implementation significantly, achieving a speedup of 30-40 times over the standard CPU implementation for datasets with upto 500 images on an Nvidia Tesla C2050 GPU /content/cudazone/CUDABrowser/assets/images/applications/1354_45129_CPU-GPU-Hybrid3_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1354_45129_CPU-GPU-Hybrid3_large.jpg Academia IIIT Hyderabad www.iiit.ac.in 2011 01 01 01/01/2011 40 Siddharth Choudhary Paper Siddharth Choudhary,siddharth.choudhary@research.iiit.ac.in 5064a523-1a53-4c6e-9be0-49b519753279 Flow dynamics measurements using digital holographic PIV An in-line digital holographic (D-HPIV) setup and CUDA-accelerated algorithm were implemented in order to measure the instantaneous three-dimensional (3D), three-component (3C) velocity field of nonstationary flows. This increases dramatically the speed of digital video hologram processing. The system can measure the number, 3D position, size, 3C velocity and track of the particles. The results of the hologram reconstruction are represented using OpenGL. /content/cudazone/CUDABrowser/assets/images/applications/1353_145946_Figure3_small.png /content/cudazone/CUDABrowser/assets/images/applications/1353_145946_Figure3_large.png Academia Petrozavodsk State University www.petrsu.ru 2010 10 07 10/07/2010 1000 Dmitry Ekimov Paper Imaging Numerics Science Signal Processing Video & Audio Dmitry Ekimov,edmitr@onego.ru c3c135bf-3665-438c-b634-25ee54e81a90 Numerical simulation of flow around an oscillating cylinder This program presents a finite difference solution for 2D, low Reynolds number (1-350), unsteady flow around and heat transfer from a stationary or oscillating circular cylinder with constant surface temperature and placed in a uniform stream. The fluid is assumed to be incompressible and of constant property. The cylinder is moved mechanically and can vibrate in-line with or transverse to the main stream or can follow an elliptical or figure-8-shaped path. The governing equations are the Navier-Stokes equations, the continuity equation, a Poisson equation for pressure and the energy equation. http://www.filefactory.com/file/cac6d70/n/FlowCFD.zip /content/cudazone/CUDABrowser/assets/images/applications/1352_24973_NVIDIA_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1352_24973_NVIDIA_large.jpg Academia Department of Fluid and Heat Engineering, University of Miskolc, Hungary www.uni-miskolc.hu 2010 02 02 02/02/2010 13 Prof. Laszlo Baranyi, Laszlo Daroczy Multimedia Paper Computational Fluid Dynamics Numerics Science CFD, numerics, Computational Fluid Dynamics, oscillating cylinder, in-line oscillation, transverse oscillation, Figure-8-shape motion, SOR, successive over-relaxation, heat transfer, 2D, Reynolds number, Strouhal number, Nusselt number, incompressible, lift, drag, Poisson equation, Navier-Stokes equations, temperature,Prof. Laszlo Baranyi, Laszlo Daroczy,arambl@uni-miskolc.hu; daroczy4@freemail.hu acbbd15e-82f0-45e6-988a-f1726e4bb1ce Running the High Performance Linpack (HPL) Benchmark on NVIDIA GPUs The HPL benchmark is used to rank the world's Top500 supercomputers. This is a step by step procedure on how to run NVIDIA's version of the HPL benchmark on Tesla GPUs. We also compare the results of a normal HPL run on CPU to a hybird run on CPU-GPU to show the performance boost gained with GPUs. /content/cudazone/CUDABrowser/assets/images/applications/1349_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1349_logo_large.png Research Saudi Aramco 2011 01 10 01/10/2011 Open source Mohamad Sindi Paper Benchmark NVIDIA, Linpack, HPL, GPU, Top500, FLOPS, High Performance Computing, HPC,Mohamad Sindi,sindimo@ieee.org be475639-1007-4bf8-bcc8-b103379fdf9d GPU Vision: Accelerating Computer Vision algorithms with Graphics Processing Units We present an introduction to the eld of GPU accelerated computer vision by examining several projects that provide the framework for researchers and developers to tap into the computational power of Graphics Processing Units (GPU). Our goal is to identify the tools and areas where GPU acceleration can provide the highest performance increases in computer vision applications by creating performance benchmarks to compare and contrast the GPU and CPU versions in realistic applications. http://c13software.com/downloads/GPUVision_2011.pdf /content/cudazone/CUDABrowser/assets/images/applications/1347_133852_haar_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1347_133852_haar_large.jpg Academia University of Connecticut 2011 02 09 02/09/2011 Tamas K. Lengyel James Gedarovich Antonio Cusano Thomas J. Peters Paper Imaging Tamas K. Lengyel,James Gedarovich,Antonio Cusano,tamas.k.lengyel@gmail.com,james.gedarovich@gmail.com,antonio.cusano@gmail.com 1deb8e36-96ab-486b-a05e-0af7a067b6b7 CUDA Image Mosaic Creates image mosaics from a database of thumbnails on a pixel by pixel basis using CUDA to perform the image comparisons. Digital Content Creation,Graphics /content/cudazone/CUDABrowser/assets/images/applications/1346_601593_cuda800_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1346_601593_cuda800_large.jpg Research Personal 2011 02 11 02/11/2011 100 Commercial Andy H Coates Multimedia Paper Photo Mosaic Image PhotoMosaic,Andy H Coates,andyhcoates@gmail.com 50827283-e4ee-4c05-bc3d-27bd9df0436b CUDA Real-time renderer /content/cudazone/CUDABrowser/assets/images/applications/1345_62423_chessRefraction_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1345_62423_chessRefraction_large.jpg Research Freelancer 2011 02 05 02/05/2011 20 Open Source Thanassis Tsiodras Multimedia Paper Graphics Ray Tracing SAH AABB BVH Triangle-meshes real-time raytracer,Thanassis Tsiodras,ttsiodras@gmail.com 7717e876-57cb-49db-9350-0ae7cc77ac63 Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model A Modern Graphics Processing unit is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two dimensional Ising model T. Preis et al., Journal of Computational Physics 228 2009 4468 4477 in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit core implementation which employs multispin coding. www.tobiaspreis.de/publications/bvp_cpc_2010.pdf /content/cudazone/CUDABrowser/assets/images/applications/1344_11409_preis_multi_gpu_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1344_11409_preis_multi_gpu_large.gif Academia Johannes Gutenberg University Mainz www.tobiaspreis.de 2010 08 01 08/01/2010 35 Benjamin Block Peter Virnau Tobias Preis Paper Science Computational Physics, Monte Carlo Simulation, GPU Clusters,Benjamin Block,Peter Virnau,Tobias Preis,mail@tobiaspreis.de 96a2f87d-895b-4b9f-b04e-7ceb30d28941 Hex Protein Docking Modelling protein-protein interactions (PPIs) is an important aspect of structural bioinformatics. The Hex spherical polar Fourier protein docking algorithm has been implemented on Nvidia graphics processor units (GPUs). On a GTX 285 GPU, an exhaustive six-dimensional docking search can be calculated in just 15 seconds using multiple one-dimensional fast Fourier transforms. This represents a 45-fold speed-up over the corresponding calculation on a single CPU, and is at least two orders of magnitude faster than conventional Cartesian grid-based FFT docking approaches. /content/cudazone/CUDABrowser/assets/images/applications/1342_47767_hex_3hfl_docked_rainbow_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1342_47767_hex_3hfl_docked_rainbow_large.jpg Research INRIA http://www.inria.fr 2010 04 24 04/24/2010 45 Dave Ritchie Paper Life Sciences protein docking,Dave Ritchie,dave.ritchie@inria.fr fe5a34d7-6844-4b87-8542-7ef5ce307b1c GPU-accelerated molecular dynamics simulation for study of liquidcrystalline flows We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed. http://portal.acm.org/citation.cfm?id=1808870 /content/cudazone/CUDABrowser/assets/images/applications/1340_header_r1_c1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1340_header_r1_c1_large.jpg Academia Kochi University of Technology 2010 08 01 08/01/2010 50 Alfeus Sunarso Tomohiro Tsuji Shigeomi Chono Paper Science Alfeus Sunarso ,Tomohiro Tsuji,Shigeomi Chono,sunarso@kochi-tech.ac.jp 4672b45f-125b-480f-a8c6-f5aa647b2a75 MandelCUDA Real-time rendering of the Mandelbrot fractal /content/cudazone/CUDABrowser/assets/images/applications/1338_43133_mandel_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1338_43133_mandel_large.gif Research Freelancer 2010 03 01 03/01/2010 40 Open Source Thanassis Tsiodras Paper Graphics Application,Code,Thanassis Tsiodras,ttsiodras@gmail.com adcd077f-17fb-44e4-91de-cb028f7fe788 CUDA Accelerated Particle Engine A simple point sprite based particle engine accelerated with CUDA. /content/cudazone/CUDABrowser/assets/images/applications/1337_309888_particles (1)_small.png /content/cudazone/CUDABrowser/assets/images/applications/1337_309888_particles (1)_large.png Academia Student 2010 05 24 05/24/2010 10 Craig Mouser Multimedia Paper Digital Content Creation Graphics Video Audio Craig Mouser,mouser58907@yahoo.com 8a9de003-fdec-4da9-b81e-d3c7a991d982 Parallel Option Pricing on GPU: Barrier Options and Realized Variance Options We present parallel algorithms implemented in CUDA subroutines ready to run on Graphics Processing Units (GPUs) to price two kinds of financial derivatives, that is: continuous barrier options and realized variance options. The outstanding parallel performance of these algorithms when executed on GPUs is due to the mathematical properties of the pricing formulae used and to their software implementation. http://www.econ.univpm.it/recchioni/finance/w13 /content/cudazone/CUDABrowser/assets/images/applications/1336_Fig2GPU_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1336_Fig2GPU_large.jpg Academia Universita di Camerino, Universita Politecnica delle Marche, Universita di Roma La Sapienza 2010 11 05 11/05/2010 L. Fatone M. Giacinti F. Mariani Paper Finance L. Fatone,M. Giacinti,F. Mariani,lorella.fatone@unicam.it ,m.giacinti@univpm.it ,fra_mariani@libero.it 38d09cb5-3ff4-431c-a705-23b70259a7c1 Graphics processing unit accelerated non-uniform fast Fourier transform for ultrahigh-speed, real-time Fourier-domain OCT We implemented fast Gaussian gridding (FGG)-based non-uniform fast Fourier transform (NUFFT) on the graphics processing unit (GPU) architecture for ultrahigh-speed, real-time Fourier-domain optical coherence tomography (FD-OCT). The Vandermonde matrix-based non-uniform discrete Fourier transform (NUDFT) as well as the linear/cubic interpolation with fast Fourier transform (InFFT) methods are also implemented on GPU to compare their performance in terms of image quality and processing speed. http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-18-22-23472 /content/cudazone/CUDABrowser/assets/images/applications/1335_165404_finger_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1335_165404_finger_large.jpg Academia Johns Hopkins University www.ece.jhu.edu 2010 10 25 10/25/2010 Kang Zhang Jin U. Kang Paper Imaging Medical Imaging Signal Processing Kang Zhang,Jin U. Kang,kzhang8@jhu.edu ba0d2d88-8e95-486f-87c7-2e084e8bcc35 Real-time 4D signal processing and visualization using graphics processing unit on a regular nonlinear-k Fourier-domain OCT system We realized graphics processing unit (GPU) based real-time 4D (3D + time) signal processing and visualization on a regular Fourier-domain optical coherence tomography (FD-OCT) system with a nonlinear k-space spectrometer. An ultra-high speed linear spline interpolation (LSI) method for -to-k spectral re-sampling is implemented in the GPU architecture, which gives average interpolation speeds of >3,000,000 line/s for 1024-pixel OCT (1024-OCT) and >1,400,000 line/s for 2048-pixel OCT (2048-OCT). http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-18-11-11772 /content/cudazone/CUDABrowser/assets/images/applications/1333_61749_finger tip singles 2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1333_61749_finger tip singles 2_large.jpg Academia Johns Hopkins University www.ece.jhu.edu 2010 05 18 05/18/2010 Kang Zhang Jin U. Kang Paper Imaging Medical Imaging Signal Processing Real-time 4D Optical coherence tomography,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu 4ee10503-6f9a-4136-858c-5df8aa4cf07f GPU Smoldyn Porting to CUDA of the core simulation algorithms of Smoldyn /content/cudazone/CUDABrowser/assets/images/applications/1332_167342_screenshot_small.png /content/cudazone/CUDABrowser/assets/images/applications/1332_167342_screenshot_large.png Research COSBI 2010 12 10 12/10/2010 130 Lorenzo Dematte Davide Prandi Paper Life Sciences Lorenzo Dematte,Davide Prandi,dematte@ieee.org 6933fa46-7fc4-4053-b724-feda8995296b MC-GPU: Monte Carlo Simulation of X-ray Transport for Medical Imaging Applications MC-GPU is a GPU-accelerated x-ray transport simulation code that can generate clinically-realistic radiographic projection images and computed tomography (CT) scans of the human anatomy. MC-GPU implements a massively multi-threaded Monte Carlo simulation algorithm for the transport of x rays in a voxelized geometry and uses the x-ray interaction models and cross sections from PENELOPE 2006. The code can handle realistic human anatomy phantoms, for example the freely available models from the Virtual Family. Electron transport is not implemented. The code has been developed using the CUDA programming model and MPI to address multiple GPUs in parallel. In typical diagnostic imaging simulations, a 15 to 30-fold speed up is obtained using a GPU compared to a CPU execution. /content/cudazone/CUDABrowser/assets/images/applications/1331_99177_mc-gpu_1mmDuke_50keV_1e10hist__All_and_NoScatter_LowRes_small.png /content/cudazone/CUDABrowser/assets/images/applications/1331_99177_mc-gpu_1mmDuke_50keV_1e10hist__All_and_NoScatter_LowRes_large.png Research US Food and Drug Administration http://www.fda.gov/MedicalDevices/ScienceandResearch/ucm2007489.htm 2010 07 08 07/08/2010 30 Open source Andreu Badal Aldo Badano Paper Code Medical Imaging Ray Tracing Andreu Badal,Aldo Badano,andreu_badal@hotmail.com 59b21188-3fc1-448c-bee4-6c5923cfcd67 A demonstration of Exact String Matching Algorithms in CUDA I had a simple idea: is it possible to convert some of the well-known exact string matching algorithms into CUDA versions /content/cudazone/CUDABrowser/assets/images/applications/1330_80324_Screen_shot_2011-01-07_at_PM_2012_12_43_small.png /content/cudazone/CUDABrowser/assets/images/applications/1330_80324_Screen_shot_2011-01-07_at_PM_2012_12_43_large.png Research HP Labs Singapore http://www.hp.com 2010 12 23 12/23/2010 100 Open source Raymond Tay Application Paper Code general purpose computing Raymond Tay,raymondtay1974@gmail.com f48bb746-7b1b-4da3-b204-cc93a3414cc0 Poker Simulation In GPU Simulation is a widely using technique by artificial and human players for helping the decision process in poker. In a typical texas hold'em game simulating all possible game states requires millions of hand evaluations. In this application, we port the Hand-Eval poker library to CUDA providing a generic interface for evaluations of large amounts of hand data. /content/cudazone/CUDABrowser/assets/images/applications/1329_6875_resim_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1329_6875_resim_large.jpg Academia METU 2010 01 31 01/31/2010 15 Commercial Sirin,Volkan Paper Game Simulation Sirin,Volkan,volkansirin@gmail.com 3b26e749-9952-4f86-b8f7-4ca377ef9dea Satellite Image Processing on GPU Satellite Image Processing on GPU is demonstration of performance of remote sensing algorithms such as Shadow Detection and Vegetation Detection on GPU. Also basic image processing algorithms like Contrast Normalization, Histogram Equalization, Automatic Threshold (Otsu's) are implemented. 4 band Satellite Images with 8-bit and 16-bit data are used in tests. Performance of basic and complex algorithms are compared in CPU and GPU with images with various sizes. In the tests, the effect of memory transfer and the order of bands are also considered. /content/cudazone/CUDABrowser/assets/images/applications/1328_60270_imageGPU_small.png /content/cudazone/CUDABrowser/assets/images/applications/1328_60270_imageGPU_large.png Academia Informatics Institute, Middle East Technical University http://www.vrcv.ii.metu.edu.tr 2010 01 31 01/31/2010 10 Open source Mustafa Teke Paper Code Signal Processing Mustafa Teke,mustafa.teke@gmail.com 3dcc9409-2154-48c2-8b2d-9f64294ea4a5 Parallel implementation of large scale crowd simulation Human crowd movement was simulated using texture convolution and a behavioral model inspired by smoothed particle hydrodynamics. In order to make large scale simulation possible in real-time, or almost real-time, we will implement a model for human crowd behavior on a parallel processing platform using CUDA /content/cudazone/CUDABrowser/assets/images/applications/1327_361460_accumulatorMultiColor_small.png /content/cudazone/CUDABrowser/assets/images/applications/1327_361460_accumulatorMultiColor_large.png Academia DIKU 2009 11 11 11/11/2009 Thomas Gronnelov Paper Crowd simulation Thomas Gronnelov,tag@greenleaf.dk c9f1c703-120a-4322-a545-025f92f91c95 Monte Carlo simulation of the q-state Potts Model using CUDA In this work we implement a parallel code to perform finite temperature Monte Carlo simulations of a magnetic system described by a two dimensional q-state Potts model. http://www.famaf.unc.edu.ar/grupos/GPGPU/Potts/CUDAPotts.html /content/cudazone/CUDABrowser/assets/images/applications/1326_82402_potts-nvidia_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1326_82402_potts-nvidia_large.jpg Academia GPGPU Computing Group - Fa.M.A.F. - U.N.C. http://www.famaf.unc.edu.ar/grupos/GPGPU/ 2010 01 05 01/05/2010 155 Open source Ezequiel E. Ferrero Juan Pablo De Francesco Nicolas Wolovick, Sergio A. Cannas Paper Statistical Mechanics Ezequiel E. Ferrero,Juan Pablo De Francesco,Nicolas Wolovick, Sergio A. Cannas,ferrero@famaf.unc.edu.ar,jde@famaf.unc.edu.ar,nicolasw@famaf.unc.edu.ar, cannas@famaf.unc.edu.ar ce4b9b35-490f-49f0-b1e7-4d5ec3b9841f CoroBot CUDA-enabled controller for a mobile robot. The controller takes advantage of an ION board. Machine vision algorithms are accelerated by a minimum of 8x compared to their single-threaded C++ version executed on the ION Atom CPU /content/cudazone/CUDABrowser/assets/images/applications/1325_6852_corobot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1325_6852_corobot_large.jpg Commercial RealityFrontier http://www.realityfrontier.com 2010 09 01 09/01/2010 8 Commercial Raphael Cariou Application Multimedia Imaging Signal Processing Robotics Raphael Cariou,raphael.cariou@realityfrontier.com 9101844c-56f4-40d7-a856-c75fc81e385a Simulating spin models on GPU Simulations of the Ising, Heisenberg and spin-glass models with Metropolis and parallel tempering updates. /content/cudazone/CUDABrowser/assets/images/applications/1324_checker_small.png /content/cudazone/CUDABrowser/assets/images/applications/1324_checker_large.png Academia Johannes Gutenberg-University Mainz http://www.uni-mainz.de 2010 01 07 01/07/2010 1000 Open source Martin Weigel Paper Code Science Statistical physics Martin Weigel,weigel@uni-mainz.de be82fae2-de7e-48bc-94c2-5f68a86c8c48 iWormhole Desktop Edition Is an ultra-secure file sending Windows Application. This application was designed for consumer use with three guiding principles: 1) Speed, 2) Privacy and 3) Security. /content/cudazone/CUDABrowser/assets/images/applications/1323_36895_screenshot_small.png /content/cudazone/CUDABrowser/assets/images/applications/1323_36895_screenshot_large.png Commercial iWormhole Communications Corp http://www.iwormhole.com 2010 12 12 12/12/2010 1700 Commercial Rob Gagnon Application File Transmission Rob Gagnon,rob@iwormhole.com 56b1c6a1-88bd-4f75-9898-730bd21ce344 Simulation of 1+1 dimensional surface growth andl attices gases using GPUs Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ( x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated. /content/cudazone/CUDABrowser/assets/images/applications/1322_15738_Model1d_small.png /content/cudazone/CUDABrowser/assets/images/applications/1322_15738_Model1d_large.png Academia MTA-MFA, Res. Inst. for Tech. Phys. and Materials Sci. Budapest http://www.mfa.kfki.hu 2010 12 03 12/03/2010 100 Henrik Schulz Geza Odor Gergely Odor, Mate F. Nagy Paper Statistical Physics Henrik Schulz,Geza Odor,Gergely Odor, Mate F. Nagy,odor@mfa.kfki.hu e795a2dd-798b-4cb4-9630-e8c7cd042a16 CUVI Lib CUDA for Vision and Imaging Library /content/cudazone/CUDABrowser/assets/images/applications/1321_5734_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1321_5734_logo_large.png Commercial TunaCode 2010 08 26 08/26/2010 40 Tauseef Rehman Salman Haq Usman Aziz, Jawad Masood Code Medical Imaging Libraries Programming Tools Signal Processing Tauseef Rehman,Salman Haq,Usman Aziz, Jawad Masood,tauseef@tunacode.com c0ae46f3-d256-4c38-9f59-11cddd117c0a Interactive visualization of the largest radioastronomy cubes Astronomy is a data intensive science. The upcoming and future astronomy research facilities will systematically generate terabyte-sized data sets moving astronomy into the Petascale data era. Such increases in dataset size and dimensionality will pose serious computational challenges for many current astronomy data analysis and visualization tools. http://astronomy.swin.edu.au/~ahassan/Research.html /content/cudazone/CUDABrowser/assets/images/applications/1320_optiportal_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1320_optiportal_large.jpg Academia Swinburne University of Technology-Centre for Astrophysics and Supercomputing http://astronomy.swin.edu.au/scivis/ 2010 09 01 09/01/2010 A. H. Hassan C. J. Fluke D. G. Barnes Application A. H. Hassan,C. J. Fluke,D. G. Barnes f01e4f1c-7947-4609-b24c-4732f954bafb Simulation of 1+1 dimensional surface growth andl attices gases using GPUs Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/ evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ( x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated. /content/cudazone/CUDABrowser/assets/images/applications/1318_15738_Model1d_small.png /content/cudazone/CUDABrowser/assets/images/applications/1318_15738_Model1d_large.png Academia MTA-MFA, Res. Inst. for Tech. Phys. and Materials Sci. Budapest http://www.mfa.kfki.hu 2010 12 03 12/03/2010 100 Henrik Schulz Geza Odor Gergely Odor, Mate F. Nagy Paper Statistical Physics Henrik Schulz,Geza Odor,Gergely Odor, Mate F. Nagy,odor@mfa.kfki.hu 4936a20a-b98f-4e39-94d2-ec174d002e9e Nonlinear Free Surface Water Waves Fast Desktop Computing for Nonlinear Free Surface Water Waves (OceanWave3D potential flow model) /content/cudazone/CUDABrowser/assets/images/applications/1317_47409_whalint3_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1317_47409_whalint3_large.jpg Academia Technical University of Denmark http://www.imm.dtu.dk/~apek 2010 12 03 12/03/2010 42 Open source Allan P. Engsig-Karup Application Computational Fluid Dynamics Numerics oceanwave3d, potential free surface flow, finite difference method, coastal engineering,Allan P. Engsig-Karup,apek@imm.dtu.dk a79d4690-9c30-4837-88bc-805873f7e5f5 Lagrangian Stochastic Particle Model using Large-Eddy Simulation Meteorology Atmospheric transport and dispersion (T D) models play an important roll in United States national defense. Due to operational time constraints, less sophisticated models have consistently dominated the defense market. Recent advances in graphics processing units (GPUs) and their programming models have made GPUs an attractive platform for commodity, low-power, high-performance parallel computing. Two GPU accelerated (using NVIDIA Corporation's CUDA technology) versions of a sophisticated, large-eddy simulation (LES) based, Lagrangian stochastic model, developed at the National Center for Atmospheric Research (NCAR), were implemented and compared against their single and multiple core CPU (Intel Harpertown) counterparts. The implementation representing the shortest route to GPU acceleration observed a single GPU speedup of 14x over the single core CPU implementation. A more robust and scalable single GPU implementation observed speedups of 20x over the single core CPU implementation. /content/cudazone/CUDABrowser/assets/images/applications/1316_27146_ave_plan_view_crop_small.png /content/cudazone/CUDABrowser/assets/images/applications/1316_27146_ave_plan_view_crop_large.png Academia University of Colorado - Boulder 2010 07 13 07/13/2010 20 Jonathan Hurst Paper Computational Fluid Dynamics Jonathan Hurst,jhurst@ucar.edu ebb0619e-dd44-4c10-ac6c-1659ff388b6f rCUDA 2.0 Allows performing CUDA calls to remote GPUs. /content/cudazone/CUDABrowser/assets/images/applications/1315_4691_rCUDA_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1315_4691_rCUDA_logo_large.png Academia UPV / UJI 2010 11 24 11/24/2010 Open source The rCUDA Team Application Paper Code Libraries Programming Tools The rCUDA Team,apenya@gap.upv.es bab91959-3b2a-494c-8644-cf771a2f9bc0 LATTE GPU-accelerated self-consistent tight-binding molecular dynamics for materials with mixed covalent and ionic bonding. /content/cudazone/CUDABrowser/assets/images/applications/1314_main_orig_small.png /content/cudazone/CUDABrowser/assets/images/applications/1314_main_orig_large.png Research Los Alamos National Laboratory http://www.lanl.gov 2010 11 01 11/01/2010 Open source E.J. Sanville N. Bock A. M. N. Niklasson A. Odell S. Rudin M. J. Cawkwell J. Coe Code Life Sciences Science E.J. Sanville,N. Bock,J. Coe, A. M. N. Niklasson, A. Odell, S. Rudin, M. J. Cawkwell,edsanville@gmail.com,nbock@lanl.gov, jcoe@lanl.gov, amn@lanl.gov, aodell@kth.se, srudin@lanl.gov, cawkwell@lanl.gov 4dfaf12b-06c9-418f-8250-7d75fa91a932 Reverse extraction of early-age hydration kinetic equation from observed data of Portland cement. The early-age hydration of Portland cement paste has an important impact on the formation of microstructure and development of strength. However, manual derivation of hydration kinetic equation is very difficult because there are multi-phased, multi-sized and interrelated complex chemical and physical reactions during cement hydration. In this paper, early-age hydration kinetic equation is reversely extracted automatically from the observed time series of hydration degree of Portland cement using evolutionary computation method that combines gene expression programming and particle swarm optimization algorithms. In order to reduce the computing time, GPUs are used for acceleration in parallel. Studies have shown that according to the extracted kinetic equation, simulation curve of early-age hydration is in good accordance with the observed experimental data. Furthermore, this equation still has a good generalization ability even changing chemical composition, particle size and curing conditions. /content/cudazone/CUDABrowser/assets/images/applications/1313_75384_Reverse_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1313_75384_Reverse_large.jpg Academia Provincial Key Laboratory for Network-based Intelligent Computing, University of Jinan, Jinan 250022, China 2010 11 19 11/19/2010 WANG Lin YANG Bo ZHAO XiuYang, CHEN YueHui, CHANG Jun Paper Science Material WANG Lin,YANG Bo,ZHAO XiuYang, CHEN YueHui, CHANG Jun,wangplanet@gmail.com f905568d-b7d2-4ac5-b102-89b8f9a8cbdc IntelliEtch GPU module IntelliEtch is an Anisotropic Wet Etch simulator. This chemical process can be used for Silicon-based Microsystems fabrication. IntelliEtch can be used as a CAD tool for Microsystem fabrication, allowing fast and accurate simulations. /content/cudazone/CUDABrowser/assets/images/applications/1312_87164_Images-036_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1312_87164_Images-036_large.jpg Research I3M Institute(Polytechnic University of Valencia), DIPC Intitute (University of the Basque Country) 2010 10 01 10/01/2010 150 Commercial N Ferrando M A Gosalvez Multimedia Paper Computer Aided Engineering Microsystems MEMS, microsystems, cellular automata,N Ferrando,M A Gosalvez,nesferjo@upvnet.upv.es,miguelangel.gosalvez@ehu.es a63d80ac-a9bb-4cb9-a632-66d9d2563f07 Ultra Fast SOM using CUDA This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm. /content/cudazone/CUDABrowser/assets/images/applications/1311_NeST-NVIDIA_Center_small.png /content/cudazone/CUDABrowser/assets/images/applications/1311_NeST-NVIDIA_Center_large.png Commercial NeST http://nestsoftware.com/ 2010 05 18 05/18/2010 Sijo Mathew Preetha Joy Paper Numerics Data mining Sijo Mathew,Preetha Joy,hpc@nestgroup.net 30fbce15-63ea-41fe-8ff3-665316b591e3 AgiSoft PhotoScan AgiSoft PhotoScan is an advanced image-based 3D modeling solution for creating professional quality 3D content from still images. Based on the latest multi-view 3D reconstruction technology, it operates on arbitrary images and is efficient in both controlled and uncontrolled conditions. The photos can be taken from any positions, providing that an object to be reconstructed is visible on at least two photos. Both image alignment and 3D model reconstruction is fully automated. /content/cudazone/CUDABrowser/assets/images/applications/1309_436476_logo-pscan-2_small.png /content/cudazone/CUDABrowser/assets/images/applications/1309_436476_logo-pscan-2_large.png Commercial AgiSoft http://www.agisoft.ru 2010 08 18 08/18/2010 20 Commercial AgiSoft Application Computer Aided Engineering Digital Content Creation Graphics image based modeling,AgiSoft,info@agisoft.ru 46c6675d-b59a-49c8-89dd-fc7a32e6484e Field Forge Field Forge brings massively parallel processing (MPP) to PostgreSQL's current single-threaded sessions. Field Forge utilizes the MPP power of the Kappa framework. The Kappa framework provides practical usage of CUDA GPU, OpenMP, and partitioned data flow scheduled processing. Field Forge make the Kappa framework from Psi Lambda LLC a new Language for defining Window and Table functions. These functions allow processing to be specified using SQL and index component notation for MPP using GPUs and CPUs. Within each Field Forge node, the Kappa framework passes (subsets) of the data sets between processing kernels and into and out of data sets. Field Forge also utilizes the Kappa framework's Apache Portable Runtime (APR) database driver SQL connections to retrieve data fields from any database source (including other Field Forge sessions and nodes), process them using the MPP capabilities of the Kappa framework, and return them as PostgreSQL table or window fields returned from table or window functions respectively. This combination of features enables a Dataset Passing Interface (DPI) for distributed MPP. DPI leverages the existing skills, protocols, connectivity, and infrastructure of an organization. /content/cudazone/CUDABrowser/assets/images/applications/1308_37625_psilambdakappa_small.png /content/cudazone/CUDABrowser/assets/images/applications/1308_37625_psilambdakappa_large.png Commercial Psi Lambda LLC http://psilambda.com 2010 11 07 11/07/2010 Commercial Psi Lambda LLC Application Finance Numerics Life Sciences Programming Tools Science PostgreSQL OpenMP CUDA Window Table Partition,Psi Lambda LLC,kappa@psilambda.com dbb00087-2b32-421a-9e65-f1295b0546b6 Introducing libflame with multi-GPU support We are happy to announce the fifth milestone release (r4648) of libflame, a modern replacement for the most-used functionality of the LAPACK linear algebra library. The main improvement since version 4.0 is that libflame now supports parallel execution using multiple GPUs through the SuperMatrix runtime system. By linking libflame with CUBLAS for the execution of BLAS routines on a single GPU, the SuperMatrix runtime system schedules operations to each GPU and manages the explicit movement of data. This release includes support for single and double precision real and complex floating point operations. /content/cudazone/CUDABrowser/assets/images/applications/1307_205958_FLAMEbanner_small.png /content/cudazone/CUDABrowser/assets/images/applications/1307_205958_FLAMEbanner_large.png Academia UT Austin / Universitat Jaume 2010 10 28 10/28/2010 Open source Ernie Chan Francisco Igual Field van Zee, Robert van de Geijn Application Code Numerics Libraries Ernie Chan,Francisco Igual,Field van Zee, Robert van de Geijn,figual@icc.uji.es 6865f9fe-2a11-47bc-9d50-c0b4026248dc alenka Alenka is a high level, high performance SQL-like language for data processing on CUDA hardware /content/cudazone/CUDABrowser/assets/images/applications/1306_53666_Cubes_small.png /content/cudazone/CUDABrowser/assets/images/applications/1306_53666_Cubes_large.png Research 2010 11 02 11/02/2010 Open source Anton K. Application Code Programming Tools databases Anton K.,antonmks@gmail.com 159ab96a-c59e-4853-bd70-03a4e7691b6f CUDA Accelerated Face Recognition We explore one of the possibilities of parallelizing and optimizing a well-known Face Recognition algorithm, Principal Component Analysis (PCA) with Eigenfaces. /content/cudazone/CUDABrowser/assets/images/applications/1305_NeST-NVIDIA_Center_small.png /content/cudazone/CUDABrowser/assets/images/applications/1305_NeST-NVIDIA_Center_large.png Commercial NeST http://nestsoftware.com 2010 07 26 07/26/2010 Numaan. A Sibi A Paper Imaging Numaan. A,Sibi A,hpc@nestgroup.net df37fa39-5f4f-43cd-8a94-797995daea91 On the Use of Small 2D Convolutions on GPUs Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diraction modeling in physics. The GPU architecture seems to be a suitable architecture to ac- celerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. In this work, we present the techniques, performance results and consid- erations to accelerate small 2D convolutions using CUDA, and compare performance to a multi-threaded CPU implementation. To improve programmability and performance of applications that make heavy use of small convolutions, we argue that two improvements to software and hardware are needed: FFT libraries must be extended with a single con- volution function and communication bandwidth between CPU and GPU needs to be drastically improved. /content/cudazone/CUDABrowser/assets/images/applications/1304_2dconvolutions_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1304_2dconvolutions_large.jpg Academia TUDelft, ASML, TU/e http://www.tudelft.nl/http://www.asml.nl/http://www.tue.nl 2010 06 19 06/19/2010 Shams Al Umairy Alexander S. van Amesfoort Henk Sips, Irwan Setija, Martijn van Beurden Irwan Setija Martijn van Beurden Paper Numerics Science 2D convolution, FFT, Electromagnetic diffraction grating,GPU, CUDA, Tesla,Shams Al Umairy,Alexander S. van Amesfoort,Henk Sips, Irwan Setija, Martijn van Beurden,salumairy@gmail.com,a.s.vanamesfoort@tudelft.nl,sips@ewi.tudelft.nl,Irwan.Setija@asml.com,M.C.v.Beurden@tue.nl f0e9b09f-f104-4594-9ec8-165b10670b21 GFARGO GFARGO simulates the evolution of a gaseous protoplanetary disk subject to the gravitational perturbation of forming protoplanets embedded in it, by solving the Navier-Stokes equations on a polar mesh. It simultaneously describes how the planetary orbits expand or shrink with time, a process known as planetary migration, which plays an important role in shaping the planetary system that emerges once the disk dissipates. The actual implementation is two-dimensionnal, and performance gains ranging up to 90x are achieved with respect to CPU implementations. /content/cudazone/CUDABrowser/assets/images/applications/1303_35573_fargo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1303_35573_fargo_large.png Academia Institute of Physical Sciences, UNAM, Mexico and CEA, Saclay, France 2010 10 22 10/22/2010 90 Open source Frederic Masset Application Code Computational Fluid Dynamics Science Frederic Masset,fmasset@cea.fr 6c2695c6-b17f-4357-ba6a-71715583d5ab 2 million pixel experiment This experimental application maps a HD video source (1080p) into 3D space. Each frame is processed in realtime on the GPU using NVIDIA CUDA technology. Each pixel in a frame (2.073.600 pixels per frame) is scaled by its luminance value and given the original color. The application is written in C# using DirectX11 via SlimDX, CUDA.NET and DirectShow.NET libraries. /content/cudazone/CUDABrowser/assets/images/applications/1302_114048_visualcompute_cuda_app960_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1302_114048_visualcompute_cuda_app960_large.jpg Research noumentalia.de - digital arts - visualcompute.com http://www.noumentalia.de 2010 10 22 10/22/2010 Philipp Drieger Multimedia Presentation Digital Content Creation Graphics Imaging Libraries Science Signal Processing Video & Audio HD video processing 1080p 3D CUDA .NET C# map 3D space,Philipp Drieger,info@visualcompute.com ece4bfff-3896-47dd-b0af-f6abbb592a8e powDOG: powder diffraction on GPUs Diffraction, particularly of X-rays, is a powerful technique for the investigation of structure, microstructure and dynamical properties of matter. In order to link theoretical methods, like Molecular Dynamics and other atomistic approaches, and diffraction experiments we developed a new software for calculating the powder diffraction pattern of nano-sized objects on the GPUs. The software, soon to be made available under GPL license, allows the use of GPUs on different hosts for a direct (brute-force) computation of the Debye scattering equation. /content/cudazone/CUDABrowser/assets/images/applications/1301_1322162_powDOG_small.png /content/cudazone/CUDABrowser/assets/images/applications/1301_1322162_powDOG_large.png Academia University of Trento, Trento, Italy 2010 02 08 02/08/2010 Luca Gelisio Cristy Leonor Azanza Ricardo, Matteo Leoni, Paolo Scardi. Application Science Powder diffraction, Debye scattering equation, nanostructured materials,Luca Gelisio,Cristy Leonor Azanza Ricardo, Matteo Leoni, Paolo Scardi.,luca.gelisio@unitn.it 85367756-abdb-4bc0-ab43-beed43680f51 GPU Accelerated Likelihoods for Stereo-Based Articulated Tracking For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds /content/cudazone/CUDABrowser/assets/images/applications/1299_88964_gpu_vision_2010_small.png /content/cudazone/CUDABrowser/assets/images/applications/1299_88964_gpu_vision_2010_large.png Academia The eScience Centre,Dept. of Computer Science, University of Copenhagen http://www.diku.dk 2010 09 05 09/05/2010 600 Rune Mollegaard Friborg Soren Hauberg Kenny Erleben Paper Ray Tracing Computer Vision Machine Learning Tracking Articulated Tracking,Particle Filtering,Rune Mollegaard Friborg,Soren Hauberg,Kenny Erleben ,runef@diku.dk,hauberg@diku.dk,kenny@diku.dk 8052c851-ce8c-4767-9452-e3df12796c1d Electronic Design Automation GPU-Based Robust Multigrid Preconditioned Solver for Large Scale On-Chip Power Grid Simulation /content/cudazone/CUDABrowser/assets/images/applications/1298_1108533_40-1-9_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1298_1108533_40-1-9_large.jpg Academia Michigan Technological University http://www.ece.mtu.edu/~zhuofeng/MTU_VLSI_DA.htm 2010 09 15 09/15/2010 50 Zhuo Feng Multimedia Paper Computer Aided Engineering Electronic Design Automation Multigrid, preconditioned iterative methods, power delivery network, on-chip interconnect simulation, VLSI system,Zhuo Feng,zhuofeng@mtu.edu c9e901d1-7e78-40ed-ac3d-4b59592bbb9b Engine_cudamrg for OpenSSL Engine_cudamrg is a cryptographic engine for the OpenSSL Toolkit that can accelerate some operation using a CUDA supported device, we currently support the following cipher types: * AES-128-ECB * AES-128-CBC * AES-192-ECB * AES-192-CBC * AES-256-ECB * AES-256-CBC We support both encryption and decryption for theese cipher types. For future releases we plan to optimize currently supported cipher types, add more cipher types and digest algorithms. /content/cudazone/CUDABrowser/assets/images/applications/1297_8191_engineCudamrg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1297_8191_engineCudamrg_large.png Commercial Engine_cudamrg Development Team http://groups.google.com/group/engine-cudamrg 2010 07 26 07/26/2010 Open source Paolo Margara Application Code Cryptography AES, cryptography,Paolo Margara,paolo.margara@gmail.com 8179b89f-1d5e-4dad-84d4-6b0466dcde7e Smoke Simulation for Fire Engineering using a Multigrid Method on Graphics Hardware We present a GPU-based Computational Fluid Dynamics solver for the purpose of fire engineering. We apply a multigrid method to the Jacobi solver when solving the Poisson pressure equation, supporting internal boundaries. Boundaries are handled on the coarse levels, ensuring that boundaries will never vanish after restriction. We demonstrate cases where the multigrid solver computes results up to three times more accurate than the standard Jacobi method within the same time. Providing rich visual details and flows closer to widely accepted standards in fire engineering. Making accurate interactive physical simulation for engineering purposes, has the benefit of reducing production turn-around time. We have measured speed-up improvements by a factor of up to 350, compared to existing CPU-based solvers. The present CUDA-based solver promises huge potential in economical benefits, as well as constructions of safer and more complex buildings. In this paper, the multigrid method is applied to fire engineering. However, this is not a limitation, since improvements are possible for other fields as well. Traditional Jacobi solvers are particulary suitable for the methods presented. /content/cudazone/CUDABrowser/assets/images/applications/1296_121739_vriphys2009_glimberg_erleben_teaser_small.png /content/cudazone/CUDABrowser/assets/images/applications/1296_121739_vriphys2009_glimberg_erleben_teaser_large.png Academia Department of Computer Science/University of Copenhagen http://www.diku.dk 2009 11 05 11/05/2009 350 Stefan Glimberg Kenny Erleben Jens Bennetsen Paper Computational Fluid Dynamics Computer Aided Engineering Pre-parameter studies of virtual designs Stefan Glimberg,Kenny Erleben,Jens Bennetsen,glimberg@diku.dk,kenny@diku.dk 2e28ba24-3cc8-4333-be77-fb572dc28776 GPU Accelerated Tandem Traversal of Blocked Bounding Volume Hierarchy Collision Detection for Multibody Dynamics The performance bottleneck of physics based animation, is often the collision detection. It is well-known by practitioners that the collision detection may consume more than half of the simulation time. In this work we will introduce a novel approach for collision detection using bounding volume hierarchies. Our approach makes it possible to perform non-convex object versus non-convex object collision on the GPU, using tandem traversals of bounding volume hierarchies. Prior work only supports single traversals on GPUs. We introduce a blocked hierarchy data structure, using imaginary nodes and a simultaneous descend in the tandem traversal. The data structure design and traversal are highly specialized for exploiting the parallel threads in the NVIDIA GPUs. As proof-of-concept we demonstrate a GPU implementation for a multibody dynamics simulation, showing an approximate speedup factor of up to 8 compared to a CPU implementation /content/cudazone/CUDABrowser/assets/images/applications/1295_52591_vriphys2009_damkjaer_erleben_teaser_small.png /content/cudazone/CUDABrowser/assets/images/applications/1295_52591_vriphys2009_damkjaer_erleben_teaser_large.png Academia Department of Computer Science, University of Copenhagen. http://www.diku.dk/ 2009 11 05 11/05/2009 8 Open source Jesper Damkjaer Kenny Erleben Paper Game Physics Graphics Numerics Libraries Programming Tools Science Bounding volume Hierarchies, Collision Detection, Rigid Body Simulation,Jesper Damkjaer, Kenny Erleben,damkjaer@diku.dk,kenny@diku.dk 70463033-d9d8-49f6-9067-8a982284a733 SpofetwraremGPU: Using graphics processing units in RNA microarray association studies Background: Many analyses of microarray association studies involve permutation, bootstrap resampling and crossvalidation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed. http://www.gpucomputing.net/?q=node/2083 /content/cudazone/CUDABrowser/assets/images/applications/1294_bmc_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1294_bmc_large.jpg Commercial BMC Bioinformatics 2010 05 22 05/22/2010 78 Ivo D Shterev Sin-Ho Jung Stephen L George Paper Ivo D Shterev,Sin-Ho Jung,Stephen L George 796fa229-7424-4736-b68b-793cf9120ee9 High performance GPU radix sorting in CUDA This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second). http://code.google.com/p/back40computing/wiki/RadixSorting /content/cudazone/CUDABrowser/assets/images/applications/1291_SortingSmall_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1291_SortingSmall_large.jpg Research CUDA Developer 2010 05 27 05/27/2010 Duane Merrill Application Duane Merrill b8244c5b-21cf-4a7d-8eef-7b8e72792b98 Hardware-Assisted Projected Tetrahedra We present a flexible and highly efficient hardware-assisted volume renderer grounded on the original Projected Tetrahedra (PT) algorithm. Unlike recent similar approaches, our method is exclusively based on the rasterization of simple geometric primitives and takes full advantage of graphics hardware. Both vertex and geometry shaders are used to compute the tetrahedral projection, while the volume ray integral is evaluated in a fragment shader; hence, volume rendering is performed entirely on the GPU within a single pass through the pipeline. http://www.lcg.ufrj.br/Members/andream/papers/cgf2010.pdf /content/cudazone/CUDABrowser/assets/images/applications/1290_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1290_GPUComputing bgimg_large.png Academia University of Rio de Janeiro 2010 03 18 03/18/2010 A. Maximo R. Marroquim R. Farias Paper A. Maximo,R. Marroquim,R. Farias 1af9d850-4d52-4267-9787-72027ff4928c A Parallel Algorithm for Construction of Uniform Grids We present a fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which we implement in CUDA. The algorithm performance does not depend on the primitive distribution, because we reduce the problem to sorting pairs of primitives and cell indices. Our implementation is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores. http://graphics.cs.uni-sb.de/fileadmin/cguds/papers/2009/kalojanov_hpg2009/kalojanov_hpg2009.pdf /content/cudazone/CUDABrowser/assets/images/applications/1289_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1289_GPUComputing bgimg_large.png Academia Saarland University 2009 06 13 06/13/2009 Javor Kalojanov Philipp Slusallek Paper Javor Kalojanov,Philipp Slusallek 17270ae2-0417-4766-88db-17f20e1e3073 Evaluation of Streaming Aggregation on Parallel Hardware Architectures We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures is still an open question. http://people.cs.vt.edu/~scschnei/papers/debs2010.pdf /content/cudazone/CUDABrowser/assets/images/applications/1288_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1288_GPUComputing bgimg_large.png Research IBM Research Division 2010 07 12 07/12/2010 Scott Schneider Henrique Andrade Bugra Gedik Kun-Lung Weu Dimitrios S. Nikolopoulos Paper Scott Schneider,Henrique Andrade,Bugra Gedik 9f95c82d-a9c4-4324-be40-85e0a4d5ebd3 A Middleware for Efficient Stream Processing in CUDA This paper presents a middleware capable of out-of-order execution of kernels and data transfers for efficient stream processing in the compute unified device architecture (CUDA). Our middleware runs on the CUDA-compatible graphics processing unit (GPU). Using the middleware, application developers are allowed to easily overlap kernel computation with data transfer between the main memory and the video memory. http://www-hagi.ist.osaka-u.ac.jp/research/papers/201005_s-nakagw_isc.pdf /content/cudazone/CUDABrowser/assets/images/applications/1287_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1287_GPUComputing bgimg_large.png Academia University of Trier 2010 03 12 03/12/2010 Shinta Nakagawa Fumihiko Ino Kenichi Hagihara Paper Shinta Nakagawa,Fumihiko Ino,Kenichi Hagihara 5fc49232-b32f-49ab-81d9-8548d9b4b730 An Adaptive Performance Modeling Tool for GPU Architectures This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. http://impact.crhc.illinois.edu/ftp/conference/sara.pdf /content/cudazone/CUDABrowser/assets/images/applications/1286_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1286_GPUComputing bgimg_large.png Academia University of Illinois at Urbana-Champaign 2009 11 19 11/19/2009 Sara S. Baghsorkhi Matthieu Delahaye Sanjay J. Patel William D. Gropp Wen-mei W. Hwu Paper Sara S. Baghsorkhi,Matthieu Delahaye,Sanjay J. Patel,bsadeghi@illinois.edu,matthieu@illinois.edu,sjp@illinois.edu d3d1ef02-d0ff-40e9-ae3c-2f380b1f45d7 Kd-Jump: a Path-Preserving Stackless Traversal for Faster Isosurface Raytracing on GPUs Stackless traversal techniques are often used to circumvent memory bottlenecks by avoiding a stack and replacing return traversal with extra computation. This paper addresses whether the stackless traversal approaches are useful on newer hardware and technology (such as CUDA). To this end, we present a novel stackless approach for implicit kd-trees, which exploits the benefits of index-based node traversal, without incurring extra node visitation. This approach, which we term Kd-Jump, enables the traversal to immediately return to the next valid node, like a stack, without incurring extra node visitation (kd-restart). http://vplab.snu.ac.kr/lectures/09-2/graphics/lecture_notes/11%20Kd-jump.pdf /content/cudazone/CUDABrowser/assets/images/applications/1285_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1285_GPUComputing bgimg_large.png Academia Bangor University 2009 07 27 07/27/2009 David m. Hughes Ik Soo Lim Paper David m. Hughes,Ik Soo Lim,meirion@bangor.ac.uk,i.s.lim@bangor.ac.uk b82e77d4-8aab-4e0c-828d-d9c87e198557 Accelerating Flow Cytometry Data Clustering Workflows with Graphics Processing Units Flow cytometry is a mainstay technology used by biologists and immunologists for counting, sorting, and analyzing cells suspended in a fluid. The results of flow cytometry are used in a variety of important clinical and research applications such as phenotyping, DNA analysis, and cell function analysis. Like many modern scientific applications, flow cytometry produces massive amounts of data which must be clustered in order to be useful. Conventional analysis of flow cytometry data uses manual sequential bivariate gating. http://cyberaide.googlecode.com/svn/trunk/papers/thesis-pangborn/proposal/pangborn-proposal.pdf /content/cudazone/CUDABrowser/assets/images/applications/1284_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1284_GPUComputing bgimg_large.png Academia Rochester Institute of Technology 2009 09 01 09/01/2009 Andrew D. Pangborn Paper Andrew D. Pangborn b3217a94-0e15-431b-bdba-c2b96f030be4 GPU Accelerated Scientific Computing: Fluid and Particulate Flows with CUDA Simulations of particulate flows, which involve gases and liquids with suspended solid particles like dust, are generally highly CPU-time demanding. The question arises whether such computations can be performed on the GPU applying highly parallel programming models like CUDA. In this paper we demonstrate that numerical simulation in that context can greatly benefit from these emerging technologies and present results in a 2D and 3D setup. http://numhpc.math.kit.edu/download/PARS_Full_Paper_Final_Heuveline_Hahn_Rocker.pdf /content/cudazone/CUDABrowser/assets/images/applications/1283_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1283_GPUComputing bgimg_large.png Academia University of Karlsruhe 2009 10 14 10/14/2009 Tobias Hahn Vincent Heuveline Bjorn Rocker Paper Tobias Hahn,Vincent Heuveline,Bjorn Rocker,tobias.hahn@kit.edu,vincent.heuveline@kit.edu,bjoern.rocker@kit.edu 3587097f-b801-4185-ad0c-f4d4c78480c9 General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads XMT1 is a general-purpose many-core parallel architecture. The foremost design objective for XMT was to meet the highest standards for ease of parallel programming. GPUs, on the other hand, have acquired a strong reputation on performance, sometimes at the expense of ease of programming. The current paper presents a performance comparison on diverse workloads between XMT and an NVIDIA CUDA-enabled GPU. Configured with roughly the same amount of chip resources as the GPU, XMT achieves an average speedup of 6.05x on irregular applications, while incurring an average slowdown of 2.07x on regular ones. http://www.umiacs.umd.edu/users/vishkin/XMT/CKTV_hotpar10.pdf /content/cudazone/CUDABrowser/assets/images/applications/1282_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1282_GPUComputing bgimg_large.png Academia University of Maryland, College Park 2010 04 27 04/27/2010 George C. Caragea Fuat Keceli Alexandros Tzannes Uzi Vishkin Paper George C. Caragea,Fuat Keceli,Alexandros Tzannes,gcaragea@umd.edu,keceli@umd.edu,tzannes@umd.edu ace3d995-2e1a-4d64-a5af-450ffcfee3fd Fast Minimum Spanning Tree for Large Graphs on the GPU Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Boruvka's approach for undirected graphs. http://www.gpucomputing.net/?q=node/1612 /content/cudazone/CUDABrowser/assets/images/applications/1281_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1281_GPUComputing bgimg_large.png Academia International Institute of Information Technology 2009 06 07 06/07/2009 50 Vibhav Vineet Pawan Harish Suryakant Patidar P. J. Narayanan Paper Vibhav Vineet,Pawan Harish,Suryakant Patidar,vibhavvinet@research.iiit.ac.in,harishpk@research.iiit.ac.in,skp@research.iit.ac.in 81ab4f38-8401-4ac6-9e25-d5ad5edceb53 CUDA-based Triangulations of Convolution Molecular Surfaces Computing molecular surfaces is important to measure areas and volumes of molecules, as well as to infer useful information about interactions with other molecules. Over the years many algorithms have been developed to triangulate and to render molecular surfaces. However, triangulation algorithms usually are very expensive in terms of memory storage and time performance, and thus far from real-time performance. Fortunately, the massive computational power of the new generation of low-cost GPUs opens up an opportunity window to solve these problems: real-time performance and cheap computing commodities. http://salsahpc.indiana.edu/ECMLS2010/papers/066.pdf /content/cudazone/CUDABrowser/assets/images/applications/1280_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1280_GPUComputing bgimg_large.png Academia Universidade da Beira Interior 2010 06 20 06/20/2010 Sergio Dias Kuldeep Bora Abel Gomes Paper Sergio Dias,Kuldeep Bora,Abel Gomes,sdias@ubi.pt,kuldeep@iitg.ernet.in,agomes@di.ubi.pt a5df523d-c91c-4105-b8ee-fb0979362cc3 Data Parallel Three-Dimensional Cahn-Hilliard Field Equation Simulation on GPUs with CUDA Computational scientific simulations have long used parallel computers to increase their performance. Recently graphics cards have been utilised to provide this functionality. GPGPU APIs such as NVIDIA's CUDA can be used to harness the power of GPUs for purposes other than computer graphics. GPUs are designed for processing two-dimensional data. In previous work we have presented several two-dimensional Cahn-Hilliard simulations that each utilise different CUDA memory types and compared their results. http://www.massey.ac.nz/~kahawick/cstn/073/cstn-073.pdf /content/cudazone/CUDABrowser/assets/images/applications/1279_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1279_GPUComputing bgimg_large.png Academia Massey University 2009 02 01 02/01/2009 D. P. Playne K. A. Hawick Paper D. P. Playne,K. A. Hawick,d.p.playne@massey.ac.nz,k.a.hawick@massey.ac.nz 0b5f7cf7-9d2e-4318-b4c5-9c9a1dba1143 FAST VISUAL HULL AND STEREO MATCHING ON CUDA Stereo matching and visual hull are techniques that are often used in 3D reconstruction. This paper presents and evaluates implementations of these algorithms on the GPU using the CUDA architecture. Experimental results show that both, visual hull and stereo matching, have much to gain in terms of speed from the data parallel execution model. /content/cudazone/CUDABrowser/assets/images/applications/1278_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1278_GPUComputing bgimg_large.png Academia University of Surrey 2010 02 11 02/11/2010 Mykyta Fastovets Jean-Yves Guillemaut Adrian Hilton Paper Mykyta Fastovets,Jean-Yves Guillemaut,Adrian Hilton,mykyta.fastovets@surrey.ac.uk, j.guillemaut@surrey.ac.uk,a.hiltong@surrey.ac.uk 17140b63-a39b-4cdf-bd86-9a54079055b8 Speed records for NTRU In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four orders of magnitude. The focus is on achieving a high throughput, in this case performing a large number of encryptions/decryptions in parallel. http://www.gpucomputing.net/?q=node/1573 /content/cudazone/CUDABrowser/assets/images/applications/1277_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1277_GPUComputing bgimg_large.png Academia University of Leuven 2009 09 10 09/10/2009 Jens Hermans Frederik Vercauteren Bart Preneel Paper Jens Hermans,Frederik Vercauteren,Bart Preneel 042c7881-8db7-45d5-8824-8c7f99c6ce91 Implementing a GPU Programming Model on a non-GPU Accelerator Architecture Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. http://hal.archives-ouvertes.fr/docs/00/49/39/05/PDF/A4MMC-kofsky.pdf /content/cudazone/CUDABrowser/assets/images/applications/1275_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1275_GPUComputing bgimg_large.png Academia University of Illinois at Urbana-Champaign 2010 06 21 06/21/2010 Stephen M. Kofsky Daniel R. Johnson John A. Stratton Wen-mei W. Hwu Sanjay J. Patel Steven S. Lumetta Paper Stephen M. Kofsky,Daniel R. Johnson,John A. Stratton 93887a5d-c38e-40ec-bda5-b5bb59aae6d7 Parallelising Wavefront Applications on General-Purpose GPU Devices Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres such as LANL in the United States and AWE in the United Kingdom. This paper investigates the viability of utilising graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). Wavefront applications differ from the massively data-parallel codes typically selected for execution on GPUs in that their computation must obey a strict data dependency, limiting the achievable level of parallelism. http://www2.warwick.ac.uk/fac/sci/dcs/research/pcav/publications/pubs/ukpew-gpu-wavefronts.pdf /content/cudazone/CUDABrowser/assets/images/applications/1274_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1274_GPUComputing bgimg_large.png Academia University of Warwick 2010 06 01 06/01/2010 S. J. Pennycook G. R. Mudalige S. D. Hammond S. A. Jarvis Paper S. J. Pennycook,G. R. Mudalige,S. D. Hammond,sjp@dcs.warwick.ac.uk,g.r.mudalige@dcs.warwick.ac.uk,sdh@dcs.warwick.ac.uk f6b4d7a3-c2f6-456c-80f9-1a6666c19c99 Performance Cost Analysis of Software-Implemented Hardware Fault Tolerance Methods in General-Purpose GPU Computing Commercial off-the-shelf graphics processing units (GPUs) provide an attractive, inexpensive platform for highthroughput scientific applications. Whereas fault tolerance may be desirable for many scientific applications, off-the-shelf GPU hardware has been designed for commodity graphics applications, where fault tolerance is not necessary. http://homepages.cae.wisc.edu/~ece753/papers/Paper_4.pdf /content/cudazone/CUDABrowser/assets/images/applications/1273_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1273_GPUComputing bgimg_large.png Academia University of Wisconsin, Madison 2009 04 26 04/26/2009 Anthony E. Gregerson Ameya V. Abhyankar Paper Anthony E. Gregerson,Ameya V. Abhyankar,agregerson@wisc.edu,aabhyankar@wisc.edu a90fc8ef-5fba-447e-b091-fd5f9d5b1e50 GPU Accelerated Stylistic Augmented Reality With the introduction of programmable graphics pipeline, the highly parallel processing power of graphical processing units (GPU) is being used not only for special graphics effects but also for general purpose computation in areas such as molecular dynamics simulation, stock options pricing, and image processing. In this work, we utilize this power to increase the immersion level in an augmented reality (AR) application. http://www.vmasc.odu.edu/downloads/Capstone_Papers/Engineering/Aras.pdf /content/cudazone/CUDABrowser/assets/images/applications/1272_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1272_GPUComputing bgimg_large.png Academia Old Dominion University 2010 04 02 04/02/2010 Rifat Aras Yuzhong Shen Paper Rifat Aras,Yuzhong Shen 33549c56-faa7-4ed2-8762-b7176670e639 A Batched GPU Algorithm for Set Intersection Intersection of inverted lists is a frequently used operation in search engine systems. Efficient CPU and GPU intersection algorithms for large problem size are well studied. We propose an efficient GPU algorithm for high performance intersection of inverted index lists on CUDA platform. This algorithm feeds queries to GPU in batches, thus can take full advantage of GPU processor cores even if problem size is small. We also propose an input preprocessing method which alleviate load imbalance effectively. http://nbjl.nankai.edu.cn/Lab_Papers/2009/A%20Batched%20GPU%20Algorithm%20for%20Set%20Intersection.pdf /content/cudazone/CUDABrowser/assets/images/applications/1271_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1271_GPUComputing bgimg_large.png Academia Nankai University 2009 09 19 09/19/2009 Di Wu Fan Zhang Naiyong Ao Fang Wang Xiaoguang Liu Gang Wang Paper Di Wu,Fan Zhang,Naiyong Ao,wakensky@gmail.com,zhangfan555@gmail.com,aonaiyong@163.com c7c4497b-7a3a-4847-8010-748fc72fdd19 GPU-based ultrafast IMRT plan optimization The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs). http://iopscience.iop.org/0031-9155/54/21/008 /content/cudazone/CUDABrowser/assets/images/applications/1270_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1270_GPUComputing bgimg_large.png Academia University of California, San Diego 2009 10 14 10/14/2009 40 Chunhua Men Xuejun Gu Dongju Choi Amitava Majumdar Ziyi Zheng Klaus Mueller Steve B. Jiang Paper Chunhua Men,Xuejun Gu,Dongju Choi 4e362946-a2fd-4f2b-8740-0009b7348bd6 Real-time Forest Simulation for a Flight Simulator using a GPU This paper concerns the real-time simulation of forests for a flight simulator, exploiting the capacities of recent graphics cards. As we will show, these architectures coupled with recent ergonomic environments like CUDA allow C-programmers to implement highly parallelizable algorithms to be executed on GPU, without being specialized in parallel programming. http://www.ecam-rennes.fr/IMG/pdf/ICCTA2008.pdf /content/cudazone/CUDABrowser/assets/images/applications/1268_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1268_GPUComputing bgimg_large.png Academia Louis de Broglie, Graduate Engineering School 2008 02 19 02/19/2008 Jean-Marc Laferte Guillaume Daussin Pascal Haigron Jihed Flifla Paper Jean-Marc Laferte,Guillaume Daussin,Jihed Flifla,laferte@ecole-debroglie.fr,g.daussin@ecole-debroglie.fr,flifla@ecole -debroglie.fr e490331e-ba50-4ff8-ad01-f2af8b63cada cuInspiral: prototype gravitational waves detection pipeline fully coded on GPU using CUDA In this paper we report the prototype of the first coalescing binary detection pipeline fully implemented on NVIDIA GPU hardware accelerators. The code has been embedded in a GPU library, called cuInspiral and has been developed under CUDA framework. The library contains for example a PN gravitational wave signal generator, matched filtering/FFT and detection algorithms that have been profiled and compared with the corresponding CPU code with dedicated benchmark in order to provide gain factor respect to the standard CPU implementation. http://arxiv.org/PS_cache/arxiv/pdf/1006/1006.4644v1.pdf /content/cudazone/CUDABrowser/assets/images/applications/1267_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1267_GPUComputing bgimg_large.png Research National Institute of Nuclear Physics 2010 06 16 06/16/2010 Leone B. Bosi Paper Leone B. Bosi b0a54882-2b4a-4058-a6e7-b829a2a04a53 GPU Accelerated Path-planning for Multi-agents in Virtual Environments Many games are populated by synthetic humanoid actors that act as autonomous agents. The animation of humanoids in real-time applications is yet a challenge if the problem involves attaining a precise location in a virtual world (path-planning), and moving realistically according to its own personality, intentions and mood (motion planning). In this paper we present a strategy to implement - using CUDA on GPU - a path planner that produces natural steering behaviors for virtual humans using a numerical solution for boundary value problems. http://www.sbgames.org/papers/sbgames09/computing/full/cp15_09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1266_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1266_GPUComputing bgimg_large.png Academia Federal University of Rio Grande do Sul 2009 10 08 10/08/2009 56 Leonardo G. Fischer Renato Silveira Luciana Nedel Paper Leonardo G. Fischer,Renato Silveira,Luciana Nedel cb2fd4bb-d313-4c57-aae7-547e4b78dc27 Real-time image segmentation on a GPU Efficient segmentation of color images is important for many applications in computer vision. Non-parametric solutions are required in situations where little or no prior knowledge about the data is available. In this paper, we present a novel parallel image segmentation algorithm which segments images in real-time in a non-parametric way. The algorithm finds the equilibrium states of a Potts model in the superparamagnetic phase of the system. http://upcommons.upc.edu/e-prints/bitstream/2117/7866/1/1104-Real-time-image-segmentation-on-a-GPU.pdf /content/cudazone/CUDABrowser/assets/images/applications/1264_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1264_GPUComputing bgimg_large.png Academia Georg-August University 2010 06 28 06/28/2010 Alexey Abramov Tomas Kulvicius Florentin Worgotter Babette Dellen Paper Alexey Abramov,Tomas Kulvicius,Florentin Worgotter,abramov@bccn-goettingen.de,tomas@bccn-goettingen.de,worgottg@bccn-goettingen.de 7f4e5772-7ed9-40a1-8da0-7694fec71c3c Development of a GPU-Based Monte Carlo Dose Calculation Code for Coupled Electron-Photon Transport Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, 20 we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. We have implemented the Dose Planning Method (DPM) Monte Carlo dose calculation package (Sempau et al, Phys. Med. Biol., 45(2000)2263-2291) on GPU architecture under CUDA platform. http://arxiv.org/ftp/arxiv/papers/0910/0910.0329.pdf /content/cudazone/CUDABrowser/assets/images/applications/1263_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1263_GPUComputing bgimg_large.png Academia University of California, San Diego 2010 03 22 03/22/2010 Xun Jia Xuejun Gu Josep Sempau Dongju Choi Amitava Majumdar Steve B. Jiang Paper Xun Jia,Xuejun Gu,Josep Sempau,Dongju Choi,Amitava Majumdar,Steve B. Jiang e41a3774-d96e-444e-a928-811d5f31b161 Performance Characterization of a GPU as a Ubiquitous Accelerator in Commodity Multiprocessor Systems Graphic processing units (GPUs) are increasingly being employed as commodity data-parallel co-processors in desktop and laptop systems due to their tremendous computational power as well as high memory bandwidth. A number of research efforts are focusing on the development of methodologies for efficient utilization of GPU hardware as a ubiquitous accelerator for CPU and memory intensive tasks to off-load the main processor(s). In order to effectively off-load parts of computation, developers need to have a clear understanding of performance trade-offs of using GPU as an accelerator for the host processor. http://www.kics.edu.pk/hpcnl/images/hpcnl_kics_tr_03.pdf /content/cudazone/CUDABrowser/assets/images/applications/1262_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1262_GPUComputing bgimg_large.png Academia Al-Khawarizmi Institute of Computer Science 2010 06 01 06/01/2010 Ghulam Mustafa Abdul Waheed Waqar Mahmood Paper Ghulam Mustafa,Abdul Waheed,Waqar Mahmood,ghulam.mustafa@kics.edu.pk,awaheed@kics.edu.pk,director@kics.edu.pk 38d916ba-ce82-48eb-9159-6f33a4396526 Implementation of Stereophonic Acoustic Echo Canceller on nVIDIA GeForce Graphics Processing Unit This paper presents an implementation of a stereophonic acoustic echo canceller on NVIDIA GeForce graphics processor and CUDA software development environment. For efficiency, fast shared memory has been used as much as possible. A tree adder is introduced to reduce the cost for summing thread outputs up. The performance evaluation results suggest that Even a low-cost GPU's with a small number of shader processor greatly helps the echo cancellation for low-cost PCbased teleconferencing. /content/cudazone/CUDABrowser/assets/images/applications/1261_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1261_GPUComputing bgimg_large.png Academia Kanazawa University 2009 12 07 12/07/2009 Akihiro Hirano Kenji Nakayama Paper Akihiro Hirano,Kenji Nakayama,hirano@t.kanazawa-u.ac.jp,nakayama@t.kanazawa-u.ac.jp 74329815-f8c3-4f35-92de-134ac83e4ada Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations This paper outlines the Nallatech Accelerator Layer (NAL) and its relationship to Intel's Accelerator Abstraction Layer. The NAL is looked at in its academic context. Hardware platforms that support the NAL are discussed: the Nallatech H101, the Intel FSB-FPGA Module and the BenOne PCIe. The Intel QuickAssist Technology initiative and its associated Accelerator Abstraction Layer (AAL) are introduced. http://www.rssi2008.org/proceedings/papers/posters/07_Bruce.pdf /content/cudazone/CUDABrowser/assets/images/applications/1260_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1260_GPUComputing bgimg_large.png Research Nallatech Ltd 2008 06 17 06/17/2008 Robin Bruce Javier Setoain Richard Chamberlain Malachy Devlin Rosa M. Badia Paper Robin Bruce,Javier Setoain,Richard Chamberlain 64be2be4-6831-45f3-aa71-7ed3906590cc MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture With the advent of high-performance COTS clusters, there is a need for a simple, scalable and fault-tolerant parallel programming and execution paradigm. In this paper, we show that the popular MapReduce programming model can be utilized to solve many interesting scientific simulation problems with much higher performance than regular cluster computers by leveraging GPGPU accelerators in cluster nodes. We use the Massive Unordered Distributed (MUD) formalism and establish a one-to-one correspondence between it and general Monte Carlo simulation methods. http://verma7.com/wp/wp-content/uploads/2009/09/CS597_Spring09_MITHRA_Technical_Report.pdf /content/cudazone/CUDABrowser/assets/images/applications/1259_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1259_GPUComputing bgimg_large.png Academia University of Illinois at Urbana-Champaign 2009 08 25 08/25/2009 Reza Farivar Abhishek Verma Ellick M. Chan Roy H. Campbell Paper Reza Farivar,Abhishek Verma,Ellick M. Chan,rhc@illinois.edu,Roy H. Campbell,farivar2@illinois.edu,verma7@illinois.edu,emchan@illinois.edu 82928644-d96f-4ba8-9c46-0e6bf1e1f95e Simulation of Reaction-Diffusion Processes in Three Dimensions using CUDA Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems. http://arxiv.org/ftp/arxiv/papers/1004/1004.0480.pdf /content/cudazone/CUDABrowser/assets/images/applications/1257_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1257_GPUComputing bgimg_large.png Academia Eotvos Lorand University 2010 04 03 04/03/2010 Ferenc Molnar Jr Ferenc Izsak Robert Meszaros Paper Ferenc Molnar Jr,Ferenc Izsak,Robert Meszaros 171b372c-fb8e-4fc3-9a02-a3db231424a7 CUDASW++2.0: enhanced Smith-Waterman Protein Database Search on CUDA-Enabled GPUs Based on SIMT and Virtualized SIMD Abstractions Due to its high sensitivity, the Smith-Waterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly time-consuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging many-core architectures and their associated programming models. http://www.biomedcentral.com/content/pdf/1756-0500-3-93.pdf /content/cudazone/CUDABrowser/assets/images/applications/1256_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1256_GPUComputing bgimg_large.png Academia Nanyang Technological University, Singapore 2010 04 14 04/14/2010 Yongchao Liu Bertil Schmidt Douglas L Maskell Paper Yongchao Liu,Bertil Schmidt,Douglas L Maskell,liu0039@ntu.edu.sg 2733733f-bf02-44f0-92ce-49eed1ab150c Design and Implementation of the Smith-Waterman Algorithm on the CUDA-Compatible GPU This paper describes a design and implementation of the Smith-Waterman algorithm accelerated on the graphics processing unit (GPU). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip memory and processing elements in the GPU. Furthermore, it reduces the number of data fetches by applying a data reuse technique to query and database sequences. http://www-hagi.ist.osaka-u.ac.jp/research/papers/200810_y-munekw_bibe.pdf /content/cudazone/CUDABrowser/assets/images/applications/1255_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1255_GPUComputing bgimg_large.png Academia Osaka University 2008 08 09 08/09/2008 Yuma Munekawa Fumihiko Ino Kenichi Hagihara Paper Yuma Munekawa,Fumihiko Ino,Kenichi Hagihara,y-munekw@ist.osaka-u.ac.jp,ino@ist.osaka-u.ac.jp 5f961830-f9db-421a-b7b6-cd749907f46e Tapping the Supercomputer Under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors This paper shows how to build algorithms that use graphics processing units (GPUs) installed in most modern computers to solve dynamic equilibrium models in economics. In particular, we rely on the compute unifed device architecture (CUDA) of NVIDIA GPUs. We illustrate the power of the approach by solving a simple real business cycle model with value function iteration. We document improvements in speed of around 200 times and suggest that even further gains are likely. /content/cudazone/CUDABrowser/assets/images/applications/1254_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1254_GPUComputing bgimg_large.png Academia Duke University 2010 04 10 04/10/2010 Eric M. Aldrich Jesus Fernandez-Villaverde A. Ronald Gallant Juan F. Rubio-Ramirez Paper Eric M. Aldrich,Jesus Fernandez-Villaverde,A. Ronald Gallant,ealdrich@gmail.com,jesusfv@econ.upenn.edu,aronaldg@gmail.com,Juan F. Rubio-Ramirez,jfr23@duke.edu c1f6075d-5f89-44d5-938b-f527b8a56825 Faster Matrix-Vector Multiplication on GeForce 8800GTX Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BLAS library CUBLAS 1.1 (resp., Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800. http://ch.nvidia.com/docs/IO/47905/fujimoto_lspp2008.pdf /content/cudazone/CUDABrowser/assets/images/applications/1253_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1253_GPUComputing bgimg_large.png Academia Osaka University 2008 01 29 01/29/2008 15 Noriyuki Fujimoto Paper Noriyuki Fujimoto,fujimoto@ist.osaka-u.ac.jp 3939b0a0-52a9-48e0-8b31-6af60eae6ce6 Stackless KD-Tree Traversal for High Performance GPU Ray Tracing Significant advances have been achieved for realtime ray tracing recently, but realtime performance for complex scenes still requires large computational resources not yet available from the CPUs in standard PCs. Incidentally, most of these PCs also contain modern GPUs that do offer much larger raw compute power. However, limitations in the programming and memory model have so far kept the performance of GPU ray tracers well below that of their CPU counterparts. http://www.gpucomputing.net/?q=node/1293 /content/cudazone/CUDABrowser/assets/images/applications/1252_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1252_GPUComputing bgimg_large.png Academia Saarland University and MPI Informatik 2007 06 11 06/11/2007 Stefan Popov Johannes Gunther Hans-Peter Seidel Philipp Slusallek Paper Stefan Popov,Johannes Gunther,Hans-Peter Seidel c747f985-752f-4f32-be50-cf24898c527b Fast Parallel GPU-Sorting Using a Hybrid Algorithm This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. Initially, a parallel bucketsort splits the list into enough sublists then to be sorted in parallel using merge-sort. The parallel bucketsort, implemented in NVIDIA's CUDA, utilizes the synchronization mechanisms, such as atomic increment, that is available on modern GPUs. http://www.gpucomputing.net/?q=node/1291 /content/cudazone/CUDABrowser/assets/images/applications/1251_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1251_GPUComputing bgimg_large.png Academia Chalmers University of Technology 2007 09 25 09/25/2007 Erik Sintorn Ulf Assarsson Paper Erik Sintorn,Ulf Assarsson,erik.sintorn@chalmers.se,uffe@chalmers.se 01c04032-f48d-40e7-ba6f-105e9e541977 Testing the Feasibility of Running a Computationally Intensive Real-Time Traffic Simulation on a Multicore Programmable Graphics Processor In the 1960s, a semiconductor scientist named Gordon Moore theorized that the number of transistors would double each year on a single integrated circuit. Through much effort, the semiconductor industry has been able to closely follow "Moore's Law", but new information shows this type of progress is not sustainable in the coming years. This realization has implications in both chip fabrication and software development. Instead of making chips with more transistors per unit area, industry now produces newer multicore chips. http://www.gpucomputing.net/?q=node/603 /content/cudazone/CUDABrowser/assets/images/applications/1250_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1250_GPUComputing bgimg_large.png Academia University of Virginia 2007 04 04 04/04/2007 Kevin Stammetti Paper Kevin Stammetti 18ea0385-7666-484e-abed-98ef81d2697a A Flexible High-Performance Lattice Boltzmann GPU Code for the Simulations of Fluid Flows in Complex Geometries We describe the porting of the Lattice Boltzmann component of MUPHY, a multi-physics/scale simulation software, to multiple graphics processing units using the Compute Unified Device Architecture. The novelty of this work is the development of ad hoc techniques for optimizing the indirect addressing that MUPHY uses for efficient simulations of irregular domains. http://www.iac.rm.cnr.it/~massimo/Papers/LBEonGPU.pdf /content/cudazone/CUDABrowser/assets/images/applications/1247_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1247_GPUComputing bgimg_large.png Academia 1 Istituto Applicazioni Calcolo, 2 NVIDIA, 3 SOFT, Istituto Nazionale Fisica della Materia, 4Harvard University School of Eng and Applied Sciences, 5 Harvard University Initiative in Innovative Computing 2009 05 11 05/11/2009 Massimo Bernaschi1 Massimiliano Fatica2 Simone Melchionna3,4 Sauro Succi1,5 Efthimios Kaxiras4 Paper Massimo Bernaschi,Massimiliano Fatica,Simone Melchionna,Sauro Succi1,Efthimios Kaxiras 74ce7180-6c27-4cf9-906a-507feae7f418 GPU Clusters for High-Performance Computing Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster architecture, resource sharing in a cluster environment, programming models, and applications for GPU clusters. http://www.ncsa.illinois.edu/~gshi/ppac09_paper.pdf /content/cudazone/CUDABrowser/assets/images/applications/1246_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1246_GPUComputing bgimg_large.png Academia University of Illinois at Urbana-Champaign 2009 08 01 08/01/2009 Volodymyr V. Kindratenko Jeremy J. Enos Guochun Shi Michael T. Showerman Galen W. Arnold John E. Stone James C. Phillips Wen-mei Hwu Paper Volodymyr V. Kindratenko,Jeremy J. Enos,Guochun Shi,kindr@ncsa.uiuc.edu,jenos@ncsa.uiuc.edu,gshi@ncsa.uiuc.edu ed2b492f-10c9-4c78-9acc-215d2de0900e Towards User Transparent Parallel Multimedia The research area of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia archives and data streams. To satisfy the increasing computational demands of MMCA problems, the use of High Performance Computing (HPC) techniques is essential. As most MMCA researchers are not HPC experts, there is an urgent need for 'familiar' programming models and tools that are both easy to use and efficient. http://hal.inria.fr/docs/00/49/38/83/PDF/A4MMC-werkhoven.pdf /content/cudazone/CUDABrowser/assets/images/applications/1244_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1244_GPUComputing bgimg_large.png Academia VU University 2010 06 21 06/21/2010 Ben van Werkhoven Jason Maassen Frank J. Seinstra Paper Ben van Werkhoven,Jason Maassen,Frank J. Seinstra,bjvwerkh@few.vu.nl,jason@few.vu.nl,fjseins@vew.vu.nl a4825b5f-3099-4dcc-8335-ba0f18a6a351 Axel: A Heterogeneous Cluster with FPGAs and GPUs This paper describes a heterogeneous computer cluster called Axel. Axel contains a collection of nodes; each node can include multiple types of accelerators such as FPGAs (Field Programmable Gate Arrays) and GPUs (Graphics Processing Units). A Map-Reduce framework for the Axel cluster is presented which exploits spatial and temporal locality through different types of processing elements and communication channels. The Axel system enables the first demonstration of FPGAs, GPUs and CPUs running collaboratively for N-body simulation. http://portal.acm.org/citation.cfm?id=1723112.1723134#abstract /content/cudazone/CUDABrowser/assets/images/applications/1243_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1243_logo_acm_portal2_large.jpg Academia Imperial College London 2010 02 01 02/01/2010 23 Kuen Hung Tsoi Wayne Luk Paper Kuen Hung Tsoi,Wayne Luk f6a0332b-f8ec-4d4d-9551-0f7fc57ad735 GPU-based Hierarchical Computations for View Independent Visibility With rapid improvements in the performance and programmability, Graphics Processing Units (GPUs) have fostered considerable interest in substantially reducing the running time of compute intensive problems. The solution to the view-independent mutual point-pair visibility problem (required for inter-reflections in global illumination) can, it would seem, require the capabilities of the GPUs. http://dspace.library.iitb.ac.in/jspui/bitstream/10054/1708/1/4756034.pdf /content/cudazone/CUDABrowser/assets/images/applications/1242_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1242_GPUComputing bgimg_large.png Academia Indian Institute of Technology Bombay www.cse.iitb.ac.in 2008 12 16 12/16/2008 Rhushabh Goradia Prekshu Ajmera Sharat Chandran Paper Rhushabh Goradia,Prekshu Ajmera,Sharat Chandran e3c99de6-a223-4437-af51-01bd3faf3026 Fast, GPU-based Diffuse Global Illumination For Point Models Photorealistic computer graphics attempts to match as closely as possible the rendering of a virtual scene with an actual photograph of the scene had it existed in the real world. Of the several techniques that are used to achieve this goal, physically-based approaches (i.e. those that attempt to simulate the actual physical process of illumination) provide the most striking results. http://www.cse.iitb.ac.in/~rhushabh/aps/aps4/report.pdf /content/cudazone/CUDABrowser/assets/images/applications/1241_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1241_GPUComputing bgimg_large.png Academia Indian Institute of Technology 2008 08 26 08/26/2008 Rhushabh Goradia Paper Rhushabh Goradia 5c93da6a-193d-4595-998d-07313764eef5 Fast GPU-based Adaptive Tessellation with CUDA Compact surface descriptions like higher-order surfaces are popular representations for both modeling and animation. However, for fast graphics-hardware-assisted rendering, they usually need to be converted to triangle meshes. In this paper, we introduce a new framework for performing on-the-fly crack-free adaptive tessellation of surface primitives completely on the GPU. Utilizing CUDA and its flexible memory write capabilities, we parallelize the tessellation task at the level of single surface primitives. https://www.mpi-sb.mpg.de/~mschwarz/papers/cudatess-eg09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1240_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1240_GPUComputing bgimg_large.png University of Erlangen-Nuremberg 2009 04 01 04/01/2009 Michael Schwarz Marc Stamminger Michael Schwarz,Marc Stamminger 5eb012b6-230e-49e0-ac77-6be5c40b011a Using Graphics Devices in Reverse: GPU-based Image Processing and Computer Vision Graphics and vision are approximate inverses of each other. Ordinarily Graphics Processing Units are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we discus the use of GPUs in approximately the reverse way to assist in "converting pictures into numbers" (i.e. computer vision). For graphical operations, GPUs currently provide many hundreds of gigaflops of processing power. http://www.uweb.ucsb.edu/~yichuwang/ecv/paper/using_graphics_device_in_reverse.pdf /content/cudazone/CUDABrowser/assets/images/applications/1239_gpucomputing_small.png /content/cudazone/CUDABrowser/assets/images/applications/1239_gpucomputing_large.png Commercial NVIDIA Corporation 2008 08 26 08/26/2008 James Fung Steve Mann Paper James Fung,Steve Mann 1bbe2f89-4dd9-4051-9bf0-bcd1abdc7a03 CUDA SURF - A real-time implementation for SURF Keypoint detection and matching is a basic computer vision task and a necessary ingredient for several applications, e.g., object recognition, structure from motion, panorama stitching. In this work we implement the popular SURF descriptor, an approximation of SIFT, on commodity graphics hardware and achieve real-time performance even for HD images. For VGA images we achieve a speed-up of about 50x and a GTX 285 and for HD images even up to 87x. /content/cudazone/CUDABrowser/assets/images/applications/1238_633214_match_GPU_graff_img1_img2_small_small.png /content/cudazone/CUDABrowser/assets/images/applications/1238_633214_match_GPU_graff_img1_img2_small_large.png Academia TU Darmstadt http://www.tu-darmstadt.de 2010 07 13 07/13/2010 87 Open source Andre Schulz Florian Jung Sebastian Hartte Daniel Trick Christian Wojek Konrad Schindler Jens Ackermann Michael Goesele Application Code Imaging Video & Audio Keypoint detection, Keypoint description, SURF, Object detection, SfM, Image stitching,Andre Schulz,Florian Jung,Sebastian Hartte, Daniel Trick,Christian Wojek, Konrad Schindler, Jens Ackermann, Michael Goesele,wojek@cs.tu-darmstadt.de fdd8f407-1003-4323-82d5-f91b0c483a18 A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA We present an efficient CUDA implementation of a finite hexahedra multigrid solver for simulating elastic deformable models in real time. Due to the regular shape of the numerical stencil induced by the hexahedral regime, computations and data layout can be restructured to avoid execution divergence and to support memory access patterns enabling the hardware to coalesce multiple memory accesses into single memory transactions. This enables to effectively exploit the GPU's parallel processing units and high memory bandwidth. Performance gains of up to a factor of 12 compared to a highly optimized CPU implementation are demonstrated. /content/cudazone/CUDABrowser/assets/images/applications/1237_170404_VoxelModel_small.png /content/cudazone/CUDABrowser/assets/images/applications/1237_170404_VoxelModel_large.png Academia Computer Graphics and Visualization Group, Technische Universitat Munchen, Germany http://wwwcg.in.tum.de/ 2010 07 10 07/10/2010 12 Christian Dick Joachim Georgii Rudiger Westermann Paper Computer Aided Engineering Numerics Deformable Objects, Finite Element Methods, Multigrid,Christian Dick,Joachim Georgii,Rudiger Westermann 1cbf9658-009b-4ee7-81fe-1f144a456225 Real-time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid We introduce a real-time stereo matching technique based on a reformulation of Yoon and Kweons adaptive support weights algorithm [1]. Our implementation uses the bilateral grid to achieve a speedup of 200x compared to a straightforward full-kernel GPU implementation, which in turn is 20x faster than the original CPU implementation, thus making it the fastest technique on the Middlebury website. Published at the European Conference on Computer Vision (ECCV) 2010. /content/cudazone/CUDABrowser/assets/images/applications/1236_167313_DCBGrid-teaser_small.png /content/cudazone/CUDABrowser/assets/images/applications/1236_167313_DCBGrid-teaser_large.png Academia University of Cambridge and Microsoft Research Cambridge 2010 06 24 06/24/2010 200 Christian Richardt Douglas Orr Ian Davies Antonio Criminisi Neil A. Dodgson Multimedia Paper Code Science Video & Audio Computer Vision Christian Richardt,Douglass Orr,Ian Davies, Antonio Criminisi, Neil A. Dodgson,christian.richardt@cl.cam.ac.uk bbaba1fd-20be-44f9-9b64-fea2cd12fce2 Realtime Tracking With a Pan-Tilt Camera The human eye is amazingly adept at tracking moving objects. The process is so natural to humans that it happens without any conscious effort. While this remarkable ability depends in part on the human brain's immense processing power, the fast response of the extraocular muscles and the eyeball's light weight are also vital. Even a small point and shoot camera mounted on a servo is typically too heavy and slow to move with the agility of the human eye. How, then, can we give a computer the ability to track movement quickly and responsively? http://umassgv.blogspot.com/2010/07/realtime-tracking-with-pan-tilt-camera.html /content/cudazone/CUDABrowser/assets/images/applications/1235_123697_tracking_overview_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1235_123697_tracking_overview_large.jpg Academia University of Massachusetts, Amherst http://www.cs.umass.edu 2010 07 06 07/06/2010 Open source Blake Foster Rui Wang Erik Learned-Miller Multimedia Paper Code Video & Audio tracking, camera, fpv, pan tilt, human eye, vision,Blake Foster,Rui Wang,Erik Learned-Miller,blfoster@cs.umass.edu,ruiwang@cs.umass.edu,elm@cs.umass.edu 749bb40e-ff99-4a34-9093-c8ff4b7ab08d Thrust Graph Library Thrust Graph Library provides graph container, algorithm, and other concepts like a Boost Graph Library. This Library based on the thrust, which is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). /content/cudazone/CUDABrowser/assets/images/applications/1234_53418_networks_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1234_53418_networks_large.jpg Research National Institute of Advanced Industrial Science and Technology (AIST) 2010 07 06 07/06/2010 Open source kazuhiro kojima Code Libraries Graph Library,kazuhiro kojima,k.kojima@aist.go.jp aafa92f6-e4db-4ff5-bbf2-093cef1271ea Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster The vortex core shed from rotorcraft blades maintains coherency---and thus dynamic relevance---many blade turns after its creation. This presents a challenge to traditional Eulerian computational methods, as fine grids are required to suppress numerical diffusion which would weaken the vortex cores after a small number of revolutions. Vortex methods have been used in the past to overcome these problems, as they require computational elements only in vorticity-containing regions, but suffer from greater computational cost per element. http://markjstock.org/research/AIAA-2010-4553.pdf /content/cudazone/CUDABrowser/assets/images/applications/1233_86624_4bladed_720_web2_small.png /content/cudazone/CUDABrowser/assets/images/applications/1233_86624_4bladed_720_web2_large.png Research Applied Scientific Research, Inc. http://www.applied-scientific.com/ 2010 06 29 06/29/2010 Commercial Mark J. Stock Adrin Gharakhani Multimedia Paper Computational Fluid Dynamics cfd rotor helicopter vortex fluid,Mark J. Stock,Adrin Gharakhani,mstock@applied-scientific.com 5eb5fa1a-f845-4b64-863f-48f16aae06bb A GPU-accelerated Boundary Element Method and Vortex Particle Method Vortex particle methods, when combined with multipole-accelerated boundary element methods (BEM), become a complete tool for direct numerical simulation (DNS) of internal or external vortex-dominated flows. In previous work, we presented a method to accelerate the vorticity-velocity inversion at the heart of vortex particle methods by performing a multipole treecode N-body method on parallel graphics hardware. The resulting method achieved a 17-fold speedup over a dual-core CPU implementation. http://markjstock.org/research/AIAA-2010-5099.pdf /content/cudazone/CUDABrowser/assets/images/applications/1232_266408_spheres_cl_vort_crop_small.png /content/cudazone/CUDABrowser/assets/images/applications/1232_266408_spheres_cl_vort_crop_large.png Research Applied Scientific Research, Inc. http://applied-scientific.com/ 2010 07 01 07/01/2010 43 Commercial Mark J. Stock Adrin Gharakhani Paper Computational Fluid Dynamics cfd vortex nbody bem fluid,Mark J. Stock,mstock@applied-scientific.com c7ec9d53-04d1-4b50-9a6a-0f92323dce34 Leukocyte Tracking: ImageJ Plugin This software is a plugin for the ImageJ image processing program. The plugin is designed to detect and track rolling leukocytes (white blood cells) through multiple frames of video. It can take advantage of a CUDA-capable GPU to dramatically accelerate video processing time; with appropriate hardware, near real-time processing can be achieved. /content/cudazone/CUDABrowser/assets/images/applications/1231_26570_leukocytes_small.png /content/cudazone/CUDABrowser/assets/images/applications/1231_26570_leukocytes_large.png Academia University of Virginia 2010 07 01 07/01/2010 26 Open source Michael Boyer David Tarjan Scott T. Acton Kevin Skadron Application Multimedia Paper Code Imaging Medical Imaging Science Video & Audio leukocyte, blood cell, tracking, video,Michael Boyer,David Tarjan,Scott T. Acton, Kevin Skadron,boyer@cs.virginia.edu d36b8ae0-3465-4b14-82f6-07712472e3ae McStas CUDA optimization project Optimize the single crystal component of McStas neutron raytracer using CUDA http://www.mcstas.org/ /content/cudazone/CUDABrowser/assets/images/applications/1229_6226_logo-left_small.png /content/cudazone/CUDABrowser/assets/images/applications/1229_6226_logo-left_large.png Academia eScience Center, University of Copenhagen http://www.escience.ku.dk 2010 01 29 01/29/2010 125 Jesper Dahlkild Martin Djurno Finn Krog Paper Code Ray Tracing Science Jesper Dahlkild,Martin Djurno,Finn Krog,jesper.dahlkild@gmail.com,djurnoe@diku.dk,fk@finnkrog.com c15f6620-26e7-4f91-b857-46380ff67782 Raytracing in participating media This work presents a CUDA-accelerated algorithm for visualization of photorealistic lighting effects which is based on Henrik Wann Jensen's method for global illumination in scenes with participating media. /content/cudazone/CUDABrowser/assets/images/applications/1228_58732_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1228_58732_logo_large.png Academia Wroclaw University of Technology 2010 07 02 07/02/2010 10 Open source Piotr Orzechowski Application Multimedia Paper Code Graphics Ray Tracing raytracing, participating media, photon mapping,Piotr Orzechowski,piotr.orzechowski@gmail.com d35e353a-09c4-47df-ba61-10984823be50 Multi-domain, Higher Order Level Set Scheme for 3D Image Segmentation on the GPU A streaming level set PDE solver to handle large volume ( sizes more than the available GPU memory). A higher order and multi-phase solver for smooth segmentation of the volume. /content/cudazone/CUDABrowser/assets/images/applications/1227_545525_Slide2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1227_545525_Slide2_large.jpg Academia The Technical University of Denmark / The University of Texas at Austin 2010 06 16 06/16/2010 10 Open source Ojaswa Sharma Qin Zhang Francois Anton Chandrajit Bajaj Paper Ojaswa Sharma, Qin Zhang,Francois Anton, and Chandrajit Bajaj ,os@imm.dtu.dk,zqyork@ices.utexas.edu,fa@imm.dtu.dk, bajaj@cs.utexas.edu 082dbdd8-2eb2-4511-9072-d5eff768e420 A Simple Pseudo-Random Number Generator Implementation of uniformly and normally distributed pseudo random number generators as device functions. http://people.virginia.edu/~mjt5v/pf/RNG/ /content/cudazone/CUDABrowser/assets/images/applications/1226_144550627858cb6d44ceb02ba9434317_small.png /content/cudazone/CUDABrowser/assets/images/applications/1226_144550627858cb6d44ceb02ba9434317_large.png Academia University of Virginia 2010 06 08 06/08/2010 Michael Trotter Matt Goodrum Application Programming Tools Michael Trotter,Matt Goodrum,mjt5v@virginia.edu,mag6x@virginia.edu 9e6bae6c-dc38-44dd-863a-441c9440bf9e High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x. /content/cudazone/CUDABrowser/assets/images/applications/1225_20831_seismic_paper_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1225_20831_seismic_paper_large.jpg Academia Universite de Pau (France), Florida State University (US,) TU Dortmund (Germany) 2010 06 15 06/15/2010 20 Dimitri Komatitsch Gordon Erlebacher Dominik Goddeke David Michea Paper Numerics Oil & Gas Clusters Dimitri Komatitsch,Gordon Erlebacher,Dominik Goddeke,David Michea,dimitri.komatitsch@univ-pau.fr 5f003419-e86a-4d8b-9d19-997acd44898b To GPU Synchronize or Not GPU Synchronize? The graphics processing unit (GPU) has evolved from being a fixed function processor with programmable stages into a programmable processor with many fixed function components that deliver massive parallelism. By modifying the GPUs stream processor to support general-purpose computation on the GPU (GPGPU), applications that perform massive vector operations can realize many orders of magnitude improvement in performance over a traditional processor, i.e., CPU. https://research.cs.vt.edu/synergy/pubs/papers/feng-iscas2010-gpusync.pdf /content/cudazone/CUDABrowser/assets/images/applications/1224_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1224_GPUComputing bgimg_large.png Academia Virginia Tech 2010 03 28 03/28/2010 Wu-chun Feng Shucai Xiao Paper Wu-chun Feng,Shucai Xiao,wfeng@vt.edu,shucaig@vt.edu aed6944b-13eb-478c-874a-751a332dad9d FATSEA An Architectural Simulator for General Purpose Computing on GPUs We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance on Nvidia platforms. http://ditec.um.es/~jlaragon/papers/FATSEA-RAPIDO-2010.pdf /content/cudazone/CUDABrowser/assets/images/applications/1223_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1223_GPUComputing bgimg_large.png Academia University of Murcia 2009 12 22 12/22/2009 K. E. Ostby J. L. Aragon J. M. Garcia M. Ujaldon Paper K. E. Ostby,J. L. Aragon,J. M. Garcya 0c25d963-909c-4a1e-8f60-81bf8fd868cf Evaluating the use of GPUs in Liver Image Segmentation and HMMER Database Searches In this paper we present the results of parallelizing two life sciences applications, Markov random fieldsbased (MRF) liver segmentation and HMMER's Viterbi algorithm, using GPUs. We relate our experiences in porting both applications to the GPU as well as the techniques and optimizations that are most beneficial. The unique characteristics of both algorithms are demonstrated by implementations on an NVIDIA 8800 GTX Ultra using the CUDA programming environment. We test multiple enhancements in our GPU kernels in order to demonstrate the effectiveness of each strategy. http://cadi.buffalo.edu/papers/2009/2009_4.pdf /content/cudazone/CUDABrowser/assets/images/applications/1222_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1222_GPUComputing bgimg_large.png Academia University at Buffalo, SUNY 2009 02 15 02/15/2009 John Paul Walters Vidyananth Balu Suryaprakash Kompalli Vipin Chaudhary Paper John Paul Walters,Vidyananth Balu,Suryaprakash Kompalli,waltersj@buffalo.edu,vbalu2@buffalo.edu,kompalli@hp.com e21e7880-81bf-46d6-9185-b56a93a0ad3b GPUCT: A GPU-Accelerated CT Reconstruction System' CT scanning is a medical imaging technique commonly used in hospitals, including the University of Virginia Hospital, to see inside the human body. Modern CT scanners can generate images of the body in three dimensions, a process called 3D reconstruction. This project illustrates the feasibility of using graphics hardware (GPUs) to process CT scans in a more efficient and inexpensive manner than current commercial reconstruction systems. Additionally, this research considers the ethical and social implications of an improved CT reconstruction system in terms of risks for hospitals and patients. http://www.cs.virginia.edu/~skadron/Papers/maier_thesis07.pdf /content/cudazone/CUDABrowser/assets/images/applications/1221_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1221_GPUComputing bgimg_large.png Academia University of Virginia 2007 03 30 03/30/2007 56 Drew Maier Paper Drew Maier 72ba00af-f128-438b-9e0f-379786953cce FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation Lithography simulation, an essential step in design for manufacturability (DFM), is still far from computationally efficient. Most leading companies use large clusters of server computers to achieve acceptable turn-around time. Thus coprocessor acceleration is very attractive for obtaining increased computational performance with a reduced power consumption. This article describes the implementation of a customized accelerator on FPGA using a polygon-based simulation model. An application-specific memory partitioning scheme is designed to meet the bandwidth requirements for a large number of processing elements. http://cadlab.cs.ucla.edu/~cong/papers/TRETS-17.pdf /content/cudazone/CUDABrowser/assets/images/applications/1220_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1220_GPUComputing bgimg_large.png Academia University of California, Los Angeles 2009 09 17 09/17/2009 15 Jason Congyi Yi Zou Paper Jason Congyi,Yi Zou ebfbcda7-6930-45cc-83d5-1414da4c1325 Visualising Spins and Clusters in Regular and Small-World Ising Models with GPUs Visualising computational simulation models of solid state physical systems is a hard problem for dense lattice models. Fly throughs and cutaways can aid viewer understanding of a simulated system. Interactive time model parameter updates and overlaying of measurements and graticules, cluster colour labelling and other visual highlighting cues can also enhance user intuition of the model's meaning. We present some graphical and simulation optimisation techniques and various graphical rendering and explanatory techniques for computational simulation models such as the Ising model in 2 and 3 dimensions. In addition to aiding understanding of conventional algorithms such as Metropolis Monte Carlo, we try to visualise cluster updates to the system using algorithms like that of Wolff. http://www.massey.ac.nz/~dpplayne/Papers/cstn-108.pdf /content/cudazone/CUDABrowser/assets/images/applications/1219_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1219_GPUComputing bgimg_large.png Academia Massey University 2010 03 19 03/19/2010 A. Leist D. P. Playne K.A. Hawick Paper A. Leist,D. P. Playne,K.A. Hawick 3c9c3b66-7ebb-461e-884c-c820abca8856 Stereo Depth with a Unified Architecture GPU This paper describes how the calculation of depth from stereo images was accelerated using a GPU. The Compute Unified Device Architecture (CUDA) from NVIDIA was employed in novel ways to compute depth using BT cost matching and the Semi-Global Matching algorithm. The challenges of mapping a sequential algorithm to a massively parallel thread environment and performance optimization techniques are considered. http://mplab.ucsd.edu/wp-content/uploads/CVPR2008/WorkShops/data/papers/143.pdf /content/cudazone/CUDABrowser/assets/images/applications/1218_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1218_GPUComputing bgimg_large.png Academia Florida Atlantic University 2008 05 04 05/04/2008 Joel Gibson Oge Marques Paper Joel Gibson ,Oge Marques 4ea4d671-5de1-4355-8c9d-392fa35be551 3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation Medical image registration is time-consuming but can be sped up employing parallel processing on the GPU. Normalized mutual information (NMI) is a well performing similarity measure for performing multi-modal registration. We present CUDA based solutions for computing NMI on the GPU and compare the results obtained by rigidly registering multi-modal data sets with a CPU based implementation. Our tests with RIRE data sets show a speed-up of factor 5 to 7 for our best GPU implementation. http://www.gris.informatik.tu-darmstadt.de/~swesarg/papers/1632.pdf /content/cudazone/CUDABrowser/assets/images/applications/1217_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1217_GPUComputing bgimg_large.png Academia Technische Universitat Darmstadt 2010 01 04 01/04/2010 Florian Jung Stefan Wesarg Paper Florian Jung,Stefan Wesarg,stefan.wesarg@gris.tu-darmstadt.de 93ec4473-01fb-42ff-9e39-46248ff46941 fastHOG - a real-time GPU implementation of HOG We introduce a parallel implementation of the histogram of oriented gradients algorithm for object detection. Our implementation uses the GPU and the NVIDIA CUDA framework. We achieve speedups of over 67x from the standard sequential code, using a single video card. Furthermore it supports multiple video cards so speedups of 120x or more can be achieved. This allows us to achieve real-time performance, using the full HOG algorithm for the first time in the literature. http://www.robots.ox.ac.uk/ActiveVision/Papers/prisacariu_reid_tr2310_09/prisacariu_reid_tr2310_09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1216_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1216_GPUComputing bgimg_large.png Academia University of Oxford 2009 07 14 07/14/2009 120 Victor Adrian Prisacariu Ian Reid Paper Victor Adrian Prisacariu,Ian Reid,victor@robots.ox.ac.uk,ian@robots.ox.ac.uk 608510ce-a512-41c8-b32a-b55cb524284d Detection and Tracking of Human Subjects The goal of the thesis project was to devise an algorithm to detect and track people in a static video. Existing techniques are inadequate; instead a new approach based on background subtraction is used. The approach is successful with a static camera. In background subtraction, the background of the video is calculated a priori and then subtracted from each frame of the video. This isolates the foreign objects, which are detected via two simple algorithms. Both algorithms are based on the subject's center of mass, but the first algorithm traces the path of the person around the video, making it very cluttered. http://www.cs.virginia.edu/~skadron/Papers/Grosvenor_Douglas_thesis.pdf /content/cudazone/CUDABrowser/assets/images/applications/1215_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1215_GPUComputing bgimg_large.png Academia University of Virginia 2009 04 30 04/30/2009 Douglas Grosvenor Paper Douglas Grosvenor 78013ee5-9a75-42af-93a8-4a5b68a330fa Optimizing Sparse Matrix-Vector Multplication on GPUs We are witnessing the emergence of Graphics Processor units (GPUs) as powerful massively parallel systems. Furthermore, the introduction of new APIs for general-purpose comptuations on GPUs, namely, CUDA from NVIDIA, Stream SDK form AMD, and OpenCL, makes GPUs an attractive choice for high-performance numerical and scientific computing. http://domino.watson.ibm.com/library/CyberDig.nsf/papers/1D32F6D23B99F7898525752200618339/$File/rc24704.pdf /content/cudazone/CUDABrowser/assets/images/applications/1214_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1214_GPUComputing bgimg_large.png Research IBM Research Division 2009 04 02 04/02/2009 Muthu Manikandan Baskaran Rajesh Bordawekar Paper Muthu Manikandan Baskaran,Rajesh Bordawekar,baskaran@ces.ohio-state.edu,bordaw@us.ibm.com 67ac5cbe-b91c-46b0-8a4e-9635e86dec49 Acceleration of Binomial Options Pricing via Parallelizing along time-axis on a GPU Since the introduction of organized trading of options for commodities and equities, computing fair prices for options has been an important problem in financial engineering. A variety of numerical methods, including Monte Carlo methods, binomial trees, and numerical solution of stochastic differential equations, are used to compute fair prices. Traders and brokerage firms constantly strive to achieve faster calculation of option prices because timely information can mean the difference between a deal struck or missed, which translates to substantial profit or loss. Hence, the latency to compute a fair option price plays an important role in short-term trading and arbitrage. http://saahpc.ncsa.illinois.edu/09/papers/Ganesan_paper.pdf /content/cudazone/CUDABrowser/assets/images/applications/1213_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1213_GPUComputing bgimg_large.png Academia Washington University in St. Louis 2009 06 29 06/29/2009 Narayan Ganesan Roger D. Chamberlain Jeremy Buhler Paper Narayan Ganesan,Roger D. Chamberlain,Jeremy Buhler,nganesan@wustl.edu,roger@wustl.edu,jbuhler@wustl.edu 52b3b800-2810-4e18-b631-e995c9a2ed48 GPU Accelerated Cardiac Electrophysiology Numerical simulations of cellular membranes are useful for both basic science and increasingly for clinical diagnostic and therapeutic applications. A common bottleneck in such simulations arises from solving large highly complex stiff systems of ordinary di fferential equations (ODEs) thousands of times for numerous collocation points (representing cells) throughout a three-dimensional volume. For some electrophysiology simulations, over 98% of the time is spent solving these systems of ODEs when run in serial on a single core. http://cseweb.ucsd.edu/groups/hpcl/scg/papers/2010/lionetti_ms_thesis.pdf /content/cudazone/CUDABrowser/assets/images/applications/1212_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1212_GPUComputing bgimg_large.png Academia University of California, San Diego 2010 04 15 04/15/2010 280 Fred Lionetti Paper Fred Lionetti 7dbbc8bd-6029-4692-911e-5a4e725ef349 HARNESSING THE POWER OF IDLE GPUS FOR ACCELERATION OF BIOLOGICAL SEQUENCE ALIGNMENT This paper presents a parallel system capable of accelerating biological sequence alignment on the graphics processing unit (GPU) grid. The GPU grid in this paper is a desktop grid system that utilizes idle GPUs and CPUs in the office and home. Our parallel implementation employs a master-worker paradigm to accelerate an OpenGLbased algorithm that runs on a single GPU. We integrate this implementation into a screensaver-based grid system that detects idle resources on which the alignment code can run. http://www-hagi.ist.osaka-u.ac.jp/research/papers/200912_ino_ppl.pdf /content/cudazone/CUDABrowser/assets/images/applications/1211_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1211_GPUComputing bgimg_large.png Academia Osaka Univeristy 2009 08 24 08/24/2009 FUMIHIKO INO YUKI KOTANI YUMA MUNEKAWA Paper FUMIHIKO INO,YUKI KOTANI,YUMA MUNEKAWA 701ae5ed-4273-4c9a-8546-55b23244a5ca Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations however, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. http://www.massey.ac.nz/~dpplayne/Papers/cstn-091.pdf /content/cudazone/CUDABrowser/assets/images/applications/1210_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1210_GPUComputing bgimg_large.png Academia Massey University 2009 09 21 09/21/2009 K. A. Hawick A. Leist D. P. Playne Paper K. A. Hawick,A. Leist,D. P. Playne,k.a.hawick@massey.ac.nz,a.leist@massey.ac.nz,d.p.playne@massey.ac.nz 4a0b9cb0-a366-4a75-b2c4-e00c1af60a5c Computing on GPUs The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been the porting of computing intensive algorithms like e.g. ray-tracing algorithms form CPU to GPU. Through the Compute Unified Device Architecture (CUDA [4]) GPUs can also be used to increase computing speed for High Performance Computing applications. In this paper different parallelization strategies for different processor architectures are presented. They are compared and firt experiences using GPUs for a collection of numerical applications are given. /content/cudazone/CUDABrowser/assets/images/applications/1209_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1209_GPUComputing bgimg_large.png Commercial DYNAmore GmbH 2009 05 14 05/14/2009 Dr. Uli Gohner Paper Dr. Uli Gohner 3e365933-a682-4caa-aa9c-b165f30d448d High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads. http://arxiv.org/ftp/arxiv/papers/1004/1004.0023.pdf /content/cudazone/CUDABrowser/assets/images/applications/1208_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1208_GPUComputing bgimg_large.png Commercial D-Wave Systems Inc. 2010 03 31 03/31/2010 Kamran Karimi Neil G. Dickson Firas Hamze Paper Kamran Karimi ,Neil G. Dickson ,Firas Hamze,kkarimi,@dwavesys.com,ndickson@dwavesys.com,fhamze@dwavesys.com 1e8f6f3a-a7bd-4267-9fc9-3680f9bc0449 Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors Training convolutional neural networks (CNNs) on large sets of high-resolution images is too computationally intense to be performed on commodity CPUs. Such architectures however achieve state-of-the-art results on low-resolution machine vision tasks such as the recognition of handwritten characters. We have adapted the inherent multi-level parallelism of CNNs for Nvidia's CUDA GPU architecture to accelerate the training by two orders of magnitude. http://www.ais.uni-bonn.de/papers/nips09ws_scherer_behnke.pdf /content/cudazone/CUDABrowser/assets/images/applications/1206_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1206_GPUComputing bgimg_large.png University of Bonn, Germany 2009 12 01 12/01/2009 Dominik Scherer Sven Behnke Paper Dominik Scherer,Sven Behnke,scherer@ais.ni-bonn.de,behnke@cs.uni-bonn.de c0bb0403-84f7-42b9-8575-ec19bf6268d2 A Practical GPU Based KNN Algorithm The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. However, we can't be able to get a satisfactory performance in many applications, as the KNN algorithm has a high computational complexity. Recent developments in programmable, highly paralleled Graphics Processing Units (GPU) have opened a new era of parallel computing which deliver tremendous computational horsepower in a single chip. In this paper, we describe a practical GPU based K Nearest Neighbor (KNN) algorithm implemented by CUDA. http://www.academypublisher.com/proc/iscsct09/papers/iscsct09p151.pdf /content/cudazone/CUDABrowser/assets/images/applications/1205_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1205_GPUComputing bgimg_large.png Academia Soochow University 2009 12 26 12/26/2009 Quansheng Kuang Lei Zhao Paper Quansheng Kuang,Lei Zhao,kqs.net@163.com,zhaol@suda.edu.cn f51cd9bf-7067-4281-9b34-95671175e688 http://www.modelica.org/events/modelica2009/Proceedings/memorystick/pages/papers/0032/0032.pdf This work focuses on the use of parallel hardware to improve the simulation speed of equation-based object-oriented Modelica models. With this intention, a method has been developed that allows for the translation of a restricted class of Modelica models to parallel simulation code, targeted for the Nvidia Tesla architecture and based on the Quantized State Systems (QSS) simulation algorithm. http://www.modelica.org/events/modelica2009/Proceedings/memorystick/pages/papers/0032/0032.pdf /content/cudazone/CUDABrowser/assets/images/applications/1204_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1204_GPUComputing bgimg_large.png 2009 09 21 09/21/2009 Martina Maggio Kristian Stavaker Filippo Donida Francesco Casella Peter Fritzson Paper Martina Maggio,Kristian Stavaker,Filippo Donida 0648ea65-f71f-4795-a1be-579a9aa03e90 An efficient GPU implementation for large scale individual-based simulation of collective behavior In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model each fish aligns its position and orientation with an appropriate average of its neighbors positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of high-density individual groups. http://www.unibas.it/utenti/erra/Papers/HiBi09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1203_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1203_GPUComputing bgimg_large.png Academia Universita della Basilicata 2009 10 01 10/01/2009 Ugo Erra Bernardino Frola Vittorio Scarano Iain Couzin Paper Ugo Erra,Bernardino Frola,Vittorio Scarano,ugo.erra@unibas.it,ber.frola@gmail.com,vitsca@dia.unisa.it 9dedab39-73d9-4c13-b5d2-2104819cec81 A Hybrid Analytical DRAM Performance Model As process technology scales, the number of transistors that can in a unit area has increased exponentially. Processor throughput, memory storage, and memory throughput have all been increasing at an exponential pace. As such, DRAM has become an ever-tightening bottleneck for applications with irregular memory access patterns. Computer architects in industry sometimes use ad hoc analytical modeling techniques in lieu of cycle-accurate performance simulation to identify critical design points. https://www.ece.ubc.ca/~aamodt/papers/gyuan.mobs2009.pdf /content/cudazone/CUDABrowser/assets/images/applications/1202_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1202_GPUComputing bgimg_large.png Academia University of British Columbia 2008 05 19 05/19/2008 George L. Yuan Tor M. Aamodt Paper George L. Yuan,Tor M. Aamodt,gyuan@ece.ubc.ca,aamodt@ece.ubc.ca df66c76c-4fda-468e-9eca-124cae57e3c4 Parallelisation of Fuzzy Inference on a Graphics Processor Unit Using the Compute Unified Device Architecture The inherently parallel nature of fuzzy inference is rarely exploited by fuzzy systems researchers. Hardware implementations, such as Field Programmable Gate Arrays (FPGAs), commonly use parallel architectures to achieve fast inference speeds. In this paper, we explore the use of Graphics Processor Units (GPUs) and NVIDIA‟s Compute Unified Device Architecture (CUDA) for fast inference speeds in a scalable and flexible Mamdani type fuzzy inference system (FIS). Our goal is to provide computational intelligence researchers the skills necessary to exploit the low cost and high performance of GPUs with a minimum learning cost. http://www.cci.dmu.ac.uk/ukci2008/papers/Parallelisation-of-Fuzzy-Inference-on-a-Graphics-Processor-Unit-Using-the-Compute-Unified-Device-Architecture.pdf /content/cudazone/CUDABrowser/assets/images/applications/1201_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1201_GPUComputing bgimg_large.png Academia University of Missouri, Columbia 2008 05 01 05/01/2008 Derek Anderson Simon Coupland Paper Derek Anderson,Simon Coupland 76dcd879-098a-41f2-ab03-38c43d2a042e GPU-Based Road Sign Detection Using Particle Swarm Optimization Road Sign Detection is a major goal of Advanced Driving Assistance Systems (ADAS). Since the dawn of this discipline, much work based on different techniques has been published which shows that traffic signs can be first detected and then classified in video sequences in real time. While detection is usually performed using classical computer vision techniques based on color and/or shape matching, most often classification is performed by neural networks. In this work we present a novel approach based on both sign shape and color which uses Particle Swarm Optimization (PSO) for detection. http://www.ce.unipr.it/~mussi/downloads/papers/mussiISDA09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1200_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1200_GPUComputing bgimg_large.png Academia University of Parma 2009 11 01 11/01/2009 Luca Mussi Stefano Cagnoni Fabio Daolio Paper Luca Mussi,Stefano Cagnoni,Fabio Daolio,mussi@ce.unipr.it,cagnoni@ce.unipr.it,fabio.daolio@unil.ch f0e5e186-d65b-4d6b-9f18-fb153cfcf39a LARGE-SCALE PARALLEL MULTIBODY DYNAMICS WITH FRICTIONAL CONTACT In the context of simulating the frictional contact dynamics of large systems of rigid bodies, this paper reviews a novel method for solving large cone complementarity problems by means of a fixed-point iteration algorithm. The method is an extension of the Gauss-Seidel and Gauss-Jacobi methods with overrelaxation for symmetric convex linear complementarity problems. http://www.mcs.anl.gov/uploads/cels/papers/P1487.pdf /content/cudazone/CUDABrowser/assets/images/applications/1199_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1199_GPUComputing bgimg_large.png Academia University of Wisconsin Madison 2008 10 20 10/20/2008 Dan Negrut Alessandro Tasora Mihai Anitescu Paper Dan Negrut,Alessandro Tasora,Mihai Anitescu,negrut@wisc.edu,tasora@ied.unipr.it,anitescu@mcs.anl.gov abb411b1-54e9-4a4d-8f26-7acad6754856 A characterization and analysis of PTX kernels General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA's CUDA [1], OpenCL [2], and Intel's Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design. http://www.computer.org/portal/web/csdl/doi/10.1109/IISWC.2009.5306801 /content/cudazone/CUDABrowser/assets/images/applications/1198_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1198_GPUComputing bgimg_large.png Academia Georgia Institute of Technology 2009 05 05 05/05/2009 Andrew Kerr Gregory Diamos Sudhakar Yalamanchili Paper Andrew Kerr,Gregory Diamos,Sudhakar Yalamanchili ed43c757-50fe-4151-a1c4-21184ce71dbd General Purpose Computation on Graphics Processing Units (GPGPU) using CUDA Graphics processing units (GPUs) are special processors which traditionally were used to accelerate computer graphics by offloading work from the CPU. Today, GPUs are highly parallel many-core processors which enable general-purpose computation on graphics processing units (GPGPU). GPGPU has already been an issue since 2002 but a huge interest did not evolve until Nvidia released the CUDA platform in 2007. Developers and researchers started to use CUDA for parallel programming. http://www.wi.uni-muenster.de/pi/lehre/ws0910/pppa/papers/gpgpu.pdf /content/cudazone/CUDABrowser/assets/images/applications/1195_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1195_GPUComputing bgimg_large.png Academia Westfalische Wilhelms-Universitat 2009 12 20 12/20/2009 Alexander Zibula Paper Alexander Zibula 993dd63c-de2a-49e1-81ba-ade7f1682b25 Simulation of one-layer shallow water systems on multicore and CUDA architectures The numerical solution of shallow water systems is useful for several applications related to geophysical flows but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model. http://lsi.ugr.es/~jmantas/papers/supercomputing09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1194_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1194_GPUComputing bgimg_large.png Academia 1 Universidad de Granada 2Universidad de Malaga 2010 03 01 03/01/2010 Marc de la Asuncion1 Jose M. Mantas1 Manuel J. Castro2 Paper Marc de la Asuncion1,Jose M. Mantas1,Manuel J. Castro2 78e3f6d6-b219-43a8-8c2d-f515465c3670 IDN_MFC Image denoising with bilateral filter algorithms /content/cudazone/CUDABrowser/assets/images/applications/1193_64638_Application_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1193_64638_Application_large.jpg Academia Wlroclaw University of Technology 2010 06 22 06/22/2010 100 Wojciech Korycki Application Paper Signal Processing Bilateral Filter denoising,Wojciech Korycki,wojciech.korycki@gmail.com 9a370a84-4d82-4133-b523-6f56cca33568 Hypercubic Storage Layout and Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a manner that supports exploitation of data-parallelism on fast modern processing devices.We report on data layouts and transformation algorithms that support both conventional and data-parallel memory layouts. http://tur-www1.massey.ac.nz/~dpplayne/Papers/cstn-096.pdf /content/cudazone/CUDABrowser/assets/images/applications/1192_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1192_GPUComputing bgimg_large.png Academia Massey University 2009 06 23 06/23/2009 K. A. Hawick D. P. Playne Paper K. A. Hawick,D. P. Playne f33d13b3-5699-4fb2-b908-98d32866aa20 Analyzing CUDA Workloads Using a Detailed GPU Simulator Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance. https://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf /content/cudazone/CUDABrowser/assets/images/applications/1191_GPUComputing bgimg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1191_GPUComputing bgimg_large.png Academia University of British Columbia 2009 03 01 03/01/2009 Ali Bakhoda George L. Yuan Wilson W. L. Fung Henry Wong Tor M. Aamodt Paper Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt c1d20c0c-79f9-44f0-9331-291ccbeb0ee7 Phase Based Volume Registration Using CUDA We have implemented phase based volume registration using CUDA, in contrast to all other GPU based image registration implementations that are based on the image intensity. Our registration algorithm is more robust for volumes that differ significantly in intensity. This work was presented at the IEEE conference ICASSP in Dallas 2010. /content/cudazone/CUDABrowser/assets/images/applications/1189_449881_phase_based_volume_registration_using_CUDA_small.png /content/cudazone/CUDABrowser/assets/images/applications/1189_449881_phase_based_volume_registration_using_CUDA_large.png Academia Linkoping university http://www.moviii.isy.liu.se 2010 06 22 06/22/2010 30 Anders Eklund Mats Andersson Hans Knutsson Paper Medical Imaging Image registration, local phase,Anders Eklund,Mats Andersson,Hans Knutsson,andek@imt.liu.se,matsa@imt.liu.se,knutte@imt.liu.se f9f4cdda-fab7-40fa-b8bc-43482f378a81 Towards a Software Transactional Memory for Graphics Processors The introduction of general purpose computing on many-core graphics processor systems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization. Software transactional memory (STM) simplifies development of concurrent code by allowing the programmer to mark sections of code to be executed concurrently and atomically in an optimistic manner. In contrast to locks, STMs are easy to compose and do not suffer from deadlocks. We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking. The design issues involved in the development of these two STMs are described and explained in the paper together with experimental results comparing the performance of the two STMs. /content/cudazone/CUDABrowser/assets/images/applications/1188_7612_cudazonestm_small.png /content/cudazone/CUDABrowser/assets/images/applications/1188_7612_cudazonestm_large.png Academia Chalmers University of Technology http://www.chalmers.se 2010 04 14 04/14/2010 Daniel Cederman Philippas Tsigas Muhammad Tayyab Chaudhry Paper Programming Tools Daniel Cederman,Philippas Tsigas,cederman@chalmers.se,tsigas@chalmers.se a8c7fb3f-0fb8-40d2-8440-c9dedecf7051 nexiwave Speech Indexing nexiwave 2.0 the GPU Assisted Speech Indexing /content/cudazone/CUDABrowser/assets/images/applications/1187_13933_nexilogo_betawith_snowflakes_small.png /content/cudazone/CUDABrowser/assets/images/applications/1187_13933_nexilogo_betawith_snowflakes_large.png Commercial nexiwave.com http://nexiwave.com 2010 06 03 06/03/2010 75 Commercial Ben Jiang Application Signal Processing Speech Indexing Speech Indexing, GPU,Ben Jiang,ben@nexiwave.com a42593ae-afe0-46ed-85d3-7a1ab25c93ac Massive Bayesian Mixture Modelling This paper describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large data sets. An example context concerns common biological studies using high-throughput technologies generating many, very large data sets and requiring increasingly high-dimensional mixture models with large numbers of mixture components. /content/cudazone/CUDABrowser/assets/images/applications/1186_179430_cfse_clusters_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1186_179430_cfse_clusters_large.jpg Academia UCLA and Duke University 2010 03 01 03/01/2010 160 Open source Marc A. Suchard Quanli Wang Cliburn Chan Jacob Frelinger Andrew Cron Mike West Paper Code Numerics Life Sciences Science Computational statistics,Marc A. Suchard,Quanli Wang,Cliburn Chan, Jacob Frelinger, Andrew Cron, Mike West,msuchard@ucla.edu,mw@stat.duke.edu b5e57af8-695e-4550-9a42-a30be2716079 Accelerating Quadrature Methods for Option Valuation This paper presents an architecture for FPGA acceleration of quadrature methods used for pricing complex options, such as discrete barrier, Bermudan, and American options. The architecture can be optimized for speed and power consumption by exploiting pipelining and parallelism to produce efficient implementations in reconfigurable logic. An optimised implementation using Graphics Processing Units (GPUs) is also developed, to provide a performance and efficiency comparison with an FPGA accelerator. http://www.computer.org/portal/web/csdl/doi/10.1109/FCCM.2009.36 /content/cudazone/CUDABrowser/assets/images/applications/1185_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1185_logo_CS Digital Library_large.jpg Academia Imperial College London 09 04 01 04/01/09 Anson H. T. Tse David B. Thomas Wayne Luk Paper Anson H. T. Tse,David B. Thomas,Wayne Luk 934fb2f1-7ee6-4958-bb7d-b4ed38debaee High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications The advent of multi-core processors has made parallel computing techniques mandatory on main stream systems. With the recent rise of hardware accelerators, hybrid parallelism adds yet another dimension of complexity to the process of software development. This article presents a tool for graphical program flow analysis of hardware accelerated parallel programs. http://www.computer.org/portal/web/csdl/doi/10.1109/CCGRID.2010.27 /content/cudazone/CUDABrowser/assets/images/applications/1184_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1184_logo_CS Digital Library_large.jpg 2010 05 01 05/01/2010 Daniel Hackenberg Guido Juckeland Holger Brunst Paper Daniel Hackenberg,Guido Juckeland,Holger Brunst d0ffd6f3-3337-4dbe-a796-0c7d19b1cd6e An Analysis of GPU Parallel Computing Parallel systems are becoming ubiquitous in the world of computing as evidenced by multi-core processors, heterogeneous Cell broadband engine, and highly parallel graphics processing units (GPUs). All parallel systems share a requirement that parallel programming is necessary to leverage multiple cores. As a result of this trend, multi-core CPUs are no longer a clear winner due to its peaked clock frequency and programming effort involved in parallelizing code for multi-core architecture. Given such drawbacks, dataparallel applications might benefit from GPU assisted computing. http://www.computer.org/portal/web/csdl/doi/10.1109/HPCMP-UGC.2009.59 /content/cudazone/CUDABrowser/assets/images/applications/1183_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1183_logo_CS Digital Library_large.jpg Research U. S. Army Research Laboratory 2009 06 01 06/01/2009 Song Jun Park Paper Song Jun Park 79cc1232-9c03-4eca-ac3f-dd9b3743fac0 Tiling for Performance Tuning on Different Models of GPUs The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance of programs has been more and more widely approved during the last two years since the CUDA platform was released. Its benefit extends from the graphic domain to many other computationally intensive domains. Tiling, as the most general and important technique, is widely used for optimization in CUDA programs. New models of GPUs with better compute capabilities have, however, been released, new versions of CUDA SDKs were also released. http://www.computer.org/portal/web/csdl/doi/10.1109/ISISE.2009.60 /content/cudazone/CUDABrowser/assets/images/applications/1182_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1182_logo_CS Digital Library_large.jpg Academia Hong Kong University of Science and Technology 2002 11 01 11/01/2002 Chang Xu Steven R. Kirk Samantha Jenkins Paper Chang Xu,Steven R. Kirk,Samantha Jenkins e5e749ac-945e-42f2-a238-1209f8986eb2 Acceleration of Medical Image Registration Using Graphics Process Units in Computing Normalized Mutual Information This paper presents a computational performance analysis of an accelerated medical image registration using Graphics Processing Units (GPUs). In our previous work, a multi-resolution approach using normalized mutual information (NMI) has proven to be useful in medical image registration. In this paper, we propose an acceleration of the NMI procedure using GPU implementation because of the parallel processing capabilities. http://www.computer.org/portal/web/csdl/doi/10.1109/ICIG.2009.48 /content/cudazone/CUDABrowser/assets/images/applications/1181_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1181_logo_CS Digital Library_large.jpg Academia Kent State University 2009 09 01 09/01/2009 Wei-Hung Cheng Cheng-Chang Lu Paper Wei-Hung Cheng,Cheng-Chang Lu 705aaab7-bfe3-4433-ae2d-e1490bf77dbb MAX-MIN Ant System on GPU with CUDA We propose a parallel MAX-MIN Ant System (MMAS) algorithm that is suitable for an implementation on graphics processing units (GPUs). Multi ant colonies with respective parameter settings are whole offloaded to the GPU in parallel. We have implemented this GPU-based MMAS on the GPU with compute unified device architecture (CUDA). Some performance optimization means for kernel program of GPU are introduced. Experimental results that are based on simulations for the traveling salesperson problem are presented to evaluate the proposed techniques. /content/cudazone/CUDABrowser/assets/images/applications/1180_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1180_logo_CS Digital Library_large.jpg 2009 12 07 12/07/2009 Hongtao Bai Dantong Ou Yang Ximing Li Lili He Haihong Yu Paper Hongtao Bai,Dantong Ou Yang,Ximing Li f3d7658d-4572-4345-a1bd-3fe05ca6ce37 Scene Recognition Acceleration Using CUDA and OpenMP Scene recognition has become a remarkable field in image processing area, and many methods have been proposed in recent years, in which the idea of extracting the scene gist from global features has been proved to have higher retrieval accuracy compared with many other methods. However, the process of extracting gist is heavily time-consuming and not suitable for real-time application. In this paper, the CUDA architecture is deployed to accelerate this process. http://www.computer.org/portal/web/csdl/doi/10.1109/ICISE.2009.1045 /content/cudazone/CUDABrowser/assets/images/applications/1179_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1179_logo_CS Digital Library_large.jpg Academia Dalian University of Technology 2009 12 01 12/01/2009 Yuxin Wang Zhen Feng He Guo Changqin He Yuansheng Yang Paper Yuxin Wang,Zhen Feng,He Guo b717867c-c959-4577-b6a5-afbc0e42fdae A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA Nowadays, the compute capability of traditional cluster system can't keep up with the computing needs of a practical application, and these aspects of energy, space technology, etc. have become a huge problem. However, as parallel computing equipment, the stream processor (SP) has a high performance of floating-point operations. NVIDIA GPUs is a typical stream processor device, CUDA technology enables the way to develop a better parallel program on GPUs to become flexible. In this paper, we make use of the hybrid parallel computing programming environment (HPCPE) with MPI and CUDA technology to build the simple CPU + GPU-based stream processor cluster system. http://www.computer.org/portal/web/csdl/doi/10.1109/ICISE.2009.171 /content/cudazone/CUDABrowser/assets/images/applications/1177_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1177_logo_CS Digital Library_large.jpg Academia University of Shanghai for Science and Technology 2009 12 01 12/01/2009 Qing-kui Chen Jia-kang Zhang Paper Qing-kui Chen,Jia-kang Zhang 5da86d39-2d35-4792-a0b5-e400e3383959 Formal Description and Optimization Based High - Performance Computing on CUDA In recent years, with the development of GPU, based on the general purpose computation on graphics processors has became a new field. Aiming at the processing of GPU, this paper provides the formal description for data parallel mode, a detailed description of the CUDA programming mode land the principle of optimization. It shows by the comparative experiment that CUDA owns strongly of the ability to the parallel processing and provides new methods and ideas to GPGPU. /content/cudazone/CUDABrowser/assets/images/applications/1176_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1176_logo_CS Digital Library_large.jpg Academia Hong Kong University of Science and Technology 2009 12 01 12/01/2009 Bo Li Huacheng Zhao JingJing Liang Paper Bo Li,Huacheng Zhao,JingJing Liang 2287f161-8378-4190-ae1f-bd428d9ca3c3 Password Recovery for RAR Files Using CUDA Driven by the insatiable demand of real-time graphics, especially from the market of computer games, Graphics Processing Unit (GPU) is becoming a major computing horsepower during recent years since the performance of GPU is surpassing that of the contemporary CPU. This paper presents our study on how to efficiently recover the passwords for encrypted RAR files. http://www.computer.org/portal/web/csdl/doi/10.1109/DASC.2009.123 /content/cudazone/CUDABrowser/assets/images/applications/1175_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1175_logo_CS Digital Library_large.jpg 2009 12 01 12/01/2009 Guang Hu Jianhua Ma Benxiong Huang Paper Guang Hu,Jianhua Ma,Benxiong Huang 2b933813-11b7-4f81-8e63-9ce67eba045f Password Recovery for RAR Files Using CUDA Driven by the insatiable demand of real-time graphics, especially from the market of computer games, Graphics Processing Unit (GPU) is becoming a major computing horsepower during recent years since the performance of GPU is surpassing that of the contemporary CPU. This paper presents our study on how to efficiently recover the passwords for encrypted RAR files. Our research focus is on the AES key generation processing, which is the most time consuming stage in the whole RAR encryption/decryption process. http://www.computer.org/portal/web/csdl/doi/10.1109/DASC.2009.123 /content/cudazone/CUDABrowser/assets/images/applications/1174_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1174_logo_CS Digital Library_large.jpg 2009 12 01 12/01/2009 Guang Hu Jianhua Ma Benxiong Huang Paper Guang Hu,Jianhua Ma,Benxiong Huang f928f65a-86c9-4a31-8299-3e40f02d03fa GPU-Assisted Computation of Centroidal Voronoi Tessellation Centroidal Voronoi tessellations (CVT) are widely used in computational science and engineering. The most commonly used method is Lloyd's method, and recently the L-BFGS method is shown to be faster than Lloyd's method for computing the CVT. However, these methods run on the CPU and are still too slow for many practical applications. We present techniques to implement these methods on the GPU for computing the CVT on 2D planes and on surfaces, and demonstrate significant speedup of these GPU-based methods over their CPU counterparts. http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2010.53 /content/cudazone/CUDABrowser/assets/images/applications/1173_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1173_logo_CS Digital Library_large.jpg Academia University of Texas at Dallas 2010 03 16 03/16/2010 Guodong Rong Yang Liu Wenping Wang Xiaotian Yin David Gu Xiaohu Guo Paper Guodong Rong,Yang Liu,Wenping Wang,Xiaotian Yin,David Gu,Xiaohu Guo 7e3424c0-b0ed-4476-8830-eb60da8a80c7 Designing Efficient Many-Core Parallel Algorithms for All-Pairs Shortest-Paths Using CUDA Finding the all-pairs shortest-paths on a large graph is a fundamental problem in many practical applications such as bioinformatics, internet node traffic and network routing. In this paper, we present the designs of two efficient parallel algorithms for many-core GPUs using CUDA. Our algorithms expose substantial fine-grained parallelism while maintaining minimal global communication. By using the global scope of the GPU's global memory, coalescing the global memory reads and writes, and avoiding on-chip shared memory bank conflicts, we are able to achieve a large performance benefit with a speed-up of 2,500x on a desktop computer in comparison with a single core program. http://www.computer.org/portal/web/csdl/doi/10.1109/ITNG.2010.230 /content/cudazone/CUDABrowser/assets/images/applications/1172_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1172_logo_CS Digital Library_large.jpg Academia Lamar University 2010 04 01 04/01/2010 Quoc-Nam Tran Paper Quoc-Nam Tran 0315bfde-f758-4fc6-8cf5-85aac810ca12 Record Setting Software Implementation of DES Using CUDA The increase in computational power of off-the-shelf hardware offers more and more advantageous tradeoffs among efficiency, cost and availability, thus enhancing the feasibility of of cryptanalytic attacks aiming to lower the security of widely used cryptosystems. In this paper we illustrate an GPU-based software implementation of the most efficent variant of Data Encryption Standard (DES), showing the performance of a software breaker which effectively exploits the multi-core Nvidia GT200 graphic architecture. http://www.computer.org/portal/web/csdl/doi/10.1109/ITNG.2010.43 /content/cudazone/CUDABrowser/assets/images/applications/1171_logo_CS Digital Library_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1171_logo_CS Digital Library_large.jpg 2010 04 01 04/01/2010 Giovanni Agosta Allessandro Barenghi Fabrizio De Santis Gerardo Pelosi Paper Giovanni Agosta,Allessandro Barenghi,Fabrizio De Santis 7df8d14f-4c52-460a-8881-ad932fd45292 Eye-Full Tower: A GPU-based variable multibaseline omnidirectional stereovision system with automatic baseline selection for outdoor mobile robot navigation In recent years, it can be observed that there is a gradual increase in the number of researchers and projects involved with the development of omnidirectional vision systems for various applications. The primary factors, which contributed towards this positive trend, are the availability of inexpensive and high resolution vision sensors, robust and fast computers and the advantages of using such systems over perspective vision systems. In this paper, a novel variable multibaseline omnidirectional stereovision system is presented. http://portal.acm.org/citation.cfm?id=1805342.1805504&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289 /content/cudazone/CUDABrowser/assets/images/applications/1170_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1170_logo_acm_portal2_large.jpg Academia Monash University 2010 06 01 06/01/2010 Wen Lik Dennis Lui Ray Jarvis Paper Wen Lik Dennis Lui,Ray Jarvis 56e6ce9f-ae50-427d-8d3f-23b1e24c6683 Optimized high speed pixel sorting and its application in watershed based image segmentation Efficient sorting of image pixels based on their grayscale value is traditionally implemented using an algorithm based on distribution or counting sort methods. We show that an elegant alternative can be used which outperforms the traditional method both in terms of processing speed and main memory access. We discuss both theoretically analyzed and real-life performance results, and demonstrate the improvements that can be obtained when our algorithm is combined with a well-known watershed image segmentation method. /content/cudazone/CUDABrowser/assets/images/applications/1169_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1169_logo_acm_portal2_large.jpg Research The National Institute for Criminalistics and Criminology (NICC) 2010 07 01 07/01/2010 Patrick De Smet Paper Patrick De Smet 729f484e-7759-401b-a6ef-78c85f290bd6 GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed. Some important issues in the implementation of computations that involve a large number of arithmetic operations and data on the GPU that has limited high-speed memory resources are addressed extensively. http://portal.acm.org/citation.cfm?id=1808372.1808870&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289 /content/cudazone/CUDABrowser/assets/images/applications/1168_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1168_logo_acm_portal2_large.jpg Academia Kochi University of Technology 2010 08 01 08/01/2010 Alfeus Sunarso Tomohiro Tsuji Shigeomi Chono Paper Alfeus Sunarso,Tomohiro Tsuji,Shigeomi Chono 2b5f83d2-9afa-42a5-8582-8a9b3ae48841 Parallel implementation of wavelet-based image denoising on programmable PC-grade graphics hardware The discrete wavelet transform (DWT) has been extensively used for image compression and denoising in the areas of image processing and computer vision. However, the intensive computation of DWT due to its inherent multilevel data decomposition and reconstruction operations brings a bottleneck that drastically reduces its performance and implementations for real-time applications when facing large size digital images and/or high-definition videos. Although various software-based acceleration solutions, such as the lifting scheme, have been devised and achieved a higher performance in general, the pure software accelerated DWT still struggle to cope with the demands from real-time and interactive applications. With the growing capacity and popularity of graphics hardware, personal computers (PCs) nowadays are often equipped with programmable graphics processing units (GPUs) for graphics acceleration. The GPU offers a cost-effective parallel data processing mechanism for operations on large amount of data, even for applications beyond graphics. This practice is commonly referred as general-purpose computing on GPU (GPGPU). http://portal.acm.org/citation.cfm?id=1786816.1787181&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289 /content/cudazone/CUDABrowser/assets/images/applications/1167_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1167_logo_acm_portal2_large.jpg Academia University of Huddersfield 2010 08 01 08/01/2010 Yang Su Zhijie Xu Paper Yang Su,Zhijie Xu a4486be2-fdef-4db0-89b0-879b296f6681 GPU Computing for Atmospheric Modeling Experience with a small kernel and implications for a full model Much success has been achieved using GPUs to accelerate existing applications that are highly data parallel, or that are dominated by small, intense computational kernels. What are the prospects for porting existing large scientific models that do not fit this mold? We take an expensive routine from the CAM atmosphere model, and port it to a GPU using CUDA. We use the experience gained as a guide in thinking about porting the full application to an accelerator based system. We consider the best path forward for getting large scientific models running on accelerator based systems, and identify cases where porting may be feasible, and where a complete redesign may be the best option. /content/cudazone/CUDABrowser/assets/images/applications/1166_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1166_logo_xplore_large.gif Research National Center for Atmospheric Research 2010 02 04 02/04/2010 Rory Kelly Paper Rory Kelly ce8e0150-e62b-4a0c-890b-0442b2e058a6 Design and Performance Evaluation of Image Processing Algorithms on GPUs In this paper, we construe key factors in design and evaluation of image processing algorithms on the massive parallel GPU (graphics processing units) using the CUDA (compute unified device architecture) programming model. A set of metrics, customized for image processing, are proposed to quantitatively evaluate algorithm characteristics. In addition, we show that a range of image processing algorithms map readily to CUDA using multiview stereo matching, linear feature extraction, JPEG2000 image encoding, and non-photorealistic rendering (NPR) as our example applications. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5477417&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1165_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1165_logo_xplore_large.gif Academia Inha University 2010 06 03 06/03/2010 In Kyu Park Nitin Singhal Man Hee Lee Sungdae Cho Chris Kim Paper In Kyu Park,Nitin Singhal,Man Hee Lee 750d6910-dfb0-4eee-9c8e-cec1320d7f09 CUDA-Based Linear Solvers for Stable Fluids In the field of computer graphics, physically-based fluids simulations (i.e., simulations that solve the equations that govern fluids behaviour) are performed using, among others, Stam's stable fluids method. This method requires the solution of two sparse linear systems that can be solved using an iterative solver (e.g., Jacobi, Gauss-Seidel, conjugate gradient, etc.). Focusing on real-time 3D applications, we provide and analyze the performance of the parallel GPU-based (using CUDA) algorithms of the Jacobi, Gauss-Seidel, and conjugate gradient solvers. /content/cudazone/CUDABrowser/assets/images/applications/1164_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1164_logo_xplore_large.gif 2010 04 21 04/21/2010 Goncalo Amador Abel Gomes Paper Goncalo Amador,Abel Gomes d43dc906-8a5f-4bb4-bcb2-006f2d9be085 Implementation of Variable Preconditioned GCR with mixed precision on GPU using CUDA The Variable Preconditioned GVR (VPGCR) with mixed precision on Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA) is numerically investigated. The convergence theorem of VPGCR is guaranteed that the residual equation for the preconditioned procedure can be solved in the range of single precision operation. The results of computations show that VPGCR with mixed precision operation on GPU demonstrated significant achievement than that of CPU. Especially, VPGCR on GPU with mixed precision operation is 22.53 times faster than that of Central Processing Unit (CPU). /content/cudazone/CUDABrowser/assets/images/applications/1163_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1163_logo_xplore_large.gif Academia Tokyo University of Technology 2010 05 09 05/09/2010 Soichiro Ikuno Norihisa Fujita Susumu Yamamoto Susumu Nakata Paper Soichiro Ikuno,Norihisa Fujita,Susumu Yamamoto 55d72cf8-150e-4281-be46-1a00cd588e1e A CUDA-Based Implementation of Stable Fluids in 3D with Internal and Moving Boundaries Fluid simulation has been an active research field in computer graphics for the last 30 years. Stam's stable fluids method, among others, is used for solving the equations that govern fluids (i.e. Navier-Stokes equations). An implementation of stable fluids in 3D using NVIDIA Compute Unified Architecture, shortly CUDA, is provided in this paper. This CUDA-based implementation also features the accurate physical treatment of internal (i.e. static boundaries inside the simulation domain) and moving boundaries. The performance gains of the presented implementation vs a sequential CPU-based implementation, and points of further improvement are also addressed. /content/cudazone/CUDABrowser/assets/images/applications/1161_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1161_logo_xplore_large.gif 2010 03 23 03/23/2010 Goncalo Amador Abel Gomes Paper Goncalo Amador,Abel Gomes d41f4ce3-8c86-4dd4-835b-8954c9caef44 Hybrid Core Acceleration of UWB SIRE Radar Signal Processing To move High Performance Computing (HPC) closer to forward operating environments and missions, the Army Research Laboratory is developing approaches using hybrid, asymmetric core computing. By blending capabilities found in Graphics Processing Units (GPUs) and traditional von Neumann multi-core Central Processing Units (CPUs), approaches are being developed and optimized to provide at or near real-time processing speeds for research project applications. Algorithms are designed to partition work to resources best designed to handle the processing load. The use of commodity resources allows the design to be flexible throughout the life-cycle without the costly and time-consuming delays associated with Application Specific Integrated Circuit (ASIC) development. This paradigm allows for rapid technology transfer to end users. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5477419&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1160_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1160_logo_xplore_large.gif Research U. S. Army Researc Laboratory 2010 06 03 06/03/2010 Song Jun Park James Ross Dale Shires David Richie Brian Henz Lam Nguyen Paper Song Jun Park,James Ross,Dale Shires 78482823-4944-4b0a-ab95-15e1aad00454 Optimal loop unrolling for GPGPU programs Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIA's CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470423&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1159_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1159_logo_xplore_large.gif Academia The Ohio State University 2010 04 19 04/19/2010 Giridhar Murthy Sreenivasa Mahesh Ravishankar Muthu Manikandan Baskaran P. Sadayappan Paper Giridhar Murthy Sreenivasa,Mahesh Ravishankar,Muthu Manikandan Baskaran 86fc0781-21d5-4b31-9fb0-7061d02a703b Using CUDA enabled FDTD simulations to solve multi-gigahertz EMI challenges Thanks to the application of GPU-CUDA acceleration technology to EM simulation tools, more and more complicated EMI challenges can be efficiently investigated and solved very early in the design process. This paper presents a novel methodology to predict EMI emission due to memory SSO noise from a real, commercial graphics card by means of a commercially available CUDA accelerated full-wave FDTD simulator. It is shown that thanks to the CUDA acceleration one can estimate the influence of on-board decoupling capacitors on the EMI emission within hours. /content/cudazone/CUDABrowser/assets/images/applications/1158_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1158_logo_xplore_large.gif Research KHBO Flanders Mechatronics Engineering Centre 2010 04 12 04/12/2010 Davy Pissoort Chen Wang Hany Fahmy Paper Davy Pissoort,Chen Wang,Hany Fahmy 860722bd-ba53-4391-9e5c-7197a5574713 Dynamic load balancing on single- and multi-GPU systems The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIA's CUDA. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470413&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1157_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1157_logo_xplore_large.gif Academia University of Delaware 2010 04 19 04/19/2010 Long Chen Oreste Villa Sriram Krishnamoorthy Paper Long Chen,Oreste Villa,Sriram Krishnamoorthy 8bc36492-7c2b-4e62-ba6d-a9664ee84f10 Automatic Generation of Multi-Core Chemical Kernels This work presents KPPA (the Kinetics PreProcessor: Accelerated), a general analysis and code generation tool that achieves significantly reduced time-to-solution for chemical kinetics kernels on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis of chemical kernels from WRF-Chem and the Community Multiscale Air Quality Model (CMAQ) is presented for each platform in double and single precision on coarse and fine grids. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5473221&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1156_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1156_logo_xplore_large.gif Academia Virginia Polytechnic Institute and State University 2010 05 27 05/27/2010 John Linford John Michalakes Manish Vachharajani Adrian Sandu Paper J. Linford,J. Michalakes,M. Vachharajani f0ed0b4c-5d78-4283-b786-d977d462b699 Speculative execution on multi-GPU systems The lag of parallel programming models and languages behind the advance of heterogeneous many-core processors has left a gap between the computational capability of modern systems and the ability of applications to exploit them. Emerging programming models, such as CUDA and OpenCL, force developers to explicitly partition applications into components (kernels) and assign them to accelerators in order to utilize them effectively. An accelerator is a processor with a different ISA and micro-architecture than the main CPU. These static partitioning schemes are effective when targeting a system with only a single accelerator. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470427&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1155_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1155_logo_xplore_large.gif Academia Georgia Institute of Technology 2010 04 19 04/19/2010 Gregory Diamos Sudhakar Yalamanchili Paper Gregory Diamos,Sudhakar Yalamanchili 829884ae-a849-4ab7-a5db-1ebb4290798a AUTO-GC: Automatic translation of data mining applications to GPU clusters Because of the very favorable price to performance ratio of the GPUs, a popular parallel programming configuration today is a cluster of GPUs. However, extracting performance on such a configuration would typically require programming in both MPI and CUDA, thus requiring a high degree of expertise and effort. It is clearly desirable to be able to support higher-level programming of this emerging high-performance computing platform. /content/cudazone/CUDABrowser/assets/images/applications/1154_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1154_logo_xplore_large.gif Academia The Ohio State University 2010 04 19 04/19/2010 Wenjing Ma Gagan Agrawal Paper Wenjing Ma,Gagan Agrawal 4401d416-56a2-4a55-88a0-a8ccbb66c75d Pricing of cross-currency interest rate derivatives on Graphics Processing Units We present a Graphics Processing Unit (GPU) parallelization of the computation of the price of cross-currency interest rate derivatives via a Partial Differential Equation (PDE) approach. In particular, we focus on the GPU-based parallel computation of the price of long-dated foreign exchange interest rate hybrids, namely Power Reverse Dual Currency (PRDC) swaps with Bermudan cancelable features. We consider a three-factor pricing model with foreign exchange skew which results in a time-dependent parabolic PDE in three spatial dimensions. Finite difference methods on uniform grids are used for the spatial discretization of the PDE, and the Alternating Direction Implicit (ADI) technique is employed for the time discretization. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470708&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1153_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1153_logo_xplore_large.gif Academia University of Toronto 2010 04 19 04/19/2010 Duy Minh Dang Paper Duy Minh Dang fb5436fc-e3e8-4c43-b70a-2dc6ba8e4f18 Study on GPU-accelerated extraction of interconnects parasitic using CUDA and MPI Parallel computation is application-oriented, particularly for the GPU (Graphics Processing Unit) with the inherent parallelism. This paper shows the architecture of a GPU cluster based on MPI (Message Passing Interface) and CUDA (Compute Unified Device Architecture). Results show that the acceleration ratio is obviously improved but the acceleration effect seems decelerated in large-scale GPU cluster. The parallel algorithm is mainly focused on task partitioning sparse matrix-vector multiplications (SpVM) in GPUs. /content/cudazone/CUDABrowser/assets/images/applications/1151_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1151_logo_xplore_large.gif Academia Chinese Academy of Sciences 2010 05 09 05/09/2010 Xiaoyu Xu Guoqiang Liu Hui Qu Wei Xu Yang Zhang Paper Xiaoyu Xu,Guoqiang Liu,Hui Qu eb22bb0c-f56a-47bc-8b54-6bc2fa978435 Performance study of mapping irregular computations on GPUs Recently, Graphical Processing Units (GPUs) have become increasingly more capable and well-suited to general purpose applications. As a result of the GPUs high degree of parallelism and computational power, there has been a great deal of interest directed toward the platform for parallel application development. Much of the focus, however, has been on very regular applications that exhibit a high degree of data parallelism, as these applications map well to the GPU. Irregular applications, such as the Breadth First Search discussed in this paper, have not been as extensively studied and are more difficult to implement in an efficient fashion on the GPU. We will present both an implementation of the Breadth First Search algorithm as well as that of a Matrix Parenthesization algorithm. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470770&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1150_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1150_logo_xplore_large.gif Academia University of Manitoba 2010 04 19 04/19/2010 Steven Solomon Parimala Thulasiraman Paper Steven Solomon,Parimala Thulasiraman 2aa51865-7e62-41be-8adf-a461a0ae58d7 Design and implementation of MPEG audio layer III decoder using graphics processing units This paper describes a new implemented method for the MPEG audio layer III (MP3) decoder. The proposed architecture is based on a graphic process unit (GPU) using CUDA environment, where it can effectively take advantage of modern GPU's parallel computing power. The implemented system with this architecture employs a multi-thread model and memory optimization to process MP3 decoding in parallel, so it is significant to minimize the computational overhead. Experimental results on a GTX260+ graphics card showed that the proposed architecture is over five times faster than traditional MP3 library based on CPU. /content/cudazone/CUDABrowser/assets/images/applications/1148_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1148_logo_xplore_large.gif Academia Chinese Academy of Sciences 2010 04 09 04/09/2010 Chen Xiaoliang Zheng Chengshi Ma Longhua Cheng Xiaobin Li Xiaodong Paper Chen Xiaoliang ,Zheng Chengshi ,Ma Longhua 0e47069d-420f-4d76-9565-d04f4341f8d2 Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms The maximum flow problem is a fundamental graph theory problem with many important applications. Max-flow algorithms based on the push-relabel method are known to have better complexity bound and faster practical execution speed than others. However, existing push-relabel algorithms are designed for uniprocessors or parallel processors that support locking primitives, thus making it very difficult to apply the push-relabel technique to CUDA-based GPUs. In this paper, we present a first generic parallel push-relabel algorithm for CUDA devices. We model the parallelization efficiency of the algorithm, which reveals that, for a given input graph, the level of parallelism varies during the execution of the algorithm. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470401&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1147_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1147_logo_xplore_large.gif Academia Georgia Institute of Technology 2010 04 19 04/19/2010 Zhengyu He Bo Hong Paper Zhengyu He,Bo Hong a23a06cf-4a9e-46f4-9171-718369793c99 An auto-tuning framework for parallel multicore stencil computations Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470421&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1146_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1146_logo_xplore_large.gif Research CRD/NERSC, Lawrence Berkeley National Laboratory 2010 04 19 04/19/2010 Shoaib Kamil Cy Chan Leonid Oliker John Shalf Samuel Williams Paper Shoaib Kamil,Cy Chan,Leonid Oliker 5448d0e9-eebf-4e90-8c29-59f7f4c224c1 Comparing Hardware Accelerators in Scientific Applications: A Case Study Multi-core processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing a Quantum Monte Carlo application. We compare the application's performance and programmability on a variety of platforms including CUDA with Nvidia GPUs, Brook+ with ATI graphics accelerators, OpenCL running on both multi-core and graphics processors, C++ running on multi-core processors, and a VHDL implementation running on a Xilinx FPGA. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5482576&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1145_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1145_logo_xplore_large.gif Academia University of Tennessee 2010 06 06 06/06/2010 R. Weber A. Gothandaraman R. Hinde G. Peterson Paper R. Weber,A. Gothandaraman,R. Hinde 1aa1156a-6f55-4e59-93bb-4cf5c1a6b6a2 Demystifying GPU Microarchitecture Through Microbenchmarking Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5452013&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1144_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1144_logo_xplore_large.gif Academia University of Toronto 2010 03 28 03/28/2010 H. Wong M. Papadopoulou M. Sadooghi-Alvandi A. Moshovos Paper H. Wong,M. Papadopoulou,M. Sadooghi-Alvandi 56a2b66b-15bc-46e7-bce1-2f1b937dfe11 SelfAudience Audience Measurement - real time video analysis for counting people, face detection and tracking /content/cudazone/CUDABrowser/assets/images/applications/1143_428449_selfadvert_small.png /content/cudazone/CUDABrowser/assets/images/applications/1143_428449_selfadvert_large.png Commercial SelfAdvert http://www.selfadvert.com 2010 05 15 05/15/2010 300 SelfAdvert Application Multimedia Video & Audio freeware audience measurement,SelfAdvert,sales@selfadvert.com 41b685bb-4e0a-49ba-af2f-b938f11bae36 Cellular Automata Evolver Evolver of Cellular Automata 1D rules plus inference tools with the state of the art technology /content/cudazone/CUDABrowser/assets/images/applications/1142_caev2_small.png /content/cudazone/CUDABrowser/assets/images/applications/1142_caev2_large.png Research Cellular Automata Evolver 2010 06 02 06/02/2010 10 Denis Antiga Application Science Cellular Automata,Denis Antiga,a.denis1@yahoo.com a71cdec6-989a-487e-bfe7-36278925ca5d Statistical constraints on binary black hole inspiral dynamics We perform a statistical analysis of the binary black hole problem in the post-Newtonian approximation by systematically sampling and evolving the parameter space of initial configurations for quasi-circular inspirals. Through a principal component analysis of spin and orbital angular momentum variables we systematically look for uncorrelated quantities and find three of them which are highly conserved in a statistical sense, both as functions of time and with respect to variations in initial spin orientations. http://arxiv.org/abs/1005.5560 /content/cudazone/CUDABrowser/assets/images/applications/1141_bh_small.png /content/cudazone/CUDABrowser/assets/images/applications/1141_bh_large.png Academia University of Maryland 2010 05 30 05/30/2010 50 Chad Galley Frank Herrmann John Silberholz Manuel Tiglio Gustavo Guerberoff Paper Numerics Science Chad Galley,Frank Herrmann,John Silberholz, Manuel Tiglio, Gustavo Guerberoff,tiglio@umd.edu f5c5c329-3a57-4d10-a40d-475b6d59423c Object-oriented stream programming using aspects High-performance parallel programs that efficiently utilize heterogeneous CPU+GPU accelerator systems require tuned coordination among multiple program units. However, using current programming frameworks such as CUDA leads to tangled source code that combines code for the core computation with that for device and computational kernel management, data transfers between memory spaces, and various optimizations. In this paper, we propose a programming system based on the principles of Aspect-Oriented Programming, to un-clutter the code and to improve programmability of these heterogeneous parallel systems. Specifically, we use standard C++ to describe the core computations and aspects to encapsulate all other support parts. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470472&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26pageNumber%3D2%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1140_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1140_logo_xplore_large.gif Academia Rutgers University 2010 04 19 04/19/2010 Mingliang Wang Manish Parashar Paper Mingliang Wang,Manish Parashar 60a13ea4-55ee-4aa3-9fbe-2d1ee29bca6c The GPU Computing Era GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications. /content/cudazone/CUDABrowser/assets/images/applications/1139_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1139_logo_xplore_large.gif Commercial NVIDIA 2010 03 01 03/01/2010 J. Nickolls W. J. Dally Paper J. Nickolls,W. J. Dally d5846aba-4896-45b5-b4d7-371b91ef56e5 Fast implementation of Wyner-Ziv Video codec using GPGPU In this paper, we report a fast implementation of Wyner-Ziv video decoder using general-purpose computing on graphics processing units (GPGPU). Despite of its many advantages, Wyner-Ziv video coding has a problem of huge decoding complexity. Since Slepian-Wolf decoding with rate adaptive LDPC accumulate code takes up more than 90% of entire Wyner-Ziv video decoding complexity, in this paper, we focus on fast implementation of the Slepian-Wolf decoder using the CUDA (Compute Unified Device Architecture) which is a GPGPU architecture developed by NVIDIA. Our implementation is shown to be 4 5 times (QCIF size) or 15 20 times (CIF size) faster compared to conventional Slepian-Wolf decoding. /content/cudazone/CUDABrowser/assets/images/applications/1138_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1138_logo_xplore_large.gif Academia Sungkyunkwan University 2010 03 24 03/24/2010 20 Ryanggeun Oh Jongbing Park Byeungwoo Jeon Paper Ryanggeun Oh,Jongbing Park,Byeungwoo Jeon b37872a6-2751-4cb1-9d68-64b433ae6da1 Efficient parallel algorithms for maximum-density segment problem One of the fundamental problems involving DNA sequences is to find high density segments of certain widths, for example, those regions with intensive guanine and cytosine (GC). Formally, given a sequence, each element of which has a value and a width, the maximum-density segment problem asks for the segment with the maximum density while satisfying minimum and possibly maximum width constraints. While several linear-time sequential algorithms have emerged recently due to its primitive-like utility, to our knowledge, no nontrivial parallel algorithm has yet been proposed for this topical problem. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470390&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1136_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1136_logo_xplore_large.gif Academia Georgia State University 2010 04 19 04/19/2010 Xue Wang Fasheng Qiu Sushil K. Prasad Paper Xue Wang,Fasheng Qiu,Sushil K. Prasad ad4bb0eb-1147-4ed3-ae05-d7d49cb8d9b4 Fast binding site mapping using GPUs and CUDA Binding site mapping refers to the computational prediction of the regions on a protein surface that are likely to bind a small molecule with high affinity. The process involves flexibly docking a variety of small molecule probes and finding a consensus site that binds most of those probes. Due to the computational complexity of flexible docking, the process is often split into two steps: the first performs rigid docking between the protein and the probe; the second models the side chain flexibility by energy-minimizing the (few thousand) top scoring protein-probe complexes generated by the first step. Both these steps are computationally very expensive, requiring many hours of runtime per probe on a serial CPU. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470895&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1134_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1134_logo_xplore_large.gif Academia Boston University 2010 04 19 04/19/2010 Bharat Sukhwani Martin C. Herbordt Paper Bharat Sukhwani,Martin C. Herbordt 4fb1d9f2-99b7-4237-b4f0-46e4bf9cf25a Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead Sorting is a well-investigated topic in Computer Science in general and by now many efficient sorting algorithms for CPUs and GPUs have been developed. There is no swapping, paging, etc. available on GPUs to provide more virtual memory than physically available, thus if one wants to sort sequences that exceed GPU memory using the GPU the problem of external sorting arises. /content/cudazone/CUDABrowser/assets/images/applications/1133_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1133_logo_xplore_large.gif Academia Christian-Albrechts-University 2010 04 19 04/19/2010 Hagen Peters Ole Schulz-Hildebrandt Norbert Luttenberger Paper Hagen Peters,Ole Schulz-Hildebrandt,Norbert Luttenberger 5668b6fb-9541-43d3-b426-da3fbc93395c A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA The Viterbi algorithm is the compute-intensive kernel in Hidden Markov Model (HMM) based sequence alignment applications. In this paper, we investigate extending several parallel methods, such as the wave-front and streaming methods for the Smith-Waterman algorithm, to achieve a significant speed-up on a GPU. The wave-front method can take advantage of the computing power of the GPU but it cannot handle long sequences because of the physical GPU memory limit. On the other hand, the streaming method can process long sequences but with increased overhead due to the increased data transmission between CPU and GPU. To further improve the performance on GPU, we propose a new tile-based parallel algorithm. We take advantage of the homological segments to divide long sequences into many short pieces and each piece pair (tile) can be fully held in the GPU's memory. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470903&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1132_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1132_logo_xplore_large.gif Academia Tsinghua University 2010 04 19 04/19/2010 Zhihui Du Zhaoming Yin David A. Bader Paper Zhihui Du,Zhaoming Yin,David A. Bader a2642850-909e-475e-bd95-fcb458609914 Designing scalable many-core parallel algorithms for min graphs using CUDA Removing redundant edges on a large graph is a fundamental problem in many practical applications such as verification of real-time systems and network routing. In this paper, we present the designs of scalable and efficient parallel algorithms for multiple many-core GPU devices using CUDA. Our algorithms expose substantial fine-grained parallelism while maintaining minimal global communication. By using the global scope of the GPU's global memory, coalescing the global memory reads and writes, and avoiding on-chip shared memory bank conflicts, we are able to achieve a large performance benefit with a speed-up of 2,500x on a desktop computer in comparison with a single core CPU program. We report our experiments on large graphs with up to 29K vertices using multiple GPU devices. /content/cudazone/CUDABrowser/assets/images/applications/1131_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1131_logo_xplore_large.gif Academia Lamar University 2010 04 19 04/19/2010 Quoc-Nam Tran Paper Quoc-Nam Tran ebaed060-0f46-49b2-b9b3-5bfbb7e21ff5 Implementing the Himeno benchmark with CUDA on GPU clusters This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance. /content/cudazone/CUDABrowser/assets/images/applications/1130_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1130_logo_xplore_large.gif Commercial NVIDIA 2010 04 19 04/19/2010 Everett H. Phillips Massimiliano Fatica Paper Everett H. Phillips,Massimiliano Fatica adad926b-73e9-492a-a3d5-69301fb1d791 CUDA-based AES parallelization with fine-tuned GPU memory utilization Current Graphics Processing Unit (GPU) presents large potentials in speeding up computationally intensive data parallel applications over traditional parallelization approaches since there are much more hardware threads inside GPUs than the computational cores available to common CPU threads. NVIDIA developed a generic GPU programming platform, CUDA, which allows programmers to utilize GPU through C programming language and parallelize applications in a similar way as in traditional multithreading approach. However, not all applications are suitable for this new platform. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470766&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1129_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1129_logo_xplore_large.gif Academia Arkansas State University 2010 04 19 04/19/2010 Chonglei Mei Hai Jiang Jeff Jenness Paper Chonglei Mei,Hai Jiang,Jeff Jenness 59dd2453-f984-4ab7-9a4c-49d2350b0f09 Optimization of linked list prefix computations on multithreaded GPUs using CUDA We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470455&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1128_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1128_logo_xplore_large.gif Academia University of Maryland 2010 04 19 04/19/2010 Zheng Wei Joseph JaJa Paper Zheng Wei,Joseph JaJa 9ecfd491-9553-4775-b59e-87718a3593fc Parallel computing with CUDA NVIDIA's CUDA architecture provides a powerful platform for writing highly parallel programs. By providing simple abstractions for hierarchical thread organization, memories, and synchronization, the CUDA programming model allows programmers to write scalable programs without the burden of learning a multitude of new programming constructs. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470378&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1127_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1127_logo_xplore_large.gif Commercial NVIDIA 2010 04 19 04/19/2010 Michael Garland Paper Michael Garland 7473646a-8f1b-4e8b-b578-cabe90a66678 Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs. http://portal.acm.org/citation.cfm?id=1772954.1772971&coll=Portal&dl=ACM&CFID=91959390&CFTOKEN=70859630 /content/cudazone/CUDABrowser/assets/images/applications/1126_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1126_logo_acm_portal2_large.jpg Research NVIDIA Corporation 2010 04 01 04/01/2010 John A Stratton Vinod Grover Jaydeep Marathe Bastiaan Aarts Mike Murphy Ziang Hu Wen-mei W. Hwu Paper John A Stratton,Vinod Grover,Jaydeep Marathe 7702a523-e58c-4e1e-8e05-207e1430c47c Non-blocking programming on multi-core graphics processors: extended asbtract This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes. http://portal.acm.org/citation.cfm?id=1556444.1556448&coll=Portal&dl=ACM&CFID=91959390&CFTOKEN=70859630 /content/cudazone/CUDABrowser/assets/images/applications/1125_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1125_logo_acm_portal2_large.jpg Academia University of Tromse 2009 06 01 06/01/2009 Phuong Hoai Ha Philippas Tsigas Otto J. Anshus Paper Phuong Hoai Ha,Philippas Tsigas,Otto J. Anshus 018fa5a9-ae5d-498c-87ba-c505061b01c5 Application-guided tool development for architecturally diverse computation Architecturally diverse computation exploits non-traditional computing platforms (e.g., field-programmable gate arrays, graphics processors, heterogeneous chip multiprocessors) to execute user applications. We have designed the Auto-Pipe tool set with the goal of easing the task of developing applications for architecturally diverse systems. Prior to and during the course of Auto-Pipe's design, we have developed a number of real, substantial applications, and the the lessons learned during the development of these applications has had a direct bearing on the capabilities of Auto-Pipe. In this paper, we describe the relationship between our application development experience and Auto-Pipe. In short, how have applications guided the tools' evolution and development? /content/cudazone/CUDABrowser/assets/images/applications/1124_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1124_logo_acm_portal2_large.jpg Academia Washington University in St. Louis 2010 03 01 03/01/2010 R. D. Chamberlain J. Buhler M. Franklin J. H. Buckley Paper R. D. Chamberlain,J. Buhler,M. Franklin 7c408604-7b5b-4079-8a47-1aeb09371dde NeuroSolutions CUDA Add-on The NeuroSolutions CUDA Add-on implements high performance parallel computing of Neural Networks using Levenberg-Marquardt - one of the most powerful form of back-propagation learning available. Neural Networks are a form of artificial intelligence (AI) that have proved to be effective in solving a wide range of data mining and data modeling problems including credit card fraud detection, cancer diagnosis and financial forecasting to name a few. As problems become more and more complex, so does the demand for processing power. By parallelizing advanced learning algorithms on a GPU (Graphics Processing Unit), NeuroSolutions can achieve up to 50 times greater performance than that of processing on a traditional CPU (Central Processing Unit). A free evaluation version of NeuroSolutions is available for download on our website. /content/cudazone/CUDABrowser/assets/images/applications/1123_v6-nscuda-large_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1123_v6-nscuda-large_large.jpg Commercial NeuroDimension, Inc. http://www.nd.com 2010 06 13 05/13/2010 50 Commercial Gary Lynn Brian Kachnowski Application Presentation Computational Fluid Dynamics Finance Imaging Medical Imaging Numerics Life Sciences Libraries Oil & Gas Science Signal Processing Neural Networks Data Mining Machine Learning neural network, Levenberg-Marquardt, CUDA, Mutlilayer Perceptron, GPU, parallel processing,Gary Lynn,Brian Kachnowski,info@nd.com e1b2a932-da54-4115-9913-ef21d09b12cb Bayesian Real-Time Perception Algorithms on GPU Real-time implementation of a Bayesian framework for robotic multisensory perception using the Compute Unified Device Architecture (CUDA). /content/cudazone/CUDABrowser/assets/images/applications/1122_1bayesoccupancyfilter_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1122_1bayesoccupancyfilter_large.jpg Academia Mobile Robotics Lab, Institute of Systems and Robotics, Coimbra Pole, Portugal http://paloma.isr.uc.pt 2010 02 26 02/26/2010 30,000 Joao Filipe Ferreira Jorge Lobo Jorge Dias Paper Life Sciences Science Signal Processing Video & Audio Robotics and Artificial Perception Joao Filipe Ferreira,Jorge Lobo,Jorge Dias,jfilipe@isr.uc.pt f695686e-a314-4d4c-a222-7a1e88c753f3 A Work-Efficient GPU Algorithm for Level Set Segmentation Level set segmentation is a powerful computational method for identifying complex objects in n-dimensional images. We present a novel level set segmentation algorithm that scales efficiently on an unbounded number of parallel computer processors while performing asymptotically no more work than the most efficient known sequential algorithm. We demonstrate that our new algorithm is one order of magnitude faster than current state-of-the-art parallel algorithms with no reduction in accuracy. /content/cudazone/CUDABrowser/assets/images/applications/1121_brainweb-3D-composite-offset-small_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1121_brainweb-3D-composite-offset-small_large.jpg Academia University of Calgary http://www.ucalgary.ca 2010 06 25 06/25/2010 14 Mike Roberts Jeff Packer Mario Costa Sousa J Ross Mitchell Multimedia Paper Graphics Imaging Medical Imaging level set image segmentation,Mike Roberts,Jeff Packer,Mario Costa Sousa / J Ross Mitchell,mlrobert@ucalgary.ca 4b052c40-1ee0-4188-bb05-5854ba1bafc9 Real Time Face Tracking A real-time face tracker that tracks multiple faces simultaneously on subsequent video frames with maximum stability. /content/cudazone/CUDABrowser/assets/images/applications/1120_NeST-NVIDIA_Center_small.png /content/cudazone/CUDABrowser/assets/images/applications/1120_NeST-NVIDIA_Center_large.png Commercial Network Systems & Technologies (P) Ltd. http://nestsoftware.com 2009 11 28 11/28/2009 Midhun M Neethu K Chandran Preetha Joy Paper Imaging Midhun M,Neethu K Chandran,Preetha Joy,hpc@nestgroup.net 055f0f06-5ab2-4644-9765-9d40534a183a HPC Platform options: Cell BE and GPU This write up briefly compares two competing performance architectures for data parallelism Cell Broadband Engine (Cell in short) and the GPU (Graphics Processing Unit). /content/cudazone/CUDABrowser/assets/images/applications/1119_NeST-NVIDIA_Center_small.png /content/cudazone/CUDABrowser/assets/images/applications/1119_NeST-NVIDIA_Center_large.png Commercial Network Systems & Technologies (P) Ltd. http://nestsoftware.com 2009 11 10 11/10/2009 Anoop Thomas Paper Cell BE and GPU comparison Anoop Thomas,hpc@nestgroup.net 91689c21-d8df-4d7d-b37a-1ac1fdd6227c Parallel Iterative Linear Solvers on GPU: A Financial Engineering Case In many numerical applications resulting from computational science and engineering problems, the solution of sparse linear systems is the most prohibitively compute intensive task. Consequently, the linear solvers need to be carefully chosen and efficiently implemented in order to harness the available computing resources. Krylov subspace based iterative solvers have been widely used for solving large systems of linear equations. In this paper, we focus on the design of such iterative solvers to take advantage of massive parallelism of general purpose Graphics Processing Units (GPU)s. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5452413&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1118_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1118_logo_xplore_large.gif Academia Ecole Centrale Paris 2010 02 17 02/17/2010 A. Gaikwad I.M. Toke Paper A. Gaikwad,I.M. Toke 5ad8691f-8688-42ac-b692-0094ba1c701b IP routing processing with graphic processors Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one order of magnitude lower. Modern GPUs are offering significant computing power, and its data-parallel computing model well matches the typical patterns of packet processing on routers. Accordingly, in this research we investigate the potential of CUDA-enabled GPUs for IP routing applications. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5457229&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1117_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1117_logo_xplore_large.gif Academia Tsinghua University 2010 03 08 03/08/2010 Shuai Mu Xinya Zhang Nairen Zhang Jiaxin Lu Yangdong Steve Deng Shu Zhang Paper Shuai Mu,Xinya Zhang,Nairen Zhang 2c2381ee-55c4-4cd2-8c69-1433e6716c77 Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA) Due to its object based nature, flexible features and provision for user interaction, MPEG-4 encoder is highly suitable for parallelization. The most critical and time-consuming operation of encoder is motion estimation. Nvidia's general-purpose graphical processing unit (GPGPU) architecture allows for a massively parallel stream processor model at a very cheap price (in a few thousands Rupees). However synchronization of parallel calculations and repeated device to host data transfer is a major challenge in parallelizing motion estimation on CUDA. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5422997&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1116_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1116_logo_xplore_large.gif Academia Indian Institute of Technology 2010 02 19 02/19/2010 D. Ailawadi M.K. Mohapatra A. Mittal Paper D. Ailawadi,M.K. Mohapatra,A. Mittal 031e13f9-1141-4bf3-8aed-851ab751fa2b CUDA Based GPU Programming to Simulate 3D Tissue Deformation The medical training systems based on virtual simulation are highly desired since minimally invasive surgical techniques have become popular to patients. The training system helps surgeon trainees to acquire, practice and evaluate their surgical skills, and the key component of such a system is to simulate the dynamic procedure such as 3D biological tissue deformation in surgical operation. In our paper, an improved mass-spring model is proposed to represent the biological tissue surface, during which the virtual spring is introduced and utilized to help compensate the weakness of the conventional mass-spring model. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5462444&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1115_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1115_logo_xplore_large.gif 2010 04 23 04/23/2010 Yuanyuan Zhang Jianhui Zhao Zhiyong Yuan Yihua Ding Chengjian Long Lu Xiong Paper Yuanyuan Zhang,Jianhui Zhao,Zhiyong Zhao, Yihua Ding, Chengjian Long,Lu Xiong 678ff1ca-9f68-4d9c-b7ec-767fbbf2d2f0 Offloading Region Matching of Data Distribution Management with CUDA Data distribution management (DDM) aims to reduce the transmission of irrelevant data between High Level Architecture (HLA) compliant simulators by taking their interesting regions into account (i.e. region matching). In a large-scale simulation, computation intensive region matching would have a direct impact on the simulation performance. To deal with the high computation cost of region matching, the whole process of region matching is offloaded to graphical processing units (GPUs) based on Computer Unified Device Architecture (CUDA). http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5416077&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1114_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1114_logo_xplore_large.gif Academia National Tsing Hua University 2010 01 27 01/27/2010 Shih Hsiang Lo Yeh Ching Chung Fang Ping Pai Paper Shih Hsiang Lo,Yeh Ching Chung,Fang Ping Pai 7bcba91a-58ab-41e9-b4f3-7cfa59a1b492 hiCUDA: High-Level GPGPU Programming Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, driven by improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5445082&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1113_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1113_logo_xplore_large.gif 2010 04 08 04/08/2010 T. Han T. Abdelrahman Paper T. Han,T. Abdelrahman a27cc5b9-6fe9-40dd-a83c-c76f2b5e3228 Preliminary implementation of VQ image coding using GPGPU GPGPU (general purpose computing on graphic processing unit) attracts a great deal of attention, that is used for general-purpose computations like numerical calculations as well as graphic processing. In this paper, as an example of hierarchical clustering algorithms, we evaluate PNN (pairwise nearest neighbor) on GPUs by using CUDA (compute unified device architecture). We also evaluate it from the viewpoint of the power consumption. /content/cudazone/CUDABrowser/assets/images/applications/1111_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1111_logo_xplore_large.gif Academia Konan University 2010 01 09 01/09/2010 A. Wakatani Paper A. Wakatani e8febd87-0451-4a70-b279-4e86b3c46e9b Real Time Simulation of Tissue Cutting Based on GPU and CUDA for Surgical Training A novel approach to the simulation of soft tissue cutting in a virtual reality endoscopic simulator is presented for applications in surgical training and education. This approach is based on an improved mass-spring model and the use of computational geometry. A virtual spring is introduced to compensate the weakness of the conventional mass-spring model, and a detection algorithm utilizing decomposition of affine coordinates is adopted to determine the springs that intersect with the cutting plane. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5462450&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1110_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1110_logo_xplore_large.gif 2010 04 23 04/23/2010 Yuanyuan Zhang Zhiyong Yuan Yihua Ding Jianhui Zhao Zhaoliang Duan Mingui Sun Paper Life Sciences Yuanyuan Zhang,Zhiyong Yuan,Yihua Ding 6ae115e8-d829-42f7-a29e-fb2bf1e2ba0a A GPU-enabled solver for time-constrained linear sum assignment problems This paper deals with solving large instances of the Linear Sum Assignment Problems (LSAPs) under realtime constraints, using Graphical Processing Units (GPUs). The motivating scenario is an industrial application for P2P live streaming that is moderated by a central tracker that is periodically solving LSAP instances to optimize the connectivity of thousands of peers. However, our findings are generic enough to be applied in other contexts. Our main contribution is a parallel version of a heuristic algorithm called Deep Greedy Switching (DGS) on GPUs using the CUDA programming language. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5461816&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All /content/cudazone/CUDABrowser/assets/images/applications/1109_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1109_logo_xplore_large.gif Commercial Peerialism, Inc. 2010 03 28 03/28/2010 Roberto Roverso Amgad Naeiem Mohammed El-Beltagy Sameh El-Ansary Seif Haridi Paper Roberto Roverso,Amgad Naeiem,Mohammed El-Beltagy 883bddba-16b6-4c99-a6e8-8317a72ca1a9 Accelerating H.264 inter prediction in a GPU by using CUDA H.264/AVC defines a very efficient algorithm for the inter prediction but it takes too much time. With the emergence of general purpose graphics processing units (GPGPU), a new door has been opened to support this video algorithm into these small processing units. In this paper, a forward step is developed towards an implementation of the H.264/AVC inter prediction algorithm into a GPU using compute unified device architecture (CUDA). The results show a negligible rate distortion drop with a time reduction on average up to 93.6%. http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5418821 /content/cudazone/CUDABrowser/assets/images/applications/1108_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1108_logo_xplore_large.gif Academia University of Castilla 2010 01 09 01/09/2010 R. Rodriguez J.L. Marttinez G. Fernandez-Escribano J.M. Claver J.L. Sanchez Paper R. Rodriguez,J.L. Marttinez,G. Fernandez-Escribano 58f115c8-1620-40a8-a30a-1bee644e7c5f Porting of an Edge-Based CFD Solver to GPUs Graphics processing units (GPUs) are increasingly becoming a mainstream platform for high performance computational fluid dynamics. This paper describes the porting of a substantial portion of FEFLO, an adaptive, edge-based finite element code for the solution of compressible and incompressible flow, to run on GPUs. The code is primarily written in Fortran 77 and has been ported to vector, shared memory parallel (via OpenMP) and distributed memory parallel (via MPI) machines. http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_523.pdf /content/cudazone/CUDABrowser/assets/images/applications/1107_logo_AIAA_portal_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1107_logo_AIAA_portal_large.jpg Academia George Mason University 2010 01 04 01/04/2010 Andrew Corrigan Fernando Camelli Rainald Lohner Paper Andrew Corrigan,Fernando Camelli,Rainald Lohner 22bda164-33bd-4311-9e70-1723cfeaee9b Toward efficient GPU-accelerated N-body simulations N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs). http://pdf.aiaa.org/preview/CDReadyMASM08_1065/PV2008_608.pdf /content/cudazone/CUDABrowser/assets/images/applications/1102_logo_AIAA_portal_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1102_logo_AIAA_portal_large.jpg Research Applied Scientific Research 2010 01 07 01/07/2010 Mark J. Stock Adrin Gharakhani Paper Science Mark J. Stock,Adrin Gharakhani 732a1a48-f8a5-40a0-ad2b-741fb822af91 Using GPU on HPC Applications to Satisfy Low-Power Computational Requirement The High-performance, low-power computing is required to reduce the computer infrastructure needed for large multi-physics calculations for reactive flow, high-resolution urban aerodynamics, deforming geometry fluid dynamics, etc. If computer infrastructure and costs can be reduced sufficiently, highly accurate calculations currently being performed only in large computer centers can be moved to operational centers and even into the field. http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_524.pdf /content/cudazone/CUDABrowser/assets/images/applications/1101_logo_AIAA_portal_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1101_logo_AIAA_portal_large.jpg Research Naval Research Laboratory 2010 01 04 01/04/2010 Gopal Patnaik Keith S. Obenschain Paper Gopal Patnaik,Keith S. Obenschain fdb72cda-8c6c-420f-bdf2-55a91c7427f5 An MPI-CUDA Implementation for Massively Parallel Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose parallel computing platforms that can accelerate simulation science applications tremendously. While multi- GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational problems, larger problems require even more resources. Conventional clusters of central processing units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems. The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges in developing scalable and efficient simulation codes. http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_522.pdf /content/cudazone/CUDABrowser/assets/images/applications/1100_logo_AIAA_portal_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1100_logo_AIAA_portal_large.jpg Boise State University 2010 01 04 01/04/2010 Dana Jacobsen Julien Thibault Inanc Senocak Paper Dana Jacobsen,Julien Thibault,Inanc Senocak b2c13afd-3503-4af9-9a6f-273f3d7589dc State-of-the-Art in Heterogeneous Computing This extensive survey (33 pages, over 180 references) gives an overview of hardware and software tools for the Cell Broadband Engine, Graphics Processing Units, and Field Programmable Gate Arrays. A qualitative and quantitative comparison is also presented, together with a summary of state-of-the-art approaches to heterogeneous computing. Computational Fluid Dynamics,Finance,Imaging,Numerics,Libraries,Oil & Gas,Programming Tools,Science /content/cudazone/CUDABrowser/assets/images/applications/1099_star_heterocomp_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1099_star_heterocomp_large.jpg Research SINTEF ICT and Oak Ridge National Laboratory, Future Technologies Group 2010 05 04 05/04/2010 A. R. Brodtkorb/C. Dyken/T R. Hagen/J. M. Hjelmervik/O. O. Storaasli Paper A. R. Brodtkorb,C. Dyken, T R. Hagen J. M. Hjelmervik and O. O. Storaasli,Andre.Brodtkorb@sintef.no aa8b4f0c-1c87-4251-8fe9-1e035864ce0e Simulation and Visualization of the Saint-Venant System using GPUs This paper describes the efficient implementation of three second order accurate explicit schemes that solve the shallow water equations. The implementation also supports real-time visualization with photorealistic effects. /content/cudazone/CUDABrowser/assets/images/applications/1098_sw_2010_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1098_sw_2010_large.jpg Research SINTEF ICT 2010 02 28 02/28/2010 A. R. Brodtkorb/T. R. Hagen/K.-A. Lie/J. R. Natvig Multimedia Paper Computational Fluid Dynamics Graphics Numerics Science A. R. Brodtkorb,T. R. Hagen,K.-A. Lie and J. R. Natvig,Andre.Brodtkorb@sintef.no d105ce4e-64c5-4075-8ad7-ba5a026484d8 Kappa The primary goal of Kappa is to allow for the creation of sophisticated, powerful, and complex processing that retain simple and easy-to-use interfaces. Kappa provides for creating processes with dynamic sizing, scheduling, and interactive execution for C and CUDA kernels to process data efficiently using the available resources.Kappa provides a library for creating processes to use combinations of CPUs and a GPU for tasks. Within a single host program process, a Kappa process can be created for each CUDA GPU using all GPUs. Each Kappa process can use all of the multiprocessors of each GPU, share all of the CPUs of the host system, have its own separate namespace, and have its own separate CUDA context. /content/cudazone/CUDABrowser/assets/images/applications/1097_psilambdakappa_small.png /content/cudazone/CUDABrowser/assets/images/applications/1097_psilambdakappa_large.png Commercial Psi Lambda LLC http://psilambda.com 2010 05 03 05/03/2010 Commercial Psi Lambda LLC Application Libraries Programming Tools Psi Lambda LLC,kappa@psilambda.com 8ecf8eac-dcc4-4770-ba77-c7b68b2ec6e6 Cusp: A sparse matrix library for CUDA Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems. /content/cudazone/CUDABrowser/assets/images/applications/1096_cusp_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1096_cusp_logo_large.png Research NVIDIA Research http://research.nvidia.com/ 2010 05 04 05/04/2010 Open source Nathan Bell Michael Garland Code Numerics Libraries Nathan Bell,Michael Garland,nbell@gmail.com 4b17cf89-a211-4fcf-8c9e-32e04272529f Performance and Scalability of GPU-Based Convolutional Neural Networks In this paper we present the implementation of a framework for accelerating training and classification of arbitrary Convolutional Neural Networks (CNNs) on the GPU. CNNs are a derivative of standard Multilayer Perceptron (MLP) neural networks optimized for two-dimensional pattern recognition problems such as Optical Character Recognition (OCR) or face detection. We describe the basic parts of a CNN and demonstrate the performance and scalability improvement that can be achieved by shifting the computation-intensive tasks of a CNN to the GPU. Depending on the network topology training and classification on the GPU performs 2 to 24 times faster than on the CPU. Furthermore, the GPU version scales much better than the CPU implementation with respect to the network size. /content/cudazone/CUDABrowser/assets/images/applications/1095_lenet5_small.png /content/cudazone/CUDABrowser/assets/images/applications/1095_lenet5_large.png Academia Distributed and Parallel Systems Group, Institute of Computer Science, University of Innsbruck http://www.dps.uibk.ac.at 2010 02 18 02/18/2010 24 Open source Daniel Strigl Klaus Kofler Stefan Podlipnig Paper Code Imaging Numerics Science Machine Learning Daniel Strigl,Klaus Kofler,Stefan Podlipnig,daniel.strigl@student.uibk.ac.at, klaus.kofler@student.uibk.ac.at,stefan.podlipnig@uibk.ac.at bbc802a8-2ae5-449d-b4c7-059fe0daa3a2 AntiPlanet Reflections AntiPlanet Reflections is first person "doom" style 3D shooter game in fantastic extraterrestrial world, which is built of spheres, shadows and infinite reflections. AntiPlanet scenes are fully dynamic with moving objects and light sources. AntiPlanet uses ray tracing for visualization. It works through CUDA. GT200 architecture performs about 15 times faster than ordinary dual core cpu, and Fermi performs about 45 times faster. /content/cudazone/CUDABrowser/assets/images/applications/1094_sphericalflowers_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1094_sphericalflowers_large.jpg Research virtualray.ru http://www.virtualray.ru 2010 05 06 05/06/2010 30 Commercial Lev Dymchenko Application Graphics Ray Tracing antiplanet real time ray tracing spherical reflections,Lev Dymchenko,lev@virtualray.ru e5354ea1-1f61-4d80-b50d-f4c7d07355b3 Acceleration of the Smith-Waterman Algorithm using Single and Multiple Graphics Processors Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases.The Smith Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences. http://www.ecs.umass.edu/mie/faculty/perot/Programs.htm /content/cudazone/CUDABrowser/assets/images/applications/1093_S-W_small.png /content/cudazone/CUDABrowser/assets/images/applications/1093_S-W_large.png Academia University of Massachusetts, Amherst http://www.umass.edu/ 2010 05 13 05/13/2010 45 Open source Ali Khajeh-Saeed Stephen Poole J. Blair Perot Paper Code Life Sciences Ali Khajeh-Saeed,Stephen Poole,J. Blair Perot,khajehsaeed@ecs.umass.edu ,perot@ecs.umass.edu ad98b4cb-6676-46c7-8cdb-c5288f4ab6b0 Nifty_reg Global and local medical image registration using CUDA. The global alignment is based on a block-matching technique and the local warping on a cubic B-spline deformation model. /content/cudazone/CUDABrowser/assets/images/applications/1092_nifty_reg_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1092_nifty_reg_logo_large.png Academia CMIC - University College London http://cmic.cs.ucl.ac.uk/ 2009 09 18 09/18/2009 Open source Marc Modat Pankaj Daga Sebastien Ourselin Paper Code Medical Imaging Marc Modat,marc.modat@gmail.com 38b7a4f8-004f-452b-b5dd-c7bc77b6fca3 Massively parallel Linux laptops, workstations and clusters with CUDA Unleash the GPU within! /content/cudazone/CUDABrowser/assets/images/applications/1090_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1090_logo_acm_portal2_large.jpg 2008 11 01 11/01/2008 Robert Farber Paper Robert Farber b7bcb258-0b02-4468-ad5c-217f93df94fe Multi-target C++ implementation of parallel skeletons This paper presents the design of an efficient multi-target (CPU+GPU) implementation for the Parallel_for skeleton. Emerging massively parallel architectures promise very high performances for a low cost. However, these architectures change faster than ever. Thus, optimization of codes becomes a very complex and time consumming task. We have identified the data storage as the main difference between the CPU and the GPU implementation of a code. We introduce an abstract data layout in order to adapt the data storage. Based on this layout, the utilization of Parallel_for skeleton allows to compile and execute the same program both on CPU and on GPU. Once compiled, the program runs close to the hardware limits. /content/cudazone/CUDABrowser/assets/images/applications/1089_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1089_logo_acm_portal2_large.jpg Research EDF R&D 2009 07 01 07/01/2009 Wilfried Kirschenmann Laurent Plagne Stephane Vialle Paper Wilfried Kirschenmann,Laurent Plagne,Stephane Vialle ed102912-3ed0-4296-bb43-f1e56474388a Triangular matrix inversion on Graphics Processing Unit Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture. http://portal.acm.org/citation.cfm?id=1654059.1654069&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1088_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1088_logo_acm_portal2_large.jpg Academia University of Bologna 2009 11 01 11/01/2009 Florian Ries Tommaso De Marco Matteo Zivieri Roberto Guerrieri Paper Florian Ries,Tommaso De Marco,Matteo Zivieri 372d3c3e-bd11-424d-80f2-f54341867325 Fast heterogeneous computing with CUDA compatible Tesla GPU computing processor (personal supercomputing) This paper presents how fast heterogeneous computing can be achieved with Tesla GPU computing processor. Tesla GPU super computer brings the performance of a cluster to a workstation and turning it into a supercomputer. We have chosen molecular dynamics field to show fast and high performance computing with Tesla GPU. We have given a DCS (direct coulomb summation) algorithm for computing electrostatic fields around molecules with CUDA. Tesla GPU speeds up the molecular dynamics application up to 240X. http://portal.acm.org/citation.cfm?id=1741906.1742121&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1085_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1085_logo_acm_portal2_large.jpg Academia Aligarh Muslim University 2010 02 01 02/01/2010 D. Kumar M. A. Qadeer Paper D. Kumar,M. A. Qadeer 51b44143-4a88-4a74-96de-e899a53cb6d2 A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for various neural engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Compute Unified Device Architecture (CUDA) Graphics Processing Units (GPUs) can provide a low-cost, programmable, and high-performance computing platform for simulation of SNNs. In this paper we demonstrate an efficient, biologically realistic, large-scale SNN simulator that runs on a single GPU. The SNN model includes Izhikevich spiking neurons, detailed models of synaptic plasticity and variable axonal delay. We allow user-defined configuration of the GPU-SNN model by means of a high-level programming interface written in C++ but similar to the PyNN programming interface specification. http://portal.acm.org/citation.cfm?id=1594405.1594470&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1084_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1084_logo_acm_portal2_large.jpg Academia University of California Irvine 2009 07 01 07/01/2009 Jayram Moorkanikara Nageswaran Nikil Dutt Jeffrey L. Krichmar Alex Nicolau Alexander V. Veidenbaum Paper Jayram Moorkanikara Nageswaran ,Nikil Dutt,Jeffrey L. Krichmar ff32502d-f153-4df5-9007-fe61af6560c1 Boids that see: Using self-occlusion for simulating large groups on GPUs Behavioral models have been used in the entertainment industry to increase the realism in the simulation of large groups of individuals. Unfortunately, the classical models can be very compute-intensive when very large groups are considered, reducing its applicability in games and other interactive systems. In this article we explore both search space reduction and parallelism to improve the performance of Reynold's Boids model. We propose a methodology that considers self-occlusion (visibility culling) to reduce the number of neighbors and we take advantage the parallelism present in common graphics processor units (GPUs) to allow the simulation of very large groups. We performed different GPU implementations (GPGPU and CUDA); the results show that visibility culling allows significant gains in performance without affecting the model's overall behavior. /content/cudazone/CUDABrowser/assets/images/applications/1083_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1083_logo_acm_portal2_large.jpg Academia Universidade Federal de Minas Gerais 2009 12 01 12/01/2009 Alessandro Ribeiro Da Silva Wallace Santos Lages Luiz Chaimowicz Paper Alessandro Ribeiro Da Silva,Wallace Santos Lages,Luiz Chaimowicz 6d68818f-bbb6-4550-a857-32e87a7c5c86 Using common graphics hardware for multi-agent traffic simulation with CUDA Today's graphics processing units (GPU) have tremendous resources when it comes to raw computing power. The simulation of large groups of agents in transport simulation has a huge demand of computation time. Therefore it seems reasonable to try to harvest this computing power for traffic simulation. Unfortunately simulating a network of traffic is inherently connected with random memory access. This is not a domain that the SIMD (single instruction, multiple data) architecture of GPUs is known to work well with. In this paper the authors will try to achieve a speedup by computing multi-agent traffic simulations on the graphics device using NVIDIAs CUDA framework. /content/cudazone/CUDABrowser/assets/images/applications/1081_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1081_logo_acm_portal2_large.jpg Academia TU Berlin 2009 03 01 03/01/2009 David Strippgen Kai Nagel Paper David Strippgen,Kai Nagel 2bedb9a8-538c-458a-8014-36e5d8b1d4dc Taming irregular EDA applications on GPUs Recently general purpose computing on graphic processing units (GPUs) is rising as an exciting new trend in high-performance computing. Thus it is appealing to study the potential of GPU for Electronic Design Automation (EDA) applications. However, EDA generally involves irregular data structures such as sparse matrix and graph operations, which pose significant challenges for efficient GPU implementations. In this paper, we propose highperformance GPU implementations for two important irregular EDA computing patterns, Sparse-Matrix Vector Product (SMVP) and graph traversal. On a wide range of EDA problem instances, our SMVP implementations outperform all published work and achieve a speedup of one order of magnitude over the CPU baseline. http://portal.acm.org/citation.cfm?id=1687399.1687501&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1080_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1080_logo_acm_portal2_large.jpg Academia Tsinghua University 2009 11 01 11/01/2009 Yangdong Deng Bo David Wang Shuai Mu Paper Yangdong Deng,Bo David Wang,Shuai Mu e601603a-6688-43d2-b74b-2febe1d8dafc CUDA renderer: a programmable graphics pipeline Modern GPUs provide gradually increasing programmability on vertex shader, geometry shader and fragment shader in the past decade. However, many classical problems such as order-independent transparency (OIT), occlusion culling have not yet been efficiently solved using the traditional graphics pipeline. The main reason is that the behavior of the current stage of the pipeline is hard to be determined due to the unpredictable future data. Since the rasterization and blending stage are still largely fixed functions on chip, previous improvements on these problems always require hardware modifications thus remain on the theoretical level. http://portal.acm.org/citation.cfm?id=1667146.1667189&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1078_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1078_logo_acm_portal2_large.jpg Academia Chinese Academy of Sciences 2009 12 01 12/01/2009 Fan Liu Meng-Cheng Huang Xue-Hui Liu Paper Fan Liu,Meng-Cheng Huang,Xue-Hui Liu e88d024a-05b4-4bba-8641-5fbd42154978 42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application --a gravitational N-body simulation-- and one non-standard application --simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops. http://portal.acm.org/citation.cfm?id=1654059.1654123&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951 /content/cudazone/CUDABrowser/assets/images/applications/1075_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1075_logo_acm_portal2_large.jpg Academia Nagasaki University 2009 11 01 11/01/2009 Tsuyoshi Hamada Tetsu Narumi Rio Yokota Paper Tsuyoshi Hamada,Tetsu Narumi,Rio Yokota e78b9ca3-7ebe-4fdb-98fe-c2ea6b67a9d5 LBM based flow simulation using GPU computing processor Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. In this paper, we develop a general purpose Lattice Boltzmann code that runs entirely on a single GPU. The results show that: (1) simple precision floating point arithmetic is sufficient for LBM computation in comparison to double precision; (2) the implementation of LBM on GPUs allows us to achieve up to about one billion lattice update per second using single precision floating point; (3) GPUs provide an inexpensive alternative to large clusters for fluid dynamics prediction. /content/cudazone/CUDABrowser/assets/images/applications/1074_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1074_logo_acm_portal2_large.jpg Academia Universite de Lyon 2010 04 01 04/01/2010 Frederic Kuznik Christian Obrecht Gilles Rusaouen Paper Frederic Kuznik,Christian Obrecht,Gilles Rusaouen 7619b5ce-3009-4e36-b291-af64ec7413fb Parallel GPU-based data-dependent triangulations In this paper we introduce a new technique for data-dependent triangulation which is suitable for implementation on a GPU. Our solution is based on a new parallel version of the well known Lawson's optimization process and is fully compatible with restrictions of the GPU hardware. We test and compare the quality of our solution in an image reconstruction problem. In comparison with the standard implementations we achieve significant speed-up (eight times on average) with comparable quality of the reconstructed image. Further, several other improvements and optimizations are introduced and tested, and the results are discussed in detail. /content/cudazone/CUDABrowser/assets/images/applications/1073_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1073_logo_acm_portal2_large.jpg Comenius University 2010 04 01 04/01/2010 Michal Ervenansky Zsolt Toth Juraj Starinsky Paper Michal Ervenansky ,Zsolt Toth,Juraj Starinsky 80347ce7-cbf9-4e15-a927-5bab420dff15 Fault Table Computation on GPUs In this paper, we explore the implementation of fault table generation on a Graphics Processing Unit (GPU). A fault table is essential for fault diagnosis and fault detection in VLSI testing and debug. Generating a fault table requires extensive fault simulation, with no fault dropping, and is extremely expensive from a computational standpoint. Fault simulation is inherently parallelizable, and the large number of threads that a GPU can operate on in parallel can be employed to accelerate fault simulation, and thereby accelerate fault table generation. Our approach, called GFTABLE, employs a pattern parallel approach which utilizes both bit-parallelism and thread-level parallelism. Our implementation is a significantly modified version of FSIM, which is pattern parallel fault simulation approach for single core processors. http://portal.acm.org/citation.cfm?id=1773593.1773611&coll=Portal&dl=GUIDE&CFID=88119154&CFTOKEN=11832401 /content/cudazone/CUDABrowser/assets/images/applications/1072_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1072_logo_acm_portal2_large.jpg Academia Texas A and M University 2010 04 01 04/01/2010 Kanupriya Gulati Sunil P. Khatri Paper Kanupriya Gulati,Sunil P. Khatri 8c78d222-202b-4a63-8202-cdb88bbb1977 Small-Ruleset Regular Expression Matching on GPGPUs: Quantitative Performance Analysis and Optimization We explore the intersection between an emerging class of architectures and a prominent workload: GPGPUs (General-Purpose Graphics Processing Units) and regular expression matching, respectively. It is a challenging task because this workload with its irregular, non-coalesceable memory access patterns is very different from the regular, numerical workloads that run efficiently on GPGPUs. http://domino.research.ibm.com/comm/research_people.nsf/pages/scarpazza.pubs.html/$FILE/2010-06-ICS-scarpazza.pdf /content/cudazone/CUDABrowser/assets/images/applications/1067_small-ruleset_small.png /content/cudazone/CUDABrowser/assets/images/applications/1067_small-ruleset_large.png Academia IBM T.J. Watson Research Center / Technische Universitat Braunschweig 2010 01 01 01/01/2010 Jamin Naghmouchi Daniele Paolo Scarpazza Mladen Berekovic Paper Science Jamin Naghmouchi,Daniele Paolo Scarpazza,Mladen Berekovic,j.naghmouchi@us.ibm.com,dpscarpazza@us.ibm.com,berekovic@ida.ing.tu-bs.de 77cbd1bb-51f1-4571-82f2-33999f0dd072 Modeling GPU-CPU workloads and systems Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms. http://portal.acm.org/citation.cfm?id=1735688.1735696&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1066_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1066_logo_acm_portal2_large.jpg Academia Georgia Institute of Technology 2010 03 01 03/01/2010 Andrew Kerr Gregory Diamos Sudhakar Yalamanchili Paper Andrew Kerr,Gregory Diamos,Sudhakar Yalamanchili 658e9e38-21dd-4f5f-9a1a-0da0b5d8df4d Cortical architectures on a GPGPU As the number of devices available per chip continues to increase, the computational potential of future computer architectures grows likewise. While this is a clear benefit for future computing devices, future chips will also likely suffer from more faulty devices and increased power consumption. It is also likely that these chips will be difficult to program if the current trend of adding more parallel cores continues to follow in the future. However, recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility. http://portal.acm.org/citation.cfm?id=1735688.1735693&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1065_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1065_logo_acm_portal2_large.jpg Academia University of Wisconsin Madison 2010 03 01 03/01/2010 Andrew Nere Mikko Lipasti Paper Andrew Nere,Mikko Lipasti 763cc86a-ccf2-4b2b-bd42-276c55734d18 A simulation of large-scale groundwater flow on CUDA-enabled GPUs This paper presents a simulation method for large-scale groundwater flow on CUDA-enabled GPUs. The discretization method for a three-dimensional groundwater flow equation is introduced. When using the preconditioned Conjugate Gradient algorithm to solve the discretized equation, the implementing methods for the sparse matrix-vector multiplication and the vector inner product on CUDA-enabled GPUs are given. The experimental results show that GPUs can speed up the groundwater simulation significantly. /content/cudazone/CUDABrowser/assets/images/applications/1064_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1064_logo_acm_portal2_large.jpg Academia China University of Geosciences 2010 03 01 03/01/2010 Xiaohui Ji Tangpei Cheng Qun Wang Paper Science Xiaohui Ji,Tangpei Cheng,Qun Wang 83d40bdb-c97a-4e07-ad48-646ada77716b A symbolic verifier for CUDA programs We present a preliminary automated verifier based on mechanical decision procedures which is able to prove functional correctness of CUDA programs and guarantee to detect bugs such as race conditions. We also employ a symbolic partial order reduction (POR) technique to mitigate the interleaving explosion problem. /content/cudazone/CUDABrowser/assets/images/applications/1063_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1063_logo_acm_portal2_large.jpg Academia University of Utah 2010 01 01 01/01/2010 Guodong Li Ganesh Gopalakrishnan Robert Kirby Paper Guodong Li,Ganesh Gopalakrishnan,Robert Kirby 2fc58876-4ace-47d4-b181-1164b718426b A breadth-first course in multicore and manycore programming The technique of scaling hardware performance through increasing the number of cores on a chip requires programmers to learn to write parallel code that can exploit this hardware. In order to expose students to a variety of multicore programming models, our university offered a breadth-first introduction to multicore and manycore programming for upper-level undergraduates. http://portal.acm.org/citation.cfm?id=1734263.1734339&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1062_cover_thumbbreadth-first_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1062_cover_thumbbreadth-first_large.jpg Academia Sonoma State University 2010 03 01 03/01/2010 Suzanne Rivoire Paper Suzanne Rivoire 84257f90-4151-49ba-96a8-86a45d23e3be CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences Biological sequence comparison is a very important operation in Bioinformatics. Even though there do exist exact methods to compare biological sequences, these methods are often neglected due to their quadratic time and space complexity. In order to accelerate these methods, many GPU algorithms were proposed in the literature. Nevertheless, all of them restrict the size of the smallest sequence in such a way that Megabase genome comparison is prevented. In this paper, we propose and evaluate CUDAlign, a GPU algorithm that is able to compare Megabase biological sequences with an exact Smith-Waterman affine gap variant. CUDAlign was implemented in CUDA and tested in two GPU boards, separately. For real sequences whose size range from 1MBP (Megabase Pairs) to 47MBP, a close to uniform GCUPS (Giga Cells Updates per Second) was obtained, showing the potential scalability of our approach. Also, CUDAlign was able to compare the human chromosome 21 and the chimpanzee chromosome 22. This operation took 21 hours on GeForce GTX 280, resulting in a peak performance of 20.375 GCUPS. As far as we know, this is the first time such huge chromosomes are compared with an exact method. http://portal.acm.org/citation.cfm?id=1693453.1693473&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1060_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1060_logo_acm_portal2_large.jpg Academia University of Brasilia 2010 01 01 01/01/2010 Edans Flavius O. Sandes Alba Cristina M.A. de Melo Paper Edans Flavius O. Sandes,Alba Cristina M.A. de Melo d2d92ca7-3344-4f47-9039-c3a203ed52cc Accelerating MATLAB Image Processing Toolbox functions on GPUs In this paper, we present our effort in developing an open-source GPU (graphics processing units) code library for the MATLAB Image Processing Toolbox (IPT). We ported a dozen of representative functions from IPT and based on their inherent characteristics, we grouped these functions into four categories: data independent, data sharing, algorithm dependent and data dependent. For each category, we present a detailed case study, which reveals interesting insights on how to efficiently optimize the code for GPUs and highlight performance-critical hardware features, some of which have not been well explored in existing literature. Our results show drastic speedups for the functions in the data-independent or data-sharing category by leveraging hardware support judiciously; and moderate speedups for those in the algorithm-dependent category by careful algorithm selection and parallelization. For the functions in the last category, fine-grain synchronization and data-dependency requirements are the main obstacles to an efficient implementation on GPUs. http://portal.acm.org/citation.cfm?id=1735688.1735703&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1058_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1058_logo_acm_portal2_large.jpg Academia University of Central Florida 2010 03 01 03/01/2010 Jingfei Kong Martin Dimitrov Yi Yang Paper Jingfei Kong,Martin Dimitrov,Yi Yang be8a567d-256c-453b-b4f5-1112db68abcc The Scalable Heterogeneous Computing (SHOC) benchmark suite Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC's initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard. http://portal.acm.org/citation.cfm?id=1735688.1735702&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1057_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1057_logo_acm_portal2_large.jpg Academia University of Tennessee 2010 03 01 03/01/2010 Anthony Danalis Gabriel Marin Collin McCurdy Paper Anthony Danalis,Gabriel Marin,Collin McCurdy 0b3ddfaa-9b4f-47d2-8f15-c7901994cd34 FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects In the past decade, modern GPUs have provided increasing programmability with vertex, geometry and fragment shaders. However, many classical problems have not been efficiently solved using the current graphics pipeline where some stages are still fixed functions on chip. In particular, multi-fragment effects, especially order-independent transparency, require programmability of the blending stage, that makes it difficult to be solved in a single geometry pass. In this paper we present FreePipe, a system for programmable parallel rendering that can run entirely on current graphics hardware and has performance comparable with the traditional graphics pipeline. Within this framework, two schemes for the efficient rendering of multi-fragment effects in a single geometry pass have been developed by exploiting CUDA atomic operations. Both schemes have achieved significant speedups compared to the state-of-the-art methods that are based on traditional graphics pipelines. http://portal.acm.org/citation.cfm?id=1730804.1730817&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1056_cover_thumbfreepipe_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1056_cover_thumbfreepipe_large.jpg Academia Chinese Academy of Sciences 2010 02 01 02/01/2010 Fang Liu Meng-Cheng Huang Xue-Hui Liu Paper Fang Liu,Meng-Cheng Huang,Xue-Hui Liu bfecf3ec-363f-460f-b5f2-d4bc7599ca9e Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries -- this overhead can limit the use of LOF due to the computational overhead involved. http://portal.acm.org/citation.cfm?id=1735688.1735707&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1055_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1055_logo_acm_portal2_large.jpg Academia Northeastern University 2010 03 01 03/01/2010 Malak Alshawabkeh Byunghyun Jang David Kaeli Paper Malak Alshawabkeh,Byunghyun Jang,David Kaeli 1e349053-7440-46a2-9958-5f49940f6846 An asymmetric distributed shared memory model for heterogeneous parallel systems Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory. http://portal.acm.org/citation.cfm?id=1736020.1736059&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1054_Asplos XV_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1054_Asplos XV_large.jpg Academia Universitat Politecnica de Catalunya 2010 03 01 03/01/2010 Isaac Gelado John E. Stone Javier Cabezas Paper Isaac Gelado,John E. Stone,Javier Cabezas 0944eab1-8c6f-4ba4-a258-14912057a27e Teaching design & analysis of multi-core parallel algorithms using CUDA One of the dominant trends in microprocessor architecture in recent years is continually increasing chip-level parallelism. However, many undergraduate curriculums, especially at small schools, do not offer courses that focus on the design and analysis of multi-threaded algorithms for multi-core processors. The courses that are offered address the theoretical aspects of parallel system design, but often fail to provide students with the opportunity to develop and evaluate distributed applications in real-world environments. As a result, undergraduate students are not as prepared as they should be for graduate study or careers in industry. http://portal.acm.org/citation.cfm?id=1734797.1734800&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1053_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1053_logo_acm_portal2_large.jpg Lamar University 2010 04 01 04/01/2010 Quoc-Nam Tran Paper Quoc-Nam Tran 422d4865-5b1d-4be0-82d2-fc9a795befe8 High-performance cone beam reconstruction using CUDA compatible GPUs Compute unified device architecture (CUDA) is a software development platform that allows us to run C-like programs on the nVIDIA graphics processing unit (GPU). This paper presents an acceleration method for cone beam reconstruction using CUDA compatible GPUs. The proposed method accelerates the Feldkamp, Davis, and Kress (FDK) algorithm using three techniques: (1) off-chip memory access reduction for saving the memory bandwidth; (2) loop unrolling for hiding the memory latency; and (3) multithreading for exploiting multiple GPUs. We describe how these techniques can be incorporated into the reconstruction code. http://portal.acm.org/citation.cfm?id=1750592.1750768&coll=GUIDE&dl=GUIDE&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1051_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1051_logo_acm_portal2_large.jpg Academia Graduate School of Information Science and Technology 2010 02 01 02/01/2010 24 Yusuke Okitsu Fumihiko Ino Kenichi Hagihara Paper Science Yusuke Okitsu,Fumihiko Ino,Kenichi Hagihara 6e20e119-f8dd-4ca4-b7f3-8afc56248978 Computational visual attention systems and their cognitive foundations: A survey Based on concepts of the human visual system, computational visual attention systems aim to detect regions of interest in images. Psychologists, neurobiologists, and computer scientists have investigated visual attention thoroughly during the last decades and profited considerably from each other. However, the interdisciplinarity of the topic holds not only benefits but also difficulties: Concepts of other fields are usually hard to access due to differences in vocabulary and lack of knowledge of the relevant literature. http://portal.acm.org/citation.cfm?id=1658349.1658355&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1047_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1047_logo_acm_portal2_large.jpg Academia Rheinische Friedrich-Wilhelms Universitat http://www3.uni-bonn.de/ 2010 01 01 01/01/2010 Simone Frintrop Erich Rome Henrick I. Christensen Paper Life Sciences Simone Frintrop,Erich Rome,Henrick I. Christensen da549e23-701a-44e5-b861-5f34fcfb1b47 Iterative induced dipoles computation for molecular mechanics on GPUs In this work, we present a first step towards the efficient implementation of polarizable molecular mechanics force fields with GPU acceleration. The computational bottleneck of such applications is found in the treatment of electrostatics, where higher-order multipoles and a self-consistent treatment of polarization effects are needed. We have coded these sections, for the case of a non-periodic simulation, with the CUDA programming model. Results show a speedup factor of 21 for a single precision GPU implementation, when comparing to the serial CPU version. A discussion of the optimization and prameterization steps is included. Comparison between different graphic cards and a shared memory parallel CPU implementation is also given. The current work demonstrates the potential usefulness of GPU programming in accelerating this field of applications. http://portal.acm.org/citation.cfm?id=1735688.1735708&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1046_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1046_logo_acm_portal2_large.jpg Research INESC-ID Lisboa http://www.gsd.inesc-id.pt/ 2010 03 01 03/01/2010 21 Frederico Pratas Ricardo Mata Leonel Sousa Paper Science Frederico Pratas,Ricardo Mata,Leonel Sousa cc358d86-c5ab-4c38-ad0a-21fefd4037d4 Exploring NVIDIA-CUDA for video coding http://portal.acm.org/citation.cfm?id=1730836.1730839&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1045_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1045_logo_acm_portal2_large.jpg Academia Florida Atlantic University www.fau.edu 2010 02 01 02/01/2010 100 Aleksandar Colic Hari Kalva Borko Furht Paper Video and Audio Aleksandar Colic,Hari Kalva,Borko Furht 6067f8f4-55fe-430b-8ca9-2a857b264b33 Thermal analysis of multiprocessor SoC applications by simulation and verification Overheating of computer chips leads to degradation of performance and reliability. Therefore, preventing chips from overheating in spite of increased performance requirements has emerged as a major challenge. Since the cost of cooling has been rising steadily, various architecture and application design techniques are used to prevent chip overheating. Temperature-aware task scheduling has emerged as an important application design methodology for addressing this problem in multiprocessor SoC systems. http://portal.acm.org/citation.cfm?id=1698759.1698765&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1042_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1042_logo_acm_portal2_large.jpg Academia Indian Institute of Technology http://www.iitkgp.ac.in/ 2010 02 01 02/01/2010 Dipankar Das P. P. Chakrabarti Rajeev Kumar Paper Science Dipankar Das,P. P. Chakrabarti,Rajeev Kumar bc4ed1f4-eb4d-47bf-8564-801f90e87a6b A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability. http://portal.acm.org/citation.cfm?id=1735688.1735698&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1041_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1041_logo_acm_portal2_large.jpg Research Reservoir Labs https://www.reservoir.com/ 2010 03 01 03/01/2010 Allen Leung Nicolas Vasilache Benoit Meister Paper Science Allen Leung,Nicolas Vasilache,Benoit Meister 6fdd88c3-83c9-49e8-a50b-0d6ccd284b6b Performance analysis of accelerated image registration using GPGPU This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of NVIDIA's Tesla C870 GPU. We explain the underlying structure of the GPU implementation and compare its performance and accuracy against a fast CPU-based implementation. http://portal.acm.org/citation.cfm?id=1513895.1513900&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1037_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1037_logo_acm_portal2_large.jpg Academia University of Notre Dame http://www.nd.edu/ 2009 03 08 03/08/2009 Peter Bui Jay Brockman Paper Medical Imaging CUDA, GPGPU, image registration, performance analysis,Peter Bui,Jay Brockman e3ebcf2d-15c3-41b0-b4a1-6191ac30209b Design and implementation of the software architecture for a 3-D reconstruction system in medical imaging The design and implementation of the reconstruction system in medical X-ray imaging is a challenging issue due to its immense computational demands. In order to ensure an efficient clinical workflow it is inevitable to meet high performance requirements. Hence, the usage of hardware acceleration is mandatory. The software architecture of the reconstruction system is required to be modular in a sense that different accelerator hardware platforms are supported and it must be possible to implement different parts of the algorithm using different acceleration architectures and techniques. http://portal.acm.org/citation.cfm?id=1368088.1368181&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264 /content/cudazone/CUDABrowser/assets/images/applications/1036_logo_acm_portal2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1036_logo_acm_portal2_large.jpg Academia University of Erlangen-Nuremberg http://www.uni-erlangen.org/ 2008 05 18 05/18/2008 Holger Scherl Stefan Hoppe Markus Kowarschik Paper Medical Imaging Holger Scherl,Stefan Hoppe,Markus Kowarschik 1df75b6e-b484-4233-8742-0cc83e34dfc8 Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance. http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WKJ-4VHSDG0-1&_user=10&_coverDate=05%2F31%2F2009&_alid=1315357454&_rdoc=3&_fmt=high&_orig=search&_cdi=6908&_sort=r&_docanchor=&view=c&_ct=791&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=cf0dc7f5e75630ed58513933e89e2835 /content/cudazone/CUDABrowser/assets/images/applications/1033_sciencedirect_small.png /content/cudazone/CUDABrowser/assets/images/applications/1033_sciencedirect_large.png Academia Universite de Pau et des Pays de lAdour 2009 05 01 05/01/2009 25 Dimitri Komatitsch David Michea Gordon Erlebacher Paper Science Dimitri Komatitsch,David Michea,Gordon Erlebacher 6ae2dac9-06ab-4893-b7a0-89da5330df36 Realtime free surface fluid simulation and visualization Implementation of a free surface fluid simulation and visualization using the Lattice Boltzmann method. OpenCL 1.0 is used for the fluid simulation and free surface handling while OpenGL is used for visualization of the refractions and caustics. /content/cudazone/CUDABrowser/assets/images/applications/1032_2010_04_05_free_beer_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1032_2010_04_05_free_beer_large.jpg CUDA Developer 2010 06 05 06/05/2010 Open Source Martin Schreiber Application Multimedia Code Computational Fluid Dynamics Game Physics Graphics Numerics Science Martin Schreiber,schreiberx@gmail.com 3bbb04a8-19f7-4823-9e08-1d712a548449 Palo GPU Business Intelligence Online Analytical Processing (OLAP) is a core technology in Business Intelligence and Corporate Performance Management, allowing users to navigate and explore corporate data (usually extracted from a data warehouse) and to roll up or drill down along different hierarchical levels.Palo enables OLAP reporting and analysis directly inside Excel spreadsheets. The GPU speeds up multidimensional aggregation queries for real-time interactive analyses. /content/cudazone/CUDABrowser/assets/images/applications/1031_suite-screen2-lg_small.png /content/cudazone/CUDABrowser/assets/images/applications/1031_suite-screen2-lg_large.png Commercial Jedox 2010 03 31 03/31/2010 40 Commercial Tobias Lauer Leif Mergener Application Multimedia Business Intelligence Leif Mergener,leif.mergener@jedox.com,OLAP,multidimensional aggregation 487089bf-2127-4c4c-9f6c-12cf40551ab4 Real-Time Multi-Agent Path Planning on Arbitrary Surfaces Path planning is an active topic in the literature, and efficient navigation over non-planar surfaces is an open research question. In this work we present a novel technique for navigation of multiple agents over arbitrary triangular domains. The proposed solution uses a fast hierarchical computation of geodesic distances over triangular meshes to allow interactive frame rates, and a GPU-based collision avoidance technique to guide individual agents. Unlike most previous work, the method imposes no limitations on the surface over which the agents are moving, and can naturally deal with non-planar meshes of arbitrary genus and curvature. Moreover, the implementation is a hybrid CPU/GPU algorithm that explores the current trend of increasing the number of CPU cores and GPU programmability. This approach exploits the best qualities in each processor, thus achieving very high performance. /content/cudazone/CUDABrowser/assets/images/applications/1030_teaser_small.png /content/cudazone/CUDABrowser/assets/images/applications/1030_teaser_large.png http://www.inf.ufrgs.br/ UFRGS and NVIDIA Academia 2010 02 21 02/21/2010 Rafael P. Torchelsen Luiz F. Scheidegger Guilherme N. Oliveira Rui Bastos Joao L. D. Comba Multimedia Paper Presentation Graphics Games Path Planning, Multi-Agents, Games,Rafael P. Torchelsen,Luiz F. Scheidegger,Guilherme N. Oliveira, Rui Bastos and Joao L. D. Comba ,rafael.torchelsen@gmail.com c6d19cfa-dfde-45ba-a8f3-c92550306e67 High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPU-extended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes. /content/cudazone/CUDABrowser/assets/images/applications/1029_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/1029_implementation_large.png Academia Tokyo Institute of Technology http://matsu-www.is.titech.ac.jp/ 2010 04 02 04/02/2010 Ali Cevahir Akira Nukada Satoshi Matsuoka Paper Numerics Science Ali Cevahir,Akira Nukada,Satoshi Matsuoka,ali@matsulab.is.titech.ac.jp ebf74770-43d6-4df3-9ad0-7caf33e71e45 Best-effort semantic document search on GPUs Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster. /content/cudazone/CUDABrowser/assets/images/applications/1027_object3_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1027_object3_large.jpg Research NEC Labs America 2010 03 14 03/14/2010 14 Surendra Byna Jiayuan Meng Anand Raghunathan Srimat Chakradhar Srihari Cadambi Paper Life Sciences Surendra Byna,Jiayuan Meng,Anand Raghunathan, Srimat Chakradhar, Srihari Cadambi ,sbyna@nec-labs.com b71710f7-ab68-4e1a-a418-109bcd152445 rCUDA The rCUDA Framework enables the concurrent usage of CUDA-compatible devices remotely. /content/cudazone/CUDABrowser/assets/images/applications/1026_bluebreeze_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1026_bluebreeze_logo_large.png Academia Universidad Politecnica de Valencia and Universidad Jaume I 2010 04 01 04/01/2010 Open source The rCUDA team Application Programming Tools Remote CUDA,The rCUDA team,apenya@gap.upv.es 7a68e84c-71bb-4325-9bba-94e4f01277e1 High Performance Finite Difference PDE Solvers on GPUs We show how to implement highly efficient GPU solvers for one dimensional PDEs based on finite difference schemes. The typical use case is to price a large number of similar or related derivatives in parallel. Application scenarios include market making, real time pricing, and risk management. The tridiagonal systems in the backward propagation of a finite difference scheme are solved with parallel cyclic reduction. This is a fine-grained parallel tridiagonal solver, which is well adapted to the hierarchical architecture of a modern GPU. We explain in detail the calculation work flow and study the performance of the solver relative to a well optimized CPU implementation. Our timings demonstrate performance improvement factors 25 on a single GPU and 38 on two GPUs. /content/cudazone/CUDABrowser/assets/images/applications/1025_parallel_cyclic_reduction_small.png /content/cudazone/CUDABrowser/assets/images/applications/1025_parallel_cyclic_reduction_large.png Commercial QuantAlea GmbH 2010 02 28 02/28/2010 38 Daniel Egloff Paper Finance Numerics Daniel Egloff,daniel.egloff@quantalea.net b5acd6e0-1899-49f0-b8c8-503be97f7afc CNS: a GPU-based framework for simulating cortically-organized networks A general GPU-based framework for the fast simulation of "cortically-organized" networks, defined as networks consisting of n-dimensional layers of similar cells. /content/cudazone/CUDABrowser/assets/images/applications/1024_cns_small.png /content/cudazone/CUDABrowser/assets/images/applications/1024_cns_large.png Academia Center for Biological & Computational Learning (CBCL) at MIT http://cbcl.mit.edu 2010 03 06 03/06/2010 80 Open source Jim Mutch Application Paper Code Programming Tools Science Computational Neuroscience Jim Mutch,jmutch@mit.edu 3c9da880-2acb-4ca9-984e-75abcca19b77 Massively parallel forward modeling of scalar and tensor gravimetry data Geophysical modeling code for calculating the first and second derivative of the gravitational potential for a 3D mass distribution. /content/cudazone/CUDABrowser/assets/images/applications/1023_logo_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1023_logo_large.gif Academia IFM-GEOMAR 2010 03 24 03/24/2010 40 Open source Max Moorkamp M. Jegen A. Roberts and R. Hobbs Paper Code Oil & Gas Science Max Moorkamp,M. Jegen,A. Roberts and R. Hobbs,mmoorkamp@ifm-geomar.de e06f5a43-2233-44c5-af51-7fd7c4cffe23 Simulation Game of Life on GPU. Simulation Game of Life on GPU via CUDA. It used shared ram technic. /content/cudazone/CUDABrowser/assets/images/applications/1022_GPUgameOfLife_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1022_GPUgameOfLife_large.gif http://www.cs.tu.ac.th/ Thammasat University, Thailand 2010 03 25 03/25/2010 8 Open source Treepop Sunpetchniyom Code Life Sciences Science Simulation , Game of Life,Treepop Sunpetchniyom,treepop.sunpetchniyom@nectec.or.th 20e2b57e-43ee-4681-aa07-d1c24ed1ebda Incompressible Flow Computations on the NCSA Lincoln Tesla Cluster We pursue MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations. /content/cudazone/CUDABrowser/assets/images/applications/1021_OKC_downtown_small.png /content/cudazone/CUDABrowser/assets/images/applications/1021_OKC_downtown_large.png Academia Boise State University 2010 03 25 03/25/2010 Jacobsen Thibault Senocak Paper Computational Fluid Dynamics Numerics CFD, CUDA, incompressible flow, MPI, GPU cluster,Jacobsen,Thibault,Senocak,danajacobsen@u.boisestate.edu,tchetchenko@gmail.com,senocak@boisestate.edu 8597944c-e271-43f7-aa6b-dd50f497ec38 ANDSolver ANDSolver solves the compressible Euler equations on unstructured meshes of polyhedrals. /content/cudazone/CUDABrowser/assets/images/applications/1020_page2_1_small.png /content/cudazone/CUDABrowser/assets/images/applications/1020_page2_1_large.png Commercial Palix Technologies http://www.palixtech.com 2010 03 23 03/23/2010 10 Commercial Palix Technologies Application Computational Fluid Dynamics Palix Technologies,info@palixtech.com b7ebd84b-4dfe-49c9-8ca7-64c913d30343 NBSymple: a symplectic N-body code for astrophysical simulations using TESLA GPUs NBSymple is a brand new parallel code which exploits joint performances of multicore CPUs and GPUs, by mean of Open MP and CUDA, respectively. It performs numerical integration of the motion equations of a set of N particles interacting via Newtonian gravitational forces. The time integration is done by a high precision algorithm, which guarantees time reversibility and excellent energy conservation. We tested the code in various cases, making use of simple precision and double precision arithmetics, as well as of a software "double-single" precision which seems a good compromise between precision and speed on TESLA C1060 GPUs. /content/cudazone/CUDABrowser/assets/images/applications/1019_macchina_small.png /content/cudazone/CUDABrowser/assets/images/applications/1019_macchina_large.png Academia Dep. of Physics, Univ. of Roma "La Sapienza", Roma, Italy http://www.uniroma1.it 2010 03 17 03/17/2010 Roberto CAPUZZO-DOLCETTA Alessandra MASTROBUONO -BATTISTI Paper Code Numerics Astrophysics astrophysics; N-body simulations; symplectic schemes,Roberto CAPUZZO-DOLCETTA Alessandra MASTROBUONO -BATTISTI,roberto.capuzzodolcetta@uniroma1.it a5518c10-3337-449f-aa4f-2b67aedd8636 Accelerating Biomedical Signal Processing Algorithms with Parallel Programming on Graphic Processor Units This paper investigates the benefits derived by adopting the use of Graphics Processing Unit (GPU) parallel programming in the field of biomedical signal processing. The differences in execution time when computing the Correlation Dimension (CD) of multivariate neurophysiological recordings and the Skin Conductance Level (SCL) are reported by comparing several common programming environments. Moreover, as indicated in this study, the combination of parallel programming with special design techniques dealing with memory management issues such as data transfer between device memory and GPU may further accelerate the processing speed. So, the minimization achieved in the time execution by means of proper parallel architecture design may reach a factor of 29 in comparison with pure C language. Therefore, the role of parallel GPU programming environment may be beneficial for numerous biomedical applications within the sphere of biosignal processing. /content/cudazone/CUDABrowser/assets/images/applications/1018_biosignal_analysis_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1018_biosignal_analysis_large.jpg Academia Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki, Greece http://lomiweb.med.auth.gr/gan/ 2009 11 05 11/05/2009 29 Konstantinidis Evdokimos Frantzidis Christos Panagiotis Bamidis Paper Science Signal Processing Konstantinidis Evdokimos,Frantzidis Christos,Panagiotis Bamidis,evdokimosk@gmail.com,christos.frantzidis@gmail.com,bamidis@med.auth.gr 608f2d59-cd5f-43a0-ac09-c0741f149014 B Flash Finder This program harnesses the power of nvidia's cuda technology to get fast search results from an entire harddrive in seconds /content/cudazone/CUDABrowser/assets/images/applications/1017_flashfinder_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1017_flashfinder_large.jpg Academia Psyzone http://www.psyzone.co.in 2010 03 16 03/16/2010 Bhairav Pardiwala Application Files Search Search,files/folders,fast search,Bhairav Pardiwala,bhairavpardiwala@gmail.com 78387568-da7b-4cf6-b23c-504b45e212e4 Allinea DDT Graphical debugger for NVIDIA CUDA - parallel and sequential code /content/cudazone/CUDABrowser/assets/images/applications/1016_Allineaddt_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1016_Allineaddt_large.jpg Commercial Allinea Software Ltd. http://www.allinea.com 2010 03 23 03/23/2010 Commercial Allinea Software Application Paper Programming Tools David Lecomber,david@allinea.com b63787c7-a224-436e-92da-515a2bc7d015 A GPU approach to FDTD for radio coverage prediction A well known approach to compute radio wave propagation is the Finite-Difference Time-Domain (FDTD) model which solves the Maxwell equations on a discrete grid. With the development of new programmable graphics hardware, novel solutions to compute electromagnetics are being already implemented on GPUs. In this paper a GPU implementation of FTDT is developed and achieves a speedup of over 100X over a Matlab implementation running on AMD Athlon 64X2 dual core 4600 /content/cudazone/CUDABrowser/assets/images/applications/1015_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/1015_logo_xplore_large.gif Academia University of Bedfordshire, Luton, Bedfordshire, UK 2009 01 06 01/06/2009 100 Alvaro Valcarce Guillaume De La Roche Jie Zhang Paper Signal Processing Alvaro Valcarce,Guillaume De La Roche,Jie Zhang 4f6fb7c7-a80f-473f-9f52-0022e61724e9 gVirtuS: A GPGPU transparent virtualization component gVirtuS tries to fill the gap between in-house hosted computing clusters, equipped with GPGPUs devices, and pay-for-use high performance virtual clusters deployed via public or private computing clouds. gVirtuS allows an instanced virtual machine to access GPGPUs in a transparent way, with an overhead slightly greater than a real machine/GPGPU setup. gVirtuS is hypervisor independent, and, even though it currently virtualizes nVIDIA CUDA based GPUs, it is not limited to a specific brand technology. The performance of the components of gVirtuS is assessed through a suite of tests in different deployment scenarios, such as providing GPGPU power to cloud computing based HPC clusters and sharing remotely hosted GPGPUs among HPC nodes. /content/cudazone/CUDABrowser/assets/images/applications/1014_gVirtuS_small.png /content/cudazone/CUDABrowser/assets/images/applications/1014_gVirtuS_large.png UniParthenope Open Source Lab http://osl.uniparthenope.it 2010 03 05 03/05/2010 Giulio Giunta Raffaele Montella Giuseppe Agrillo ◦Giuseppe Coviello Code Giulio Giunta ,Raffaele Montella,Giuseppe Agrillo,giulio.giunta@uniparthenope.it,raffaele.montella@uniparthenope.it,giuseppe.agrillo@uniparthenope.it 1f219825-5344-4682-9cc8-d329d28f8eb3 Volmaster FX A complete pricing tool for FX derivatives delivering advanced stochastic volatility models natively. Wide range of exotics covered. Thanks to innovative proprietary pricing techniques implemented with CUDA, Volmaster can achieve unrivalled computational speed. Delivered via web (software-as-a-service), Volmaster runs instantly on any desktop with a click-and-go distribution model. /content/cudazone/CUDABrowser/assets/images/applications/1013_logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1013_logo_large.jpg Commercial Volmaster B.V. http://www.volmaster.com 2010 03 05 03/05/2010 Commercial Stefano Silvano Application Finance fx option derivative pricing stochastic volatility exotic vanilla greeks numerical skew,Stefano Silvano,info@volmaster.com e0663fcb-55fe-4b9b-886b-0c7305d789df Exploring utilisation of GPU for database applications This study is devoted to exploring possible applications of GPU technology for acceleration of the database access. We use the n-gram based approximate text search engine as a test bed for GPU based acceleration algorithms. Two solutions - hybrid CPU/GPU and pure GPU algorithms for query processing are studied and compared with the baseline CPU algorithm as well as with the optimized versions of the CPU algorithm. The hybrid algorithm performs poorly on most queries and only modest acceleration is achievable for long queries with high error level. On the other hand speedups up to 18 times were achieved for pure GPU algorithm. Application of the GPU acceleration for more general data base problems is discussed. /content/cudazone/CUDABrowser/assets/images/applications/1012_chemsearch_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/1012_chemsearch_logo_large.png Academia Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw http://www.icm.edu.pl 2010 03 09 03/09/2010 18 S. Walkowiak K. Wawruch L. Ligowski Paper Science Databases S. Walkowiak,K. Wawruch,L. Ligowski,S.Walkowiak@icm.edu.pl,L.Ligowski@icm.edu.pl,W.Rudnicki@icm.edu.pl 09210fbd-d990-4fd2-a864-aeb2dc8291eb Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude. /content/cudazone/CUDABrowser/assets/images/applications/1011_TPDS_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1011_TPDS_large.jpg Academia TU Dortmund and Max Planck Institut Informatik 2010 03 02 03/02/2010 10 Dominik Goddeke Robert Strzodka Paper Numerics Dominik Goddeke,Robert Strzodka,dominik.goeddeke@math.tu-dortmund.de a22255de-199f-41da-b3c7-703a9714f29f Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We show that our code scales well on an MPI-based cluster; that an eightfold speedup can be achieved using modern GPUs in contrast to multithreaded CPU code and, finally, that it is possible to solve fluid-structure interaction scenarios with high resolution at interactive rates. /content/cudazone/CUDABrowser/assets/images/applications/1010_mcc-paper_small.png /content/cudazone/CUDABrowser/assets/images/applications/1010_mcc-paper_large.png Academia TU Dortmund 2010 02 24 02/24/2010 8 Markus Geveler Paper Computational Fluid Dynamics Numerics Markus Geveler,markus.geveler@math.tu-dortmund.de 095c041f-aa2c-472f-b0b4-eaa692951dc5 Fast Image Blurring with CUDA High performance and good quality of image blurring, using stack blurring algorithm provided by http://incubator.quasimondo.com /content/cudazone/CUDABrowser/assets/images/applications/1008_device_small.png /content/cudazone/CUDABrowser/assets/images/applications/1008_device_large.png Research http://home.so-net.net.tw/lioucy/ 2009 09 10 09/10/2009 300 Open source ChaoJui Application Graphics ChaoJui,lioucr@yahoo.ca 476d306d-6a77-4749-8210-8b7b19ebd420 Fast Human Detection with Cascaded Ensembles A real time implementation of the Histograms of Oriented Gradients algorithm with cascaded classifers. /content/cudazone/CUDABrowser/assets/images/applications/1007_cover_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1007_cover_large.jpg Academia MIT 2010 02 26 02/26/2010 13 Berkin Bilgic Paper Signal Processing Berkin Bilgic,berkin@mit.edu ba8b2f03-6170-4b26-99c3-86b4e47546d1 Cuda-Renderer 2009 - A Multi-Volume Polyhedral Renderer We present a new algorithm for hardware-accelerated ray casting of multiple volumes. Our approach supports a large number of volumes, complex translucent and concave polyhedral objects as well as CSG intersections of volumes and geometry in any combination. It is implemented as a software renderer in CUDA without any fixed function portions, which allows full control over the use of memory bandwidth. High depth complexity, which is problematic for conventional approaches based on depth peeling, can be successfully handled. As far as we know, our approach is the first framework for multi-volume rendering which provides interactive frame rates when concurrently rendering more than 50 arbitrarily overlapping volumes on current graphics hardware. /content/cudazone/CUDABrowser/assets/images/applications/1006_Thumbnail_small.png /content/cudazone/CUDABrowser/assets/images/applications/1006_Thumbnail_large.png Academia Graz University of Technology http://www.icg.tugraz.at 2009 12 14 12/14/2009 Bernhard Kainz Markus Grabner Alexander Bornik Stefan Hauswiesner Judith Muehl Dieter Schmalstieg Multimedia Paper Graphics Medical Imaging Bernhard Kainz,Markus Grabner,Alexander Bornik,kainz@icg.tugraz.at aa417b5b-e0cc-446a-9fca-a93e14d4868b Accelerating SQL Database Operations on a GPU with CUDA A reimplementation of portions of the SQLite database to execute on a GPU, part of the GPGPU-3 workshop. /content/cudazone/CUDABrowser/assets/images/applications/1005_volcano_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1005_volcano_large.jpg Academia University of Virginia (LAVA Lab) http://www.cs.virginia.edu/~skadron/pub_list.html 2010 03 14 03/14/2010 70 Peter Bakkum Kevin Skadron Paper Data Mining Peter Bakkum,pbb7c@virginia.edu c6b19852-39b9-4460-8777-47047330ce20 Gramm-software package for molecular dynamics on graphical processing units This work describes the software package and algorithms for molecular dynamics using NVIDIA GPU G80, G84, and G92. All potentials needed for MM2 and AMBER force fields are implemented and the combination of different potentials is allowed. The performance comparison of different MD algorithms on GPU and CPU is presented. All software is available from www.gpamm.mntech.ru. /content/cudazone/CUDABrowser/assets/images/applications/1003_cover-medium2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1003_cover-medium2_large.jpg Academia Russian Academy of Sciences 2010 01 21 01/21/2010 D. S. Tarasov E. D. Izotova D. A. Alisheva Paper Numerics D. S. Tarasov,E. D. Izotova,D. A. Alisheva de2ccfcc-9cf0-4076-b92e-969e42607064 Leveraging Computation Sharing and Parallel Processing in Location-Based Services A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. In this paper, we introduce an efficient and scalable system for monitoring continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit. http://www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.437 /content/cudazone/CUDABrowser/assets/images/applications/1002_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1002_cs_large.jpg Research 2009 International Conference on Computational Science and Engineering 2008 08 31 08/31/2008 Jonathan Cazalas Kien Hua Paper Science Jonathan Cazalas,Kien Hua 5914c55d-b33d-4834-91f5-52968c66c450 Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors Lattice Boltzmann Methods (LBM) are used for the computational simulation of Newtonian fluid dynamics. LBM-based simulations are readily parallelizable; they have been implemented on general-purpose processors, field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). Of the three methods, the GPU implementations achieved the highest simulation performance per chip. With memory bandwidth of up to 141 GB/s and a theoretical maximum floating point performance of over 600 GFLOPS, CUDA-ready GPUs from NVIDIA provide an attractive platform for a wide range of scientific simulations, including LBM. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPP.2009.38 /content/cudazone/CUDABrowser/assets/images/applications/1001_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1001_cs_large.jpg Research 2009 International Conference on Parallel Processing 2009 09 25 09/25/2009 Peter Bailey Joe Myre Stuart D.C. Walsh Paper Science Peter Bailey,Joe Myre,Stuart D.C. Walsh 9c638c6c-3e27-4a8c-b3ea-e6f75ca52f8d Theoretical and Empirical Analysis of a GPU Based Parallel Bayesian Optimization Algorithm General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelization of Bayesian Optimization Algorithm (BOA) to solve complex combinatorial optimization problems over nVidia commodity graphics hardware using Compute Unified Device Architecture (CUDA). http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.32 /content/cudazone/CUDABrowser/assets/images/applications/1000_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/1000_cs_large.jpg Research 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies 2009 12 11 12/11/2009 Asim Munawar Mohamed Wahib Masaharu Munetomo Paper Science Asim Munawar,Mohamed Wahib,Masaharu Munetomo 9c689f15-653f-4c90-b64c-1140bae9d5df Applying Modern Soft- and Hardware Technologies for Computational Steering Approaches in Computational Fluid Dynamics In this article we present an educational simulation tool, FlowSim 2007 CUDA edition, a computational steering application for interactive 2D flow simulation based on the Lattice Boltzmann Method. The application combines a comfortable user interface as well as a convenient development platform on the one hand and a high performance flow solver on the other hand. http://www.computer.org/portal/web/csdl/doi/10.1109/CW.2007.53 /content/cudazone/CUDABrowser/assets/images/applications/999_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/999_cs_large.jpg Research 2007 International Conference on Cyberworlds 2007 10 26 10/26/2007 Jan Linxweiler Jonas Tlke Manfred Krafczyk Paper Science Jan Linxweiler,Jonas Tlke,Manfred Krafczyk 722324b0-4ea9-4cc8-896d-190e61c0da21 High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units The power of graphics processing unit (GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed??which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of Korea using OpenGL and CUDA libraries on GPU. http://www.computer.org/portal/web/csdl/doi/10.1109/MUE.2008.94 /content/cudazone/CUDABrowser/assets/images/applications/998_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/998_cs_large.jpg Research 2008 International Conference on Multimedia and Ubiquitous Engineering 2008 04 26 04/26/2008 Yongjin Yeom Yongkuk Cho Moti Yung Paper Science Yongjin Yeom,Yongkuk Cho,Moti Yung 8fc3f95e-0465-479a-a4d9-6876d7b5e3b3 Accelerating Compute-Intensive Applications with GPUs and FPGAs Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth. http://www.computer.org/portal/web/csdl/doi/10.1109/SASP.2008.4570793 /content/cudazone/CUDABrowser/assets/images/applications/997_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/997_cs_large.jpg Academia University of Virginia 2008 06 09 06/09/2008 Shuai Che Jie Li Jeremy W. Sheaffer Paper Science Shuai Che,Jie Li,Jeremy W. Sheaffer 458661f9-8252-4e51-b2bd-d53126b80571 High-Speed Private Information Retrieval Computation on GPU A Private Information Retrieval (PIR) scheme is a protocol in which a user retrieves a record out of n from a replicated database, while hiding from the database which record has been retrieved, as long as the different replicas do not collude. A specially interesting sub-field of research, called single-database PIR, deals with the schemes that allow a user to retrieve privately an element of a non-replicated database. In these schemes, user privacy is related to the intractability of a mathematical problem, instead of being based on the assumption that different replicas exist and do not collude against their users. http://www.computer.org/portal/web/csdl/doi/10.1109/SECURWARE.2008.55 /content/cudazone/CUDABrowser/assets/images/applications/996_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/996_cs_large.jpg Research 2008 Second International Conference on Emerging Security Information, Systems and Technologies 2008 08 31 08/31/2008 Carlos Aguilar Melchor Benoit Crespin Philippe Gaborit Paper Science Carlos Aguilar Melchor,Benoit Crespin,Philippe Gaborit 23325740-fa2b-4e7b-a7ee-b607da12ee54 Compute Unified Device Architecture Application Suitability Graphics processing units (GPUs) can provide excellent speedups on some, but not all, general-purpose workloads. Using a set of computational GPU kernels as examples, the authors show how to adapt kernels to utilize the architectural features of a GeForce 8800 GPU and what finally limits the achievable performance. /content/cudazone/CUDABrowser/assets/images/applications/995_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/995_cs_large.jpg Academia University of Illinois 2009 06 01 06/01/2009 Wen-Mei Hwu Christopher Rodrigues Shane Ryoo Paper Science Wen-Mei Hwu,Christopher Rodrigues,Shane Ryoo 0e7f8bf5-4959-4ece-9756-519dda1fe8b6 Parallel Approaches for SWAMP Sequence Alignment This document is a summary and overview of several approaches to implement the local sequence alignmentalgorithms known as SWAMP and SWAMP+ on commerciallyavailable hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model while maintaining speed, accuracy, and producing a richer set of results in an automated way that is not currently available.We consider four different hardware architectures for therealization of the ASC model. These are the ClearSpeed CSXprocessor, NVIDIA GPGPU graphics processors, IBM Cell Processors, and FPGAs. /content/cudazone/CUDABrowser/assets/images/applications/994_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/994_cs_large.jpg Academia Case Western University, Cleveland, Ohio 2009 06 17 06/17/2009 Shannon Steinfadt Kevin Schaffer Paper Science Shannon Steinfadt,Kevin Schaffer 9d864e53-07ca-423a-ab86-088d470c12a2 Accelerating Algebraic Reconstruction Using CUDA-Enabled GPU In this paper, we apply the Compute Unified Device Architecture (CUDA) to the 3D cone-beam CT reconstruction using Simultaneous Algebraic Reconstruction Technique (SART). With the hardware acceleration, the computationally complex SART can run at speed comparable to the commonly used Filtered Back-Projection, and provide even better quality volume with less samples. The main contributions include two novel techniques to accelerate the reconstruction. http://www.computer.org/portal/web/csdl/doi/10.1109/CGIV.2009.18 /content/cudazone/CUDABrowser/assets/images/applications/993_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/993_cs_large.jpg Research 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization 2009 08 14 08/14/2009 Yuqiang Lu Weiming Wang Shifu Chen Paper Science Yuqiang Lu,Weiming Wang,Shifu Chen 9eb3b453-b0e1-42bc-8040-626f61e09879 Profiling General Purpose GPU Applications We are witnessing an increasing adoption of GPUs for performing general purpose computation, which is usually known as GPGPU. The main challenge in developing such applications is that they often do not fit in the model required by the graphics processing devices, limiting the scope of applications that may be benefit from the computing power provided by GPUs. Even when the application fits GPU model, obtaining optimal resource usage is a complex task. http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2009.26 /content/cudazone/CUDABrowser/assets/images/applications/992_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/992_cs_large.jpg Research 2009 21st International Symposium on Computer Architecture and High Performance Computing 2009 10 31 10/31/2009 Bruno Rocha Coutinho George Luiz Medeiros Teodoro Rafael Sachetto Oliveira Paper Science Bruno Rocha Coutinho,George Luiz Medeiros Teodoro,Rafael Sachetto Oliveira f186e9e8-e3bd-41da-a917-0868ecbc7fdc Improving Performance of Matrix Multiplication and FFT on GPU In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms. /content/cudazone/CUDABrowser/assets/images/applications/991_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/991_cs_large.jpg Research 2009 15th International Conference on Parallel and Distributed Systems 2009 12 11 12/11/2009 Xiang Cui Yifeng Chen Hong Mei Paper Science Xiang Cui,Yifeng Chen,Hong Mei dedb0ea8-da35-401f-a28a-50bf45cb4f96 Coprocessor Computing with FPGA and GPU Specialized secondary processing units, such as field programmable gate arrays (FPGAs) and graphics processing units (GPUs), attempt to tackle the time consuming applications containing high computational requirements. In order to achieve acceleration, FPGAs allow a customizable architecture and Nvidia GPUs offer up to 16 cores with 128 stream processors. http://www.computer.org/portal/web/csdl/doi/10.1109/DoD.HPCMP.UGC.2008.69 /content/cudazone/CUDABrowser/assets/images/applications/990_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/990_cs_large.jpg Research 2008 DoD HPCMP Users Group Conference 2008 07 17 07/17/2008 Song Jun Park Dale R. Shires Brian J. Henz Paper Science Song Jun Park,Dale R. Shires,Brian J. Henz a594c3dc-754e-4f17-a212-b1968a962069 GPU as a General Purpose Computing Resource In the last few years, GPUs(Graphics Processing Units) have made rapid development. Their ever-increasing computing power and decreasing cost have attracted attention from both industry and academia. In addition to graphics applications, researchers are interested in using them for general purpose computing. Recently, NVIDIA released a new computing architecture, CUDA (Compute Uniﬁed Device Architecture), for its GeForce 8 series, Quadro FX, and Tesla GPU products. http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2008.38 /content/cudazone/CUDABrowser/assets/images/applications/989_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/989_cs_large.jpg Research 2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies 2008 12 04 12/04/2008 Qihang Huang Zhiyi Huang Paul Werstein Paper Science Qihang Huang,Zhiyi Huang,Paul Werstein ce07c8c1-bb41-4cc2-b04f-ee02a2980f68 Accelerating Partitional Algorithms for Flow Cytometry on GPUs Like many modern techniques for scientific analysis, flow cytometry produces massive amounts of data that must be analyzed and clustered intelligently to be useful. Current manual binning techniques are cumbersome and limited in both the quality and quantity of analysis produced. To address the quality of results, a new framework applying two different sets of clustering algorithms and inference methods are implemented. http://www.computer.org/portal/web/csdl/doi/10.1109/ISPA.2009.29 /content/cudazone/CUDABrowser/assets/images/applications/988_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/988_cs_large.jpg Research 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications 2009 08 12 08/12/2009 Jeremy Espenshade Andrew Pangborn Gregor von Laszewski Paper Science Jeremy Espenshade,Andrew Pangborn,Gregor von Laszewski 34d48171-d36d-4bd4-8cf5-25e71f00c0ee kD-Tree Traversal Implementations for Ray Tracing on Massive Multiprocessors: A Comparative Study Current GPU computational power enables the execution of complex and parallel algorithms, such as Ray Tracing techniques supported by kD-trees for 3D scene rendering in real time. This work describes in detail the study and implementation of five different kD-Tree traversal algorithms using the parallel framework NVIDIA Compute Unified Device Architecture (CUDA), in order to point their pros and cons regarding adaptation capability to the chosen architecture. http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2009.25 /content/cudazone/CUDABrowser/assets/images/applications/987_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/987_cs_large.jpg Research 2009 21st International Symposium on Computer Architecture and High Performance Computing 2009 10 31 10/31/2009 Artur L. dos Santos Joao Marcelo X.N. Teixeira Thiago S.M.C. de Farias Paper Science Artur L. dos Santos,Joao Marcelo X.N. Teixeira,Thiago S.M.C. de Farias ed07648a-45fa-4ca3-b053-e20c0411184a Multi-core acceleration of chemical kinetics for simulation and prediction This work implements a computationally expensive chemical kinetics kernel from a large-scale community atmospheric model on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis for each platform in double and single precision on coarse and fine grids is presented. Platform-specific design and optimization is discussed in a mechanism-agnostic way, permitting the optimization of many chemical mechanisms. http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654067 /content/cudazone/CUDABrowser/assets/images/applications/986_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/986_cs_large.jpg Academia Virginia Polytechnic Institute and State University 2009 11 20 11/20/2009 John C. Linford John Michalakes Manish Vachharajani Paper Science John C. Linford,John Michalakes,Manish Vachharajani 7d14d54f-5d86-4c0d-aed3-d3c48af7bcd6 Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence Direct numerical simulation (DNS) of turbulence is computationally intensive and typically relies on some form of parallel processing. Spectral kernels used for spatial discretization are a common computational bottleneck on distributed memory architectures. One way to increase DNS algorithms' efficiency is to parallelize spectral kernels using tightly coupled single-program, multiple-data (SPMD) multiprocessor units with minimal interprocessor communication latency. http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.42 /content/cudazone/CUDABrowser/assets/images/applications/985_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/985_cs_large.jpg Academia University of Maryland 2009 04 01 04/01/2009 George Stantchev Derek Juba William Dorland Paper Science George Stantchev,Derek Juba,William Dorland c8e73d49-b21f-4fdb-9e45-387be9600fe0 Accelerating Phase Correlation Functions Using GPU and FPGA In this paper, we present a comparison study about implementations of phase correlation function using GPUs, ASIC and FPGAs. The Phase Only Correlation(POC) method demonstrates high robustness and subpixel accuracy in the pattern matching and the image registration. However, there is a disadvantage in computational speed because of the calculation of 2D-FFT etc. http://www.computer.org/portal/web/csdl/doi/10.1109/AHS.2009.53 /content/cudazone/CUDABrowser/assets/images/applications/984_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/984_cs_large.jpg Research 2009 NASA/ESA Conference on Adaptive Hardware and Systems 2009 08 01 08/01/2009 Kentaro Matsuo Tsuyoshi Hamada Masayuki Miyoshi Paper Science Kentaro Matsuo,Tsuyoshi Hamada,Masayuki Miyoshi 9d8a949b-df07-474b-987c-0f649a0c3750 Financial Derivatives Modeling Using GPU's The architecture of the latest Graphic Processing Unit (GPU) has surpassed the previous application-specific stream architecture. This has led to an architecture consisting of a number of uniform programmable units integrated on the same chip which facilitate the general-purpose computing beyond the graphic processing. http://www.computer.org/portal/web/csdl/doi/10.1109/EmbeddedCom-ScalCom.2009.85 /content/cudazone/CUDABrowser/assets/images/applications/983_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/983_cs_large.jpg Research 2009 International Conference on Scalable Computing and Communications 2009 09 27 09/27/2009 Myungho Lee Chin Hong Chun Sugwon Hong Paper Science Myungho Lee,Chin Hong Chun,Sugwon Hong a9fb7f0f-051b-4f80-940c-0d35b077453f Fast k nearest neighbor search using GPU Statistical measures coming from information theory represent interesting bases for image and video processing tasks such as image retrieval and video object tracking. For example, let us mention the entropy and the Kullback-Leibler divergence. Accurate estimation of these measures requires to adapt to the local sample density, especially if the data are high-dimensional. http://www.computer.org/portal/web/csdl/doi/10.1109/CVPRW.2008.4563100 /content/cudazone/CUDABrowser/assets/images/applications/982_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/982_cs_large.jpg Research Universitu de Nice-Sophia Antipolis/CNRS Laboratoire I3S, France 2008 06 28 06/28/2008 Vincent Garcia Eric Debreuve Michel Barlaud Paper Science Vincent Garcia,Eric Debreuve,Michel Barlaud 3b04ed36-f9f7-4cdd-bae3-6f168d7a28f4 Accelerating Simulations of Light Scattering Based on Finite-Difference Time-Domain Method with General Purpose GPUs Simulations of light scattering from nano-structured surface areas require substantial amount of computing time. The emergence of General Purpose Graphics Processing Units (GPGPUs) as affordable PC SIMD arithmetic coprocessors brings the necessary computing power to modern desktop PCs. In this paper we examine how the computation time of the Finite-Difference Time-Domain (FDTD), a classic numerical method for computing a solution to Maxwell's equations, can be reduced by leveraging the massively parallel architecture of GPGPUs cards. /content/cudazone/CUDABrowser/assets/images/applications/981_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/981_cs_large.jpg Research 2008 11th IEEE International Conference on Computational Science and Engineering 2008 07 28 07/28/2008 A. Balevic L. Rockstroh A. Tausendfreund Paper Science A. Balevic,L. Rockstroh,A. Tausendfreund b249b69b-fbc0-48a9-b9a1-6dfc8e766fee Exploring the multiple-GPU design space Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup. http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5161068 /content/cudazone/CUDABrowser/assets/images/applications/980_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/980_cs_large.jpg Academia Northeastern University 2009 05 29 05/29/2009 Dana Schaa David Kaeli Paper Science Dana Schaa,David Kaeli 558ecded-dc18-475f-9e3d-ebda86332f8f The Virtual Marathon: Parallel Computing Supports Crowd Simulations To be realistic, an urban model must include appropriate numbers of pedestrians, vehicles, and other dynamic entities. Using a parallelcomputing architecture, researchers simulated a marathon with more than a million participants. To simulate participant behavior, they used fuzzy logic on a GPU to perform millions of inferences in real time. /content/cudazone/CUDABrowser/assets/images/applications/979_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/979_cs_large.jpg Research IEEE Computer Graphics 2009 08 01 08/01/2009 Erdal Yilmaz Veysi Isler Yasemin Yardimci Cetin Paper Science Erdal Yilmaz,Veysi Isler,Yasemin Yardimci Cetin e09be7a1-fad3-47c5-8189-cb111c0818df A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU Motif is overrepresented pattern in biological sequence and Motif finding is an important problem in bioinformatics. Due to high computational complexity of motif finding, more and more computational capabilities are required as the rapid growth of available biological data, such as gene transcription data. Among many motif finding algorithms, Gibbs sampling is an effective method for long motif finding. In this paper we present an improved Gibbs sampling method on graphics processing units (GPU) to accelerate motif finding. Experimental data support that, compared to traditional programs on CPU, our program running on GPU provides an effective and low-cost solution for motif finding problem, especially for long motif finding. /content/cudazone/CUDABrowser/assets/images/applications/978_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/978_cs_large.jpg Research 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications 2009 08 12 08/12/2009 Linbin Yu Yun Xu Paper Science Linbin Yu,Yun Xu c1239952-bc70-4a62-b8aa-da685c20d2ea Cellular Level Agent Based Modelling on the Graphics Processing Unit Cellular level agent based modelling is reliant on either sequential processing environments or expensive and largely unavailable PC grids. The GPU offers an alternative architecture for such systems, however the steep learning curve associated with the GPUs data parallel architecture has previously limited the uptake of this emerging technology. http://www.computer.org/portal/web/csdl/doi/10.1109/HiBi.2009.12 /content/cudazone/CUDABrowser/assets/images/applications/977_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/977_cs_large.jpg Research 2009 International Workshop on High Performance Computational Systems Biology 2009 10 14 10/14/2009 Paul Richmond Simon Coakley Daniela Romano Paper Science Paul Richmond,Simon Coakley,Daniela Romano 843ce581-fa99-4372-99d8-6ef7b20ec10e A microdriver architecture for error correcting codes inside the Linux kernel Coding tasks, such as encryption of data or the generation of failure-tolerant codes, belong to the most computationaly expensive tasks inside the Linux kernel. Their integration into the kernel enables the user to transparently access these functionalities, encrypted hard disks can be used in the same way as unencrypted ones. Nevertheless, Linux as a monolithic kernel is not prepared to support these expensive tasks by accessing modern hardware accelerators, like graphics processing units (GPUs), as the corresponding accelerator libraries, like the CUDA-API for NVIDIA GPUs, only offer user-space APIs. http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654095 /content/cudazone/CUDABrowser/assets/images/applications/976_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/976_cs_large.jpg Academia University of Paderborn, Germany 2009 11 20 11/20/2009 A. Brinkmann D. Eschweiler Paper Science A. Brinkmann,D. Eschweiler d6ce8db6-de74-452f-8737-b2efa14c3d63 A Program Behavior Study of Block Cryptography Algorithms on GPGPU Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement. http://www.computer.org/portal/web/csdl/doi/10.1109/FCST.2009.13 /content/cudazone/CUDABrowser/assets/images/applications/975_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/975_cs_large.jpg Research 2009 Fourth International Conference on Frontier of Computer Science and Technology 2009 12 19 12/19/2009 Gu Liu Hong An Wenting Han Paper Science Gu Liu,Hong An,Wenting Han d9dd760c-7469-42a8-905c-7144ff3d043d Count Sort for GPU Computing Counting sort is a simple, stable and efficient sort algorithm with linear running time, which is a fundamental building block for many applications. This paper depicts the design issues of a data parallel implementation of the count sort algorithm on a commodity multiprocessor GPU using the Compute Unified Device Architecture (CUDA) platform, both from NVIDIA Corporation. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.30 /content/cudazone/CUDABrowser/assets/images/applications/974_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/974_cs_large.jpg Research 2009 15th International Conference on Parallel and Distributed Systems 2009 12 11 12/11/2009 Weidong Sun Zongmin Ma Paper Science Weidong Sun,Zongmin Ma 47e489fe-8243-425b-a88c-27a5d18b0f6a Solving Computational Problems with GPU Computing Modern GPUs are massively parallel microprocessors that can deliver very high performance for the parallel computations common in science and engineering. /content/cudazone/CUDABrowser/assets/images/applications/972_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/972_cs_large.jpg Research Computing Science 2009 10 01 10/01/2009 Jonathan Cohen Michael Garland Paper Science Jonathan Cohen,Michael Garland 8a1ca4bd-1895-40b6-98b7-33bdddca994d The Synchronization Power of Coalesced Memory Accesses Multicore architectures have established themselves as the new generation of computer architectures. As part of the one core to many cores evolution, memory access mechanisms have advanced rapidly. Several new memory access mechanisms have been implemented in many modern commodity multicore architectures. By specifying how processing cores access shared memory, memory access mechanisms directly influence the synchronization capabilities of multicore architectures. Therefore, it is crucial to investigate the synchronization power of these new memory access mechanisms. /content/cudazone/CUDABrowser/assets/images/applications/971_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/971_cs_large.jpg Academia Chalmers University of Technology, Gothenburg 2008 12 31 12/31/2008 Phuong Hoai Ha Philippas Tsigas Otto J. Anshus Paper Science Phuong Hoai Ha,Philippas Tsigas,Otto J. Anshus fbaceb8f-3e46-4070-ac07-fe2e8d5e4608 Fast Disk Encryption through GPGPU Acceleration We present the design and performance analysis of a GPU-optimized implementation of a disk encryption application employing the XTS mode of operation applied together with the Twofish algorithm within the well-known TrueCrypt suite. We show how to correctly tune the design parameters, including data allocation, thread packing, and parallelization strategy. Overall, our implementation of TrueCrypt running on a NVidia GTX260 GPU outperforms by 67% the baseline implementation running on a four core CPU. /content/cudazone/CUDABrowser/assets/images/applications/970_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/970_cs_large.jpg Research 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies 2009 12 11 12/11/2009 Giovanni Agosta Alessandro Barenghi Fabrizio De Santis Paper Science Giovanni Agosta,Alessandro Barenghi,Fabrizio De Santis 4471c01f-7076-4798-acd4-af519ff3ae9e Optical Flow Computation on Compute Unified Device Architecture In this study, the implementation of an image processing technique on Compute Unified Device Architecture (CUDA) is discussed. CUDA is a new hardware and software architecture developed by NVIDIA Corporation for the generalpurpose computation on graphics processing units. http://www.computer.org/portal/web/csdl/doi/10.1109/ICIAP.2007.97 /content/cudazone/CUDABrowser/assets/images/applications/969_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/969_cs_large.jpg Academia Yamaguchi University, Japan 2007 09 14 09/14/2007 Yoshiki Mizukami Katsumi Tadamura Paper Science Yoshiki Mizukami,Katsumi Tadamura 8252cbfd-7350-4004-acfb-096dfea1d9e2 Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major categories of volume rendering algorithms and confirm through an imaging scientist-guided evaluation that ray-casting is the most acceptable. http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2009.164 /content/cudazone/CUDABrowser/assets/images/applications/967_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/967_cs_large.jpg Research Intel Corporation 2009 11 15 11/15/2009 Mikhail Smelyanskiy David Holmes Jatin Chhugani Paper Medical Imaging Mikhail Smelyanskiy,David Holmes,Jatin Chhugani bf5ce6bc-a2c0-4262-961b-d0fe4504edc1 GPU-accelerated, gradient-free MI deformable registration for atlas-based MR brain image segmentation Brain structure segmentation is an important task in many neuroscience and clinical applications. In this paper, we introduce a novel MI-based dense deformable registration method and apply it to the automatic segmentation of detailed brain structures. http://www.computer.org/portal/web/csdl/doi/10.1109/CVPR.2009.5204043 /content/cudazone/CUDABrowser/assets/images/applications/966_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/966_cs_large.jpg Academia Maryland Heights 2009 06 25 06/25/2009 Xiao Han L.S. Hibbard V. Willcut Paper Science Xiao Han,L.S. Hibbard,V. Willcut 97e427af-035a-48b2-944a-44029f2b874e Efficient band approximation of Gram matrices for large scale kernel methods on GPUs Kernel-based methods require O(N2) time and space complexities to compute and store non-sparse Gram matrices, which is prohibitively expensive for large scale problems. We introduce a novel method to approximate a Gram matrix with a band matrix. Our method relies on the locality preserving properties of space filling curves, and the special structure of Gram matrices. Our approach has several important merits. http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654091 /content/cudazone/CUDABrowser/assets/images/applications/965_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/965_cs_large.jpg Research Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis 2009 11 20 11/20/2009 Mohamed Hussein Wael Abd-Almageed Paper Science Mohamed Hussein,Wael Abd-Almageed 0286b4ba-3c62-4513-b654-eb17ca5eb44f CUDA-Based Jacobi's Iterative Method Solving linear equations is a common problem in the fields of science and engineering. Accelerating its solving process is of great significance. Modern GPUs are high performance many-core processors fit for large scale parallel computing. They provide us a novel way for accelerating the solving process. A GPU based parallel Jacobis iterative solver for dense linear equations is presented in this paper. http://www.computer.org/portal/web/csdl/doi/10.1109/IFCSTA.2009.68 /content/cudazone/CUDABrowser/assets/images/applications/964_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/964_cs_large.jpg Research 2009 International Forum on Computer Science-Technology and Applications 2009 12 27 12/27/2009 Zhihui Zhang Qinghai Miao Ying Wang Paper Science Zhihui Zhang,Qinghai Miao,Ying Wang aa5f9c87-b466-42cd-bbbf-e6d69a883462 Voice Command Recognition with Dynamic Time Warping (DTW) using GPU with CUDA Recently, we are attending to a huge evolution on the development of high performance computing platforms. Among these platforms, the GPU (Graphics Processing Units) stimulated by game industries, constantly demanding more graphical processing power, evolved from a simple graphical card to a general purpose computation parallel data processing device. http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2007.21 /content/cudazone/CUDABrowser/assets/images/applications/963_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/963_cs_large.jpg Research 19th International Symposium on Computer Architecture and High Performance Computing 2007 10 27 10/27/2007 Gustavo Poli Joso F. Mari Josw Hiroki Saito Paper Science Gustavo Poli,Joso F. Mari,Josw Hiroki Saito 6f659aa3-3402-474c-860e-06af9f94e3f8 NVIDIA Tesla: A Unified Graphics and Computing Architecture To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs. /content/cudazone/CUDABrowser/assets/images/applications/961_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/961_cs_large.jpg Research NVIDIA Corp. http://www.nvidia.com/cuda 2008 04 01 04/01/2008 Erik Lindholm John Nickolls Stuart Oberman Paper Science Erik Lindholm,John Nickolls,Stuart Oberman dee4e626-3b28-4f97-b21f-51e7af3cd36a Parallel Computing Experiences with CUDA The CUDA programming model provides a straightforward means of describing inherently parallel computations, and NVIDIA's Tesla GPU architecture delivers high computational throughput on massively parallel problems. This article surveys experiences gained in applying CUDA to a diverse set of problems and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/956_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/956_cs_large.jpg Research NVIDIA Corp. http://www.nvidia.com/cuda 2008 08 01 08/01/2008 Michael Garland Scott Le Grand John Nickolls Paper Science Michael Garland,Scott Le Grand,John Nickolls 33b8d91d-369c-4bcf-a02d-dea9fc848e19 Low-cost, high-speed computer vision using NVIDIA's CUDA architecture In this paper, we introduce real time image processing techniques using modern programmable Graphic Processing Units (GPU). GPUs are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel. http://www.computer.org/portal/web/csdl/doi/10.1109/AIPR.2008.4906458 /content/cudazone/CUDABrowser/assets/images/applications/954_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/954_cs_large.jpg Academia Virginia Polytechnic Institute and University Blacksburg 2008 10 17 10/17/2008 Seung In Park Sean P. Ponce Jing Huang Paper Science Seung In Park,Sean P. Ponce,Jing Huang 27bfd115-787f-47af-bc51-e3978fe90dc2 K-Means on Commodity GPUs with CUDA K-means algorithm is one of the most famous unsupervised clustering algorithms. Many theoretical improvements for the performance of original algorithms have been put forward, while almost all of them are based on Single Instruction Single Data(SISD) architecture processors (CPUs), which partly ignored the inherent paralleled characteristic of the algorithms. http://www.computer.org/portal/web/csdl/doi/10.1109/CSIE.2009.491 /content/cudazone/CUDABrowser/assets/images/applications/947_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/947_cs_large.jpg Research 2009 WRI World Congress on Computer Science and Information Engineering 2009 04 02 04/02/2009 Bai Hong-tao He Li-li Ouyang Dan-tong Paper Science Bai Hong-tao,He Li-li,Ouyang Dan-tong e458a4f1-5693-4f34-beb8-9f46e2d0e158 Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture We explore the use of todays high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish,there is a lot more to be gained the field of scientific computing, high performance computing and their applications. http://www.computer.org/portal/web/csdl/doi/10.1109/ICSPS.2009.167 /content/cudazone/CUDABrowser/assets/images/applications/945_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/945_cs_large.jpg Academia 2009 International Conference on Signal Processing Systems 2009 05 17 05/17/2009 S.A. Arul Shalom Manoranjan Dash Minh Tue Paper Science S.A. Arul Shalom,Manoranjan Dash,Minh Tue cf4cef10-6cad-4a80-94eb-fca17c2968c6 Compute Pairwise Manhattan Distance and Pearson Correlation Coefficient of Data Points with GPU Graphics processing units (GPUs) are powerful computational devices tailored towards the needs of the 3-D gaming industry for high-performance, real-time graphics engines. Nvidia Corporation released a new generation of GPUs designed for general-purpose computing in 2006, and it released a GPU programming language called CUDA in 2007. http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.34 /content/cudazone/CUDABrowser/assets/images/applications/944_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/944_cs_large.jpg Academia Catholic University of Daegu, Korea 2009 05 29 05/29/2009 Dar-Jen Chang Ahmed H. Desoky Ming Ouyang Paper Science Dar-Jen Chang,Ahmed H. Desoky,Ming Ouyang f61532c5-2f66-40d8-8f6c-04b50f5bbefd Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160924 /content/cudazone/CUDABrowser/assets/images/applications/940_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/940_cs_large.jpg Academia Nanyang Technological University, Singapore 2009 05 29 05/29/2009 Haixiang Shi Bertil Schmidt Weiguo Liu Paper Science Haixiang Shi,Bertil Schmidt,Weiguo Liu 158694c3-4565-45ec-9051-9c1dfd7120d0 An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases The Smith Waterman algorithm for sequence alignment is one of the main tools of bioinformatics. It is used for sequence similarity searches and alignment of similar sequences. http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160931 /content/cudazone/CUDABrowser/assets/images/applications/937_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/937_cs_large.jpg Academia University of Warsaw, Poland 2009 05 29 05/29/2009 Lukasz Ligowski Witold Rudnicki Paper Science Lukasz Ligowski,Witold Rudnicki e58ea7e5-f7df-481c-8b10-3b3eccd9e977 CuPP - A framework for easy CUDA integration This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIAs GPGPU system CUDA into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures. http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160937 /content/cudazone/CUDABrowser/assets/images/applications/936_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/936_cs_large.jpg Academia Universitat Kassel, Germany 2009 05 29 05/29/2009 Jens Breitbart Paper Science Jens Breitbart b854753b-3791-4a9c-9b9a-4fe6700b4aa1 Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA Computing large multiple protein sequence alignments using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. ClustalW uses a three-stage processing pipeline: http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160923 /content/cudazone/CUDABrowser/assets/images/applications/934_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/934_cs_large.jpg Academia Nanyang Technological University, Singapore 2009 05 29 05/29/2009 Yongchao Liu Bertil Schmidt Douglas L. Maskell Paper Science Yongchao Liu,Bertil Schmidt,Douglas L. Maskell cb7682ca-4aae-4703-ae29-e99695f85d91 Ocean3DTechnology Simulation oceanic surfaces; physics calculation for objects in water environment. /content/cudazone/CUDABrowser/assets/images/applications/933_ocean_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/933_ocean_large.jpg Commercial Ocean3DInteractive http://www.ocean3dinteractive.com 2009 04 15 04/15/2009 Commercial Mykola Ozerchuk Multimedia Game Physics Graphics Mykola Ozerchuk 23e01f03-877d-433d-8b31-74754d82b8d9 FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. http://www.computer.org/portal/web/csdl/doi/10.1109/SASP.2009.5226333 /content/cudazone/CUDABrowser/assets/images/applications/932_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/932_cs_large.jpg Academia University of Illinois 2009 07 27 07/27/2009 Alexandros Papakonstantinou Karthik Gururaj John A. Stratton Paper Science Alexandros Papakonstantinou,Karthik Gururaj,John A. Stratton 5894f1ec-6a2f-4df5-ac17-a8a77ada7394 MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. http://www.computer.org/portal/web/csdl/doi/10.1109/ASAP.2009.14 /content/cudazone/CUDABrowser/assets/images/applications/931_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/931_cs_large.jpg Research 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors 2009 07 07 07/07/2009 36 Yongchao Liu Bertil Schmidt Douglas L. Maskell Paper Science Yongchao Liu,Bertil Schmidt,Douglas L. Maskell 0e4de9b3-9658-4d42-b197-6ef57ab2d2ee Getting Started with GPU Programming This tutorial describes a step-by-step procedure for programming a Macintosh Nvidia GPU. General scientific programmers with some C knowledge can get started in parallel processing application development with relative ease. /content/cudazone/CUDABrowser/assets/images/applications/930_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/930_cs_large.jpg Research American University 2009 08 01 08/01/2009 Michael A. Gray Paper Science Michael A. Gray 9777acef-c2ae-4810-a2bb-809471ddc369 An Empirically Optimized Radix Sort for GPU Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain. http://www.computer.org/portal/web/csdl/doi/10.1109/ISPA.2009.89 /content/cudazone/CUDABrowser/assets/images/applications/929_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/929_cs_large.jpg Research 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications 2009 08 10 08/10/2009 Bonan Huang Jinlan Gao Xiaoming Li Paper Science Bonan Huang,Jinlan Gao,Xiaoming Li 8a13d819-67c1-47d5-994b-7a995ba156b8 Accelerating Genome-Wide Association Studies Using CUDA Compatible Graphics Processing Units Recent advances in highly parallel, multithreaded, manycore Graphics Processing Units (GPUs) have been enabling massive parallel implementations of many applications in bioinformatics. In this paper, we describe a parallel implementation of genome-wide association studies (GWAS) using Compute Unified Device Architecture (CUDA). Using a single NVIDIA GTX 280 graphics card, we achieve speedups of about 15 times over Intel Xeon E5420. We also implement a highly scalable, massive parallel, GWAS system using the Message Passing Interface (MPI) and show that a single GTX 280 can have similar performance as a 16-node cluster. We further apply the GPU program to two real genome-wide case-control data sets. The results show that the GPU program is 17.7 times as fast as the CPU version for an Age-related Macular Degeneration (AMD) data set and 25.7 times as fast as the CPU version for a Parkinsons disease data set. /content/cudazone/CUDABrowser/assets/images/applications/928_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/928_cs_large.jpg Research 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing 2009 08 03 08/03/2009 25 Rui Jiang Feng Zeng Wangshu Zhang Paper Science Rui Jiang,Feng Zeng,Wangshu Zhang ff55e59c-e068-4456-a289-f60c94909099 Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA Power efficiency is one of the most important issues in high performance computing (HPC) interrelated to both software and hardware. Power dissipation of a program lies on algorithm design and power features of the computer components on which the program runs. In this work, we measure and model the power consumption of large matrices multiplication on multi-core CPU and GPU platform. By incorporating major physical power constrains of hardware components with the analysis of program execution behaviors, we approach to save the overall power consumption by using multithreading CPU to control two GPU devices computing in parallel synchronously. By implementing above method on real system, we show that it can save 22% of energy and speedup the kernel execution time by 71%, compare with solving the same large matrices multiplication using single CPU and GPU combination. /content/cudazone/CUDABrowser/assets/images/applications/927_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/927_cs_large.jpg Academia 2009 International Conference on Computational Science and Engineering 2009 08 29 08/29/2009 DaQi Ren Reiji Suda Paper Science DaQi Ren,Reiji Suda 36b2c6dc-0c25-4e19-a1b1-4f01eb1ba9a3 Solving 0/1 Knapsack Problem for Light Communication SLA-Based Workflow Mapping Using CUDA Mapping and running jobs on suitable resources are the core tasks in Grid Computing. In the algorithm to map light communication Grid-based workflow within the SLA context, there is an operation of resolving the conflict period which is exact a 0/1 knapsack problem. When the size of the workflow is large such as in the case of mapping a group of workflows, the time to solve this problem is long and thus, makes the whole mapping process long. In this paper, we describe a way to solve this problem by exploiting the parallel computing power of Graphic Processing Unit (GPU) with Compute Unified Device Architecture (CUDA). The experiment shows that the approach is very efficient with huge problem. /content/cudazone/CUDABrowser/assets/images/applications/926_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/926_cs_large.jpg Academia 2009 International Conference on Computational Science and Engineering 2009 08 29 08/29/2009 Dang Minh Quan Laurence T. Yang Paper Science Dang Minh Quan,Laurence T. Yang 5c12478a-a0ac-4a51-b496-dea3b86a936f CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPPW.2009.78 /content/cudazone/CUDABrowser/assets/images/applications/925_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/925_cs_large.jpg Academia 2009 International Conference on Parallel Processing Workshops 2009 09 25 09/25/2009 Jakob Siegel Juergen Ributzka Xiaoming Li Paper Science Jakob Siegel,Juergen Ributzka,Xiaoming Li 4e0a3931-0a17-43fc-a2e9-068c23a4a0ea String Matching on a Multicore GPU Using CUDA Graphics Processing Units (GPUs) have evolved over the past few years from dedicated graphics rendering devices to powerful parallel processors, outperforming traditional Central Processing Units (CPUs) in many areas of scientific computing. The use of GPUs as processing elements was very limited until recently, when the concept of General-Purpose computing on Graphics Processing Units (GPGPU) was introduced. GPGPU made possible to exploit the processing power and the memory bandwidth of the GPUs with the use of APIs that hide the GPU hardware from programmers. This paper presents experimental results on the parallel processing for some well known on-line string matching algorithms using one such GPU abstraction API, the Compute Unified Device Architecture (CUDA). /content/cudazone/CUDABrowser/assets/images/applications/924_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/924_cs_large.jpg Academia Corfu, Greece 2009 09 12 09/12/2009 Charalampos S. Kouzinopoulos Konstantinos G. Margaritis Paper Science Charalampos S. Kouzinopoulos,Konstantinos G. Margaritis fffa6694-d276-4fc7-938e-d1ae95148346 Isosurface Extraction and View-Dependent Filtering from Time-Varying Fields Using Persistent Time-Octree (PTOT) We develop a new algorithm for isosurface extraction andview-dependent filtering from large time-varying fields, by using anovel Persistent Time-Octree (PTOT) indexingstructure. http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2009.160 /content/cudazone/CUDABrowser/assets/images/applications/923_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/923_cs_large.jpg Academia Polytechnic Institute of New York University 2009 12 01 12/01/2009 Cong Wang Yi-Jen Chiang Paper Science Cong Wang,Yi-Jen Chiang ffccf360-34bc-43f7-8fd1-de125348ba45 Simulation of P Systems with Active Membranes on CUDA P systems or membrane systems provide a high level computational modeling framework that combines the structural and dynamic aspects of biological systems in a relevant and understandable way. P systems are massively parallel distributed, and non-deterministic systems. In this paper, we describe the implementation of a simulator for the class of recognizer P systems with active membranes by using the GPU (Graphics Processing Unit). We compare the high performance parallel simulator for the GPU to the simulator developed on a single CPU (Central Processing Unit), and we show that the GPU is better suited than the CPU to simulate P systems due to its highly parallel nature. /content/cudazone/CUDABrowser/assets/images/applications/922_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/922_cs_large.jpg Academia CoSBi, Trento, Italy 2009 10 14 10/14/2009 Jose Maria Cecilia Canales Jose Manuel Garcia Carrasco Paper Science Jose Maria Cecilia Canales,Jose Manuel Garcia Carrasco c4e34cb4-2b20-4b45-96ad-99b180dbcc47 Auto-tuning 3-D FFT library for CUDA GPUs Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654090 /content/cudazone/CUDABrowser/assets/images/applications/921_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/921_cs_large.jpg Academia Tokyo Institute of Technology and Japan Science and Technology Agency 2009 11 14 11/14/2009 Akira Nukada Satoshi Matsuoka Paper Science Akira Nukada,Satoshi Matsuoka c94e9a92-86b2-46a4-bc60-9910216c5d48 CUDA Accelerated LTL Model Checking Recent technological developments made available various many-core hardware platforms. For example, a SIMD-like hardware architecture became easily accessible for many users who have their computers equipped with modern NVIDIA GPU cards with CUDA technology. In this paper we redesign the maximal accepting predecessors algorithm [7] for LTL model checking in terms of matrix-vector product in order to accelerate LTL model checking on many-core GPU platforms. Our experiments demonstrate that using the NVIDIA CUDA technology results in a significant speedup of verification process. /content/cudazone/CUDABrowser/assets/images/applications/919_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/919_cs_large.jpg Academia Shenzhen, Guangdong, China 2009 12 11 12/11/2009 Jiri Barnat Lubos Brim Milan Ceska Paper Science Jiri Barnat,Lubos Brim,Milan Ceska 810014ad-f17c-406d-aff6-737150d18fdd RankBoost Acceleration on both NVIDIA CUDA and ATI Stream Platforms NVIDIA CUDA and ATI Stream are the two major general-purpose GPU (GPGPU) computing technologies. We implemented RankBoost, a web relevance ranking algorithm, on both NVIDIA CUDA and ATI Stream platforms to accelerate the algorithm and illustrate the differences between these two technologies. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.115 /content/cudazone/CUDABrowser/assets/images/applications/917_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/917_cs_large.jpg Academia Shenzhen, Guangdong, China 2009 12 11 12/11/2009 Bo Wang Tianji Wu Feng Yan Paper Science Bo Wang,Tianji Wu,Feng Yan 38e91cb6-7537-44d0-9a2f-3fb26b020e88 Optimal Data Distribution for Versatile Finite Impulse Response Filtering on Next-Generation Graphics Hardware Using CUDA In this paper, we investigate discrete finite impulse response (FIR) filtering of images, while harnessing the powerful computational resources of next-generation GPUs. These novel platforms exhibit a massive data parallel architecture with an advanced SIMT execution model and thread management, to enable designers to better cope with the infamous memory wall, i.e. the growing gap between the cost of data communication and computational processing. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.79 /content/cudazone/CUDABrowser/assets/images/applications/916_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/916_cs_large.jpg Academia Shenzhen, Guangdong, China 2009 12 11 12/11/2009 Patrik Goorts Sammy Rogmans Philippe Bekaert Paper Science Patrik Goorts,Sammy Rogmans,Philippe Bekaert b6017842-9107-4a99-b3f3-b15cc13fe777 Parallel Lexicographic Names Construction with CUDA Suffix array is a simpler and compact alternative to the suffix tree, lexicographic name construction is the fundamental building block in suffix array construction process. This paper depicts the design issues of first data parallel implementation of the lexicographic name construction algorithm on a commodity multiprocessor GPU using the Compute Unified Device Architecture (CUDA) platform, both from NVIDIA Corporation. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.31 /content/cudazone/CUDABrowser/assets/images/applications/915_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/915_cs_large.jpg Academia Shenzhen, Guangdong, China 2009 12 11 12/11/2009 Weidong Sun Zongmin Ma Paper Science Weidong Sun,Zongmin Ma 0b8f9a03-72c0-4eb4-bed8-7189f3048805 Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+ Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs and software platforms. http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.12 /content/cudazone/CUDABrowser/assets/images/applications/914_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/914_cs_large.jpg Academia Shenzhen, Guangdong, China 2009 12 11 12/11/2009 Guibin Wang Tao Tang Xudong Fang Paper Science Guibin Wang,Tao Tang,Xudong Fang a76e6a95-dc8f-444b-af64-badd2fddee07 Accelerating Multi-scale Image Fusion Algorithms Using CUDA Recently, fusion speed has emerged as an important factor in the image fusion and a substantial amount of memory and computing power are required for a high-speed fusion. This paper shows approaches to accelerate multi-scale image fusion speed on GPU (Graphics Processing Unit) using CUDA (Compute Unified Device Architecture). http://www.computer.org/portal/web/csdl/doi/10.1109/SoCPaR.2009.63 /content/cudazone/CUDABrowser/assets/images/applications/913_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/913_cs_large.jpg Academia Malacca, Malaysia 2007 12 14 12/14/2007 Seung-Hun Yoo Jin-Hyung Park Chang-Sung Jeong Paper Science Seung-Hun Yoo,Jin-Hyung Park,Chang-Sung Jeong 62ed4115-e5b5-4c6d-947e-3cb75f5c66d5 An Improved Parallel Implementation of 3D DRIE Simulation on GPU Deep reactive ion etching (DRIE) technique is a new and powerful tool in Micro-Electro-Mechanical Systems (MEMS) fabrication. A 3D DRIE simulation can help researcher understand the time-evolution of Bosch process used in DRIE. Due to the high complexity of the algorithm used in the simulation, it is necessary to develop an algorithm that can accelerate the simulation. http://www.computer.org/portal/web/csdl/doi/10.1109/I-SPAN.2009.111 /content/cudazone/CUDABrowser/assets/images/applications/912_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/912_cs_large.jpg Academia Kaohsiung, Taiwan 2008 12 14 12/14/2008 Fan Zhang Gang Wang Xiaoguang Liu Paper Science Fan Zhang,Gang Wang,Xiaoguang Liu 5f454c72-f9aa-4a01-b530-1dc441677f43 CheCUDA: A Checkpoint/Restart Tool for CUDA Applications In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory. http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.78 /content/cudazone/CUDABrowser/assets/images/applications/911_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/911_cs_large.jpg Academia Higashi Hiroshima, Japan 2008 12 11 12/11/2008 Hiroyuki Takizawa Katsuto Sato Kazuhiko Komatsu Paper Science Hiroyuki Takizawa,Katsuto Sato,Kazuhiko Komatsu d7276c60-49fb-4114-8743-89bca632df40 Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing Power dissipation is one of the most imminent limitation factors influencing the development of High Performance Computing (HPC). Toward power-efficient HPC on CPU-GPU hybrid platform, we are investigating software methodologies to achieve optimized power utilization by algorithm design and programming technique. In this paper we discuss power measurements of GPU http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.65 /content/cudazone/CUDABrowser/assets/images/applications/910_cs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/910_cs_large.jpg Academia Higashi Hiroshima, Japan 2008 12 11 12/11/2008 Reiji Suda Da Qi Ren Paper Science Reiji Suda,Da Qi Ren 985ae77e-9e9e-4719-b960-cce5bb84051a Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA Expectation maximization (EM) algorithm is an iterative technique widely used in the fields of signal processing and data mining. We present a parallel implementation of EM for finding maximum likelihood estimates of parameters of Gaussian mixture models, designed for many-core architecture of Graphics Processing Units (GPU). http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5166982 /content/cudazone/CUDABrowser/assets/images/applications/909_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/909_logo_xplore_large.gif Research NVIDIA Corp. http://www.nvidia.com/cuda 2009 07 17 07/17/2009 Kumar, N Satoor, S Buck, I Paper Science Kumar, N,Satoor, S, Buck, I 1a3dd903-83a7-40a0-ba34-0f84d4c7df59 Parallelizing Motion JPEG 2000 with CUDA Due to the rapid growth of Graphics Processing Unit (GPU) processing capability, using GPU as a coprocessor for assisting the CPU in computing massive data has become indispensable. Nvidia's CUDA general-purpose graphical processing unit (GPGPU) architecture can greatly benefit single instruction multiple thread (SIMT) styled, computationally expensive programs. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5380169 /content/cudazone/CUDABrowser/assets/images/applications/907_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/907_logo_xplore_large.gif Research IEEE 2010 01 15 01/15/2010 Datla, Sanketh Gidijala Naga Sathish Paper Science Datla, Sanketh,Gidijala,Naga Sathish e8ece29c-cc39-4c42-a3fb-4f4d78a4a3bf Reliability modeling of MEMS devices on CUDA based HPC setup In this paper, we have reviewed the development in CUDA and the implementation of various distribution that exists in the reliability for MEMS based devices on a CUDA setup. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5340289 /content/cudazone/CUDABrowser/assets/images/applications/905_logo_xplore_small.gif /content/cudazone/CUDABrowser/assets/images/applications/905_logo_xplore_large.gif Academia Acropolis Inst. of Technol. & Res., Indore, India 2009 11 24 11/24/2009 Pathak, R Joshi, S Paper Science Pathak, R,Joshi, S fe250813-b016-4b71-b44d-80e8de5f4166 Survey on Parallel Programming Model The development of microprocessors design has been shifting to multi-core architectures. Therefore, it is expected that parallelism will play a significant role in future generations of applications. Throughout the years, there has been a myriad number of parallel programming models proposed. In choosing a parallel programming model, not only the performance aspect is important, but also qualitative the aspect of how well parallelism is abstracted to developers. A model with a well abstraction of parallelism leads to a higher application-development productivity. In this paper, we propose seven criteria to qualitatively evaluate parallel programming models. Our focus is on how parallelism is abstracted and presented to application developers. As a case study, we use these criteria to investigate six well-known parallel programming models in the HPC community. /content/cudazone/CUDABrowser/assets/images/applications/904_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/904_cover-medium_large.jpg Research Sun Microsystems 2008 10 11 10/11/2008 Henry Kasim Verdi March Rita Zhang Paper Science Henry Kasim,Verdi March,Rita Zhang,henry.kasim@sun.com,verdi.march@sun.com,rita.zhang@sun.com 4736ba0b-002f-4ca5-b107-c70a1e05a004 A Variational Approach to Semiautomatic Generation of Digital Terrain Models We present a semiautomatic approach to generate high quality digital terrain models (DTM) from digital surface models (DSM). A DTM is a model of the earths surface, where all man made objects and the vegetation have been removed. In order to achieve this, we use a variational energy minimization approach. The proposed energy functional incorporates Huber regularization to yield piecewise smooth surfaces and an L1 norm in the data fidelity term. Additionally, a minimum constraint is used in order to prevent the ground level from pulling up, while buildings and vegetation are pulled down. Being convex, the proposed formulation allows us to compute the globally optimal solution. Clearly, a fully automatic approach does not yield the desired result in all situations. Therefore, we additionally allow the user to affect the algorithm using different user interaction tools. Furthermore, we provide a real-time 3D visualization of the output of the algorithm which additionally helps the user to assess the final DTM. We present results of the proposed approach using several real data sets. /content/cudazone/CUDABrowser/assets/images/applications/903_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/903_cover-medium_large.jpg Academia Graz University of Technology 2009 11 26 11/26/2009 Andreas Klaus Thomas Pock Markus Grabner Paper Science Andreas Klaus,Thomas Pock,Markus Grabner 58b1aaf4-695f-453b-833a-0c77c40fcae7 Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs We discuss implementing blocked sparse matrix-vector multiplication for NVIDIA GPUs. We outline an algorithm and various optimizations, and identify potential future improvements and challenging tasks. In comparison with previously published implementation, our implementation is faster on matrices having many high fill-ratio blocks but slower on matrices with low number of non-zero elements per row. /content/cudazone/CUDABrowser/assets/images/applications/902_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/902_cover-medium_large.jpg Academia Institute for System Programming of RAS, Russia 2009 07 21 07/21/2009 Alexander Monakov Arutyun Avetisyan Paper Science Alexander Monakov,Arutyun Avetisyan,amonakov@ispras.ru,arut@ispras.ru ba6d3a3b-b774-495a-8c34-f7c46204b175 AtelierM++: a fast and accurate marbling system We present AtelierM++, a new interactive marbling image rendering system which allows artists to create marbling textures with real-time visual feedback on mega-pixel sized images. Marbling is a method of aqueous surface design, which can produce patterns similar to marble or other stone, hence the name. The system is based on the physical model of the traditional marbling process. We simulate real marbling by solving the Navier-Stokes equations on the graphics processing unit. We employ a third-order accurate but fast Unsplit semi-Lagragian Constrained Interpolation Profile method to reduce the numerical dissipation while retaining the stability. To simulate very sharp interface lines among different paints, a simple yet effective transformation function is applied to the paint concentrations. Several intuitive interfaces are implemented to provide flexible control for users. Extensive experimental results are shown to demonstrate both the effectiveness and efficiency of the proposed approach. /content/cudazone/CUDABrowser/assets/images/applications/901_cover-medium7_small.png /content/cudazone/CUDABrowser/assets/images/applications/901_cover-medium7_large.png Academia Zhejiang University, China 2009 05 12 05/12/2009 Hanli Zhao Xiaogang Jin Shufang Lu Paper Science Hanli Zhao,Xiaogang Jin,Shufang Lu,hanlizhao@gmail.com,jin@cad.zju.edu.cn,lushufang@cad.zju.edu.cn 13440b6a-5288-46fb-912a-0a7945d88544 Implementing P Systems Parallelism by Means of GPUs Software development for Membrane Computing is growing up yielding new applications. Nowadays, the efficiency of P systems simulators have become a critical point when working with instances of large size. The newest generation of GPUs (Graphics Processing Units) provide a massively parallel framework to compute general purpose computations. We present GPUs as an alternative to obtain better performance in the simulation of P systems and we illustrate it by giving a solution to the N-Queens problem as an example. /content/cudazone/CUDABrowser/assets/images/applications/900_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/900_cover-medium_large.jpg Academia University of Sevilla, Spain 2010 01 20 01/20/2010 Jose M. Cecilia Jose M. Garcia Gines D. Guerrero Paper Science Jose M. Cecilia,Jose M. Garcia,Gines D. Guerrero,chema@ditec.um.es,jmgarcia@ditec.um.es,gines.guerrero@ditec.um.es b21f6d6e-aa27-42cf-b19a-8c8d970e2433 Real-Time Neighborhood Based Disparity Estimation Incorporating Temporal Evidence This paper presents a system for dense area based disparity estimation from binocular rectified image sequences with the integration of temporal evidence. The system is using dense 2D optical flow fields and timely displaced disparity estimates to reason about the observed 3D scene flow. This scene flow is then exploited to strengthen timely consistent observations in the disparity estimation. Moreover a novel neighborhood assumption is presented, which allows to seamlessly implement the presented algorithm on the GPU. It is shown that by means of the presented approach the sensitivity to noise and ambiguities observed with plain real-time disparity estimations can be improved, even in fully dynamic scenarios with simultaneous movement of objects and cameras /content/cudazone/CUDABrowser/assets/images/applications/899_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/899_cover-medium_large.jpg Academia Universiy of Kiel, Germany 2008 06 29 06/29/2008 Bogumil Bartczak Daniel Jung Reinhard Koch Paper Science Bogumil Bartczak,Daniel Jung,Reinhard Koch,bartczak@mip.informatik.uni-kiel.de,djung@mip.informatik.uni-kiel.de,rk@mip.informatik.uni-kiel.de 82775ff7-a0ec-4b29-8d21-f366cda8039a Relighting Forest Ecosystems Real-time cinematic relighting of large, forest ecosystems remains a challenging problem, in that important global illumination effects, such as leaf transparency and inter-object light scattering, are difficult to capture, given tight timing constraints and scenes that typically contain hundreds of millions of primitives. A solution that is based on a lattice-Boltzmann method is suggested. Reflectance, transmittance, and absorptance parameters are taken from measurements of real plants and integrated into a parameterized, dynamic global illumination model. When the model is combined with fast shadow rays, traced on a GPU, near real-time cinematic relighting is achievable for forest scenes containing hundreds of millions of polygons. /content/cudazone/CUDABrowser/assets/images/applications/898_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/898_cover-medium_large.jpg Academia Clemson University 2009 11 26 11/26/2009 Jay E. Steele Robert Geist Paper Science Jay E. Steele,Robert Geist,jesteel@cs.clemson.edu,geist@cs.clemson.edu 179e73fd-8aa8-41ce-9778-7fc4aa5d5044 Acceleration of cardiac tissue simulation with graphic processing units In this technical note we show the promise of using graphic processing units (GPUs) to accelerate simulations of electrical wave propagation in cardiac tissue, one of the more demanding computational problems in cardiology. We have found that the computational speed of two-dimensional (2D) tissue simulations with a single commercially available GPU is about 30 times faster than with a single 2.0 GHz Advanced Micro Devices (AMD) Opteron processor. We have also simulated wave conduction in the three-dimensional (3D) anatomic heart with GPUs where we found the computational speed with a single GPU is 1.6 times slower than with a 32-central processing unit (CPU) Opteron cluster. However, a cluster with two or four GPUs is faster than the CPU-based cluster. These results demonstrate that a commodity personal computer is able to perform a whole heart simulation of electrical wave conduction within times that enable the investigators to interact more easily with their simulations. /content/cudazone/CUDABrowser/assets/images/applications/897_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/897_prediction_large.png Academia David Geffen School of Medicine at UCLA, Los Angeles, CA 2009 08 04 08/04/2009 Daisuke Sato Alan Garfinkel Paper Computer Aided Engineering Daisuke Sato,Alan Garfinkel,dasato@mednet.ucla.edu,agarfinkel@mednet.ucla.edu 765b8ab8-2775-423e-8d68-3d5f4a6cc0b5 Real-Time Prediction of Brain Shift Using Nonlinear Finite Element Algorithms Patient-specific biomechanical models implemented using specialized nonlinear (i.e. taking into account material and geometric nonlinearities) finite element procedures were applied to predict the deformation field within the brain for five cases of craniotomy-induced brain shift. The procedures utilize the Total Lagrangian formulation with explicit time stepping. The loading was defined by prescribing deformations on the brain surface under the craniotomy. Application of the computed deformation fields to register the preoperative images with the intraoperative ones indicated that the models very accurately predict the intraoperative positions and deformations of the brain anatomical structures for limited information about the brain surface deformations. For each case, it took less than 40 s to compute the deformation field using a standard personal computer, and less than 4 s using a Graphics Processing Unit (GPU). The results suggest that nonlinear biomechanical models can be regarded as one possible method of complementing medical image processing techniques when conducting non-rigid registration within the real-time constraints of neurosurgery. /content/cudazone/CUDABrowser/assets/images/applications/896_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/896_cover-medium_large.jpg Academia The University of Western Australia 2009 09 30 09/30/2009 Grand Roman Joldes Paper Science Grand Roman Joldes,grandj@mech.uwa.edu.au 6134f011-ed6e-4cf4-9f52-887c65646088 An Extension of the StarSs Programming Model for Platforms with Multiple GPUs While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer's productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results. /content/cudazone/CUDABrowser/assets/images/applications/895_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/895_cover-medium_large.jpg Academia Consejo Superior de Investigaciones Cientificas, Spain 2009 08 22 08/22/2009 Eduard Ayguade Rosa M. Badia Francisco D. Igual Paper Science Task-level parallelism,heterogeneous systems,programming models,Eduard Ayguade,Rosa M. Badia,Francisco D. Igual,eduard.ayguade@bsc.es,rosa.m.badia@bsc.es,figual@icc.uji.es 480d6c65-d2c8-4d10-88e9-8f4cecdddd49 Fast Image Mapping of Endoscopic Image Mosaics with Three-Dimensional Ultrasound Image for Intrauterine Treatment of Twin-to-Twin Transfusion Syndrome This paper describes a fast image mapping system that integrates endoscopic image mosaics with three-dimensional (3-D) ultrasound images for assisting intrauterine treatment of twin-to-twin transfusion syndrome (TTTS) by laser photocoagulation. Endoscopic laser photocoagulation treatment has a good survival rate and a low complication rate for twins. However, the small field of view and lack of surrounding information makes the identification of vessels anastomosis difficult. We have developed an extended placenta visualization system with the fusion of endoscopic image mosaics with a 3-D ultrasound-image placenta model. Fully automatic and fast calibration is used for endoscope calibration in fluid. The 3-D spatial position of the endoscopic images and the ultrasound image are tracked by a 3-D position tracking device. The mosaiced endoscope images are registered to the surface of the 3-D ultrasound placenta model by using a fast GPU-based image rendering method. Experimental results show that the system may provide an improved and efficient way of planning and guidance in laser photocoagulation TTTS treatment. /content/cudazone/CUDABrowser/assets/images/applications/894_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/894_cover-medium_large.jpg Academia The University of Tokyo 2008 07 15 07/15/2008 Hongen Liao Paper Science Hongen Liao,liao@bmpe.t.u-tokyo.ac.jp 0a6541a1-93ac-4431-a6ff-45865398551e Accelerated Discovery of Discrete M-Clusters/Outliers on the Raster Plane Using Graphical Processing Units This paper presents two discrete computational geometry algorithms designed for execution on Graphics Processing Units (GPUs). The algorithms are parallelized versions of sequential algorithms intended for application in geographical data mining. The first algorithm finds clusters of m points, called m-clusters, in the rasterized plane. The second algorithm complements the first by identifying outliers, those points which are not members of any m-clusters. The use of a raster representation of coordinates provides an ideal data stream environment for efficient GPU utilization. The parallel algorithms have low memory demands, and require only a limited amount of inter-process communication. Initial performance analysis indicates the algorithms are scalable, both in problem size and in the number of seeds, and significantly outperform commercial implementations. /content/cudazone/CUDABrowser/assets/images/applications/893_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/893_cover-medium_large.jpg Academia Grand Valley State University, MI / Univ. of Maine-Augusta, ME 2009 05 20 05/20/2009 Christian Trefftz Joseph Szakas Igor Majdandzic Paper Numerics GPU algorithms,Geographical data mining,Christian Trefftz,Joseph Szakas,Igor Majdandzic,trefftzc@gvsu.edu,szakas@maine.edu,majdanig@student.gvsu.edu db591363-d303-433e-9ab7-f3e856c6a6b0 GP on SPMD parallel graphics hardware for mega Bioinformatics data mining We demonstrate a SIMD C++ genetic programming system on a single 128 node parallel NVIDIA GeForce 8800 GTX GPU under RapidMind's GPGPU Linux software by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of 5 million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP. /content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_small.png /content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_large.png Academia University of Essex, Colchester 2008 05 08 05/08/2008 W. B. Langdon Paper Computer Aided Engineering W. B. Langdon,wlangdon@essex.ac.uk db591363-d303-433e-9ab7-f3e856c6a6b0 GP on SPMD parallel graphics hardware for mega Bioinformatics data mining We demonstrate a SIMD C++ genetic programming system on a single 128 node parallel NVIDIA GeForce 8800 GTX GPU under RapidMind's GPGPU Linux software by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of 5 million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP. /content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_small.png /content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_large.png Academia University of Essex, Colchester 2008 05 08 05/08/2008 W. B. Langdon Paper Computer Aided Engineering W. B. Langdon,wlangdon@essex.ac.uk ddc5f77d-8ad3-4adf-bf3b-d88271d702fe A Real-Time Evolutionary Object Recognition System We have created a real-time evolutionary object recognition system. Genetic Programming is used to automatically search the space of possible computer vision programs guided through user interaction. The user selects the object to be extracted with the mouse pointer and follows it over multiple frames of a video sequence. Several different alternative algorithms are evaluated in the background for each input image. Real-time performance is achieved through the use of the GPU for image processing operations. /content/cudazone/CUDABrowser/assets/images/applications/891_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/891_cover-medium_large.jpg Academia Eberhard-Karls-Universitat Tubingen 2009 04 10 04/10/2009 Marc Ebner Paper Science Marc Ebner,marc.ebner@wsii.uni-tuebingen.de 4861ac13-3ca4-4659-9fb8-ae5704b94996 Concurrent CT Reconstruction and Visual Analysis Using Hybrid Multi-resolution Raycasting in a Cluster Environment GPU clusters nowadays combine enormous computational resources of GPUs and multi-core CPUs. This paper describes a distributed program architecture that leverages all resources of such a cluster to incrementally reconstruct, segment and render 3D cone beam computer tomography (CT) data with the objective to provide the user with results as quickly as possible at an early stage of the overall computation. As the reconstruction of high-resolution data sets requires a significant amount of time, our system first creates a low-resolution preview volume on the head node of the cluster, which is then incrementally supplemented by high-resolution blocks from the other cluster nodes using our multi-resolution renderer. It is further used for graphically choosing reconstruction priority and render modes of sub-volume blocks. The cluster nodes use their GPUs to reconstruct and render sub-volume blocks, while their multi-core CPUs are used to segment already available blocks. /content/cudazone/CUDABrowser/assets/images/applications/890_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/890_cover-medium_large.jpg Academia Visualisierungsinstitut der Universitat Stuttgart 2009 11 26 11/26/2009 Steffen Frey Christoph Muller Magnus Strengert Paper Science Steffen Frey,Christoph Muller,Magnus Strengert 67fcb101-5c74-49c7-abf2-b026feeea773 Modelling Anisotropic Viscoelasticity for Real-Time Soft Tissue Simulation Previously almost all biomechanically-based time-critical surgical simulation has ignored the well established features of tissue mechanical response of anisotropy and time-dependence. We address this issue by presenting an efficient solution procedure for anisotropic visco-hyperelastic constitutive models which allows use of these in nonlinear explicit dynamic finite element algorithms. We show that the procedure allows incorporation of both anisotropy and viscoelasticity for as little as 5.1% additional cost compared with the usual isotropic elastic models. When combined with high performance GPU execution the complete framework is suitable for time-critical simulation applications such as interactive surgical simulation and intraoperative image registration. /content/cudazone/CUDABrowser/assets/images/applications/889_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/889_cover-medium_large.jpg Academia University College London, UK 2008 09 10 09/10/2008 Zeike A. Taylor Paper Science Zeike A. Taylor fdeb943e-2746-42f1-a70b-e92ca74592c7 Fast and Robust Face Tracking for Analyzing Multiparty Face-to-Face Meetings This paper presents a novel face tracker and verifies its effectiveness for analyzing group meetings. In meeting scene analysis, face direction is an important clue for assessing the visual attention of meeting participants. The face tracker, called STCTracker (Sparse Template Condensation Tracker), estimates face position and pose by matching face templates in the framework of a particle filter. STCTracker is robust against large head rotation, up to 60 degrees in the horizontal direction, with relatively small mean deviation error. Also, it can track multiple faces simultaneously in real-time by utilizing a modern GPU (Graphics Processing Unit), e.g. 6 faces at about 28 frames/second on a single PC. Also, it can automatically build 3-D face templates upon initialization of the tracker. This paper evaluates the tracking errors and verifies the effectiveness of STCTracker for meeting scene analysis, in terms of conversation structures, gaze directions, and the structure of cross-modal interactions involving head gestures and utterances. Experiments confirm that STCTracker can basically match the performance of from the user-unfriendly magnetic-sensor-based motion capture system. /content/cudazone/CUDABrowser/assets/images/applications/888_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/888_cover-medium_large.jpg Academia NTT Communication Science Labs, Japan 2008 09 20 09/20/2008 Kazuhiro Otsuka Junji Yamato Paper Science Kazuhiro Otsuka,Junji Yamato,otsuka@eye.brl.ntt.co.jp,yamato@eye.brl.ntt.co.jp d29fe864-0d87-456c-9127-ae0164499337 SUNVIZ: A Real-Time Visualization Environment for Space Physics Applications Real-time physically accurate simulations are difficult to create because of limited computational power available on a CPU. General purpose computing on the graphics processing unit (GPU) can provide a significant increase in performance. We are able to investigate the flow characteristics of a cloud of charged particles, which is one of the first steps in our goal of generating a real-time Coronal Mass Ejection (CME) simulator. Preliminary results show a sustained 60 Hz visual simulation with approximately four million particles and a non-visual simulation of 16 million particles at 30 Hz. The simulator provides a novel way to investigate a CME in real-time, and it has the potential to predict when a particular CME is geoeffective, i.e. an event that could damage electrical infrastructure such as satellites, space stations, power grids, etc... /content/cudazone/CUDABrowser/assets/images/applications/887_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/887_cover-medium_large.jpg Academia University of Alberta Physics 2008 12 03 12/03/2008 S. Eliuk P. Boulanger K. Kabin Paper Science S. Eliuk,P. Boulanger,K. Kabin af4a522e-2156-4c31-99f2-519daaa3e24d Graphic processing unit-accelerated mutual information-based 3D image rigid registration Mutual information (MI)-based image registration is effective in registering medical images, but it is computationally expensive. This paper accelerates MI-based image registration by dividing computation of mutual information into spatial transformation and histogram-based calculation, and performing 3D spatial transformation and trilinear interpolation on graphic processing unit (GPU). The 3D floating image is downloaded to GPU as flat 3D texture, and then fetched and interpolated for each new voxel location in fragment shader. The transformed results are rendered to textures by using frame buffer object (FBO) extension, and then read to the main memory used for the remaining computation on CPU. Experimental results show that GPU-accelerated method can achieve speedup about an order of magnitude with better registration result compared with the software implementation on a single-core CPU. /content/cudazone/CUDABrowser/assets/images/applications/886_transactions_small.png /content/cudazone/CUDABrowser/assets/images/applications/886_transactions_large.png Academia Dalian University of Technology, China 2009 10 26 10/26/2009 Zongying Ou Paper Computer Aided Engineering Zongying Ou,ouzyg@dlut.edu.cn 684bdf6c-6ae7-4c97-a001-9dcc5b568603 Dual-RBF based surface reconstruction Surface reconstruction (Bloomenthal and Wyvill, Introduction to Implicit Surfaces, 1997) is a fundamental work in Computer Aided Design (CAD) and Computer Graphics (CG). In this paper, motivated by the physical polar field model (Yuxu Lin Chun Chen in Proceedings of the 3rd Pacific-Rim Symposium on Image and Video Technology, 1997), we propose a novel implicit surface reconstruction approach, named Dual-RBF. Through simulating the physical polar field model, Dual-RBF provides a nice initial reconstruction state firstly. Then, two simple nonlinear methods are introduced to adjust the configurations of Dual-RBF model, so that a more accurate reconstruction is reached. Thirdly, the Dual-RBF becomes even more robust to fill the holes on some flawed input point-clouds by adopting a multi-level strategy. Finally, the visualization of the surface reconstruction is speed up with GPU. Experimental results show that the proposed approach is faster and more robust than previous implicit surface reconstruction techniques. /content/cudazone/CUDABrowser/assets/images/applications/885_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/885_visualcomputer_large.png Academia Zhejiang University, China 2009 03 03 03/03/2009 Yuxu Lin Chun Chen Mingli Song Paper Science Yuxu Lin,Chun Chen,Mingli Song,linyuxu@zju.edu.cn,chenc@cs.zju.edu.cn,brooksong@ieee.org 67822eb4-51de-4035-b30f-046c25f50c9d A Color Management Process for Real Time Color Reconstruction of Multispectral Images We introduce a new accurate and technology independent display color characterization model for color rendering of multispectral images. The establishment of this model is automatic, and does not exceed the time of a coffee break to be efficient in a practical situation. This model is a part of the color management workflow of the new tools designed at the C2RMF for multispectral image analysis of paintings acquired with the material developed during the CRISATEL European project. The analysis is based on color reconstruction with virtual illuminants and use a GPU (Graphics processor unit) based processing model in order to interact in real time with a virtual lighting. /content/cudazone/CUDABrowser/assets/images/applications/884_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/884_cover-medium_large.jpg Academia Universite Jean Monnet / France 2009 07 14 07/14/2009 Philippe Colantoni Jean-Baptiste Thomas Paper Science Philippe Colantoni,Jean-Baptiste Thomas 5da0d849-dfa9-4a99-b8d9-1ed49ac75197 Regular Expression Matching on Graphics Hardware for Intrusion Detection The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and spam filtering applications. However, the flexible pattern matching functionality of regular expressions in these systems comes with significant overheads in terms of both memory and CPU cycles, since every byte of the inspected input needs to be processed and compared against a large set of regular expressions. http://springerlink.com/content/b3m7662014272t8m/?p=0dd80c5c9b564b009c9e0e9c88044df6 /content/cudazone/CUDABrowser/assets/images/applications/883_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/883_cover-medium_large.jpg Academia Foundation for Research and Technology,Hellas 2009 09 30 09/30/2009 48 Giorgos Vasiliadis Michalis Polychronakis Spiros Antonatos Paper Science Giorgos Vasiliadis,Michalis Polychronakis,Spiros Antonatos,gvasil@ics.forth.gr,mikepo@ics.forth.gr,antonat@ics.forth.gr 0d6ee814-5cf0-4da6-988b-a9e7159f4f0a Haptic guided 3-D deformable image registration Purpose We present a system which supports deformable image registration guided by a haptic device. Methods The haptic device is tied to a block matching method where a set of uniformly distributed control points determine the block positions. Each control point constitutes a particle in a mass spring grid which limits the space of allowed movements to elastic movements. Control points are manipulated by the haptic device, and the negative gradient of the similarity metric over the corresponding block is rendered as a force on the haptic device guiding the user to a minimum of the optimization landscape. Fast update of forces was achieved by exploiting the GPU for computations of the similarity metric and for interpolation of the deformation field. /content/cudazone/CUDABrowser/assets/images/applications/882_cover-medium5_small.png /content/cudazone/CUDABrowser/assets/images/applications/882_cover-medium5_large.png Academia University of Oslo, Norway / Rikshospitalet University Hospital, Norway 2009 02 24 02/24/2009 Petter Risholm Eigil Samset Paper Medical Imaging Petter Risholm,Eigil Samset,pettri@ifi.uio.no 0f7ccffd-5dcf-44a4-9971-18d0a82c6dd3 Radar Signal Processing with Graphics Processors (GPUs) The investigation is conducted through comparing a GPU (GTX260) against a modern desktop CPU for several HPEC (High Performance Embedded Computing) and other radar signal processing algorithms; 12 in total. Several other aspects are also investigated, such as programming environment and efficiency, future GPU-architectures, and applicability in radar systems. Our CUDA GPU-implementations perform substantially better than the CPU and associated CPU-code used for all but one of the 12 algorithms tested, sometimes by a factor of 100 or more. The OpenCL implementations also perform substantially better than the CPU. The substantial performance achieved when using CUDA for almost all benchmarks can be attributed to both the high theoretical performance of the GPU, but also to the inherent data-parallelism, and hence GPU-suitability, of almost all of the investigated algorithms. Programming CUDA is reasonably straight forward, largely due to the mature development environment and abundance of documentation and white-papers. OpenCL is a lot more tedious to program. Furthermore, the coming CUDA GPU-architecture called Fermi is expected to further increase performance and programmability. When considering system integration of GPU-architectures into harsh radar application environments, one should be aware of potential heat and also possible obsolescence issues. /content/cudazone/CUDABrowser/assets/images/applications/881_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/881_logo_large.png Academia HPC http://www.hpcsweden.se 2010 02 08 02/08/2010 140 Ian Wainwright Jimmy Pettersson Paper Signal Processing Ian Wainwright,Jimmy Pettersson,jimmy.pettersson@hpcsweden.se,ian.wainwright@gmail.com 4ff91bfb-b496-46cb-9e1c-3572031aff73 Exploiting the Power of GPUs for Asymmetric Cryptography Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include beside a CPU such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available co-processor for general purpose applications. http://springerlink.com/content/d1rt1r0326500541/?p=0dd80c5c9b564b009c9e0e9c88044df6 /content/cudazone/CUDABrowser/assets/images/applications/880_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/880_cover-medium_large.jpg Academia Ruhr University Bochum, Germany 2008 08 06 08/06/2008 Robert Szerwinski Tim Guneysu Paper Science Robert Szerwinski,Tim Guneysu,szerwinski@crypto.rub.de,gueneysu@crypto.rub.de 8749f743-3122-4ade-9664-c7c12c9cba95 Programmable and Scalable Architecture for Graphics Processing Units Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput. In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs. TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications. We show that TTA provides high floating point processing performance while allowing more programming freedom than vector processors. Finally, one of the main features of the presented TTA-based GPU design is its fully programmable architecture making it suitable target for general purpose computing on GPU APIs which have become popular in recent years. /content/cudazone/CUDABrowser/assets/images/applications/879_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/879_cover-medium_large.jpg Academia Universidad Rey Juan Carlos, Spain 2009 07 21 07/21/2009 Carlos S. de La Lama Pekka Jaaskelainen Jarmo Takala Paper Science Carlos S. de La Lama,Pekka Jaaskelainen,Jarmo Takala,carlos.delalama@urjc.es,pekka.jaaskelainen@tut.fi,jarmo.takala@tut.fi 81a5b615-fc67-4667-ab39-ad855603e008 Breath-Hold Target Localization with Simultaneous Kilovoltage/Megavoltage Cone-Beam CT and Fast Reconstruction Hypofractionated high dose radiotherapy of small lung tumors is very effective and was based on stereotaxy until now. It has recently become possible to achieve a high patient positioning precision based on on-line imaging with cone-beam CT (CBCT) and breath-hold techniques. The CBCT acquisition time of roughly 60 seconds, however, is too long for one breath-hold, resulting in image degradation by respiratory motion artifacts. By using megavoltage (MV) an kilovoltage (kV) photon source (mounted perpendicularly on the Linac gantry) for volume reconstruction, we could reduce the acquisition time to 15 seconds. /content/cudazone/CUDABrowser/assets/images/applications/878_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/878_prediction_large.png Academia World Congress on Medical Physics and Biomedical Engineering, Germany 2010 01 04 01/04/2010 M. Blessing D. Stsepankou H. Wertz Paper Science M. Blessing,D. Stsepankou,H. Wertz b0ea1f26-c2cf-436b-830b-d3cbb7ecb7bf Implementation of Fine-Grained Algorithms on Graphical Processing Unit In this paper we solve the problem of mapping of fine- grained algorithm to graphical processing unit (GPU). Synchronous, asynchronous, block-synchronous and probabilistic cellular automata and explicit scheme of PDE are used as examples. Different implementation variants and their performances are presented. /content/cudazone/CUDABrowser/assets/images/applications/877_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/877_cover-medium_large.jpg Academia ICMMG SB RAS, Novosibirsk, Russia 2009 09 01 09/01/2009 Konstantin Kalgin Paper Science Konstantin Kalgin,kalgin@ssd.sscc.ru ac91c067-09bc-44a8-877b-254db2f289b0 StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE SPUs) or data-parallel accelerators (e.g. GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine. /content/cudazone/CUDABrowser/assets/images/applications/876_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/876_cover-medium_large.jpg Academia University of Bordeaux 2009 08 22 08/22/2009 Cedric Augonnet Paper Science Cedric Augonnet 71356391-beda-4ab3-82f5-a0b60765af0f Seismic Wave Field Modeling with Graphics Processing Units GPGPU - general-purpose computing on graphics processing units is a very effective and inexpensive way of dealing with time consuming computations. In some cases even a low end GPU can be a dozens of times faster than a modern CPUs. Utilization of GPGPU technology can make a typical desktop computer powerful enough to perform necessary computations in a fast, effective and inexpensive way. Seismic wave field modeling is one of the problems of this kind. Some times one modeled common shot-point gather or one wave field snapshot can reveal the nature of an analyzed wave phenomenon. On the other hand these kinds of modelings are often a part of complex and extremely time consuming methods with almost unlimited needs of computational resources. This is always a problem for academic centers, especially now when times of generous support from oil and gas companies have ended /content/cudazone/CUDABrowser/assets/images/applications/875_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/875_cover-medium_large.jpg Academia AGH University of Science and Technology, Poland 2009 05 21 05/21/2009 Tomasz Danek Paper Science Tomasz Danek,tdanek@agh.edu.pl 9c1ce1c7-2162-49d4-93ca-a21ba5687e48 Active Structured Learning for High-Speed Object Detection High-speed smooth and accurate visual tracking of objects in arbitrary, unstructured environments is essential for robotics and human motion analysis. However, building a system that can adapt to arbitrary objects and a wide range of lighting conditions is a challenging problem, especially if hard real-time constraints apply like in robotics scenarios. In this work, we introduce a method for learning a discriminative object tracking system based on the recent structured regression framework for object localization. Using a kernel function that allows fast evaluation on the GPU, the resulting system can process video streams at speed of 100 frames per second or more. Consecutive frames in high speed video sequences are typically very redundant, and for training an object detection system, it is sufficient to have training labels from only a subset of all images. We propose an active learning method that select training examples in a data-driven way, thereby minimizing the required number of training labeling. Experiments on realistic data show that the active learning is superior to previously used methods for dataset subsampling for this task. /content/cudazone/CUDABrowser/assets/images/applications/874_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/874_cover-medium_large.jpg Academia Max Planck Institute for Biological Cybernetics, Tubingen, Germany 2009 09 02 09/02/2009 Christoph H. Lampert Jan Peters Paper Science Christoph H. Lampert,Jan Peters,ChristophH.Lampert@tuebingen.mpg.de,Jan.Peters@tuebingen.mpg.de 7afe1a21-0c1b-40fd-9f4b-60e731b26240 Attaining High Performance in General-Purpose Computations on Current Graphics Processors The increase in performance of the last generations of graphics processors (GPUs) has made this class of hardware a coprocessing platform of remarkable success in certain types of operations. In this paper we evaluate the performance of linear algebra and image processing routines, both on classical and unified GPU architectures and traditional processors (CPUs). From this study, we gain insights on the properties that make an algorithm likely to deliver high performance on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/873_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/873_cover-medium_large.jpg Academia Universidad Jaume, Spain 2008 12 06 12/06/2008 Francisco D. Igual Rafael Mayo Enrique S. Quintana-Orti Paper Science Francisco D. Igual,Rafael Mayo,Enrique S. Quintana-Orti,figual@icc.uji.es,mayo@icc.uji.es,quintana@icc.uji.es 6fe3ccf0-abfd-4113-bcf7-49d91f20f318 Efficient Multiplication of Polynomials on Graphics Hardware We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused multiply-add capabilities of the graphics hardware. NTT multiplications are executed in parallel for a set of distinct primes followed by reconstruction using the Chinese Remainder theorem (CRT) on the GPU. Our benchmarking experiences show the NTT multiplication performance up to 77 GMul/s. We compared our approach with CPU-based implementations of polynomial and large integer multiplication provided by NTL and GMP libraries. /content/cudazone/CUDABrowser/assets/images/applications/872_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/872_cover-medium_large.jpg Academia Saarbrucken 2009 08 21 08/21/2009 Pavel Emeliyanenko Paper Science Pavel Emeliyanenko,asm@mpi-inf.mpg.de 5f4ef7a1-9d5f-4ef3-a1ea-c2be5ed4d1e8 Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels' execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude. /content/cudazone/CUDABrowser/assets/images/applications/871_cover-medium4_small.png /content/cudazone/CUDABrowser/assets/images/applications/871_cover-medium4_large.png Academia Universidade de Coimbra 2009 09 28 09/28/2009 Gabriel Falcao Shinichi Yamagiwa Vitor Silva Paper Science Gabriel Falcao,Shinichi Yamagiwa,Vitor Silva,gff@co.it.pt,yama@inesc-id.pt,vitor@co.it.pt 394dbac7-73bd-4c9e-86da-1dd81c35ad28 Retargeting PLAPACK to Clusters with Hardware Accelerators Hardware accelerators are becoming a highly appealing approach to boost the raw performance as well as the price-performance and power-performance ratios of current clusters. In this paper we present a strategy to retarget PLAPACK, a library initially designed for clusters of nodes equipped with general- purpose processors and a single address space per node, to clusters equipped with graphics processors (GPUs). In our approach data are kept in the device memory and only retrieved to main memory when they have to be communicated to a different node. Here we benefit from the object-based orientation of PLAPACK which allows all communication between host and device to be embedded within a pair of routines, providing a clean abstraction that enables an efficient and direct port of all the contents of the library. Our experiments in a cluster consisting of 16 nodes with two NVIDIA Quadro FX5800 GPUs each show the performance of our approach. /content/cudazone/CUDABrowser/assets/images/applications/870_FLAMEbanner_small.png /content/cudazone/CUDABrowser/assets/images/applications/870_FLAMEbanner_large.png Academia University Jaume I / Texas University 2010 02 11 02/11/2010 Fogue Igual Quintana-Orti Paper Numerics Fogue,Igual,Quintana-Orti,figual@icc.uji.es d4831d82-4e49-46b5-9751-c1e58a61d67a Neural Network Training with Extended Kalman Filter Using Graphics Processing Unit The graphics processing unit has evolved through the years into the powerful resource for general purpose computing. We present in this article the implementation of Extended Kalman filter used for recurrent neural networks training, which most computational intensive tasks are performed on the GPU. This approach achieves significant speedup of neural network training process for larger networks. /content/cudazone/CUDABrowser/assets/images/applications/869_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/869_cover-medium_large.jpg Academia Slovak University of Technology in Bratislava 2008 08 29 08/29/2008 Peter Trebaticky Jiri Pospichal Paper Science Peter Trebaticky,Jiri Pospichal,trebaticky@fiit.stuba.sk,pospichal@fiit.stuba.sk c8daa779-2c65-4b59-a45a-a3648753fb56 Fast collision detection using the A-buffer This paper presents a novel and fast image-space collision detection algorithm with the A-buffer, where the GPU computes the potentially colliding sets (PCSs), and the CPU performs the standard triangle intersection test. When the bounding boxes of two objects intersect, the intersection is passed to the GPU. The object surfaces in the intersection are rendered into the A-buffer. Rendering into the A-buffer is up to eight-times faster than the ordinary approaches. Then, PCSs are computed by comparing the depth values of each texel of the A-buffer. A PCS consists of only two triangles. The PCSs are read back to the CPU, and the CPU computes the intersection points between the triangles. The proposed algorithm runs extremely fast, does not require any preprocessing, can handle dynamic objects including deformable and fracturing models, and can compute self-collisions. Such versatility and performance gain of the proposed algorithm prove its usefulness in real-time applications such as 3D games. /content/cudazone/CUDABrowser/assets/images/applications/868_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/868_visualcomputer_large.png Academia Korea University, Seoul, Korea 2008 05 17 05/17/2008 Hanyoung Jang JungHyun Han Paper Science Hanyoung Jang,JungHyun Han,jhan@korea.ac.kr 3752cd56-fe2a-4457-a8e1-ea665d83102d Engineering of Computer Vision Algorithms Using Evolutionary Algorithms Computer vision algorithms are currently developed by looking up the available operators from the literature and then arranging those operators such that the desired task is performed. This is often a tedious process which also involves testing the algorithm with different lighting conditions or at different sites. We have developed a system for the automatic generation of computer vision algorithms at interactive frame rates using GPU accelerated image processing. The user simply tells the system which object should be detected in an image sequence. Simulated evolution, in particular Genetic Programming, is used to automatically generate and test alternative computer vision algorithms. Only the best algorithms survive and eventually provide a solution to the users image processing task. /content/cudazone/CUDABrowser/assets/images/applications/867_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/867_cover-medium_large.jpg Academia Eberhard Karls Universitat Tubingen 2009 09 30 09/30/2009 Marc Ebner Paper Science Marc Ebner,marc.ebner@wsii.uni-tuebingen.de e2835997-b2cc-4236-a3c2-83316b6befcb Solving Dense Linear Systems on Graphics Processors We present several algorithms to compute the solution of a linear system of equations on a GPU, as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We also show how iterative refinement with mixed-precision can be used to regain full accuracy in the solution of linear systems. Experimental results on a G80 using CUBLAS 1.0, the implementation of BLAS for NVIDIA GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed. /content/cudazone/CUDABrowser/assets/images/applications/866_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/866_cover-medium_large.jpg Academia Universidad Jaume 2008 08 21 08/21/2008 Sergio Barrachina Maribel Castillo Francisco D. Igual Paper Science Sergio Barrachina,Maribel Castillo,Francisco D. Igual,barrachi@icc.uji.es,castillo@icc.uji.es,figual@icc.uji.es ee93a2d7-a172-4d28-88cc-ea3581de0988 Visual simulation of thermal fluid dynamics in a pressurized water reactor We present a simulation and visualization system for a critical application analysis of the thermal fluid dynamics inside a pressurized water reactor of a nuclear power plant when cold water is injected into the reactor vessel. We employ a hybrid thermal lattice Boltzmann method (HTLBM), which has the advantages of ease of parallelization and ease of handling complex simulation boundaries. For efficient computation and storage of the irregular-shaped simulation domain, we classify the domain into nonempty and empty cells and apply a novel packing technique to organize the nonempty cells. This method is implemented on a GPU cluster for acceleration. We demonstrate the formation of cold-water plumes in the reactor vessel. A set of interactive visualization tools, such as side-view slices, 3D volume rendering, thermal layers rendering, and panorama rendering, are provided to collectively visualize the structure and dynamics of the temperature field in the vessel. To the best of our knowledge, this is the first system that combines 3D simulation and visualization for analyzing thermal shock risk in a pressurized water reactor. /content/cudazone/CUDABrowser/assets/images/applications/865_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/865_visualcomputer_large.png Academia Stony Brook University, NY 2009 01 23 01/23/2009 Zhe Fan Yu-Chuan Kuo Ye Zhao Paper Science Zhe Fan,Yu-Chuan Kuo,Ye Zhao,fzhe@cs.sunysb.edu,yukuo@cs.sunysb.edu,zhao@cs.kent.edu bf51755a-de3f-4ba1-b775-ba5134f861e9 A novel multiple-walk parallel algorithm for the BarnesHut treecode on GPUs towards cost effective, high performance N-body simulation Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs. /content/cudazone/CUDABrowser/assets/images/applications/864_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/864_implementation_large.png Academia Nagasaki University, Japan 2009 05 20 05/20/2009 Tsuyoshi Hamada Keigo Nitadori Khaled Benkrid Paper Science Tsuyoshi Hamada,Keigo Nitadori,Khaled Benkrid,hamada@cis.nagasaki-u.ac.jp,nitadori@cfca.jp,k.benkdird@ed.ac.uk 4977b07b-89ac-439e-abb4-8879e099c3c4 Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations. /content/cudazone/CUDABrowser/assets/images/applications/863_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/863_cover-medium_large.jpg Academia Trinity College Dublin 2009 06 19 06/19/2009 Owen Harrison John Waldron Paper Science Owen Harrison,John Waldron,harrisoo@cs.tcd.ie,john.waldron@cs.tcd.ie ed7c674a-46f4-4536-84a9-24e1489c692e Realistic real-time sound re-synthesis and processing for interactive virtual worlds We present new GPU-based techniques for implementing linear digital filters for real-time audio processing. Our solution for recursive filters is the first presented in the literature. We demonstrate the relevance of these algorithms to computer graphics by synthesizing realistic sounds of colliding objects made of different materials, such as glass, plastic, and wood, in real time. The synthesized sounds can be parameterized by the object materials, velocities, and collision angles. Despite its flexibility, our approach uses very little memory, since it essentially requires a set of coefficients representing the impulse response of each material sound. Such features make our approach an attractive alternative to traditional CPU-based techniques that use playback of pre-recorded sounds. /content/cudazone/CUDABrowser/assets/images/applications/862_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/862_visualcomputer_large.png Academia Instituto de Informatica 2009 03 11 03/11/2009 Fernando Trebien Manuel M. Oliveira Paper Video & Audio Fernando Trebien,Manuel M. Oliveira,ftrebien@inf.ufrgs.br,oliveira@inf.ufrgs.br b3df15ed-da9f-40e2-9e52-827b4ffa8012 Solid Mesh Registration for Radiotherapy Treatment Planning We present an algorithm for solid organ registration of pre-segmented data represented as tetrahedral meshes. Registration of the organ surface is driven by force terms based on a distance field representation of the source and reference shapes. Registration of internal morphology is achieved using a non-linear elastic finite element model. A key feature of the method is that the user does not need to specify boundary conditions (surface point correspondences) prior to the finite element analysis. Instead the boundary matches are found as an integrated part of the analysis. The method is evaluated on phantom data and prostate data obtained in vivo based on fiducial marker accuracy and inverse consistency of transformations. The parallel nature of the method allows an efficient implementation on a GPU and as a result the method is very fast. All validation registrations take less than 30 seconds to complete. The proposed method has many potential uses in image guided radiotherapy (IGRT) which relies on registration to account for organ deformation between treatment sessions. /content/cudazone/CUDABrowser/assets/images/applications/861_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/861_cover-medium_large.jpg Academia Aarhus University, Denmark 2010 01 21 01/21/2010 Karsten Ostergaard Noe Paper Science Karsten Ostergaard Noe,noe@cs.au.dk 6a0fe015-97bb-4d85-b408-308d30e105d5 Large Scale Bioinformatics Data Mining with Parallel Genetic Programming on Graphics Processing Units A suitable single instruction multiple data GP interpreter can achieve high (Giga GPop/second) performance on a SIMD GPU graphics card by simultaneously running multiple diverse members of the genetic programming population. SPMD dataflow parallelisation is achieved because the single interpreter treats the different GP programs as data. On a single 128 node parallel nVidia GeForce 8800 GTX GPU, the interpreter can out run a compiled approach, where data parallelisation comes only by running a single program at a time across multiple inputs. The RapidMind GPGPU Linux C++ system has been demonstrated by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of five million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP. /content/cudazone/CUDABrowser/assets/images/applications/860_iss_small.png /content/cudazone/CUDABrowser/assets/images/applications/860_iss_large.png Academia King's College, London 2010 01 06 01/06/2010 William B. Langdon Paper Science William B. Langdon 8a53a58c-cb1e-4854-b81a-88ec94b5490d Hierarchical Markov Random Fields Applied to Model Soft Tissue Deformations on Graphics Hardware Many methodologies dealing with prediction or simulation of soft tissue deformations on medical image data require preprocessing of the data in order to produce a different shape representation that complies with standard methodologies, such as mass spring networks, finite element method s (FEM). On the other hand, methodologies working directly on the image space normally do not take into account mechanical behavior of tissues and tend to lack physics foundations driving soft tissue deformations. This chapter presents a method to simulate soft tissue deformations based on coupled concepts from image analysis and mechanics theory. The proposed methodology is based on a robust stochastic approach that takes into account material properties retrieved directly from the image, concepts from continuum mechanics and FEM. The optimization framework is solved within a hierarchical Markov random field (HMRF) which is implemented on the graphics processor unit (GPU ). /content/cudazone/CUDABrowser/assets/images/applications/859_cover-medium3_small.png /content/cudazone/CUDABrowser/assets/images/applications/859_cover-medium3_large.png Academia University of Bern, Switzerland 2009 11 24 11/24/2009 Christof Seiler Philippe Buchler Lutz-Peter Nolte Paper Science Christof Seiler,Philippe Buchler,Lutz-Peter Nolte,christof.seiler@artorg.unibe.ch,christof.seiler@artorg.unibe.ch,christof.seiler@artorg.unibe.ch 8fea4552-0ab0-4c43-a0e4-f72c417d9e06 Efficient K- Means Clustering Using Accelerated Graphics Processors We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes. /content/cudazone/CUDABrowser/assets/images/applications/855_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/855_cover-medium_large.jpg Academia Nanyang Technological University, Singapore 2008 08 30 08/30/2008 S. A. Arul Shalom Manoranjan Dash Minh Tue Paper Science S. A. Arul Shalom,Manoranjan Dash,Minh Tue,sall0001@ntu.edu.sg,asmdash@ntu.edu.sg,h0630082@nus.edu.sg 07072ad0-b7c1-4eb6-9826-2c1cc0ae740f Systematic Parallelization of Medical Image Reconstruction for Graphics Hardware Modern Graphics Processing Units (GPUs) consist of several SIMD-processors and thus provide a high degree of parallelism at low cost. We introduce a new approach to systematically develop parallel image reconstruction algorithms for GPUs from their parallel equivalents for distributed-memory machines. We use High-Level Petri Nets (HLPN) to intuitively describe the parallel implementations for distributed- memory machines. By denoting the functions of the HLPN with memory requirements and information about data distribution, we are able to identify parallel functions that can be implemented efficiently on the GPU. For an important iterative medical image reconstruction algorithm the list-mode OSEM algorithm we demonstrate the limitations of its distributed-memory implementation and show how our HLPN-based approach leads to a fast implementation on GPUs, reusable across different medical imaging devices. /content/cudazone/CUDABrowser/assets/images/applications/854_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/854_cover-medium_large.jpg Academia University of Munster, Germany 2008 08 21 08/21/2008 Maraike Schellmann Jurgen Vording Sergei Gorlatch Paper Science Maraike Schellmann,Jurgen Vording,Sergei Gorlatch,schellmann@uni-muenster.de,voerding@uni-muenster.de,gorlatch@uni-muenster.de 7794a349-0166-46c9-8c6d-32da8b4febda Real-Time Autostereoscopic Visualization of Registration-Generated 4D MR Image of Beating Heart This paper presents a real-time autostereoscopic visualization system using the principle of Integral Videography(IV). We develop MIP and composite volume ray casting method for IV volume rendering, and implemented the algorithm on GPU to achieve real-time rendering. The system was used to visualize 4D MR image that was generated from registration of 3D MR image and 4D ultrasound image. The registration scheme consists of inter-modality rigid registration between 3D MR image and 3D ultrasound image and intra-modality non-rigid registration between 3D ultrasound images. Registration processes were also implemented on GPU. Evaluation of processing speed showed that GPU processing time was 48x, 13x, 21x faster than CPU processing time for IV volume rendering, rigid registration, and non-rigid registration respectively. We also enabled real-time user interactivity for IV visualization system. In the future, We plan to use this system to develop intra-operative surgery navigation system for intra-cardiac surgery on beating heart. /content/cudazone/CUDABrowser/assets/images/applications/853_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/853_cover-medium_large.jpg Academia The University of Tokyo, Japan 2008 07 15 07/15/2008 Nicholas Herlambang Hongen Liao Ken Masamune Paper Science Nicholas Herlambang,Hongen Liao,Ken Masamune,nicholas@atre.t.u-tokyo.ac.jp,liao@atre.t.u-tokyo.ac.jp,masa@i.u-tokyo.ac.jp 498009d7-ccca-476d-881a-4a392b52b7ba Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 x 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm. /content/cudazone/CUDABrowser/assets/images/applications/852_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/852_implementation_large.png Academia Universidad Rey Juan Carlos, Spain 2008 05 08 05/08/2008 Raul Cabido Antonio S. Montemayor Juan Jose Pantrigo Paper Science Raul Cabido,Antonio S. Montemayor, Juan Jose Pantrigo ,raul.cabido@urjc.es,antonio.sanz@urjc.es,juanjose.pantrigo@urjc.es 066c8093-375d-46dc-a170-4955e4c07315 Deforming a High-Resolution Mesh in Real-Time by Mapping onto a Low-Resolution Physical Model For interactive surgical simulation the physical model of the soft tissue needs to be solved in real-time. This limits the attainable model density to well below the desired mesh density for visual realism. Previous work avoids this problem by using a high-resolution visual mesh mapped onto a low-resolution physical model. We apply the same approach and present an computationally cheap implementation of a known algorithm to avoid texture artefacts caused by the mapping. We also introduce a spline-based algorithm to prevent groups of high-resolution vertices, mapped to the same low-resolution triangle, from exhibiting movements in which the underlying low-resolution structure can be recognised. The resulting mapping algorithm is very efficient, mapping 54,000 vertices in 8.5 ms on the CPU and in 0.88 ms on the GPU. Consequently, the density of the high-resolution visual mesh is limited only by the detail of the CT data from which the mesh was generated. /content/cudazone/CUDABrowser/assets/images/applications/1372_evisser08_small.png /content/cudazone/CUDABrowser/assets/images/applications/1372_evisser08_large.png Research The Australian e-Health Research Centre 2008 07 07 07/07/2008 Hans de Visser Olivier Comas David Conlan Paper Science Hans de Visser,Olivier Comas,David Conlan 3423ec81-fdfb-4e8c-89af-2dce4ce05a4a ECM on Graphics Cards This paper reports record-setting performance for the elliptic-curve method of integer factorization: for example, 926.11 curves/second for ECM stage 1 with B 1 = 8192 for 280-bit integers on a single PC. The state-of-the-art GMP-ECM software handles 124.71 curves/second for ECM stage 1 with B 1 = 8192 for 280-bit integers using all four cores of a 2.4 GHz Core 2 Quad Q6600. The extra speed takes advantage of extra hardware, specifically two NVIDIA GTX 295 graphics cards, using a new ECM implementation introduced in this paper. Our implementation uses Edwards curves, relies on new parallel addition formulas, and is carefully tuned for the highly parallel GPU architecture. On a single GTX 295 the implementation performs 41.88 million modular multiplications per second for a general 280-bit modulus. GMP-ECM, using all four cores of a Q6600, performs 13.03 million modular multiplications per second. This paper also reports speeds on other graphics processors: for example, 2414 280-bit elliptic-curve scalar multiplications per second on an older NVIDIA 8800 GTS (G80), again for a general 280-bit modulus. For comparison, the CHES 2008 paper "Exploiting the Power of GPUs for Asymmetric Cryptography" reported 1412 elliptic-curve scalar multiplications per second on the same graphics processor despite having fewer bits in the scalar (224 instead of 280), fewer bits in the modulus (224 instead of 280), and a special modulus (2224 − 296 + 1). /content/cudazone/CUDABrowser/assets/images/applications/849_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/849_cover-medium_large.jpg Academia University of Illinois at Chicago 2009 04 16 04/16/2009 Daniel J. Bernstein Tien-Ren Chen Chen-Mou Cheng Paper Science Daniel J. Bernstein,Tien-Ren Chen,Chen-Mou Cheng,djb@cr.yp.to,trchen1033@crypto.tw,doug@crypto.tw 9dd9b7a3-2ab7-4a26-ad08-1f2554a989fe A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration on GPGPU We introduced four prototypes of General Purpose GPU solutions by Compute Unified Device Architecture (CUDA) on NVidia GeForce 8800GT and Tesla C870 for a practical Curved Ray Prestack Kirchhoff Time Migration program, which is one of the most widely adopted imaging methods in the seismic data processing industry. We presented how to re-design and re-implement the original CPU code to efficient GPU code step by step. We demonstrated optimization methods, such as how to reduce the overhead of memory transportation on PCI-E bus, how to significantly increase the kernel thread numbers on GPU cores, how to buffer the inputs and outputs of CUDA kernel modules, and how to utilize the memory streams to overlap GPU kernel execution time, etc., to improve the runtime performance on GPUs. We analyzed the floating point errors between CPUs and GPUs. We presented the images generated by CPU and GPU programs for the same real-world seismic data inputs. Our final approach of Prototype-IV on NVidia GeForce 8800GT is 16.3 times faster than its CPU version on Intels P4 3.0G. /content/cudazone/CUDABrowser/assets/images/applications/848_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/848_cover-medium_large.jpg Academia Beihang University, Beijing 2009 08 21 08/21/2009 Xiaohua Shi Chuang Li Xu Wang Paper Science Xiaohua Shi,Chuang Li,Xu Wang,xhshi@buaa.edu.cn,whlichuang@126.com,xu.wang@sei.buaa.edu.cn 6a921e34-8d47-4a42-a892-8398ec64468f A Practical Quicksort Algorithm for Graphics Processors In this paper we present GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multi-core graphics processors. Quicksort has previously been considered as an inefficient sorting solution for graphics processors, but we show that GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors /content/cudazone/CUDABrowser/assets/images/applications/847_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/847_cover-medium_large.jpg Academia Chalmers University of Technology, Sweden 2008 09 20 09/20/2008 Daniel Cederman Philippas Tsigas Paper Science Daniel Cederman,Philippas Tsigas,cederman@chalmers.se,tsigas@chalmers.se 77df8f3f-2d01-428d-b8b3-70fc6d308873 Hardware-Accelerated Particle-Based Volume Rendering for Multiple Irregular Volumes In this paper, we propose a performance improvement of particle-based volume rendering (PBVR) by using a current, programmable GPU architecture. PBVR allows to render without visibility sorting by representing a given volume dataset as a set of opaque and emissive particles. In our new GPU acceleration of PBVR, we provide a switchable rendering pipeline that is compatible with both regular and irregular grid volumes. Particle generation is improved by using a cell-by-cell approach for processing large volume dataset. We also reduce the memory cost required for storing all sub-pixel values by proposing a pixel-superimposing technique targeting a large sub-pixel level. Our work demonstrates a full detail rendering rate from 5 to 11 fps for overlapped or separated multi-irregular volumes with a mega-scale number of volume cells on NVIDIA Geforce 8800GTX. /content/cudazone/CUDABrowser/assets/images/applications/846_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/846_cover-medium_large.jpg Academia Center for the Promotion of Excellence in Higher Education, Kyoto University 2008 12 03 12/03/2008 Naohisa Sakamoto Ding Zhongming Takuma Kawamura Paper Science Naohisa Sakamoto,Ding Zhongming,Takuma Kawamura acb94169-1150-437c-9ebb-6c183be2b38f Compiler support for general-purpose computation on GPUs In recent years, the GPU (graphics processing unit) has evolved into an extremely powerful and flexible processor, with it now representing an attractive platform for general-purpose computation. Moreover, changes to the design and programmability of GPUs provide the opportunity to perform general-purpose computation on a GPU (GPGPU). Even though many programming languages, software tools, and libraries have been proposed to facilitate GPGPU programming, the unusual and specific programming model of the GPU remains a significant barrier to writing GPGPU programs. In this paper, we introduce a novel compiler-based approach for GPGPU programming. Compiler directives are used to label code fragments that are to be executed on the GPU. Our GPGPU compiler, Guru, converts the labeled code fragments into ISO-compliant C code that contains appropriate OpenGL and Cg APIs. A native C compiler can then be used to compile it into the executable code for GPU. Our compiler is implemented based on the Open64 compiler infrastructure. Preliminary experimental results from selected benchmarks show that our compiler produces significant performance improvements for programs that exhibit a high degree of data parallelism. /content/cudazone/CUDABrowser/assets/images/applications/844_neville_small.png /content/cudazone/CUDABrowser/assets/images/applications/844_neville_large.png Academia National Chung Cheng University, China 2008 11 19 11/19/2008 Yu-Te Lin Peng-Sheng Chen Paper Science Yu-Te Lin,Peng-Sheng Chen,lyt94@cs.ccu.edu.tw,pschen@cs.ccu.edu.tw a0c19949-34b1-4227-a049-a70feb8ad4e9 A Gradient Descent Approximation for Graph Cuts Graph cuts have become very popular in many areas of computer vision including segmentation, energy minimization, and 3D reconstruction. Their ability to find optimal results efficiently and the convenience of usage are some of the factors of this popularity. However, there are a few issues with graph cuts, such as inherent sequential nature of popular algorithms and the memory bloat in large scale problems. In this paper, we introduce a novel method for the approximation of the graph cut optimization by posing the problem as a gradient descent formulation. The advantages of our method is the ability to work efficiently on large problems and the possibility of convenient implementation on parallel architectures such as inexpensive Graphics Processing Units (GPUs). We have implemented the proposed method on the Nvidia 8800GTS GPU. The classical segmentation experiments on static images and video data showed the effectiveness of our method. /content/cudazone/CUDABrowser/assets/images/applications/843_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/843_cover-medium_large.jpg Academia Gebze Institute of Technology, Gebze 2009 09 02 09/02/2009 Alparslan Yildiz Yusuf Sinan Akgul Paper Science Alparslan Yildiz,Yusuf Sinan Akgul,yildiz@bilmuh.gyte.edu.tr,akgul@bilmuh.gyte.edu.tr 83479fc3-a87e-49c9-b505-d08ae7a1747f Data Mining Using Graphics Processing Units During the last few years, Graphics Processing Units (GPU) have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks such as rendering of 3D scenarios but can also be used for general numeric and symbolic computation tasks such as simulation and optimization. As major advantage, GPUs provide extremely high parallelism (with several hundred simple programmable processors) combined with a high bandwidth in memory transfer at low cost. In this paper, we propose several algorithms for computationally expensive data mining tasks like similarity search and clustering which are designed for the highly parallel environment of a GPU. We define a multidimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU, and define a similarity join method. Moreover, we define highly parallel algorithms for density-based and partitioning clustering. In an extensive experimental evaluation, we demonstrate the superiority of our algorithms running on GPU over their conventional counterparts in CPU. /content/cudazone/CUDABrowser/assets/images/applications/842_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/842_cover-medium_large.jpg Academia University of Munich, Germany 2009 08 24 08/24/2009 Christian Bohm Robert Noll Claudia Plant Paper Science Christian Bohm,Robert Noll,Claudia Plant,boehm@dbs.ifi.lmu.de,noll@dbs.ifi.lmu.de,plant@lrz.tum.de 23658fd2-eb70-4bb7-bc9a-5efb3e91b16e GPU RayTracing Pipeline We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray-primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel. /content/cudazone/CUDABrowser/assets/images/applications/841_paper4_small.png /content/cudazone/CUDABrowser/assets/images/applications/841_paper4_large.png Research Keldysh Institute of Applied Mathematics / Microsoft Research 2010 02 10 02/10/2010 3 K.Garanzha C.Loop Paper Presentation Graphics Ray Tracing GPU, ray tracing, custom pipeline,K.Garanzha,C.Loop,kirill@garanzha.com 89d31666-0540-4526-800d-124ea52364d8 Maaap Reduce In order to verify the feasibility of using the GPU for a fairly substantial and rapidly changing dataset, a simple set of benchmark functions were created for three main programming language families. Each test evaluated every element in the MNIST dataset with the sigmoid function. /content/cudazone/CUDABrowser/assets/images/applications/840_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/840_defaultlogo_large.png CUDA Developer 2009 04 07 04/07/2009 Paul Reimer Code Libraries Paul Reimer 5f27c6d9-6984-495a-bf57-bca8dd6ea108 Rocks CUDA Rocks Cluster Distribution is a linux distribution for HPC clusters. It was started by National Partnership for Advanced Computational Infrastructure and the SDSC in 2000. Rocks includes many tools that make a group of computers into a cluster. Installations can be customized with additional software packages at install-time by using special user-supplied CDs (called "Roll CDs"). The "Rolls" extend the system by integrating seamlessly and automatically into the management and packaging mechanisms used by base software, greatly simplifying installation and configuration of large numbers of computers. This project will contain the source code and images for an NPACI ROCKS 5.0 Roll for NVIDIA CUDA libraries and drivers. /content/cudazone/CUDABrowser/assets/images/applications/839_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/839_defaultlogo_large.png CUDA Developer 2008 08 14 08/14/2008 3kforme Code 3kforme 746a6455-e475-40f9-b881-6f85a8ec0e76 GPU based Sparse Grid Technique for Solving Multidimensional Options Pricing PDEs It has been shown that the sparse grid combination technique can be a practical tool to solve high dimensional PDEs arising in multidimensional option pricing problems in finance. Hierarchical approximation of these problems leads to linear systems that are smaller in size compared to those arising from standard finite element or finite difference discretizations. However, these systems are still excessively demanding in terms of memory for direct methods and challenging to solve by iterative methods. In this paper we address iterative solutions via preconditioned Krylov subspace based methods, such as Stabilized BiConjugate Gradient (BiCGStab) and CG Squared (CGS), with the main focus on the design of such iterative solvers to harness massive parallelism of general purpose Graphics Processing Units (GPGPU)s. We discuss data structures and efficient implementation of iterative solvers. We also present a number of performance results to demonstrate the scalability of these solvers on the NVIDIA's CUDA platform. /content/cudazone/CUDABrowser/assets/images/applications/838_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/838_graph_large.png Academia Chatenay-Malabry, France 2009 12 31 12/31/2009 1000 Abhijeet Gaikwad Ioane Muni Toke Paper Finance NVIDIA CUDA, Iterative solvers, multidimensional option,Abhijeet Gaikwad,Ioane Muni Toke,abhijeet.gaikwad@ecp.fr,ioane.muni-toke@ecp.fr a65412a3-1d34-490f-a209-9f8d486c7b55 Micromanager London Kings CUDA makes the processing power of NVIDIA graphics cards available for normal computation. Here are Some Add-ons for uManager http://www.micro-manager.org/, a free cross-platform software to control microscopes and do image acquisition. /content/cudazone/CUDABrowser/assets/images/applications/837_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/837_defaultlogo_large.png Research CUDA Developer 2009 04 28 04/28/2009 Martin Kielhorn Code Libraries Martin Kielhorn c0250d55-553e-4fa6-aa6a-e91916638b97 CBCL Model CUDA CUDA version of the HMAX model /content/cudazone/CUDABrowser/assets/images/applications/836_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/836_defaultlogo_large.png Research CUDA Developer 2009 01 28 01/28/2009 Sharat.Chikkerur Code Sharat.Chikkerur 9afbcfda-88d9-44a4-9cf3-8c2e3c2ec1d9 Real-time virtual environment signal extraction and denoising using programmable graphics hardware The sense of being within a three-dimensional (3D) space and interacting with virtual 3D objects in a computer-generated virtual environment (VE) often requires essential image, vision and sensor signal processing techniques such as differentiating and denoising. This paper describes novel implementations of the Gaussian filtering for characteristic signal extraction and wavelet-based image denoising algorithms that run on the graphics processing unit (GPU). While significant acceleration over standard CPU implementations is obtained through exploiting data parallelism provided by the modern programmable graphics hardware, the CPU can be freed up to run other computations more efficiently such as artificial intelligence (AI) and physics. The proposed GPU-based Gaussian filtering can extract surface information from a real object and provide its material features for rendering and illumination. The wavelet-based signal denoising for large size digital images realized in this project provided better realism for VE visualization without sacrificing real-time and interactive performances of an application. /content/cudazone/CUDABrowser/assets/images/applications/835_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/835_cover-medium_large.jpg Academia University of Huddersfield, Queensgate, Huddersfield 2009 10 21 10/21/2009 Yang Su Zhi-Jie Xu Xiang-Qian Jiang Paper Signal Processing Yang Su,Zhi-Jie Xu,Xiang-Qian Jiang,y.su@hud.ac.uk 578067f7-ac4b-47f2-950b-1f9ed61408e5 Extracting Curve Skeletons from Gray Value Images for Virtual Endoscopy The extraction of curve skeletons from tubular networks is a necessary prerequisite for virtual endoscopy applications. We present an approach for curve skeleton extraction directly from gray value images that supersedes the need to deal with segmentations and skeletonizations. The approach uses properties of the Gradient Vector Flow to derive a tube-likeliness measure and a medialness measure. Their combination allows the detection of tubular structures and an extraction of their medial curves that stays centered also in cases where the structures are not tubular such as junctions or severe stenoses. We present results on clinical datasets and compare them to curve skeletons derived with different skeletonization approaches from high quality segmentations. Our approach achieves a high centerline accuracy and is computationally efficient by making use of a GPU based implementation of the Gradient Vector Flow. /content/cudazone/CUDABrowser/assets/images/applications/834_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/834_cover-medium_large.jpg Academia Graz University of Technology, Austria 2008 07 15 07/15/2008 Christian Bauer Horst Bischof Paper Science Christian Bauer,Horst Bischof,cbauer@icg.tu-graz.ac.at,bischof@icg.tu-graz.ac.at bb093c2d-d78d-4eb7-81b1-af6e59587e17 Evaluating the Jaccard-Tanimoto Index on Multi-core Architectures The Jaccard/Tanimoto coefficient is an important workload, used in a large variety of problems including drug design fingerprinting, clustering analysis, similarity web searching and image segmentation. This paper evaluates the Jaccard coefficient on three platforms: the Cell Broadband Engine processor Intel Xeon dualcore platform and NVIDIA 8800 GTX GPU. In our work, we have developed a novel parallel algorithm specially suited for the Cell/B.E. architecture for all-to-all Jaccard comparisons, that minimizes DMA transfers and reuses data in the local store. We show that our implementation on Cell/B.E. outperforms the implementations on comparable Intel platforms by 6-20X with full accuracy, and from 10-50X in reduced accuracy mode, depending on the size of the data, and by more than 60X compared to Nvidia 8800 GTX. In addition to performance, we also discuss in detail our efforts to optimize our workload on these architectures and explain how avenues for optimization on each architecture are very different and vary from one architecture to another for our workload. Our work shows that the algorithms or kernels employed for the Jaccard coefficient calculation are heavily dependent on the traits of the target hardware. /content/cudazone/CUDABrowser/assets/images/applications/833_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/833_cover-medium_large.jpg Academia Technologies Design Center, Indianapolis 2009 05 20 05/20/2009 20 Vipin Sachdeva Douglas M. Freimuth Chris Mueller Paper Science Vipin Sachdeva,Douglas M. Freimuth,Chris Mueller,vsachde@us.ibm.com,dmfreim@us.ibm.com,chemuell@cs.indiana.edu 7a0a2a4f-3fa3-4ed0-8e1c-f3dd9f2835e7 Focused Volumetric Visual Hull with Color Extraction This paper introduces a new approach for volumetric visual hull reconstruction, using a voxel grid that focuses on the moving target object. This grid is continuously updated as a function of object location, orientation, and size. The benefit is a reduced amount of voxels that have to be evaluated or allocated towards capturing the target at higher resolution. This technique particularly improves reconstructions where the total reconstruction space is larger than the moving reconstruction target. The higher resolution of the voxel grid also reduces the computational cost per voxel reprojection since a one voxel to one input pixel reprojection ratio is approximated. In addition, the appropriate view independent color of the surface voxels is computed allowing for realistic visual hull texturing. All color calculations are performed locally, based on approximated surface voxel normals and the input images. A color outlier detection approach is introduced, which reduces the influence of occlusions in the color evaluation. The parallel nature of the presented focused visual hull reconstruction technique, lends itself to hardware acceleration, allowing interactive rates to be achieved by performing most computations on the GPU. A set of case studies is provided for well-defined static and dynamic data sets. /content/cudazone/CUDABrowser/assets/images/applications/832_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/832_cover-medium_large.jpg Academia University of California, San Diego 2009 11 26 11/26/2009 Daniel Knoblauch Falko Kuester Paper Science Daniel Knoblauch,Falko Kuester 89f67b35-e35e-4e8c-8b22-14c848a66f32 Fourier Volume Rendering on GPGPU Fourier Volume Rendering (FVR) is a volume rendering technique with lower computational complexity of O(N 2 logN) for an N 3 data array. A new FVR algorithm is proposed through expanding Fourier Projection-Slice Theorem into High-Dimension and mapping the pipeline totally on GPU. A windowed-sinc function is used as reconstruction filter to implement higher-order interpolation and reduction of samples is executed on GPU in parallel, which meets the architecture of Heterogeneous multi-core. The rendering is accelerated by a factor of 7 when rendering images resolution is larger than 512x512. /content/cudazone/CUDABrowser/assets/images/applications/831_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/831_cover-medium_large.jpg Academia Hunan University 2009 05 21 05/21/2009 Degui Xiao Yi Liu Lei Yang Paper Science Degui Xiao,Yi Liu,Lei Yang e7dc92ba-1736-4c6f-92ea-6ac559d565f7 Practical Random Linear Network Coding on GPUs Recently, random linear network coding has been widely applied in peer-to-peer network applications. Instead of sharing the raw data with each other, peers in the network produce and send encoded data to each other. As a result, the communication protocols have been greatly simplified, and the applications experience higher end-to-end throughput and better robustness to network churns.Since it is difficult to verify the integrity of the encoded data, such systems can suffer from the famous pollution attack, in which a malicious node can send bad encoded blocks that consist of bogus data. Consequently, the bogus data will be propagated into the whole network at an exponential rate. Homomorphic hash functions (HHFs) have been designed to defend systems from such pollution attacks, but with a new challenge: HHFs require that network coding must be performed in GF(q), where q is a very large prime number. This greatly increases the computational cost of network coding, in addition to the already computational expensive HHFs. This paper exploits the potential of the huge computing power of Graphic Processing Units (GPUs) to reduce the computational cost of network coding and homomorphic hashing. With our network coding and HHF implementation on GPU, we observed significant computational speedup in comparison with the best CPU implementation. This implementation can lead to a practical solution for defending against the pollution attacks in distributed systems. /content/cudazone/CUDABrowser/assets/images/applications/830_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/830_cover-medium_large.jpg Academia Hong Kong Baptist University / University of Calgary, Alberta, Canada 2009 05 07 05/07/2009 Xiaowen Chu Kaiyong Zhao Mea Wang Paper Science Xiaowen Chu,Kaiyong Zhao,Mea Wang,chxw@comp.hkbu.edu.hk,kyzhao@comp.hkbu.edu.hk,meawang@ucalgary.ca 207cd764-e884-47aa-b0c3-b5505bedfbe4 Fast Conjugate Gradients with Multiple GPUs The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we describe a fast Conjugate Gradient solver for unstructured problems, which runs on multiple GPUs installed on a single mainboard. The solver achieves double precision accuracy with single precision GPUs, using a mixed precision iterative refinement algorithm. To achieve high computation speed, we propose a fast sparse matrix-vector multiplication algorithm, which is the core operation of iterative solvers. The proposed multiplication algorithm efficiently utilizes GPU resources via caching, coalesced memory accesses and load balance between running threads. Experiments on wide range of matrices show that our matrix-vector multiplication algorithm achieves up to 11.6 Gflops on single GeForce 8800 GTS card and CG implementation achieves up to 24.6 Gflops with four GPUs. /content/cudazone/CUDABrowser/assets/images/applications/829_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/829_cover-medium_large.jpg Academia Tokyo Institute of Technology / National Institute of Informatics, Japan 2009 05 20 05/20/2009 Ali Cevahir Akira Nukada Satoshi Matsuoka Paper Science Ali Cevahir,Akira Nukada,Satoshi Matsuoka,ali@matsulab.is.titech.ac.jp,nukada@matsulab.is.titech.ac.jp,matsu@is.titech.ac.jp 307d80ab-1016-4ba9-9fda-be6f1e85a18f Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study To facilitate the design of hardware accelerators we propose in this paper the adoption of the stream-based computing model and the usage of Graphics Processing Units (GPUs) as prototyping platforms. This model exposes the maximum data parallelism available in the applications and decouples computation from memory accesses. The design and implementation procedures, including the programming of GPUs, are illustrated with the widely used MrBayes bioinformatics application. Experimental results show that a straightforward mapping of the stream-based program for the GPU into hardware structures leads to improvements in performance, scalability and cost. Moreover, it is shown that a set of simple optimization techniques can be applied in order to reduce the cost, and the power consumption of hardware solutions. /content/cudazone/CUDABrowser/assets/images/applications/828_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/828_cover-medium_large.jpg Academia Rua Alves Redol 2009 07 21 07/21/2009 Frederico Pratas Leonel Sousa Paper Science Frederico Pratas,Leonel Sousa,fcpp@inesc-id.pt,las@inesc-id.pt 24d9dfbe-430a-4065-b835-69d1728e3a2b Parallel Calculating of the Goal Function in Metaheuristics Using GPU We consider a metaheuristic optimization algorithm which uses single process (thread) to guide the search through the solution space. Thread performs in the cyclic way (iteratively) two main tasks: the goal function evaluation for a single solution or a set of solutions and management (solution filtering and selection, collection of history, updating). The latter task takes statistically 1-3% total iteration time, therefore we skip its acceleration as useless. The former task can be accelerated in parallel environments in various manners. We propose certain parallel small-grain calculation model providing the cost optimal method. Then, we carry out an experiment using Graphics Processing Unit (GPU) to confirm our theoretical results. /content/cudazone/CUDABrowser/assets/images/applications/827_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/827_cover-medium_large.jpg Academia Wrocaw University of Technology 2009 05 20 05/20/2009 Wojciech Bozejko Czes'aw Smutnicki Mariusz Uchronski Paper Science Wojciech Bozejko,Czes'aw Smutnicki,Mariusz Uchronski,wojciech.bozejko@pwr.wroc.pl,czeslaw.smutnicki@pwr.wroc.pl,mariusz.uchronski@pwr.wroc.pl 89537d32-f563-4d80-af24-b3b43058d026 Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU) In a previous paper we have shown that direct gravitational N-body simulations in astrophysics scale very well for moderately parallel supercomputers (order 10100 nodes). The best balance between computation and communication is reached if the nodes are accelerated by special purpose hardware; in this paper we describe the implementation of particle based astrophysical simulation codes on new types of accelerator hardware (field programmable gate arrays, FPGA, and graphical processing units, GPU). In addition to direct gravitational N-body simulations we also use the algorithmically similar smoothed particle hydrodynamics method as test application; the algorithms are used for astrophysical problems as e.g. evolution of galactic nuclei with central black holes and gravitational wave generation, and star formation in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and GPU) and some implementation aspects (e.g. accuracy). The results show that GPU hardware for real application codes is as fast as GRAPE, but for an order of magnitude lower price, and that FPGA is useful for acceleration of complex sequences of operations (like SPH). We discuss future prospects and new cluster computers built with new generations of FPGA and GPU cards. /content/cudazone/CUDABrowser/assets/images/applications/826_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/826_implementation_large.png Academia University of Heidelberg 2009 05 12 05/12/2009 R. Spurzem P. Berczik G. Marcus Paper Science R. Spurzem,P. Berczik,G. Marcus,spurzem@ari.uni-heidelberg.de,berczik@ari.uni-heidelberg.de,guillermo.marcus@ziti.uni-heidelberg.de a9e38eba-e87f-426b-916a-5c33b9f69177 A framework for exploring numerical solutions of advection reaction diffusion equations using a GPU-based approach In this paper we describe a general purpose, graphics processing unit (GP-GPU)-based approach for solving partial differential equations (PDEs) within advection reaction diffusion models. The GP-GPU-based approach provides a platform for solving PDEs in parallel and can thus significantly reduce solution times over traditional CPU implementations. This allows for a more efficient exploration of various advection reaction diffusion models, as well as, the parameters that govern them. Although the GPU does impose limitations on the size and accuracy of computations, the PDEs describing the advection reaction diffusion models of interest to us fit comfortably within these constraints. Furthermore, the GPU technology continues to rapidly increase in speed, memory, and precision, thus applying these techniques to larger systems should be possible in the future. We chose to solve the PDEs using two numerical approaches: for the diffusion, a first-order explicit forward Euler solution and a semi-implicit second order Crank Nicholson solution; and, for the advection and reaction, a first-order explicit solution. The goal of this work is to provide motivation and guidance to the application scientist interested in exploring the use of the GP-GPU computational framework in the course of their research. In this paper, we present a rigorous comparison of our GPU-based advection reaction diffusion code model with a CPU-based analog, finding that the GPU model out-performs the CPU implementation in one-to-one comparisons. /content/cudazone/CUDABrowser/assets/images/applications/825_computedvisualation_small.png /content/cudazone/CUDABrowser/assets/images/applications/825_computedvisualation_large.png Academia University of Utah 2008 03 04 03/04/2008 Allen R. Sanderson Miriah D. Meyer Robert M. Kirby Paper Numerics Allen R. Sanderson,Miriah D. Meyer,Robert M. Kirby,allen@sci.utah.edu,miriah@sci.utah.edu,kirby@sci.utah.edu 7d3cc29a-3dac-4791-8478-77dd28708ea8 Going Forward with GPU Computing This article describes why CEA is looking at GPU Computing and how the first experiments are conducted. We describe here a well defined global strategy which relies on training users and taking advantage of Grand Challenges, involving early access users and system administrators. We also describe some preliminary results and raise questions which need to be addressed in the near future. /content/cudazone/CUDABrowser/assets/images/applications/824_highperformancecomputing_small.png /content/cudazone/CUDABrowser/assets/images/applications/824_highperformancecomputing_large.png Research CEA, DAM 2009 10 07 10/07/2009 Guillaume Colin de Verdiere Paper Science Guillaume Colin de Verdiere 441d4d84-f548-4465-ac76-eef36ff2a059 Introduction to Mastering Cell BE and GPU Execution Platforms Both Cell BE-type and GPU processors have emerged as multi-processor execution platforms that can outperform general purpose multi-core computers in certain application domains. The two architectures are quite different, and by no means interchangeable. GPUs are reminiscent of fine-grained systolic array architectures, while the Cell BE is suitable to execute a set of co-ordinated coarse-grained tasks. By now, enough applications have been mapped on either of these two processors, mostly by hand, that the pros and cons tables can be filled. The next step is to provide mappings that are based on efficient programming models and methods, in particular methods that minimize communication overheads. The six papers in this special session are attempts to take precisely that route. Three of them are taking the GPU as the underlying execution platform, the third taking also the Cell-BE multicore processor into consideration. The other three papers are targetting the Cell-BE processor. /content/cudazone/CUDABrowser/assets/images/applications/823_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/823_cover-medium_large.jpg Academia Leiden University, the Netherlands 2009 07 21 07/21/2009 Ed Deprettere Ana L. Varbanescu Paper Science Ed Deprettere,Ana L. Varbanescu 8949e7e3-c9b6-487a-894e-75c35f7b8d45 Development of a GPU-based multithreaded software application to calculate digitally reconstructed radiographs for radiotherapy To provide faster calculation of digitally reconstructed radiographs (DRRs) in patient-positioning verification, we developed and evaluated a graphic processing unit (GPU)-based DRR software application and compared it with a central processing unit (CPU)-based application. The evaluation metrics were calculation speed and image quality for various slice thicknesses. The results showed that the GPU-based DRR computation was an average of 50 times faster than the CPU-based methodology, whereas the image quality was very similar. This excellent performance may increase the accuracy of patient positioning and improve the patient treatment throughput time /content/cudazone/CUDABrowser/assets/images/applications/822_radialogics_small.png /content/cudazone/CUDABrowser/assets/images/applications/822_radialogics_large.png Research National Institute of Radiological Sciences, Japan 2008 11 07 11/07/2008 Shinichiro Mori Paper Medical Imaging Shinichiro Mori,shinshin@nirs.go.jp 6884796d-0fa2-4f33-9297-1fde62fcc824 Lattice Boltzmann based PDE solver on the GPU In this paper, we propose a hardware-accelerated PDE (partial differential equation) solver based on the lattice Boltzmann model (LBM). The LBM is initially designed to solve fluid dynamics by constructing simplified microscopic kinetic models. As an explicit numerical scheme with only local operations, it has the advantage of being easy to implement and especially suitable for graphics hardware (GPU) acceleration. Beyond the Navier Stokes equation of fluid mechanics, a typical LBM can be modified to solve the parabolic diffusion equation, which is further used to solve the elliptic Laplace and Poisson equations with a diffusion process. These PDEs are widely used in modeling and manipulating images, surfaces and volumetric data sets. Therefore, the LBM scheme can be used as an GPU-based numerical solver to provide a fast and convenient alternative to traditional implicit iterative solvers. We apply this method to several examples in volume smoothing, surface fairing and image editing, achieving outstanding performance on contemporary graphics hardware. It has the great potential to be used as a general GPU computing framework for efficiently solving PDEs in image processing, computer graphics and visualization. /content/cudazone/CUDABrowser/assets/images/applications/821_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/821_visualcomputer_large.png Academia Kent State University 2007 12 07 12/07/2007 Ye Zhao Paper Imaging Ye Zhao,zhao@cs.kent.edu 1acdf9de-8761-4f13-9fea-7b8b02b55719 Real-Time Online Video Object Silhouette Extraction Using Graph Cuts on the GPU Being able to find the silhouette of an object is a very important front-end processing step for many high-level computer vision techniques, such as Shape-from-Silhouette 3D reconstruction methods, object shape tracking, and pose estimation. Graph cuts have been proposed as a method for finding very accurate silhouettes which can be used as input to such high level techniques, but graph cuts are notoriously computation intensive and slow. Leading CPU implementations can extract a silhouette from a single QVGA image in 100 milliseconds, with performance dramatically decreasing with increased resolution. Recent GPU implementations have been able to achieve performance of 6 milliseconds per image by exploiting the intrinsic properties of the lattice graphs and the hardware model of the GPU. However, these methods are restricted to a subclass of lattice graphs and are not generally applicable. We propose a novel method for graph cuts on the GPU which places no limits on graph configuration and which is able to achieve comparable real-time performance in online video processing scenarios. /content/cudazone/CUDABrowser/assets/images/applications/820_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/820_cover-medium_large.jpg Academia Keio University 2009 08 29 08/29/2009 Zachary A. Garrett Hideo Saito Paper Video & Audio Zachary A. Garrett,Hideo Saito,zgarrett@hvrl.ics.keio.ac.jp,saito@hvrl.ics.keio.ac.jp 1919e879-ecaa-471f-b6cb-93415638c16a Seeded ND medical image segmentation by cellular automaton on GPU Purpose We present a GPU-based framework to perform organ segmentation in N-dimensional (ND) medical image datasets by computation of weighted distances using the Ford Bellman algorithm (FBA). Our GPU implementation of FBA gives an alternative and optimized solution to other graph-based segmentation techniques. http://springerlink.com/content/v92w2q820w412jj8/?p=617f22391ecf47f89a3da0c82420ae97&pi=63 /content/cudazone/CUDABrowser/assets/images/applications/819_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/819_cover-medium_large.jpg Research Notre-Dame Hospital 2009 07 31 07/31/2009 Claude Kauffmann Paper Medical Imaging Claude Kauffmann,claude.kauffmann@gmail.com 4bd610d3-92f8-4032-9730-02b0e6091d1f On GPU's viability as a middleware accelerator Today Graphics Processing Units (GPUs) are a largely underexploited resource on existing desktops and a possible cost-effective enhancement to high-performance systems. To date, most applications that exploit GPUs are specialized scientific applications. Little attention has been paid to harnessing these highly-parallel devices to support more generic functionality at the operating system or middleware level. This study starts from the hypothesis that generic middleware-level techniques that improve distributed system reliability or performance (such as content addressing, erasure coding, or data similarity detection) can be significantly accelerated using GPU support. We take a first step towards validating this hypothesis and we design StoreGPU, a library that accelerates a number of hashing-based middleware primitives popular in distributed storage system implementations. Our evaluation shows that StoreGPU enables up twenty five fold performance gains on synthetic benchmarks as well as on a high-level application: the online similarity detection between large data files. /content/cudazone/CUDABrowser/assets/images/applications/818_scalable_small.png /content/cudazone/CUDABrowser/assets/images/applications/818_scalable_large.png Academia University of British Columbia 2009 01 17 01/17/2009 Samer Al-Kiswany Abdullah Gharaibeh Elizeu Santos-Neto Paper Science Samer Al-Kiswany,Abdullah Gharaibeh,Elizeu Santos-Neto,samera@ece.ubc.ca,abdullah@ece.ubc.ca,elizeus@ece.ubc.ca 03286d23-be49-45d1-be2b-790c02badee7 Implementing Decision Trees and Forests on a GPU We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, non-branching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms. http://springerlink.com/content/y702n504831g232m/?p=617f22391ecf47f89a3da0c82420ae97&pi=61 /content/cudazone/CUDABrowser/assets/images/applications/817_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/817_cover-medium_large.jpg Academia Microsoft Research, Cambridge, UK 2008 10 12 10/12/2008 Toby Sharp Paper Science Toby Sharp,toby.sharp@microsoft.com e4fd34a1-868c-482b-9522-41104b157431 CUDAMat CUDAMat provides a CUDA-based matrix class for Python, making it easy to implement algorithms that are easily expressed in terms of dense linear algebra. /content/cudazone/CUDABrowser/assets/images/applications/816_google_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/816_google_large.jpg Academia University of Toronto 2009 11 30 11/30/2009 50 Open source Volodymyr Mnih Code Libraries Volodymyr Mnih,vmnih@cs.toronto.edu 0692da9d-1f32-4819-a7e6-278383b1c438 Parallelization of a Video Segmentation Algorithm on CUDA Enabled Graphics Processing Units Nowadays, Graphics Processing Units (GPU) are emerging as SIMD coprocessors for general purpose computations, specially after the launch of nVIDIA CUDA. Since then, some libraries have been implemented for matrix computation and image processing. However, in real video applications some stages need irregular data distributions and the parallelism is not so inherent. This paper presents the parallelization of a video segmentation application on GPU hardware, which implements an algorithm for abrupt and gradual transitions detection. A critical part of the algorithm requires highly intensive computation for video frames features calculation. Results on three CUDA-enabled GPUs are encouraging, because of the significant speedup achieved. They are also compared with an OpenMP version of the algorithm, running on two platforms with multiples cores. /content/cudazone/CUDABrowser/assets/images/applications/815_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/815_cover-medium_large.jpg Academia University of Cordoba, Spain / University of Malaga, Spain 2009 08 22 08/22/2009 Juan Gomez-Luna Jose Maria Gonzalez-Linares Jose Ignacio Benavides Paper Science Juan Gomez-Luna,Jose Maria Gonzalez-Linares,Jose Ignacio Benavides,el1goluj@uco.es,gonzalez@ac.uma.es,el1bebej@uco.es ae4da9b0-398e-4b88-ad64-95c879d6e61f Fast and automatic object pose estimation for range images on the GPU We present a pose estimation method for rigid objects from single range images. Using 3D models of the objects, many pose hypotheses are compared in a data-parallel version of the downhill simplex algorithm with an image-based error function. The pose hypothesis with the lowest error value yields the pose estimation (location and orientation), which is refined using ICP. The algorithm is designed especially for implementation on the GPU. It is completely automatic, fast, robust to occlusion and cluttered scenes, and scales with the number of different object types. We apply the system to bin picking, and evaluate it on cluttered scenes. Comprehensive experiments on challenging synthetic and real-world data demonstrate the effectiveness of our method. /content/cudazone/CUDABrowser/assets/images/applications/814_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/814_implementation_large.png Academia Inha University, Korea 2009 08 04 08/04/2009 In Kyu Park Paper Science In Kyu Park,pik@inha.ac.kr 6a9ef568-5517-4b74-b3d0-0070e8b2ab21 MinGPU: a minimum GPU library for computer vision In the field of computer vision, it is becoming increasingly popular to implement algorithms, in sections or in their entirety, on a graphics processing unit (GPU). This is due to the superior speed GPUs offer compared to CPUs. In this paper, we present a GPU library, MinGPU, which contains all of the necessary functions to convert an existing CPU code to GPU. We have created GPU implementations of several well known computer vision algorithms, including the homography transformation between two 3D views. We provide timing charts and show that our MinGPU implementation of homography transformations performs approximately 600 times faster than its C++ CPU implementation. /content/cudazone/CUDABrowser/assets/images/applications/813_iss_small.png /content/cudazone/CUDABrowser/assets/images/applications/813_iss_large.png Academia University of Central Florida 2009 05 28 05/28/2009 Pavel Babenko Paper Science Pavel Babenko,pavelb@cs.ucf.edu 38ad061e-364d-40a7-8e42-1233c587d56e GPU Accelerated Non-rigid Registration for the Evaluation of Cardiac Function We present a method for the fast and efficient tracking of motion in cardiac magnetic resonance (CMR) cines. A GPU accelerated Levenberg-Marquardt non-linear least squares optimization procedure for finite element non-rigid registration was implemented on an NVIDIA graphics card using the OpenGL environment. Points were tracked from frame to frame using forward and backward incremental registration. The inner (endocardial) and outer (epicardial) boarders of the heart were tracked in six short axis cines with ~25 frames through the cardiac cycle in 36 patients with vascular disease. Contours placed by two independent expert observers using a semi-automatic ventricular analysis program (CIM version 4.6) were used as the gold standard. The method took 0.5 seconds per frame, and the maximum Hausdorff errors were less than 2 mm on average which was of the same order as the expert inter-observer error. In conclusion, GPU accelerated Levenberg-Marquardt non-linear optimization enables fast and accurate tracking of cardiac motion in CMR images. /content/cudazone/CUDABrowser/assets/images/applications/812_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/812_cover-medium_large.jpg Academia University of Auckland 2008 10 30 10/30/2008 Bo Li Alistair A. Young Brett R. Cowan Paper Science Bo Li,Alistair A. Young,Brett R. Cowan,b.li@auckland.ac.nz,a.young@auckland.ac.nz,b.cowan@auckland.ac.nz 22e19a72-ca7f-4e5d-a9d9-fbe3cbb38d5c A Hybrid Parallel Signature Matching Model for Network Security Applications Using SIMD GPU High performance signature matching against a large dictionary is of great importance in network security applications. The many-core SIMD GPU is a competitive choice for signature matching. In this paper, a hybrid parallel signature matching model (HPSMM) using SIMD GPU is proposed, which uses pattern set partition and input text partition together. Then the problem of load balancing for multiprocessors in the GPU is discussed carefully, and a balanced pattern set partition method (BPSPM) employed in HPSMM is introduced. Experiments demonstrate that using pattern set partition and input text partition together can help achieve a better performance, and the proposed BPSPM-Length works well in load balancing. /content/cudazone/CUDABrowser/assets/images/applications/811_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/811_cover-medium_large.jpg Academia National University of Defense Technology, China 2009 08 21 08/21/2009 Chengkun Wu Jianping Yin Zhiping Cai Paper Science Chengkun Wu,Jianping Yin,Zhiping Cai,chengkun_wu@nudt.edu.cn,jpyin@nudt.edu.cn,zpcai@nudt.edu.cn 89d9f616-d298-43a4-99e1-3fe1db248cba Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster In this paper, we propose an inherent parallel scheme for 3D image segmentation of large volume data on a GPU cluster. This method originates from an extended Lattice Boltzmann Model (LBM), and provides a new numerical solution for solving the level set equation. As a local, explicit and parallel scheme, our method lends itself to several favorable features: (1) Very easy to implement with the core program only requiring a few lines of code; (2) Implicit computation of curvatures; (3) Flexible control of generating smooth segmentation results; (4) Strong amenability to parallel computing, especially on low-cost, powerful graphics hardware (GPU). The parallel computational scheme is well suited for cluster computing, leading to a good solution for segmenting very large data sets. /content/cudazone/CUDABrowser/assets/images/applications/810_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/810_cover-medium_large.jpg Academia Kent State University 2009 11 26 11/26/2009 Aaron Hagan Ye Zhao Paper Science Aaron Hagan,Ye Zhao 9e70b216-1271-4886-be56-fe79e2bb7ea9 Computing the Longest Common Transposition-Invariant Subsequence with GPU Finding a longest common transposition-invariant subsequence (LCTS) of two given integer sequences A = a 1 a 2...a m and B = b 1 b 2...b n (a generalization of the well-known longest common subsequence problem (LCS)) has arisen in the field of music information retrieval. In the LCTS problem, we look for an LCS for the sequences A + t = (a 1 + t)(a 2 + t)...(a m + t) and B where t is any integer. Performance of the top graphical processing units (GPUs) outgrew the performance of the top CPUs a few years ago and there is a surge of interest in recent years in using GPUs for general processing.We propose and evaluate a bit-parallel algorithm solving the LCTS problem on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/809_Untitledsecuritytechnology_small.png /content/cudazone/CUDABrowser/assets/images/applications/809_Untitledsecuritytechnology_large.png Academia Silesian University of Technology 2009 10 01 10/01/2009 Sebastian Deorowicz Paper Computer Aided Engineering Sebastian Deorowicz,sebastian.deorowicz@polsl.pl 58db5b29-d3e0-4e9a-975a-d39dfd48e727 Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling We present an approach to compute the visual hulls of multiple people in real-time in the presence of occlusions. We prove that the resulting visual hulls are correct and minimal under occlusions. Our proposed algorithm runs completely on the GPU with framerates up to 50fps for multiple people using only one computer equipped with off-the-shelf hardware. We also compare runtimes for different graphic chips and show that our approach scales very well without additional effort. Comparison to other work shows that our algorithm is as fast as state-of-the-art technology. The resulting visual hulls can be the basis for a wide range of algorithms that require a robust voxel representation as input. /content/cudazone/CUDABrowser/assets/images/applications/808_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/808_cover-medium_large.jpg Academia Fraunhofer IITB Karlsruhe / Universitat Karlsruhe 2009 09 02 09/02/2009 Alexander Schick Rainer Stiefelhagen Paper Science Alexander Schick,Rainer Stiefelhagen,alexander.schick@iitb.fraunhofer.de,rainer.stiefelhagen@iitb.fraunhofer.de f4504157-17b0-4b17-9476-d48e77994f7f Arion Render Arion is the hybrid-accelerated and physically-based light simulator developed by RandomControl. It comprises an interactive WYSIWYG editing application and a super-high performance production renderer. Arion's uses all the GPUs -and- all the CPUs in your system simultaneously, not wasting a single flop available. Additionally, Arion can use all the GPUs and all the CPUs in all the other computers in your network forming a cluster for massive computation. Arion is a grid-computing solution to the problem of light physics simulation. /content/cudazone/CUDABrowser/assets/images/applications/807_arion_cuda_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/807_arion_cuda_large.jpg Commercial RandomControl S.L.U. 2010 04 01 04/01/2010 50 Commercial RandomControl Application Multimedia Presentation Graphics Imaging Raytracing raytracing rendering physically-based unbiased randomcontrol arion fryrender,RandomControl,tech@randomcontrol.com 1a492908-0605-4d9e-af4f-085ff724e6cf Asymmetric Distributed Shared Memory GMAC is a run-time system that implements an Asymmetric Disitributed Shared Memory model. This model eases the task of programming CUDA applications by building a unified global address space including system and GPU memories. Code executed at the CPU can transparently access data hosted by the GPU memory, but code run at the GPU is constrained to access the data hosted by its memory. GMAC removes the need to perform explicit data transfers using cudaMemcpy() calls and handles all data transfers in a transparent and efficient way. Moreover, the unified address space implemented by GMAC allows using CPU pointers in the GPU code. /content/cudazone/CUDABrowser/assets/images/applications/806_google_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/806_google_large.jpg Academia Universitat Politecnica de Catalunya / University of Illinois 2009 11 02 11/02/2009 Open source Isaac Gelado Application Presentation Libraries Isaac Gelado,igelado@ac.upc.edu ef51a1b4-1fff-412e-a96d-796a24015f38 Octane Renderer Octane Render is a fully GPU-powered, un-biased and physically based rendering application, with a 10-15X speed increase over un-biased CPU based renderers /content/cudazone/CUDABrowser/assets/images/applications/806_octane_cuda_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/806_octane_cuda_large.jpg Commercial Refractive Software LTD http://www.refractivesoftware.com 2010 01 10 01/10/2010 15 Commercial Refractive Software LTD Application Multimedia Imaging Video & Audio Graphics Refractive Software LTD 554c3825-b0de-4df9-bd68-f0dba7b2a590 Textbook: GPU Chinese text book for CUDA programing /content/cudazone/CUDABrowser/assets/images/applications/803_20100202044228595_small.png /content/cudazone/CUDABrowser/assets/images/applications/803_20100202044228595_large.png Commercial www.hpctech.com http://www.hpctech.com/ 2009 10 01 10/01/2009 Shu Zhang Yanli Chu Kaiyong Zhao Multimedia HPC information Shu Zhang,Yanli Chu,Kaiyong Zhao,zhao.kaiyong@gmail.com a62c5428-2955-4cf3-9d0d-0078b395153f QView Multi-math object viewer . Still under development. /content/cudazone/CUDABrowser/assets/images/applications/802_qview_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/802_qview_large.jpg Research digitker - The digital kernel 2010 04 30 04/30/2010 Dimitar Tsonov Paper Presentation Computational Fluid Dynamics Finance Game Physics Graphics Numerics Libraries Science math kernel viewer Dimitar Tsonov,dtsonov@digitker.com ed7975e2-60da-449a-8a34-febfbd08eebf Textbook: Programming Massively Parallel Processors: A Hands-on Approach The first textbook of its kind, Programming Massively Parallel Processors: A Hands-on Approach is authored by Dr. David B. Kirk, NVIDIA Fellow and former chief scientist, and Dr. Wen-mei Hwu, who serves at the University of Illinois at Urbana-Champaign as Chair of Electrical and Computer Engineering in the Coordinated Science Laboratory, co-director of the Universal Parallel Computing Research Center and principal investigator of the CUDA Center of Excellence. The textbook, which is 256 pages, is the first aimed at teaching advanced students and professionals the basic concepts of parallel programming and GPU architectures. Published by Morgan Kaufmann, it explores various techniques for constructing parallel programs and reviews numerous case studies. /content/cudazone/CUDABrowser/assets/images/applications/801_Kirk-HR_large_small.png /content/cudazone/CUDABrowser/assets/images/applications/801_Kirk-HR_large_large.png Academia NVIDIA and UIUC 2011 01 28 01/28/2010 Dr. David Kirk Dr. Wen-meiHwu Multimedia Progamming textbook CUDA, Parallel Processing, NVIDIA, GPU,Dr. David Kirk,Dr. Wen-meiHwu,dkirk@nvidia.com c8e8ac46-4a7f-47db-b7d6-b79ae238ba7d PARRET: Parellel RestoreTools PARRET is a Python package for image deblurring on GPUs. By making use of the parallelism on NVIDIA GPU CUDA architecture, the deblurring time is greatly reduced. Besides image deblurring, PARRET can be used to solve linear equations. /content/cudazone/CUDABrowser/assets/images/applications/800_demo_small.png /content/cudazone/CUDABrowser/assets/images/applications/800_demo_large.png Academia Emory University 2010 02 01 02/01/2010 15 Open source Ying Wai (Daniel) Fan Code Imaging deblurring, Python, linear systems of equations,Ying Wai (Daniel) Fan,yfan@emory.edu 3192f565-72ab-4885-9348-2b3afd2511d6 QUDA : A library for QCD on GPUs QUDA is a library for performing calculations in lattice QCD on graphics processing units (GPUs) using NVIDIA's C for CUDA API. The current release includes optimized kernels for applying the Wilson Dirac operator and clover-improved Wilson Dirac operator, kernels for performing various BLAS-like operations, and full inverters built on these kernels. Mixed-precision implementations of both CG and BiCGstab are provided, with support for double, single, and half (16-bit fixed-point) precision. /content/cudazone/CUDABrowser/assets/images/applications/799_quda_image_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/799_quda_image_large.jpg Academia Boston University and Harvard University 2009 11 17 11/17/2009 10 Open source M. A. Clark R. Babich K. Barros R. Brower C. Rebbi Application Paper Code Science QCD, linear solver, mixed precision,Mike Clark,mikec@seas.harvard.edu 14721042-0396-4060-8731-199cc53e5bc2 SCGPSim: A fast SystemC simulator on GPUs The main objective of this paper is to speed up the simulation performance of SystemC designs at the RTL abstraction level by exploiting the high degree of parallelism afforded by today's general purpose graphics processors (GPGPUs). Our approach parallelizes SystemC's discrete-event simulation (DES) on GPGPUs by transforming the model of computation of DES into a model of concurrent threads that synchronize as and when necessary. Our simulation infrastructure is called SCGPSim and it includes a source-to-source (S2S) translator to transform synthesizable SystemC models into parallelly executable programs targeting an NVIDIA GPU. The translator retains the simulation semantics of the original designs by applying semantics preserving transformations. The resulting transformed models mapped onto the massively parallel architecture of GPUs improve simulation efficiency quite substantially. Preliminary experiments with varying-sized examples such as AES, ALU, and FIR have shown simulation speed-ups ranging from 30x to 100x. Considering that our transformations are not yet optimized, we believe that optimizing them will improve the simulation performance even further. /content/cudazone/CUDABrowser/assets/images/applications/798_scgp2_small.png /content/cudazone/CUDABrowser/assets/images/applications/798_scgp2_large.png Academia FERMAT Lab, Virginia Tech, Blacksburg, VA http://www.fermat.ece.vt.edu/ 2010 01 19 01/19/2010 100 Mahesh Nanjundappa Hiren D Patel Bijoy A Jose Sandeep K Shukla Paper Electronic Design Automation Mahesh Nanjundappa,Hiren D Patel,Bijoy A Jose,knmahesh@vt.edu 128f6237-5801-4d4f-b825-fc3a01ba1578 Myocyte Simulation Code performes several time-step simulations of a Myocyte (heart muscle cell) in parallel, allowing to obtain results for different set of inputs. /content/cudazone/CUDABrowser/assets/images/applications/797_Myocyte_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/797_Myocyte_large.jpg Academia University of Virginia http://www.virginia.edu 2010 01 31 01/31/2010 10 Lukasz G. Szafaryn Application Multimedia Paper Code Life Sciences Science Simulation myocyte, simulation, ode solving, time-step,Lukasz G. Szafaryn,lgs9a@virginia.edu ab039cd4-07bd-419e-b6b0-a2e7e7be3fec Mutual Information Based Semi-Global Stereo Matching on the GPU Real-time stereo matching is necessary for many practical applications, including robotics. There are already many real-time stereo systems, but they typically use local approaches that cause object boundaries to be blurred and small objects to be removed. We have selected the Semi-Global Matching (SGM) method for implementation on graphics hardware, because it can compete with the currently best global stereo methods. At the same time, it is much more efficient than most other methods that produce a similar quality. In contrast to previous work, we have fully implemented SGM including matching with mutual information, which is partly responsible for the high quality of disparity images. Our implementation reaches 4.2 fps on a GeForce 8800 ULTRA with images of 640 x480 pixel size and 128 pixel disparity range and 13 fps on images of 320 x240 pixel size and 64 pixel disparity range. /content/cudazone/CUDABrowser/assets/images/applications/796_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/796_cover-medium_large.jpg Academia German Aerospace Center 2008 12 02 12/02/2008 Ines Ernst Heiko Hirschmuller Paper Science Ines Ernst,Heiko Hirschmuller,ines.ernst@dlr.de,heiko.hirschmueller@dlr.de 4f1c26e4-bd49-4db3-9e21-65632e62b00d Experiences with Cell-BE and GPU for Tomography Tomography is a powerful technique for three-dimensional imaging, that deals with image reconstruction from a series of projection images, acquired along a range of viewing directions. An important part of any tomograph system is the reconstruction algorithm. Iterative reconstruction algorithms have many advantages over non-iterative methods, yet their running time can be prohibitively long. As these algorithms have high potential for parallelization, multi-core architectures, such as the Cell-BE and GPU, can possibly alleviate this problem. In this paper, we describe our experiences in mapping the basic operations of iterative reconstruction algorithms onto these platforms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. Performance results of our implementation demonstrate a speedup of over 40 for a single GPU, compared to a single-core CPU version. By combining eight GPUs and a quad-core CPU in a single system, similar performance to a large cluster consisting of hundreds of CPU cores has been obtained. /content/cudazone/CUDABrowser/assets/images/applications/795_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/795_cover-medium_large.jpg Academia University of Antwerp, Belgium 2009 07 21 07/21/2009 40 Sander van der Maar Kees Joost Batenburg Jan Sijbers Paper Science Sander van der Maar,Kees Joost Batenburg,Jan Sijbers,Sander.vanderMaar@ua.ac.be,Joost.Batenburg@ua.ac.be,Jan.Sijbers@ua.ac.be 9256c867-a33e-4bca-8dd1-f56c21b6047b Experiences with Cell-BE and GPU for Tomography Tomography is a powerful technique for three-dimensional imaging, that deals with image reconstruction from a series of projection images, acquired along a range of viewing directions. An important part of any tomograph system is the reconstruction algorithm. Iterative reconstruction algorithms have many advantages over non-iterative methods, yet their running time can be prohibitively long. As these algorithms have high potential for parallelization, multi-core architectures, such as the Cell-BE and GPU, can possibly alleviate this problem. In this paper, we describe our experiences in mapping the basic operations of iterative reconstruction algorithms onto these platforms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. Performance results of our implementation demonstrate a speedup of over 40 for a single GPU, compared to a single-core CPU version. By combining eight GPUs and a quad-core CPU in a single system, similar performance to a large cluster consisting of hundreds of CPU cores has been obtained. /content/cudazone/CUDABrowser/assets/images/applications/793_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/793_cover-medium_large.jpg Academia University of Antwerp, Belgium 2009 07 21 07/21/2009 40 Sander van der Maar Kees Joost Batenburg Jan Sijbers Paper Science Sander van der Maar,Kees Joost Batenburg,Jan Sijbers,Sander.vanderMaar@ua.ac.be,Joost.Batenburg@ua.ac.be,Jan.Sijbers@ua.ac.be 4ad94310-447d-47c8-bd18-1a36ddda8728 Multi-walk Parallel Pattern Search Approach on a GPU Computing Platform This paper studies the efficiency of using Pattern Search (PS) on bound constrained optimization functions on a Graphics Processing Unit (GPU) computing platform. Pattern Search is a direct search optimization technique that does not require derivative information on non-linear programming problems. Pattern Search is ideally suited to a GPU computing environment due to its low memory requirement and no communication between threads in a multi-walk setting. To adapt to a GPU environment, traditional Pattern Search is modified by terminating based on iterations instead of tolerance. This research designed and implemented a multi-walk Pattern Search algorithm on a GPU computing platform. Computational results are promising with a computing speedup of 100+ compared to a corresponding implementation on a single CPU. /content/cudazone/CUDABrowser/assets/images/applications/792_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/792_cover-medium_large.jpg Academia Lamar University 2009 05 20 05/20/2009 Weihang Zhu James Curry Paper Science Weihang Zhu,James Curry,Weihang.Zhu@lamar.edu,jcurry@my.lamar.edu 1ecce826-a4da-4bd6-932e-11130eeee781 A GPU-Based Simulation of Tsunami Propagation and Inundation Tsunami simulation consists of fluid dynamics, numerical computations, and visualization techniques. Nonlinear shallow water equations are often used to model the tsunami propagation. By adding the friction slope to the conservation of momentum, it also can model the tsunami inundation. To solve these equations, we use the second order finite difference MacCormack method. Since it is a finite difference method, it brings the possibility to be parallelized. We use the parallelism provided by GPU to speed up the computations. By loading data as textures in GPU memory, the computation processes can be written as shader programs and the operations will be done by GPU in parallel. The results show that with the help of GPU, the simulation can get a significant improvement in the execution time for each of the computation steps. /content/cudazone/CUDABrowser/assets/images/applications/790_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/790_cover-medium_large.jpg Academia National United University 2009 07 31 07/31/2009 Wen-Yew Liang Tung-Ju Hsieh Muhammad T. Satria Paper Science Wen-Yew Liang,Tung-Ju Hsieh,Muhammad T. Satria,wyliang@ntut.edu.tw,tjhsieh@ntut.edu.tw,t6598056@ntut.edu.tw 4aba234f-c87b-477d-84e4-5ccb3a641313 GPU-Supported Image Compression for Remote Visualization Realization and Benchmarking In this paper we introduce a novel GPU-supported JPEG image compression technique with a focus on its application for remote visualization purposes. Fast and high quality compression techniques are very important for the remote visualization of interactive simulations and Virtual reality applications (IS/VR) on hybrid clusters. Thus the main goals of the design and implementation of this compression technique were low compression times and nearly no visible quality loss, while achieving compression rates that allow for 30+ Frames per second over 10 MBit/s networks. To analyze the potential of the technique and further development needs and to compare it to existing methods, several benchmarks are conducted and described in this paper. Additionally a quality assessment is performed to allow statements about the achievable quality of the lossy image compression. The results show that using the GPU not only for rendering but also for image compression is a promising approach for interactive remote rendering. /content/cudazone/CUDABrowser/assets/images/applications/789_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/789_cover-medium_large.jpg Academia University of Paderborn 2008 12 02 12/02/2008 Stefan Lietsch Paul Hermann Lensing Paper Science Stefan Lietsch,Paul Hermann Lensing,slietsch@upb.de,plensing@upb.de 1968f34b-b4e7-4cfe-949e-957ac0b0a242 GPU-MEME: Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences Discovery of motifs that are repeated in groups of biological sequences is a major task in bioinformatics. Iterative methods such as expectation maximization (EM) are used as a common approach to find such patterns. However, corresponding algorithms are highly compute-intensive due to the small size and degenerate nature of biological motifs. Runtime requirements are likely to become even more severe due to the rapid growth of available gene transcription data. In this paper we present a novel approach to accelerate motif discovery based on commodity graphics hardware (GPUs). To derive an efficient mapping onto this type of architecture, we have formulated the compute-intensive parts of the popular MEME tool as streaming algorithms. Our experimental results show that a single GPU allows speedups of one order of magnitude with respect to the sequential MEME implementation. Furthermore, parallelization on a GPU-cluster even improves the speedup to two orders of magnitude. /content/cudazone/CUDABrowser/assets/images/applications/788_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/788_cover-medium_large.jpg Academia Nanyang Technological University 2008 10 08 10/08/2008 Chen Chen Bertil Schmidt Liu Weiguo Paper Science Chen Chen,Bertil Schmidt,Liu Weiguo,cchen@ntu.edu.sg,asbschmidt@ntu.edu.sg,liuweiguo@ntu.edu.sg 9dd0b45a-39ac-46d4-b174-a1e78ecab2a7 Performance Optimization Strategies of High Performance Computing on GPU Recently GPU is widely utilized in scientific computing and engineering applications, owing primarily to the evolution of GPU architecture. Firstly, we analyze some key performance characters of GPU in detail, and the relationships among GPU architecture, programming model and memory hierarchy. Secondly, we present three performance optimization strategies: Prefetching, Streamlizing, and Task Division. Adequate experiments have been done to abstract the relationships among different factors and efficiency. Finally, we map the HPL benchmark to testify our strategies and achieve certain speedup. /content/cudazone/CUDABrowser/assets/images/applications/787_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/787_cover-medium_large.jpg Academia National University of Defense Technology, ChangSha 2009 08 21 08/21/2009 Anguo Ma Jing Cai Yu Cheng Paper Science Anguo Ma,Jing Cai,Yu Cheng,anguo.ma@nudt.edu.cn,jing.cai@nudt.edu.cn,y.cheng@nudt.edu.cn 168e001f-d970-4413-90a0-8d6c90fda259 Bipartite Graph Matching Computation on GPU The Bipartite Graph Matching Problem is a well studied topic in Graph Theory. Such matching relates pairs of nodes from two distinct sets by selecting a subset of the graph edges connecting them. Each edge selected has no common node as its end points to any other edge within the subset. When the considered graph has huge sets of nodes and edges the sequential approaches are impractical, specially for applications demanding fast results. In this paper we investigate how to compute such matching on Graphics Processing Units (GPUs) motivated by its increasing processing power made available with decreasing costs. We present a new data-parallel approach for computing bipartite graph matching that is efficiently computed on todays graphics hardware and apply it to solve the correspondence between 3D samples taken over a time interval. /content/cudazone/CUDABrowser/assets/images/applications/786_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/786_cover-medium_large.jpg Academia Leibniz Universitaet Hannover 2009 08 17 08/17/2009 Cristina Nader Vasconcelos Bodo Rosenhahn Paper Science Cristina Nader Vasconcelos,Bodo Rosenhahn,crisnv@inf.puc-rio.br,rosenhahn@tnt.uni-hannover.de 124508be-daac-4a5e-8a7d-8bcdae9ea237 Face Detection Using GPU-Based Convolutional Neural Networks In this paper, we consider the problem of face detection under pose variations. Unlike other contributions, a focus of this work resides within efficient implementation utilizing the computational powers of modern graphics cards. The proposed system consists of a parallelized implementation of convolutional neural networks (CNNs) with a special emphasize on also parallelizing the detection process. Experimental validation in a smart conference room with 4 active ceiling-mounted cameras shows a dramatic speed-gain under real-life conditions. /content/cudazone/CUDABrowser/assets/images/applications/785_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/785_cover-medium_large.jpg Academia TU Dortmund University 2009 08 29 08/29/2009 Fabian Nasse Christian Thurau Gernot A. Fink Paper Science Fabian Nasse,Christian Thurau,Gernot A. Fink cfd6b540-64f5-423f-bc2e-1b7ec1439ba5 Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures Graphics processors are increasingly used in scientific applications due to their high computational power, which comes from hardware with multiple-level parallelism and memory hierarchy. Sparse matrix computations frequently arise in scientific applications, for example, when solving PDEs on unstructured grids. However, traditional sparse matrix algorithms are difficult to efficiently parallelize for GPUs due to irregular patterns of memory references. In this paper we present a new storage format for sparse matrices that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning. Experimental evaluation demonstrates significant speedups compared to previously published results. /content/cudazone/CUDABrowser/assets/images/applications/784_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/784_cover-medium_large.jpg Academia Institute for System Programming of RAS 2010 01 21 01/21/2010 Alexander Monakov Anton Lokhmotov Arutyun Avetisyan Paper Science Alexander Monakov,Anton Lokhmotov,Arutyun Avetisyan,amonakov@ispras.ru,anton@doc.ic.ac.uk,arut@ispras.ru e105e7e5-d0ca-4fe1-b6ce-897fd679d5b4 Searching High-Dimensional Neighbours: CPU-Based Tailored Data-Structures Versus GPU-Based Brute-Force Method Many image processing algorithms rely on nearest neighbor (NN) or on the k nearest neighbor (kNN) search problem. Several methods have been proposed to reduce the computation time, for instance using space partitionning. However, these methods are very slow in high dimensional space. In this paper, we propose a fast implementation of the brute-force algorithm using GPU (Graphics Processing Units) programming. We show that our implementation is up to 150 times faster than the classical approaches on synthetic data, and up to 75 times faster on real image processing algorithms (finding similar patches in images and texture synthesis). /content/cudazone/CUDABrowser/assets/images/applications/783_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/783_cover-medium_large.jpg Academia Palaiseau 2009 05 05 05/05/2009 Vincent Garcia Frank Nielsen Paper Science Vincent Garcia,Frank Nielsen,garciav@lix.polytechnique.fr,nielsen@lix.polytechnique.fr 05b7c411-e33a-4038-b779-b94b67ba0e80 Belief Propagation Implementation Using CUDA on an NVIDIA GTX 280 Disparity map generation is a significant component of vision-based driver assistance systems. This paper describes an efficient implementation of a belief propagation algorithm on a graphics card (GPU) using CUDA (Compute Uniform Device Architecture) that can be used to speed up stereo image processing by between 30 and 250 times. For evaluation purposes, different kinds of images have been used: reference images from the Middlebury stereo website, and real-world stereo sequences, self-recorded with the research vehicle of the .enpeda.. project at The University of Auckland. This paper provides implementation details, primarily concerned with the inequality constraints, involving the threads and shared memory, required for efficient programming on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/780_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/780_cover-medium_large.jpg Academia Shandong University 2009 11 18 11/18/2009 Yanyan Xu Hui Chen Reinhard Klette Paper Science Yanyan Xu,Hui Chen,Reinhard Klette 1cb185e6-e66e-458f-95d9-0f08f2490b6b Lloyd's Algorithm on GPU The Centroidal Voronoi Diagram (CVD) is a very versatile structure, well studied in Computational Geometry. It is used as the basis for a number of applications. This paper presents a deterministic algorithm, entirely computed using graphics hardware resources, based on Lloyds Method for computing CVDs. While the computation of the ordinary Voronoi diagram on GPU is a well explored topic, its extension to CVDs presents some challenges that the present study intends to overcome. /content/cudazone/CUDABrowser/assets/images/applications/779_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/779_cover-medium_large.jpg Academia Pontificia Universidade Catolica 2008 12 02 12/02/2008 Cristina N. Vasconcelos Asla Sa Paulo Cezar Carvalho Paper Science Cristina N. Vasconcelos,Asla Sa,Paulo Cezar Carvalho,crisnv@inf.puc-rio.br,asla@tecgraf.puc-rio.br,pcezar@impa.br 86056069-3857-4e63-8c25-55a234a83edd GPU-Accelerated Nearest Neighbor Search for 3D Registration Nearest Neighbor Search (NNS) is employed by many computer vision algorithms. The computational complexity is large and constitutes a challenge for real-time capability. The basic problem is in rapidly processing a huge amount of data, which is often addressed by means of highly sophisticated search methods and parallelism. We show that NNS based vision algorithms like the Iterative Closest Points algorithm (ICP) can achieve real-time capability while preserving compact size and moderate energy consumption as it is needed in robotics and many other domains. The approach exploits the concept of general purpose computation on graphics processing units (GPGPU) and is compared to parallel processing on CPU. We apply this approach to the 3D scan registration problem, for which a speed-up factor of 88 compared to a sequential CPU implementation is reported. /content/cudazone/CUDABrowser/assets/images/applications/778_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/778_cover-medium_large.jpg Academia Sankt Augustin 2009 10 14 10/14/2009 Deyuan Qiu Stefan May Andreas Nuchter Paper Science Deyuan Qiu,Stefan May,Andreas Nuchter,dqiu2s@smail.inf.h-brs.de,stefan_may@arcor.de,andreas@nuechti.de 95724514-4e0b-41fc-b92d-e2c41be2c895 An Efficient Pre-filtering Mechanism for Parallel Intrusion Detection Based on Many-Core GPU Multi-pattern search is a time-consuming task in Network Intrusion Detection Systems(NIDS). The processing ability of NIDS cannot catch up with the rapid development of network bandwidth. One intuitive idea is to use pre-filtering to reduce the workload of NIDS. Our goal is to design a novel method for per-filtering which will be ready for an efficient implementation on many-core GPU. Through statistical analysis, we propose a rudimentary method to use 2B ASCII sub patterns as the filter keywords. To reduce the size of the filter keyword set, we use Binary Integer Linear Programming(BILP) for optimization. The number of filter keywords is reduced from 4824 to 362, which is also much smaller then the prefix based and suffix based method. We argue that our method can well utilize the computation power of GPU. Experiments demonstrate that our pre-filter can achieve a good fiter ratio, thus alleviate the burden of NIDS. /content/cudazone/CUDABrowser/assets/images/applications/777_Untitledsecuritytechnology_small.png /content/cudazone/CUDABrowser/assets/images/applications/777_Untitledsecuritytechnology_large.png Academia National University of Defense Technology 2009 11 28 11/28/2009 Chengkun Wu Jianping Yin Zhiping Cai Paper Science Chengkun Wu,Jianping Yin,Zhiping Cai,chengkun_wu@nudt.edu.cn,jpyin@nudt.edu.cn,zpcai@nudt.edu.cn 21d4bbfd-5dd3-4016-982d-d55bab9285ed GPU-based Acceleration of System-level Design Tasks Many system-level design tasks (e.g., high-level timing analysis, hardware/software partitioning and design space exploration) involve computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems. In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise in the electronic design automation (EDA) domain. We demonstrate this idea via two detailed case studies. The first explores the possibility of using GPUs to speedup standard schedulability analysis problems. The second proposes a GPU-based engine for a general hardware/software design space exploration problem. Not only do these problems commonly arise in the embedded systems domain, their computational kernels turn out to be variants of a combinatorial optimization problem viz., the knapsack problem that lies at the heart of several EDA applications. Experimental results show that our GPU-based implementations offer very attractive speedups for the computational kernels (up to 100x), and speedups of up to 17x for the full problem. In contrast to ASIC/FPGA-based accelerators given that even low-end desktop and notebook computers are now equipped with GPUs our solution involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications (e.g., in databases and bioinformatics), harnessing the parallelism of GPUs to accelerate problems from the EDA domain has not been sufficiently explored so far. We believe that our results and the generality of the core problem that we address will motivate researchers from this community to explore the possibility of using GPUs for a wider variety of problems from the EDA domain. /content/cudazone/CUDABrowser/assets/images/applications/776_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/776_cover-medium_large.jpg Academia TU Munich 2010 01 15 01/15/2010 Unmesh D. Bordoloi Samarjit Chakraborty Paper Science Unmesh D. Bordoloi,Samarjit Chakraborty d6f04307-3afc-40ed-9c71-ad0bc9456cec A generic library for structured real-time computations: GPU implementation applied to retinal and cortical vision processes Most graphics cards in standard personal computers are now equipped with several pixel pipelines running shader programs. Taking advantage of this technology by transferring parallel computations from the CPU side to the GPU side increases the overall computational power even in non-graphical applications by freeing the main processor from an heavy work. A generic library is presented to show how anyone can benefit from modern hardware by combining various techniques with little hardware specific programming skills. Its shader implementation is applied to retinal and cortical simulation. The purpose of this sample application is not to provide a correct approximation of real center surround ganglion or middle temporal cells, but to illustrate how easily intertwined spatiotemporal filters can be applied on raw input pictures in real-time. Requirements and interconnection complexity really depend on the vision framework adopted, therefore various hypothesis that may benefit from such a library are introduced. /content/cudazone/CUDABrowser/assets/images/applications/775_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/775_implementation_large.png Academia University of Toulouse 2009 01 08 01/08/2009 Jean-Charles Quinton Paper Science Jean-Charles Quinton,quinton@n7.fr 4844c2e1-42ea-446c-aa46-616f14577bf2 GPU Accelerated 3D Face Registration / Recognition This paper proposes a novel approach to both registration and recognition of face in three dimensions. The presented method is based on normal map metric to perform either the alignment of captured face to a reference template or the comparison between any two faces in a gallery. As the metric involved is highly suited to be computed via vector processor, we propose an implementation of the whole framework on last generation graphics boards, to exploit the potential of GPUs applied to large scale biometric identification applications. This work shows how the use of affordable consumer grade hardware could allow ultra rapid comparison between face descriptors through their highly specialized architecture. The approach also addresses facial expression changes by means of a subject specific weighting masks. We include preliminary results of experiments conducted on a proprietary gallery and on a subset of FRGC database. /content/cudazone/CUDABrowser/assets/images/applications/774_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/774_cover-medium_large.jpg Academia Universita degli Studi di Salerno 2007 08 30 08/30/2007 Andrea Francesco Abate Michele Nappi Stefano Ricciardi Paper Imaging Andrea Francesco Abate,Michele Nappi,Stefano Ricciardi,abate@unisa.it,mnappi@unisa.it,sricciardi@unisa.it ffcd384a-918f-49f4-a5ad-b21b0988e948 Implementation of a Lattice Boltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology The Lattice Boltzmann method (LBM) is a distribution-function based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPUs in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data. /content/cudazone/CUDABrowser/assets/images/applications/773_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/773_implementation_large.png Academia Technische Universitat Munchen 2009 05 06 05/06/2009 T. Indinger Paper Science T. Indinger,Thomas.Indinger@tum.de b4a2cd2c-ab54-4f4e-a597-12943d456da4 GPU-Assisted Surface Reconstruction on Locally-Uniform Samples In point-based graphics, surfaces are represented by point clouds without explicit connectivity. If the distribution of the points can be carefully controlled, surface reconstruction becomes a much easier problem. We present a simple, completely local surface reconstruction algorithm for input point distributions that are locally uniform. The locality of the computation lets us handle large point sets using parallel and out-of-core methods. The algorithm can be implemented robustly with floating-point arithmetic. We demonstrate the simplicity, efficiency, and numerical stability of our algorithm with an out-of-core and parallel implementation using graphics hardware. /content/cudazone/CUDABrowser/assets/images/applications/772_roundtable_small.png /content/cudazone/CUDABrowser/assets/images/applications/772_roundtable_large.png Academia University of California 2009 10 23 10/23/2009 Yong Joo Kil Nina Amenta Paper Computer Aided Engineering Yong Joo Kil,Nina Amenta,kil@cs.ucdavis.edu,amenta@cs.ucdavis.edu fdd9c3bb-420d-4e67-94a0-60174e2f4534 GP-GPU Implementation of the Local Rank Differences Image Feature A currently popular trend in object detection and pattern recognition is usage of statistical classifiers, namely AdaBoost and its modifications. The speed performance of these classifiers largely depends on the low level image features they are using: both on the amount of information the feature provides and the processor time of its evaluation. Local Rank Differences is an image feature that is alternative to commonly used haar wavelets. It is suitable for implementation in programmable (FPGA) or specialized (ASIC) hardware, but -as this paper shows -it performs very well on graphics hardware (GPU) used in general purpose manner (GPGPU, namely CUDA in this case) as well. The paper discusses the LRD features and their properties, describes an experimental implementation of the LRD in graphics hardware using CUDA, presents its empirical performance measures compared to alter native approaches, suggests several notes on practical usage of LRD and proposes directions for future work. /content/cudazone/CUDABrowser/assets/images/applications/771_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/771_cover-medium_large.jpg Academia Brno University of Technology 2009 05 21 05/21/2009 Adam Herout Radovan Josth Pavel Zemcik Paper Imaging Adam Herout,Radovan Josth,Pavel Zemcik,fherout@fit.vutbr.cz,ijosth@fit.vutbr.cz,zemcik@fit.vutbr.cz 307a1055-9c6c-4df0-bc88-96f461322333 AES Encryption Implementation and Analysis on Commodity Graphics Processing Units Graphics Processing Units (GPUs) present large potential performance gains within stream processing applications over the standard CPU. These performance gains are best realised when high computational intensity is required across large amounts of mostly independent input elements. The GPUs success in general purpose stream processing has been demonstrated in many diverse fields, though attempts to port cryptographic algorithms to the GPU have thus far met little success. In recent years, GPU architectures have continued to develop a more flexible and uniform programming environment. These developments have overcome a lot of previously encountered restrictions in cipher implementations. We present novel approaches for the implementation of the AES block cipher encryption algorithm on these GPUs. This work also serves as a precursor for future cipher implementations on the most advanced GPU architecture, the recently released Nvidia G80, which now includes integer support and a simplified programming interface. /content/cudazone/CUDABrowser/assets/images/applications/770_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/770_cover-medium_large.jpg Academia Trinity College Dublin 2007 08 23 08/23/2007 Owen Harrison John Waldron Paper Science Owen Harrison,John Waldron,harrisoo@cs.tcd.ie,john.waldron@cs.tcd.ie 819a8581-877a-45c8-9cfb-d63121d5dbe2 The Future of Volume Graphics in Medical Virtual Reality A recent trend in medical virtual reality is to include information from multiple sources, especially about physiology, into one model and one single visualization. Computer graphics must therefore deal with a huge amount of information in real time. The latest developments in computer graphics hardware allow not only implementing direct volume rendering on the graphics processing unit (GPU). The emerging compute languages enable us to address volume rendering problems of arbitrary complexity without being limited to formulating visualization techniques in an awkward fashion to match the GPU execution model. Utilizing the arising new possibilities we meet next generations demands in medical visualization. /content/cudazone/CUDABrowser/assets/images/applications/769_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/769_prediction_large.png Academia Graz University of Technology 2010 01 01 01/01/2010 Judith Muehl Bernhard Kainz Alexander Bornik Paper Medical Imaging Judith Muehl,Bernhard Kainz,Alexander Bornik f327d71f-b539-441f-a3d5-fc8b66c264db Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by NVIDIA In this article a very efficient implementation of a 2D-Lattice Boltzmann kernel using the Compute Unified Device Architecture (CUDA) interface developed by nVIDIA is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order in performance gain compared to standard CPUs. A non-trivial example, the flow through a generic porous medium, shows the performance of the implementation. /content/cudazone/CUDABrowser/assets/images/applications/768_bottle_small.png /content/cudazone/CUDABrowser/assets/images/applications/768_bottle_large.png Academia TU Braunschweig 2008 07 24 07/24/2008 Jonas Tolke Paper Numerics Jonas Tolke,toelke@cab.bau.tu-bs.de 36f123f2-0612-42e2-8134-d637453033c5 GPU in Haptic Rendering of Deformable Objects We present some results regarding utilizing Graphics Processing Unit (GPU) for computing the deformation of two experimental objects. A suture simulation model with GPU and a 2D deformable cloth model with nVidia CUDA techniques are also proposed. We conducted experimental studies to compare the GPU-based suture models and with the CPU implementation. We also experimented with the implicit model of the 2D mesh which offer similar computational challenges associated with any Finite-Element modeling approaches. A method for computing the inverse of a matrix with truncated Neumann series is also introduced. /content/cudazone/CUDABrowser/assets/images/applications/767_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/767_cover-medium_large.jpg Academia Simon Fraser University 2008 06 28 06/28/2008 Hans Fuhan Shi Shahram Payandeh Paper Imaging Hans Fuhan Shi,Shahram Payandeh,fuhans@cs.sfu.ca,shahram@cs.sfu.ca aa90451a-b028-44e9-98ae-84677865270f GP-GPU Implementation of the Local Rank Differences Image Feature A currently popular trend in object detection and pattern recognition is usage of statistical classifiers, namely AdaBoost and its modifications. The speed performance of these classifiers largely depends on the low level image features they are using: both on the amount of information the feature provides and the processor time of its evaluation. Local Rank Differences is an image feature that is alternative to commonly used haar wavelets. It is suitable for implementation in programmable (FPGA) or specialized (ASIC) hardware, but -as this paper shows -it performs very well on graphics hardware (GPU) used in general purpose manner (GPGPU, namely CUDA in this case) as well. The paper discusses the LRD features and their properties, describes an experimental implementation of the LRD in graphics hardware using CUDA, presents its empirical performance measures compared to alter native approaches, suggests several notes on practical usage of LRD and proposes directions for future work. /content/cudazone/CUDABrowser/assets/images/applications/766_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/766_cover-medium_large.jpg Academia Brno University of Technology 2009 05 21 05/21/2009 Adam Herout Radovan Josth Pavel Zemcik Paper Imaging Adam Herout,Radovan Josth,Pavel Zemcik 41e47d70-074c-4ef5-a17f-ba467f8e9d78 Monte Carlo Dose Calculation using GPU-Based parallel processing Recently, it became possible to operate physical phenomenon using Graphics Processing Unit (GPU), and Monte Carlo calculation methods came to be researched about shortening the computing time using GPU positively. This report shows how to significantly accelerate 3D dose calculation of photon beam using Graphics Processing Unit (GPU). We describe GPU parallel processing method for dose simulation based on NRCC DOSXYZnrc. http://www.springerlink.com/content/r42wtk514k03865j/?p=da8f68ea438f401396ffad66aea4a402&pi=77 /content/cudazone/CUDABrowser/assets/images/applications/765_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/765_prediction_large.png Academia Tokyo Metropolitan University 2010 01 01 01/01/2010 Atsushi Myojyoyama Hidetoshi Saitoh Paper Numerics Atsushi Myojyoyama,Hidetoshi Saitoh b4c6e882-4f8e-4f2b-87aa-c0667c088ae7 GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision This paper presents briefly the state of the art of accelerating image processing with graphics hardware (GPU) and discusses some of its caveats. Then it describes GpuCV, an open source multi-platform library for GPU-accelerated image processing and Computer Vision operators and applications. It is meant for computer vision scientist not familiar with GPU technologies. GpuCV is designed to be compatible with the popular OpenCV library by offering GPU-accelerated operators that can be integrated into native OpenCV applications. The GpuCV framework transparently manages hardware capabilities, data synchronization, activation of low level GLSL and CUDA programs, on-the-fly benchmarking and switching to the most efficient implementation and finally offers a set of image processing operators with GPU acceleration available. /content/cudazone/CUDABrowser/assets/images/applications/764_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/764_cover-medium_large.jpg Academia TELECOM & Management SudParis 2008 12 03 12/03/2008 Yannick Allusse Patrick Horain Ankit Agarwal Paper Imaging Yannick Allusse,Patrick Horain,Ankit Agarwal 4786c8f5-1af2-4f0b-b323-7dee0cdd4936 Population Parallel GP on the G80 GPU The availability of low cost powerful parallel graphics cards has stimulated a trend to port GP on Graphics Processing Units (GPUs). Previous works on GPUs have shown evaluation phase speedups for large training cases sets. Using the CUDA language on the G80 GPU, we show it is possible to efficiently interpret several GP programs in parallel, thus obtaining speedups also for small training sets starting at less than 100 training cases. Our scheme was embedded in the well-known ECJ library, providing an easy entry point for owners of G80 GPUs. /content/cudazone/CUDABrowser/assets/images/applications/762_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/762_cover-medium_large.jpg Academia Universite du Littoral Cote dOpale 2009 04 03 04/03/2009 Denis Robilliard Virginie MarionPoty Cyril Fonlupt Paper Science Denis Robilliard,Virginie MarionPoty,Cyril Fonlupt,robillia@lil.univ-littoral.fr,poty@lil.univ-littoral.fr,fonlupt@lil.univ-littoral.fr 2d5f5437-9dcc-4368-ad9a-937cce37e34c Medical feature matching and model extraction from MRI/CT based on the Invariant Generalized Hough/Radon Transform In this paper we present a variation of the Generalized Hough Transform (GHT) for automatic feature matching and model extraction. We propose a two-dimensional algorithm with two reference points parameterization (Dual-Point GHT) that is invariant to rotation and uniform scaling and uses the specificities of the both generalized Hough and Radon transforms. The method operates with two-dimensional accumulators, that decreases strongly the required memory size. We realize the algorithm on Graphics Processing Units (GeForce 8800GTX/nVidia CUDA) and apply it to the MRI/CT cardiac shapes extraction as an initial step for further medical image segmentation. /content/cudazone/CUDABrowser/assets/images/applications/761_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/761_cover-medium_large.jpg Academia University of Heidelberg 2009 02 04 02/04/2009 D. Hlindzich R. Maenner Paper Science D. Hlindzich,R. Maenner 4a1c7daa-bbdd-4a89-b774-fafdc8d40477 Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning NVIDIA have released a new platform (CUDA) for general purpose computing on their graphical processing units (GPU). This paper evaluates use of this platform for statistical machine learning applications. The transfer rates to and from the GPU are measured, as is the performance of matrix vector operations on the GPU. An implementation of a sparse matrix vector product on the GPU is outlined and evaluated. Performance comparisons are made with the host processor. /content/cudazone/CUDABrowser/assets/images/applications/760_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/760_cover-medium_large.jpg Academia Australian National University 2009 06 25 06/25/2009 Ahmed El Zein Eric McCreath Alistair Rendell Paper Science Ahmed El Zein,Eric McCreath,Alistair Rendell,Ahmed.ElZein@anu.edu.au,Eric.McCreath@anu.edu.au,Alistair.Rendell@anu.edu.au fcdd7311-d7ab-423b-9790-8fc720230f72 High-Quality Rendering of Varying Isosurfaces with Cubic Trivariate C1-Continuous Splines Smooth trivariate splines on uniform tetrahedral partitions are well suited for high-quality visualization of isosurfaces from scalar volumetric data. We propose a novel rendering approach based on spline patches with low total degree, for which ray-isosurface intersections are computed using efficient root finding algorithms. Smoothly varying surface normals are directly extracted from the underlying spline representation. Our approach is using a combined CUDA and graphics pipeline and yields two key advantages over previous work. First, we can interactively vary the isovalues since all required processing steps are performed on the GPU. Second, we employ instancing in order to reduce shader complexity and to minimize overall memory usage. In particular, this allows to compute the spline coefficients on-the-fly in real-time on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/759_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/759_cover-medium_large.jpg Academia TU Darmstadt 2009 11 26 11/26/2009 Thomas Kalbe Thomas Koch Michael Goesele Paper Science Thomas Kalbe,Thomas Koch,Michael Goesele 10eb0dc2-f2cc-4beb-9bfb-d6483731f3a4 Evaluation of Parallel FFT Implementations on GPU and Multi-core PCs for Magnetic Induction Tomography Magnetic Induction Tomography is a relatively new non-invasive modality for the imaging of the electrical properties of materials which is currently under investigation for a variety of industrial and biomedical applications, in particular the detection and monitoring of cerebral haemorrhage. The speed of FFT-based phase measurement algorithms employed in some current MIT systems is however a major limit to higher data acquisition rate and precision. http://www.springerlink.com/content/t5022335826j4052/?p=e6efad6c51a246a5a01428810aa2b808&pi=67 /content/cudazone/CUDABrowser/assets/images/applications/758_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/758_prediction_large.png Academia Philips Research / University of Glamorgan 2010 01 01 01/01/2010 2 Y. Maimaitijiang H. C. Wee A. Roula Paper Science Y. Maimaitijiang,H. C. Wee,A. Roula 72ad49af-dbf9-4586-9849-1a116402fbcd Visualization and GPU-accelerated simulation of medical We present a fast GPU-based method for simulation of ultrasound images from volumetric CT scans and their visualization. The method uses a ray-based model of the ultrasound to generate view-dependent ultrasonic effects such as occlusions, large-scale reflections and attenuation combined with speckle patterns derived frompre-processing the CT image using a wave-based model of ultrasound propagation in soft tissue. The main applications of the method are ultrasound training and registration of ultrasound and CT images. /content/cudazone/CUDABrowser/assets/images/applications/755_computermethods_small.png /content/cudazone/CUDABrowser/assets/images/applications/755_computermethods_large.png Academia Technische Universitat Munchen 2008 12 19 12/19/2008 Oliver Kutter Ramtin Shams Nassir Navab Paper Medical Imaging Oliver Kutter,Ramtin Shams,Nassir Navab b717bbc6-50d1-4024-90db-2c891a8c7716 Parallel Computation of Mutual Information on the GPU with Application to Real-Time Registration of 3D Medical Images Due to processing constraints, automatic image-based registration of medical images has been largely used as a pre-operative tool. We propose a novel method named sort and count for ecient parallelization of mutual information (MI) computation designed for massively multiprocessing architectures. Combined with a parallel transformation implementation and an improved optimization algorithm, our method achieves real-time (less than 1 second) rigid registration of 3D medical images using a commodity graphics processing unit (GPU). This represents a more than 50-fold improvement over a standard implementation on a CPU. Real-time registration opens new possibilities for development of improved and interactive intraoperative tools that can be used for enhanced visualization and navigation during an intervention. /content/cudazone/CUDABrowser/assets/images/applications/754_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/754_graph_large.png Academia Australian National University 2009 08 21 08/21/2009 50 Ramtin Shams Parastoo Sadeghi Rodney Kennedy Paper Medical Imaging Ramtin Shams,Parastoo Sadeghi,Rodney Kennedy f6835ea9-7c74-4c76-9fe3-64880944cc7e A SURVEY OF MEDICAL IMAGE REGISTRATION ON MULTI-CORE AND THE GPU A surgeon is performing a potentially life-saving pancreatectomy on a patient in early stages of pancreatic cancer. Two small incisions of no more than half an inch allow laparoscopic tools including a video camera and an ultrasound probe to be guided inside the abdominal cavity. A third, larger incision, is occupied by a hand-access device that facilitates the operation. The surgeon is able to locate the tumor in the ultrasound view with ease. This is largely possible due to a newly installed 3D navigation and visualization system that virtually renders the patient transparent. http://users.rsise.anu.edu.au/~ramtin/papers/2010/SPM_2010.pdf /content/cudazone/CUDABrowser/assets/images/applications/753_multicoregpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/753_multicoregpu_large.png Academia Australian National University 2010 03 01 03/01/2010 Ramtin Shams Parastoo Sadeghi Rodney A. Kennedy Paper Medical Imaging Ramtin Shams,Parastoo Sadeghi,Rodney A. Kennedy 5ded63a6-656c-44f4-a306-7cc45e85ea40 A GPU Tile-Load-Map architecture for terrain rendering: theory and applications This paper describes a robust, modular, complete GPU architecturethe Tile-Load-Map (TLM)designed for the real-time visualization of wide textured terrains created with arbitrary meshes. It extends and completes our previous succinct paper Amara et al. (ISVC 2007, Part 1, Lecture Notes in Computer Science, vol. 4841, pp. 586597, Springer, Berlin, 2007) by giving further technical and implementation details. It provides new solutions to problems that had been left unresolved, in the context of a joint use of OpenGL and CUDA, optimized on the G80 graphics chip. We explain the crucial components of the shaders, and emphasize the progress we have proposed, while resolving some difficulties. We show that this texturing architecture is well suited to current challenges, and takes into account most of the distinctive aspects of terrain rendering. Finally, we demonstrate how the design of the TLM facilitates the integration of geomatic input-data into procedural selection/rendering tasks on the GPU, and immediate applications to amplification. /content/cudazone/CUDABrowser/assets/images/applications/751_visualcomputer_small.png /content/cudazone/CUDABrowser/assets/images/applications/751_visualcomputer_large.png Academia Bab Ezzouar 2009 01 14 01/14/2009 Yacine Amara Xavier Marsault Paper Science Yacine Amara,Xavier Marsault af46c6f4-36e8-4672-af52-7cc2741bccb6 HISTOGRAM COMPUTATION WITH CUDA GPU's higher processing power compared to a standard CPU comes at the cost of reduced data caching and flow control logic as more transistors have to be devoted to data processing. This imposes certain limitations in terms of how an application may access memory and implement flow control. As a result, implementation of certain algorithms (even trivial ones) on the GPU may be difficult or may not be computationally justified. /content/cudazone/CUDABrowser/assets/images/applications/750_8800gtx-128_small.png /content/cudazone/CUDABrowser/assets/images/applications/750_8800gtx-128_large.png Academia Australian National University 2008 08 01 08/01/2008 R. Shams Application R. Shams 82a5a192-ee7e-42dc-83d9-b24a79656a21 Parallel Lattice Boltzmann Flow Simulation on Emerging Multi-core Platforms A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of emerging multi-core platforms PlayStation3 (Cell Broadband Engine) and Compute Unified Device Architecture (CUDA)are tested using the pLBM, which is implemented with multi-thread and message-passing programming. The results show that pLBM achieves good performance improvement, 11.02 for Cell over a traditional Xeon cluster and 8.76 for CUDA graphics processing unit (GPU) over a Sempron central processing unit (CPU). The results provide some insights into application design on future many-core platforms. /content/cudazone/CUDABrowser/assets/images/applications/749_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/749_cover-medium_large.jpg Academia University of Southern California 2008 08 21 08/21/2008 Liu Peng Ken-ichi Nomura Takehiro Oyakawa Paper Science Liu Peng,Ken-ichi Nomura,Takehiro Oyakawa 05930d99-8367-4f73-a605-c469a41e6fdb Efficient Nonlinear FEM for Soft Tissue Modelling and Its GPU Implementation within the Open Source Framework SOFA Accurate biomechanical modelling of soft tissue is a key aspect for achieving realistic surgical simulations. However, because medical simulation is a multi-disciplinary area, researchers do not always have sufficient resources to develop an efficient and physically rigorous model for organ deformation. We address this issue by implementing a CUDA-based nonlinear finite element model into the SOFA open source framework. The proposed model is an anisotropic visco-hyperelastic constitutive formulation implemented on a graphical processor unit (GPU). After presenting results on the models performance we illustrate the benefits of its integration within the SOFA framework on a simulation of cataract surgery. /content/cudazone/CUDABrowser/assets/images/applications/1371_comas08_small.png /content/cudazone/CUDABrowser/assets/images/applications/1371_comas08_large.png Academia The Australian e-Health Research Centre 2008 04 07 04/07/2008 53 Olivier Comas Zeike A. Taylo Jeremie Allard Paper Science Olivier Comas,Zeike A. Taylo,Jeremie Allard bd858d80-d8c8-4000-9df8-2353690a6f98 Four styles of parallel and net programming This paper reviews the programming landscape for parallel and network computing systems, focusing on four styles of concurrent programming models, and example languages/libraries. The four styles correspond to four scales of the targeted systems. At the smallest coprocessor scale, Single Instruction Multiple Thread (SIMT) and Compute Unified Device Architecture (CUDA) are considered. Transactional memory is discussed at the multicore or process scale. The MapReduce style is examined at the datacenter scale. At the Internet scale, Grid Service Markup Language (GSML) is reviewed, which intends to integrate resources distributed across multiple datacenters. /content/cudazone/CUDABrowser/assets/images/applications/747_computerscience_small.png /content/cudazone/CUDABrowser/assets/images/applications/747_computerscience_large.png Academia Chinese Academy of Sciences 2009 05 20 05/20/2009 Zhiwei Xu Yongqiang He Paper Science Zhiwei Xu,Yongqiang He,zxu@ict.ac.cn,heyongqiang@software.ict.ac.cn 955ab92b-e28a-4199-bb03-45c14dade318 Accelerating Image Retrieval Using Factorial Correspondence Analysis on GPU We are interested in the intensive use of Factorial Correspondence Analysis (FCA) for large-scale content-based image retrieval. Factorial Correspondence Analysis, is a useful method for analyzing textual data, and we adapt it to images using the SIFT local descriptors. FCA is used to reduce dimensions and to limit the number of images to be considered during the search. Graphics Processing Units (GPU) are fast emerging as inexpensive parallel processors due to their high computation power and low price. The G80 family of Nvidia GPUs provides the CUDA programming model that treats the GPU as a SIMD processor array. We present two very fast algorithms on GPU for image retrieval using FCA: the first one is a parallel incremental algorithm for FCA and the second one is an extension of the filtering algorithm in our previous work for filtering step. Our implementation is able to scale up the FCA computation a factor of 30 compared to the CPU version. For retrieval tasks, the parallel version on GPU performs 10 times faster than the one on CPU. Retrieving images in a database of 1 million images is done in about 8 milliseconds. /content/cudazone/CUDABrowser/assets/images/applications/746_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/746_cover-medium_large.jpg Academia Campus de Beaulieu 2009 08 29 08/29/2009 NguyenKhang Pham Annie Morin Patrick Gros Paper Science NguyenKhang Pham,Annie Morin,Patrick Gros,Nguyen_Khang@irisa.fr,Annie.Morin@irisa.fr,Patrick.Gros@inria.fr fdc39d5f-f0d5-4664-809c-9f9c10a35c34 Experiences with Mapping Non-linear Memory Access Patterns into GPUs Modern Graphics Processing Units (GPU) are very powerful computational systems on a chip. For this reason there is a growing interest in using these units as general purpose hardware accelerators (GPGPU). To facilitate the programming of general purpose applications, NVIDIA introduced the CUDA programming environment. CUDA provides a simplified abstraction of the underlying complex GPU architecture, so as a number of critical optimizations must be applied to the code in order to get maximum performance. In this paper we discuss our experience in porting an application kernel to the GPU, and all classes of design decisions we adopted in order to obtain maximum performance. /content/cudazone/CUDABrowser/assets/images/applications/745_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/745_cover-medium_large.jpg Academia University of Malaga 2009 05 20 05/20/2009 Eladio Gutierrez Sergio Romero Maria A. Trenas Paper Science Eladio Gutierrez,Sergio Romero,Maria A. Trenas,eladio@uma.es,sromero@uma.es,maria@uma.es 48d86485-4f4a-4edf-aa9c-9b0900fbf425 Mean Shift Parallel Tracking on GPU We propose a parallel Mean Shift (MS) tracking algorithm on Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA). Traditional MS algorithm uses a large number of color histogram, say typically 16x16x16, which makes parallel implementation infeasible. We thus employ K-Means clustering to partition the object color space that enables us to represent color distribution with a quite small number of bins. Based on this compact histogram, all key components of the MS algorithm are mapped onto the GPU. The resultant parallel algorithm consist of six kernel functions, which involves primarily the parallel computation of the candidate histogram and calculation of the Mean Shift vector. Experiments on public available CAVIAR videos show that the proposed parallel tracking algorithm achieves large speedup and has comparable tracking performance, compared with the traditional serial MS tracking algorithm. /content/cudazone/CUDABrowser/assets/images/applications/744_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/744_cover-medium_large.jpg Academia Heilongjiang Univesity 2009 06 09 06/09/2009 Peihua Li Paper Science Peihua Li,peihualj@hotmail.com e606b34c-1f0d-4118-a68a-d0d7286fbab5 Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse general-purpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML). /content/cudazone/CUDABrowser/assets/images/applications/743_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/743_cover-medium_large.jpg Academia Nancy Universite 2007 09 08 09/08/2007 Luc Buatois Guillaume Caumon Bruno Levy Paper Science Luc Buatois,Guillaume Caumon,Bruno Levy,buatois@gocad.org,caumon@gocad.org,levy@loria.fr 3c1de2f7-4132-4e99-87fa-8242c3b9d107 Solving Sparse Linear Systems on NVIDIA Tesla GPUs Current many-core GPUs have enormous processing power, and unlocking this power for general-purpose computing is very attractive due to their low cost and efficient power utilization. However, the fine-grained parallelism and the stream-programming model supported by these GPUs require a paradigm shift, especially for algorithm designers. In this paper we present the design of a GPU-based sparse linear solver using the Generalized Minimum RESidual (GMRES) algorithm in the CUDA programming environment. Our implementation achieved a speedup of over 20x on the Tesla T10P based GTX280 GPU card for benchmarks with from a few thousands to a few millions unknowns. /content/cudazone/CUDABrowser/assets/images/applications/742_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/742_cover-medium_large.jpg Academia State University of New Jersey 2009 05 20 05/20/2009 20 Mingliang Wang Hector Klie Manish Parashar Paper Science Mingliang Wang,Hector Klie,Manish Parashar f97eefd3-0c69-48b8-b703-c569c5afab1e Optimizing Monte Carlo radiosity on graphics hardware The radiosity method is usually employed for the rendering of highly realistic synthetic images. In this paper we present an implementation of the Monte Carlo radiosity algorithm on the GPU using CUDA. Our proposal is based on the partition of the scene into sub-scenes to be processed in parallel to exploit the graphics card structure. The convex partition method employed permits the exploitation of data locality and the optimization of the ray shooting procedure due to the minimization of the number of objects to be tested in the intersection calculation. The results are good in terms of execution times, increasing the flexibility of previous solutions and demonstrating that the GPU can outperform the CPU results even for non-regular algorithms. /content/cudazone/CUDABrowser/assets/images/applications/741_neville_small.png /content/cudazone/CUDABrowser/assets/images/applications/741_neville_large.png Academia Univ. of A Coruna 2009 11 06 11/06/2009 J. R. Sanjurjo M. Amor M. Boo Paper Numerics J. R. Sanjurjo,M. Amor,M. Boo,josesan@udc.es,margamor@udc.es,montserrat.boo@usc.es 84a7d907-10ce-4d67-ae51-cc83bf5e33ab Optimizations and Performance of a Robotics Grasping Algorithm Described in Geometric Algebra The usage of Conformal Geometric Algebra leads to algorithms that can be formulated in a very clear and easy to grasp way. But it can also increase the performance of an implementation because of its capabilities to be computed in parallel. In this paper we show how a grasping algorithm for a robotic arm is accelerated using a Conformal Geometric Algebra formulation. The optimized C code is produced by the CGA framework Gaalop automatically. We compare this implementation with a CUDA implementation and an implementation that uses standard vector algebra. /content/cudazone/CUDABrowser/assets/images/applications/740_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/740_cover-medium_large.jpg Academia Technische Universitat Darmstad 2009 11 16 11/16/2009 Florian Worsdorfer Florian Stock Eduardo BayroCorrochano Paper Numerics Florian Worsdorfer,Florian Stock,Eduardo BayroCorrochano dbeef696-2ea9-4394-bd10-b2f4aea55e81 Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing, linear algebra, etc. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to the graphics hardware. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine the best configuration offline in order to use it at run-time. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of up to 33x can be achieved on NVIDIA's Tesla C870 compared to a parallelized implementation on a Xeon Quad Core. /content/cudazone/CUDABrowser/assets/images/applications/739_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/739_cover-medium_large.jpg Academia University of Erlangen-Nuremberg 2009 07 21 07/21/2009 33 Richard Membarth Frank Hannig Hritam Dutta Paper Imaging Richard Membarth,Frank Hannig,Hritam Dutta,richard.membarth@cs.fau.de,hannig@cs.fau.de,dutta@cs.fau.de 2ca5d0a9-8e2c-4c94-8c92-ccea0f8f3ede Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors Microscopic imaging is an important tool for characterizing tissue morphology and pathology. 3D reconstruction and visualization of large sample tissue structure requires registration of large sets of high-resolution images. However, the scale of this problem presents a challenge for automatic registration methods. In this paper we present a novel method for efficient automatic registration using graphics processing units (GPUs) and parallel programming. Comparing a C++ CPU implementation with Compute Unified Device Architecture (CUDA) libraries and pthreads running on GPU we achieve a speed-up factor of up to 4.11 with a single GPU and 6.68x with a GPU pair. We present execution times for a benchmark composed of two sets of large-scale images: mouse placenta (16K x16K pixels) and breast cancer tumors (23K x62K pixels). It takes more than 12 hours for the genetic case in C++ to register a typical sample composed of 500 consecutive slides, which was reduced to less than 2 hours using two GPUs, in addition to a very promising scalability for extending those gains easily on a large number of GPUs in a distributed system. /content/cudazone/CUDABrowser/assets/images/applications/738_hyperspectral_small.png /content/cudazone/CUDABrowser/assets/images/applications/738_hyperspectral_large.png Academia University of Malaga 2008 05 20 05/20/2008 7 Antonio Ruiz Manuel Ujaldon Lee Cooper Paper Computer Aided Engineering Antonio Ruiz,Manuel Ujaldon,Lee Cooper,aruiz@ac.uma.es,ujaldon@ac.uma.es,cooperl@ece.osu.edu ddb390fb-53bb-46e9-aad0-d5443baf25a4 Integrated Digital Image Correlation for the Identification of Mechanical Properties Digital Image Correlation (DIC) is a powerful technique to provide full-field displacement measurements for mechanical tests of materials and structures. The displacement fields may be further processed as an entry for identification procedures giving access to parameters of constitutive laws. A new implementation of a Finite Element based Integrated Digital Image Correlation (I-DIC) method is presented, where the two stages (image correlation and mechanical identification) are coupled. This coupling allows one to minimize information losses, even in case of low signal-to-noise ratios. A case study for elastic properties of a composite material illustrates the approach, and highlights the accuracy of the results. Implementations on GPUs (using CUDA) leads to high speed performance while preserving the versatility of the methodology. /content/cudazone/CUDABrowser/assets/images/applications/737_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/737_cover-medium_large.jpg Academia SpringerLink 2009 05 05 05/05/2009 Hugo Leclerc Jean-Noel Perie Stephane Roux Paper Science Hugo Leclerc,Jean-Noel Perie,Stephane Roux,hugo.leclerc@lmt.ens-cachan.fr,jean-noel.perie@lmt.ens-cachan.fr,stephane.roux@lmt.ens-cachan.fr 8b5c1771-af09-43fd-a7ba-8160936587d3 Multifold Acceleration of Neural Network Computations Using GPU With emergence of graphics processing units (GPU) of the latest generation, it became possible to undertake neural network based computations using GPU on serially produced video display adapters. In this study, NVIDIA CUDA technology has been used to implement standard back-propagation algorithm for training multiple perceptrons simultaneously on GPU. For the problem considered, GPU-based implementation (on NVIDIA GTX 260 GPU) has lead to a 50x speed increase compared to a highly optimized CPU-based computer program, and more than 150x compared to a commercially available CPU-based software (NeuroShell 2) (AMD Athlon 64 Dual core 6000+ processor). /content/cudazone/CUDABrowser/assets/images/applications/736_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/736_cover-medium_large.jpg Academia Lomonosov Moscow State University 2009 09 16 09/16/2009 50 Alexander Guzhva Paper Science Alexander Guzhva,nop43@rambler.ru dddf07ad-463e-4556-8709-33b2f8f5b204 Genetic programming on graphics processing units The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source. /content/cudazone/CUDABrowser/assets/images/applications/735_hybrid_small.png /content/cudazone/CUDABrowser/assets/images/applications/735_hybrid_large.png Academia SpringerLink 2009 10 13 10/13/2009 Denis Robilliard Virginie Marion-Poty Cyril Fonlupt Paper Science Denis Robilliard,Virginie Marion-Poty,Cyril Fonlupt,robillia@lil.univ-littoral.fr,poty@lil.univ-littoral.fr,onlupt@lil.univ-littoral.fr a20016ac-ca3b-4eb6-b760-3c62fa956a30 A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs We present a particle-mesh N-body integrator running on GPU using CUDA. Relying on a grid-based description of the gravitational potential, it can simulate the evolution of self-interacting 'stars' in order to model e.g. galaxies. All the steps of the application have been ported on the GPU, namely 1/ an histogramming algorithm with CUDPP, 2/ of the resolution of the Poisson equation by means of FFT with CUFFT and multi-grid relaxation, 3/ of an optimized finite difference scheme to compute the accelerations of stars and 4/ of an update procedure for positions and velocities. We present several tests at different resolution, and reach a speedup from 2 to 50 depending on the resolution and on the test case. /content/cudazone/CUDABrowser/assets/images/applications/734_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/734_cover-medium_large.jpg Academia Universite de Strasbourg 2009 05 20 05/20/2009 50 Dominique Aubert Mehdi Amini Romaric David Paper Science Dominique Aubert,Mehdi Amini,Romaric David bee72b5d-162e-4691-a0b7-f6646c239fbf Parallel Implementations of Recurrent Neural Network Learning Neural networks have proved to be effective in solving a wide range of problems. As problems become more and more demanding, they require larger neural networks, and the time used for learning is consequently greater. Parallel implementations of learning algorithms are therefore vital for a useful application. Implementation, however, strongly depends on the features of the learning algorithm and the underlying hardware architecture. For this experimental work a dynamic problem was chosen which implicates the use of recurrent neural networks and a learning algorithm based on the paradigm of learning automata. Two parallel implementations of the algorithm were applied - one on a computing cluster using MPI and OpenMP libraries and one on a graphics processing unit using the CUDA library. The performance of both parallel implementations justifies the development of parallel algorithms. /content/cudazone/CUDABrowser/assets/images/applications/733_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/733_cover-medium_large.jpg Academia University of Ljubljana 2009 09 30 09/30/2009 Uros Lotric Andrej Dobnikar Paper Science ,Uros Lotric,Andrej Dobnikar,uros.lotric@fri.uni-lj.si,andrej.dobnikar@fri.uni-lj.si 91c16088-9260-4f2d-aaea-bdfde57fb25d Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station We implement image correlation, a fundamental component of many real-time imaging and tracking systems, on a graphics processing unit (GPU) using NVIDIA's CUDA platform. We use our code to analyze images of liquid-gas phase separation in a model colloid-polymer system, photographed in the absence of gravity aboard the International Space Station (ISS). Our GPU code is 4,000 times faster than simple MATLAB code performing the same calculation on a central processing unit (CPU), 130 times faster than simple C code, and 30 times faster than optimized C++ code using single-instruction, multiple-data (SIMD) extensions. The speed increases from these parallel algorithms enable us to analyze images downlinked from the ISS in a rapid fashion and send feedback to astronauts on orbit while the experiments are still being run. /content/cudazone/CUDABrowser/assets/images/applications/732_iss_small.png /content/cudazone/CUDABrowser/assets/images/applications/732_iss_large.png Academia Harvard University 2009 10 30 10/30/2009 130 Peter J. Lu Hidekazu Oki Catherine A. Frey Paper Science Peter J. Lu,Hidekazu Oki,Catherine A. Frey b3c55a10-7397-4e94-a954-a949e0bc26cd A mathematical speedup prediction model for parallel vs. sequential programs Data independent command sequences are part of many algorithms. One way to speed up their execution is processing on a single instruction multiple data (SIMD) architecture. But an implementation must not necessarily be efficient. To predict program acceleration for NVIDIA's compute unified device architecture (CUDA), a parallel computing platform based on graphics boards, a mathematical model is developed. This model extends the common approach for so called speedup prediction by CUDA hardware and algorithm specific parameters. The identification of some model parameters is difficult since they depend on hardware internal parameters. The model is tested for a convolution filter and yields conservative processing time predictions. /content/cudazone/CUDABrowser/assets/images/applications/731_prediction_small.png /content/cudazone/CUDABrowser/assets/images/applications/731_prediction_large.png Academia University of Applied Sciences Gelsenkirchen 2009 02 04 02/04/2009 Heinrich Martin Overhoff Paper Computer Aided Engineering Heinrich Martin Overhoff,heinrich-martin.overhoff@fh-gelsenkirchen.de 8a45da2c-3f81-429c-8576-c6be8690765f Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems Advances in sensor technology are revolutionizing the way remotely sensed data is collected, managed and analyzed. The incorporation of latest generation sensors to airborne and satellite platforms is currently producing a nearly continual stream of high dimensional data, and this explosion in the amount of collected information has rapidly created new processing challenges. http://www.springerlink.com/content/hp81u02p11126226/?p=c5eead9af73340e58a313d95581cfd40&pi=47 /content/cudazone/CUDABrowser/assets/images/applications/729_hyperspectral_small.png /content/cudazone/CUDABrowser/assets/images/applications/729_hyperspectral_large.png Academia University of Extremadura 2010 01 01 01/01/2010 Antonio Plaza Javier Plaz Hugo Vegas Paper Science Antonio Plaza,Javier Plaz,Hugo Vegas,aplaza@unex.es,jplaza@unex.es,hugovegas@fdi.ucm.es 66a4c5f1-213c-4450-b8d1-ce3745396713 GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache We describe the design of parallel trace-driven cache simulation for the purposes of evaluating different cache structures. As the research goes deeper, traditional simulation methods, which can only execute simulation operations in sequence, are no longer practical due to their long simulation cycles. An obvious way to achieve fast parallel simulation is to simulate the independent sets of a cache concurrently on different compute resources. We considered the use of generic GPU to accelerate cache simulation which exploits set-partitioning as the main source of parallelism. But we show this technique is not efficient in the case that just simulating one cache configuration, since a high correlation of the activity between different sets. Trace-sort and multi-configuration simulation in one single pass techniques are developed, taking advantage of the full programmability offered by the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental results demonstrate that the cache simulator based on GPU-CPU platform gains 2.44x performance improvement compared to traditional sequential algorithm. /content/cudazone/CUDABrowser/assets/images/applications/728_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/728_cover-medium_large.jpg Academia Beihang University 2009 08 21 08/21/2009 3 Han Wan Xiaopeng Gao Xiang Long Paper Science Han Wan,Xiaopeng Gao,Xiang Long,wanhan@les.buaa.edu.cn,gxp@les.buaa.edu.cn,long@les.buaa.edu.cn f8d98fea-3e48-4399-97c6-1cc70bf36e27 GPU Accelerated RNA Folding Algorithm Many bioinformatics studies require the analysis of RNA or DNA structures. More specifically, extensive work is done to elaborate efficient algorithms able to predict the 2-D folding structures of RNA or DNA sequences. However, the high computational complexity of the algorithms, combined with the rapid increase of genomic data, triggers the need of faster methods. Current approaches focus on parallelizing these algorithms on multiprocessor systems or on clusters, yielding to good performance but at a relatively high cost. Here, we explore the use of computer graphics hardware to speed up these algorithms which, theoretically, provide both high performance and low cost. We use the CUDA programming language to harness the power of NVIDIA graphic cards for general computation with a C-like environment. Performances on recent graphic cards achieve a x17 speed-up. /content/cudazone/CUDABrowser/assets/images/applications/727_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/727_cover-medium_large.jpg Academia universitaire de Beaulieu 2009 05 20 05/20/2009 17 Guillaume Rizk Dominique Lavenier Paper Science Guillaume Rizk,Dominique Lavenier,guillaume.rizk@irisa.fr,dominique.lavenier@irisa.fr b122d8ea-f81f-4796-a3b4-2f519b9b05f2 Multimedia Mining on Manycore Architectures: The Case for GPUs Media mining, the extraction of meaningful knowledge from multimedia content, poses significant computational challenges in today's platforms, particularly in real-time scenarios. In this paper, we show how Graphic Processing Units (GPUs) can be leveraged for compute-intensive media mining applications. Furthermore, we propose a parallel implementation of color visual descriptors (color correlograms and color histograms) commonly used in multimedia content analysis on a CUDA (Compute Unified Device Architecture) enabled GPU (the Nvidia GeForce GTX280 GPU). Through the use of shared memory as software managed cache and efficient data partitioning, we reach computation throughputs of over 1.2 Giga Pixels/sec for HSV color histograms and over 100 Mega Pixels/sec for HSV color correlograms. We show that we can achieve better than real time performance and major speedups compared to high-end multicore CPUs and comparable performance on known implementations on the Cell B.E. We also study different trade-offs on the size and complexity of the features and their effect on performance. /content/cudazone/CUDABrowser/assets/images/applications/726_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/726_cover-medium_large.jpg Academia Georgia Institute of Technology 2009 11 26 11/26/2009 Mamadou Diao Jongman Kim Paper Science Mamadou Diao,Jongman Kim,mamadou@ece.gatech.edu,jkim@ece.gatech.edu 73c6f376-cc30-4f50-a53d-0246839f1870 MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures. /content/cudazone/CUDABrowser/assets/images/applications/725_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/725_cover-medium_large.jpg Academia University of Illinois at Urbana-Champaign 2008 11 28 11/28/2008 John A. Stratton Sam S. Stone Wen-mei W. Hwu Paper Science John A. Stratton,Sam S. Stone,Wen-mei W. Hwu,stratton@crhc.uiuc.edu,ssstone2@crhc.uiuc.edu,hwu@crhc.uiuc.edu 1c474439-b236-4f2e-b625-a7e540f06ffa A Real-Time Video Illustration Using CUDA According to advancements in video technology, there are lots of needs for various special effects of videos. The conventional image-transform effects could be applied to video streams, but non-photorealistic rendering effects are not easy to apply. For example, cartoon or illustration effects have expensive costs in video transformation which makes it difficult to execute in real-time. In this paper, we suggest a video transformation system with illustration effects. It is designed to apply the illustration effects to the video stream directly and is implemented to achieve real time performances using the GPU hardware with NVIDIA's CUDA. /content/cudazone/CUDABrowser/assets/images/applications/724_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/724_cover-medium_large.jpg Academia Electronics and Telecommunications Research Institute 2009 08 28 08/28/2009 JiHyung Lee Yoon-Seok Choi Bon-Ki Koo Paper Science JiHyung Lee,Yoon-Seok Choi,Bon-Ki Koo,ijihyung@etri.re.kr,ys-choi@etri.re.kr,bkkoo@etri.re.kr f3da9a57-398b-4320-adf8-c66fc56e7440 A Fast and Flexible Sorting Algorithm with CUDA In this paper, we propose a fast and flexible sorting algorithm with CUDA. The proposed algorithm is much more practical than the previous GPU-based sorting algorithms, as it is able to handle the sorting of elements represented by integers, floats and structures. Meanwhile, our algorithm is optimized for the modern GPU architecture to obtain high performance. We use different strategies for sorting disorderly list and nearly sorted list to make it adaptive. Extensive experiments demonstrate our algorithm has higher performance than previous GPU-based sorting algorithms and can support realtime applications. /content/cudazone/CUDABrowser/assets/images/applications/723_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/723_cover-medium_large.jpg Academia Chinese Academy of Sciences 2009 07 31 07/31/2009 Shifu Chen Jing Qin Yongming Xie Paper Numerics Shifu Chen,Jing Qin,Yongming Xie,sf.chen@siat.ac.cn,jqin@cse.cuhk.edu.hk,ymxie@cse.cuhk.edu.hk 448d3111-a000-421c-b2da-c00e1509d590 Exploring Parallel Algorithms for Volumetric Mass-Spring-Damper Models in CUDA Since the advent of programmable graphics processors (GPUs) their computational powers have been utilized for general purpose computation. Initially by exploiting graphics APIs and recently through dedicated parallel computation frameworks such as the Compute Unified Device Architecture (CUDA) from Nvidia. This paper investigates multiple implementations of volumetric Mass-Spring-Damper systems in CUDA. The obtained performance is compared to previous implementations utilizing the GPU through the OpenGL graphics API. We find that both performance and optimization strategies differ widely between the OpenGL and CUDA implementations. Specifically, the previous recommendation of using implicitly connected particles is replaced by a recommendation that supports unstructured meshes and run-time topological changes with an insignificant performance reduction. /content/cudazone/CUDABrowser/assets/images/applications/722_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/722_cover-medium_large.jpg Academia University of Aarhus 2008 07 07 07/07/2008 Allan Rasmusson Jesper Mosegaard Thomas Sangild Paper Science Allan Rasmusson,Jesper Mosegaard,Thomas Sangild 8072ce4d-85ec-4bad-8d3f-986eae58cfd2 Implementation of Parallel Genetic Algorithm Based on CUDA Genetic Algorithm (GA) is a powerful tool for science computing, while Parallel Genetic Algorithm (PGA) further promotes the performance of computing. However, the traditional parallel computing environment is very difficult to set up, much less the price. This gives rise to the appearance of moving dense computing to graphics hardware, which is inexpensive and more powerful. The paper presents a hierarchical parallel genetic algorithm, implemented by NVIDIAs Compute Unified Device Architecture (CUDA). Mixed with master-slave parallelization method and multiple-demes parallelization method, this algorithm has contributed to better utilization of threads and high-speed shared memory in CUDA. /content/cudazone/CUDABrowser/assets/images/applications/721_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/721_cover-medium_large.jpg China University of Geosciences 2009 09 30 09/30/2009 Sifa Zhang Zhenming He Paper Science Sifa Zhang,Zhenming He c8b48fed-9a91-483e-bfcd-d80e41dde203 Memory Locality Exploitation Strategies for FFT on the CUDA Architecture Modern graphics processing units (GPU) are becoming more and more suitable for general purpose computing due to its growing computational power. These commodity processors follow, in general, a parallel SIMD execution model whose efficiency is subject to a right exploitation of the explicit memory hierarchy, among other factors. In this paper we analyze the implementation of the Fast Fourier Transform using the programming model of the Compute Unified Device Architecture (CUDA) recently released by NVIDIA for its new graphics platforms. Within this model we propose an FFT implementation that takes into account memory reference locality issues that are crucial in order to achieve a high execution performance. This proposal has been experimentally tested and compared with other well known approaches such as the manufacturer's FFT library. /content/cudazone/CUDABrowser/assets/images/applications/720_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/720_cover-medium_large.jpg Academia University of Malaga 2008 12 06 12/06/2008 Eladio Gutierrez Sergio Romero aria A. Trenas Paper Science Eladio Gutierrez,Sergio Romero,aria A. Trenas,eladio@ac.uma.es,sromero@ac.uma.es,maria@ac.uma.es b2b7e513-4ef5-4e1a-8a28-1aa414ba9965 Parallel Quantum Computer Simulation on the CUDA Architecture Due to their increasing computational power, modern graphics processing architectures are becoming more and more popular for general purpose applications with high performance demands. This is the case of quantum computer simulation, a problem with high computational requirements both in memory and processing power. When dealing with such simulations, multiprocessor architectures are an almost obliged tool. In this paper we explore the use of the new graphics processor architecture NVIDIA CUDA in the simulation of some basic quantum computing operations. This new architecture is oriented towards a more general exploitation of the graphics platform, allowing to use it as a parallel SIMD multiprocessor. In this direction, some implementation strategies are proposed, showing that the effectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy. /content/cudazone/CUDABrowser/assets/images/applications/718_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/718_cover-medium_large.jpg Academia University of Malaga 2008 06 25 06/25/2008 Eladio Gutierrez Sergio Romero Maria A. Trenas Paper Science Eladio Gutierrez,Sergio Romero,Maria A. Trenas,eladio@ac.uma.es,sromero@ac.uma.es,maria@ac.uma.es 3a301449-69ff-493b-a459-7f4ff6b973a0 Implementation of a LatticeBoltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology The LatticeBoltzmann method (LBM) is a distributionfunction based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPU's in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data. /content/cudazone/CUDABrowser/assets/images/applications/717_implementation_small.png /content/cudazone/CUDABrowser/assets/images/applications/717_implementation_large.png Academia Technische Universitat Munchen 2009 05 06 05/06/2009 18 T. Indinger E. Riegel N. A. Adams Paper Science T. Indinger,E. Riegel,N. A. Adams,Thomas.Indinger@tum.de ade79ba5-8167-479e-837c-4edc6c615cd4 CUDA-Lite: Reducing GPU Programming Complexity The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding. /content/cudazone/CUDABrowser/assets/images/applications/716_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/716_cover-medium_large.jpg Academia University of Illinois at Urbana-Champaign 2008 11 28 11/28/2008 17 Sain-Zee Ueng Melvin Lathara Sara S. Baghsorkhi Paper Programming Tools Sain-Zee Ueng,Melvin Lathara,Sara S. Baghsorkhi,ueng@crhc.uiuc.edu,mlathara@crhc.uiuc.edu,bsadeghi@crhc.uiuc.edu 99559840-4e04-42b3-aef7-def473ec47e9 Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA This paper describes several parallel algorithmic variations of the Neville elimination. This elimination solves a system of linear equations making zeros in a matrix column by adding to each row an adequate multiple of the preceding one. The parallel algorithms are run and compared on different multi- and many-core platforms using parallel programming techniques as MPI, OpenMP and CUDA. /content/cudazone/CUDABrowser/assets/images/applications/715_neville_small.png /content/cudazone/CUDABrowser/assets/images/applications/715_neville_large.png Academia Universidad de Oviedo / Universidad Politecnica de Valencia 2009 11 18 11/18/2009 P. Alonso R. Cortina F. J. Martinez Zaldivar Paper Science Neville,Multi core, Many core, OpenMP, MPI,GPU,CUDA,CUBLAS,P. Alonso,R. Cortina,F. J. Martinez Zaldivar,palonso@uniovi.es,raquel@uniovi.es,fjmartin@dcom.upv.es 662a5685-925d-4f0a-9e8e-8cf9681a85a4 Real-Time Ray Tracing with CUDA The graphics processors (GPUs) have recently emerged as a low-cost alternative for parallel programming. Since modern GPUs have great computational power as well as high memory bandwidth, running ray tracing on them has been an active field of research in computer graphics in recent years. Furthermore, the introduction of CUDA, a novel GPGPU architecture, has removed several limitations that the traditional GPU-based ray tracing suffered. In this paper, an implementation of high per formance CUDA ray tracing is demonstrated. We focus on the perfor mance and show how our design choices in various optimization lead to an implementation that outperforms the previous works. For reasonably complex scenes with simple shading, our implementation achieves the performance of 30 to 43 million traced rays per second. Our implementation also includes the effects of recursive specular reflection and refraction, which were less discussed in previous GPU-based ray tracing works. /content/cudazone/CUDABrowser/assets/images/applications/714_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/714_cover-medium_large.jpg Academia National Tsing Hua University / National Taiwan Normal University 2009 07 31 07/31/2009 Min Shih Yung-Feng Chiu Ying-Chieh Chen Paper Imaging Ray Tracing - Programmable Graphics Hardware - GPU Computing - CUDA - Multithreaded Architectures,Min Shih,Yung-Feng Chiu,Ying-Chieh Chen,min_shih@ibr.cs.nthu.edu.tw,yfchiu@ibr.cs.nthu.edu.tw,louis@ibr.cs.nthu.edu.tw 21718119-decf-4a85-ad89-c2acf965a3a1 Scalable and highly parallel implementation of Smith-Waterman on graphics processing unit using CUDA Program development environments have enabled graphics processing units (GPUs) to become an attractive high performance computing platform for the scientific community. A commonly posed problem in computational biology is protein database searching for functional similarities. The most accurate algorithm for sequence alignments is Smith-Waterman (SW). However, due to its computational complexity and rapidly increasing database sizes, the process becomes more and more time consuming making cluster based systems more desirable. Therefore, scalable and highly parallel methods are necessary to make SW a viable solution for life science researchers. In this paper we evaluate how SW fits onto the target GPU architecture by exploring ways to map the program architecture on the processor architecture. We develop new techniques to reduce the memory footprint of the application while exploiting the memory hierarchy of the GPU. With this implementation, GSW, we overcome the on chip memory size constraint, achieving 23x speedup compared to a serial implementation. Results show that as the query length increases our speedup almost stays stable indicating the solid scalability of our approach. Additionally this is a first of a kind implementation which purely runs on the GPU instead of a CPU-GPU integrated environment, making our design suitable for porting onto a cluster of GPUs. /content/cudazone/CUDABrowser/assets/images/applications/713_scalable_small.png /content/cudazone/CUDABrowser/assets/images/applications/713_scalable_large.png Academia University of Arizona 2009 06 11 06/11/2009 23 Ali Akoglu Gregory M. Striemer Paper Science Ali Akoglu,Gregory M. Striemer,akoglu@email.arizona.edu,gmstrie@email.arizona.edu 94546634-a42d-4be6-9db8-6e5f34612d9f Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a hybrid of Genetic Algorithm (GA) and Local Search (LS) to solve MAXimum SATisfiability (MAX-SAT) problem on a state-of-the-art nVidia Tesla GPU using nVidia Compute Unified Device Architecture (CUDA). MAX-SAT is a problem of practical importance and is often solved by employing metaheuristics based search methods like GAs and hybrid of GA with LS. Almost all the parallel GAs (pGAs) designed in the last two decades were designed for either clusters or MPPs. Unfortunately, very little research is done on the implementation of such algorithms over commodity graphics hardware. GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, and the same is the case with conventional LS algorithms. In this paper we explore different genetic operators that can be used for an efficient implementation of GAs over nVidia GPUs. We also design and introduce new techniques/operators for an efficient implementation of GAs and LS over such architectures. We use nVidia Tesla C1060 to perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25x. We also discuss the effects of different optimization techniques on the overall execution time. /content/cudazone/CUDABrowser/assets/images/applications/712_hybrid_small.png /content/cudazone/CUDABrowser/assets/images/applications/712_hybrid_large.png Academia Hokkaido University 2009 10 20 10/20/2009 25 Asim Munawar Mohamed Wahib Masaharu Munetomo Paper Science Compute unified device architecture (CUDA) - General-purpose computing on graphics processing unit (GPGPU) - Genetic algorithm (GA) - MAXimum SATisfiability problem (MAX-SAT) - Single instruction multiple data (SIMD) - Single instruction multiple threads (SIMT),Asim Munawar,Mohamed Wahib,Masaharu Munetomo,asim@uva.cims.hokudai.ac.jp,wahibium@uva.cims.hokudai.ac.jp,munetomo@iic.hokudai.ac.jp 3c39d32e-5e81-4e93-8d02-4c6e2105d2be CUDA Solutions for the SSSP Problem We present several algorithms that solve the single-source shortest-path problem using CUDA. We have run them on a database, composed of hundreds of large graphs represented by adjacency lists and adjacency matrices, achieving high speedups regarding a CPU implementation based on Fibonacci heaps. Concerning correctness, we outline why our solutions work, and show that a previous approach [10] is incorrect. /content/cudazone/CUDABrowser/assets/images/applications/711_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/711_cover-medium_large.jpg Academia Universidad Complutense de Madrid 2009 05 20 05/20/2009 Pedro J. Martin Roberto Torres Antonio Gavilanes Paper Science Shortest path algorithms - GPU - CUDA,Pedro J. Martin,Roberto Torres,Antonio Gavilanes,pjmartin@sip.ucm.es,r.torres@fdi.ucm.es,agav@sip.ucm.es b6723f4e-8b3e-416f-a229-ba8f7fbb2334 Adaptative Resonance Theory Fuzzy Networks Parallel Computation Using CUDA Programming of Graphics Processing Units (GPUs) has evolved in a way they can be used to address and speed-up computation of algorithms exemplified by data-parallel models. In this paper parallelization of a Fuzzy ART algorithm is described and a detailed explanation of its implementation under CUDA is given. Experimental results show the algorithm runs up to 52 times faster on the GPU than on the CPU for testing and 18 times faster for training under specific conditions. /content/cudazone/CUDABrowser/assets/images/applications/710_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/710_cover-medium_large.jpg Academia University of Valladolid 2009 06 05 06/05/2009 M. Martinez-Zarzuela F. J. Diaz Pernas A. Tejero de Pablos Paper Science M. Martinez-Zarzuela,F. J. Diaz Pernas,A. Tejero de Pablos 627e05b2-0728-4871-ad2f-108473423236 Accelerating Large Graph Algorithms on the GPU Using CUDA Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms including breadth first search, single source shortest path, and all-pairs shortest path using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing 600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors. /content/cudazone/CUDABrowser/assets/images/applications/709_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/709_cover-medium_large.jpg Academia International Institute of Information Technology Hyderabad 2008 01 22 01/22/2008 Pawan Harish P. J. Narayanan Paper Science Pawan Harish,P. J. Narayanan,harishpk@research.iiit.ac.in,pjn@iiit.ac.in cef5a0e1-9512-4dd7-bf59-e0537bceb8f1 Molecular Dynamics Simulations on Commodity GPUs with CUDA Molecular dynamics simulations are a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the requirement for large system simulations with many atoms and numerous time steps. In this paper we present a new approach to high performance molecular dynamics simulations on graphics processing units. Using modern graphics processing units for high performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm. This results in an implementation with significant runtime savings on an off-the-shelf computer graphics card. /content/cudazone/CUDABrowser/assets/images/applications/708_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/708_cover-medium_large.jpg Academia Nanyang Technological University 2008 01 22 01/22/2008 Weiguo Liu Bertil Schmidt Gerrit Voss Paper Science Weiguo Liu,Bertil Schmidt,Gerrit Voss,liuweiguo@ntu.edu.sg,bertil.schmidt@unsw.edu.au,asgerrit@ntu.edu.sg f891bba8-5e53-45f2-a171-f2fd2b1b02e2 Accelerating Cone Beam Reconstruction Using the CUDA-Enabled GPU Compute unified device architecture (CUDA) is a software development platform that enables us to write and run general-purpose applications on the graphics processing unit (GPU). This paper presents a fast method for cone beam reconstruction using the CUDA-enabled GPU. The proposed method is accelerated by two techniques: (1) off-chip memory access reduction; and (2) memory latency hiding. We describe how these techniques can be incorporated into CUDA code. Experimental results show that the proposed method runs at 82% of the peak memory bandwidth, taking 5.6 seconds to reconstruct a 5123-voxel volume from 360 5122-pixel projections. This performance is 18% faster than the prior method. Some detailed analyses are also presented to understand how effectively the acceleration techniques increase the reconstruction performance of a naive method. /content/cudazone/CUDABrowser/assets/images/applications/707_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/707_cover-medium_large.jpg Academia Osaka University 2008 12 17 12/17/2008 Yusuke Okitsu Fumihiko Ino Paper Science Yusuke Okitsu,Fumihiko Ino,y-okitu@ist.osaka-u.ac.jp,ino@ist.osaka-u.ac.jp 3ec2d6eb-a3ca-4435-9e66-c0dbe3e61938 Parallelization of a Video Segmentation Algorithm on CUDA Enabled Graphics Processing Units Nowadays, Graphics Processing Units (GPU) are emerging as SIMD coprocessors for general purpose computations, specially after the launch of nVIDIA CUDA. Since then, some libraries have been implemented for matrix computation and image processing. However, in real video applications some stages need irregular data distributions and the parallelism is not so inherent. This paper presents the parallelization of a video segmentation application on GPU hardware, which implements an algorithm for abrupt and gradual transitions detection. A critical part of the algorithm requires highly intensive computation for video frames features calculation. Results on three CUDA-enabled GPUs are encouraging, because of the significant speedup achieved. They are also compared with an OpenMP version of the algorithm, running on two platforms with multiples cores. /content/cudazone/CUDABrowser/assets/images/applications/706_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/706_cover-medium_large.jpg Academia University of Cordoba / University of Malaga 2009 08 22 08/22/2009 Juan Gomez-Luna Jose Maria Gonzalez-Linares Jose Ignacio Benavides Paper Science Juan Gomez-Luna,Jose Maria Gonzalez-Linares,Jose Ignacio Benavides,el1goluj@uco.es,gonzalez@ac.uma.es,el1bebej@uco.es 7628110e-6e5b-4952-9a5f-dbe69816046e A CUDA-Supported Approach to Remote Rendering In this paper we present the utilization of advanced programming techniques on current graphics hardware to improve the performance of remote rendering for interactive applications. We give an overview of existing systems in remote rendering and focus on some general bottlenecks of remote visualization. Afterwards we describe current developments in graphics hardware and software and outline how they can be used to increase the performance of remote graphics systems. Finally we present some results and benchmarks to confirm the validity of our work. /content/cudazone/CUDABrowser/assets/images/applications/705_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/705_cover-medium_large.jpg Academia University of Paderborn 2007 11 22 11/22/2007 Stefan Lietsch Oliver Marquardt Paper Science Stefan Lietsch,Oliver Marquardt,slietsch@upb.de,marquard@upb.de 50132356-0063-40d3-922b-8bc54e0ecb18 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popular programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with managed runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary performance results show that this interface can deliver significant performance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java. /content/cudazone/CUDABrowser/assets/images/applications/704_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/704_cover-medium_large.jpg Academia Department of Computer Science 2009 08 22 08/22/2009 Yonghong Yan Max Grossman Vivek Sarkar Paper Science Yonghong Yan,Max Grossman,Vivek Sarkar,yanyh@rice.edu,jmg3@rice.edu,vsarkar@rice.edu 5a46aeed-d703-4e3f-a010-ccf692264df9 Training Recurrent Neural Network Using Multistream Extended Kalman Filter on Multicore Processor and Cuda Enabled Graphic Processor Unit Recurrent neural networks are popular tools used for modeling time series. Common gradient-based algorithms are frequently used for training recurrent neural networks. On the other side approaches based on the Kalman filtration are considered to be the most appropriate general-purpose training algorithms with respect to the modeling accuracy. Their main drawbacks are high computational requirements and difficult implementation. In this work we first provide clear description of the training algorithm using simple pseudo-language. Problem with high computational requirements is addresses by performing calculation on Multicore Processor and CUDA-enabled graphic processor unit. We show that important execution time reduction can be achieved by performing computation on manycore graphic processor unit. /content/cudazone/CUDABrowser/assets/images/applications/703_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/703_cover-medium_large.jpg Academia Faculty of Informatics and Information Technologies 2009 09 16 09/16/2009 Michal Cernansky Paper Science Michal Cernansky,cernansky@fiit.stuba.sk f9600d38-e1e3-40c9-bd98-fca07f782225 Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. Therefore, we accelerate CVB inference, an efficient deterministic inference method for LDA, with Nvidia CUDA. In the evaluation experiments, we used a set of 50,000 documents and a set of 10,000 images. We could obtain inference results comparable to sequential CVB inference. /content/cudazone/CUDABrowser/assets/images/applications/702_cover-medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/702_cover-medium_large.jpg Academia Nagasaki University 2009 06 26 06/26/2009 Tomonari Masada Tsuyoshi Hamada Yuichiro Shibata Paper Tomonari Masada,Tsuyoshi Hamada,Yuichiro Shibata,masada@cis.nagasaki-u.ac.jp,hamada@cis.nagasaki-u.ac.jp,shibata@cis.nagasaki-u.ac.jp 9396fec3-f4d1-419d-9345-537bb1e70f10 POSIX Threads and NVIDIA's CUDA The current progression of commodity processing architectures exhibits a trend toward increasing parallelism, requiring that undergraduate students in a wide range of technical disciplines gain an understanding of problem solving in massively parallel environments. However, as a small comprehensive college, we cannot currently afford to dedicate an entire semester-long course to the study of parallel computing. To combat this situation, we have integrated the key components of such a course into a 300-level course on modern operating systems. In this paper, we describe a parallel computing unit that is designed to dovetail with the discussion of process and thread management common to operating systems courses. We also describe a set of self-contained projects in which students explore two parallel programming models, POSIX Threads and NVIDIA's Compute Unified Device Architecture, that enable parallel architectures to be utilized effectively. In our experience, this unit can be integrated with traditional operating systems topics quite readily, making parallel computing accessible to undergraduate students without requiring a full course dedicated to these increasingly important topics. /content/cudazone/CUDABrowser/assets/images/applications/701_mte_small.png /content/cudazone/CUDABrowser/assets/images/applications/701_mte_large.png Academia ogf.org 2008 12 31 12/31/2008 ogf.org Paper ogf.org a23ac3b0-45a9-4784-a714-af9f875bd5cc Open Inventor by VSG Open Inventor by VSG provides application developers with a unique solution that enables interoperability between advanced 3D visualization and powerful GPU-based computing capabilities to perform parallel computation on the fly on a workstation. /content/cudazone/CUDABrowser/assets/images/applications/700_vsg_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/700_vsg_logo_large.png Commercial VSG http://www.vsg3d.com/vsg_prod_openinventor.php 2009 12 31 12/31/2009 Commercial VSG Application Multimedia Oil & Gas VSG 984915a6-7fd1-45fa-8b37-52fd8af92486 Mental Ray 3.8 iray introduces a new way of utilizing photorealistic rendering, by integrating both preview and final frame rendering in one single interactive process. In addition, the power of the CUDA GPU dramatically shortens the processing time, introducing significant cost optimizations along the rendering pipeline. And the handling simplifications of iray provide a tool that enables professionals to focus on their core business, while still being able to generate beautiful photorealistic images of their works, all without the help of rendering experts and without the need of becoming rendering experts. /content/cudazone/CUDABrowser/assets/images/applications/scene_update_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/scene_update_new_large.jpg Commercial mental images http://www.mentalimages.com/index.php 2009 12 31 12/31/2009 Commercial mental images Application Multimedia Imaging mental images 0152d7ff-43b6-4f43-b884-582438d94a54 AxRTM Reverse Time Migration (RTM) is the current 'state-of-the-art' in seismic imaging. The strength of RTM stems from the fact that it fully respects the two-way acoustic wave equation, thus improving imaging in areas where complex geology violates the assumptions made in Kirchhoff or one-way wave equation migrations. Until recently, RTM's widespread use was severely hindered by the enormous computing resources required to process the data. This computational bottleneck is now cleared with Acceleware's patent-pending software solution AxRTM. AxRTM provides the core numerical functionality of Reverse Time Migration as a library that can be integrated into an existing seismic processing framework. AxRTM has a modular architecture supporting a variety of integrator-supplied functionality, and currently supports both optimized multi-core CPU and NVIDIA GPU hardware. /content/cudazone/CUDABrowser/assets/images/applications/698_Seismic_velocity_model_sml_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/698_Seismic_velocity_model_sml_large.jpg Commercial Acceleware http://www.acceleware.com/default 2008 01 01 01/01/2008 Acceleware Application Paper Oil & Gas Acceleware d9067cbb-646d-4ec7-ae1e-5302f65bed87 Linear Algebra Solvers and High Performance Computing Solving a system of linear equations is a common numerical technique applied in many fields including fluid dynamics, thermal analysis, mechanical simulations, and economics. As simulations and models increase in complexity, organizations require high performance software to meet their growing computational needs. Several widely available optimized versions of BLAS and LAPACK libraries have been written to take advantage of CPU architectures. Recently, graphics processing units (GPUs) have shown potential to offer substantial performance gains when solving data-intensive calculations. /content/cudazone/CUDABrowser/assets/images/applications/697_Engine_Block_sml_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/697_Engine_Block_sml_large.jpg Commercial Acceleware http://www.acceleware.com/default/ 2009 05 01 05/01/2009 Acceleware Application Computer Aided Engineering Acceleware 16cf695b-a4de-4cd9-9aca-10b8ad8eff95 Unipro UGENE UGENE is free cross-platform bioinformatics toolkit. It works on Windows, Linux, Mac OS and has out of the box support for modern GPUs including NVIDIA CUDA. UGENE focuses on integration of highly optimized versions of the most popular bioinformatics algorithms (Smith Waterman, HMMER, MUSCLE, Phylip etc) within single flexible visual interface. http://ugene.unipro.ru /content/cudazone/CUDABrowser/assets/images/applications/680_ss_mac_h1n1_small.png /content/cudazone/CUDABrowser/assets/images/applications/680_ss_mac_h1n1_large.png Commercial Unipro http://unipro.ru/eng/ 2009 07 15 07/15/2009 10 Open source Unipro UGENE team Application Code Life Sciences Unipro UGENE team,ugene@unipro.ru 8b638261-433f-4fc9-ad7b-96c2bd1a6599 Movavi Video Suite Movavi Video Suite is a complete collection of EIGHT powerful yet easy-to-use tools to suit your video processing needs /content/cudazone/CUDABrowser/assets/images/applications/694_0000217475_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/694_0000217475_large.jpg Commercial Movavi http://www.movavi.com/ 2009 11 25 11/25/2009 Commercial Movavi Application Video & Audio Movavi fa0fb82b-edb4-406b-bdcb-5b7d5e3eea51 Movavi Video Converter Movavi Video Converter is a leading video converter you can use to convert video & audio, save for portables, rip & burn DVD /content/cudazone/CUDABrowser/assets/images/applications/695_vc9box_jr_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/695_vc9box_jr_large.jpg Commercial Movavi http://www.movavi.com 2009 11 24 11/24/2009 Commercial Movavi Application Video & Audio Movavi a6e01b50-0329-4a90-85b3-597c398b8a63 PowerProducer 5 PowerProducer connects your HDV camcorder to your creative side, with a complete range of Blu-ray Disc and DVD authoring features for producing discs of your videos. /content/cudazone/CUDABrowser/assets/images/applications/693_2eqevch_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/693_2eqevch_large.jpg Commercial Cyberlink http://www.cyberlink.com/ 2009 11 02 11/02/2009 5 Commercial Cyberlink Application Video & Audio Cyberlink b4bfbf0c-7c02-49f3-9114-1bb621f4b3c7 HD NVR The HD NVR series network video recorder sets new standards for IP camera recorders featuring full 1080p HD video output with dual monitor capability and hardware video acceleration via Nvidia Cuda. Also features low power consumption with green hard drives up to 2TB, perfect for MegaPixel HD cameras. The wireless HD NVR is suitable for up to 16 network cameras and can be used in the home, office or for professional applications. /content/cudazone/CUDABrowser/assets/images/applications/691_nvr-header-2009_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/691_nvr-header-2009_large.jpg Commercial BiKal IP CCTV http://www.bikal.co.uk/ 2009 10 31 10/31/2009 Commercial BiKal IP CCTV Application Video & Audio BiKal IP CCTV 54670403-5c1a-4f0a-b226-3c1ca3dd071d EyeSoft EyeSoft is compatible with IP cameras and USB video devices from many different manufacturers including analogue video capture cards, alarm boxes and PTZ Keyboards. EyeSoft has an open source architecture allowing the integration of many hardware and software platforms and it's compatibility increases with each release. /content/cudazone/CUDABrowser/assets/images/applications/690_eyesoft-header2-2009_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/690_eyesoft-header2-2009_large.jpg BiKal IP CCTV http://www.bikal.co.uk/ 2009 10 30 10/30/2009 BiKal IP CCTV Application Video & Audio EyeSoft 9ae6cb89-6d4c-4480-8021-f35668d44724 Loilo Touch Now enjoy video editing by simply touching the screen. Enjoy using your fingers directly on your video, picture, and music with your friends and family. Extreme 10X output made possible with NVIDIA CUDA technology that enables GPU to take command for ultra fast video encode. /content/cudazone/CUDABrowser/assets/images/applications/689_touch_05_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/689_touch_05_large.jpg Commercial Loilo http://loilo.tv 2009 10 23 10/23/2009 10 Commercial Loilo Application Multimedia Video & Audio Loilo 4db08198-2f0d-49dd-b9b2-6087c1c8b368 Mirics FlexiTV Mirics FlexiTVTM is a multi-standard broadcast TV receiver for netbooks, notebooks and desktop PCs. Using NVIDIAs CUDATM GPU acceleration technology for critical TV signal processing, global TV and radio can be received using FlexiTV. The result is a single hardware design for worldwide terrestrial TV and radio reception. /content/cudazone/CUDABrowser/assets/images/applications/688_mirics_small.gif /content/cudazone/CUDABrowser/assets/images/applications/688_mirics_large.gif Commercial Mirics http://www.mirics.com/ 2009 10 02 10/02/2009 Mirics Application Video & Audio Mirics b7689d00-714b-455f-af55-9d1714974140 WinDVD 2010 Kick it up a notch with HD! WinDVD Pro is a Blu-ray player that supports AVCHD and even upscales standard DVDs to near-HD quality for more intense movies and music. Includes everything in the Standard version, plus: NVIDIA GPU-accelerated upscaling for smoother playback of your DVD-Video on high-definition display. Upscale DVD-video to fit your HD display, regardless of the platform! /content/cudazone/CUDABrowser/assets/images/applications/687_images_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/687_images_large.jpg Commercial Corel http://www.corel.com 2009 09 10 09/10/2009 Corel Application Video & Audio Corel 7572a818-d387-4d19-b437-e2244aca5398 MilkyWay@home The goal of Milkyway@Home is to use the BOINC platform to harness volunteered computing resources in creating a highly accurate three dimensional model of the Milky Way galaxy using data gathered by the Sloan Digital Sky Survey. This project enables research in both astroinformatics and computer science. /content/cudazone/CUDABrowser/assets/images/applications/686_feed-248_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/686_feed-248_large.jpg Research MilkyWay@home http://milkyway.cs.rpi.edu/milkyway/ 2009 08 31 08/31/2009 MilkyWay@home Application Science MilkyWay@home a4cd88a8-6691-4923-83b0-a03b9a3f6e2b Roxio Creator 2010 With Creator 2010, you can render and encode your video 5 times faster thanks to NVIDIA Cuda technologies. /content/cudazone/CUDABrowser/assets/images/applications/685_creator2010-box-lg_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/685_creator2010-box-lg_large.jpg Commercial Roxio http://www.roxio.com 2009 08 25 08/25/2009 5 Commercial Roxio Application Video & Audio Roxio bd09a954-61de-4510-bf7d-ea219b830f78 DivideFrame GPU Decoder Hardware accelerated decoding of AVCHD/Quicktime h.264 files for NLEs /content/cudazone/CUDABrowser/assets/images/applications/683_logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/683_logo_large.jpg Commercial DivideFrame http://www.divideframe.com 2009 07 31 07/31/2009 10 Commercial DivideFrame Application Multimedia Video & Audio DivideFrame 135ad008-3035-4f9c-943a-af31b3302a2f Nero Moveit Nero Move it lets you convert and transfer all your multimedia files to the most popular portable and mobile devices. Easily transfer your MP3, WMA, and other audio and video files to your choice of device, PC, Mobile Phone, Digital Camera and more. Move It converts quickly and hassle-free from any supported source and from online communities, and easily move them to iPod, iPhone, PSP and other mobile devices or online communities such as Blackberry, LG, Xbox, YouTube and more. With integrated NVIDIA CUDA technology in Nero Move it lets users with compatible NVIDIA graphics cards convert their favorite videos faster and more efficiently. /content/cudazone/CUDABrowser/assets/images/applications/682_featured-product-moveit-eng_small.png /content/cudazone/CUDABrowser/assets/images/applications/682_featured-product-moveit-eng_large.png Commercial Nero http://www.nero.com/ 2009 04 20 04/20/2009 Commercial Nero Application Multimedia Video & Audio Nero cb39210c-23de-464c-ac39-a27c3ea748d6 Elcomsoft Wireless Security Auditor Elcomsoft Wireless Security Auditor allows network administrators to verify how secure a companys wireless network is by executing an audit of accessible wireless networks. Featuring patent-pending cost-efficient GPU acceleration technologies, Elcomsoft Wireless Security Auditor attempts to recover the original WPA/WPA2-PSK text passwords in order to test how secure your wireless environment is. /content/cudazone/CUDABrowser/assets/images/applications/681_ewsa_small.gif /content/cudazone/CUDABrowser/assets/images/applications/681_ewsa_large.gif Commercial Elcomsoft http://www.elcomsoft.com/ewsa.html 2009 01 01 01/01/2009 Commercial Elcomsoft Application Wireless Security,Elcomsoft 0465f26d-a230-406e-8c08-9cb481f02fab Wave Tomography 2D time-domain waveform tomography reconstruction algorithm using GPUs. /content/cudazone/CUDABrowser/assets/images/applications/680_cuda_website_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/680_cuda_website_large.jpg Academia EPFL 2010 01 20 01/20/2010 Open source Olivier Roy Ivana Jovanovic Reza Parhizkar Paper Code Imaging Signal Processing acoustic wave equation, inverse problems, waveform tomography,Olivier Roy,olivier.roy@usense.org 069ab305-5868-4d1b-8713-abf1f7dfd1ef A performance study of general-purpose applications on graphics processors using CUDA Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIAs C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications. /content/cudazone/CUDABrowser/assets/images/applications/679_pyramid_small.png /content/cudazone/CUDABrowser/assets/images/applications/679_pyramid_large.png Academia University of Virginia, Department of Computer Science, Charlottesville, VA 2008 03 02 03/02/2008 6 Shuai Che Michael Boyer Jiayuan Meng Paper Shuai Che,Michael Boyer,Jiayuan Meng,sc5nf@cs.virginia.edu,jm6dg@cs.virginia.edu,jws9c@cs.virginia.edu 254535ef-fa6c-46f1-a198-5a49c6deecbb Fast N-Body Simulation with CUDA An N-body simulation numerically approximates the evolution of a system of bodies in which each body continuously interacts with every other body. A familiar example is an astrophysical simulation in which each body represents a galaxy or an individual star, and the bodies attract each other through the gravitational force, as in Figure 31-1. N-body simulation arises in many other computational science problems as well. For example, protein folding is studied using N-body simulation to calculate electrostatic and van der Waals forces. Turbulent fluid flow simulation and global illumination computation in computer graphics are other examples of problems that use N-body simulation. /content/cudazone/CUDABrowser/assets/images/applications/678_n-body_small.png /content/cudazone/CUDABrowser/assets/images/applications/678_n-body_large.png Academia NVIDIA Corporation / University of North Carolina at Chapel Hill 2007 12 31 12/31/2007 Lars Nyland Mark Harris Jan Prins Paper Lars Nyland,Mark Harris,Jan Prins 5a75ea20-e19b-4d31-8ee4-cb07bd0cca4d Optimization principles and application performance evaluation of a multithreaded GPU using CUDA GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processors organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each threads resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup. /content/cudazone/CUDABrowser/assets/images/applications/677_cover_thumb_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/677_cover_thumb_large.jpg Academia University of Illinois at Urbana-Champaign / NVIDIA Corporation 2008 12 31 12/31/2008 457 Shane Ryoo Christopher I. Rodrigues Sara S. Baghsorkhi Sam S. Stone Wen-mei W. Hwu Paper Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi, Sam S. Stone, Wen-mei W. Hwu bb1d56b5-acb4-41ec-a177-b2c2ab424e0f CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment. /content/cudazone/CUDABrowser/assets/images/applications/676_1471-2105-9-S2-S10-1_small.gif /content/cudazone/CUDABrowser/assets/images/applications/676_1471-2105-9-S2-S10-1_large.gif Academia CRIBI, University of Padova / Elaide, Srl, Padova 2008 03 26 03/26/2008 30 Svetlin A Manavski Giorgio Valle Paper Science Svetlin A Manavski,Giorgio Valle 66e597cc-b024-471c-a2ae-ab091eb6f738 Speeding up Mutual Information Computation Using NVIDIA CUDA Hardware We present an efficient method for mutual information (MI) computation between images (2D or 3D) for NVIDIAs (CUDA) compatible devices. Efficient parallelization of MI is particularly challenging on a (GPU) due to the need for histogram-based calculation of joint and marginal probability mass functions (pmfs) with large number of bins. The data-dependent (unpredictable) nature of the updates to the histogram, together with hardware limitations of the GPU (lack of synchronization primitives and limited memory caching mechanisms) can make GPU-based computation inefficient. To overcome these limitation, we approximate the pmfs, using a down-sampled version of the jointhistogram which avoids memory update problems. Our CUDA implementation improves the efficiency of MI calculations by a factor of 25 compared to a standard CPUbased implementation and can be used in MI-based image registration applications. /content/cudazone/CUDABrowser/assets/images/applications/675_comparison_small.png /content/cudazone/CUDABrowser/assets/images/applications/675_comparison_large.png Academia The Australian National University 2007 12 31 12/31/2007 25 Ramtin Shams Nick Barnes Paper Imaging Ramtin Shams,Nick Barnes,ramtin.shams@anu.edu.au,nick.barnes@nicta.com.au d3a7926e-9a32-4fbf-abe5-9c2d26d38adc Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices We present two efficient histogram algorithms designed for NVIDIAs compute unified device architecture (CUDA)compatible graphics processor units (GPUs). Our algorithm can be used for parallel computation of histograms on large data-sets and for thousands of bins. Traditionally histogram computation has been difficult and inefficient on the GPU. This often means that GPU-based implementation of the algorithms that require histogram calculation as part of their computation, require to transfer data between the GPU and the host memory, which can be a significant bottleneck. Our algorithms remove the need for such costly data transfers by allowing efficient histogram calculation on the GPU. We show that the speed of histogram calculations can be improved by up to 30 times compared to a CPU-based implementation. /content/cudazone/CUDABrowser/assets/images/applications/674_ParaviewHistogram_small.png /content/cudazone/CUDABrowser/assets/images/applications/674_ParaviewHistogram_large.png Academia The Australian National University 2007 12 01 12/01/2007 30 Ramtin Shams A. Kennedy Paper Code Ramtin Shams,A. Kennedy 62eb8c85-57f7-4ab2-b7db-6b5e0aab23f9 gpuCuller gpuCuller is a software library implementing parallel computation of view frustum culling for multiple view frustum and multiple entities (for now, AABB) Its main application is to compute visible elements for autonomous agents in VR simulation platforms the library builds up a BVH from the universe entities, which is parsed during culling operations /content/cudazone/CUDABrowser/assets/images/applications/673_325px-View_frustum_culling.svg_small.png /content/cudazone/CUDABrowser/assets/images/applications/673_325px-View_frustum_culling.svg_large.png Academia UTBM 2010 01 14 01/14/2010 Nicolas Said Application Multimedia Graphics Nicolas Said,nicolas.said@gmail.com 00bed0a7-7d59-4f06-b773-fbcb33358272 Flow visualization and flow cytometry with holographic video microscopy CUDA-accelerated analysis of holographic images yields the three-dimensional position of colloidal spheres with nanometer resolution, and simultaneously yields each spheres radius and complex refractive index with part-per-thousand resolution. /content/cudazone/CUDABrowser/assets/images/applications/672_img19_small.png /content/cudazone/CUDABrowser/assets/images/applications/672_img19_large.png Academia New York University http://physics.nyu.edu/grierlab/ 2009 07 16 07/16/2009 20 David G. Grier Paper Imaging Science Video & Audio David G. Grier,david.grier@nyu.edu b9d646ab-1b8e-42dc-b7c0-5033a2c01fe8 GPU acceleration of object classification algorithms using NVIDIA CUDA The field of computer vision has become an important part of today's society, supporting crucial applications in the medical, manufacturing, military intelligence and surveillance domains. Many computer vision tasks can be divided into fundamental steps: image acquisition, pre-processing, feature extraction, detection or segmentation, and high-level processing. This work focuses on classification and object detection, specifically k-Nearest Neighbors, Support Vector Machine classification, and Viola & Jones object detection. Object detection and classification algorithms are computationally intensive, which makes it difficult to perform classification tasks in real-time. This thesis aims in overcoming the processing limitations of the above classification algorithms by offloading computation to the graphics processing unit (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). The primary focus of this work is the implementation of the Viola and Jones object detector in CUDA. A multi-GPU implementation provides a speedup ranging from 1x to 6.5x over optimized OpenCV code for image sizes of 300 x 300 pixels up to 2900 x 1600 pixels while having comparable detection results. The second part of this thesis is the implementation of a multi-GPU multi-class SVM classifier. The classifier had the same accuracy as an identical implementation using LIBSVM with a speedup ranging from 89x to 263x on the tested datasets. The final part of this thesis was the extension of a previous CUDA k-Nearest Neighbor implementation by exploiting additional levels of parallelism. These extensions provided a speedup of 1.24x and 2.35x over the previous CUDA implementation. As an end result of this work, a library of these three CUDA classifiers has been compiled for use by future researchers. /content/cudazone/CUDABrowser/assets/images/applications/671_grouping_small.png /content/cudazone/CUDABrowser/assets/images/applications/671_grouping_large.png Academia Rochester Institute of Technology 2009 09 01 09/01/2009 263 Jesse Patrick Harvey Paper Jesse Patrick Harvey 59686a98-5e65-42d9-9044-02706ad3d148 Motion estimation for H.264/AVC on multiple GPUs using NVIDIA CUDA To achieve the high coding efficiency the H.264/AVC standard offers, the encoding process quickly becomes computationally demanding. One of the most intensive encoding phases is motion estimation. Even modern CPUs struggle to process high-definition video sequences in real-time. While personal computers are typically equipped with powerful Graphics Processing Units (GPUs) to accelerate graphics operations, these GPUs lie dormant when encoding a video sequence. Furthermore, recent developments show more and more computer configurations come with multiple GPUs. However, no existing GPU-enabled motion estimation architectures target multiple GPUs. In addition, these architectures provide no early-out behavior nor can they enforce a specific processing order. We developed a motion search architecture, capable of executing motion estimation and partitioning for an H.264/AVC sequence entirely on the GPU using the NVIDIA CUDA (Compute Unified Device Architecture) platform. This paper describes our architecture and presents a novel job scheduling system we designed, making it possible to control the GPU in a flexible way. This job scheduling system can enforce real-time demands of the video encoder by prioritizing calculations and providing an early-out mode. Furthermore, the job scheduling system allows the use of multiple GPUs in one computer system and efficient load balancing of the motion search over these GPUs. This paper focuses on the execution speed of the novel job scheduling system on both single and multi-GPU systems. Initial results show that real-time full motion search of 720p high-definition content is possible with a 32 by 32 search window running on a system with four GPUs. /content/cudazone/CUDABrowser/assets/images/applications/670_h264_small.png /content/cudazone/CUDABrowser/assets/images/applications/670_h264_large.png The International Society for Optical Engineering 2009 09 02 09/02/2009 Bart Pieters Charles F. Hollemeersch Peter Lambert Paper Video & Audio Bart Pieters,Charles F. Hollemeersch,Peter Lambert 3f523b1a-9c95-4ffa-a003-90314020aede Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. Therefore, we accelerate CVB inference, an efficient deterministic inference method for LDA, with Nvidia CUDA. In the evaluation experiments, we used a set of 50,000 documents and a set of 10,000 images. We could obtain inference results comparable to sequential CVB inference. /content/cudazone/CUDABrowser/assets/images/applications/669_lncs_small.png /content/cudazone/CUDABrowser/assets/images/applications/669_lncs_large.png Academia Nagasaki University, Bunkyo-machi, Nagasaki, Japan 2009 06 26 06/26/2009 Tomonari Masada Tsuyoshi Hamada Yuichiro Shibata Paper Tomonari Masada,Tsuyoshi Hamada,Yuichiro Shibata 54d853fe-6790-40a4-84a1-d7bfadeaa979 Real-time 2D parallel windowed Fourier transform for fringe pattern analysis using GPUs In optical interferometers, fringe projection systems, and synthetic aperture radars, fringe patterns are common outcomes and usually degraded by unavoidable noises. The presence of noises makes the phase extraction and phase unwrapping challenging. Windowed Fourier transform (WFT) based algorithms have been proven to be effective for fringe pattern analysis to various applications. However, the WFT-based algorithms are computationally expensive, prohibiting them from real-time applications. In this paper, we propose a fast parallel WFT-based library using graphics processing units and computer unified device architecture. Real-time WFT-based algorithms are achieved with 4 frames per second in processing 256x256 fringe patterns. Up to 132x speedup is obtained for WFT-based algorithms using NVIDIA GTX295 graphics card than sequential C in quad-core 2.5GHz Intel(R)Xeon(R) CPU E5420. /content/cudazone/CUDABrowser/assets/images/applications/668_rt2d_small.png /content/cudazone/CUDABrowser/assets/images/applications/668_rt2d_large.png Academia Nanyang Technological University, Singapore 2009 12 02 12/02/2009 132 Wenjing Gao Nguyen Thi Ho Sy Loi Paper Wenjing Gao,Nguyen Thi,Ho Sy Loi,mkmqian@ntu.edu.sg 1ac32591-b05b-4a98-9570-fbdd6347751c Solve MAX-SAT problem using nVidia CUDA framework General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a hybrid of Genetic Algorithm (GA) and Local Search (LS) to solve MAXimum SATisfiability (MAX-SAT) problem on a state-of-the-art nVidia Tesla GPU using nVidia Compute Unified Device Architecture (CUDA). MAX-SAT is a problem of practical importance and is often solved by employing metaheuristics based search methods like GAs and hybrid of GA with LS. Almost all the parallel GAs (pGAs) designed in the last two decades were designed for either clusters or MPPs. Unfortunately, very little research is done on the implementation of such algorithms over commodity graphics hardware. GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, and the same is the case with conventional LS algorithms. In this paper we explore different genetic operators that can be used for an efficient implementation of GAs over nVidia GPUs. We also design and introduce new techniques/operators for an efficient implementation of GAs and LS over such architectures. We use nVidia Tesla C1060 to perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25x. We also discuss the effects of different optimization techniques on the overall execution time. /content/cudazone/CUDABrowser/assets/images/applications/667_cover-medium_small.png /content/cudazone/CUDABrowser/assets/images/applications/667_cover-medium_large.png Academia Hokkaido University, Sapporo, Japan 2009 10 20 10/20/2009 Asim Munawar Mohamed Wahib Masaharu Munetomo Paper Asim Munawar,Mohamed Wahib,Masaharu Munetomo,asim@uva.cims.hokudai.ac.jp,wahibium@uva.cims.hokudai.ac.jp,munetomo@iic.hokudai.ac.jp ff558f05-2894-4e4e-ae78-2c0a67925404 Scalable computation for spatially scalable video coding using NVIDIA CUDA and multi-core CPU The scalable video coding (SVC), an extension of H.264/MPEG4-AVC (H.264), was standardized in 2007 by Joint Video Team (JVT). SVC provides spatial, temporal and SNR scalabilities. To achieve these scalabilities, SVC uses additional coding tools and coding modes based on H.264. The coding tools used by SVC and the variety coding modes decision make the corresponding coding complexity become extremely high, so real-time realization of SVC is nearly impossible by using software and single-core CPU only. One possible solution to generate SVC streams in real-time is to parallelize the whole encoding process. Currently, multi-core CPU and GPU are two popular kinds of parallel processing architectures. Not much research has been devoted to realize the parallel SVC encoders based on the co-work of these two architectures. In this paper, a scalable computation model for spatial SVC using multi-core CPU and GPGPU through NVIDIA CUDA is proposed. On the basis of the proposed computational model, a solution to solve the challenging data transition problem (will be detailed later) of this CPU-GPU co-work architecture is then provided. Simulation results show that, through our work, significant speed up gain in spatial SVC encoding can be achieved. /content/cudazone/CUDABrowser/assets/images/applications/666_cover_thumb_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/666_cover_thumb_large.jpg Academia ACM http://www.acm.org/publications 2009 01 01 01/01/2009 Yen-Lin Huang Yun-Chung Shen Ja-Ling Wu Paper Yen-Lin Huang,Yun-Chung Shen,Ja-Ling Wu b972ec14-5548-49b5-8366-31d9df19eaf8 Sugarscape Cuda Using emergent programing techniques on the GPU we have made an implementation of sugarscape to utilize the massively parallel architecture of modern GPUs. Agents within the model move optimally within their vision which is uniformly set between 1,10. Multiple agent cannot occupy the same cell. The agents also interact with the sugar patches uniformly given a metabolism between [0.1,1). The sugar patches grow at a constant rate of 0.1 per time step until they reach their maximum values which are determined by two Gaussian functions. /content/cudazone/CUDABrowser/assets/images/applications/663_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/663_logo_large.png code.google.com http://code.google.com 2009 08 21 08/21/2009 Devm Code Devm ae892094-5c92-4cd2-b782-fc12cb86f174 Cuda Nash Finding Nash equilibria for large games is a computationally difficult task. The goal of this project is to implement a simple algorithm that is well suited to being run in parallel on simple hardware. The algorithm boils down solving a large system of differential equations until they converge within a given tolerance. We believe that the computational architecture of graphics cards is especially well suited to this type of problem. /content/cudazone/CUDABrowser/assets/images/applications/662_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/662_defaultlogo_large.png CUDA Developer http://code.google.com 2009 06 03 06/03/2009 Aultman Stephen Code Numerics Aultman Stephen a58dcb4d-07ea-415c-8141-760465ce3812 Electromag with CUDA Fun electromagnetism simulation application with CUDA GPGPU acceleration /content/cudazone/CUDABrowser/assets/images/applications/661_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/661_defaultlogo_large.png CUDA Developer http://code.google.com 2009 05 08 05/08/2009 Code e128489b-f5c9-4fb5-9e20-e6f08d8d3cd7 Hydrazine A library of common operations needed for C++ and CUDA development /content/cudazone/CUDABrowser/assets/images/applications/660_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/660_defaultlogo_large.png CUDA Developer http://code.google.com 2009 05 13 05/13/2009 Gregory Code Libraries Gregory 6e7508ba-012a-4545-9c73-97e646edae15 CUDA Grayscale This project presents a common technique for converting colored images to their grayscale representation using CUDA enabled GPUs to speed up processing. This multi-platform implementation uses OpenCV for managing image files, while the conversion algorithm takes into consideration different weighting of the color channels for a more effective representation of the colored image. /content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_large.png CUDA Developer http://code.google.com 2009 11 16 11/16/2009 Karl Phillip Application Code Imaging Karl Phillip d84046cb-3498-46af-a187-293a84bbad65 CUDA Ndarray This project provides a type with an interface as similar as possible to numpy's ndarray whose storage is allocated on a GPU device. /content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_small.png /content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_large.png CUDA Developer http://code.google.com 2009 12 18 12/18/2009 James Bergstra Frederic Bastien Pascal Lamblin Code Libraries James Bergstra,Frederic Bastien,Pascal Lamblin ad2ef9c8-ad67-4c51-929b-c36141afa1c7 multisvm The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this project, it is described how a naive implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy. /content/cudazone/CUDABrowser/assets/images/applications/657_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/657_logo_large.png CUDA Developer http://code.google.com 2009 11 14 11/14/2009 Sergherr Code Sergherr 9b79ff8f-cc43-43bf-8143-37759f811994 gpuocelot Ocelot is a dynamic compilation framework for heterogeneous systems, accomplishing this by providing various backend targets for CUDA programs. Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs and x86-CPUs at full speed without recompilation. /content/cudazone/CUDABrowser/assets/images/applications/656_logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/656_logo_large.jpg CUDA Developer http://code.google.com 2009 12 15 12/15/2009 Gregory Arkerr Code Gregory,Arkerr 500a4c45-722d-4593-9b33-c78d5247013e PHENOTYPING RODENT MODELS OF OBESITY USING MAGNETIC RESONANCE IMAGING The emergence of dedicated, small animal imaging systems provides an excellent opportunity to study obesity using the rat and mouse models which will be critical to increasing our basic knowledge as well as deriving new treatments. MRI is well suited for quantifying fat depots (e.g., visceral, subcutaneous, hepatic, muscular) and for helping to determine the role of genetic, environmental, and therapeutic factors on lipid accumulation, metabolism, and disease. Assessment of lipid depots is important because of the linkage of visceral and ectopic depots to insulin resistance, vascular disease, etc. The importance of making reproducible imaging measurements can never be underestimated when conducting a study of many animals, and we demonstrated that ratio imaging enables reliable quantification even on a human clinical 1.5T MRI scanner. Scan-rescan variability and intra-operator variability were each reduced to a 2% coefficient of variation or less when the semi-automatic ratio image analysis was used. Receiver coil signal intensity inhomogeneity of over 200% across the field of view was flattened to less than 3% variation by ratio imaging. Using the SHR/SHROB rat model of dietary and genetic obesity, we found a novel image phenotype which showed that visceral adipose tissue depots are increased in both genetic and dietary obesity, but subcutaneous adipose tissue is uniquely linked to dietary obesity, at least in this model. /content/cudazone/CUDABrowser/assets/images/applications/655_rat_small.png /content/cudazone/CUDABrowser/assets/images/applications/655_rat_large.png Academia Department of Biomedical Engineering Case Western Reserve University 2010 01 01 01/01/2010 21 David Hervert Johnson Paper DAVID HERBERT JOHNSON 9ae40a45-facb-4779-9512-9bbf25f875c4 Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK) storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in doubleprecision on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8x and 1.5x for single and doubleprecision respectively. /content/cudazone/CUDABrowser/assets/images/applications/654_threadblock_small.png /content/cudazone/CUDABrowser/assets/images/applications/654_threadblock_large.png Academia Georgia Institute of Technology / Indian Institute of Technology Roorkee 2010 1 1 1/1/2010 Jee W. Choi Amik Singh Richard W. Vuduc Paper Jee W. Choi,Amik Singh,Richard W. Vuduc,jee@ece.gatech.edu,amiksuec@iitr.ernet.in,richie@cc.gatech.edu d660b5b0-780e-42d3-9fc6-b4bb78acefde Real-time display on Fourier domain optical coherence tomography system Fourier domain optical coherence tomography (FD-OCT) requires resampling of spectrally resolved depth information from wavelength to wave number, and the subsequent application of the inverse Fourier transform. The display rates of OCT images are much slower than the image acquisition rates due to processing speed limitations on most computers. We demonstrate a real-time display of processed OCT images using a linear-in-wave-number (linear-k) spectrometer and a graphics processing unit (GPU). We use the linear-k spectrometer with the combination of a diffractive grating with 1200 lines/mm and a F2 equilateral prism in the 840-nm spectral region to avoid calculating the resampling process. The calculations of the fast Fourier transform (FFT) are accelerated by the GPU with many stream processors, which realizes highly parallel processing. A display rate of 27.9 frames/sec for processed images (2048 FFT sizex1000 lateral A-scans) is achieved in our OCT system using a line scan CCD camera operated at 27.9 kHz /content/cudazone/CUDABrowser/assets/images/applications/653_060506_1-V1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/653_060506_1-V1_large.jpg Academia Graduate School of Science and Engineering, Yamagata University 2009 12 28 12/28/2009 Yuuki Watanabe Multimedia Paper Imaging Life Sciences Yuuki Watanabe,ywata@yz.yamagata-u.ac.jp 83c76f9b-4deb-4017-9e78-42e911ff01ed muvee Reveal version 8 muvee Reveal lets you create and share personalized, professional looking home movies in a few quick steps. With automatic motion and face detection, your photos and video are synced to the beat of your favorite music. /content/cudazone/CUDABrowser/assets/images/applications/681_160x90_small.png /content/cudazone/CUDABrowser/assets/images/applications/681_160x90_large.png Commercial muvee Technologies Pte. Ltd. http://www.muvee.com 2009 11 17 11/17/2009 8 Commercial muvee Technologies Application Multimedia Digital Content Creation Imaging Video & Audio Mafrudy bin Rubani,mafrudy@muvee.com 376fa043-5c61-42ff-a8e9-9c5dbcee6c9e Multiple Back-Propagation source code Multiple Back-Propagation is an open source oftware application for training neural networks with the backpropagation and the multiple back propagation algorithms. Currently this project is osted at htp://code.google.com/p/multiplebackpropagation and http://sourceforge.net/projects/mbp/ /content/cudazone/CUDABrowser/assets/images/applications/651_mbpTop_small.png /content/cudazone/CUDABrowser/assets/images/applications/651_mbpTop_large.png Academia IPG http://dit.ipg.pt/MBP/ 2009 12 11 12/11/2009 179 Open source Noel Lopes Application Science Noel Lopes,noel@ipg.pt e3fe2d97-e009-470c-9d85-6ee65c25cd43 ClusterTech Financial Library in GPU CLUSTERTECH Finance Library includes a BGM Interest Path Generator and a Trinomial Tree-based Options Pricing Model. In the BGM model, each forward rate is modeled by a lognormal process. The volatility vector function is also defined in our implementation. Then numerous interest-rate paths are generated by Monte Carlo simulation. The library also includes a trinomial recombining tree based options-pricing model , which allows for greater flexibility in the movement of rates or prices compared to the binomial counterpart. /content/cudazone/CUDABrowser/assets/images/applications/650_ct-fl-ad_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/650_ct-fl-ad_large.jpg Commercial Cluster Technology Limited http://www.clustertech.com 2009 11 17 11/17/2009 30 Commercial Cluster Technology Limited Application Finance Numerics Libraries Cluster Technology Limited,hkbd@clustertech.com 224bcdd0-6e4a-422e-bd1a-4c367e094441 ClusterTech Parallel Random Number Generator The ClusterTech Parallel Random Number Generator is based on Mersenne Twister which has a period of 2^19937-1. It generates multiple independent streams simultaneously across a cluster of CPUs and GPUs with a jump-ahead feature to guarantee the quality of the output. /content/cudazone/CUDABrowser/assets/images/applications/649_ct-prng-ad_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/649_ct-prng-ad_large.jpg Commercial Cluster Technology Limited http://www.clustertech.com 2009 11 17 11/17/2009 30 Cluster Technology Limited Application Numerics Libraries Cluster Technology Limited,hkbd@clustertech.com df4a0b13-0479-40f6-8405-ae810286a06b GPU computing with Kaczmarz's and otheriterative algorithms for linear systems The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarzs, Cimminos, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradientaccelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method. /content/cudazone/CUDABrowser/assets/images/applications/648_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/648_graph_large.png Academia University of Illinois Urbana-Champaign http://www.uiuc.edu 2009 12 22 12/22/2009 10 J. Elble N. Sahinidis P. Vouzis Paper Graphics Joseph Elble,elble@uiuc.edu 3b6e6752-173e-45e4-a22f-5e7eccae9b7f Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures This is a highly accelerated implementation of the finite-difference weighted essentially non-oscillatory (WENO) scheme. This method is suitable for direct numerical simulations (DNS) large eddy simulations (LES) of compressible turbulence and requires large computing resources in order to achieve high Reynolds numbers. Our implementation utilizes a multi-GPU environment. /content/cudazone/CUDABrowser/assets/images/applications/647_rayleigh-taylor_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/647_rayleigh-taylor_large.jpg Academia PDS Group - University of Patras http://pdsgroup.hpclab.ceid.upatras.gr 2009 12 14 12/14/2009 50 Konstantinos Karantasis Paper Computational Fluid Dynamics Konstantinos Karantasis,karantas@ceid.upatras.gr 39f2db06-bc42-4ecf-9f6a-235d25b002e6 GPU Accelerated Pathfinding In the past few years the graphics programmable processor (GPU) has evolved into an increasingly convincing computational resource for non graphics applications. The GPU is especially well suited to address problem sets expressed as data parallel computation with the same program executed on many data elements concurrently. In pursuing a scalable navigation planning approach for many thousands of agents in crowded game scenes, developers became more attracted to decomposable movement algorithms that lend to explicit parallelism. Pathfinding is one key computational intelligence action in games that is typified by intense search over sparse graph data structures. This paper describes an efficient GPU implementation of parallel global pathfinding using the CUDA programming environment, and demonstrates GPU performance scale advantage in executing an inherently irregular and divergent algorithm. /content/cudazone/CUDABrowser/assets/images/applications/646_GPUAcceleratedPathfinding_small.png /content/cudazone/CUDABrowser/assets/images/applications/646_GPUAcceleratedPathfinding_large.png Research NVIDIA Corporation http://www.nvidia.com 2008 06 20 06/20/2008 Avi Bleiweiss Presentation Paper Artificial Intelligence Avi Bleiweiss,ableiweiss@nvidia.com 9b19fa0b-58e2-4937-a1bb-e6532bb7522a Scalable Multi Agent Simulation on the GPU We present a unique and elegant graphics hardware realization of multi agent simulation. Specifically, we adapted Velocity Obstacles that suits well parallel computation on single instruction, multiple thread, SIMT, type architecture. We explore hash based nearest neighbors search to considerably optimize the algorithm when mapped on to the GPU. Moreover, to alleviate inefficiencies of agent level concurrency, primarily exposed in small agent count (<32) scenarios, we exploit nested data parallel in unrolling the inner velocity iteration, demonstrating an appreciable performance gain. Simulation of ten thousand agents created with our system runs on current hardware at a real time rate of eighteen frames per second. Our software implementation builds on NVIDIAs CUDA. /content/cudazone/CUDABrowser/assets/images/applications/645_aicuda_small.png /content/cudazone/CUDABrowser/assets/images/applications/645_aicuda_large.png Research NVIDIA Corporation http://www.nvidia.com 2009 11 02 11/02/2009 50 Avi Bleiweiss Presentation Paper Artificial Intelligence Avi Bleiweiss,ableiweiss@nvidia.com 25c4c0ea-21c5-449e-8a40-f8cdfa40d539 NVIDIA Nexus - Visual Studio-based GPU Development Our new GPU developer tools, code-named Nexus brings GPU Computing into Visual Studio 2008. Debug, profile, and analyze GPU code using standard workflow and tools. Nexus supports CUDA C, OpenCL, DirectCompute, Direct3D, and OpenGL. /content/cudazone/CUDABrowser/assets/images/applications/644_64_small.png /content/cudazone/CUDABrowser/assets/images/applications/644_64_large.png Commercial NVIDIA http://www.nvidia.com 2009 12 16 12/16/2009 NVIDIA Application Multimedia nexus,NVIDIA,cuda@nvidia.com 4fd199bf-5cce-4533-af21-5db250922ae6 Recursive APSP on the GPU We consider the computation of shortest paths on Graphic Processing Units (GPUs). The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without piv- oting) having similar data access patterns. Using the all-pairs shortest-paths problem as an example, we uncover potential gains over this class of algorithms. The impressive computational power and memory bandwidth of the GPU make it an attractive plat- form to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partioned all-pairs shortest-paths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrix-matrix multi- plications on a semiring. Since matrix-matrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magni- tude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU. /content/cudazone/CUDABrowser/assets/images/applications/643_apsp-timings-small_small.png /content/cudazone/CUDABrowser/assets/images/applications/643_apsp-timings-small_large.png Academia UC Santa Barbara 2008 11 30 11/30/2008 480 Open source Aydin Buluc Paper Code Numerics Science Aydin Buluc,aydin@cs.ucsb.edu b474b1ae-94b1-4ac4-ba86-a16506460ba4 Multiphase flow in porous media The movie shows fractional flow of oil and water in a generic porous medium (glass beads, water wet). The glass beads are visualized by a transparent material, the water is invisible and the oil phase is shown by a color encoded surface. The color represents the pressure distribution, where red is high and blue low pressure. The porous medium is resolved by 250^3 grid points. Ingrain's digital rock physics lab computes the physical properties and fluid flow characteristics of oil and gas reservoir rocks. Our technology leads the industry in measuring shales, carbonates, tight gas sands and oil sands. Ingrain uses advanced lattice Boltzmann methods to simulate multiphase flow in the rocks (porous media). The simulation engine uses a sparse data structure to represent the grid. The simulations are accelerated by using GPUs and the CUDA technology by two orders of magnitude compared to a state of the art multicore desktop computer. On a single Tesla GPU with 4GB memory we are able to simulate grids up to 800^3/600^3 for 5 % porosity and up to 500^3/400^3 for 40 % porosity for single/multi phase flow. For larger grids multiple GPUs are used in parallel. /content/cudazone/CUDABrowser/assets/images/applications/642_movBlackLogo.0400_small.png /content/cudazone/CUDABrowser/assets/images/applications/642_movBlackLogo.0400_large.png Commercial Ingrain http://www.ingrainrocks.com 2009 12 05 12/05/2009 100 Jonas Toelke Multimedia Computational Fluid Dynamics Jonas Toelke,toelke@ingrainrocks.com c0d931f3-fe2d-42cf-aa7c-981392258c99 FastFractal256 Mandelbrot fractal render. Uses software integers at 256 bit precision run on GPU. /content/cudazone/CUDABrowser/assets/images/applications/641_baby-mandelbrot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/641_baby-mandelbrot_large.jpg Commercial Imaginary Software, LLC http://www.fastfractal.com/ 2009 11 16 11/16/2009 10 Commercial Imaginary Software, LLC Application Multimedia Graphics Numerics Science Imaginary Software, LLC,contact@fastfractal.com 75775576-7f1b-4338-950d-57508d71eb11 Digital Breast Tomosynthesis Reconstruction reconstruction of Digital Breast Tomosynthesis volumes. The CUDA version gave a minimum 25x speedup over multi-threaded implementation on an Intel Core i7 quad-core CPU. The application is also scalable to multiple GPUs for further acceleration. This work was done courtesy of Massachusetts General Hospital with additional support from the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems (Gordon-CenSSIS). Individual and Institutional Contributors include: Professor David Kaeli, Daniel B. Kopans M.D., Micha Moffie PhD., Richard H Moore, Diego Rivera, Dana Schaa, Juemin Zhang PhD., Brandeis University, and Dexela, Ltd. /content/cudazone/CUDABrowser/assets/images/applications/640_tomo_small.png /content/cudazone/CUDABrowser/assets/images/applications/640_tomo_large.png Research Massachusetts General Hospital 2009 11 03 11/03/2009 85 Benjamin C Brown Multimedia Paper Imaging Life Sciences GTX285 vs. Intel Core i7 940 quad-core 2.93 GHz.,Benjamin C Brown,bcbrown@partners.org 5a4a9940-b346-454f-926b-fc21e5e9995b Needleman-Wunsch Sequence Alignment The Needleman-Wunsch Sequence Alignment using CUDA /content/cudazone/CUDABrowser/assets/images/applications/639_nw_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/639_nw_large.jpg Academia University of Virginia 2009 12 03 12/03/2009 8 Shuai Che Kevin Skadron Application Multimedia Code Life Sciences Shuai Che,Kevin Skadron,sc5nf@virginia.edu 984a48ef-ecb4-4649-85e3-fb42ceff5269 CUDAEASY We present a graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe in NVIDIA's Compute Unified Device Architecture (CUDA). In chaotic inflation models we report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations. /content/cudazone/CUDABrowser/assets/images/applications/638_logo_big_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/638_logo_big_large.jpg Academia University of Turku / Department of Physics and Astronomy http://www.physics.utu.fi/en/ 2009 12 01 12/01/2009 100 Open source Jani Sainio Application Multimedia Science Jani Sainio,jani.sainio@utu.fi e563fe70-8e81-434a-81a1-4d1ca78c77a4 TeraChem General purpose software for quantum chemistry calculations designed specifically for Nvidia GPU /content/cudazone/CUDABrowser/assets/images/applications/637_CoverArtDNANew_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/637_CoverArtDNANew_large.jpg Commercial PetaChem, LLC http://www.petachem.com 2009 11 24 11/24/2009 650 Commercial Ivan Ufimtsev Application Multimedia Life Sciences Science Ivan Ufimtsev,i.ufimtsev@gmail.com 51876181-0577-4305-8961-455fe9f22ce9 Monte Carlo eXtreme (MCX) Monte Carlo eXtreme, or MCX, is a Monte Carlo simulation software for photon migration in 3D turbid media. It uses Graphics Processing Units (GPU) based massively parallel computing techniques and is extremely fast compared to traditional CPU-based simulations. Using an nVidia 8800GT graphics card (14MP/114Cores), the acceleration is about 300x~400x with over 1700 parallel threads; this ratio can be as high as 700x on a high-end GTX 295 GPU (multiply by another 2x if both GPUs on GTX295 are used). /content/cudazone/CUDABrowser/assets/images/applications/636_mcx_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/636_mcx_logo_large.png Academia Massachusetts General Hospital, Harvard Medical School http://nmr.mgh.harvard.edu/ 2009 10 22 10/22/2009 300 Open source Qianqian Fang Application Paper Code Imaging 3D Photon Migration,Qianqian Fang,fangqq@gmail.com ff3b0870-5be3-48ff-b14a-1e3b54c3320f AIRWC Accelerated Image Registration with CUDA. Fast medical image registrion using affina and B-Spline transformations. /content/cudazone/CUDABrowser/assets/images/applications/634_image002_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/634_image002_large.jpg Academia University of Cambridge, Dept of Physics http://www.phy.cam.ac.uk/ 2009 11 15 11/15/2009 100 Richard Ansorge Application Imaging Richard Ansorge,rea1@cam.ac.uk a5d1af40-470d-4088-b087-30a5e7a408d3 Task and Data Parallel Framework for GPU Computing MIT Lincoln Laboratory is developing PVTOL, a high-performance, portable signal and image processing library The goals of PVTOL are to: Provide a portable framework for high-performance embedded computing Support data and task parallelism Reduce the complexity and increase the speed of developing applications /content/cudazone/CUDABrowser/assets/images/applications/628_pvtol_small.png /content/cudazone/CUDABrowser/assets/images/applications/628_pvtol_large.png Research MIT Lincoln Laboratory http://ww.tll.mit.edu 2009 11 12 11/12/2009 Commercial James Brock Multimedia Paper Signal Processing James Brock,brock.j@neu.edu c47ca00a-cfa4-4bfc-9e05-9aa325fcf26c TMPGEnc KARMA..Plus TMPGEnc KARMA..Plus makes it easy to take control of your ever-growing digital video library. Sort, search, classify, play, and even compare your digital video with easy-to-use tools and controls. And it supports NVIDIA CUDA technology for filter processing, decoding and H.264/AVC file output. /content/cudazone/CUDABrowser/assets/images/applications/625_tmkp_main_quickview_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/625_tmkp_main_quickview_large.jpg Commercial Pegasys Inc. http://www.pegasys-inc.com 2009 11 10 11/10/2009 9 Commercial Zakk saito Application Multimedia Video & Audio CUDA H.264 Deocde Player Manage TMPG TMPGEnc Pegasys,Zakk saito,saito@pegasys-inc.com f1d52d6a-a875-4d32-90ba-c1c23aa4f6a0 Mersenne Twister for Graphic Processors (MTGP) MTGP is a new variant of Mersenne Twister (MT) introduced by Mutsuo Saito and Makoto Matsumoto in 2009. MTGP is designed with some features of Graphic Processors, such as parallel execution and hi-speed constant reference. It supports 32-bit and 64-bit integers, as well as single and double precision floating point as output. The periods of generated sequence are 11213-1,223209-1 and 244497-1 for 32-bit version, and 223209-1, 244497-1, 2110503-1 for 64-bit version. It support 128 parameter sets for each period, in other words, it can generate 128 independent pseudorandom number sequences for each period. We are now developing Dynamic Creator for MTGP, which generates more parameter sets. /content/cudazone/CUDABrowser/assets/images/applications/624_mtgp_small.png /content/cudazone/CUDABrowser/assets/images/applications/624_mtgp_large.png Academia Department of Mathematics, Hiroshima University , Japan 2009 11 17 11/17/2009 Open source Mutsuo Saito Makoto Matsumoto Paper Code Finance Numerics Libraries Science Mutsuo Saito,Makoto Matsumoto,saito@math.sci.hiroshima-u.ac.jp 5a730964-d49a-4305-b5a8-3c5d75ecf73b Eudyptula Eudyptula is portable graphics engine that provides advanced support for the CUDA tools of NVIDIA and with its core purpose to be used in the development of scientific applications /content/cudazone/CUDABrowser/assets/images/applications/622_eudyptula_small.png /content/cudazone/CUDABrowser/assets/images/applications/622_eudyptula_large.png OpenSource 2008 06 25 06/25/2008 Georgios Paraskevas Application Numerics Science Georgios Paraskevas 9ca281be-34d8-4b10-9f7c-cd1853ad715c High performance sequence alignment A fast Smith-Waterman algorithm, implemented on CUDA /content/cudazone/CUDABrowser/assets/images/applications/620_protein_small.png /content/cudazone/CUDABrowser/assets/images/applications/620_protein_large.png Research OpenSource 2008 09 19 09/19/2008 Vahid Noormofidi Code Life Sciences Vahid Noormofidi 082d85de-353e-4a4d-9613-2513309d4b09 aeth.drive A fast, parallel, versatile QED modelling framework. Uses Geometric Calculus and CUDA. Algorithm supports complex phenomena including turbulence, quantum effects, and relativistic gravitational procession. /content/cudazone/CUDABrowser/assets/images/applications/619_aeth_small.jpeg /content/cudazone/CUDABrowser/assets/images/applications/619_aeth_large.jpeg Research OpenSource 2008 11 15 11/15/2008 Kevin Daley Code Numerics Science Kevin Daley 91df274b-6c8d-470a-956d-8e6ff1d8c053 jacuzzi This projects aims at providing java-bindings to the CUDA numeric environment. CUDA is an extension to the C/C++ programming language by NVIDIA. /content/cudazone/CUDABrowser/assets/images/applications/617_jacuzzi_small.png /content/cudazone/CUDABrowser/assets/images/applications/617_jacuzzi_large.png Research OpenSource 2009 03 05 03/05/2009 Alexander Heusel Code Numerics Alexander Heusel 551bb282-5e25-4ff5-92fc-a0fc675d32bc cuda cagen CUDA-based rule 30 cellular automaton generator for nVidia GPUs /content/cudazone/CUDABrowser/assets/images/applications/616_CellularAutomata_small.png /content/cudazone/CUDABrowser/assets/images/applications/616_CellularAutomata_large.png Research OpenSource 2008 09 17 09/17/2008 Yuri Parfenov Code Numerics Yuri Parfenov 60d005b8-e3c7-47a5-8fec-ab8aef9f2031 Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU Particle-in-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of particle-to-grid interpolation. /content/cudazone/CUDABrowser/assets/images/applications/615_ptg_small.png /content/cudazone/CUDABrowser/assets/images/applications/615_ptg_large.png Academia University of Maryland http://www.umd.edu/ 2008 10 01 10/01/2008 20 George Stantchev Paper Science George Stantchev,gogo@umd.edu 276b1bef-214e-4528-85e7-c08792f09988 cudacluster The CUDA Cluster allows you to organize a cluster of CUDA-enabled Peer-To-Peer nodes, allowing for execution of tasks with extreme performance, by harnessing the combined power of multiple such GPU hosts. Sample jobs are provided. C#.Net/Mono with C. /content/cudazone/CUDABrowser/assets/images/applications/614_cudacluster_small.png /content/cudazone/CUDABrowser/assets/images/applications/614_cudacluster_large.png Research OpenSource 2008 08 06 08/06/2008 Nikolaos Tountas Application Numerics Nikolaos Tountas 2be843df-918d-4f4f-94ec-6c1b99e58760 MP3 Encoder MP3 encoder that runs on CUDA compatible hardware. /content/cudazone/CUDABrowser/assets/images/applications/613_cudamp3_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/613_cudamp3_large.jpg OpenSource 2008 03 19 03/19/2008 Research biggestpos Application Video & Audio Numerics biggestpos 9a4aea49-e96f-487a-b6e3-ab50c134a049 cesql Database Server based on NVIDIA CUDA Technology. CUDA makes it possible to use the GPU and its performance for parallel data computing.A classic sql server uses only about 15 GFlops instead of more than 500 GFlops which could be used by cesql. /content/cudazone/CUDABrowser/assets/images/applications/612_cesql_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/612_cesql_large.jpg Research OpenSource 2008 06 08 06/08/2008 Arash Mahini Application Numerics Arash Mahini,Arash_Mahini@users.sourceforge.net 436a1f19-e066-438d-9769-afd6b612b52e cehttp Web Server based on NVIDIA CUDA Technology. CUDA makes it possible to use the GPU and its performance for parallel data computing.A classic web server uses only about 15 GFlops instead of more than 500 GFlops which could be used by cehttp. /content/cudazone/CUDABrowser/assets/images/applications/611_cehttp_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/611_cehttp_large.jpg Research OpenSource 2008 06 08 06/08/2008 Arash Mahini Application Arash Mahini,Arash_Mahini@users.sourceforge.net faba717d-f830-457b-94a4-a8ca1d709890 The CUDA Files Implementations of various algorithms using CUDA. /content/cudazone/CUDABrowser/assets/images/applications/610_thecudafiles_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/610_thecudafiles_large.jpg Research OpenSource 2008 01 08 01/08/2008 sashang Code Numerics sashang,sashang@users.sourceforge.net 8ffabfbd-cad9-4fa1-81ee-f61d4bc4cc76 FreeSWITCH-CUDA This goal of this project is produce and maintain a branch of the FreeSWITCH telephony platform that utilizes CUDA (NVida's GPGPU toolkit) to offload cpu-intensive transcoding tasks to the (NVidia) GPU. /content/cudazone/CUDABrowser/assets/images/applications/609_freeswitch_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/609_freeswitch_large.jpg Academia OpenSource 2009 04 01 04/01/2009 Zac Wolfe Code Numerics Zac Wolfe,Zac_Wolfe@users.sourceforge.net 9b5c77ca-f014-4173-83cd-3bc3da09039b tokaspt The Once Known as SmallPT is a cheap editable realtime derivation of http://kevinbeason.com/smallpt/ By way of the marketing department, some outrageously insignificant numbers: on a Quadro FX 5800, on the default scene at default resolution and configuration, 768x512x(2x2)x118fps = 185.6M 4-bounces rays are traced per second (alternatively, a maximum of 742.4M bounces are generated). Requires CUDA 2.1 to compile and run. /content/cudazone/CUDABrowser/assets/images/applications/608_img_ui_bloated_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/608_img_ui_bloated_large.jpg Research http://ompf.org http://ompf.org 2009 01 25 01/25/2009 Thierry Berger-Perrin Application Code Graphics Thierry Berger-Perrin,tbptbp@gmail.com e143112b-a0c0-4f45-8360-6afe7687f68e A framework for efficient and scalable execution of domain-specific templates on GPUs Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts - processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7 - 7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory. /content/cudazone/CUDABrowser/assets/images/applications/607_ipdp_small.png /content/cudazone/CUDABrowser/assets/images/applications/607_ipdp_large.png Commercial NEC Labs, Berkeley, Purdue 2009 05 01 05/01/2009 8 Narayanan Sundaram Anand Raghunathan Srimat T. Chakradhar Paper Imaging Medical Imaging machine learning edge detection, convolution neural network, out-of-core,Narayanan Sundaramyz,Anand Raghunathanyx,Srimat T. Chakradhar,narayans@eecs.berkeley.edu d45f95f7-772b-41f6-a00d-4cb40e53e785 HyperNEAT4CUDA This is a simple C# implementation of HyperNEAT implemented on NVidia's Compute Unified Device Architecture (CUDA). /content/cudazone/CUDABrowser/assets/images/applications/605_hyperneat_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/605_hyperneat_large.jpg Research OpenSource 2009 05 19 05/19/2009 K A Lloyd Code Numerics K A Lloyd 727f6e8e-1cc9-4afc-9d6f-3329a569a712 Smoke rendering demo This application renders a density field of float values. In the particualr demo it is a smoke density field, but i could might as well be other sorts of data like fog, fluids or calculations. The density field is visualized using a ray marching technique and the background is rendered by ray tracing a kd tree. /content/cudazone/CUDABrowser/assets/images/applications/604_smoke_sreenshot1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/604_smoke_sreenshot1_large.jpg Research Alexandra Instituttet http://www.alexandra.dk/index.htm 2009 05 14 05/14/2009 Peter Trier Application Multimedia Graphics Science Smoke rendering, ray tracing,Peter Trier,peter.trier@alexandra.dk 72067ded-99f3-4176-96ad-9f1551b12c41 CUJ2K - JPEG2000 Encoder CUJ2K is a fast encoder for the new image compression standard JPEG2000 which is an improvement of JPEG providing better compression ratios and also supporting lossless compression along with many other features. JPEG2000 is very computation-intensive and therefore benfits much from CUDA acceleration. CUJ2K uses streaming to accelerate batch image compression. This program provides commandline-, .Net GUI- and libary-interfaces to convert BMP -> JPEG2000. It also supports creation of MJ2 videos. /content/cudazone/CUDABrowser/assets/images/applications/603_banner_small.gif /content/cudazone/CUDABrowser/assets/images/applications/603_banner_large.gif Hochschule University of Stuttgart, IPVS http://www.ipvs.uni-stuttgart.de/ 2009 09 20 09/20/2009 4 Open Source Norbert Fuerst Armin Weiss Simon Papandreou Martin Heide Ana Balevic Application Paper Code Graphics Imaging Medical Imaging Libraries Video & Audio JPEG2000, image compression, encoder, codec, JPEG, CUJ2K, image processing, lossless, lossy,Norbert Fuerst,Armin Weiss,Simon Papandreou, Martin Heide, Ana Balevic,cuj2k.project@googlemail.com 64528049-540a-4d7f-9cc0-2d4a2ccad4f0 Parallel Multiclass classification using SVM on GPUs The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this paper, it is described how a native implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy. /content/cudazone/CUDABrowser/assets/images/applications/602_multisvm_small.gif /content/cudazone/CUDABrowser/assets/images/applications/602_multisvm_large.gif Academia MIT 2008 12 31 12/31/2008 112 Sergio Herrero-Lopez Code Numerics Sergio Herrero-Lopez,sherrero@mit.edu f7874e4b-ba49-44f9-b736-6a3341519f41 Fast pattern classification of ventricular arrhythmias using graphics processing units Graphics Processing Units (GPUs) can provide remarkable performance gains when compared to CPUs for computationally-intensive applications. In the biomedical area, most of the previous studies are focused on using Neural Networks (NNs) for pattern recognition of biomedical signals. However, the long training times prevent them to be used in real-time. This is critical for the fast detection of Ventricular Arrhythmias (VAs) which may cause cardiac arrest and sudden death. In this paper, we present a parallel implementation of the Back-Propagation (BP) and the Multiple Back-Propagation (MBP) algorithm which allowed significant training speedups. In our proposal, we explicitly specify data parallel computations by defining special functions (kernels); therefore, we can use a fast evaluation strategy for reducing the computational cost without wasting memory resources. The performance of the pattern classification implementation is compared against other reported algorithms. /content/cudazone/CUDABrowser/assets/images/applications/600_mbpTop_small.png /content/cudazone/CUDABrowser/assets/images/applications/600_mbpTop_large.png Academia IPG http://www.ipg.pt 2009 11 09 11/09/2009 53 Noel Lopes Application Paper medicine Neural Networks,Noel Lopes,noel@ipg.pt c8a33001-387c-474f-a477-63571429ab6f Heart Wall Tracking Tracking of mouse heart walls through a series of ultrasound images. /content/cudazone/CUDABrowser/assets/images/applications/599_heartwall_small.png /content/cudazone/CUDABrowser/assets/images/applications/599_heartwall_large.png Academia University of Virginia http://www.virginia.edu 2009 11 05 11/05/2009 15 Open source Lukasz G. Szafaryn Application Multimedia Code Medical Imaging Image Processing, Feature Detection, Ultrasound,Lukasz G. Szafaryn,lgs9a@virginia.edu d29736de-ffee-4b0a-b7ec-8d041259c195 Towards a multi-GPU solver for the three-dimensional two-phase incompressible Navier-Stokes equations We have ported parts of our parallel level-set based two-phase solver for the three-dimensional Navier-Stokes equations on the GPU. To our knowledge, this is the first time that a two-phase fluid solver profits from the performance boost of several GPUs. A multi-GPU double-precision solver for the pressure Poisson equation based on the Jacobi preconditioned conjugate gradient method was implemented using CUDA and MPI. Thereby, we obtain a major speedup factor of 31.1 for the Poisson solver on four GPUs of our NVIDIA Tesla S1070, in contrast to a single CPU. Consequently, our overall fluid solver shows an impressive speedup factor of 16.6. /content/cudazone/CUDABrowser/assets/images/applications/598_logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/598_logo_large.jpg Academia Institute for Numerical Simulation - University of Bonn, Germany http://www.ins.uni-bonn.de 2009 09 30 09/30/2009 16 Peter Zaspel Paper Computational Fluid Dynamics Numerics Science CFD, multi-GPU, Navier-Stokes, multi-phase,Peter Zaspel,zaspel@ins.uni-bonn.de 5bd7b280-5a27-49e5-be83-c95099ac3a3c String Matching on a Multicore GPU Using CUDA Graphics Processing Units (GPUs) have evolved over the past few years from dedicated graphics rendering devices to powerful parallel processors, outperforming traditional Central Processing Units (CPUs) in many areas of scientific computing. The use of GPUs as processing elements was very limited until recently, when the concept of General-Purpose computing on Graphics Processing Units (GPGPU) was introduced. GPGPU made possible to exploit the processing power and the memory bandwidth of the GPUs with the use of APIs that hide the GPU hardware from programmers. This paper presents experimental results on the parallel processing for some well known on-line string matching algorithms using one such GPU abstraction API, the Compute Unified Device Architecture (CUDA). /content/cudazone/CUDABrowser/assets/images/applications/597_cuda1o_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/597_cuda1o_large.jpg Academia University of Macedonia http://www.uom.gr 2009 09 10 09/10/2009 24 C. S. Kouzinopoulos Paper String matching string matching, algorithms, CUDA, GPGPU, parallel,C. S. Kouzinopoulos,ckouz@uom.gr c3242f2b-7ede-43d1-87b7-c462eae24c94 Fast Tridiagonal Solvers on the GPU We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver. /content/cudazone/CUDABrowser/assets/images/applications/596_idav_small.png /content/cudazone/CUDABrowser/assets/images/applications/596_idav_large.png Academia University of California, Davis 2009 10 28 10/28/2009 12 Yao Zhang Application Paper Numerics Yao Zhang,yaozhang@ucdavis.edu 0facea85-946d-47ef-93fd-12b5ae74b4b6 Accelerating Geo-Science and Engineering System Simulations on Graphics Hardware This paper discusses GPU implementations of three example applications from computational fluid dynamics, seismic wave propagation, and rock magnetism. These candidate applications involve important numerical modeling techniques, widely employed in physical system simulations, that are themselves examples of distinct computing classes identified as fundamental to scientific and engineering computing. The presented numerical methods (and respective computing classes they belong to) are: (1) a lattice-Boltzmann code for geofluid dynamics (structured grid class); (2) a spectral-finite-element code for seismic wave propagation simulations (sparse linear algebra class); and (3) a least-squares minimization code for interpreting magnetic force microscopy data (dense linear algebra class). Significant performance increases are seen in all three applications, demonstrating the power of GPU implementations for these types of simulations and their associated computing classes. /content/cudazone/CUDABrowser/assets/images/applications/595_stochastic_small.png /content/cudazone/CUDABrowser/assets/images/applications/595_stochastic_large.png Academia University of Minnesota 2009 10 25 10/25/2009 30 Stuart D.C. Walsh Paper Computational Fluid Dynamics Imaging Science Stuart D.C. Walsh,sdcwalsh@umn.edu 786fec9c-472d-4f0e-9985-42ad2050e358 Sailfish: An Open Source fluid simulation package using the Lattice-Boltzmann method Sailfish is a general purpose fluid dynamics solver optimized for modern multicore processors, especially Graphics Processing Units (GPUs). The solver is based on the Lattice Boltzmann Method and works for both 2D and 3D fluids. Its performance peaks at 950MLUPS with the D2Q9 grid and 750MLUPS with D3Q19 (using CUDA on a single GTX280 video card). The design of Sailfish tries to reconcile ease of use and flexibility with performance. Python, with its powerful modules: sympy (for automatic code generation), numpy, pygame, tvtk etc. is used as the main language on the host (for I/O, visualization and user interaction), while the actual computations are performed on the GPU using CUDA or OpenCL. /content/cudazone/CUDABrowser/assets/images/applications/594_sailfish_small.png /content/cudazone/CUDABrowser/assets/images/applications/594_sailfish_large.png Academia Institute of Physics, University of Silesia 2009 04 17 04/17/2009 100 Open Source M. Januszewski M. Kostur Multimedia Code Computational Fluid Dynamics M. Januszewski,M. Kostur,mjanusz@us.edu.pl 111d3757-3e16-4600-bf47-437a832bae86 GPU-SPHysics a GPU-based Smoothed Particle Hydrodynamics model for free surface flows /content/cudazone/CUDABrowser/assets/images/applications/593_boreinboxwhite_small.png /content/cudazone/CUDABrowser/assets/images/applications/593_boreinboxwhite_large.png Academia Istituto Nazionale di Geofisica e Vulcanologia 2008 12 31 12/31/2008 23 Alexis Herault Paper Computational Fluid Dynamics dea0e214-213a-4557-9ef4-1e9d5d6f80c9 Evaluating Multi-Core Platforms for HPC Data-Intensive Kernels We present an evaluation of three platform types, namely NVIDIA GPUs, the STI Cell/B.E., and generic multi-core CPUs on convolutional resampling (aka gridding), which is an irregular, data-intensive application from radio astronomy. We evaluate these platforms in terms of performance, programming effort and cost. Although we do not select a clear winner, we do provide a list of guidelines to assist in platform choice and development of similar data-intensive applications. /content/cudazone/CUDABrowser/assets/images/applications/592_gridding_fig_small.png /content/cudazone/CUDABrowser/assets/images/applications/592_gridding_fig_large.png Academia Delft University of Technology http://www.tudelft.nl/ 2009 05 18 05/18/2009 Alexander S. van Amesfoort Paper Imaging Numerics Science Signal Processing data-intensive gridding astronomy,Alexander S. van Amesfoort,a.s.vanamesfoort@tudelft.nl 02285ada-66ce-4cd5-8809-e459372d9fb8 An efficient GPU implementation for large scaleindividual-based simulation of collective behavior In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model each fish aligns its position and orientation with an appropriate average of its neighbors positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of high-density individual groups. In particular we present as a case study a simulation of motion of millions of fishes. We describe our implementation and present extensive experiments which demonstrate the effectiveness of our GPU implementation. /content/cudazone/CUDABrowser/assets/images/applications/591_HiBi09_small.png /content/cudazone/CUDABrowser/assets/images/applications/591_HiBi09_large.png Academia Universita di Salerno 2009 10 16 10/16/2009 Ugo Erra Bernardino Frola Vittorio Scarano Iain Couzin Application Multimedia Paper Life Sciences Ugo Erra,ugo.erra@unibas.it a412a716-04f1-4cf9-a389-8a51d3ea7680 OpenCurrent OpenCurrent is an open source C++ library for solving Partial Differential Equations (PDEs) over regular grids using the CUDA platform from NVIDIA. It breaks down a PDE into 3 basic objects, Grids, Solvers, and Equations. Grid data structures efficiently implement regular 1D, 2D, and 3D arrays in both double and single precision. Grids support operations like computing linear combinations, managing host-device memory transfers, interpolating values at non-grid points, and performing array-wide reductions. Solvers use these data structures to calculate terms arising from discretizations of PDEs, such as finite-difference based advection and diffusion schemes, and a multigrid solver for Poisson equations. These computational building blocks can be assembled into complete Equation objects that solve time-dependent PDEs. One such Equation solver is an incompressible Navier-Stokes solver that uses a second-order Boussinesq model. This equation solver is fully validated, and has been used to study Rayleigh-Benard convection under a variety of different regimes (citation). Benchmarks show it to perform about 8 times faster than an equivalent Fortran code running on an 8-core Xeon. /content/cudazone/CUDABrowser/assets/images/applications/590_opencurrent_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/590_opencurrent_large.jpg Commercial NVIDIA http://www.nvidia.com 2009 09 25 09/25/2009 Open Source Jonathan Cohen Code libraries Jonathan Cohen 21a1b481-5773-403d-8644-730c1c5f1d58 Correlating Radio Astronomy Signals A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware, to increase flexibility and to reduce development efforts. /content/cudazone/CUDABrowser/assets/images/applications/589_LBA-field_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/589_LBA-field_large.jpg Research Astron http://www.astron.nl 2009 10 16 10/16/2009 6.3 Open source Rob van Nieuwpoort Paper Code Application Science Signal Processing Rob van Nieuwpoort,nieuwpoort@astron.nl fb82b05f-0449-485d-8779-b53d28646189 TUNED AND ASYNCHRONOUS STENCIL KERNELS FOR CPU/GPU SYSTEMS We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060. Motivated to find a still faster implementation, we further consider wildly asynchronous implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay synchronization between iterations, thereby potentially trading of more ops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly fast-and-loose algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs. /content/cudazone/CUDABrowser/assets/images/applications/588_tuned_small.png /content/cudazone/CUDABrowser/assets/images/applications/588_tuned_large.png Academia Georgia Institute of Technology 2009 05 01 05/01/2009 Sundaresan Venkatasubramanian Paper Sundaresan Venkatasubramanian f72dcd39-833c-4760-8d04-87e67f4afa2b Hybrid GPU-Based Single- and Double-Bounce SAR Simulation A new hybrid graphics-processing-unit (GPU)-based real-time synthetic aperture radar (SAR) simulation system is presented. Previous real-time SAR simulators only supported single-bounce simulation in real time. The new hybrid system uses the rasterization approach for real-time single-bounce simulation and a new image-based GPU ray-tracing approach for monostatic SAR double-bounce simulation. This approach provides fast simulation results even while simulating complex and extended scenes. /content/cudazone/CUDABrowser/assets/images/applications/587_hybrid_small.png /content/cudazone/CUDABrowser/assets/images/applications/587_hybrid_large.png Academia LIESMARS, Wuhan University 2009 10 01 10/01/2009 Timo Balz Uwe Stilla Paper Science Remote Sensing Radar, SAR, Remote Sensing, Simulaton, Ray-Tracing,Timo Balz,Uwe Stilla,timobalz@gmail.com 1026c7d5-f1c2-4709-800e-fad3add12e5a A Proposal to Extend the OpenMP Tasking Modelfor Heterogeneous Architectures A proposal to extend OpenMP so it incorporates the concept of multiple architectures so it takes care of: separating the different pieces, compiling them adequately, offloading them. The user is still responsible for identifying interesting parts to offload and optimize for the target. /content/cudazone/CUDABrowser/assets/images/applications/586_openmp_small.png /content/cudazone/CUDABrowser/assets/images/applications/586_openmp_large.png Academia Universitat Politechnica de Catalunya 2009 06 03 06/03/2009 E. Ayguade Presentation Libraries E. Ayguade 13072d4f-4cdc-488e-ac1b-5d42f73c2528 AntiPlanet2 AntiPlanet2 is first person 3D shooter game in fantastic extraterrestrial world, which is built of spheres and shadows. AntiPlanet uses ray tracing render for visualization. It works through CUDA. 3D engine works in any resolution in real-time, supports transparency and bi-cubic textures. /content/cudazone/CUDABrowser/assets/images/applications/585_fallenflowers_small.png /content/cudazone/CUDABrowser/assets/images/applications/585_fallenflowers_large.png Commercial virtualray.ru http://www.virtualray.ru 2009 10 06 10/06/2009 Commercial Lev Dymchenko Application Multimedia Graphics computer game 3d shooter antiplanet first person action game real time ray tracing spherical computer art,Lev Dymchenko,levdy@virtualray.ru 3859efe4-0773-4cc5-be54-9fc3d338a0ce cuco The GPU version of cosmological simulation code Gadget based on CUDA /content/cudazone/CUDABrowser/assets/images/applications/584_cuco_small.png /content/cudazone/CUDABrowser/assets/images/applications/584_cuco_large.png Partner Group of MPA 2009 08 25 08/25/2009 Lei Liu Code Science Lei Liu 0e8e658b-58f3-4627-a6ca-1c64e79c3416 Data Monster Database processing is a cornerstone of computing, and it is a market that last year generated approximately US $27 billion, according to technology analysis firm Forrester Research, in Cambridge, Mass. The firm projects that this number which includes new database licenses, technical support, and consulting will grow to $32 billion by 2013. Every time you bid on an eBay auction, search for a movie on Netflix, look for a Kindle title on Amazon, or do a Google search, massive database applications spring into action, delving into huge quantities of data spread across tens of thousands of machines. /content/cudazone/CUDABrowser/assets/images/applications/583_datamonster_small.png /content/cudazone/CUDABrowser/assets/images/applications/583_datamonster_large.png ieee spectrum http://spectrum.ieee.org 2009 09 01 09/01/2009 Andrea Di Blas Tim Kaldewey Paper Andrea Di Blas,Tim Kaldewey 4d61ce47-f1c6-472f-81d3-595fd0ab0883 Citrix HDX 3D for Professional Graphics Citrix HDX 3D for Professional Graphics can now deliver Windows physical desktops and applications to the most advanced professional graphics power users through Citrix XenDesktop technology. XenDesktop with HDX 3D provides the best performance possible over the wide area network (WAN), and over a local area network (LAN), HDX 3D consumes 10x less bandwidth than alternatives while still providing a high-definition user experience. /content/cudazone/CUDABrowser/assets/images/applications/582_citrix_small.png /content/cudazone/CUDABrowser/assets/images/applications/582_citrix_large.png Commercial Citrix http://www.citrix.com 2009 10 15 10/15/2009 Commercial Citrix Application Multimedia Paper Video & Audio Citrix a8985a03-4860-49c7-92ce-f1237031cc81 GPU-Accelerated TF-IDF TF-IDF (term-frequency/inverse-document frequency) is one of the fundamental concepts used in information retrieval and text mining. /content/cudazone/CUDABrowser/assets/images/applications/581_atomic_method_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/581_atomic_method_large.jpg Academia North Carolina State University 2009 03 10 03/10/2009 9 Yongpeng Zhang Frank Mueller Xiaohui Cui and Thomas Potok Paper Text Mining Yongpeng Zhang,Frank Mueller, Xiaohui Cui and Thomas Potok,zhang.yongpeng@gmail.com 27bd9c43-0986-477e-aa4d-1dcd0493090c High-Quality Rendering of Varying Isosurfaces Smooth trivariate splines on uniform tetrahedral partitions are well suited for high-quality visualization of isosurfaces from scalar volumetric data. We propose a novel rendering approach based on spline patches with low total degree, for which ray-isosurface intersections are computed using effcient root finding algorithms. Smoothly varying surface normals are directly extracted from the underlying spline representation. Our approach is using a combined CUDA and graphics pipeline and yields two key advantages over previous work. First, we can interactively vary the isovalues since all required processing steps are performed on the GPU. Second, we employ instancing in order to reduce shader complexity and to minimize overall memory usage. In particular, this allows to compute the spline coeffcients on-the-fly in real-time on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/580_C1isosurfaces-medical_small.png /content/cudazone/CUDABrowser/assets/images/applications/580_C1isosurfaces-medical_large.png Academia TU Darmstadt http://www.tu-darmstadt.de/ 2009 10 07 10/07/2009 68 T. Kalbe T. Koch M. Goesele Multimedia Paper Graphics Medical Imaging Raycasting trivariate Splines isosurface volumerendering,T. Kalbe,T. Koch,M. Goesele,thomasdidikoch@gmx.net 0fa28489-5f19-4370-9a78-1d90711534a6 Realtime Dense Stereo Matching with Dynamic Programming in CUDA Real-time depth extraction from stereo images is an important process in computer vision. This paper proposes a new implementation of the dynamic programming algorithm to calculate dense depth maps using the CUDA architecture achieving real-time performance with consumer graphics cards. We compare the running time of the algorithm against CPU implementation and demonstrate the scalability property of the algorithm by testing it on different graphics cards. /content/cudazone/CUDABrowser/assets/images/applications/579_DP_algorithm_CUDA_TV_2009_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/579_DP_algorithm_CUDA_TV_2009_large.jpg Academia CAD/CAM/CAE Lab. EAFIT University http://www1.eafit.edu.co/cadcamcae/ 2009 09 09 09/09/2009 10 John Congote Paper Graphics Imaging Video & Audio John Congote,jcongote@eafit.edu.co b7498c1e-fb46-492b-8ae6-fe6a4ccea50d Improving the Open64 Backend for GPUs NVIDIA uses Open64 as a front-end tool to compile CUDA programs into an intermediate language called PTX. PTX can be viewed as an assembly language targeting a virtual machine and is an abstract layer between the application and the final hardwaredependent machine code. Our research explores the relationship between register pressure in the PTX code and the final machine code. We also implemented two optimizations in Open64 to help reduce register pressure and increase thread concurrency. /content/cudazone/CUDABrowser/assets/images/applications/578_open64_small.png /content/cudazone/CUDABrowser/assets/images/applications/578_open64_large.png Academia Northeastern University 2009 10 01 10/01/2009 Rodrigo Dominguez Presentation Programming Tools Rodrigo Dominguez,rdomingu@ece.neu.edu 92df54f0-8995-4ad9-8a57-0e9ddfd14842 Computer Generated Hologram on GPU - Simple color electroholography reconstruction system - We have constructed a simple color electroholography system that has excellent cost performance. It uses a graphics processing unit (GPU) and a liquid crystal display (LCD) projector. The structure of the GPU is suitable for calculating computer-generated holograms (CGHs). The calculation speed of the GPU is approximately 1,500 times faster than that of a central processing unit(Intel Core 2 Duo 2.66 GHz (We used one core for the calculation)). The LCD projector is an inexpensive, high-performance device for displaying CGHs. It has high-definition LCD panels for red, green and blue. Thus, it can be easily used for color electroholography. For a three-dimensional object consisting of 1,000 points, our system succeeded in real-time color holographic reconstruction at rate of 30 frames per second. /content/cudazone/CUDABrowser/assets/images/applications/577_hologram_small.png /content/cudazone/CUDABrowser/assets/images/applications/577_hologram_large.png Chiba University / Shohoku College / Kisarazu National College of Technology 2009 10 07 10/07/2009 1500 Tomoyoshi Ito Naoki Takada Tomoyoshi Shimobaba Multimedia Paper Imaging Numerics Science Tomoyoshi Ito,Naoki Takada,Tomoyoshi Shimobaba,itot@faculty.chiba-u.jp a8608fd3-c5f3-45fe-bc14-9438fefb2c62 CudaPad Cudapad is a software that helps developments develop and test small kernals for NVIDIAs CUDA language. Sometimes in your IDE you will want a quick way build or test a piece of CUDA code and CudaPad lets you do it. It shows the ptx code, cubin code, register count, error and more on the fly. There is no need to manually compile your code. /content/cudazone/CUDABrowser/assets/images/applications/576_CudaPad_small.png /content/cudazone/CUDABrowser/assets/images/applications/576_CudaPad_large.png CudaPad http://cudapad.com/ 2009 08 23 08/23/2009 CudaPad Application Programming Tools CudaPad d91c3c63-a2d6-4a15-a70a-87bcafdd70d8 Real-time Parallel Hashing on the GPU We introduce an efficient data-parallel algorithm for building hash tables containing millions of elements in real-time on the GPU. Our two-tiered approach combines classical randomized perfect hashing and the recently introduced cuckoo hashing. Retrieval of any item requires checking at most three locations. /content/cudazone/CUDABrowser/assets/images/applications/575_paper_thumb_small.png /content/cudazone/CUDABrowser/assets/images/applications/575_paper_thumb_large.png Academia University of California, Davis http://idav.ucdavis.edu/ 2009 09 12 09/12/2009 Dan Anthony Alcantara Paper Graphics Libraries Dan Anthony Alcantara,dfalcantara@ucdavis.edu b79e2f2b-047f-4497-bc71-8fae1e3bf2df Real-time Robotic Surgery Platform with the GPU A Real-time Simulation, Guidance and Visualisation Platform for Intra-operative Minimally Invasive Robotic Surgery /content/cudazone/CUDABrowser/assets/images/applications/574_robot-hotspot180_medium_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/574_robot-hotspot180_medium_large.jpg Academia Imperial College London 2009 10 06 10/06/2009 88 Guang-Zhong Yang Presentation Medical Imaging Guang-Zhong Yang,gzy@doc.ic.ac.uk aee7f189-cad9-4004-ab12-7af9e2dac705 Accelerating Virtual Texturing using CUDA Virtual texturing selectively loads parts of a large texture data set visible by the current view. Our poster shows how virtual texturing can be accelerated by using CUDA and OpenGL /content/cudazone/CUDABrowser/assets/images/applications/573_cuda_zone_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/573_cuda_zone_large.jpg Academia Ghent University - IBBT, ELIS Department/Multimedia Lab http://multimedialab.elis.ugent.be/ 2009 09 30 09/30/2009 Charles-Frederik Hollemeersch Paper Graphics virtual textures rendering,Charles-Frederik Hollemeersch,charlesfrederik.hollemeersch@ugent.be dfe42cca-0549-462c-a8b4-6f7f2fdb17a8 Implementation in C+CUDA of Multi-Label Text Categorizers In automated multi-label text categorization problems with large numbers of labels, the training databases are large, which may render the categorization time prohibitive for online systems. In this work, we evaluate the parallel implementation in C+CUDA of two multi-label text categorizers: the first is based on the k-Nearest Neighbors (k-NN) algorithm and the second is based on Probabilistic Neural Networks (PNN). We implemented these algorithms in three different ways: sequential in C, parallel in C+CUDA, and parallel using the C+CUBLAS library. /content/cudazone/CUDABrowser/assets/images/applications/572_800px-Pnn_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/572_800px-Pnn_large.jpg Academia Universidade Federal do Espirito Santo http://www.ufes.br 2009 08 03 08/03/2009 64 Alberto F. De Souza et al. Paper Information Retrieval Alberto F. De Souza,alberto@lcad.inf.ufes.br 9e8ea1d4-1246-4b44-86aa-3eaeeec9bc0c Biologically Inspired Stereoscopic Vision Model in C+CUDA Most of the depth perception processing is done in the visual cortex, mainly in the primary (V1) and medial temporal (MT) areas. In this work, we modeled the neural architecture of the V1 and MT cortices using as building blocks previous models of cortical cells and log-polar mapping. A sequential implementation of our model can build a tridimensional representation of the external world using stereoscopic image pairs obtained from a pair of fronto-parallel cameras. A C+CUDA parallel implementation is almost 60 times faster and allows real-time 3D reconstruction. /content/cudazone/CUDABrowser/assets/images/applications/571_800px-3d-hallysson_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/571_800px-3d-hallysson_large.jpg Academia Universidade Federal do Espirito Santo http://www.ufes.br 2009 08 03 08/03/2009 57 Alberto F. De Souza et al. Paper Computer Vision Alberto F. De Souza,alberto@lcad.inf.ufes.br ddb099b1-959c-4bf2-9254-ba51143125d4 ACCELERATING SPHERICAL HARMONIC TRANSFORMS ON THE NVIDIA GPU The Spherical Harmonic Transform is a critical computational kernel of the dynamics algorithms for numerical weather prediction and climate modeling. As atmospheric models push towards higher resolutions it has become necessary to accelerate this computationally intensive transform. Previous work has made attempts to parallelize and optimize the transform [1] [2] [3] [4], but none have exploited the advantages of the NVIDIAs General Purpose Graphics Processor Unit (GPGPU), a very recent SIMD type architecture. This paper describes a CPU-GPU type implementation for computation of Spherical Harmonic Transform. The implementation shows gain in terms of computation time and a low error rate, when compared to the implementation discussed in [1]. /content/cudazone/CUDABrowser/assets/images/applications/570_soman_small.png /content/cudazone/CUDABrowser/assets/images/applications/570_soman_large.png Academia Department of Electrical Engineering University of Wisconsin, Madison, Wisconsin, USA 2008 12 31 12/31/2008 42 Vikrant Soman Paper Computational Fluid Dynamics Spherical Harmonic Transform, GPU, Parallel,Vikrant Soman e22a4a2f-cf43-499a-8777-3570c85b9e60 CULATools CULA is EM Photonics' GPU-accelerated numerical linear algebra library that contains a growing list of LAPACK functions. /content/cudazone/CUDABrowser/assets/images/applications/569_cula-logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/569_cula-logo_large.png Commercial CULATools http://www.culatools.com/ 2009 09 30 09/30/2009 Application 200 3355d528-c9f0-4e35-a07e-da8ea95ddc35 Scalable Split Primitives for the GPU Fast Split and Sort Implementation for millions of input elements and supporting 32-128 bit key values /content/cudazone/CUDABrowser/assets/images/applications/568_splitSort_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/568_splitSort_large.jpg Academia CVIT, IIIT Hyderabad http://cvit.iiit.ac.in 2009 07 15 07/15/2009 Open source Suryakant Patidar Paper Code Libraries Sort, Split,Suryakant Patidar,skp@research.iiit.ac.in 2f279ff5-7168-4acf-822b-72fd98b2cd76 FindCUDA.cmake Building on the open source project CMake, developers can now integrate CUDA C compilation directly into their Visual Studio, Makefile or XCode build systems. File level dependencies are supported, as well as many other features designed to help CUDA C files build as part of the native system. Starting with CMake 2.8, FindCUDA.cmake is part of the standard distribution. /content/cudazone/CUDABrowser/assets/images/applications/567_CMake-logo-high-res_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/567_CMake-logo-high-res_large.jpg Commercial NVIDIA Corp. http://www.nvidia.com 2009 09 30 09/30/2009 Open source James Bigler Code Programming Tools Build, CMake, Visual Studio, Makefile, XCode,James Bigler,jbigler@nvidia.com 46bb452f-bc32-4e3e-a9f8-ef2b42c975db Cognitive developmental approach towards the realization of human-like visual scene understanding How we humans understand visual scenes so easily and quickly? It is difficult to answer the question. However human babiles naturally acquire the ability to do it. Thus, imitating typical actions of babies would be promising for acquiring the ability of human-like visual scene understanding. Based on the above discussion, we propose a new framework of human-like visual scene understanding based on cognitive developmental approach, and construct a prototype system that recognizes already known objects, detects and registers unknown objects in near real-time with CUDA technologies. /content/cudazone/CUDABrowser/assets/images/applications/566_poster2_small.png /content/cudazone/CUDABrowser/assets/images/applications/566_poster2_large.png NTT Communication Science Laboratories http://www.kecl.ntt.co.jp 2009 09 27 09/27/2009 Akisato Kimura Multimedia Paper Signal Processing Video & Audio Cognitive developmental approach, visual scene understanding, saliency, video segmentation, CUDA,Akisato Kimura,akisato@ieee.org bd8ebdba-8c09-413c-8c09-8cd67ec51ea5 SCGPSim: A fast SystemC Simulator on GPUs A SystemC simulator on GPUs /content/cudazone/CUDABrowser/assets/images/applications/564_poster_small.png /content/cudazone/CUDABrowser/assets/images/applications/564_poster_large.png Academia FERMAT Lab, Virginia Tech, Blacksburg, USA http://www.fermat.ece.vt.edu/ 2009 10 01 10/01/2009 100 Mahesh Nanjundappa Paper Electronic Design Automation GPGPU, EDA, Parallel Simulation, SystemC,Mahesh Nanjundappa,knmahesh@vt.edu 6e6bb696-0ae8-49c7-b75f-982182e43b7e Flowcart Flowball is an interactive game using dense optical flow computed in realtime on a Geforce GTX 280. We provide a video and optical flow libraries... /content/cudazone/CUDABrowser/assets/images/applications/563_cuda_zone_flowcart_small.png /content/cudazone/CUDABrowser/assets/images/applications/563_cuda_zone_flowcart_large.png Academia Institute for Computer Graphics and Vision, Graz University of Technology http://www.icg.tugraz.at/ 2009 09 02 09/02/2009 Wolfgang Paier Application Multimedia Paper Game Physics Graphics Video & Audio Wolfgang Paier,info@gpu4vision.org ccbd6aa9-f5a3-4310-bc01-4463d114ba04 CUDA Accelerated Sparse Field Level Set Segmentation of Large Medical Data Sets Segmentation of large medical volumes is an important task in diagnostic medicine. Computer assisted level set segmentation techniques have been shown to improve the accuracy of difficult segmentation tasks. We present a novel GPU accelerated level set segmentation algorithm that avoids redundant computations by only processing those voxels near the propagating level set surface. We evaluate the speed and accuracy of our algorithm by performing various segmentation tasks on a noisy magnetic resonance image (MRI) generated from the BrainWeb phantom dataset. We compare the performance of our algorithm to that of the previous best GPU and CPU algorithms. Compared to previous best GPU algorithm, our algorithm reduces the total number of processed voxels by 16 times with a negligible effect on segmentation accuracy. Our algorithm converges 9 times faster than the previous best GPU algorithm and 360 times faster than the previous best CPU algorithm on identical hardware. /content/cudazone/CUDABrowser/assets/images/applications/562_level_set_growth_3D_3_images_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/562_level_set_growth_3D_3_images_large.jpg Academia University of Calgary http://www.ucalgary.ca/ 2009 10 01 10/01/2009 360 Commercial Mike Roberts Paper Medical Imaging segmentation, level set, sparse field, narrow band,Mike Roberts,mlrobert@ucalgary.ca b19208d8-3dcc-4b57-b8aa-993ed8261989 GPU accelerated Maximum Intensity Projection The "Maximum Intensity Projection" (MIP) is a computer visualization method in medicine that uses 3D data, e. g. CT or MRT, and computes a 2D view from a certain viewpoint. /content/cudazone/CUDABrowser/assets/images/applications/603_mip_filter2_small.gif /content/cudazone/CUDABrowser/assets/images/applications/603_mip_filter2_large.gif Academia Heidelberg University / Heilbronn University 2008 12 31 12/31/2008 Clas Rurik Multimedia Medical Imaging Clas Rurik,crurik@ix.urz.uni-heidelberg.de 5ad29b38-3310-42a2-830e-f315c5103602 Stochastic Lagrangian Particle Model for Air Pollution The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future. /content/cudazone/CUDABrowser/assets/images/applications/602_plume_small.png /content/cudazone/CUDABrowser/assets/images/applications/602_plume_large.png Academia Eotvos Lorand University 2009 09 21 09/21/2009 120 Open source Ferenc Molnar Jr. Application Paper Code Computational Fluid Dynamics Numerics Science Video card, Parallel computing, CUDA, Environmental application, Air pollution,Ferenc Molnar Jr.,mofi@elte.hu 358bc116-6b7d-4598-a11a-bdad6cbd8e30 On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers. For certain classes of Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we find speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design. /content/cudazone/CUDABrowser/assets/images/applications/601_montecarlo_small.png /content/cudazone/CUDABrowser/assets/images/applications/601_montecarlo_large.png Academia Oxford-Man Institute 2009 05 14 05/14/2009 500 Anthony Lee Christopher Yau Michael B. Giles Paper Numerics Sequential Monte Carlo, Population-Based Markov Chain Monte Carlo, General Purpose Computationon Graphics Processing Units, Many-Core Architecture, Stochastic Simulation, Parallel Processing,Anthony Lee,Christopher Yau,Michael B. Giles,lee@stats.ox.ac.uk ed8e3b35-7db8-4c89-8cf7-a9366ce84bbe FOLKI-GPU Optical Flow A very fast implementation of Optical flow (25fps for full HD res) /content/cudazone/CUDABrowser/assets/images/applications/600_onera_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/600_onera_large.jpg Research ONERA http://www.onera.fr/english.php 2009 07 24 07/24/2009 Open source Aurelien Plyer Guy Le Besnerais Frederic Champagnat Multimedia Paper Code Imaging Video & Audio computer vision optical flow motion,Aurelien Plyer,Guy Le Besnerais,Frederic Champagnat,aurelien.plyer@onera.fr aa156ca7-4c87-4d17-89d0-e51569250645 A Fast High Quality Pseudo Random Number Generator for NVIDIA CUDA Previously either due to hardware GPU limits or older versions of software, careful implementation of PRNGs was required to make good use of the limited numerical precision available on graphics cards. Newer nVidia G80 and Tesla hardware support double precision. This is available to high level programmers via CUDA. This allows a much simpler C++ implementation of Park-Miller random numbers, which provides a four fold speed up compared to an earlier GPU implementation. Code is available via ftp. /content/cudazone/CUDABrowser/assets/images/applications/599_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/599_graph_large.png Academia Department of Computer Science, CREST centre, Kings College, London http://www.cs.ucl.ac.u 2009 01 01 01/01/2009 W. B. Langdon Paper Programming Tools W. B. Langdon,Wi11iam.Langdon@kcl.ac.uk cbe71302-afb1-4776-bb98-80fd8651b466 JUMP FLOODING ALGORITHM ON GRAPHICS HARDWARE AND ITS APPLICATIONS The graphics processing unit (GPU) has been developing at a very fast pace these few years. More and more researches have been done to utilize the ever increasing computability power of the GPU on general-purpose computations. This thesis proposes a new GPU algorithm { jump cooding algorithm (JFA). JFA is a new paradigm of communication between pixels on the GPU. It can quickly propagate the information of certain pixels to the others. The speed of JFA is exponen-tially faster than that of the standard cooding algorithm, and is approximately independent to the input size. /content/cudazone/CUDABrowser/assets/images/applications/597_progress_small.png /content/cudazone/CUDABrowser/assets/images/applications/597_progress_large.png 2008 12 31 12/31/2008 RONG GUODONG Paper Imaging RONG GUODONG 91755150-16a7-4570-9a2e-2b2e921d2baf Many-Core Algorithms for Statistical Phylogenetics Statistical phylogenetics is computationally intensive, resulting in considerable attention meted on techniques for parallelization. Codon-based models allow for independent rates of synonymous and replacement substitutions and have the potential to more adequately model the process of protein coding sequence evolution with a resulting increase in phylogenetic accuracy. Unfortunately, due to the high number of codon states, computational burden has largely thwarted phylogenetic reconstruction under codon models, particularly at the genomic-scale. Here we describe novel algorithms and methods for evaluating phylogenies under arbitrary molecular evolutionary models on Graphics Processing Units (GPUs), making use of the large number of processing cores to efficiently parallelize calculations even for large state-size models. Results: We implement the approach in an existing Bayesian framework and apply the algorithms to estimating the phylogeny of 62 complete mitochondrial genomes of carnivores under a 60-state codon model. We see a near 90-fold speed increase over an optimized CPU-based computation and a >140-fold increase over the currently available implementation, making this the first practical use of codon models for phylogenetic inference over whole mitochondrial or microorganism genomes. /content/cudazone/CUDABrowser/assets/images/applications/596_Phylogenetics_small.png /content/cudazone/CUDABrowser/assets/images/applications/596_Phylogenetics_large.png Department of Biomathematics, University of California, Los Angeles 2009 04 15 04/15/2009 140 Marc A. Suchard Andrew Rambaut Paper Marc A. Suchard,Andrew Rambaut d753c609-c7e3-4ddc-b2c9-d054e3ab46dd Speed Up SVM Algorithm for Massive Classification Tasks We present a new parallel and incremental Support Vector Machine (SVM) algorithm for the classification of very large datasets on graphics processing units (GPUs). SVM and kernel related methods have shown to build accurate models but the learning task usually needs a quadratic program so that this task for large datasets requires large memory capacity and long time. We extend a recent Least Squares SVM (LS-SVM) proposed by Suykens and Vandewalle for building incremental and parallel algorithm. The new algorithm uses graphics processors to gain high performance at low cost. Numerical test results on UCI and Delve dataset repositories showed that our parallel incremental algorithm using GPUs is about 70 times faster than a CPU implementation and often significantly faster (over 1000 times) than state-of-the-art algorithms like LibSVM, SVM-perf and CB-SVM. /content/cudazone/CUDABrowser/assets/images/applications/595_svm_small.png /content/cudazone/CUDABrowser/assets/images/applications/595_svm_large.png Academia IRISA Symbiose, Campus de Beaulieu, 35042 Rennes Cedex, France 2008 09 30 09/30/2008 Thanh-Nghi Do Van-Hoa Nguyen Francois Poulet Paper Numerics Thanh-Nghi Do,Van-Hoa Nguyen,Francois Poulet,dtnghi@cit.ctu.edu.vn,vhnguyen@irisa.fr,francois.poulet@irisa.fr 040861ed-61a2-410f-907e-65e4a23b33a3 Visualizing Multiwavelength Astrophysical Data With recent advances in the measurement technology for allsky astrophysical imaging, our view of the sky is no longer limited to the tiny visible spectral range over the 2D Celestial sphere. We now can access a third dimension corresponding to a broad electromagnetic spectrum with a wide range of allsky surveys; these surveys span frequency bands including long long wavelength radio, microwaves, very short X-rays, and gamma rays. These advances motivate us to study and examine multiwavelength visualization techniques to maximize our capabilities to visualize and exploit these informative image data sets. In this work, we begin with the processing of the data themselves, uniformizing the representations and units of raw data obtained from varied detector sources. Then we apply tools to map, convert, color-code, and format the multiwavelength data in forms useful for applications. We explore different visual representations for displaying the data, including such methods as textured image stacks, the horseshoe representation, and GPU-based volume visualization. A family of visual tools and analysis methods are introduced to explore the data, including interactive data mapping on the graphics processing unit (GPU), the mini-map explorer, and GPU-based interactive feature analysis. /content/cudazone/CUDABrowser/assets/images/applications/593_title_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/593_title_large.jpg Academia The Hong Kong University of Science and Technology 2008 12 01 12/01/2008 Hongwei Li Paper Imaging Science Hongwei Li 489165bd-a529-412c-bb5e-0230b77d02f9 A GPU based real-time software correlation system for theMurchison Widefield Array prototype. Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom built hardware and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32 antenna prototype system where it ran in real-time for extended periods. We briefly describe the data capture, streaming and correlation system for the prototype array. /content/cudazone/CUDABrowser/assets/images/applications/592_bar_small.png /content/cudazone/CUDABrowser/assets/images/applications/592_bar_large.png Academia Harvard-Smithsonian Center for Astrophysics 2008 12 31 12/31/2008 Randall B. Wayth Paper Science Signal Processing Randall B. Wayth,rwayth@cfa.harvard.edu 01e2e4a7-8b67-47c9-80b4-c4b0b40c66e7 Asymptotic theorems of sequential estimation-adjusted urn models The Generalized P'{o}lya Urn (GPU) is a popular urn model which is widely used in many disciplines. In particular, it is extensively used in treatment allocation schemes in clinical trials. In this paper, we propose a sequential estimation-adjusted urn model (a nonhomogeneous GPU) which has a wide spectrum of applications. Because the proposed urn model depends on sequential estimations of unknown parameters, the derivation of asymptotic properties is mathematically intricate and the corresponding results are unavailable in the literature. We overcome these hurdles and establish the strong consistency and asymptotic normality for both the patient allocation and the estimators of unknown parameters, under some widely satisfied conditions. These properties are important for statistical inferences and they are also useful for the understanding of the urn limiting process. A superior feature of our proposed model is its capability to yield limiting treatment proportions according to any desired allocation target. The applicability of our model is illustrated with a number of examples. /content/cudazone/CUDABrowser/assets/images/applications/591_formula_small.png /content/cudazone/CUDABrowser/assets/images/applications/591_formula_large.png Academia Zhejiang University 2006 03 14 03/14/2006 Li-X. Zhang Feifang Hu Siu Hung Cheung Paper Numerics Science Li-X. Zhang,Feifang Hu,Siu Hung Cheung 8fd8d414-8e27-4d33-b9b7-ec084a06aeb4 High Performance Direct Gravitational N-body Simulations We present the results of gravitational direct $N$-body simulations using the commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce 8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The force evaluation of the $N$-body problem was implemented in Cg using the GPU directly to speed-up the calculations. The integration of the equations of motions were, running on the host computer, implemented in C using the 4th order predictor-corrector Hermite integrator with block time steps. We find that for a large number of particles ($N apgt 10^4$) modern graphics processing units offer an attractive low cost alternative to GRAPE special purpose hardware. A modern GPU continues to give a relatively flat scaling with the number of particles, comparable to that of the GRAPE. Using the same time step criterion the total energy of the $N$-body system was conserved better than to one in $10^6$ on the GPU, which is only about an order of magnitude worse than obtained with GRAPE. For $Napgt 10^6$ the GeForce 8800GTX was about 20 times faster than the host computer. Though still about an order of magnitude slower than GRAPE, modern GPU's outperform GRAPE in their low cost, long mean time between failure and the much larger onboard memory; the GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTF can hold 9 million particles in memory. /content/cudazone/CUDABrowser/assets/images/applications/590_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/590_graph_large.png Academia Section Computational Science, University of Amsterdam, Amsterdam, The Netherlands 2009 02 23 02/23/2009 Simon Portegies Zwart Robert Belleman Peter Geldof Paper Numerics Science Simon Portegies Zwart,Robert Belleman,Peter Geldof abb3c32a-1e92-4553-9a7a-812aaa364adb Graphic processors to speed-up simulations for the design of high performance solar receptors Graphics Processing Units (GPUs) are now powerful and flexible systems adapted and used for other purposes than graphics calculations (General Purpose computation on GPU -- GPGPU). We present here a prototype to be integrated into simulation codes that estimate temperature, velocity and pressure to design next generations of solar receptors. Such codes will delegate to our contribution on GPUs the computation of heat transfers due to radiations. We use Monte-Carlo line-by-line ray-tracing through finite volumes. This means data-parallel arithmetic transformations on large data structures. Our prototype is inspired on the source code of GPUBench. Our performances on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL) show some speed-up higher than 400 compared to CPU implementations leaving most of CPU computing resources available. As there were some questions pending about the accuracy of the operators implemented in GPUs, we start this report with a survey and some contributed tests on the various floating point units available on GPUs. /content/cudazone/CUDABrowser/assets/images/applications/589_model_small.png /content/cudazone/CUDABrowser/assets/images/applications/589_model_large.png Academia ELIAUS, UPVD 2007 03 06 03/06/2007 420 Sylvain Collange Marc Daumas David Defour Paper Graphics Science Sylvain Collange,Marc Daumas,David Defour,firstname.lastname@univ-perp.fr 003d7f3b-356c-4876-81b4-c207d76b6bf2 nHD nHD is a multi-GPU 2nd order full Godunov three-dimensionaluniform-mesh Euler equations solver for calorically ideal,compressible gas. nHD uses CUDA with MPI and runs on a cluster ofmulti-GPU machines to accelerate computational hydrodynamicscalculations.Full Godunov method solves the hydrodynamic equations by discretizingthe fluid and calculating the nonlinear evolution of the discretizeddistribution, using the analytic solutions for Riemann problems. Thusfull Godunov method can resolve arbitrary severe shocks with minimumartificial dissipation and oscillation, and is the irreplaceablemethod for simulations of compressible fluid, where shocks and vacuumsare naturally generated. /content/cudazone/CUDABrowser/assets/images/applications/588_nHD7_small.png /content/cudazone/CUDABrowser/assets/images/applications/588_nHD7_large.png Academia Department of Physics, Kyoto University http://www.scphys.kyoto-u.ac.jp/index_e.html 2009 09 20 09/20/2009 173 Open source Takayuki Muranushi Code Computational Fluid Dynamics Science Computational Hydrodynamics, Full Godunov Method,Takayuki Muranushi,muranushi@gmail.com 9369afee-d78a-4b20-a092-c689a4a40301 SCELib3.0 SCELib is a computer program which implements the Single Center Expansion (SCE) method to describe molecular electronic densities and the interaction potentials between a charged projectile (electron or positron) and a target molecular system. The first version (CPC Catalog identifier ADMG_v1_0) was submitted to the CPC Program Library in 2000, and version 2.0 (ADMG_v2_0) was submitted in 2004. We here announce the new release 3.0 which presents additional features with respect to the previous versions aiming at a significative enhance of its capabilities to deal with larger molecular systems. SCELib 3.0 allows for ab initio effective core potential (ECP) calculations of the molecular wavefunctions to be used in the SCE method in addition to the standard all-electron description of the molecule. The list of supported architectures has been updated and the code has been ported to platforms based on accelerating coprocessors, such as the NVIDIA GPGPU and the new parallel model adopted is able to efficiently run on a mixed many-core computing system. /content/cudazone/CUDABrowser/assets/images/applications/587_Ribose_toc_small.png /content/cudazone/CUDABrowser/assets/images/applications/587_Ribose_toc_large.png Research CASPUR, Consortium for Supercomputing in Research http://www.caspur.it 2009 07 25 07/25/2009 177 Nico Sanna Paper Science Nico Sanna,n.sanna@caspur.it 9a45c9f4-df80-4c98-b6b4-98def8807dd4 Black holes on GPUs This paper describes a parallel implementation of Monte Carlo simulations using the post-Newtonian equations of motion to model black holes. We use these simulations to investigate the phase space of binary black hole systems. /content/cudazone/CUDABrowser/assets/images/applications/586_blackhole_small.png /content/cudazone/CUDABrowser/assets/images/applications/586_blackhole_large.png Academia University of Maryland 2009 08 27 08/27/2009 50 Frank Herrmann John Silberholz Matias Bellone Gustavo Guerberoff Manuel Tiglio Paper Numerics Life Sciences Science Frank Herrmann,John Silberholz,Matias Bellone,Gustavo Guerberoff,Manuel Tiglio,tiglio@umd.edu 16c382b3-7218-4288-a261-523470b8c535 GPU accelerated analysis of financial markets The compute unified device architecture is an almost conventional programming approach for managing computations on a graphics processing unit (GPU) as a data-parallel computing device. With a maximum number of 240 cores in combination with a high memory bandwidth, a recent GPU offers resources for computational physics. We apply this technology to methods of fluctuation analysis, which includes determination of the scaling behavior of a stochastic process and the equilibrium autocorrelation function. Additionally, the recently introduced pattern formation conformity (Preis T et al 2008 Europhys. Lett. 82 68005), which quantifies pattern-based complex short-time correlations of a time series, is calculated on a GPU and analyzed in detail. Results are obtained up to 84 times faster than on a current central processing unit core. When we apply this method to high-frequency time series of the German BUND future, we find significant pattern-based correlations on short time scales. /content/cudazone/CUDABrowser/assets/images/applications/585_financial_markets_small.gif /content/cudazone/CUDABrowser/assets/images/applications/585_financial_markets_large.gif Academia Johannes Gutenberg University Mainz 2009 09 16 09/16/2009 80 Open source Tobias Preis Multimedia Paper Code Finance Science Tobias Preis,preis@uni-mainz.de c462ebc4-646d-4eaf-9714-144678d49528 Fast recursive filters for simulating nonlinear dynamic systems A fast and accurate computational scheme for simulating nonlinear dynamic systems is presented. The scheme assumes that the system can be represented by a combination of components of only two different types: first-order low-pass filters and static nonlinearities. The parameters of these filters and nonlinearities may depend on system variables, and the topology of the system may be complex, including feedback. Several examples taken from neuroscience are given: phototransduction, photopigment bleaching, and spike generation according to the Hodgkin-Huxley equations. The scheme uses two slightly different forms of autoregressive filters, with an implicit delay of zero for feedforward control and an implicit delay of half a sample distance for feedback control. On a fairly complex model of the macaque retinal horizontal cell it computes, for a given level of accuracy, 1-2 orders of magnitude faster than 4th-order Runge-Kutta. The computational scheme has minimal memory requirements, and is also suited for computation on a stream processor, such as a GPU (Graphical Processing Unit). /content/cudazone/CUDABrowser/assets/images/applications/584_nuclear_small.gif /content/cudazone/CUDABrowser/assets/images/applications/584_nuclear_large.gif Academia Netherlands Institute for Neuroscience 2007 04 11 04/11/2007 J. H. van Hateren Paper Imaging Life Sciences Science J. H. van Hateren 4b683456-f4de-488d-b8f6-6e9a8607538f N-Body Simulations on GPUs Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that constitute the major part of stellar and molecular dynamics simulations. In some of the calculations, we achieve sustained performance of nearly 100 GFlops on an ATI X1900XTX. The performance on GPUs is comparable to specialized processors such as GRAPE-6A and MDGRAPE-3, but at a fraction of the cost. Furthermore, the wide availability of GPUs has significant implications for cluster computing and distributed computing efforts like Folding@Home. /content/cudazone/CUDABrowser/assets/images/applications/583_nbody_small.gif /content/cudazone/CUDABrowser/assets/images/applications/583_nbody_large.gif Academia Stanford University 2007 06 20 06/20/2007 Erich Elsen V. Vishal Mike Houston Paper Numerics Life Sciences Science Erich Elsen,V. Vishal,Mike Houston,pande@stanford.edu e5271230-c663-4fb6-bf23-997f7563256e High Performance Direct Gravitational N-body Simulations We present the results of gravitational direct $N$-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the $N$-body problem is implemented in ``Compute Unified Device Architecture'' (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different $N$-body codes: two direct $N$-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for $N > 512$ particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the $N$-body system was conserved better than to one in $10^6$ on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For $N apgt 10^5$ the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af. /content/cudazone/CUDABrowser/assets/images/applications/582_nbody_small.png /content/cudazone/CUDABrowser/assets/images/applications/582_nbody_large.png Academia Section Computational Science, University of Amsterdam, Amsterdam, TheNetherlands 2007 07 06 07/06/2007 Robert G. Belleman Jeroen Bedorf Simon Portegies Zwart Paper Numerics Science Robert G. Belleman,Jeroen Bedorf,Simon Portegies Zwart a89487d4-5b55-47ad-9a1c-38363d7c0e04 Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware The objective of our research is to demonstrate the practical usage and orders of magnitude speedup of real-world applications by using alternative technologies to support high performance computing. Currently, the main barrier to the widespread adoption of this technology is the lack of development tools and case studies that typically impede non-specialists that might otherwise develop applications that could leverage these technologies. By partnering with the Innovative Systems Laboratory at the National Center for Supercomputing, we have obtained access to several novel technologies, including several Field-Programmable Gate Array (FPGA) systems, NVidia Graphics Processing Units (GPUs), and the STI Cell BE platform. Our goal is to not only demonstrate the capabilities of these systems, but to also serve as guides for others to follow in our path. To date, we have explored the efficacy of the SRC-6 MAP-C and MAP-E and SGI RASC Athena and RC100 reconfigurable computing platforms in supporting a two-point correlation function which is used in a number of different scientific domains. In a brute force test, the FPGA based single-processor system has achieved an almost two orders of magnitude speedup over a single-processor CPU system. We are now developing implementations of this algorithm on other platforms, including one using a GPU. Given the considerable efforts of the cosmology community in optimizing these classes of algorithms, we are currently working to implement an optimized version of the basic family of correlation functions by using tree-based data structures. Finally, we are also exploring other algorithms, such as instance-based classifiers, power spectrum estimators, and higher-order correlation functions that are also commonly used in a wide range of scientific disciplines. /content/cudazone/CUDABrowser/assets/images/applications/581_tesla_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/581_tesla_large.jpg Academia National Center for Supercomputing Applications, University of Illinois atUrbana-Champaign 2007 11 21 11/21/2007 25 Robert J. Brunner Volodymyr V. Kindratenko Adam D. Myers Paper Numerics Science Robert J. Brunner,Volodymyr V. Kindratenko,Adam D. Myers,rb@astro.uiuc.edu 19b10e9a-f467-4538-8587-8594b128eeda Fast k Nearest Neighbor Search The recent improvements of graphics processing units (GPU) offer to the computer vision community a powerful processing platform. Indeed, a lot of highly-parallelizable computer vision problems can be significantly accelerated using GPU architecture. Among these algorithms, the k nearest neighbor search (KNN) is a well-known problem linked with many applications such as classification, estimation of statistical properties, etc. The main drawback of this task lies in its computation burden, as it grows polynomially with the data size. In this paper, we show that the use of the NVIDIA CUDA API accelerates the search for the KNN up to a factor of 120. /content/cudazone/CUDABrowser/assets/images/applications/580_dots_small.png /content/cudazone/CUDABrowser/assets/images/applications/580_dots_large.png Research 2009 04 09 04/09/2009 120 Vincent Garcia and Eric Debreuve and Michel Barlaud Paper Numerics Science Vincent Garcia and Eric Debreuve and Michel Barlaud 8890dacc-2905-41ac-a0bd-4efc292db999 A multiphysics and multiscale software environment for modeling astrophysical systems We present MUSE, a software framework for combining existing computational tools for different astrophysical domains into a single multiphysics, multiscale application. MUSE facilitates the coupling of existing codes written in different languages by providing inter-language tools and by specifying an interface between each module and the framework that represents a balance between generality and computational efficiency. This approach allows scientists to use combinations of codes to solve highly-coupled problems without the need to write new codes for other domains or significantly alter their existing codes. MUSE currently incorporates the domains of stellar dynamics, stellar evolution and stellar hydrodynamics for studying generalized stellar systems. We have now reached a "Noah's Ark" milestone, with (at least) two available numerical solvers for each domain. MUSE can treat multi-scale and multi-physics systems in which the time- and size-scales are well separated, like simulating the evolution of planetary systems, small stellar associations, dense stellar clusters, galaxies and galactic nuclei. In this paper we describe three examples calculated using MUSE: the merger of two galaxies, the merger of two evolving stars, and a hybrid N-body simulation. In addition, we demonstrate an implementation of MUSE on a distributed computer which may also include special-purpose hardware, such as GRAPEs or GPUs, to accelerate computations. The current MUSE code base is publicly available as open source at this http URL: http://muse.li/. /content/cudazone/CUDABrowser/assets/images/applications/579_sidexside_small.png /content/cudazone/CUDABrowser/assets/images/applications/579_sidexside_large.png Academia University of Amsterdam, Amsterdam, The Netherlands 2008 07 12 07/12/2008 Simon Portegies Zwart Steve McMillan Stefan Harfst Paper Numerics Science Simon Portegies Zwart,Steve McMillan,Stefan Harfst 0ba0bc17-1da4-46b9-8af0-b885bd619e74 Accelerating Scientific Computations with Mixed Precision Algorithms On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. /content/cudazone/CUDABrowser/assets/images/applications/578_c_small.png /content/cudazone/CUDABrowser/assets/images/applications/578_c_large.png Academia Department of Mathematics, University of Coimbra, Coimbra,Portugal 2008 08 20 08/20/2008 15 Marc Baboulin Alfredo Buttari Jack Dongarra Code Numerics Science Marc Baboulin,Alfredo Buttari,Jack Dongarra cfb65cc8-2394-4a3f-8658-4d117f3a3953 Parallel GPU Implementation of Iterative PCA Algorithms Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA) are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library). /content/cudazone/CUDABrowser/assets/images/applications/577_pca_small.png /content/cudazone/CUDABrowser/assets/images/applications/577_pca_large.png Academia Institute for Biocomplexity and Informatics, University of Calgary 2009 11 07 11/07/2009 12 M. Andrecut Paper Numerics Science M. Andrecut 9e8271a2-4d5e-4e08-868d-fb8c0e0eb80a Recent algorithm and machine developments for lattice QCD I review recent machine trends and algorithmic developments for dynamical lattice QCD simulations with the HMC algorithm for Wilson-type fermions. The topics include the trend toward multi-core processors and general purpose GPU (GPGPU) computing, and improvements on the quark determinant preconditioning, molecular dynamics integrator, and quark solvers. I also discuss the prospect on the use of these techniques on the forthcoming petaflops machines. /content/cudazone/CUDABrowser/assets/images/applications/576_ps_small.png /content/cudazone/CUDABrowser/assets/images/applications/576_ps_large.png Academia Graduate School of Science, Hiroshima University, Higashi-Hiroshima, Hiroshima 739-8526,Japan. 2009 11 11 11/11/2009 Ken-Ichi Ishikawa Paper Numerics Science Ken-Ichi Ishikawa,ishikawa@theo.phys.sci.hiroshima-u.ac.jp 58941ce0-0754-418a-9235-d94fbd05b96f Interactive Visualization of Billion Point Cosmological Simulations Despite the recent advances in graphics hardware capabilities, a brute force approach is incapable of interactively displaying terabytes of data. We have implemented a system that uses hierarchical level-of-detailing for the results of cosmological simulations, in order to display visually accurate results without loading in the full dataset (containing over 10 billion points). The guiding principle of the program is that the user should not be able to distinguish what they are seeing from a full rendering of the original data. Furthermore, by using a tree-based system for levels of detail, the size of the underlying data is limited only by the capacity of the IO system containing it. /content/cudazone/CUDABrowser/assets/images/applications/575_space_small.png /content/cudazone/CUDABrowser/assets/images/applications/575_space_large.png Academia California Institute of Technology, California Ave, 91126, Pasadena, CA 2009 11 13 11/13/2009 Tamas Szalay Volker Springel Gerard Lemson Paper Imaging Numerics Science Tamas Szalay,Volker Springel,Gerard Lemson 8b031bf9-1fc2-4a6b-a94f-fc9d93433d19 Parallel Algorithm for Solving Kepler's Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., chi^2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed-precision, our GPU code provides a speed-up factor of over 600, when evaluating N_sys > 1024 models planetary systems each containing N_pl = 4 planets and assuming N_obs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space. /content/cudazone/CUDABrowser/assets/images/applications/574_KeplersEquation_small.gif /content/cudazone/CUDABrowser/assets/images/applications/574_KeplersEquation_large.gif Academia Department of Astronomy, University of Florida 2009 12 16 12/16/2009 600 Eric B. Ford Paper Numerics Science gravitation,planetary systems,methods: numerical,techniques:radial velocities,Eric B. Ford c8752f5a-9c9d-4f67-9779-6c0ffbd62c22 Differential Equations for Monte Carlo Recycling and a GPU-Optimized Normal Quantile This article presents differential equations and solution methods for the functions of the form $A(z) = F^{-1}(G(z))$, where $F$ and $G$ are cumulative distribution functions. Such functions allow the direct recycling of samples from one distribution into samples from another. The method may be developed analytically for certain special cases, and illuminate the idea that it is a more precise form of the traditional Cornish-Fisher expansion. In this manner the model risk of distributional risk may be assessed free of the Monte Carlo noise associated with resampling. The method may also be regarded as providing both analytical and numerical bases for doing more precise Cornish-Fisher transformations. Examples are given of equations for converting normal samples to Student t, and converting exponential to hyperbolic, variance gamma and normal. In the case of the normal distribution, the change of variables employed allows the sampling to take place to good accuracy based on a single rational approximation over a very wide range of the sample space. The avoidance of any branching statement is of use in optimal GPU computations, and we give example of branch-free normal quantiles that offer performance improvements in a GPU environment, while retaining the precision characteristics of well-known methods. /content/cudazone/CUDABrowser/assets/images/applications/573_montecarlo_small.png /content/cudazone/CUDABrowser/assets/images/applications/573_montecarlo_large.png Academia Department of Mathematics King's College, The Strand, LondonWC2R 2LS, England 2009 01 06 01/06/2009 William T. Shaw Nick Brickman Paper Numerics Science William T. Shaw,Nick Brickman,william.shaw@kcl.ac.uk a8956c98-3e65-4a84-8ed9-2b2c84becf99 Nodal Discontinuous Galerkin Methods Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, another property of DG has been growing in importance: The majority of a DG operator is applied in an element-local way, with weak penalty-based element-to-element coupling. The resulting locality in memory access is one of the factors that enables DG to run on off-the-shelf, massively parallel graphics processors (GPUs). In addition, DG's high-order nature lets it require fewer data points per represented wavelength and hence fewer memory accesses, in exchange for higher arithmetic intensity. Both of these factors work significantly in favor of a GPU implementation of DG. Using a single US$400 Nvidia GTX 280 GPU, we accelerate a solver for Maxwell's equations on a general 3D unstructured grid by a factor of 40 to 60 relative to a serial computation on a current-generation CPU. In many cases, our algorithms exhibit full use of the device's available memory bandwidth. Example computations achieve and surpass 200 gigaflops/s of net application-level floating point work. In this article, we describe and derive the techniques used to reach this level of performance. In addition, we present comprehensive data on the accuracy and runtime behavior of the method. /content/cudazone/CUDABrowser/assets/images/applications/572_plane_small.png /content/cudazone/CUDABrowser/assets/images/applications/572_plane_large.png Academia Division of Applied Mathematics, Brown University, Providence, RI 02912 2009 01 08 01/08/2009 60 Andreas Klockner Tim Warburton Jeffrey Bridge Paper Numerics Science Andreas Klockner,Tim Warburton,Jeffrey Bridge,andreas@brown.edu,kloeckner@brown.edu f93b62b6-b6af-497e-83a8-865af31c8d7a Parallelizing Hash-based Data Carving The ability to detect fragments of deleted image files and to reconstruct these image files from all available fragments on disk is a key activity in the field of digital forensics. Although reconstruction of image files from the file fragments on disk can be accomplished by simply comparing the content of sectors on disk with the content of known files, this brute-force approach can be time consuming. This paper presents results from research into the use of Graphics Processing Units (GPUs) in detecting specific image file byte patterns in disk clusters. Unique identifying pattern for each disk sector is compared against patterns in known images. A pattern match indicates the potential presence of an image and flags the disk sector for further in-depth examination to confirm the match. The GPU-based implementation outperforms the software implementation by a significant margin. /content/cudazone/CUDABrowser/assets/images/applications/571_g80_small.png /content/cudazone/CUDABrowser/assets/images/applications/571_g80_large.png Academia ELIAUS University of Perpignan 2009 01 09 01/09/2009 Sylvain Collange Yoginder Dandass Paper Imaging Science Sylvain Collange,Yoginder Dandass,sylvain.collange@univ-perp.fr 0cf53113-4ca6-4571-ad43-030fb84f5f1e ACEMD: Accelerating bio-molecular dynamics in the microsecond time-scale The high arithmetic performance and intrinsic parallelism of recent graphical processing units (GPUs) can offer a technological edge for molecular dynamics simulations. ACEMD is a production-class bio-molecular dynamics (MD) simulation program designed specifically for GPUs which is able to achieve supercomputing scale performance of 40 nanoseconds/day for all-atom protein systems with over 23,000 atoms. We illustrate the characteristics of the code, its validation and performance. We also run a microsecond-long trajectory for an all-atom molecular system in explicit TIP3P water on a single workstation computer equipped with just 3 GPUs. This performance on cost effective hardware allows ACEMD to reach microsecond timescales routinely with important implications in terms of scientific applications. /content/cudazone/CUDABrowser/assets/images/applications/570_biomoleculardynamics_small.png /content/cudazone/CUDABrowser/assets/images/applications/570_biomoleculardynamics_large.png Academia Information and Communications Technologies,Imperial College London, South Kensington, London, SW7 2AZ, UK 2009 02 05 02/05/2009 19 M. J. Harvey G. Giupponi G. De Fabritiis Paper Life Sciences Science M. J. Harvey,G. Giupponi,G. De Fabritiis,m.j.harvey@imperial.ac.uk f16b23b6-991b-410d-b339-b4815a000f00 GPUs for data processing in the MWA The MWA is a next-generation radio interferometer under construction in remote Western Australia. The data rate from the correlator makes storing the raw data infeasible, so the data must be processed in real-time. The processing task is of order ~10 TFLOPS. The remote location of the MWA limits the power that can be allocated to computing. We describe the design and implementation of elements of the MWA real-time data processing system which leverage the computing abilities of modern graphics processing units (GPUs). The matrix algebra and texture mapping capabilities of GPUs are well suited to the majority of tasks involved in real-time calibration and imaging. Considerable performance advantages over a conventional CPU-based reference implementation are obtained. /content/cudazone/CUDABrowser/assets/images/applications/569_wma_small.png /content/cudazone/CUDABrowser/assets/images/applications/569_wma_large.png Academia Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA 2009 02 05 02/05/2009 S. Ord L. Greenhill R. Wayth Paper Numerics Science S. Ord,L. Greenhill,R. Wayth 31397807-ebad-4ce4-a822-4f66cfe8d3ca SAPPORO: A way to turn your graphics cards into a GRAPE-6 We present Sapporo, a library for performing high-precision gravitational N-body simulations on NVIDIA Graphical Processing Units GPUs. Our library mimics the GRAPE-6 library, and N-body codes currently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited to single precision arithmetics. This limitation is effectively overcome by emulating double precision for calculating the distance between particles. The performance loss of this operation is small ( 20 percent) compared to the advantage of being able to run at high precision. We tested the library using several GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak performance of 800 Gflop/s for running with 10^6 particles on a PC with four commercial G92 architecture GPUs (two GeForce 9800GX2). As a production test, we simulated a 32k Plummer model with equal mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance was 113 Gflop/s. The GPU did not show any problems from running in a production environment for such an extended period of time. /content/cudazone/CUDABrowser/assets/images/applications/567_cpu_gpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/567_cpu_gpu_large.png Academia Astronomical Institute "Anton Pannekoek", University of Amsterdam 2009 02 25 02/25/2009 Evghenii Gaburov Stefan Harfst Simon Portegies Zwart Paper Numerics Science Evghenii Gaburov,Stefan Harfst,Simon Portegies Zwart,egaburov@strw.leidenuniv.nl ca427734-eeff-4c03-ae7e-3230ad448d64 Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code. /content/cudazone/CUDABrowser/assets/images/applications/566_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/566_graph_large.png Research European Synchrotron Radiation Facility 2009 04 09 04/09/2009 20 Luigi Genovese Matthieu Ospici Thierry Deutsch Paper Numerics Science Luigi Genovese,Matthieu Ospici,Thierry Deutsch,luigi.genovese@esrf.fr cea6c291-c024-4ea7-9cdb-af965bd49771 Accelerator-Oriented Algorithm Transformation for Temporal Data Mining Temporal data mining algorithms are becoming increasingly important in many application domains including computational neuroscience, especially the analysis of spike train data. While application scientists have been able to readily gather multi-neuronal datasets, analysis capabilities have lagged behind, due to both lack of powerful algorithms and inaccessibility to powerful hardware platforms. The advent of GPU architectures such as Nvidia's GTX 280 offers a cost-effective option to bring these capabilities to the neuroscientist's desktop. Rather than port existing algorithms onto this architecture, we advocate the need for algorithm transformation, i.e., rethinking the design of the algorithm in a way that need not necessarily mirror its serial implementation strictly. We present a novel implementation of a frequent episode discovery algorithm by revisiting "in-the-large" issues such as problem decomposition as well as "in-the-small" issues such as data layouts and memory access patterns. This is non-trivial because frequent episode discovery does not lend itself to GPU-friendly data-parallel mapping strategies. Applications to many datasets and comparisons to CPU as well as prior GPU implementations showcase the advantages of our approach. /content/cudazone/CUDABrowser/assets/images/applications/564_oriented_small.png /content/cudazone/CUDABrowser/assets/images/applications/564_oriented_large.png Academia Department of Computer Science, Virginia Tech 2009 05 13 05/13/2009 431 Debprakash Patnaik Sean P. Ponce Yong Cao Paper Numerics Life Sciences Science Debprakash Patnaik, Sean P. Ponce, Yong Cao, Naren Ramakrishnan aa89b4c8-9abd-4459-adc4-00bfbb8021f7 Solving $k$-Nearest Vector Problem on Multiple Graphics Processors In a recommendation system, customers preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. We define this part of problem as a $k$-nearest vector problem and give an effective algorithm to solve it on multiple graphics processor units (GPUs). By an experiment, we show that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 260 times faster than a single core implementation on a latest CPU. We also show that our algorithm scales well with respect to the number of GPUs. /content/cudazone/CUDABrowser/assets/images/applications/563_k_small.png /content/cudazone/CUDABrowser/assets/images/applications/563_k_large.png Research Nihon Unisys, Ltd. 2009 01 01 01/01/2009 260 Kimikazu Kato Tikara Hosino Paper Numerics Kimikazu Kato, Tikara Hosino d4976711-d460-44f9-bfa5-ce9ca5d4c44e Elemental Accelerator Elemental Accelerator is a video processing solution designed to add power and performance to the Adobe Premiere Pro CS4 workflow. Coupled with NVIDIA Quadro series video cards, Elemental Accelerator harnesses the power of the graphics processing unit (GPU) to perform high-speed video encoding and deliver dramatic time savings over conventional CPU-only encoding solutions. Elemental Accelerator performs GPU-accelerated conversion of commonly distributed digital video formats to H.264/AVC output ready for upload to the web or burning to Blu-ray disc. Elemental Accelerator also supports high-speed MPEG-2 encoding for DVD or digital broadcast. By executing demanding processing tasks on the GPU, Elemental Accelerator not only speeds video transcoding, it frees CPU resources to perform other tasks, resulting in a faster, more efficient video editing and production environment. /content/cudazone/CUDABrowser/assets/images/applications/562_accelerator_small.png /content/cudazone/CUDABrowser/assets/images/applications/562_accelerator_large.png Commercial Elemental http://elementaltechnologies.com/ 2009 07 10 07/10/2009 7 Commercial Elemental Application Multimedia BUY NOW Video & Audio 14428008-4747-47d1-bcd5-f59bdb8230ec Towards Flow Cytometry Data Clustering on Graphics Processing Units Like many modern techniques for scientific analysis, flow cytometry produces massive amounts of data that must be analyzed and clustered intelligently to be useful. Current manual binning techniques are cumbersome and limited in both the quality and quantity of analysis produced. To address the quality of results, a new framework applying two different sets of clustering algorithms and inference methods are implemented. The two methods investigated are fuzzy c-means and minimum description length inference and k-medoids with BIC. These approaches lend themselves to large scale parallel processing. To address the computational demands, the Nvidia CUDA framework and Tesla architecture are utilized. The resulting performance demonstrated 1-2 orders of magnitude improvement over an equivalent sequential version. The quality of results is promising and motivates further research and development in this direction. /content/cudazone/CUDABrowser/assets/images/applications/561_flow_small.png /content/cudazone/CUDABrowser/assets/images/applications/561_flow_large.png Academia Rochester Institute of Technology, Rochester, NY 2008 12 31 12/31/2008 159 Jeremy Espenshade Doug Roberts James Cavenaugh Paper Numerics Jeremy Espenshade,Doug Roberts,James Cavenaugh d2033a62-e770-435d-a243-a38ca5a2ac58 Search Pipeline for Gravitational Waves from Coalescing Binaries of Compact Objects We report a novel application of graphics processing units (GPUs) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16 fold has been achieved compared with a single central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs. /content/cudazone/CUDABrowser/assets/images/applications/560_pipeline_small.png /content/cudazone/CUDABrowser/assets/images/applications/560_pipeline_large.png Academia School of Computer Science and Engineering, The University of Western Australi 2009 07 23 07/23/2009 16 Shin Kee Chung Linqing Wen David Blair Paper Science Shin Kee Chung, Linqing Wen, David Blair, Kipp Cannon, Amitava Datta e441e930-0ee4-41c2-9fa0-6d18c307ea30 Neuroblastoma Accelerationg dataflow application through the coordination of CPU and GPU /content/cudazone/CUDABrowser/assets/images/applications/559_neuro_results_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/559_neuro_results_large.jpg Research UFMG 2009 09 04 09/04/2009 90 George Teodor Paper Medical Imaging George Teodor,george@dcc.ufmg.br 46b9fa44-59f6-4987-a80c-87339387925f Abe Abe is a different type of search, serching for images with images. /content/cudazone/CUDABrowser/assets/images/applications/558_logo_s_small.png /content/cudazone/CUDABrowser/assets/images/applications/558_logo_s_large.png Commercial Quad Streaming http://www.quadstreaming.com/ 2010 02 08 02/08/2010 10 Quad Application Imaging Image search Quad,office@quadstreaming.com d84233c9-2041-4233-818b-e72e7813b115 GPU Satellite Image Processing Using CUDA and Tesla, PCI Geomatics has optimized code for orthorectification and pansharpening of high-resolution satellite imagery in the GeoImaging Accelerator (GXL) /content/cudazone/CUDABrowser/assets/images/applications/557_GXL_Server_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/557_GXL_Server_large.jpg Commercial PCI Geomatics http://www.pcigeomatics.com 2009 03 02 03/02/2009 2 Commercial David Piekny Paper Imaging David Piekny,piekny@pcigeomatics.com 8daa1fae-c95a-42a0-8d4a-82aab0b0d346 FlaCuda encoder Opensource CUDA-enabled FLAC encoder /content/cudazone/CUDABrowser/assets/images/applications/556_flacuda_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/556_flacuda_large.jpg Research 2009 09 10 09/10/2009 3 Open source Gregory S. Chudov Application Code Video & Audio Gregory S. Chudov, ef913ea0-85cb-4e53-a970-00b879982728 Large Integer/polynomial multiplication on GPU The paper describes the first implementation of large integer and/or polynomial multiplication using the number theoretic transform on GPU with 24-bit primes. The efficient 24-bit modular reduction is performed in floating-point arithmetic. Our algorithm exploits fused-multiply add (FMA) capabilities of the graphics hardware. DOI: http://dx.doi.org/10.1007/978-3-642-03644-6_11 /content/cudazone/CUDABrowser/assets/images/applications/555_mul_image_small.png /content/cudazone/CUDABrowser/assets/images/applications/555_mul_image_large.png Academia Max Planck Institute for Informatics http://www.mpi-inf.mpg.de 2009 08 21 08/21/2009 Pavel Emeliyanenko Paper Numerics Science Pavel Emeliyanenko,asm@mpi-sb.mpg.de 82150290-d681-44c6-a606-35c9565949a8 A Parallel Annealing Method for Automatic Color Cervigram Image Segmentation The accurate and automatic segmentation of tissue regions in cervigram images can aid in the identification and classification of precancerous regions. We implement and analyze four GPU (Graphics Processing Unit) based clustering algorithms: K-means, mean shift, deterministic annealing, and spatially coherent deterministic annealing. From our results, we propose a novel parallel algorithm using the CUDA programming language for digital cervigram segmentation and clustering. The first step of our fully automatic method is to compute the number of modes in the feature space of a color cervigram image using the mean shift clustering algorithm. Next, we use the number of modes in a novel spatially coherent deterministic annealing optimization technique to produce an approximate optimal solution for the clustering problem. Our GPU based methods perform approximately 38x (deterministic annealing), 134x (mean shift), and 276x (spatially coherent deterministic annealing) faster than an equivalent CPU solution. Our implementation decreases the computational time of an annealing method on a 1280x872 pixel image from 5 hours 3 minutes to 72.12 seconds, enabling the use of this optimization method in clinical settings and on large cervigram datasets. /content/cudazone/CUDABrowser/assets/images/applications/554_edkim_miccaigpuimage_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/554_edkim_miccaigpuimage_large.jpg Academia Lehigh University 2009 08 15 08/15/2009 276 Edward Kim Paper Medical Imaging Edward Kim,edk208@lehigh.edu 30159376-0082-49a8-9716-a55f0b2fb707 Predicting Lightning in Protoplanetary Discs We study the role of dust-dust collisional charging in protoplanetary discs. Although in some cases the charge densities for different species differ by 20 orders of magnitude, we transformed algorithm sothat it gives sufficiently precise solutions using only single precision floats. This made the program run faster on GPGPUs, allowing us to survey wide range of parameter space in high resolution. As a result, we found that as dust condensate, the charge distribution experience four phases. At one of these phases the electrostatic field grows as fourth power of dust density and lightning takes place. /content/cudazone/CUDABrowser/assets/images/applications/553_lightning-here_small.png /content/cudazone/CUDABrowser/assets/images/applications/553_lightning-here_large.png Academia Theoretical Astrophysics Group, Department of Physics, Kyoto University http://www-tap.scphys.kyoto-u.ac.jp/ 2009 08 11 08/11/2009 140 Takayuki Muranushi Paper Numerics Science Takayuki Muranushi,muranushi@gmail.com 2c8af116-762a-4968-b59c-bdc1328b7461 Optimization of FTLE Calulation We calculate the Finite-Time Lyapunov Exponent (FTLE) for several fluid flows and find that CUDA helps us immensely. /content/cudazone/CUDABrowser/assets/images/applications/552_rlw_vortex_small.png /content/cudazone/CUDABrowser/assets/images/applications/552_rlw_vortex_large.png Academia California Institute of Technology 2009 08 14 08/14/2009 1000 Raymond Jimenez Application Multimedia Paper Code Computational Fluid Dynamics Raymond Jimenez,raymondj@caltech.edu d508073b-38bf-4fc7-b99b-1ad6ff71b868 CBDA: Cyclotron Beam Dynamics Analysis Software for the Accelerator Physics /content/cudazone/CUDABrowser/assets/images/applications/551_demo2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/551_demo2_large.jpg Research JINR http://www.jinr.ru 2009 07 01 07/01/2009 60 Perepelkin Evgeny Application Science Perepelkin Evgeny,pevgeny@jinr.ru,Cyclotron, Space charge effect, Acceleration 1b0a6a82-8181-4dd3-9477-0e7d523af249 Efficient Acceleration of Asymmetric Cryptography on GPUs We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations. /content/cudazone/CUDABrowser/assets/images/applications/550_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/550_graph_large.png Academia Trinity College Dublin, Ireland http://www.tcd.ie/ 2008 12 01 12/01/2008 4 Owen Harrison Paper Numerics Owen Harrison,harrisoo@cs.tcd.ie 4eb88033-a856-44e6-aa9c-f8143e624219 StandardModel on GPU This project is a GPU port of the "Standard Model of Visual Cortex" (CBCL, MIT, by Riesenhuber M., Poggio T., Serre T., Wolf L.) /content/cudazone/CUDABrowser/assets/images/applications/549_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/549_logo_large.png Research 2009 08 10 08/10/2009 100 Open source Giacomo Spigler Application Code Graphics Science Signal Processing Giacomo Spigler,spiglerg@gmail.com a927fa37-9635-48c1-8b51-1f237dec4035 Cuda Jpeg Decoder jpeg decoder on GPU /content/cudazone/CUDABrowser/assets/images/applications/548_screenshot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/548_screenshot_large.jpg Commercial 2U http://www.2uinfotech.com 2009 08 13 08/13/2009 10 Open source Ramazan Dincer Application Code Imaging Ramazan Dincer,rados82@gmail.com 36253542-171f-491b-8646-f224a5694e8f Hyperspectral unmixing on NVidia GPUs Hyperspectral images are now routinely used in several Earth observation and planetary exploration missions. These images can be seen as high-dimensional data cubes with three dimensions: two of which represent the spatial domain, while the third one comprises hundreds of spectral bands collected at different wavelengths. As a result, each pixel is represented by a spatial localization and a spectral signature which provides very detailed information about its composition. One of the main problems in the analysis of hyperspectral data cubes is the problem of mixed pixels, which arise when the spatial resolution of the sensor is not enough to separate spectrally distinct materials. In this case, several spectrally pure signatures (endmembers) are combined into the same (mixed) pixel. Hyperspectral unmixing techniques comprise two stages: 1) automatic identification of spectral endmembers; and 2) estimation of the fractional abundance of each endmember in each pixel. The unmixing process is quite computationally expensive, mainly due to the extremely high dimensionality of hyperspectral data cubes. In this work, we develop a computationally efficient implementation of the full hyperspectral unmixing chain using different endmember extraction and fractional abundance estimation algorithms. The proposed methodology has been implemented, using the compute device unified architecture (CUDA), on an NVidia GeForce 8800 GTX GPU, achieving speedups in the order of 25x when compared to an optimized implementation of the same code in a dual-core CPU. /content/cudazone/CUDABrowser/assets/images/applications/547_hyperspectralcube_small.png /content/cudazone/CUDABrowser/assets/images/applications/547_hyperspectralcube_large.png Academia Technology of Computers and Communications, University of Extremadura http://www.umbc.edu/rssipl/people/aplaza 2009 08 12 08/12/2009 25 Antonio Plaza Application Imaging Antonio Plaza,aplaza@unex.es 31d6fa65-0651-4b66-8e96-75dda84f13f6 Tracking as Segmentation of Spatial-Temporal Volumes In this work, we interpret tracking as segmentation of spatial-temporal volumes. Segmentation is done by a variational approach using anisotropic weighted Total Variation (TV) regularization. All major parts of this approach are computed on the GPU using CUDA /content/cudazone/CUDABrowser/assets/images/applications/546_cuda_zone_emmcvpr_small.png /content/cudazone/CUDABrowser/assets/images/applications/546_cuda_zone_emmcvpr_large.png Academia Graz University of Technology, Institute for Computer Graphics and Vision http://www.gpu4vision.org 2009 08 12 08/12/2009 Markus Unger Multimedia Paper Numerics Science Video & Audio Computer Vision Markus Unger,info@gpu4vision.org 6217e565-1518-4cfd-9644-e7206c32a1a5 Performance Comparison of Single-Precision SPICE Model-Evaluation on FPGA, GPU, Cell, and Multi-core Processors Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE Model-Evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3--131x for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models. /content/cudazone/CUDABrowser/assets/images/applications/545_ic_logo_basic_small.png /content/cudazone/CUDABrowser/assets/images/applications/545_ic_logo_basic_large.png Academia U. Penn. Implementation of Computation Lab 2009 08 31 08/31/2009 133 Nachiket Kapre Paper Electronic Design Automation Nachiket Kapre,nachiket@ieee.org d0e01d7c-0f0f-4dcb-8b28-e0f98e87f914 Single Pass Depth Peeling via CUDA Renderer Multi-fragment effects play important roles on many graphics applications, which require operations on more than one fragment per pixel. The classical depth peeling algorithm provides a simple but robust solution by peeling off one layer each pass, but multi rasterizations will become a performance bottleneck for large and complex scenes. Ideally, we prefer to capture and sort multiple fragments in a single pass, which is difficult because the fragments generated in graphics pipeline are not allowed to be scattered to arbitrary positions of the render targets. Compute unified device architecture (CUDA) provides more flexible control over the GPU memory, but accessing of the fragments generated by graphics pipeline is not yet supported. In this work we design a CUDA rasterizer so that many graphics applications can benefit from the free control of GPU memory, especially for the multi-fragment effects. We present two efficient schemes to capture and sort multiple fragments per pixel in a single geometry pass via the atomic operations of CUDA without read-modify-write (RMW) hazards. Experimental results show significant speedup to classical depth peeling, especially for large scenes. /content/cudazone/CUDABrowser/assets/images/applications/544_dragon_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/544_dragon_large.jpg Research Institue of Software, Chinese Academy of Sciences 2009 08 10 08/10/2009 10 Open source Fang Liu Paper Graphics Fang Liu,liuf@ios.ac.cn b4cb74db-42cd-4def-86f0-65c87bd36187 FOLKI GPU Fast Optical Flow on GPU at video rate for full HD resolution /content/cudazone/CUDABrowser/assets/images/applications/543_icone_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/543_icone_large.jpg Research Onera http://www.onera.fr 2009 07 23 07/23/2009 100 Open source Aurelien Plyer Application Code Computational Fluid Dynamics Imaging Signal Processing Video & Audio Aurelien Plyer,aurelien.plyer@gmail.com bdb4e756-195f-4243-94f7-b10b7e11e2bd Iterative CUDA Iterative CUDA is a CUDA-based solver package for large, sparse linear systems. /content/cudazone/CUDABrowser/assets/images/applications/542_sparse-city-small_small.png /content/cudazone/CUDABrowser/assets/images/applications/542_sparse-city-small_large.png Academia Brown University http://brown.edu 2009 08 01 08/01/2009 10 Open source Andreas Kloeckner Code Computational Fluid Dynamics Numerics Libraries Science Andreas Kloeckner,kloeckner@dam.brown.edu,solver,cg,iterative,linear system c620d360-fac3-43d3-a045-9c1aae75ec57 SARRACUDA: Syntetic Aperture Radar Range-doppler Algorithm using CUDA This application is a GPU version of a Synthetic Aperture Radar focusing algorithm. The implemented algorithm is the Range doppler algorithm, one of the most accurates and widely used. Synthetic Aperture Radar (SAR) is an imaging radar for earth observation from satellite and airborne manned/unmanned platforms; it is currently operational in recently launched polar-orbiting platforms such as TerraSAR-X, RadarSAT-2 and Cosmo-SkyMed as well as in previous missions. Applicatons are tailored to disaster observation and management, mapping of renewable resources, geological mapping, snow/ice mapping and strategic surveillance of military sites.The data stream produced by high resolution SAR systems may exceed 1 Gb/s and the real-time or near real-time processing represents a demanding requirement for on-board or even ground-based processing systems. The remote sensing community and the space agencies spend yearly a considerableamount of time and money to implement efficient and accurate processors for SAR data. Moreover, the scientific community is more and more oriented to a wide range of applications where the first step is the focalization of SAR data. The recent development and diffusion of multicore platformsopens new horizons and breaks barriers in the design of architectures for massively parallel processing of SAR data, without loosing in resolution and/or accuracy. /content/cudazone/CUDABrowser/assets/images/applications/541_sarracuda_small.png /content/cudazone/CUDABrowser/assets/images/applications/541_sarracuda_large.png Academia Universita degli Studi del Sannio http://www.ing.unisannio.it 2009 08 05 08/05/2009 15 Open source Carmine Clemente Paper Signal Processing Remote Sensing Carmine Clemente,carmineclemente@gmail.com 9ca18ade-f66f-4f40-9743-e6ebb760de33 Libra SDK Libra SDK is a scientific developer kit for building simple and fast cross CPU-GPU applications suited for scientific computations. Libra 1.1 SDK includes C/C++ matlab style API, sample programs and documentation. Example code and a downloadable trial version is available from GPU Systems website http://www.gpusystems.com /content/cudazone/CUDABrowser/assets/images/applications/540_logo_bg_small.png /content/cudazone/CUDABrowser/assets/images/applications/540_logo_bg_large.png Commercial GPU Systems http://www.gpusystems.com 2009 06 24 06/24/2009 Commercial Marco Hjerpe Multimedia Paper Computational Fluid Dynamics,Digital Content Creation,Electronic Design Automation,Finance,Game Physics,Graphics,Imaging,Medical Imaging,Numerics,Life Sciences,Libraries,Oil & Gas,Science,Signal Processing,Video & Audio,matlab programming Marco Hjerpe,marco.hjerpe@gpusystems.com,CPU,GPU,C++ programming,gpgpu,matlab,CUDA,OpenCL 87d90d59-5c3a-4094-a8cf-0e8da5326193 Real-time optical manipulation of micron sized structures using GPU generated holograms Holographic optical tweezers allow the three dimensional, dynamic, multipoint manipulation of micron sized dielectric objects. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame rate allowing the interactive micromanipulation of complex structures. /content/cudazone/CUDABrowser/assets/images/applications/539_slm_small.png /content/cudazone/CUDABrowser/assets/images/applications/539_slm_large.png Academia CNR-INFM, CRS-SOFT Dipartimento di Fisica, Universita di Roma La Sapienza 2009 07 23 07/23/2009 350 S. Bianchi R. Di Leonardo Paper Imaging Science S. Bianchi,R. Di Leonardo a0d09099-5643-406d-9d4a-9e7053425028 The Living Application: a Self-Organising System for Complex Grid Tasks We present the living application, a method to autonomously manage applications on the grid. During its execution on the grid, the living application makes choices on the resources to use in order to complete its tasks. These choices can be based on the internal state, or on autonomously acquired knowledge from external sensors. By giving limited user capabilities to a living application, the living application is able to port itself from one resource topology to another. The application performs these actions at run-time without depending on users or external workflow tools. We demonstrate this new concept in a special case of a living application: the living simulation. Today, many simulations require a wide range of numerical solvers and run most efficiently if specialized nodes are matched to the solvers. The idea of the living simulation is that it decides itself which grid machines to use based on the numerical solver currently in use. In this paper we apply the living simulation to modelling the collision between two galaxies in a test setup with two specialized computers. This simulation switces at run-time between a GPU-enabled computer in the Netherlands and a GRAPE-enabled machine that resides in the United States, using an oct-tree N-body code whenever it runs in the Netherlands and a direct N-body solver in the United States. /content/cudazone/CUDABrowser/assets/images/applications/538_self-organism_small.png /content/cudazone/CUDABrowser/assets/images/applications/538_self-organism_large.png Academia Section Computational Science, University of Amsterdam, Amsterdam, theNetherlands 2009 07 23 07/23/2009 D. Groen S. Harfst S. Portegies Zwart Paper Numerics Science Signal Processing D. Groen, S. Harfst, S. Portegies Zwart,djgroen@science.uva.nl 057f342d-84b4-4a12-ae35-5208c51ed958 Synthetic Aperture Radar Back-Projection Algorithm Synthetic Aperture Radar(SAR) uses microwaves to create images of the earth. These images provide information not visible to the naked eye, and can be made despite visibility conditions. SAR image formation requires massive amounts of computation and is hard to do in real-time. The best SAR processing algorithm, known as back-projection, is O(N^3) where N is the number of pixels -- which can be many thousands. To reduce computation suboptimal FFT-based algorithms have been traditionally used despite the various limitations and image degradation effects these algorithms have. The back-projection algorithm is however ideal for a highly parallel processor like NVIDIA's GPGPUs. At the Brigham Young University Microwave Earth Remote Sensing Laboratory we have been able to take advantage of the GPGPUs massive processing power to reduce the processing time for a 1500X1600 image that took 31 minutes in a well-optimized, single-threaded C implementation, down to a 5.6 seconds using one of the four processors of a NVIDIA S1070. This is even faster than many FFT-based algorithms! We hope to continue to build off of this speed up to make further advancements in SAR imaging. /content/cudazone/CUDABrowser/assets/images/applications/537_sonar_small.png /content/cudazone/CUDABrowser/assets/images/applications/537_sonar_large.png Academia Brigham Young University Microwave Earth Remote Sensing Laboratory http://www.mers.byu.edu/SAR.html#YIFSAR 2009 08 03 08/03/2009 300 David G. Long Multimedia Paper Signal Processing David G. Long,long@ee.byu.edu 069854c1-e5b6-44b1-84a9-eb00831c8fae Julia 4D Ray tracing of quaternion julia set /content/cudazone/CUDABrowser/assets/images/applications/540_Julia4D_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/540_Julia4D_large.jpg Research homemade 2009 08 02 08/02/2009 Charles Strub Application Multimedia Graphics Numerics Charles Strub,charles.strub@gmail.com,Julia 4D quaternion ray tracing 7cc11512-08ba-4264-a4c7-fa1c31ae47b2 cudaseg (Fast Level Set Segmentation of Biomedical Images using Graphics Processing Units ) n this projet we have engineered a parallel level In this projet we have engineered a parallel level set implementation using the NVIDIA CUDA framework to accelerate image and volume segmentations. The final source code and thesis can be downloaded on this site In this projet we have engineered a parallel level set implementation using the NVIDIA CUDA framework to accelerate image and volume segmentations. /content/cudazone/CUDABrowser/assets/images/applications/538_cudasegall_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/538_cudasegall_large.jpg Academia University of Oxford 2009 06 02 06/02/2009 Hormuz Mostofi Application Multimedia Paper Code Graphics Imaging Hormuz Mostofi, d86e78ef-8ada-44e6-ad42-fc6386e55cc0 Cholesky Decompositions Cholesky factorization for dense matrix and reached 450x with GTX 285 /content/cudazone/CUDABrowser/assets/images/applications/536_http_imgload.cgi_small.png /content/cudazone/CUDABrowser/assets/images/applications/536_http_imgload.cgi_large.png Freelance 2009 09 05 09/05/2009 450 lixiuyu Application Science lixiuyu,cyrosly@163.com 34d84e30-9ec7-42e6-a411-63810d133fc4 A GPU based GPS software receiver Off-the-shelf graphics processing units provide low-cost massive parallel computing performance, which can be utilized for the implementation of a GPS software receiver. In order to realize a real-time capable system the crucial stages of the receiver should be optimized to suit the requirements of a parallel processor. Moreover, the receiver should be capable to provide wider correlation functions and provide easy access to the spectral domain of the signals. Thus, the most suitable correlation algorithm, which forms the core part of each receivers should be chosen and implemented on the graphics processor. Since the sampling rate of the received signal limits the real-time capabilities of the software radio it is necessary to determine an optimum value, considering that the precision of the observable varies with sampling bandwidth. We are going to discuss details and present our single frequency multi-channel implementation, which is capable of operating in real-time mode. Our implementation differs from other solutions by the wideness of the correlation function and allows simple handling of data in the spectral domain. Comparison with output from a commercial hardware receiver, which shares the antenna with the software radio, confirms the consistency and accuracy of our development. /content/cudazone/CUDABrowser/assets/images/applications/535_gpsgpu_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/535_gpsgpu_large.jpg Research National Institute of Information and Communications Technology, Japan http://www.nict.go.jp 2009 08 08 08/08/2009 Thomas Hobiger Paper Science Signal Processing Thomas Hobiger,hobiger@nict.go.jp ce75850e-89b5-4c0f-af44-1f5f66f91cd1 framework for efficient and scalable execution of domain-specific templates on GPUs Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts - processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7 - 7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory. /content/cudazone/CUDABrowser/assets/images/applications/534_image_small.png /content/cudazone/CUDABrowser/assets/images/applications/534_image_large.png Commercial NEC Labs, Berkeley, Purdue 2009 05 01 05/01/2009 Narayanan Sundaramyz Anand Raghunathanyx Srimat T. Chakradhar Paper Imaging Medical Imaging machine learning Narayanan Sundaramyz, Anand Raghunathanyx, and Srimat T. Chakradhar a9aaf71b-fcf4-4cab-a580-5029383afb71 Massively Parallel Population-Based Monte Carlo Methods Implementation of population-based MCMC and a sequential Monte Carlo sampler for inference in a Gaussian mixture model and a particle filter for a factor stochastic volatility state-space model. /content/cudazone/CUDABrowser/assets/images/applications/533_b1_small.png /content/cudazone/CUDABrowser/assets/images/applications/533_b1_large.png Academia University of Oxford 2009 05 14 05/14/2009 500 Open source Anthony Lee Christopher Yau Michael B. Giles Arnaud Doucet Christopher C. Holmes Application Paper Code Statistics Anthony Lee,lee@stats.ox.ac.uk f549113c-45f8-4aff-96b4-b89db7abe5bb 3D Image Deconvolution on GPUs A popular approach to solving the inverse problem of image deconvolution is to use iterative methods. Iterative deconvolution can provide better results than simpler methods at a cost of higher computational complexity and processing time. In this work we investigate the use of graphics processing units (GPUs) and CUDA to accelerate the execution of one such iterative algorithm, the Richardson-Lucy (RL) algorithm. We compare performance results for a number of 3D Richardson-Lucy implementations on both the CPU and GPU, showing that our best GPU implementation, using Fourier space convolutions (CUFFT), significantly outperforms our best CPU implementation, which uses a publicly available and highly optimised Fast Fourier Transform (FFT) library. L. Domanski, P. Vallotton, and D. Wang. Two and Three-Dimensional Image Deconvolution on Graphics Hardware. In Anderssen, R.S., R.D. Braddock and L.T.H. Newham (eds) 18th World IMACS/MODSIM Congress, Cairns, Australia, pages 1010--1016, 13-17 July 2009. ISBN: 978-0-9758400-7-8. http://www.mssanz.org.au/modsim09/C5/domanski.pdf /content/cudazone/CUDABrowser/assets/images/applications/532_psfteaser_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/532_psfteaser_large.jpg Research Commonwealth Scientific and Industrial Research Organisation http://www.csiro.au/ 2009 07 13 07/13/2009 Luke Domanski Multimedia Paper Imaging Medical Imaging Luke Domanski,Luke.Domanski@csiro.au,image deconvolution, image restoration, microscopy, CUDA, CUFFT 0313f2ec-fa80-4682-8fb2-8e855c9f2e66 PAPER - Accelerating Parallel Evaluations of ROCS PAPER is a GPU-accelerated implementation of Gaussian molecular shape overlay (the algorithm in OpenEye ROCS) running on NVIDIA graphics cards. We have demonstrated multiple-order-of-magnitude speedups relative to a CPU-based implementation of the same algorithm, and 5x speedup relative to OpenEye ROCS even on low-end graphics hardware (an NVIDIA 8600GT). /content/cudazone/CUDABrowser/assets/images/applications/531_gpuROCS_thumb_small.png /content/cudazone/CUDABrowser/assets/images/applications/531_gpuROCS_thumb_large.png Academia Department of Computer Science, Stanford University 2009 05 06 05/06/2009 35 Imran Haque Application Paper Code Life Sciences Imran Haque,ihaque@cs.stanford.edu,paper openeye rocs ffe53df7-0183-443c-a269-710b724d1cb7 librysq librysq is C/C++ implementation of the Rys quadrature for computing arbitrary electron repulsion integrals on CPU and CUDA GPUs. A FORTRAN interface is provided for compatibility with the existing chemistry packages. /content/cudazone/CUDABrowser/assets/images/applications/529_MOS-902-8-400x300_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/529_MOS-902-8-400x300_large.jpg Research Source Forge 2009 03 29 03/29/2009 andrey asadchev Paper Numerics Science andrey asadchev, 0fc84b69-6d38-4463-adae-0d6d3ad2fdb0 GPU Flame Fractal Renderer Renderer for flam3 cosmic recursive fractal flames implemented on GPU. Requires a CUDA-capable graphics card. /content/cudazone/CUDABrowser/assets/images/applications/528_screenshot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/528_screenshot_large.jpg Research SourceForge 2009 07 24 07/24/2009 Keldor Application Graphics Keldor,Keldor@users.sourceforge.net 027d6d57-37b1-4657-9df6-394c24092014 Combining Molecular Dynamics with Bayesian Analysis To Predict and Evaluate Ligand-Binding Mutations in Influenza Hemagglutinin The influenza virus infects people and animals by binding to complex sugar molecules on the surface of the respiratory tract. Bird viruses bind most strongly to bird cell-surface sugars and human viruses bind most strongly to human cell-surface sugars. As the recent swine-origin influenza virus has demonstrated, there is considerable overlap between the binding ability of human and pig viruses to cells of the other host. Changes to this binding affinity are one key component for viruses to make a jump between species, and it is difficult to predict the necessary mutations ahead of time. We would like to predict high-risk mutations to enable better surveillance and early control of potential inter-species transmission events. This work represents a first step in that direction, as we examine mutations to H5N1 avian influenza that alter ligand binding. We use Folding@Home as a powerful computational screen to evaluate mutations that will eventually require experimental testing to verify. /content/cudazone/CUDABrowser/assets/images/applications/527_ja904_small.png /content/cudazone/CUDABrowser/assets/images/applications/527_ja904_large.png Academia Departments of Chemistry and Structural Biology, Stanford University 2009 07 28 07/28/2009 Peter M Kasson Paper Life Sciences Peter M Kasson,kasson@stanford.edu,folding@home influenza d1233002-2132-43e2-8527-3bf5159ddf19 ViVid Python framework for video processing and content analysis using CUDA for acceleration. /content/cudazone/CUDABrowser/assets/images/applications/525_6702-Water_Life_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/525_6702-Water_Life_large.jpg Research Source Forge 2009 04 18 04/18/2009 Dennis Lin Mert Dikmen Code Video & Audio Dennis Lin,Mert Dikmen,Dennis_Lin@sourceforge.net 0b495833-3a0f-45e7-88ab-43b3f28cc0fe SSbump Generator A GUI interface to a tool for generating SSBumps (Self Shadowed Bump Maps). Includes a CUDA GPU rendering extension. /content/cudazone/CUDABrowser/assets/images/applications/524_screenshot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/524_screenshot_large.jpg Research SourceForge 2008 12 31 12/31/2008 SARGE ssbumpgenerator Application Imaging Numerics Science SARGE,ssbumpgenerator,SARGE@users.sourceforge.net 2444c3a0-b3b0-4584-9b11-6f566f9030ee Open64 Compiler and Tools The Open64 Compiler and Tools site is dedicated to the continued development of the former SGI Pro64(TM) compiler for the IA64, x86, CUDA and MIPS architecture. /content/cudazone/CUDABrowser/assets/images/applications/523_nvidia-2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/523_nvidia-2_large.jpg Research NVIDIA http://www.nvidia.com 2009 04 04 04/04/2009 Alban Douillet Juergen Ributzka Suneel Jain Application Numerics Alban Douillet,Juergen Ributzka,Suneel Jain,adouillet@nvidia.com 90e8e358-dfd7-493e-b829-36373e4ab5ee CUDA-EC A fast parallel error correction tool for short reads. /content/cudazone/CUDABrowser/assets/images/applications/522_cuda-ec_small.png /content/cudazone/CUDABrowser/assets/images/applications/522_cuda-ec_large.png Academia Nanyang Technological University 2009 04 07 04/07/2009 Haixiang Shi Application Numerics Haixiang Shi,Haixiang_Shi@users.sourceforge.net 4c8b5fb1-15cd-4251-bec3-f6de3a414800 pfsRTtmo This project provides realtime implementations of popular HDR tone mapping operators on GeForce 8800 GPUs using the CUDA programming environment. /content/cudazone/CUDABrowser/assets/images/applications/521_screenshot_thumb_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/521_screenshot_thumb_large.jpg Research SourceForce 2008 12 31 12/31/2008 07/30/2008 Peter Kipfer Application Imaging Science Peter Kipfer,prkipfer@users.sourceforge.net 6de04003-5433-4de2-bf86-c308ac51fd12 GPU Accelerated Real Time HDR Rendering A real-time interactive display was developed to showcase timelapse photos by using motion estimation results to produce unique high-dynamic range images as a function of the viewer's position in front of the display. /content/cudazone/CUDABrowser/assets/images/applications/520_ir_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/520_ir_large.jpg Academia University of Toronto http://www.eyetap.org 2009 05 05 05/05/2009 Raymond Lo Eric Tran Multimedia Graphics Imaging Raymond Lo,Eric Tran 63d283aa-d137-4705-899c-cbb174ef07ba GPU accelerated dose calculations for radiotherapy We developed a ray-tracing algorithm for radiotherapy dose calculations that enables (nearly) real-time calculation of the dose for realistic radiotherapy patient data-sets. This reduces the workload for manual determination of the optimal treatment plan. Besides, it offers a speed up for automated optimization of (advanced) radiotherapy treatment plans and/or re-planning after on-line imaging of the patient. /content/cudazone/CUDABrowser/assets/images/applications/519_dosedistro_1e6_ptv_only_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/519_dosedistro_1e6_ptv_only_large.jpg Academia Academic Medical Center, University of Amsterdam http://www.amc.nl/radiotherapie 2009 08 12 08/12/2009 10 M.de Greef Multimedia Paper Numerics Life Sciences Science M.de Greef,m.degreef@amc.uva.nl ce36336b-2796-4924-9cb9-a79e4d7992e6 OpenMS An open-source framework for mass spectrometry /content/cudazone/CUDABrowser/assets/images/applications/518_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/518_logo_large.png Academia Center for Bioinformatics, Saarland University http://bioinf-www.bioinf.uni-sb.de/ 2009 01 14 01/14/2009 Rene Hussong Paper Code Life Sciences Rene Hussong,rene@bioinf.uni-sb.de,openms proteomics 228465d3-3c07-4215-ad0f-8bba8d3f87a8 parallel for A data parallel scientific programming model. Compiles efficiently to different platforms like distributed memory (MPI), shared memory multi-processor (pthreads), Cell BE processor, NVIDIA CUDA, SIMD vectorization (SSE, Altivec), and sequential C++ code. /content/cudazone/CUDABrowser/assets/images/applications/517_simd_mimd_small.png /content/cudazone/CUDABrowser/assets/images/applications/517_simd_mimd_large.png Research CISCO 2008 12 31 12/31/2008 GWZ Code Numerics Science GWZ,gwz@cisco.com 2c01c333-373e-49f3-9b2e-41a3d14db455 multiDAC multiDAC is intended to become a user-friendly tool for image- and videoprocessing in the field of deformation/movement analysis. It is written in C# with some C routines using CPU/GPU parallelization (e.g. CUDA). /content/cudazone/CUDABrowser/assets/images/applications/516_screenshot_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/516_screenshot_large.jpg Research SourceForge 2009 07 30 07/30/2009 purzel42 Application Video & Audio purzel42,purzel@users.sourceforge.net feb82039-4088-43ec-9118-1d2a1c80b349 CUDA-NN A parallel version of Neural Networks using CUDA for optimization, data mining, etc. /content/cudazone/CUDABrowser/assets/images/applications/515_datamining7_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/515_datamining7_large.jpg Academia Nanyang Technological University 2008 12 31 12/31/2008 Haixiang Shi Code Numerics Life Sciences Science Haixiang Shi,Haixiang_Shi@users.sourceforge.net f11599cd-b3a5-4b08-a9af-801b67ebd826 IllustStudio IllustStudio is the paint tool which allows users to express pen strokes similar to real ones and to expand their range of expressions. IllustStudio has filters corresponding to CUDA and realizes high-speed filtering process by using GPU calculation. According to our research*, with CUDA enables the processing speed 35 times faster than without CUDA. * According to the ratio of CELSYS. /content/cudazone/CUDABrowser/assets/images/applications/514_illuststudio_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/514_illuststudio_large.jpg CELSYS,Inc. http://www.celsys.co.jp/ 2009 07 29 07/29/2009 35 CELSYS,Inc. Application Digital Content Creation Graphics CELSYS,Inc. 12ba6f44-1cdc-4fad-a5de-7d9e052f76dc CUDA-SVM A fast parallel SVM tool based on CUDA. /content/cudazone/CUDABrowser/assets/images/applications/513_svm_small.png /content/cudazone/CUDABrowser/assets/images/applications/513_svm_large.png Nanyang Technological University Academia 2008 12 31 12/31/2008 Haixiang Shi Code Numerics Haixiang Shi,Haixiang_Shi@users.sourceforge.net 3662fbe9-eeec-4413-b43c-d42054cbfa52 CUDA-GA A fast parallel genetic algorithm using CUDA. /content/cudazone/CUDABrowser/assets/images/applications/512_GAArt_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/512_GAArt_large.jpg Academia Nanyang Technological University 2008 12 31 12/31/2008 Haixiang Shi Code Numerics Life Sciences Science Haixiang Shi,Haixiang_Shi@users.sourceforge.net 76057c07-a388-46d9-af21-b9bfcc4453c3 CUDA-PSO A parallel version of Particle Swarm Intelligence (PSO) using nVidia's CUDA. /content/cudazone/CUDABrowser/assets/images/applications/511_swarm_intelligence_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/511_swarm_intelligence_large.jpg Academia Nanyang Technological University 2008 12 31 12/31/2008 Haixiang Shi Code Imaging Science Haixiang Shi,Haixiang_Shi@users.sourceforge.net f4e8deee-ef9f-4047-bafd-0701a9a1bc27 Magnetohydrodynamics simulations on graphics processing units Magnetohydrodynamics (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the authors' knowledge, the first implementation to accelerate computation of MHD simulations on GPUs. Numerical tests have been performed to validate the correctness of our GPU MHD code. Performance measurements show that our GPU-based implementation achieves speedups of 2 (1D problem with 2048 grids), 106 (2D problem with 1024^2 grids), and 43 (3D problem with 128^3 grids), respectively, compared to the corresponding serial CPU MHD implementation. /content/cudazone/CUDABrowser/assets/images/applications/510_GPU_MHD_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/510_GPU_MHD_new_large.jpg Academia Faculty of IT, Macau University of Science and Technology 2009 09 01 09/01/2009 100 Hon-Cheng Wong Un-Hong Wong Paper Science Computational Physics Hon-Cheng Wong,hcwong@ieee.org e8dc2667-cce6-47b8-8fe4-c0e18e14972b CUDA-ClustalW CUDA-ClustalW is publicly available open-source software for high-speed computation of large MSAs running on CUDA-enabled GPUs based on clustalw-2.0.9. The project has been tested on a GeForce GTX 280 graphics card. /content/cudazone/CUDABrowser/assets/images/applications/509_p53_Hsap_Mmus_Rnor_Frub_ClustalW_6Kb_angle_800p_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/509_p53_Hsap_Mmus_Rnor_Frub_ClustalW_6Kb_angle_800p_large.jpg Research SourceForge.net 2009 04 01 04/01/2009 nkcslyc Application Numerics nkcslyc, 49b7c770-57ef-4530-b9ea-ea804d21c7ff Cuda_Wrapper The CUDA wrapper library provides means for an efficient resource sharing and resource protection on multi-user GPU clusters.It implements the following functionality:1) Virtualization of the physical GPU devices2) Ensuring NUMA affinity for GPUs . /content/cudazone/CUDABrowser/assets/images/applications/507_numerics_rayleighbenard3d_small.png /content/cudazone/CUDABrowser/assets/images/applications/507_numerics_rayleighbenard3d_large.png Academia University of Illinois at Urbana-Champaign 2009 07 21 07/21/2009 Guochun Shi Jeremy Enos Code Numerics Libraries Guochun Shi,Jeremy Enos,gshi@ncsa.uiuc.edu 54d76ef2-f4b0-4e80-a3c3-ee8338606f13 CUDA Neural Network Implementation of a feed-forward backpropagation artificial neural network using CUDA. /content/cudazone/CUDABrowser/assets/images/applications/506_neural_network_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/506_neural_network_large.jpg Research Sourceforge 2008 12 03 12/03/2008 Pyrevenant Application Life Sciences,Libraries,Science Pyrevenant eabc31be-c665-455a-95bd-0d0e7dd532ab cuda-z Simple program that displays information about CUDA-enabled devices. Program is equipped with GPU performance test. /content/cudazone/CUDABrowser/assets/images/applications/505_CUDA-Z_2_small.png /content/cudazone/CUDABrowser/assets/images/applications/505_CUDA-Z_2_large.png Research SourceForge.net 2009 04 13 04/13/2009 Andriy Golovnya Application Numerics Andriy Golovnya,andrew_golovnia@users.sourceforge.net bbf86cd6-59ac-443b-ab02-8ba8ef3bbf60 Computation of Troposphere Slant Delays on a GPU Description (i.e. abstract of the paper): The computation of ray-traced troposphere delays which can be utilized for space geodetic applications is a time-consuming effort when a large number of rays has to be calculated. On the other hand, computation time can be tremendously reduced when algorithms are capable of supporting parallel processing architectures. Thus, by the use of an off-the-shelf graphics processing unit (GPU), it is demonstrated that troposphere slant delays can be computed very efficiently, without loss of accuracy. An adopted ray-tracing algorithm is presented, and results from GPU computations are compared with those obtained from calculations on a standard personal computer's CPU. /content/cudazone/CUDABrowser/assets/images/applications/504_IEEE_GPU_figureC_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/504_IEEE_GPU_figureC_new_large.jpg Research National Institute of Information and Communications Technology, Japan http://www.nict.go.jp 2009 06 26 06/26/2009 18 Hobiger Thomas Ichikawa Ryuichi Koyama Yasuhiro Paper Geoscience Science Hobiger Thomas, Ichikawa Ryuichi, Koyama Yasuhiro, Kondo Tetsuro e97270b7-8c73-45fa-ac57-96e9ab59ca88 Cuda ITK This project shows how to integrate NVIDIA CUDA GPU programming API into ITK (Insight Segmentation and Registration Toolkit) library. /content/cudazone/CUDABrowser/assets/images/applications/503_226314_small.png /content/cudazone/CUDABrowser/assets/images/applications/503_226314_large.png Academia Harvard University 2009 06 28 06/28/2009 Won-Ki Jeong Paper Numerics e9097706-9b7a-4d12-9104-a44bd1952348 Phobos Phobos is a continuous map-reduce framework built upon NVIDIA CUDA /content/cudazone/CUDABrowser/assets/images/applications/502_1_PHOBOS_461_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/502_1_PHOBOS_461_large.jpg Academia HKUST 2009 01 01 01/01/2009 Wenbin Fang Code Libraries Wenbin Fang,saven@cse.ust.hk 212c683f-9d5a-4654-9edc-c3e3fcfe8727 cudatemplates CUDA Templates" is a collection of C++ template classes and functions which provide a consistent interface to NVidia's "Compute Unified Device Architecture" (CUDA), hiding much of the complexity of the underlying CUDA functions from the programmer. /content/cudazone/CUDABrowser/assets/images/applications/501_CUDATemplates_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/501_CUDATemplates_large.jpg Academia Technische Unversitat Graz 2008 12 31 12/31/2008 Markus Grabner Application Numerics Science d321a794-7411-4623-a966-4586c0d149e8 Application of a Kinetic Theory based solver of the Euler Equations using GPU Presented is a modified form of the Quiet Direct Simulation (QDS) method [1] adapted for application of Graphics Processing Units (GPU) for flux calculation. Fluxes between source and destination cells calculated by QDS are flux-vector split and (on a regular Cartesian grid) a function of the source cell alone. The resulting advantage is the rapid calculation of fluxes between cells without the prior exchange of information between them, allowing highly efficient calculation using GPU. Various flow problems have been solved and consistent speed-ups of over 35 times (when compared to an equivalent single CPU code) are reported. /content/cudazone/CUDABrowser/assets/images/applications/500_kinetic_small.png /content/cudazone/CUDABrowser/assets/images/applications/500_kinetic_large.png Academia National Centre for High Performance Computing, Hsinchu, Taiwan http://www.nchc.org.tw/en/ 2009 05 18 05/18/2009 35 Matthew Smith Paper Computational Fluid Dynamics Matthew Smith,msmith@nchc.org.tw,Quiet Direct Simulation, Kinetic Theory 426e2b01-84c7-4c8c-a935-c652aee3ba78 Conjugated Gradient CUDA and CPU solvers for float, double and quad precision Free CUDA CG! Take advantage from our full featured 150GFlop/s Conjugated Gradient CUDA and CPU solvers for float, double and quad precision for free. /content/cudazone/CUDABrowser/assets/images/applications/499_CG_small.png /content/cudazone/CUDABrowser/assets/images/applications/499_CG_large.png Commercial Elegant Mathematics Ltd http://www.elegant-mathematics.com/ 2009 01 08 01/08/2009 Open source Elegant Mathematics Ltd Code Numerics Elegant Mathematics Ltd,info@elegant-mathematics.com ef371856-7d1b-4d97-ab4c-6e73f9925992 GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh Refinement code), which has adopted a novel approach to improve the performance of adaptive mesh refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is made to diminish by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely-baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with 8192^3 effective resolution, respectively. /content/cudazone/CUDABrowser/assets/images/applications/498_fig18_small.png /content/cudazone/CUDABrowser/assets/images/applications/498_fig18_large.png Academia Department of Physics, National Taiwan University 2009 07 30 07/30/2009 12 Hsi-Yu Schive Paper Computational Fluid Dynamics Science Hsi-Yu Schive,b88202011@ntu.edu.tw ab53f652-f3d5-41b7-ab63-10bbda728871 Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures The multi-core trend in CPUs and GPUs offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g. GPUs). /content/cudazone/CUDABrowser/assets/images/applications/497_960_small.png /content/cudazone/CUDABrowser/assets/images/applications/497_960_large.png Academia University of California at Davis 2009 06 02 06/02/2009 18 Luke J. Gosink Paper Data Parallel Database Indexing Luke J. Gosink,jgosink@ucdavis.edu 47cb4eda-e210-4db8-bb80-d6ae342dd454 Physical-Space Refraction-Corrected Transmission Ultrasound Computed Tomography Made Computationally Practical Transmission Ultrasound Computed Tomography CT) is strongly affected by the acoustic refraction properties of the imaged tissue, and proper modeling and correction of these effects is crucial to achieving high-quality image reconstructions. Excellent results can be obtained when these physics effects are incorporated, but at considerable computational expense. We have used CUDA to conceive a framework that implements refractive Ultrasound CT and meets the interactive demands of clinical practice, without a loss in reconstruction quality. /content/cudazone/CUDABrowser/assets/images/applications/497_us_img_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/497_us_img_large.jpg Academia Stony Brook University 2008 09 11 09/11/2008 85 Kllaus Mueller Paper Medical Imaging Kllaus Mueller,mueller@cs.sunysb.edu b912f9a1-2627-4d4f-9f4f-4da3eff3ca78 Python Parallel Utilities NVIDIA CUDA and MPI python wrappers. These wrappers are written in pure C no swig or boost necessary. The CUDA wrapper exposes the CUDA runtime and Driver API's. /content/cudazone/CUDABrowser/assets/images/applications/496_smoothed_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/496_smoothed_large.jpg Academia Seismic Laboratory for Imaging and Modeling 2008 12 31 12/31/2008 Sean Ross-Ross Paper Programming Tools c58a8810-432f-4757-a91f-c80faabe20ab Signal Integrity Simulations Agilent Technologies Inc. (NYSE:A) announced its work with NVIDIA to accelerate signal integrity simulations using NVIDIAs Compute Unified Device Architecture (CUDA)-based Graphics Processing Units (GPU). The association is expected to yield the commercial release of a GPU-enabled Advanced Design System (ADS) Transient Convolution Simulator that will allow signal integrity designers to run these simulations dramatically faster than was previously possible. /content/cudazone/CUDABrowser/assets/images/applications/495_hyperlinx-eye_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/495_hyperlinx-eye_large.jpg Commercial EDA Geek News Staff in Models, Simulations 2008 08 26 08/26/2008 EDA Geek News Staff in Models, Simulations Paper Signal Processing contact_us@agilent.com d0dbf768-8c4a-45f6-a6c3-8c38dc100a98 Applying Modern Soft and Hardware Technologies for Computational Steering Approaches in Computational Fluid Dynamics In this article we present an educational simulation tool, FlowSim 2007 CUDA edition, a computational steering application for interactive 2D flow simulation based on the Lattice Boltzmann Method. The application combines a comfortable user interface as well as a convenient development platform on the one hand and a high performance flow solver on the other hand. The user interface is implemented using the Microsoft .NET Framework whereas the Lattice Boltzmann kernel is based on the Compute Unified Device Architecture (CUDA) by nVIDIA running on GeForce 8 series featuring G8X GPUs [2]. The gap between the managed intermediate language (IL) code and the hardware specific native code is filled using the recently introduced C++/CLI programming language [1]. We demonstrate that this integrated desktop approach can deliver a performance that exceeds that of a high end PC by at least an order of magnitude. In our conclusion we will focus on extensions to three dimensions and clusters of GPUs. /content/cudazone/CUDABrowser/assets/images/applications/494_p175_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/494_p175_large.jpg Academia Technology Institute at TU Braunschweig 2007 10 26 10/26/2007 Jan Linxweiler Jonas T lke Manfred Krafczyk Paper Computational Fluid Dynamics Science Jan Linxweiler,Jonas T,lke Manfred Krafczyk, a6e5c287-cebe-4a52-b00f-7ec58a5dbdd2 Computer generated hologram with geometric occlusion using GPU-accelerated depth buffer rasterization for three-dimensional display We present a method of rapidly producing computer-generated holograms that exhibit geometric occlusion in the reconstructed image. Conceptually, a bundle of rays is shot from every hologram sample into the object volume. We use z buffering to find the nearest intersecting object point for every ray and add its complex field contribution to the corresponding hologram sample. Each hologram sample belongs to an independent operation, allowing us to exploit the parallel computing capability of modern programmable graphics processing units (GPUs). Unlike algorithms that use points or planar segments as the basis for constructing the hologram, our algorithm's complexity is dependent on fixed system parameters, such as the number of ray-casting operations, and can therefore handle complicated models more efficiently. The finite number of hologram pixels is, in effect, a windowing function, and from analyzing the Wigner distribution function of windowed free-space transfer function we find an upper limit on the cone angle of the ray bundle. Experimentally, we found that an angular sampling distance of 0.01 for a 2.66 cone angle produces acceptable reconstruction quality. /content/cudazone/CUDABrowser/assets/images/applications/493_h15g_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/493_h15g_large.jpg Academia University of CambridgeElectrical Engineering Dept. 2009 07 17 07/17/2009 Rick H.-Y. Chen Timothy D. Wilkinson Paper Rick H.-Y. Chen,Timothy D. Wilkinson 6f121072-b6d3-47ba-9e6b-e872695eaaf8 Real-Time Fringe Pattern Generation with High Quality A hologram computation procedure and its GPU implementation are presented. The procedure is based on partitioning. Each segment has an approximate but simpler frequency domain representation. Quality of the results is comparable to Fresnel holograms. /content/cudazone/CUDABrowser/assets/images/applications/492_3d-scan1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/492_3d-scan1_large.jpg Academia Department of Electrical and Electronics Engineering and Bilkent University 2009 04 30 04/30/2009 Hoonjong Kang, Fahri Yara, Levent Onural, Paper Imaging Science Hoonjong Kang, Fahri Yara, Levent Onural,hjkang@ee.bilkent.edu.tr 922028e2-56a9-45cf-99de-ceaa8d0a5370 Real-Time Multiple SLM Color Holographic Display Using Multiple GPU Acceleration A real-time color holographic video display system computes holograms from point cloud of a rigid object by using multi-GPU system and uses three different colored LEDs for reconstruction. Experimental results are satisfactory. /content/cudazone/CUDABrowser/assets/images/applications/491_slm_small.png /content/cudazone/CUDABrowser/assets/images/applications/491_slm_large.png Academia Dept. of Electrical and Electronics Eng., Bilkent University 2009 04 30 04/30/2009 Fahri Yara Hoonjong Kang Levent Onural Paper Imaging Science Video & Audio Fahri Yara, Hoonjong Kang,Levent Onural, 8bbd6e15-496a-4bb3-9053-b3811821e510 Fast Hardware-Accelerated Volume Rendering of CT Scans As CT scanning is a very common medical imaging method, we propose new hardware-based algorithms using GPU (Graphical Processor Unit) programming for rapid visualization. Firstly, 3D volumes are constructed from CT scans. Then volume rendering is used to display anatomical structures via algorithms founded on improved ray casting and 2D textures. Our methods achieve interactive rendering rates and require an ordinary PC with an off-the-shelf graphics card. We expect our approach to be useful to medical practitioners for handling modern, large-scale medical datasets. /content/cudazone/CUDABrowser/assets/images/applications/490_ct_head_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/490_ct_head_large.jpg Academia Zhejiang University 2007 12 01 12/01/2007 Ronghua Liang Zhigeng Pan Meleagros Krokos Paper Medical Imaging Life Sciences Ronghua Liang, Zhigeng Pan, Meleagros Krokos,zgpan@cad.zju.edu.cn f10e6a41-90c1-4b44-8a9a-a427e74974f8 GPU-Based Acceleration Method for Coherent Holographic Stereogram Calculation In this paper, we show an acceleration method of the coherent holographic stereogram calculation by means of the GPU, and demonstrate the performance gain up to a factor of over 10 compared with CPU-based computing. /content/cudazone/CUDABrowser/assets/images/applications/489_mobius_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/489_mobius_large.jpg Academia Department of Electrical and Electronics Engineering and Bilkent University 2008 03 16 03/16/2008 10 Hoonjong Kang, Takeshi Yamaguchi, Hiroshi Yoshikawa Paper Imaging Science Hoonjong Kang, Takeshi Yamaguchi,Hiroshi Yoshikawa,hjkang@ee.bilkent.edu.tr f9c1b8ad-db0d-429f-8863-03ded9a69dab Atmospheric wavefront phase recovery by use of specialized hardware: graphical processing units and field-programmable gate arrays To achieve the wavefront phase-recovery stage of an adaptive-optics loop computed in real time for 32x32 or a greater number of subpupils in a Shack-Hartmann sensor, we present here, for what is to our knowledge the first time, preliminary results that we obtained by using innovative techniques: graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). We describe the stream-computing paradigm of the GPU and adapt a zonal algorithm to take advantage of the parallel computational power of the GPU. We also present preliminary results we obtained by use of FPGAs on the same algorithm. GPUs have proved to be a promising technique, but FPGAs are already a feasible solution to adaptive-optics real-time requirements, even for a large number of subpupils. /content/cudazone/CUDABrowser/assets/images/applications/488_08_06a_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/488_08_06a_large.jpg Academia University of La LagunaSpain 2004 12 31 12/31/2004 Jose G. Marichal-Hernandez Luis F. Rodriguez-Ramos Fernando Rosa Paper Imaging Science Jose G. Marichal-Hernandez, Luis F. Rodriguez-Ramos, Fernando Rosa,tpc3dtvcon09@tnt.uni-hannover.de 8a5f261c-09d8-42a5-8726-7100bdde85c8 Acceleration method of computing a compensated phase-added stereogram We have implemented experimental code to compute a compensated phase-added stereogram (CPAS), which was proposed in a previous paper, on a graphic processing unit (GPU). In this paper, we show an acceleration method for CPAS computation by means of the GPU and compare the computation time between CPU-based and GPU-based calculations, which are programmed in our laboratories. In addition, we demonstrate their reconstructed images. As a result, we could achieve a performance gain of a factor of over 33 compared with a CPU-based computing environment and digital holograms can be displayed at 30 frames per second with 15,000 points. /content/cudazone/CUDABrowser/assets/images/applications/487_stereo_small.png /content/cudazone/CUDABrowser/assets/images/applications/487_stereo_large.png Academia Department of Electrical and Electronics Engineering and Bilkent University 2008 10 24 10/24/2008 33 Hoonjong Kang Takeshi Yamaguchi Hiroshi Yoshikawa Paper Imaging Science Hoonjong Kang, Takeshi Yamaguchi, Hiroshi Yoshikawa, hjkang@ee.bilkent.edu.tr 784de106-a069-4478-8a7e-92a0aed3649b Hologram synthesis for photorealistic reconstruction Computation of diffraction patterns, and thus holograms, of scenes with photorealistic properties is a highly complicated and demanding process. An algorithm, based primarily on computer graphics methods, for computing full-parallax diffraction patterns of complicated surfaces with realistic texture and reflectivity properties is proposed and tested. The algorithm is implemented on single-CPU, multiple-CPU and GPU platforms. An alternative algorithm, which implements reduced occlusion diffraction patterns for much faster but somewhat lower quality results, is also developed and tested. The algorithms allow GPU-aided calculations and easy parallelization. Both numerical and optical reconstructions are conducted. The results indicate that the presented algorithms compute diffraction patterns that provide successful photorealistic reconstructions; the computation times are acceptable especially on the GPU implementations. /content/cudazone/CUDABrowser/assets/images/applications/486_image018_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/486_image018_large.jpg Academia JOSA A 2008 11 24 11/24/2008 Martin Janda Ivo Hanak Levent Onural Paper Imaging Numerics Science Martin Janda,Ivo Hanak, Levent Onural,mjandakiv@zcu.cz 9d2fbd2e-e241-478f-90a1-f40cb04ed084 Real-time digital holographicmicroscopy Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512x512 grids in 24 frames per second. /content/cudazone/CUDABrowser/assets/images/applications/485_holo_small.png /content/cudazone/CUDABrowser/assets/images/applications/485_holo_large.png Academia Graduate School of Science and Engineering, Yamagata University 2009 07 23 07/23/2009 Tomoyoshi Shimobaba Yoshikuni Sato Junya Miura Multimedia Paper Imaging Science Numerics Tomoyoshi Shimobaba,Yoshikuni Sato,Junya Miura,shimo@yz.yamagata-u.ac.jp 5068f09c-0433-4ba5-9565-f77fbc04d4c8 Real-time liquid-crystal atmosphere turbulence simulator To generate time-evolving atmosphere turbulence in real time, a phase-generating method for our liquid-crystal (LC) atmosphere turbulence simulator (ATS) is derived based on the Fourier series (FS) method. A real matrix expression for generating turbulence phases is given and calculated with a graphic processing unit (GPU), the GeForce 8800 Ultra. A liquid crystal on silicon (LCOS) with 256x256 pixels is used as the turbulence simulator. The total time to generate a turbulence phase is about 7.8 ms for calculation and readout with the GPU. A parallel processing method of calculating and sending a picture to the LCOS is used to improve the simulating speed of our LC ATS. Therefore, the real-time turbulence phasegeneration frequency of our LC ATS is up to 128 Hz. To our knowledge, it is the highest speed used to generate a turbulence phase in real time. /content/cudazone/CUDABrowser/assets/images/applications/484_simulator_small.png /content/cudazone/CUDABrowser/assets/images/applications/484_simulator_large.png Academia Changchun Institute of Optics, Fine Mechanics and Physics 2009 04 17 04/17/2009 Lifa Hu Li Xuan Dayu Li Paper Numerics Science Lifa Hu,Li Xuan,Dayu Li,hulifa@ciomp.ac.cn dfaea93f-1724-4e37-b329-5ee4848f3988 GPU-assisted high-resolution, real-time3-D shape measurement This paper describes a Graphics Processing Unit (GPU)-assisted real-time three-dimensional shape measurement system. Our experiments demonstrated that the absolute coordinates calculation and rendering speed of a GPU is more than four times faster than that of a dual CPU workstation with the same graphics card. By implementing the GPU into our system, we realized simultaneous absolute coordinate acquisition, reconstruction and display at 30 frames per second with a resolution of approximately 266K points per frame. Moreover, a 2+1 phase-shifting algorithm was employed to alleviate the measurement error caused by motion. Applications of the system include medical imaging, manufacturing, entertainment, and security. /content/cudazone/CUDABrowser/assets/images/applications/483_face_small.png /content/cudazone/CUDABrowser/assets/images/applications/483_face_large.png Academia Mathematics Department, Harvard University 2006 10 02 10/02/2006 4 Song Zhang Dale Royer Shing-Tung Yau Paper Imaging Numerics Science Song Zhang,Dale Royer,Shing-Tung Yau,szhang77@gmail.com 6ab09d58-6ea6-4025-8af8-e1925bef8dce Computer generated holography We have applied the graphics processing unit (GPU) to computer generated holograms (CGH) to overcome the high computational cost of CGH and have compared the speed of a GPU implementation to a standard CPU implementation. The calculation speed of a GPU (GeForce 6600, nVIDIA) was found to be about 47 times faster than that of a personal computer with a Pentium 4 processor. Our system can realize real-time reconstruction of a 64-point 3-D object at video rate using a liquid-crystal display of resolution 800x600. /content/cudazone/CUDABrowser/assets/images/applications/482_computer-generated-hologram_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/482_computer-generated-hologram_large.jpg Academia Department of Medical System Engineering Chiba University 2008 12 31 12/31/2008 47 Nobuyuki Masuda Tomoyoshi Ito Takashi Tanaka Paper Imaging Numerics Video & Audio Nobuyuki Masuda,Tomoyoshi Ito,Takashi Tanaka,masudanb@faculty.chiba-u.jp 028ff6b3-3641-497e-8515-37370d59d3c3 Flow visualization and flow cytometry with holographic video microscopy The video stream captured by an in-line holographic microscope can be analyzed on a frame-by-frame basis to track individual colloidal particles three-dimensional motions with nanometer resolution, and simultaneously to measure their sizes and refractive indexes. Through a combination of hardware acceleration and software optimization, this analysis can be carried out in near real time with off-the-shelf instrumentation. An efficient particle identification algorithm automates initial position estimation with sufficient accuracy to enable unattended holographic tracking and characterization. This techniques resolution for particle size is fine enough to detect molecular-scale coatings on the surfaces of colloidal spheres, without requiring staining or fluorescent labeling. We demonstrate this approach to label-free holographic flow cytometry by detecting the binding of avidin to biotinylated polystyrene spheres. /content/cudazone/CUDABrowser/assets/images/applications/481_laser_small.png /content/cudazone/CUDABrowser/assets/images/applications/481_laser_large.png Academia Department of Physics and Center for Soft Matter Research, New York University, New York 2009 07 17 07/17/2009 Fook Chiong Cheong Bo Sun Remi Dreyfus Paper Imaging Science Fook Chiong Cheong,Bo Sun,Remi Dreyfus,david.grier@nyu.edu 8dba8b26-2a21-43a9-945b-ec9f04d5ff5d A QAP Solver with CUDA GPU Computing Architecture This application solves the quadratic assignment problem (QAP) [1]. In QAP, we are given l locations and l facilities and the task is to assign the facilities to the locations to minimize the cost. We chose QAP for the following reasons: First, problem sizes of QAPs in real life problems are relatively small compared with other problems in permutation domains such as the traveling salesman problem (TSP) and the scheduling problem. This enables us to use the shared memory of a GPU effectively. Second, QAP is one of the most diffcult problems among problems in permutation domains. Thus, QAP is a good test bed to evaluate an optimization algorithm. /content/cudazone/CUDABrowser/assets/images/applications/480_qap03_small.png /content/cudazone/CUDABrowser/assets/images/applications/480_qap03_large.png Academia Graduate School of Science, Osaka Prefecture University 2008 12 31 12/31/2008 Noriyuki Fujimoto Shigeyoshi Tsutsui Paper Numerics Science Noriyuki Fujimoto,Shigeyoshi Tsutsui,fujimoto@mi.s.osakafu-u.ac.jp ca10a525-1ae8-4b4e-9720-2243074cb32e A GPU Accelerated Evolutionary Computer Vision System We have used the graphics processing unit (GPU) of the graphics card to create an evolutionary image processing system which is able to learn how to detect a user-specified object in an image. The system receives an image sequence as input. The user only has to tell the system where this object is located. This is done by using the mouse pointer. The user simply moves the mouse over the desired object and then presses the mouse button as long as the object is located under the mouse pointer. The user follows this object over several frames while keeping the mouse button pressed. As this is being done, the system evolves a population of image processing algorithms by exploiting the power of the GPU at interactive rates. Our system is the first GPU accelerated evolutionary image processing system (Figure 1) which allows the automatic creation of object detection algorithms [2]. This is the first step towards building fully adaptive evolutionary vision systems [1]. /content/cudazone/CUDABrowser/assets/images/applications/479_ducks_small.png /content/cudazone/CUDABrowser/assets/images/applications/479_ducks_large.png Academia Universitat Tubingen 2008 12 31 12/31/2008 45 Eberhard Karls Paper Imaging Numerics Science Eberhard Karls,marc.ebner@wsii.uni-tuebingen.de 49ece6ac-6aa1-492c-8ceb-6a748939c306 GPU-based Acceleration of the Genetic Algorithm Genetic algorithm (GA) is a stochastic optimization method inspired by nature evolution. Because of their parallel nature, they have been parallelized many times. Graphic Processing Units (GPU) were originally targeted for rasterization of graphics primitives. Today GPUs are more likely fast multi-core processors capable of performing complex mathematical tasks. There are many ways how to exploit GPUs potential for general purpose computation (GPGPU). One option is to employ Compute Unified Device Architecture (CUDA) framework. /content/cudazone/CUDABrowser/assets/images/applications/478_voronoi_knauss_oesterle_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/478_voronoi_knauss_oesterle_large.jpg Academia Brno University of Technology, Bozetechova 2 2008 12 31 12/31/2008 2600 Petr Pospichal Jiri Jaros Numerics Science Petr Pospichal,Jiri Jaros,xpospi45@stud.t.vutbr.cz 4aa192d6-70c5-4040-9fa2-7ee690f988dc Parallel Ant System for Traveling Salesman Problem Ant Colony Optimization(ACO) is a meta-heuristic introduced in 1991 by Dorigo et al. on TSP problem(Dorigo, 1992). This alorithm is inspired by the natural behavior of real ants. Ants usually communicate via pheromone trail, i.e. an ant would lay down some mount of pheromone on the passed path. An ants tendency to choose a specific path is positively correlated to the intensity of trail. The pheromone trail evaporates over time, if on pheromone laid down by other ants. If many ants lay down pheromone on specific path, the intensity would attract more ants forward this path. Although ACO has outstanding performance on TSP problem, it spends huge execution time in large scale TSP problem. However, ACO has highly parallelizable structure(Talbi, Roux, Fonlupt, & Robillard, 1999 St utzle, 1998). In this work, we choose NVIDIAs CUDA programming model and Tesla C1060 as platform to implement our Parallel ACO. /content/cudazone/CUDABrowser/assets/images/applications/477_AntLines_small.png /content/cudazone/CUDABrowser/assets/images/applications/477_AntLines_large.png Academia Taiwan Evolutionary Intelligence Laboratory (TEIL) Department of Electrical Engineering, National Taiwan University 2008 12 31 12/31/2008 21 Ying-Shiuan You Paper Numerics Science Ying-Shiuan You,r97921039@ntu.edu.tw 8fc9a55c-d66e-4870-8536-634fad8c6d4a StarPU StarPU is a unified runtime system that offers support for heterogeneous multicore architectures (CPUs, GPUs, Cell's SPUs, ...) . Its unified execution model is tightly coupled with a high-level data management library and provides a convenient way to develop and tune powerful scheduling algorithms. StarPU therefore make it possible to actually get the benefits of hybrid systems in a portable fashion. /content/cudazone/CUDABrowser/assets/images/applications/476_starpu-lu-dag_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/476_starpu-lu-dag_large.jpg Research INRIA http://www.inria.fr 2009 07 06 07/06/2009 Open source Cedric Augonnet Application Code Libraries runtime system, task scheduling, data management, portability 8d5ab3be-6289-4b09-8e1f-db37f6d927b0 Optimization of Primality Testing Methods Modern fast primality testing uses a combination of Strong Probable Prime (SPRP) rejection tests. We find more powerful combinations by intensive search of the vast domain of SPRP test configurations. Evolutionary guidance using previous promising results boosts search speed. We implement the entire search on the GPU with the CUDA programming language resulting in 65-time speedup over a CPU search. This project has already found a test an order of magnitude more powerful than the best previously known. /content/cudazone/CUDABrowser/assets/images/applications/474_rabin_miller_1_small.PNG /content/cudazone/CUDABrowser/assets/images/applications/474_rabin_miller_1_large.PNG Academia 2008 12 31 12/31/2008 65 Steve Worley Paper Numerics Science Steve Worley,comments@worley.com b8b898ab-530a-4119-9976-a20a5fdc492b Particle Swarm Optimization The increasing interest of researchers in using low cost GPUs for applications requiring intensive parallel comput- ing is due to the ability of these devices to solve parallelizable problems much faster than traditional sequential processors. The first applications of evolutionary algorithms (EAs) on GPUs have been developed to solve specific image processing problems; at the beginning they were using textures render- ing for the encoding and evaluation of individuals and most of the times tasks like pseudo random numbers generation and other evolutionary operations were executed on CPU. This project presents an approach for the implementation of PSO algoritms on GPUs which, by means of the nVIDIA CUDA TM environment, avoids the use of textures as data structures and performs all evolution on the GPU, reducing as much as possible the exchange of data with the CPU. /content/cudazone/CUDABrowser/assets/images/applications/473_phase_small.png /content/cudazone/CUDABrowser/assets/images/applications/473_phase_large.png Academia Dipartimento di Ingegneria dell InformazioneUniversita degli Studi di Parma 2008 12 31 12/31/2008 50 Luca Mussi Stefano Cagnoni Paper Numerics Luca Mussi,,Stefano Cagnoni,mussi@ce.unipr.it,cagnoni@ce.unipr.it d23dc8ee-f770-49ab-ab28-abe32d6d2d10 Video Game Tools Used For Defense Needs Video gaming computers and video game consoles available today typically contain a graphics processing unit (GPU), which is very efficient at manipulating and displaying computer graphics. However, the unit's highly parallel structure also makes it more efficient than a general-purpose central processing unit for a range of complex calculations important to defense applications. /content/cudazone/CUDABrowser/assets/images/applications/472_commandandconquer-775336_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/472_commandandconquer-775336_large.jpg Academia Georgia Institute of Technology Research News 2009 06 24 06/24/2009 350 Georgia Institute of Technology Research News Paper Game Physics Video Game 1f321cbb-73a3-4321-8858-cf8f5d246fe1 Using Evolutionary Computing on Consumer GraphicsHardware for Epistasis Analysis in Human Genetics Biological systems are both complex and robust. Because of this epistasis, or gene-gene interactions, are thought to be a ubiquitous component of common human diseases. Unfortunately, due to the non-linear nature of these interactions, detecting and characterizing epistasis requires algorithms which are combinatorial in complexity. One such algorithm is Multifactor Dimensionality Reduction (MDR). Expert knowledge guided evolutionary computing wrappers around MDR have previously been shown to be a powerful way to efficiently analyze datasets for interactions. Evolutionary computing can effectively address some of the challenges these datasets present. Unfortunately examining the statistical significance of results requires permutation testing, which increases the computation requirements by a factor of 1000. Here we implement an expert knowledge guided ant system on graphics processing units (GPUs) and show that the GPU implementation makes the rigorous statistical analysis of large datasets practical. /content/cudazone/CUDABrowser/assets/images/applications/471_karyotype_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/471_karyotype_large.jpg Academia Dartmouth Medical School 2009 07 24 07/24/2009 Nicholas A.Sinnott-Armstrong Casey S. Greene Jason H. Moore Paper Life Sciences Science Nicholas A.Sinnott-Armstrong,Casey S. Greene,Jason H. Moore,Epistasis Analysis, Consumer app, human genetics 5a9e5a89-919c-4735-872b-a0670bb94480 How GPUs can outperform ASICs for fast LDPC decoding Due to huge computational requirements, powerful Low-Density Parity-Check (LDPC) error correcting codes, discovered in the early 1960s, have only recently been adopted by emerging communication standards. LDPC decoders are supported by VLSI technology, which delivers good parallel computational power with excellent throughputs, but at the expense of significant costs. In this work, we propose an alternative flexible LDPC decoder that exploits data-parallelism for simultaneous multicodeword decoding, supported by multithreading on CUDA-based graphics processing units (GPUs). The ratio of arithmetic operations per memory access is low for the efficient min-sum LDPC decoding algorithm proposed, which causes a bottleneck due to memory latency and data collisions. We propose runtime data realignment to allow coalesced parallel memory accesses to be performed by distinct threads inside the same warp. The memory access patterns of LDPC codes are random, which does not admit the simultaneous use of coalescence in both read and write operations of the decoding process. To overcome this problem we have developed a data mapping transformation which allows new addresses to be contiguously accessed for one of the mentioned memory access types. Our implementation shows throughputs above 100Mbps and BER curves that compare well with ASIC solutions. /content/cudazone/CUDABrowser/assets/images/applications/469_QPPldpcgraph_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/469_QPPldpcgraph_large.jpg Academia University of Coimbra, Coimbra, Portugal 2008 12 31 12/31/2008 Gabriel Falcao Vitor Silva Leonel Sousa Paper Numerics Gabriel Falcao,Leonel Sousa,Vitor Silva, 24d1eddb-c861-4eae-8746-e0bb6eb9c3f3 High performance genetic programming The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. We compare two parallelization schemes that evaluate several GP programs in parallel. We show that the fine grain distribution of computations over the elementary processors greatly impacts performances. We also present memory and representation optimizations that further enhance computation speed, up to 2.8 billion GP operations per second. The code has been developed with the well known ECJ library. /content/cudazone/CUDABrowser/assets/images/applications/468_mutation_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/468_mutation_large.jpg Academia Universite Lille Nord de France, Calais, France 2008 12 31 12/31/2008 Denis Robilliard Virginie Marion Cyril Fonlupt Paper Numerics Denis Robilliard,Virginie Marion,Cyril Fonlupt,poty@lil.univ-littoral.fr,genetic algorithms, genetic programming, parallel processing 27faebfe-cded-4c12-b4b5-88c50c12807c A game loop architecture for the GPU used as a math coprocessor in real-time applications This article concerns the use of a graphics processor unit (GPU) as a math co-processor in real-time applications in special games and physics simulations. To validate this approach, we present a new game loop architecture that employs GPUs for general-purpose computations (GPGPUs). A critical issue here is the process distribution between the CPU and the GPU. The architecture consists of a model for distribution, and our implementation offers many advantages in comparison to other approaches without the GPGPU stage. This architecture can be used either by a general-purpose language such as the Compute Unified Device Architecture (CUDA), or shader languages such as the High-Level Shader Language (HLSL) and the OpenGL Shading Language (GLSL). Although the architecture proposed here aims at supporting mathematics and physics on the GPU, it is possible to adapt any kind of generic computation. This article discusses the model implementation in an open-source game engine and presents the results of using this platform. /content/cudazone/CUDABrowser/assets/images/applications/467_Minna-de-Puzloop-1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/467_Minna-de-Puzloop-1_large.jpg Academia Instituto de Computacao, Universidade Federal Fluminense, Brazil 2008 12 31 12/31/2008 Marcelo P. M. Zamith Esteban W. G. Clua Aura Conci Paper Game Physics,Numerics,Science Marcelo P. M. Zamith,Esteban W. G. Clua,Aura Conci,esteban@inf.puc-rio.br,Game loop, real-time physics 3ac00157-b33b-4de3-87e4-b079e50c6f8a A hardware redundancy and recovery mechanism for reliable scientific computation General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market. /content/cudazone/CUDABrowser/assets/images/applications/466_cuda-nbody-example_small.png /content/cudazone/CUDABrowser/assets/images/applications/466_cuda-nbody-example_large.png Academia University of Virginia 2008 12 31 12/31/2008 Jeremy W. Sheaffer David P. Luebke Kevin Skadron Paper Numerics Science Jeremy W. Sheaffer,David P. Luebke,Kevin Skadron 04b954f1-d86f-409f-a6ad-d3ac9b072663 Accelerated Pathfinding In the past few years the graphics programmable processor (GPU) has evolved into an increasingly convincing computational resource for non graphics applications. The GPU is especially well suited to address problem sets expressed as data parallel computation with the same program executed on many data elements concurrently. In pursuing a scalable navigation planning approach for many thousands of agents in crowded game scenes, developers became more attracted to decomposable movement algorithms that lend to explicit parallelism. Pathfinding is one key computational intelligence action in games that is typified by intense search over sparse graph data structures. This paper describes an efficient GPU implementation of parallel global pathfinding using the CUDA programming environment, and demonstrates GPU performance scale advantage in executing an inherently irregular and divergent algorithm. /content/cudazone/CUDABrowser/assets/images/applications/465_image006_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/465_image006_large.jpg Commercial NVIDIA Corporation http://www.nvidia.com 2008 12 31 12/31/2008 Avi Bleiweiss Paper Numerics Avi Bleiweiss,ableiweiss@nvidia.com 2b0d90a1-b0dc-4834-833f-c633cdf8bd9b BSGP: bulk-synchronous We present BSGP, a new programming language for general purpose computation on the GPU. A BSGP program looks much the same as a sequential C program. Programmers only need to supply a bare minimum of extra information to describe parallel processing on GPUs. As a result, BSGP programs are easy to read, write, and maintain. Moreover, the ease of programming does not come at the cost of performance. A well-designed BSGP compiler converts BSGP programs to kernels and combines them using optimally allocated temporary streams. In our benchmark, BSGP programs achieve similar or better performance than well-optimized CUDA programs, while the source code complexity and programming time are significantly reduced. To test BSGP's code efficiency and ease of programming, we implemented a variety of GPU applications, including a highly sophisticated X3D parser that would be extremely difficult to develop with existing GPU programming languages. /content/cudazone/CUDABrowser/assets/images/applications/464_6_small.JPG /content/cudazone/CUDABrowser/assets/images/applications/464_6_large.JPG Academia Tsinghua University 2008 12 31 12/31/2008 Qiming Hou Kun Zhou Baining Guo Paper Numerics Qiming Hou,Kun Zhou,Baining Guo 0294e1a5-493d-432d-a5f8-66ca43222dc6 High performance discrete Fourier transforms We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes. /content/cudazone/CUDABrowser/assets/images/applications/463_fc100_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/463_fc100_large.jpg Commercial Microsoft Corporation 2008 12 31 12/31/2008 40 Naga K. Govindaraju Brandon Lloyd Yuri Dotsenko Paper Numerics Naga K. Govindaraju,Brandon Lloyd,Yuri Dotsenko,Algorithms, Design, Experimentation, Measurement, Performance a0694fe3-9fac-4d87-9684-16832426b768 Wave field synthesis for 3D audio: architectural prospectives In this paper, we compare the architectural perspectives of the Wave Field Synthesis (WFS) 3D-audio algorithm mapped on three different platforms: a General Purpose Processor (GPP), a Graphics Processor Unit (GPU) and a Field Programmable Gate Array (FPGA). Previous related work reveals that, up to now, WFS sound systems are based on standard PCs. However, on one hand, contemporary GPUs consist of many multiprocessors that can process data concurrently. On the other hand, recent FPGAs provide huge level of parallelism, and reasonably high performance potentials, which can be exploited very efficiently by smart designers. Furthermore, new parallel programming environments, such as the Compute Unified Device Architecture (CUDA) from NVidia and the Stream from ATI, give to the researchers full access to the GPU resources. We use the CUDA to map the WFS kernel on a GeForce 8600GT GPU. Additionally, we implement a reconfigurable and scalable hardware accelerator for the same kernel, and map it onto Virtex4 FPGAs. We compare both architectural approaches against a baseline GPP implementation on a Pentium D at 3.4 GHz. Our conclusion is that in highly demanding WFS-based audio systems, a low-cost GeForce 8600GT desktop GPU can achieve a speedup of up to 8x comparing to a modern Pentium D implementation. An FPGA-based WFS hardware accelerator consisting of a single rendering unit (RU), can provide a speedup of up 10x comparing to the Pentium D approach. It can fit into small FPGAs and consumes approximately 3 Watts. Furthermore, cascading multiple RUs into a larger FPGA, can boost processing throughput up to more than two orders of magnitude higher than a GPP-based implementation and an order of magnitude better than a low-cost GPU one. /content/cudazone/CUDABrowser/assets/images/applications/462_wfs-objetos_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/462_wfs-objetos_large.jpg Academia Delft University of Technology, Delft, Netherlands 2008 12 31 12/31/2008 10 Dimitris Theodoropoulos Catalin Bogdan Ciobanu Georgi Kuzmanov Paper Numerics Video & Audio Dimitris Theodoropoulos,Catalin Bogdan Ciobanu,Georgi Kuzmanov b74a6976-b873-458c-acfe-6057b5eedf72 A compiler framework for optimization of affine loop nests GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed. /content/cudazone/CUDABrowser/assets/images/applications/461_180px-Polytope_model_unskewed.svg_small.png /content/cudazone/CUDABrowser/assets/images/applications/461_180px-Polytope_model_unskewed.svg_large.png Academia The Ohio State University, Columbus, OH, USA 2008 12 31 12/31/2008 Muthu Manikandan Baskaran Uday Bondhugula Sriram Krishnamoorthy Paper Numerics Muthu Manikandan Baskaran,Uday Bondhugula,Sriram Krishnamoorthy bd6d474d-c2bb-423c-b00f-fe9a2fedb280 Single-particle 3d reconstruction from cryo-electron microscopy images Single-particle 3D reconstruction from cryo-electron microscopy (cryo-EM) images is a kernel application of biological molecules analysis, as the computational requirement of which is now beyond PetaFlop for a high-resolution 3D structure. In this paper, we quantitatively analyze the workload, computational intensity and memory performance of the application, parallelize it on an emerging multicore architecture GPU-CUDA. Further we apply a percolation technique to decouple computation with memory operations and orchestrate thread-data mapping to reduce the overhead off-chip memory operations. Finally we tested our optimization strategy on a popular open-source package EMAN to GPU-CUDA, which achieves a relative speedup of about 10X to the original CPU-only EMAN. The experimental results also show that the proposed percolation programming greatly improves utilization of memory bandwidth and floating-point units. /content/cudazone/CUDABrowser/assets/images/applications/460_kouzouseiri_image_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/460_kouzouseiri_image_large.jpg Academia Chinese Academy of Science, Beijing, China 2008 12 31 12/31/2008 10 Guangming Tan Ziyu Guo Mingyu Chen Paper Imaging Medical Imaging Life Sciences Science Guangming Tan,Ziyu Guo,Mingyu Chen 0d938d85-37c4-41f6-8329-fad802f09c5e All-pairs shortest-paths for large graphs The all-pairs shortest-path problem is an intricate part in numerous practical applications. We describe a shared memory cache efficient GPU implementation to solve transitive closure and the all-pairs shortest-path problem on directed graphs for large datasets. The proposed algorithmic design utilizes the resources available on the NVIDIA G80 GPU architecture using the CUDA API. Our solution generalizes to handle graph sizes that are inherently larger then the DRAM memory available on the GPU. Experiments demonstrate that our method is able to significantly increase processing large graphs making our method applicable for bioinformatics, internet node traffic, social networking, and routing problems. /content/cudazone/CUDABrowser/assets/images/applications/459_2_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/459_2_new_large.jpg Academia University of Pennsylvania and Lockheed Martin 2008 12 31 12/31/2008 Gary J. Katz Joseph T. Kider, Jr Paper Numerics Science Gary J. Katz, Joseph T. Kider, Jr d8ab1ded-aa91-4d3f-be16-5b13f6a6a1e2 Program optimization space pruning for a multithreaded gpu Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications. /content/cudazone/CUDABrowser/assets/images/applications/458_deferredshadow_small.png /content/cudazone/CUDABrowser/assets/images/applications/458_deferredshadow_large.png Academia 2008 12 31 12/31/2008 Shane Ryoo Christopher I. Rodrigues Sam S. Stone Paper Numerics Shane Ryoo,Christopher I. Rodrigues,Sam S. Stone 018b5db4-b3cb-444f-be6c-778b8517c99b Aspects of GPU for general purpose high performance computing We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs. /content/cudazone/CUDABrowser/assets/images/applications/457_GeForce_GTX_280_3qtr_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/457_GeForce_GTX_280_3qtr_large.jpg Academia The University of Tokyo 2008 12 31 12/31/2008 Reiji Suda Takayuki Aoki Shoichi Hirasawa Paper Numerics Reiji Suda,Takayuki Aoki,Shoichi Hirasawa 21306bd4-d5ae-4455-b202-c8bae8a17348 Software Pipelined Execution of Stream Programs The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem --- both scheduling and assignment of filters to processors --- as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU. /content/cudazone/CUDABrowser/assets/images/applications/456_pipe_small.png /content/cudazone/CUDABrowser/assets/images/applications/456_pipe_large.png Academia Supercomputer Education and Research Centre, Indian Institute of Science 2008 12 31 12/31/2008 37 Abhishek Udupa R. Govindarajan Matthew J. Thazhuthaveetil Paper Numerics Abhishek Udupa, R. Govindarajan,Matthew J. Thazhuthaveetil,mjt@csa.iisc.ernet.in ea8c2995-d603-42a2-93f8-633811e8b9c2 Pervasive massively multithreaded GPU processors This talk presents an overview of NVIDIA's SIMT architecture and some brief insights on how some CUDA programming paradigms map onto it. A brief history of SIMT is provided to explain how NVIDIA ended up implementing a unified SIMT processor core in its GPUs including how graphics shaders are mapped onto SIMT threads. In addition, a conceptual view of how a SIMT microarchitecture executes threads in parallel is provided. The talk wraps up by describing some pitfalls related to thread synchronization, memory access, and cache management and describes some key problem areas in SIMT programming that NVIDIA would like to address in the future /content/cudazone/CUDABrowser/assets/images/applications/455_nvidia_gpu_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/455_nvidia_gpu_large.jpg Commercial NVIDIA Corporation, Santa Clara, CA, USA 2008 12 31 12/31/2008 Michael C. Shebanow Paper Science Numerics Michael C. Shebanow , mshebanow@nvidia.com d1cab6b3-cfb3-4feb-9d02-4517910a5cf0 A compiler and runtime system for enabling data mining applications With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs have become popular. A recent development in using GPU for general computing has been the release of CUDA (Compute Unified Device Architecture) by NVIDIA. CUDA allows GPU programming with Clanguage-like features, thus easing the development of non-graphics applications on a GPU. However, several challenges still remain in programming the GPUs with CUDA, because CUDA involves explicit parallel programming and management of its complex memory hierarchy, as well as allocating device memory, moving data between CPU anddevice memory, and specification of thread grid configurations. In this paper, we offer a solution for the programmers to generate CUDA code by specifying the sequential reduction loop(s) with some information about the parameters. With program analysis and code generation, the applications are mapped to a GPU. Several additional optimizations are also performed by the middleware. We have evaluated our system using three popular data miningapplications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. /content/cudazone/CUDABrowser/assets/images/applications/454_data-mining_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/454_data-mining_large.jpg Academia The Ohio State University, Columbus, OH, USA 2008 12 31 12/31/2008 50 Wenjing Ma Gagan Agrawal Paper Numerics Science Wenjing Ma, Gagan Agrawal 9bc5625b-e5f6-4072-a83c-32e59a956b1d A control-structure splitting optimization for GPGPU Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads and consequently ruin the occupancy of GPU, that is, the number of threads running concurrently. Unlike traditional vector processing units that are inside a general purpose processor, the GPU cannot leave the control statements to the CPU because fine-grain statement scheduling between GPU and CPU is impossible. We need an effective method to handle the control statements "just in place" on the GPUs. In this paper, we propose novel techniques to transform control statements so that they can be executed efficiently on GPUs. Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance. We focus our attention on how common programming structures such as loops and branches decrease the occupancy of single kernels and how to counter that. We demonstrate our optimizations on a synthetic benchmark and a complex parallel algorithm, the Lattice Boltzmann Method (LBM). Our results show that these techniques are very efficient and can lead to an increase in occupancy and a drastic improvement in performance compared to non-split version of the programs. /content/cudazone/CUDABrowser/assets/images/applications/453_fracorg_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/453_fracorg_large.jpg Academia University of Delaware, Newark, USA 2008 12 31 12/31/2008 Snaider Carrillo Jakob Siegal Xiaoming Li Numerics Science Snaider Carrillo,Jakob Siegal,Xiaoming Li b223f4bf-c0f0-49bc-876f-1b1d7058d9e7 Massive parallel LDPC decoding Low-Density Parity-Check (LDPC) codes are powerful error correcting codes (ECC). They have recently been adopted by several data communication standards such as DVB-S2 and WiMax. LDPCs are represented by bipartite graphs, also called Tanner graphs, and their decoding demands very intensive computation. For that reason, VLSI dedicated architectures have been investigated and developed over the last few years. This paper proposes a new approach for LDPC decoding on graphics processing units (GPUs). Efficient data structures and an new algorithm are proposed to represent the Tanner graph and to perform LDPC decoding according to the stream-based computing model. GPUs were programmed to efficiently implement the proposed algorithms by applying data-parallel intensive computing. Experimental results show that GPUs perform LDPC decoding nearly three orders of magnitude faster than modern CPUs. Moreover, they lead to the conclusion that GPUs with their tremendous processing power can be considered as a consistent alternative to state-of-the-art hardware LDPC decoders. /content/cudazone/CUDABrowser/assets/images/applications/452_ldpc_generation_graph_small.png /content/cudazone/CUDABrowser/assets/images/applications/452_ldpc_generation_graph_large.png Academia Instituto de Telecomunicacoes/FCTUC, University of Coimbra, Coimbra, Portugal 2008 12 31 12/31/2008 Gabriel Falcao Leonel Sousa Vitor Silva Paper Numerics Science Gabriel Falcao,Leonel Sousa,Vitor Silva 3510667b-e35f-4140-a7dc-c4ff7e95ee68 Efficient computation of sum-products on GPUs through software-managed cache We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache. /content/cudazone/CUDABrowser/assets/images/applications/451_6763420-0-large_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/451_6763420-0-large_large.jpg Academia Technion - Israel Institute of Technology, Haifa, Israel 2008 12 31 12/31/2008 270 Mark Silberstein Assaf Schuster Dan Geiger Paper Numerics Life Sciences Science Mark Silberstein, Assaf Schuster,Dan Geiger 66160e9c-33cb-4208-9562-05b21eceb571 Accelerating total variation regularization for matrix-valued images on GPUs The advent of new matrix-valued magnetic resonance imaging modalities such as Diffusion Tensor Imaging (DTI) requires extensive computational acceleration. Computational acceleration on graphics processing units (GPUs) can make the regularization (denoising) of DTI images attractive in clinical settings, hence improving the quality of DTI images in a broad range of applications. Construction of DTI images consists of direction-specific Magnetic Resonance (MR) measurements. Compared with conventional MR, direction-sensitive acquisition has a lower signal-to-noise ratio (SNR). Therefore, high noise levels often limit DTI imaging. Advanced post-processing of imaging data can improve the quality of estimated tensors. However, the post-processing problem is only made more computationally difficult when considering matrix-valued imaging data. This paper describes the acceleration of a Total Variation regularization method for matrix-valued images, in particular, for DTI images on NVIDIA Quadro FX 5600. The TV regularization of a 3-D image with 1283 voxels ultimately achieves 266X speedup and requires 1 minute and 30 seconds on the Quadro, while this algorithm on a dual-core CPU completes in more than 3 hours. In this application study we are aimed at analyzing the effective of excessive synchronization, which provides an insight into generally adapting Variational methods to the GPU architecture for other image processing algorithms designed for matrix-valued images. /content/cudazone/CUDABrowser/assets/images/applications/450_matrix_rose_leaf_3_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/450_matrix_rose_leaf_3_large.jpg Academia University of California, Los Angeles, CA 2008 12 31 12/31/2008 266 Maryam Moazeni Alex Bui Majid Sarrafzadeh Paper Imaging Science Maryam Moazeni,Alex Bui,Majid Sarrafzadeh ca59d3fd-811a-44aa-bffa-097765bd6b20 Performance analysis of accelerated image registration using GPGPU This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of NVIDIA's Tesla C870 GPU. We explain the underlying structure of the GPU implementation and compare its performance and accuracy against a fast CPU-based implementation. Our experimental results demonstrate that our GPU version is capable of up to 90x speedup with bilinear interpolation and 30x speedup with bicubic interpolation while maintaining a high level of accuracy. This compares favorably to recent image registration studies, but it also indicates that our implementation only reaches about 70% of theorectical peak performance. To analyze our results, we utilize profiling data to identify some of the underlying limitations of CUDA that prohibit peak performance. At the end, we emphasize the need to manage memory resources carefully to fully utilize the GPU and obtain maximum speedup. /content/cudazone/CUDABrowser/assets/images/applications/449_attention_based_image_registration_saliency_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/449_attention_based_image_registration_saliency_large.jpg Academia University of Notre Dame 2008 12 01 12/01/2008 90 Peter Bui Jay Brockman Paper Imaging,Science Peter Bui,Jay Brockman b33ef247-efa4-414a-bb2a-ee8f016f096f Accelerating advanced mri reconstructions Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. At present, MR imaging is often limited by high noise levels, significant imaging artifacts, and/or long data acquisition (scan) times. Advanced image reconstruction algorithms can mitigate these limitations and improve image quality by simultaneously operating on scan data acquired with arbitrary trajectories and incorporating additional information such as anatomical constraints. However, the improvements in image quality come at the expense of a considerable increase in computation. This paper describes the acceleration of an advanced reconstruction algorithm on NVIDIA's Quadro FX 5600. Optimizations such as register allocating the voxel data, tiling the scan data, and storing the scan data in the Quadro's constant memory dramatically reduce the reconstruction's required bandwidth to on-chip memory. The Quadro's special functional units provide substantial acceleration of the trigonometric computations in the algorithm's inner loops, and experimentally-tuned code transformations increase the reconstruction's performance by an additional 20%. The reconstruction of a 3D image with 128^3 voxels ultimately achieves 150 GFLOPS and requires less than two minutes on the Quadro, while reconstruction on a quad-core CPU is thirteen times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. In short, the acceleration afforded by the GPU greatly increases the appeal of the advanced reconstruction for clinical MRI applications. /content/cudazone/CUDABrowser/assets/images/applications/448_Img00250_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/448_Img00250_large.jpg Academia University of Illinois at Urbana-Champaign, Urbana, IL, USA 2008 12 31 12/31/2008 13 Samuel S. Stone Justin P. Haldar Stephanie C. Tsao Paper Imaging Medical Imaging Life Sciences Science Samuel S. Stone, Justin P. Haldar, Stephanie C. Tsao,ssstone2@crhc.uiuc.edu 51d66e62-d1a3-42b7-9f4b-29ce84e42a20 GPU acceleration of cutoff pair potentials for molecular modeling applications The advent of systems biology requires the simulation of ever-larger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. We present algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low compute efficiency, a newer strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work-efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition. /content/cudazone/CUDABrowser/assets/images/applications/447_imprint_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/447_imprint_large.jpg Academia University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA 2008 12 31 12/31/2008 20 Christopher I. Rodrigues David J. Hardy John E. Stone Paper Numerics Christopher I. Rodrigues,David J. Hardy,John E. Stone, graphics processors, molecular dynamics 2155f6f6-18cc-479b-85a0-3e96576dff51 An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language. /content/cudazone/CUDABrowser/assets/images/applications/446_figure09_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/446_figure09_large.jpg Academia Georgia Institute of Technology, Atlanta, GA 2008 12 01 12/01/2008 Sunpyo Hong Hyesoon Kim Paper Numerics Sunpyo Hong, Hyesoon Kim, hyesoon@cc.gatech.edu d9006f32-0255-4aab-a31d-a8f339088809 A translation system for enabling data mining applications on GPUs Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system. We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements. /content/cudazone/CUDABrowser/assets/images/applications/445_data-mining_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/445_data-mining_large.jpg Academia The Ohio State University, Columbus, OH, USA 2008 12 31 12/31/2008 50 Wenjing Ma Gagan Agrawal Paper Numerics Wenjing Ma,Gagan Agrawal 96154328-556a-4904-a557-ae73986ce7bc Hughes Trainable Text Skimmer: description of the TTS system as used for MUC-3 The objective of the Hughes Trainable Text Skimmer (TTS) Project is to create text skimming software that: (1) can be easily re-configured for new applications, (2) improves its performance with use, and (3) is fast enough to process megabytes of text per day. The TTS-MUC3 system is our first full scale prototype. /content/cudazone/CUDABrowser/assets/images/applications/444_text-deactivation_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/444_text-deactivation_large.jpg Academia Hughes Research Laboratories, Malibu, CA 2008 12 31 12/31/2008 Charles P. Dolan Thomas V. Cuda Seth R. Goldman Paper Numerics Charles P. Dolan, Thomas V. Cuda,Seth R. Goldman, HRLcontracts@hrl.com 68f9f27b-3df1-46c8-aebf-1c79fc6f3a47 Accelerating linpack with CUDA on heterogenous clusters This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on both GPUs and CPU cores. An 8U cluster is able to sustain more than a Teraflop using a CUDA accelerated version of HPL. /content/cudazone/CUDABrowser/assets/images/applications/443_1476-072X-7-57-2-l_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/443_1476-072X-7-57-2-l_large.jpg NVIDIA http://www.nvidia.com/cuda 2008 12 31 12/31/2008 Massimiliano Fatica Paper Numerics Massimiliano Fatica, mfatica@nvidia.com, 53bada17-5c89-4cf8-8e56-5edee7ba8578 High-performance CUDA kernel execution on FPGAs In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto reconfigurable devices. The use of the CUDA programming model offers the advantage of a common programming interface for exploiting parallelism on two very different types of accelerators -- FPGAs and GPUs. Moreover, by leveraging the advanced synthesis capabilities of AutoPilot we enable efficient exploitation of the FPGA configurability for application specific acceleration. Our flow is based on a compilation process that transforms the SPMD CUDA thread blocks into high-concurrency AutoPilot-C code. We provide an overview of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the generated multi-core accelerators. /content/cudazone/CUDABrowser/assets/images/applications/442_fpga_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/442_fpga_large.jpg Academia University of Illinois, Urbana - Champaign, IL, USA 2008 12 31 12/31/2008 Alexandros Papakonstantinou Karthik Gururaj John A. Stratton Paper Electronic Design Automation Imaging Science Alexandros Papakonstantinou,Karthik Gururaj,John A. Stratton 56573ff8-9c5d-49b3-a543-759fd0b3dfb8 A Cross-Input Adaptive Framework for GPU Program Optimizations This work presents a CUDA program optimizer, named G-ADAPT. It is a tool for helping programmers determine the suitable values of a set of optimization parameters for a CUDA application. It is unique in being adaptive to the influence of program inputs on the application's executions. /content/cudazone/CUDABrowser/assets/images/applications/441_tjetb_iso_shaded_small.png /content/cudazone/CUDABrowser/assets/images/applications/441_tjetb_iso_shaded_large.png Academia The College of William and Mary http://www.cs.wm.edu/caps/ 2009 05 25 05/25/2009 Xipeng Shen Paper Programming Tools Program Optimizations, empirical search, Cross-input Adaptation. Xipeng Shen b6d01d14-c22a-43ac-bc34-9e0ee006c583 Optimization principles and application performance evaluation of a multithreaded GPU GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup. /content/cudazone/CUDABrowser/assets/images/applications/440_comet-connections_small.png /content/cudazone/CUDABrowser/assets/images/applications/440_comet-connections_large.png Academia University of Illinois at Urbana-Champaign, Urbana, IL, USA 2008 12 31 12/31/2008 431 Shane Ryoo Christopher I. Rodrigues Sara S. Baghsorkhi Paper Parallel Algorithms Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, GPU computing, parallel computing 222b928f-43be-4e02-a007-a005d2655181 Bandwidth intensive 3-D FFT kernel Most GPU performance "hypes" have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces. /content/cudazone/CUDABrowser/assets/images/applications/439_ERGOpage04_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/439_ERGOpage04_large.jpg Academia Tokyo Institute of Technology, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan 2008 12 31 12/31/2008 3 Akira Nukada Yasuhiko Ogata Toshio Endo Paper Numerics Akira Nukada, Yasuhiko Ogata, Toshio Endo, Algorithms, Design, Experimentation, Measurement, Performance ccaf5379-24f8-488e-a424-c6c223458be2 A High Performance Agent Based Modelling Framework We present an efficient implementation of a high performance parallel framework for Agent Based Modelling (ABM), exploiting the parallel architecture of the Graphics Processing Unit (GPU). It provides a mapping between formal agent specifications, with C based scripting, and optimised NVIDIA Compute Unified Device Architecture (CUDA) code. The mapping of agent data structures and agent communication is described, and our work is evaluated through a number of simple interacting agent examples. In contrast with an alternative, single machine CPU implementation, a speedup of up to 250 times is reported. /content/cudazone/CUDABrowser/assets/images/applications/438_11219696_small.JPG /content/cudazone/CUDABrowser/assets/images/applications/438_11219696_large.JPG Academia University of Sheffield, UK 2008 12 31 12/31/2008 250 Paul Richmond Simon Coakley Daniela M. Romano Paper parallel algorithms Paul Richmond, Simon Coakley, Daniela M. Romano 108e4261-9842-4b48-9266-5445cda7c5df Accelerating phase unwrapping and affine transformations for optical quadrature microscopy Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular application of interest, the framework is used to extract phase information from the image of an embryo to determine embryo viability. Phase Unwrapping is the process of reconstructing the real phase shift (propagation delay) of a sample from the measured "wrapped" representation which is between - and +. Unwrapping can be done using the Minimum LP Norm Phase Unwrap algorithm. Images are first preprocessed using an Affine Transform before they are unwrapped. Both of these steps are time consuming and would benefit greatly from parallelization and acceleration. Faster processing would lower many research barriers (in terms of throughput and performance) present when using OQM. In this paper we report on accelerating Phase Unwrapping and Affine Transformations using NVIDIA's CUDA programming model. We also run elementary noise removal on the GPU using NVIDIA's CUBLAS (CUDA Basic Linear Algebra Subprograms) library. We integrate GPU execution into a Matlab environment to seamlessly interface to the pre-existing image acquisition system. By mapping the unwrap and noise removal to a GPU, and by also reducing the amount of I/O overhead, we are able to accelerate the end-to-end process by more than 7.3x. This enables our imaging framework to perform high speed image acquisition and visualization at near real-time rates. /content/cudazone/CUDABrowser/assets/images/applications/437_20060621-QuenchedSi-AFM_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/437_20060621-QuenchedSi-AFM_large.jpg Academia Northeastern University, Boston, MA 2008 12 31 12/31/2008 8 Miriam Leeser Sherman Braganza David Kaeli Perhaad Mistry Paper Medical Imaging Life Sciences Science Perhaad Mistry , Sherman Braganza , David Kaeli, pmistry@ece.neu.edu 16bab639-6bea-43bf-bb15-a160a8fb5924 hiCUDA: a high-level directive-based language The Compute Unified Device Architecture (CUDA) has become a de facto standard for programming NVIDIA GPUs. However, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host memory and various components of the GPU memory, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, which are often tedious and error-prone, before getting an optimized program. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code. Nonetheless, it supports the same programming paradigm already familiar to CUDA programmers. We have prototyped a source-to-source compiler that translates a hiCUDA program to a CUDA program. Experiments using five standard CUDA bechmarks show that the simplicity and flexibility hiCUDA provides come at no expense to performance. /content/cudazone/CUDABrowser/assets/images/applications/436_sombrero_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/436_sombrero_large.jpg Academia University of Toronto, Toronto, Ontario, Canada 2008 12 31 12/31/2008 Tianyi David Han Tarek S. Abdelrahman Paper Parallel Algorithms b4a32d12-7514-402c-81f6-0f3a1131a030 Harvesting graphics power for MD simulations We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 80, 40 and 150 fold, respectively. With newest generation of GPU's one can run standard MD simulations at 10^7 flops. /content/cudazone/CUDABrowser/assets/images/applications/435_math_snap-480_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/435_math_snap-480_large.jpg Academia FOM Institute for Atomic and Molecular Physics, Kruislaan 2007 09 01 09/01/2007 150 J.A. van Meel A. Arnold D. Frenkel Paper Digital Content Creation Graphics Imaging Science simulations 7bb139aa-5808-4bf2-a89b-2c4666abc8cc GPU computing for 2-d spin systems: CUDA vs OpenGL In recent years the more and more powerful GPU's available on the PC market have attracted attention as a cost effective solution for parallel (SIMD) computing. CUDA is a solid evidence of the attention that the major companies are devoting to the field. CUDA is a hardware and software architecture developed by Nvidia for computing on the GPU. It qualifies as a friendly alternative to the approach to GPU computing that has been pioneered in the OpenGL environment. We discuss the application of both the CUDA and the OpenGL approach to the simulation of 2-d spin systems (XY model). /content/cudazone/CUDABrowser/assets/images/applications/434_opengl_small.png /content/cudazone/CUDABrowser/assets/images/applications/434_opengl_large.png Academia University of Parma 2008 11 13 11/13/2008 Viola Anselmi Giovanni Conti Francesco Di Renzo Paper Numerics Science 45b89c24-5196-4659-aa01-b47994748c78 Accelerating numerical solution of Stochastic Differential Equations with CUDA Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675x compared to a standard CPU, which corresponds to sev eral billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. /content/cudazone/CUDABrowser/assets/images/applications/433_numerical_small.png /content/cudazone/CUDABrowser/assets/images/applications/433_numerical_large.png Academia Institute of Physics, University of Silesia 2009 03 23 03/23/2009 675 M. Januszewski M. Kostur Paper Numerics Science Josephson junction, Kuramoto, graphics processing unit,advanced computer architecture, numerical integration, diusion, stochasticdierential equation, CUDA, Tesla, NVIDIA 17d19b5f-5d93-4db7-87a8-1d58ee75a60b An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home) We present a detailed approach for making use of two new computer hardware architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame, is nearly identical. /content/cudazone/CUDABrowser/assets/images/applications/432_96714main_DiskPreBurst_lg_web-1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/432_96714main_DiskPreBurst_lg_web-1_large.jpg Academia Research Group Programming Languages, Methodologies Universitat Kassel 2008 12 31 12/31/2008 06/29/2009 Jens Breitbart Gaurav Khanna Paper Numerics Science Signal Processing 99f41941-0cd4-4e77-88f3-19bbb5787ac0 Teraflop per second gravitational lensing ray-shooting using graphics processing units Gravitational lensing calculation using a direct inverse ray-shooting approach is a computationally expensive way to determine magnification maps, caustic patterns, and light-curves (e.g. as a function of source profile and size). However, as an easily parallelisable calculation, gravitational ray-shooting can be accelerated using programmable graphics processing units (GPUs). We present our implementation of inverse ray-shooting for the NVIDIA G80 generation of graphics processors using the NVIDIA Compute Unified Device Architecture (CUDA) software development kit. We also extend our code to multiple-GPU systems, including a 4-GPU NVIDIA S1070 Tesla unit. We achieve sustained processing performance of 182 Gflop/s on a single GPU, and 1.28 Tflop/s using the Tesla unit. We demonstrate that billion-lens microlensing simulations can be run on a single computer with a Tesla unit in timescales of order a day without the use of a hierarchical tree code. /content/cudazone/CUDABrowser/assets/images/applications/431_tera_small.png /content/cudazone/CUDABrowser/assets/images/applications/431_tera_large.png Academia Centre for Astrophysics and Supercomputing, Swinburne University of Technology 2009 05 15 05/15/2009 100 Alexander C. Thompson Christopher J. Fluke David G. Barnes Paper Graphics Imaging Numerics Science 9fe009cc-0722-47c2-8f4a-dbef6569de7a SAR simulation Synthetic Aperture Rada (SAR) target signla simulation, and SAR imaging /content/cudazone/CUDABrowser/assets/images/applications/430_d0b97bfa-b534-4d93-bb7d-cf8e9da6d64dB_small.JPG /content/cudazone/CUDABrowser/assets/images/applications/430_d0b97bfa-b534-4d93-bb7d-cf8e9da6d64dB_large.JPG Academia UESTC EE 2008 05 30 05/30/2008 40 Wang haihua Yu qin Zhang Shu Application Signal Processing eeae1a85-6d79-4615-92ad-23040cec2407 Chromakey Solution for Photo Studio Chromakey Solution for Photo Studio is a system provides synthesized live view of video and background still image by ISP chromakey algorithm, accelerated by CUDA. /content/cudazone/CUDABrowser/assets/images/applications/429_Chromakey_small.png /content/cudazone/CUDABrowser/assets/images/applications/429_Chromakey_large.png Commercial Research Institute of Systems Planning,Inc 2009 07 27 07/27/2009 13 Research Institute of Systems Planning,Inc Paper Imaging Video & Audio chromakey, synthesized live view 71454a1e-2588-4983-85bd-578a6d501c65 MediaCoder MediaCoder is a free universal batch media transcoder, which nicely integrates most popular audio/video codecs and tools into an all-in-one solution /content/cudazone/CUDABrowser/assets/images/applications/428_mc-skinned_small.png /content/cudazone/CUDABrowser/assets/images/applications/428_mc-skinned_large.png Commercial mediacoder http://www.mediacoderhq.com/index.htm 2009 07 22 07/22/2009 Stanley Huang Application Video & Audio audio and video transcoder 34f38ba0-4c4b-49e0-b737-14e7f8028d73 Furry Ball: GPU renderer for Maya GPU renderer for Maya for studio use. Features: Direct X 10 compatible, Full Maya Integration in Viewport, Complete realtime Dynamic Fur and Hairs, Bump mapping, Lambert, Blin, Phong materials, Textures, Unlimited lights Soft Shadows (with variable penumbra), Reflection, Blurred reflection, Resolution up to 8k, Unlimited Supersampling, Per Object Supersampling, Ambient occlusion, Transparency. 100-300 times faster than CPU render on regular Geforce card. /content/cudazone/CUDABrowser/assets/images/applications/427_furry_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/427_furry_large.jpg Commercial Art And Animation studio http://www.aaa-studio.cz 2009 12 30 12/30/2009 300 Commercial Art And Animation studio Application Multimedia Graphics Imaging Video & Audio GPU renderer Maya aaa7ed55-38b8-4b94-889b-b0d2a3d6b216 Massively-Parallel Game Servers This work is intended to show that the GPU is the most appropriate technology for game servers, and also that high performance for game servers can be achieved with low cost hardware. /content/cudazone/CUDABrowser/assets/images/applications/426_CPUvsGPU_game_server_time_chart_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/426_CPUvsGPU_game_server_time_chart_large.jpg Maxime Griot 2009 07 24 07/24/2009 Research Maxime Griot Paper server 1d3b0c3e-d0a6-4727-bf26-bf373f8f6953 Parallel, distributed and GPU computing technologies in single-particle electron microscopy Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined. /content/cudazone/CUDABrowser/assets/images/applications/425_POLlen-_thmb_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/425_POLlen-_thmb_large.jpg 2008 10 29 10/29/2008 Martin Schmeisser Burkhard C. Heisen Mario Luettich Paper Science parallel processing, GPU processing, distributed heterogeneous computing, nondedicated systems, multicore performance, cluster computing, electron microscopy,Martin Schmeisser, Burkhard C. Heisen,Mario Luettich da9e5636-cfb8-4460-add9-571196637dcf CUDA Path Tracing Demo CUDA Path Tracing Demo. Can load and display static .obj scene with diffuse path tracing or direct illumination, both from a single area light. /content/cudazone/CUDABrowser/assets/images/applications/424_ruins_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/424_ruins_new_large.jpg Academia Saarland University 2009 07 01 07/01/2009 Javor Kalojanov Application Graphics Javor Kalojanov, javor@graphics.cs.uni-sb.de b8685f2e-ed2e-4b7a-86f5-37fe0ac5f8dd ToraTora File Encryption Software ( AES 256bit ) /content/cudazone/CUDABrowser/assets/images/applications/423_OpenHelp_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/423_OpenHelp_large.jpg Commercial iCanal Inc. http://www.icanal.co.jp/ 2009 07 10 07/10/2009 3 Naoki Hirayama Application File Utility nhiraya@icanal.co.jp, Naoki Hirayama 9e515da5-c97e-4c37-8305-f27982a02d5f CUDA Multiforcer Multihash CUDA Brute Forcer - The world's fastest cross-platform MD4/MD5/NTLM cracking for Windows/Mac/Linux /content/cudazone/CUDABrowser/assets/images/applications/421_md5lookupwidget_20070725131833_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/421_md5lookupwidget_20070725131833_large.jpg Research Cryptohaze http://www.cryptohaze.com/ 2009 01 23 01/23/2009 Bitweasil Application Information Security CUDA-Multiforcer, MD4, MD5, NTLM 7e5043c6-ca48-401d-a1e3-8fa1d3e12f99 Rapid Aerodynamic Performance Prediction on a Cluster of Graphics Processing Units We investigate the use of a cluster of GPUs for large-scale CFD problems and show order-of-magnitude increases in performance and performance-to-price ratio. We implement two separate compressible ow solvers. First, we develop a CUDA-based solver for the 2D compressible Euler equations and verify the results against a reference multi-block code MBFLO. After demonstrating the performance of our Euler solver, we proceed to develop a new version of MBFLO by adding GPU-accelerated subroutines to the existing Fortran codebase. Using an eight-node cluster equiped with 16 NVIDIA 9800GX2 GPUs, we achieve speedups of up to 496x on our Euler Solver and 88x on MBFLO. This paper describes the numerical, hardware and software techniques that provide signicant speedups. /content/cudazone/CUDABrowser/assets/images/applications/420_fig5_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/420_fig5_large.jpg Academia University of California, Davis 2009 01 05 01/05/2009 496 Everett H. Phillips Yao Zhangy Roger L. Davisz Paper Computational Fluid Dynamics Everett H. Phillips, Yao Zhangy, Roger L. Davisz, MBFLO, Euler Solver, Navier-Stokes 03bb68ad-6cb6-4a8d-969e-53eebed9a521 PowerDVD 9 Video playback quality optimization with TrueTheater HD. CUDA support with build 1719 /content/cudazone/CUDABrowser/assets/images/applications/419_box_PDVD_9_ultra_eng-150_small.gif /content/cudazone/CUDABrowser/assets/images/applications/419_box_PDVD_9_ultra_eng-150_large.gif Commercial Cyberlink 2009 05 22 05/22/2009 Commercial Cyberlink Application Video & Audio CUDA video DVD upscaling quality 47d7ca93-8eb9-4f19-b3f6-813e99b2aa02 Granular Matter Realtime, simple model of granular matter implemented on CUDA architecture /content/cudazone/CUDABrowser/assets/images/applications/418_c2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/418_c2_large.jpg Academia aciej Matyka 2009 07 08 07/08/2009 Maciej Matyka Application Multimedia Code Game Physics Graphics Maciej Matyka, CUDA Physics Gravity f2bf43a7-bd39-45ba-9845-a50c956157cd Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces We present a strategy for performing view-adaptive, crack-free tessellation of Catmull-Clark subdivision surfaces entirely on programmable graphics hardware. Our scheme extends the concept of breadth-first subdivision, which up to this point has only been applied to parametric patches. While mesh representations designed for a CPU often involve pointer-based structures and irregular per-element storage, neither of these is well-suited to GPU execution. To solve this problem, we use a simple yet effective data structure for representing a subdivision mesh, and design a careful algorithm to update the mesh in a completely parallel manner. We demonstrate that in spite of the complexities of the subdivision procedure, real-time tessellation to pixel-sized primitives can be done. Our implementation does not rely on any approximation of the limit surface, and avoids both subdivision cracks and T-junctions in the subdivided mesh. Using the approach in this paper, we are able to perform real-time subdivision for several static as well as animated models. Rendering performance is scalable for increasingly complex models. /content/cudazone/CUDABrowser/assets/images/applications/417_return_small.png /content/cudazone/CUDABrowser/assets/images/applications/417_return_large.png Academia University of California, Davis http://www.ece.ucdavis.edu/ 2009 08 01 08/01/2009 Anjul Patney Paper Graphics Anjul Patney, apatney@ucdavis.edu 87183ef1-9947-470d-bd10-3352dc74be16 Real-Time Reyes-Style Adaptive Surface Subdivision We present a GPU based implementation of Reyes-style adaptive surface subdivision, known in Reyes terminology as the Bound/Split and Dice stages. The performance of this task is important for the Reyes pipeline to map efficiently to graphics hardware, but its recursive nature and irregular and unbounded memory requirements present a challenge to an efficient implementation. Our solution begins by characterizing Reyes subdivision as a work queue with irregular computation, targeted to a massively parallel GPU. We propose efficient solutions to these general problems by casting our solution in terms of the fundamental primitives of prefix-sum and reduction, often encountered in parallel and GPGPU environments. Our results indicate that real-time Reyes subdivision can indeed be obtained on today's GPUs. We are able to subdivide a complex model to subpixel accuracy within 15 ms. Our measured performance is several times better than that of Pixar's RenderMan. Our implementation scales well with the input size and depth of subdivision. We also address concerns of memory size and bandwidth, and analyze the feasibility of conventional ideas on screen-space buckets. /content/cudazone/CUDABrowser/assets/images/applications/416_reyes08_new_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/416_reyes08_new_large.jpg Academia University of California, Davis http://www.ece.ucdavis.edu/ 2008 12 01 12/01/2008 Anjul Patney Paper Graphics Anjul Patney, apatney@ucdavis.edu df98027a-dd64-4806-a323-db07d9c1c88d R GPU: Enabling GPU Computing in the R Statistical Environment R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia's CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research. To overcome this obstacle, we decided to devote efforts for moving frequently used R functions in our work to the GPU using CUDA. In the ideal solution, if a CUDA compatible GPU and driver is present on a user's machine, the user may only need to prefix "gpu" to the original function name to take advantage of the GPU implementation of the corresponding R function. We take achieving this ideal as one of our primary goals so that any biomedical researcher can harness the computational power of a GPU using a familiar tool. Since our code is open source, researchers may customize the R interfaces to their particular needs. In addition, because CUDA uses shared libraries and unobtrusive extensions to the C programming language, any experienced C programmer can easily customize the underlying code. Using the CUDA extension to C and the shared linear algebra library CUBLAS, we have implemented a variety of statistical analysis functions with R interfaces that execute with different degrees of parallelism on a Graphics Processing Unit (GPU). If an algorithm is comprised of common vector or matrix operations each performed once, we involve the GPU by implementing those operations with calls to CUBLAS. If an algorithm involves computing the elements of a large matrix, we can often merely assign each thread executing on the GPU a portion of a row and/or column. Algorithms for which we have implemented GPU enabled versions include the calculations of distances between sets of points (R dist function), hierarchical clustering (R hclust function). Pearson and Kendall correlation coefficients (similar to R cor function), and the Granger test ('granger.test' in the R MSBVAR package). We are committed to implement more R GPU functions, and we hope to contribute packages to the open source community via our project's website. The initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia's nvcc compiler and CUBLAS shared library. We hope that others can contribute to the R-GPGPU effort and encourage any comments or suggestions. /content/cudazone/CUDABrowser/assets/images/applications/415_rgpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/415_rgpu_large.png Academia The Molecular and Behavioral Neuroscience Institute, U. of Michigan http://brainarray.mbni.med.umich.edu/ 2009 06 14 06/14/2009 75 Open source J. Buckner Paper Code Numerics Life Sciences Oil & Gas 29d03c16-cf86-4f24-9e60-3a943214b48a Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster We have designed a fast parallel simulator that solves the acoustic wave equation on a GPU cluster. Solving the acoustic wave equation in an oil exploration industrial context aims at speeding up seismic modeling and Reverse Time Migration. We consider a finite difference approach on a regular mesh, in both 2D and 3D cases. The acoustic wave equation is solved in either a constant density or a variable density domain. All the computations are done in single precision, since double precision is not required in our context. We use CUDA to take advantage of the GPUs computational power. We study different implementations and their impact on the application performance. We obtain a speed up of 10 for Reverse Time Migration and up to 30 for the modeling application over a sequential code running on general purpose CPU. /content/cudazone/CUDABrowser/assets/images/applications/414_inra_small.png /content/cudazone/CUDABrowser/assets/images/applications/414_inra_large.png INRIA Bordeaux 2009 07 14 07/14/2009 30 Rached Abdelkhalek Paper Libraries Rached Abdelkhalek 0f9d31f9-f736-4c8a-a281-0aa5b0883fe5 Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price. These GPUs also support high-level parallel programming models, for example, NVIDIA's Common Unified Device Architecture (CUDA) or Brook+ from AMD, that enable users to develop parallel applications that use the CPU as the host and the GPU as an accelerator. Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA CUDA interfaces, and works on all NVIDIA GPUs that support this library. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA's CUDPP library. While the current implementation of the SpMV code uses the CUDA interfaces, the code can be easily migrated to use the upcoming OpenCL standard. This will allow the SpMV code to execute on a wide range of GPU architectures. /content/cudazone/CUDABrowser/assets/images/applications/413_thumbnail_small.png /content/cudazone/CUDABrowser/assets/images/applications/413_thumbnail_large.png Research IBM Research 2009 04 21 04/21/2009 10 Open source Rajesh Bordawekar Code Computational Fluid Dynamics Electronic Design Automation Medical Imaging Numerics Life Sciences Libraries Oil & Gas Rajesh Bordawekar, Linear Algebra, Sparse Matrix-Vector Multiplication 93caed1b-cbd9-4f44-88a3-ff0dc1182e35 Practical Pre-stack Kirchhoff Time Migration of Seismic Processing on General Purpose GPU In this paper, we introduced three prototypes of GPGPU solutions on NVidia GeForce8800GT for a practical Pre-stack Kirchhoff Time Migration program. We presented how to re-design and re-implement the original CPU code to efficiency GPU code. The prototypes are more than at most 7.2 times faster than its CPU version on Intels P4 3.0G /content/cudazone/CUDABrowser/assets/images/applications/412_kirchhoff_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/412_kirchhoff_large.jpg 2009 01 01 01/01/2009 7 Xiaohua Shi Xu Wang Paper Signal Processing Xiaohua Shi,Xu Wang ,Changhai Zhao d5429d39-25a0-402b-9492-d775fa79105f Exploiting Computing Power on Graphics Processing Unit With recent technological advances, graphics processing units (GPUs) are providing increasingly higher performance with improvement programmability. This paper investigates NVIDIAs CUDA technology that enables data mining algorithm be parallelized effectively on GPU. The proposed algorithm exploits the computational power and the memory hierarchy of GPUs, using the shared memory to store frequently accessed data. Experimental results indicate that the speed of the computation through the GPU is considerably faster than through the CPU. /content/cudazone/CUDABrowser/assets/images/applications/411_0408_Hoff4_305_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/411_0408_Hoff4_305_large.jpg 2008 12 01 12/01/2008 Ziyi Liu Wenjing Ma Paper Ziyi Liu, Wenjing Ma 13e2548c-0608-45e5-b42c-32349121a917 Sequence alignment with GPU: Performance and design challenges In bioinformatics, alignments are commonly performed in genome and protein sequence analysis for gene identification and evolutionary similarities. There are several approaches for such analysis, each varying in accuracy and computational complexity. Smith-Waterman (SW) is by far the best algorithm for its accuracy in similarity scoring. However, execution time of this algorithm on general purpose processor based systems makes it impractical for use by life scientists. In this paper we take Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU. We achieve a 23x speedup against the serial version of the SW algorithm. We further study the effect of memory organization and the instruction set architecture on GPU performance. For that purpose we analyze another implementation on an Intel Quad Core processor that makes use of Intel's SIMD based SSE2 architecture. We show that if reading blocks of 16 words at a time instead of 4 is allowed, and if 64KB of shared memory as opposed to 16KB is available to the programmer, GPU performance enhances significantly making it comparable to the SIMD based implementation. We quantify these observations to illustrate the need for studies on extending the instruction set and memory organization for the GPU. /content/cudazone/CUDABrowser/assets/images/applications/410_bioinformatics_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/410_bioinformatics_large.jpg Academia Department of Electrical and Computer Engineering, University of Arizona 2009 01 01 01/01/2009 23 Gregory M. Striemer Ali Akoglu Paper Life Sciences Gregory M. Striemer, Ali Akoglu 8c74ee14-db85-421e-aab4-62db9a7be80e High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units The power of graphics processing unit (GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of Korea using OpenGL and CUDA libraries on GPU. Since ARIA was announced only 4 years ago, there is no hardware solution yet providing high-speed encryption with ARIA. We make use of GPU as a parallel processors with several grid structures and optimize the encryption speed and the occupancy of shared-memory. As a result, when ARIA is running on GeForce 8800GTS using CUDA library, the speed of the encryption reaches up to 4.8 Gbps which is the fastest implementation of ARIA known to public. /content/cudazone/CUDABrowser/assets/images/applications/408_smileycbcb_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/408_smileycbcb_large.jpg 2008 12 01 12/01/2008 Yongjin Yeom Yongkuk Cho Moti Yung Paper Numerics Yongjin Yeom,Yongkuk Cho,Moti Yung fdc24388-6fb6-4e40-ab58-56bafa8f9422 Swarm's flight: Accelerating the particles using C-CUDA With the development of Graphics Processing Units (GPU) and the Compute Unified Device Architecture (CUDA) platform, several areas of knowledge are being benefited with the reduction of the computing time. Our goal is to show how optimization algorithms inspired by Swarm Intelligence can take profit from this technology. In this paper, we provide an implementation of the Particle Swarm Optimization (PSO) algorithm in C-CUDA. The algorithm was tested on a suite of well-known benchmark optimization problems and the computing time has been compared with the same algorithm implemented in C and Matlab. Results demonstrate that the computing time can significantly be reduced using C-CUDA. As far as we know, this is the first implementation of PSO in C-CUDA. /content/cudazone/CUDABrowser/assets/images/applications/407_flock1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/407_flock1_large.jpg Academia Departmento de Informatica, PPGI, Universidade Federal do Espirito Santo 2009 01 01 01/01/2009 Lucas de P. Veronese Renato A. Krohling Paper Science Lucas de P. Veronese, Renato A. Krohling cb1e2f8f-6bd3-4ac8-ab13-c84866cb1f6e Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence Direct numerical simulation (DNS) of turbulence is computationally intensive and typically relies on some form of parallel processing. Spectral kernels used for spatial discretization are a common computational bottleneck on distributed memory architectures. One way to increase DNS algorithms' efficiency is to parallelize spectral kernels using tightly coupled single-program, multiple-data (SPMD) multiprocessor units with minimal interprocessor communication latency. The authors present techniques to map DNS computations to modern graphics processing units (GPUs), which are characterized by very high memory bandwidth and hundreds of SPMD processors. The article compares the performance between the authors' parallel algorithm running on a GPU versus the associated CPU implementation of a solver for one of the fundamental nonlinear models of turbulence theory. They also demonstrate a prototype of a scalable computational steering framework based on turbulence simulation and visualization coupling on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/406_F-TFTRplasma_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/406_F-TFTRplasma_large.jpg Academia University of Maryland, College Park 2009 03 01 03/01/2009 George Stantchev Derek Juba William Dorland, Paper Science George Stantchev,Derek Juba,William Dorland f0c09d7d-c0a3-4f4d-8e8b-4ba17d10e0bc GPU-based parallel particle swarm optimization A novel parallel approach to run standard particle swarm optimization (SPSO) on Graphic Processing Unit (GPU) is presented in this paper. By using the general-purpose computing ability of GPU and based on the software platform of Compute Unified Device Architecture (CUDA) from NVIDIA, SPSO can be executed in parallel on GPU. Experiments are conducted by running SPSO both on GPU and CPU, respectively, to optimize four benchmark test functions. The running time of the SPSO based on GPU (GPU-SPSO) is greatly shortened compared to that of the SPSO on CPU (CPU-SPSO). Running speed of GPU-SPSO can be more than 11 times as fast as that of CPU-SPSO, with the same performance. compared to CPU-SPSO, GPU-SPSO shows special speed advantages on large swarm population applications and hign dimensional problems, which can be widely used in real optimizing problems. /content/cudazone/CUDABrowser/assets/images/applications/404_ParticleSwarmOptimization_small.png /content/cudazone/CUDABrowser/assets/images/applications/404_ParticleSwarmOptimization_large.png Academia Key Laboratory of Machine Perception and Intelligence (Peking University) 2009 05 01 05/01/2009 11 You Zhou Ying Tan Paper Science You Zhou,Ying Tan 9a4d9359-25e3-4dc2-b4c8-65c7c4b3633c The Virtual Marathon: Parallel Computing Supports Crowd Simulations To be realistic, an urban model must include appropriate numbers of pedestrians, vehicles, and other dynamic entities. Using a parallelcomputing architecture, researchers simulated a marathon with more than a million participants. To simulate participant behavior, they used fuzzy logic on a GPU to perform millions of inferences in real time. /content/cudazone/CUDABrowser/assets/images/applications/403_crowd_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/403_crowd_large.jpg Academia Middle East Technical University 2009 07 01 07/01/2009 Erdal Yilmaz Veysi Isler Yasemin Yardimci Cetin Paper Erdal Yilmaz, Veysi Isler, Yasemin Yardimci Cetin 3cdcf8ea-261d-4611-aaf3-44761158df5f Accelerating Dust Temperature Calculations with Graphics Processing Units When calculating the infrared spectral energy distributions (SEDs) of galaxies in radiation-transfer models, the calcu-lation of dust grain temperatures is generally the most time-consuming part of the calculation. Because of its highly parallel nature, this calculation is perfectly suited for massively parallel general-purpose Graphics Processing Units (GPUs). This paper presents an implementation of the calculation of dust grain equilibrium temperatures on GPUs in the Monte-Carlo radiation transfer code sunrise, using the CUDA API. The GPU can perform this calculation 55 times faster than the 8 CPU cores, showing great potential for accelerating calculations of galaxy SEDs. /content/cudazone/CUDABrowser/assets/images/applications/402_dust_small.png /content/cudazone/CUDABrowser/assets/images/applications/402_dust_large.png Academia Santa Cruz Institute for Particle Physics, University of California, Santa Cruz, CA 2009 07 22 07/22/2009 55 Patrik Jonsson Joel R. Primack Paper Numerics Joel R. Primack, Patrik Jonsson, dust, radiative transfer, methods: numerical eb30ba09-4ff0-453e-89f3-041ea6d73ec7 Linear optimization on modern GPUs Optimization algorithms are becoming increasingly more important in many areas, such as finance and engineering. Typically, real problems involve several hundreds of variables, and are subject to as many constraints. Several methods have been developed trying to reduce the theoretical time complexity. Nevertheless, when problems exceed reasonable sizes they end up being very computationally intensive. Heterogeneous systems composed by coupling commodity CPUs and GPUs are becoming relatively cheap, highly performing systems. Recent developments of GPGPU technologies give even more powerful control over them. In this paper, we show how we use a revised simplex algorithm for solving linear programming problems originally described by Dantzig for both our CPU and GPU implementations. Previously, this approach has showed not to scale beyond around 200 variables. However, by taking advantage of modern libraries such as ATLAS for matrix-matrix multiplication, and the NVIDIA CUDA programming library on recent GPUs, we show that we can scale to problem sizes up to at least 2000 variables in our experiments for both architectures. On the GPU, we also achieve an appreciable precision on large problems with thousands of variables and constraints while achieving between 2x and 2.5x speed-ups over the serial ATLAS-based CPU version. With further tuning of both the algorithm and its implementations, even better results should be achievable for both the CPU and GPU versions. /content/cudazone/CUDABrowser/assets/images/applications/402_2601838e2_small.gif /content/cudazone/CUDABrowser/assets/images/applications/402_2601838e2_large.gif Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway 2009 05 01 05/01/2009 3 Daniele G. Spampinato Anne C. Elstery Paper Numerics Daniele G. Spampinato, Anne C. Elstery 354bd9a4-3e2a-412d-a9f6-e9e32be1e782 GPU acceleration of Zernike moments for large-scale images Zernike moments are trascendental digital image descriptors used in many application areas like biomedical image processing and computer vision due to their good properties of orthogonality and rotation invariance. However, their computation is too expensive and limits its application in practice, overall when real-time constraints are imposed. This work introduces a novel approach to the high-performance computation of Zernike moments using CUDA on graphics processors. The proposed method is applicable to the computation of an individual Zernike moment as well as a set of Zernike moments of a given order, and it is compared against three of the fastest implementations performed on CPUs over the last decade. Our experimental results on a commodity PC reveal up to 5x faster execution times on a GeForce 8800 GTX against the best existing implementation on a Pentium 4 CPU. /content/cudazone/CUDABrowser/assets/images/applications/401_zernike_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/401_zernike_large.jpg Academia Computer Architecture Department, University of Malaga, Spain 2009 05 01 05/01/2009 5 Manuel Ujaldon Application Imaging Manuel Ujaldon 7a75c63a-5693-4eac-8879-52b943e3db6a Efficient visual hull computation for real-time 3D reconstruction using CUDA In this paper we present two efficient GPU-based visual hull computation algorithms. We compare them in terms of performance using image sets of varying size and different voxel resolutions. In addition, we present a real-time 3D reconstruction system which uses the proposed GPU-based reconstruction method to achieve real-time performance (30 fps) using 16 cameras and 4 PCs. /content/cudazone/CUDABrowser/assets/images/applications/399_PVHMemo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/399_PVHMemo_large.jpg Academia Department of Computer Science, Technische Universitat Munchen 2008 12 31 12/31/2008 12/01/2008 Alexander Ladikos Selim Benhimane Nassir Navab Paper Imaging Alexander Ladikos,Selim Benhimane,Nassir Navab 67438a50-58e1-44a1-aec4-42052ba5add2 CUDA cuts: Fast graph cuts on the GPU Graph cuts has become a powerful and popular optimization tool for energies defined over an MRF and have found applications in image segmentation, stereo vision, image restoration, etc. The maxflow/mincut algorithm to compute graph-cuts is computationally heavy. The best-reported implementation of graph cuts takes over 100 milliseconds even on images of size 640x480 and cannot be used for real-time applications or when iterated applications are needed. The commodity Graphics Processor Unit (GPU) has emerged as an economical and fast computation co-processor recently. In this paper, we present an implementation of the push-relabel algorithm for graph cuts on the GPU. We can perform over 60 graph cuts per second on 1024x1024 images and over 150 graph cuts per second on 640x480 images on an Nvidia 8800 GTX. The time for each complete graph-cut is about 1 millisecond when only a few weights change from the previous graph, as on dynamic graphs resulting from videos. The CUDA code with a well-defined interface can be downloaded for anyone's use. /content/cudazone/CUDABrowser/assets/images/applications/398_case01045_2T_half_fa_small.png /content/cudazone/CUDABrowser/assets/images/applications/398_case01045_2T_half_fa_large.png 2008 12 1 12/1/2008 Vibhav Vineet, P. J. Narayanan Paper Imaging Vibhav Vineet,P. J. Narayanan 16ef052b-1b67-4ada-88f8-6461329d82c8 GPU Acceleration of 2D-DWT Image Compression in MATLAB with CUDA This article will present the details about the acceleration of 2D wavelet-based medical data (image) compression on MATLAB with CUDA. It is obvious that the diagnostic materials (mostly as acertain type of image) are increasingly acquired in a digital format. Therefore, common need to daily manipulate huge amount of data brought about the issue of compression within a very less stipulated amount of time. Attention will be given to the acceleration processing flow which exploits the massive parallel computational power offered by the latest NVIDIA graphics processor unit (GPU). It brings a compute device that can be programmed using a C-like language using CUDA, (Compute Unified Device Architecture). In the same time, a number of attractive features can be exploited for a broad class of intensive data parallel computation tasks. The final part of discussion outlines possible directions towards future improvements of compression ratio and processing speed. /content/cudazone/CUDABrowser/assets/images/applications/396_wls2_small.gif /content/cudazone/CUDABrowser/assets/images/applications/396_wls2_large.gif 2008 12 01 12/01/2008 Vaclav Simek Paper Medical Imaging Vaclav Simek, Radim Dvorak 7250813b-109b-4b5c-b956-70f1a188a517 A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware including the concept of global memory on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters. /content/cudazone/CUDABrowser/assets/images/applications/395_3d-gauge-cluster-from-nvidia-and-icar_YatMw_59_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/395_3d-gauge-cluster-from-nvidia-and-icar_YatMw_59_large.jpg Academia Visualisierungsinstitut der Universitat Stuttgart 2009 31 12 12/31/2009 07/01/2009 Christoph Muller Steffen Frey Magnus Strengert Paper Christoph Muller, Steffen Frey,Magnus Strengert b66bbea9-6d0c-4b61-aa2f-128074a32b0b Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture This work presents an implementation of Neocognitron Neural Network, using a high performance computing architecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by Fukushima and collaborators, constituted of several hierarchical stages of neuron layers, organized in two-dimensional matrices called cellular planes. For the high performance computation of Face Recognition application using Neocognitron it was used CUDA (Compute Unified Device Architecture) as API (Application Programming Interface) between the CPU and the GPU, from GeForce 8800 GTX of NVIDIA company, with 128 ALU's. As face image databases it was used a face database created at UFSCar, and the CMU-PIE (Carnegie Mellon University Pose, Illumination and Expression) database. The load balancing was achieved through the use of cellular connections as threads organized in blocks, following the CUDA philosophy of development. The results showed the feasibility of this type of device as a massively parallel data processing tool, and that smaller the granularity and the data dependency of the parallel processing, better is its performance. /content/cudazone/CUDABrowser/assets/images/applications/394_polar-rose-face-3d_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/394_polar-rose-face-3d_large.jpg 2008 12 01 12/01/2008 Gustavo Poli Jos Hiroki Saito Joo F. Mari Paper Signal Processing Gustavo Poli,Jos Hiroki Saito,Joo F. Mari 11117185-2796-4f64-aab0-91728886bf15 Parallel Image Processing Based on CUDA CUDA (Compute Unified Device Architecture) is a novel technology of general-purpose computing on the GPU, which makes users develop general GPU (Graphics Processing Unit) programs easily. This paper analyzes the distinct features of CUDA GPU, summarizes the general program mode of CUDA. Furthermore, we implement several classical image processing algorithms by CUDA, such as histogram equalization, removing clouds, edge detection and DCT encode and decode etc., especially introduce the first two algorithms. If we don't take the data transfer time in experiment between host memory and device memory into account, as the image size increase, histogram computation can get a more than 40x speedup, removing clouds can get an about 79x speedup, DCT can gain around 8x and edge detection more than 200x. /content/cudazone/CUDABrowser/assets/images/applications/393_nvidia-CUDA,Q-1-111097-13_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/393_nvidia-CUDA,Q-1-111097-13_large.jpg 2008 12 12 12/12/2008 200 Zhiyi Yang Yating Zhu Yong Pu Paper Imaging Zhiyi Yang, Yating Zhu, Yong Pu 5cc29037-c944-43de-94b6-046dc4fbbac2 Neural Network Implementation using CUDA and OpenMP Many algorithms for image processing and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation using GPU encounters two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job which needs much cooperation between CPU and GPU, which is usual in image processings and pattern recognitions contrary to the graphics area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPGPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text detection system using the proposed architecture, and the computational times showed about 15 times faster than implementation using CPU and about 4 times faster than implementation on only GPU without OpenMP. /content/cudazone/CUDABrowser/assets/images/applications/392_openmp_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/392_openmp_large.jpg Academia Department of Digital Media, College of Information Science, Soongsil University 2008 12 12 01/01/2009 Honghoon Jang Anjin Park Keechul Jung Paper Numerics Honghoon Jang, Anjin Park, Keechul Jung e6a0a5aa-1798-4eaa-b48f-e5a0a6810875 A Parallel Implementation of the 2D Wavelet Transform Using CUDA There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device Architecture (CUDA) which is common to the latest NVIDIA GPUs. The bottom line is a multicore platform which provides an enormous potential performance benefit driven by a non-traditional programming model. In this paper we try to provide some insight into the peculiarities of CUDA in order to target scientific computing by means of a specific example. In particular, we show that the parallelization of the two-dimensional fast wavelet transform for the NVIDIA Tesla C870 achieves a speedup of 20.8 for an image size of 8192x8192, when compared with the fastest host-only version implementation using OpenMP and including the data transfers between main memory and device memory. /content/cudazone/CUDABrowser/assets/images/applications/391_2d-wavelet_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/391_2d-wavelet_large.jpg 2009 01 01 01/01/2009 20 Joaquin Franco Gregorio Bernabe Juan Fernandez Paper Numerics Joaquin Franco, Gregorio Bernabe, Juan Fernandez 63d5f310-4eb7-4f0d-82ea-5b4dc263e859 Towards Accelerated Computation of Atmospheric Equations Using CUDA Main objective of this paper is to outline possibleways how to achieve a substantial acceleration in caseof advection-diffusion equation (A-DE) calculation,which is commonly used for a description of thepollutant behavior in atmosphere. A-DE is a kind ofpartial differential equation (PDE) and in general caseit is usually solved by numerical integration due to itshigh complexity. These types of calculations are timeconsuming thus the main idea of our work is to adoptCUDA platform and commodity GPU card to do thecalculations in a faster way. The solution is based onmethod of lines with 4th order Runge-Kutta scheme tohandle the integration. As a matter of fact, the selectedapproach involves number of auxiliary variables andthus the memory management is critical in order toachieve desired performance. We have implementedseveral possible solutions that use different memoryaccess schemes. Detailed evaluation is provided in thispaper where the obtained results show a tremendousprocessing speed up in comparison to CPU. /content/cudazone/CUDABrowser/assets/images/applications/390_600px-Lorenz_attractor.svg_small.png /content/cudazone/CUDABrowser/assets/images/applications/390_600px-Lorenz_attractor.svg_large.png 2009 01 01 01/01/2009 Vaclav Simek Radim Dvorak Frantisek Zboril Paper Life Sciences Vaclav Simek, Radim Dvorak, Frantisek Zboril d23b8b83-4dbd-47b4-a4b6-ee9df082dfb0 K-Means on Commodity GPUs with CUDA K-means algorithm is one of the most famous unsupervised clustering algorithms. Many theoretical improvements for the performance of original algorithms have been put forward, while almost all of them are based on Single Instruction Single Data(SISD) architecture processors (CPUs), which partly ignored the inherent paralleled characteristic of the algorithms. In this paper, a novel Single Instruction Multiple Data (SIMD) architecture processors (GPUs)based k-means algorithm is proposed. In this algorithm, in order to accelerate compute-intensive portions of traditional k-means, both data objects assignment and k centroids recalculation are offloaded to the GPU in parallel. We have implemented this GPU-based k-means on the newest generation GPU with Compute Unified Device Architecture(CUDA). The numerical experiments demonstrated that the speed of GPU-based k-means could reach as high as 40 times of the CPU-based k-means. /content/cudazone/CUDABrowser/assets/images/applications/389_AndromedaKMEANSK_4_1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/389_AndromedaKMEANSK_4_1_large.jpg 2009 03 01 03/01/2009 40 Bai Hong-tao He Li-li Ouyang Dan-tong Paper Numerics Bai Hong-tao, He Li-li, Ouyang Dan-tong 7a2fcb84-0960-415d-901a-d59f0a3a81c3 Accelerating K-Means on the Graphics Processor via CUDA In this paper an optimized k-means implementation on the graphics processing unit (GPU) is presented. NVIDIA's Compute Unified Device Architecture (CUDA), available from the G80 GPU family onwards, is used as the programming environment. Emphasis is placed on optimizations directly targeted at this architecture to best exploit the computational capabilities available. Additionally drawbacks and limitations of previous related work, e.g. maximum instance, dimension and centroid count are addressed. The algorithm is realized in a hybrid manner, parallelizing distance calculations on the GPU while sequentially updating cluster centroids on the CPU based on the results from the GPU calculations. An empirical performance study on synthetic data is given, demonstrating a maximum 14x speed increase to a fully SIMD optimized CPU implementation. /content/cudazone/CUDABrowser/assets/images/applications/388_k_means_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/388_k_means_large.jpg 2009 04 01 04/01/2009 14 Mario Zechner Michael Granitzer Paper Other Mario Zechner, Michael Granitzer a0bd4162-4ba4-4139-b2c6-b27e17553509 Design of a parallel AES for graphics hardware using the CUDA framework Web servers often need to manage encrypted transfers of data. The encryption activity is computationally intensive, and exposes a significant degree of parallelism. At the same time, cheap multicore processors are readily available on graphics hardware, and toolchains for development of general purpose programs are being released by the vendors. In this paper, we propose an effective implementation of the AES-CTR symmetric cryptographic primitive using the CUDA framework. We provide quantitative data for different implementation choices and compare them with the common CPU-based OpenSSL implementation on a performance-cost basis. With respect to previous works, we focus on optimizing the implementation for practical application scenarios, and we provide a throughput improvement of over 14 times. We also provide insights on the programming knowledge required to efficiently exploit the hardware resources by exposing the different kinds of parallelism built in the AES-CTR cryptographic primitive. /content/cudazone/CUDABrowser/assets/images/applications/387_encryption_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/387_encryption_large.jpg Academia Politecnico di Milano, Italy 2009 01 01 01/01/2009 14 Andrea Di Biagio Paper Other Andrea Di Biagio bb9a79fd-0b9b-44f3-b56a-5e5d04a51acc vCUDA: GPU accelerated high performance computing in virtual machines This paper describes vCUDA, a GPGPU (General Purpose Graphics Processing Unit) computing solution for virtual machines. vCUDA allows applications executing within virtual machines (VMs) to leverage hardware acceleration, which can be beneficial to the performance of a class of high performance computing (HPC) applications. The key idea in our design is: API call interception and redirection. With API interception and redirection, applications in VMs can access graphics hardware device and achieve high performance computing in a transparent way. We carry out detailed analysis on the performance and overhead of our framework. Our evaluation shows that GPU acceleration for HPC applications in VMs is feasible and competitive with those running in a native, non-virtualized environment. Furthermore, our evaluation also identifies the main cause of overhead in our current framework, and we give some suggestions for future improvement. /content/cudazone/CUDABrowser/assets/images/applications/386_electricsheep_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/386_electricsheep_large.jpg Academia Advanced Internet and Media Lab, School of Computer and Communications, Hunan University, Chang Sha 2009 01 01 01/01/2009 Lin Shi Paper Other 355534f2-32b2-4fa0-8a1f-3c7bdbeb17c8 Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de-novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this paper we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data. It is based on spectral alignment and uses the CUDA programming model. Our computational experiments on a GTX 280 GPU show runtime savings between 10 and 19 times (for different error-rates using simulated datasets as well as real Solexa/Illumina datasets). /content/cudazone/CUDABrowser/assets/images/applications/385_figure1D_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/385_figure1D_large.jpg Academia School of Computer Engineering, Nanyang Technological University, Singapore 2009 01 01 01/01/2009 19 Haixiang Shi Paper Life Sciences f3c0c426-1df9-4fb3-b9f4-1bd7f62a6978 Accelerating the reduction to upper Hessenberg We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are signi cant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a signi cant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to two-sided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get signi cant computational acceleration which otherwise may have been impossible. /content/cudazone/CUDABrowser/assets/images/applications/384_criticalpath_small.png /content/cudazone/CUDABrowser/assets/images/applications/384_criticalpath_large.png Academia University of Tennessee / Oak Ridge National Laboratory / University of Manchester 2009 05 29 05/29/2009 16 Stanimire Tomov Jack Dongarra Paper Numerics Stanimire Tomov, Jack Dongarra, Hessenberg reduction, eigenvalue problems, two-sided factorizations, 893b3219-bf47-4757-87f1-d83b48783c83 Zonar ZONAR is an advanced STP system that handles sales, marketmaking, portfolio and risk management of all major asset classes in a multi-user environment. New version optimized for Cuda in several areas; -Option calculations and formulas -Volatility smile calculations -Interpolations -Portfolio calculations /content/cudazone/CUDABrowser/assets/images/applications/383_portf_small.png /content/cudazone/CUDABrowser/assets/images/applications/383_portf_large.png Commercial SoftCapital http://www.softcapital.com/index.htm 2009 08 05 08/01/2009 40 Commercial Lars Pehrsson Application Finance Numerics Finance, derivatives, trading, volatility, Lars Pehrsson, larsnsj@gmail.com 9494d226-5bd1-4a88-98dd-1f9536598781 PointTrackerLibrary A C++ library with various frontends and a full search SSD CUDA blockmatcher as backend, able to track many points within realtime in images up to 2K resolution. /content/cudazone/CUDABrowser/assets/images/applications/382_Peacock0000_small.png /content/cudazone/CUDABrowser/assets/images/applications/382_Peacock0000_large.png Research JOANNEUM RESEARCH www.joanneum.at/iis 2009 07 18 07/18/2009 12 Commercial H.Fassold / H. Fuerntratt Application Multimedia Graphics Imaging Video & Audio Tracking, Blockmatching, Image conversion, H.Fassold,H. Fuerntratt,hermann.fuerntratt@joanneum.at 17ed7b65-302b-46b7-9d62-624cfce64935 PyOSSMGPU : Propagation of high-intensity pulses in nonlinear fiber bragg grating The propagation of high-intensity laser pulses in fiber bragg grating or in any nonlinear periodic dielectric media can be studied using coupled-mode theory. When applied to Bragg grating in optical fiber, the coupled-mode theory lead to two coupled-mode equations which can be numerically resolved using a classical fourth-order Runge-Kutta formula. When studying classical problem like propagation of bragg soliton in very long grating (many cm), Runge-Kutta method usually take many hours to complete. PyOSSMGPU is a CUDA implementation of the optimized split-step method for solving nonlinear coupled-mode equations that model wave propagation in nonlinear fiber Bragg gratings. The GPU accelerated version of the OSSM code perform around 20X faster then plain C version. Classical problem like bragg soliton in very long grating take can be completed typically within a minute. /content/cudazone/CUDABrowser/assets/images/applications/381_PyOSSMGPU_screenshot1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/381_PyOSSMGPU_screenshot1_large.jpg Academia Universite Laval 2009 07 10 07/10/2009 20 Martin Laprise Multimedia Code Science Martin Laprise, martin.laprise.1@ulaval.ca f498d1da-fd44-4359-a83d-84a281d58c53 HONEI: A collection of libraries for numerical computations targeting multiple processor architectures We present HONEI, an open-source collection of libraries o ering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the exibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a twofold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3{4 and 4{16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-speci c operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, signi cantly simplifying their development. /content/cudazone/CUDABrowser/assets/images/applications/380_honei_small.png /content/cudazone/CUDABrowser/assets/images/applications/380_honei_large.png Academia aInstitut fur Physik, TU Dortmund, Germany 2009 01 01 01/01/2009 16 Danny van Dyk Markus Geveler Sven Mallach Paper Numerics Danny van Dyk, Markus Geveler, Sven Mallach 55b2f736-0a6b-4893-aaed-272cb5dd676d Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Atomistic molecular dynamics (MD) simulations are a vital tool in chemical research, as they are able to provide a view of chem- ical systems and processes that is not obtainable through experiment. However, large-scale MD simulations require access to multicore clus- ters or supercomputers that are not always available to all researchers. Recently, many have begun to explore the power of graphics processing units (GPUs) for various applications, such as MD. We present prelimi- nary results of water simulations carried out on GPUs. We compare the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. We also address the use of more accurate double precision arithmetic with the newest GPUs and its cost in performance. /content/cudazone/CUDABrowser/assets/images/applications/379_towards_molecular_dynamics_small.png /content/cudazone/CUDABrowser/assets/images/applications/379_towards_molecular_dynamics_large.png Academia University of Delaware, Newark 2009 01 01 01/01/2009 7 Joseph E. Davis Adnan Ozsoy Sandeep Patel Paper Life Sciences Joseph E. Davis, Adnan Ozsoy, Sandeep Patel ccf2228b-a635-4fc1-8875-4321542e5a7c Multi-Dimensional Characterization of Temporal Data Mining on Graphics Processors Through the algorthmic design patterns of data parallelism and task parallelism, the graphics processing unit (GPU) offers the potential to vastly accelerate discovery and innovation across a multitude of disciplines. For example, the exponential growth in data volume now presents an obstacle for high-throughput data mining in fields such as neuroinformatics and bioinformatics. As such, we present a characterization of a MapReduce-based datamining application on a general-purpose GPU (GPGPU). Using neuroscience as the application vehicle, the results of our multi-dimensional performance evaluation show that a (one-size-fits-all) approach maps poorly across different GPGPU cards. Rather, a high-performance implementation on the GPGPU should factor in the 1) problem size, 2) type of GPU, 3) type of algorithm, and 4) data-access method when determining the type and level of parallelism. To guide the GPGPU programmer towards optimal performance within such a broad design space, we provide eight general performance characterizations of our data-mining application. /content/cudazone/CUDABrowser/assets/images/applications/378_csatvt-header_small.gif /content/cudazone/CUDABrowser/assets/images/applications/378_csatvt-header_large.gif Academia Department of Computer Science, Virginia Tech 2009 01 01 01/01/2009 Jeremy Archuleta Yong Cao Wu-chun Feng Paper Numerics Jeremy Archuleta, Yong Cao, Wu-chun Feng e728d1cd-65b2-4b12-af9e-e1a445f0b779 Molecular dynamics simulation of complex multiphase flow on a computer cluster with GPUs Compute Unified Device Architecture (CUDA) was used to design and implement molecular dynamics (MD) simulations on graphics processing units (GPU). With an NVIDIA Tesla C870, a 20 to 60 fold speedup over that of one core of the Intel Xeon 5430 CPU was achieved, reaching up to 150 Gflops. MD simulation of cavity flow and particle-bubble interaction in liquid was implemented on multiple GPUs using a message passing interface (MPI). Up to 200 GPUs were tested on a special network topology, which achieves good scalability. The capability of GPU clusters for large-scale molecular dynamics simulation of meso-scale flow behavior was, therefore, uncovered. /content/cudazone/CUDABrowser/assets/images/applications/377_molecular_dynamics_simulation_small.png /content/cudazone/CUDABrowser/assets/images/applications/377_molecular_dynamics_simulation_large.png State Key Laboratory of Multi-Phase Complex Systems, Institute of Process Engineering, Chinese Academy of Sciences, Beijing 2009 01 01 01/01/2009 60 CHEN FeiGuo GE Wei LI JingHai Paper Life Sciences multiphase flow, molecular dynamics, CUDA, GPU, parallel computing, CHEN FeiGuo, GE Wei, LI JingHai e38ddfe2-3ca5-4cb0-9a6d-209006e8051a CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled GPUs The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware /content/cudazone/CUDABrowser/assets/images/applications/376_1756-0500-2-73-1-l_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/376_1756-0500-2-73-1-l_large.jpg Research BioMed Central Ltd 2009 02 01 02/01/2009 Yongchao Liu Douglas L Maskell Bertil Schmidt Paper Life Sciences Yongchao Liu, Douglas L Maskell, Bertil Schmidt 99c99e2d-b68b-4e24-a12d-e5ff3105c5b0 A High-Speed Multi-GPU Implementation of Bottom-Up Attention Using CUDA In this paper a novel implementation of the saliency map model on a multi-GPU platform using CUDA technology is presented. The saliency map model is a well- known computational model for bottom-up attention selection and serves as a basis of many attention control strategies of cognitive vision systems. A real-time implementation is the prerequisite of an application of bottom-up attention on mobile robots and vehicles. Parallel computation on Graphics Process- ing Unit (GPU) provides an excellent solution for this kind of compute-intensive image processing. Running on 1 to 4 NVIDIA GeForce 8800 (GTX) graphics cards a frame rate of 313 fps at resolution of 640 x 480 is achieved, which is approximately 8.5 times faster than the standard implementations on CPUs. The implementation is also evaluated using a high-speed camera at 200 Hz. Using two GPUs only 2 ms extra computational time for the saliency map generation in addition to the camera capture time is required for images of 640 x 480 pixels. /content/cudazone/CUDABrowser/assets/images/applications/375_attention_small.png /content/cudazone/CUDABrowser/assets/images/applications/375_attention_large.png Academia Institute of Automatic Control Engineering Technische Universitat Munchen, Germany 2009 01 01 01/01/2009 9 Tingting Xu Thomas Pototschnig Kolja Kuhnlenz Paper Imaging Tingting Xu, Thomas Pototschnig, Kolja Kuhnlenz 06a005b8-2c22-4aaa-9a47-1dc0e24dbfe8 Implementation of a Lattice-Boltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology The Lattice-Boltzmann method (LBM) is a distribution-function based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPUs in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data. /content/cudazone/CUDABrowser/assets/images/applications/374_boltzmann_small.png /content/cudazone/CUDABrowser/assets/images/applications/374_boltzmann_large.png Academia Technische Universitat Munchen Lehrstuhl fur Aerodynamik 2009 01 01 01/01/2009 1840 Eugen Riegel Thomas Indinger Paper Computational Fluid Dynamics Eugen Riegel, Thomas Indinger bf616128-b630-49cb-9ab0-996b495737b6 QP: A Heterogeneous Multi-Accelerator Cluster We present a heterogeneous multi-accelerator cluster developed and deployed at NCSA. The cluster consists of 16 AMD dual-core CPU compute nodes each with four NVIDIA GPUs and one Xilinx FPGA. Cluster nodes are interconnected with both InfiniBand and Ethernet networks. The software stack consists of standard cluster tools with the addition of accelerator-specific software packages and enhancements to the resource allocation and batch sub-systems. We highlight several HPC applications that have been developed and deployed on the cluster. We also present our Phoenix application development framework that is meant to help with developing new applications and migrating existing legacy codes to heterogeneous systems. /content/cudazone/CUDABrowser/assets/images/applications/373_heterogeneous_small.png /content/cudazone/CUDABrowser/assets/images/applications/373_heterogeneous_large.png Academia University of Illinois at Urbana-Champaign, Urbana 2009 01 01 01/01/2009 48 Commercial Michael Showerman Jeremy Enos Avneesh Pant Paper Other heterogeneous system, acceleration co-processor, GPGPU, FPGA, Michael Showerman, Jeremy Enos, Avneesh Pant 8d2571a3-e901-47bc-a42a-2b2291dc858e OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization GPGPUs have recently emerged as powerful vehicles for generalpurpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial on a CPU). /content/cudazone/CUDABrowser/assets/images/applications/372_gpu_perf_small.png /content/cudazone/CUDABrowser/assets/images/applications/372_gpu_perf_large.png Academia School of ECE, Purdue University West Lafayette, IN 2009 01 01 01/01/2009 50 Seyong Lee Seung-Jai Min Rudolf Eigenmann Paper Programming Tools Seyong Lee, Seung-Jai Min, Rudolf Eigenmann 4dd15dbc-9919-413f-9603-3bd4744edbd5 Nuclei: GPU-accelerated Many-core Network Coding While it is a well known result that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained to be a question, due to its high computational complexity. Our previous work has attempted to design a hardware-accelerated and multi-threaded implementation of network coding to fully utilize multi-core CPUs, as well as SSE2 and AltiVec SIMD vector instructions on x86 and PowerPC processors. This paper represents another step forward, and presents the first attempt in the literature to maximize the performance of network coding by taking advantage of not only multi-core CPUs, but also potentially hundreds of computing cores in commodity off-the-shelf Graphics Processing Units (GPU). /content/cudazone/CUDABrowser/assets/images/applications/371_network_coding_small.png /content/cudazone/CUDABrowser/assets/images/applications/371_network_coding_large.png Academia University of Toronto / School of Computer Science Fudan University 2009 01 01 01/01/2009 3 Hassan Shojania Baochun Li Xin Wang Paper Signal Processing Hassan Shojania, Baochun Li, Xin Wang c24dcc0f-c60c-45f9-8d57-588e9460a58f High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs The visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a threedimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render MOs perform calculations on the CPU and require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system. We present novel data-parallel algorithms for computing lattices of MOs on modern graphics processing units (GPUs) and multi-core CPUs. The fastest GPU algorithm achieves up to a 125-fold speedup over an optimized CPU implementation running on one CPU core. We also demonstrate possible bene ts of dynamic GPU kernel generation and just-intime compilation for MO calculation. We have implemented these algorithms within the popular molecular visualization program VMD, which can now produce high quality MO renderings for large systems in less than a second, and achieves the rst-ever interactive animations of quantum chemistry simulation trajectories using only on-the- y calculation. /content/cudazone/CUDABrowser/assets/images/applications/374_molecular_orbitals_small.png /content/cudazone/CUDABrowser/assets/images/applications/374_molecular_orbitals_large.png Academia University of Illinois at Urbana-Champaign Urbana 2009 01 01 01/01/2009 125 John E. Stone Jan Saam David J. Hardy Paper Science John E. Stone, Jan Saam, David J. Hardy f1be0f16-d60c-4c25-b225-6b7d575d6efc CUDA Implementation of a Navier-Stokes Solver on Multi GPU Desktop Platforms for Incompressible Flows Graphics processor units (GPU) that are traditionally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint desktop supercomputers with hundreds of cores that can deliver teraflops peak performance at the price of conventional workstations have been realized. A computational fluid dynamics (CFD) simulation capability with rapid computational turnaround time has the potential to transform engineering analysis and design optimization procedures. We describe the implementation of a Navier-Stokes solver for incompressible fluid flow using desktop platforms equipped with multi-GPUs. Specifically, NVIDIA's Compute Unified Device Architecture (CUDA) programming model is used to implement the discretized form of the governing equations. The projection algorithm to solve the incompressible fluid flow equations is divided into distinct CUDA kernels, and a unique implementation that exploits the memory hierarchy of the CUDA programming model is suggested. Using a quad-GPU platform, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU desktops can serve as a cost-effective small-footprint parallel computing platform to accelerate CFD simulations substantially. /content/cudazone/CUDABrowser/assets/images/applications/373_navier_small.png /content/cudazone/CUDABrowser/assets/images/applications/373_navier_large.png Academia Boise State University, Boise, Idaho 2009 01 01 01/01/2009 Julien C. Thibault Inanc Senocak Paper Numerics Julien C. Thibault, Inanc Senocak 78767ea4-1fbc-4def-b463-06022ab41ae5 Feasibility of GPU-assisted iterative image reconstruction for mobile C-arm CT Computed tomography (CT) has been extensively studied and widely used for a variety of medical applications. The reconstruction of 3D images from a projection series is an important aspect of the modality. Reconstruction by filtered backprojection (FBP) is used by most manufacturers because of speed, ease of implementation, and relatively few parameters. Iterative reconstruction methods have a significant potential to provide superior performance with incomplete or noisy data, or with less than ideal geometries, such as cone-beam systems. However, iterative methods have a high computational cost, and regularization is usually required to reduce the effects of noise. The simultaneous algebraic reconstruction technique (SART) is studied in this paper, where the Feldkamp method (FDK) for filtered back projection is used as an initialization for iterative SART. Additionally, graphics hardware is utilized to increase the speed of SART implementation. Nvidia processors and compute unified device architecture (CUDA) form the platform for GPU computation. Total variation (TV) minimization is applied for the regularization of SART results. Preliminary results of SART on 3-D Shepp-Logan phantom using using TV regularization and GPU computation are presented in this paper. Potential improvements of the proposed framework are also discussed. /content/cudazone/CUDABrowser/assets/images/applications/372_image_reconstruction_for_mobile_c_arm_small.png /content/cudazone/CUDABrowser/assets/images/applications/372_image_reconstruction_for_mobile_c_arm_large.png Academia aScientific Computing and Imaging Institute, University of Utah 2009 01 01 01/01/2009 130 Yongsheng Pan Ross Whitaker Arvi Cheryauka Paper Medical Imaging C-arm CT, FDK, SART, GPU, CUDA, TV, Yongsheng Pan, Ross Whitaker, Arvi Cheryauka 2f26c62f-b633-45aa-8690-9a493d3851aa Scalable Parallel Programming with CUDA The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores. /content/cudazone/CUDABrowser/assets/images/applications/371_fig5-7_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/371_fig5-7_large.jpg Commercial NVIDIA http://www.nvidia.com 2008 04 01 04/01/2008 263 John Nickolls Ian Buck Michael Garland Paper Programming Tools John Nickolls, Ian Buck, Michael Garland 774b5ab3-3d70-4061-9dc6-1809c3eaa8f3 GPU Decoder (Vegas/Premiere) GPU Decoder comes as a plugin for Non-Linear Editors (Sony Vegas/Adobe Premiere). It uses the power of your NVIDIA graphic card to decode h.264 video files such as AVCHD or files from Canon EOS 5D Mark II. /content/cudazone/CUDABrowser/assets/images/applications/370_packshot_small.png /content/cudazone/CUDABrowser/assets/images/applications/370_packshot_large.png Commercial DIVIDE FRAME http://www.divideframe.com 2009 07 11 07/11/2009 10 Commercial Robin Lobel Application Multimedia Video & Audio gpu, decoder, sony, vegas, adobe, premiere, panasonic, avchd, h.264, h264, quicktime, canon, Robin Lobel 008d8c73-6b0b-4838-a211-8587b4f23232 Data structure design for GPU based heterogeneous systems This paper reports on our experience with data structure design for systems having both multiple CPU cores and a programmable graphics card. We integrate our data structures into the game-like application OpenSteerDemo and compare our data structures on two different pc-systems. One System has a relative fast single core CPU and slower GPU, whereas the other one uses a high-end GPU with a slower multi core CPU. We design two grid based data structures for effectively solving the k-nearest neighbor problem. The static grid uses grid cells of uniform size, whereas the dynamic grid does not rely on given grid cells, but creates them at runtime. The static grid is designed for fast data structure creation, in contrast to the dynamic grid, which is designed to provide high simulation performance at the GPU. The high performance at the GPU is achieved by explicitly taking advantage of the special GPU memory system, which however comes at the cost of a more complex construction algorithm. Our experiments show that with a slower CPU the complex algorithm for creating the dynamic grid becomes the bottleneck and the increased simulation performance at the GPU thereby does not provide an increase in performance compared to the static grid based implementation. This also holds true when the simulation is run with a faster CPU and a slower GPU, even though the break-even point is different. Furthermore we experimented with data structure creation on the GPU, but the performance of the static grid is not feasible, whereas the creation of the dynamic grid on the GPU is not possible due to the lack of support for recursive functions. We provide a dynamic grid creation algorithm, which uses multiple CPU cores. However, this algorithm is slower than its sequential counterpart due to the parallelization overhead. /content/cudazone/CUDABrowser/assets/images/applications/369_data_structure_small.png /content/cudazone/CUDABrowser/assets/images/applications/369_data_structure_large.png Academia Research Group Programming Languages / Methodologies Universitat Kassel 2008 12 01 12/01/2008 Jens Breitbart Paper Numerics GPGPU, k-nearest neighbor, games, OpenMP, CUDA, Jens Breitbart aa81fd54-34d5-4855-abe2-74940dd2d268 Comparing CUDA and OpenGL implementations for a Jacobi iteration The use of the GPU as a general purpose processor is becoming more popular and there are different approaches for this kind of programming. In this paper we present a comparison between different implementations of the OpenGL and CUDA approaches for solving our test case, a weighted Jacobi iteration with a structured matrix originating from a finite element discretization of the elliptic PDE part of the cardiac bidomain equations. The CUDA approach using textures showed to be the fastest with a speedup of 78 over a CPU implementation. CUDA showed to be an efficient and easy way of programming GPU for general purpose problems, though it is also easier to write inefficient codes. /content/cudazone/CUDABrowser/assets/images/applications/368_visualization_simulation_small.png /content/cudazone/CUDABrowser/assets/images/applications/368_visualization_simulation_large.png Academia Karl Franzens Universitat Graz 2008 12 01 12/01/2008 78 Ronan Amorim Gundolf Haase Manfred Liebmann Paper Programming Tools Ronan Amorim, Gundolf Haase, Manfred Liebmann d473a665-7e53-4405-9be4-952aeea0f762 GPU Acceleration of an Unmodified Parallel Finite Element Navier Stokes Solver We have previously suggested a minimally invasive approach to include hardware accelerators into an existing large-scale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit from the acceleration immediately, without changes to application code. In this paper we explore the limitations of our approach by accelerating a Navier-Stokes solver. This nonlinear saddle point problem is much more involved than our previous tests, and does not exhibit an equally favourable acceleration potential: Not all computational work is concentrated inside the linear solver. Nonetheless, we are able to achieve speedups of more than a factor of two on a small GPU-enhanced cluster. We conclude with a discussion how our concept can be altered to further improve acceleration. /content/cudazone/CUDABrowser/assets/images/applications/367_channel_flow_small.png /content/cudazone/CUDABrowser/assets/images/applications/367_channel_flow_large.png Academia Angewandte Mathematik und Numerik, TU Dortmund, Germany 2009 06 23 06/23/2009 2 Dominik Goddeke Sven H.M. Buijssen Hilmar Wobker and Stefan Turek Paper Numerics Dominik Goddeke, Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek dee67bc1-3342-481f-b4d9-64f377689b1e GPU Implementation of the Multiple Back-Propagation Algorithm In this paper, we describe a parallel implementation of the Multiple Back-Propagation (MBP) algorithm and present the results obtained when running the algorithm on two well-known benchmarks. The implementation described in the paper will be included in the next version of the Multiple Back-Propagation Software. /content/cudazone/CUDABrowser/assets/images/applications/366_mbpTop_small.png /content/cudazone/CUDABrowser/assets/images/applications/366_mbpTop_large.png Academia IPG 2009 09 01 09/01/2009 40 Noel Lopes Application Paper Neural Networks Neural Networks, Multiple Back~Propagation, Noel Lopes 20d2df07-85f7-4bc9-9689-ab36bad685af F2C-ACC F2C-ACC was developed to reduce the time required to modify codes to run on the GPU or Cell devices. We believe in time, Fortran language support will be provided by PGI and others. In the meantime, we have developed a language translator to convert codes from Fortran into C or CUDA-C. Both translations are useful: C can be used for testing and as a base code for running on the IBM Cell processor, and the generated CUDA code serves as a base for running on the GPU. The translator handles parsing of all Fortran 95 language features but output generation of the C and CUDA code is not complete. /content/cudazone/CUDABrowser/assets/images/applications/365_notepadplus_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/365_notepadplus_large.jpg Research NOAA 2009 05 01 05/01/2009 Mark Govett Application Code Programming Tool Medical Imaging Mark Govett dd07c8a1-e8f8-472d-95f5-9e4cfbf8e928 Interactive Point-Based Rendering of Higher-Order Tetrahedral Data Computational simulations frequently generate solutions defined over very large tetrahedral volume meshes containing many millions of elements. Furthermore, such solutions may often be expressed using non-linear basis functions. Certain solution techniques, such as discontinuous Galerkin methods, may even produce non-conforming meshes. Such data is difficult to visualize interactively, as it is far too large to fit in memory and many common data reduction techniques, such as mesh simplification, cannot be applied to non-conforming meshes. We introduce a point-based visualization system for interactive rendering of large, potentially non-conforming, tetrahedral meshes. We propose methods for adaptively sampling points from non-linear solution data and for decimating points at run time to fit GPU memory limits. Because these are streaming processes, memory consumption is independent of the input size. We also present an /content/cudazone/CUDABrowser/assets/images/applications/364_thumbnail_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/364_thumbnail_large.jpg Academia IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, 2008 1 1 1/1/2008 Yuan Zhou Michael Garland Paper Imaging Yuan Zhou, Michael Garland 7b377967-35c3-427c-ba6e-196138e952fd Rapid multipole graph drawing on the GPU As graphics processors become powerful, ubiquitous and easier to program, they have also become more amenable to general purpose high-performance computing, including the computationally expensive task of drawing large graphs. This paper describes a new parallel analysis of the multipole method of graph drawing to support its efficient GPU implementation. We use a variation of the Fast Multipole Method to estimate the long distance repulsive forces in force directed layout. We support these multipole computations efficiently with a k-d tree constructed and traversed on the GPU. The algorithm achieves impressive speedup over previous CPU and GPU methods, drawing graphs with hundreds of thousands of vertices within a few seconds via CUDA on an NVIDIA GeForce 8800 GTX. /content/cudazone/CUDABrowser/assets/images/applications/363_thumbnail_small.png /content/cudazone/CUDABrowser/assets/images/applications/363_thumbnail_large.png Academia University of Illinois / NVIDIA 2008 09 01 09/01/2008 4 Apeksha Godiyal Jared Hoberock Michael Garland Paper Electronic Design Automation Apeksha Godiyal, Jared Hoberock, Michael Garland fdfd95d7-184d-4637-879e-26b8906b3aec Fast BVH Construction on GPUs We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications. /content/cudazone/CUDABrowser/assets/images/applications/362_thumbnail_small.png /content/cudazone/CUDABrowser/assets/images/applications/362_thumbnail_large.png Academia 1University of North Carolina at Chapel Hill / NVIDIA 2009 05 01 05/01/2009 C. Lauterbach M. Garland S. Sengupta Multimedia Paper Imaging C. Lauterbach, M. Garland , S. Sengupta cf7594c3-abb8-4fd2-8e0f-a570be1629e4 Designing Efficient Sorting Algorithms for Manycore GPUs We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine. /content/cudazone/CUDABrowser/assets/images/applications/361_thumbnail_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/361_thumbnail_large.jpg Academia Dept. of Electrical Engineering University of California / NVIDIA http://eecs.berkeley.edu 2009 05 01 05/01/2009 Nadathur Satish Mark Harris Michael Garland Numerics Nadathur Satish, Mark Harris, Michael Garland de61eb09-d3dc-43a2-9fe7-13e3709e7c04 Cryostasis: Benchmark Sneak Peek - Featuring GPU accelerated NVIDIA PhysX effects /content/cudazone/CUDABrowser/assets/images/applications/359_screenshot_cryostasis_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/359_screenshot_cryostasis_large.jpg Commercial Action Forms ltd http://www.cryostasis-game.com/ 2008 12 01 12/01/2008 10 Commercial Action Forms ltd Application Multimedia Game Physics Action Forms ltd 166a76d3-01d8-4200-857a-3bb71d282a60 Ocelot CUDA provides a programming model with abstractions that are amenable to many-core architectures in general, not only GPUs. We argue that the optimal partitioning of application may require both highly data-parallel architectures that rely on hardware multithreaded to hide memory latency as well as superscalar architectures with deep cache hierarchies. Ocelot aims to push towards the development of tools that can compile CUDA programs to multiple architectures, and dynamically determine which parts of an application should be run on each architecture. We currently have code analysis tools for CUDA and PTX, as well as a full featured emulator for PTX. /content/cudazone/CUDABrowser/assets/images/applications/358_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/358_logo_large.png Academia Georgia Institute of Technology www.ece.gatech.edu/research/labs/casl/index.html 2009 07 09 07/09/2009 Open source Gregory Diamos Application Libraries Gregory Diamos a3eb01b2-b364-4d42-8abe-9187a6b2136c Prestack migration on gpu We provide world leading prestack mingration techlonogy on GPU . The program have alreadey used by a lot of China Oil Field /content/cudazone/CUDABrowser/assets/images/applications/357_2009041220515865320_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/357_2009041220515865320_large.jpg Commercial Geostar Science & Technology Co. LTD 2008 07 01 07/01/2008 80 Commercial Tong Xiaolong Application Paper Oil & Gas prestack migratoin, Tong Xiaolong 45413113-1fe7-4f1f-b61a-b39a80f6c99a Crazy Machines 2 Casual gaming could not get any better, this GPU accelerated puzzle game challenges the player to create a series of interacting mechanisms to achieve the final goal. GPU accelerated PhysX is used for amazing fluid simulation which interacts with the machines and puzzle elements. A fully playable multi-level version available as a free download as part of NVIDIAs Powerpack. /content/cudazone/CUDABrowser/assets/images/applications/356_screenshot_crazymachines_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/356_screenshot_crazymachines_large.jpg Commercial Viva Media LLC http://www.crazymachinesgame.com/ 2008 12 1 12/1/2008 5 Commercial Viva Media LLC Application Multimedia Game Physics Viva Media LLC b39b2278-ae30-4a00-bdab-c7efa59f065d Star Tales Benchmark Demo This exciting social networking game, features GPU accelerated PhysX for lifelike simulation of clothing and hair. Characters can dance and their clothing will move and interact with the surroundings . /content/cudazone/CUDABrowser/assets/images/applications/355_screenshot_startales_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/355_screenshot_startales_large.jpg Commercial QWD1 2008 12 1 12/1/2008 10 Commercial QWD1 Application Multimedia Game Physics QWD1 68630bdc-c298-481b-905b-e806a212ad99 Sacred 2: Fallen Angel - PhysX Game Patch Sacred 2: Fallen Angel - PhysX Game Patch is an Action Role-playing Game (RPG) which occurs in a vast, graphically rich world, called Ancaria. The free GPU PhysX patch extends this richness to the physical world with the addition of physical debris, physical spell particles, and force fields. Storms come to life as a myriad of leaves dance in the wind and collide with the environment. Players can finally, "feel" the power of spells as surrounding leaves and pebbles are blown about and magic spell particles fly about the environment, bouncing off of buildings, and tumbling down hillsides. Once you've experienced the amount of energy and detail this patch brings to the environment, there is no going back. /content/cudazone/CUDABrowser/assets/images/applications/354_screenshot_sacred2_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/354_screenshot_sacred2_large.jpg Ascaron Entertainment GmbH 2008 http://www.sacred2.com/en.html 2008 12 1 12/1/2008 5 Commercial Ascaron Entertainment GmbH 2008 Application Multimedia Game Physics Ascaron Entertainment GmbH 2008 bfdb49b3-472b-4ddb-8772-7ff5640c5f27 Warmonger This multi-player shooter with five levels sets the standard for interactivity for all multi-player shooters. It makes terrific use of GPU PhysX. Every building is destructible. The constant battle has created an environment of debris, rags and embers which all interact with the environment. You can create obstacles and affect game play by blowing up buildings. You can protect your rear by blowing up staircases behind you. You can create shortcuts to get to your opponents by blowing up wall they were seeking shelter behind. There is NO PLACE To HIDE! /content/cudazone/CUDABrowser/assets/images/applications/353_warmonger-free_fullgame_01_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/353_warmonger-free_fullgame_01_large.jpg Commercial NetDevil http://www.netdevil.com 2008 08 12 08/12/2008 5 Commercial NetDevil Application Multimedia Game Physics NetDevil 7e81a1fa-17d2-4a23-b4f7-52a3ac094cc8 PhysX Screen Saver The PhysX screen saver uses the power of accelerated GPU Physics to create this unique hypnotic experience. The forever rolling ball will power through objects and cloth banners featuring your pictures, against your own favorite panoramic image -- but now you have a chance to customize and mod it to make your own personal creation. The full source for the PhysX Screen Saver is now available from The Game Creators at: http://gdk.thegamecreators.com/ /content/cudazone/CUDABrowser/assets/images/applications/352_it_photo_111042_28_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/352_it_photo_111042_28_large.jpg Commercial The Game Creators http://www.thegamecreators.com/ 2008 08 12 08/12/2008 5 Commercial The Game Creators Application Multimedia Game Physics The Game Creators ea10a5ce-2e1a-45d0-b6b1-5f4297590aa2 Ghost Recon Advanced Warfighter 2 Tom Clancy's Ghost Recon Advanced Warfighter 2 builds off of the events in the first game and places gamers in control of the U.S. military's elite fighting unit, the Ghosts. In the year 2014, the rising conflict between Mexican loyalists and insurgent rebel forces has thrown Mexico into full-scale civil war. Under the command of Captain Scott Mitchell, the Ghosts are called upon to face an imminent threat to the United States. The fate of two countries now lies in the hands of the Ghosts as they fend off an attack on U.S. soil. Equipped with the most cutting-edge weaponry and technology, the Ghosts must battle on both sides of the border to neutralize the escalating rebel threat. Use of PhysX extends the visually rich and complex GRAW 2 to an entirely new level of game realism and interactivity, with dynamic gameplay physics, impactful environmental effects and persistent destruction and debris throughout. PhysX provides a realistic combat experience the from the characters to tanks to buildings and every other object within the game world. When something explodes, the physics engine kicks in to create superb explosions, emitting debris which affects gameplay. /content/cudazone/CUDABrowser/assets/images/applications/351_GRAW2_Logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/351_GRAW2_Logo_large.jpg Ubisoft 2008 08 12 08/12/2008 5 Commercial Red Storm Entertainment Application Multimedia Game Physics Red Storm Entertainment bcfc968e-677a-44c2-9cf3-34042de48d2e Selective and Adaptive Supersampling for Real-Time Ray Tracing While supersampling is an essential element for high quality rendering, high sampling rates, routinely employed in offline rendering, are still considered quite burdensome for real-time ray tracing. In this paper, we propose a selective and adaptive supersampling technique aimed at the development of a real-time ray tracer on today's many-core processors. For efficient utilization of very precious computing time, this technique explores both image--space and object--space attributes, which can be easily gathered during the ray tracing computation, minimizing rendering artifacts by cleverly distributing ray samples to rendering elements according to priorities that are selectively set by a user. Our implementation on the current GPU demonstrates that the presented algorithm makes high sampling rates as effective as 9 to 16 samples per pixel more affordable than before for real-time ray tracing. /content/cudazone/CUDABrowser/assets/images/applications/350_AdaptiveSamplingRenderedImage_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/350_AdaptiveSamplingRenderedImage_large.jpg Sogang Computer Graphics Lab http://grmanet.sogang.ac.kr/results_rtg.html#gpurt2 2009 06 01 06/01/2009 B. Jin I. Ihm C. Park Multimedia Paper Imaging B. Jin, I. Ihm, C. Park a8845841-7514-46a1-9c0d-d28fef4e68ea SIMD Optimization of Linear Expressions for Programmable Graphics Hardware The increased programmability of graphics hardware allows efficient GPU implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of , where is a matrix, and , , and are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods. /content/cudazone/CUDABrowser/assets/images/applications/349_simd_ole4pgh_title_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/349_simd_ole4pgh_title_large.jpg Department of Computer Science, University of Texas / Sogang University 2008 07 01 07/01/2008 Chandrajit Bajaj Insung Ihm Jungki Min Imaging programmable GPU, vertex shader, pixel shader, numerical computing, linear expression, SIMD, shader code optimization, Chandrajit Bajaj, Insung Ihm, Jungki Min 12d5c225-4df4-4315-b428-40025933ac3d Nebula 3 VST Plugin based on Volterra Kernels Series. It emulates different types of vintage gear: equalizers, filters, microphones, preamps, compressors, reverb and generic time-variant processors (chorus, flangers, phasers) /content/cudazone/CUDABrowser/assets/images/applications/348_acustica_small.png /content/cudazone/CUDABrowser/assets/images/applications/348_acustica_large.png Commercial ACUSTICA http://www.acusticaudio.net/ 2009 06 01 06/01/2009 2 Commercial ACUSTICA Application Video & Audio ACUSTICA 225159b9-381a-45ca-9fa2-556bdac7e48f 3D-Coat 3D-Coat is a creative 3D painting, texturing and sculpting tool that uses CUDA to speed up Voxel sculpting so the application can keep up with the artist's creative input. /content/cudazone/CUDABrowser/assets/images/applications/347_3dcoat_small.png /content/cudazone/CUDABrowser/assets/images/applications/347_3dcoat_large.png Commercial Pilgway http://http://www.3dcoat.com 2009 06 01 06/01/2009 2 Consumer Application Multimedia 3D digital content creation Imaging 79b0b95f-6bc4-4ccd-b719-8d33ba9f5875 MediaShow Espresso MediaShow Espresso is hassle-free. Its user interface is intuitively designed to convert videos in an easy 2-step, allowing you to convert all your favorite videos for playback on iPhone, PSP, Xbox, YouTube and more. Optimized for NVidia CUDA, MediaShow Espresso offers an incredibly performance up to 10 times faster. Faster performance doesn't necessarily mean you have to waste power though, as you'll find out with MediaShow Espresso's energy-saving feature, auto-shutdown. /content/cudazone/CUDABrowser/assets/images/applications/346_mediashowespresso_small.png /content/cudazone/CUDABrowser/assets/images/applications/346_mediashowespresso_large.png Commercial CyberLink http://www.cyberlink.com/ 2009 4 29 4/29/2009 4 CyberLink Application Video & Audio CyberLink 7b3f7561-d959-44df-a9a8-0d5cac1c630c Move it 1.5 Nero Move it lets customers enjoy all your multimedia files on the compatible portable and mobile devices. Nero Move it uses CUDA to convert videos in a fraction of the usual time, moving content between mobile devices at incredible speeds. Up to 5x the speed when compared to classic CPU-based transcoding. /content/cudazone/CUDABrowser/assets/images/applications/345_box-moveit-96_small.png /content/cudazone/CUDABrowser/assets/images/applications/345_box-moveit-96_large.png Commercial Nero http://www.nero.com/ 2009 04 30 4/20/2009 4 Nero Application Video & Audio Nero bfd8e336-ea99-45e2-8b96-1e65404c3cf3 Super LoiLoScope Super LoiLoScope is an easy-to-use super high-speed GPU-based video editing software with a game-like ultra intuitive GUI, developed by two top Japanese game creators. /content/cudazone/CUDABrowser/assets/images/applications/344_loiloscope_small_.png /content/cudazone/CUDABrowser/assets/images/applications/344_loiloscope_large_.png Commercial LoiLo Inc. http://www.loilo.tv/ 2009 1 31 1/31/2009 10 Consumer LoiLo Inc. Application Video & Audio LoiLo Inc 9c83700f-4313-46ac-b21e-8bbbab6c579c ffA Software: Performance Acceleration Integration of Graphics Processing Unit (GPU) based computing capabilities into ffA's SVI Pro and SEA3D Pro desktop image processing and analysis applications is delivering step increases in the processing performance of ffA Seismic Image Processing Algorithms. /content/cudazone/CUDABrowser/assets/images/applications/343_HPCPic_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/343_HPCPic_large.jpg ffA http://www.ffa.co.uk/hpc.html 2009 05 15 05/15/2009 98 ffA Multimedia Paper Oil & Gas ffA 8d33eb53-44ed-4237-bf40-82cb67db9224 SeismicCity As a foundation for its future needs, SeismicCity turned to GPU Computing by running NVIDIA CUDA on an NVIDIA Tesla S870 1U server system. This massively parallel computing architecture produces a 20X performance increase over the previous CPU configuration. Performance was accelerated an additional 3.5X with NVIDIA's next-generation Tesla processors based on CUDA technology. Going forward, the scalability of GPUs will make the transition to new algorithms faster and allow the hardware platform to be expanded as need arises. /content/cudazone/CUDABrowser/assets/images/applications/342_large_Seismiccity_RTM_image_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/342_large_Seismiccity_RTM_image_large.jpg Commercial SeismicCity http://www.seismiccity.com 2008 10 29 10/29/2008 20 SeismicCity Paper Oil & Gas SeismicCity 340dde84-bed3-41c8-b7e9-25ac626ea0ee A method of accelerating seismic Pre-stack time migration by GPU General Purpose GPU technology has being becoming mature, it has been applied in many industry area. However, due to the differences of computing feather between CPU and GPU, the study of GPU in petroleum industry application should be developed effectively. In this article, we introduce the General Purpose GPU technology and propose a method to realize pre-stack time migration software on GPU. Compared with traditional pre-stack time migration running on Personal Computer (PC) or PC-Cluster, the new programming method greatly improves computational efficiency, and then dramatically save power and maintenances cost. the actual tests in real seismic data illustrate that high performance computing based on General Purpose GPU technology(GPGPU) is a important direction of developments to meet the requirements of large scale computing in petroleum industry. /content/cudazone/CUDABrowser/assets/images/applications/341_waves_small.png /content/cudazone/CUDABrowser/assets/images/applications/341_waves_large.png Academia Institute of Geology and Geophysics Chinese Academy of Sciences CBeijing 2009 01 01 01/01/2009 15 LI Bo LIU Guo-feng LIU Hong Paper Oil & Gas LI Bo, LIU Guo-feng, LIU Hong 14ba18ca-16f2-4463-a272-ad86dac1149d Ruins 1.5 If you want to Shatter many debris, youneed CUDA accelerated .you must have nvidia Geforce 8800 or quadro FX 4600 and 512M display memory above, if you have 512M display memory , but In fact only 350M memory can use about! /content/cudazone/CUDABrowser/assets/images/applications/340_shatter_helix_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/340_shatter_helix_large.jpg Commercial nShatter.com http://www.nshatter.com/index.html 2009 07 07 07/07/2009 Commercial nShater.com Application Multimedia Imaging nShater.com 5ac5e350-f908-4569-b8b8-d4e510e57595 Seismic Solvers Acceleware is leading the market in providing acceleration solutions for seismic data processing and reservoir simulation. By combining our core knowledge in parallelization and optimization of complex algorithms with an in-house team of seismic industry experts, Acceleware provides software solutions for seismic data processors, which access the massively parallel capabilities of compute GPUs. The Acceleware seismic processing solutions provide multi-fold performance increases to reduce lengthy processing times and deliver faster business decisions for the seismic industry. By harnessing the parallel processing power of GPU accelerators to dramatically increase the computation power of data centers, seismic jobs are processed faster and with a reduced total cost of IT ownership. /content/cudazone/CUDABrowser/assets/images/applications/339_SEG_Salt_Model_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/339_SEG_Salt_Model_large.jpg Commercial Acceleware http://www.acceleware.com/default/ 2009 06 09 06/09/2009 200 Acceleware Application Multimedia Oil & Gas Acceleware 1b3adcb0-1f3b-4f2d-8772-a4158048ddbf Imaging Earth's Subsurface Using CUDA The state-of-the-art algorithms used in seismic data processing are evolving rapidly, and the need for computing power increases dramatically every year. For this reason, CGGVeritas has always pioneered new high-performance computing (HPC) technologies, and in this work we explore GPUs and NVIDIA's CUDA programming model to accelerate our industrial applications. /content/cudazone/CUDABrowser/assets/images/applications/338_gems3_small.png /content/cudazone/CUDABrowser/assets/images/applications/338_gems3_large.png CGGVeritas: Global Provider of Geophysical Services and Equipment http://www.cggveritas.com 2008 08 02 08/02/2008 15 Bernard Deschizeaux Jean-Yves Blanc Paper Oil & Gas Bernard Deschizeaux, Jean-Yves Blanc 1fa695ac-ce3d-4712-93cf-5ce15091f681 GADGET2 Optimization Optimization of the astrophysics N-Body/SPH solver, using the CUDA architecture to calculate the particle forces. /content/cudazone/CUDABrowser/assets/images/applications/337_structure2_small.png /content/cudazone/CUDABrowser/assets/images/applications/337_structure2_large.png Academia Private/Acme Late Night Coding 2009 07 06 07/06/2009 30 Open source Carsten Frigaard Code Computational Fluid Dynamics Numerics Science Carsten Frigaard 8178e5ed-fa4f-4d93-901b-0bbd3fa0c50f FIDESYS Strength analysis at large deformations Mechanics and strength at phase transformations under finite strains Strength of solids made of materials which properties are changed at loading Strength analysis of solids which parts are removed Development of numerical and analytical computational methods /content/cudazone/CUDABrowser/assets/images/applications/336_mp8-fig1_small.png /content/cudazone/CUDABrowser/assets/images/applications/336_mp8-fig1_large.png Commercial SALD Laboratory http://www.saldlab.com 2009 07 03 07/03/2009 30 Commercial Vladimir A. Levin Multimedia Paper Numerics Oil & Gas Science strength analysis, large deformations, finite strains, phase transitions, superposition, numerical methods, CUDA, Vladimir A. Levin f335d34b-a206-43d9-9e84-bf4500523f8c Real-time estimation of human visual attention we propose a new stochastic model of visual attention. The proposed model is composed of a dynamic Bayesian network with four layers that combines several fundamental statistical models. The proposed model enable us to automatically estimate eye focusing positions and their densities only from video frames. /content/cudazone/CUDABrowser/assets/images/applications/335_akisato_small.png /content/cudazone/CUDABrowser/assets/images/applications/335_akisato_large.png Academia NTT Communication Science Laboratories http://www.brl.ntt.co.jp/people/akisato/ 2009 07 03 07/03/2009 20 Akisato Kimura Application Multimedia Paper Digital Content Creation Graphics Science Signal Processing Video & Audio Akisato Kimura 3bb25969-f48e-4aee-b4cb-65c369ca70d2 Autodesk Moldflow Autodesk Moldflow Adviser injection molding software can put plastics simulation within every designer's grasp. Autodesk Moldflow Adviser simplifies plastics injection molding simulation, enabling you to optimize mold features, such as gates, runners, and cavity layouts. The product guides designers and mold makers through analysis setup and results interpretation, so you can see how changes to wall thickness, gate location, material, and geometry affect manufacturability. /content/cudazone/CUDABrowser/assets/images/applications/334_autodesk_small.png /content/cudazone/CUDABrowser/assets/images/applications/334_autodesk_large.png Commercial Autodesk http://usa.autodesk.com/ 2009 07 01 07/01/2009 2 Autodesk Application Industrial design Autodesk 89e2206e-bcf5-4db7-a353-ed500b83c385 Massively-Parallel Simulation of Biochemical Systems Understanding biological evolution prompts for a detailed understanding of the realized phenotype. Biochemical and gene regulatory dynamics are a cornerstone for the physiology of the cell and must therefore be regarded as one of the major aspects of such a phenotype. Experimental insight into molecular parameters is, however, hard to come by. Model development therefore requires computational parameter estimation. At the same time, design of cellular dynamics is highly efficient when done in-silico. We therefore developed a computational approach to allow for massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics cards and other many-core programming paradigms. Our system can automatically compile standard SBML files into CUDA code, using analytic derivatives, and computing standard measures of complex dynamics like the Lyapunov exponent. /content/cudazone/CUDABrowser/assets/images/applications/333_SBMLtoCUDA-pipeline-thumb_small.png /content/cudazone/CUDABrowser/assets/images/applications/333_SBMLtoCUDA-pipeline-thumb_large.png Academia TU Darmstadt http://www.tu-darmstadt.de 2009 07 03 07/03/2009 59 J. Ackermann P. Baecher T. Franzel M. Goesele K. Hamacher Paper Life Sciences SBML-to-CUDA conversion, J. Ackermann, P. Baecher, T. Franzel, M. Goesele, K. Hamacher f7976db7-ee65-4f7c-80fe-75011b114c6a GPU-SNN: Large-Scale Biologically Realistic Spiking Neural Networks Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Graphics Processing Units (GPUs) can provide a low-cost, programmable, and highperformance computing platform for simulation of SNNs. In this project we demonstrate an efficient, Izhikevich neuron based large-scale SNN simulator that runs on a single GPU. The GPU-SNN model (running on an NVIDIA GTX-280 with 1GB of memory), is up to 26 times faster than a CPU version for the simulation of 100K neurons with 50 Million synaptic connections, firing at an average rate of 7Hz. For simulation of 100K neurons with 10 Million synaptic connections, the GPUSNN model is only 1.5 times slower than real-time. This project uses a collection of new techniques related to parallelism extraction, mapping of irregular communication, and compact network representation for effective simulation of SNNs on GPUs /content/cudazone/CUDABrowser/assets/images/applications/332_gpu-snn-logo_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/332_gpu-snn-logo_large.jpg Academia University of California - Irvine http://www.ics.uci.edu/~jmoorkan/project/ 2009 06 06 06/06/2009 26 Jayram Moorkanikara Nageswaran Micah Richert Paper Code Jayram Moorkanikara Nageswaran, Micah Richert bf5d4551-ca35-4137-9070-987c91bc227a CUDPP: CUDA Data-Parallel Primitives Library v1.1 CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information and to download CUDPP, visit the CUDPP homepage at http://www.gpgpu.org/developer/cudpp /content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_small.png /content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_large.png Research NVIDIA and University of California at Davis http://gpgpu.org/developer/cudpp 2009 07 01 07/01/2009 Open source Mark Harris Code parallel algorithms data-parallel, scan, sort, random number generation, Mark Harris ab27bee7-f0b0-4e84-b9d6-9155055ef5d5 Clustering Billions of Data Points Using GPUs In this paper, we report our research on using GPUs to accelerate clustering algorithms, with special interests on very large data sets, which are common in today's real world applications. While many published works have shown that GPUs can be used to accelerate various general purpose applications with respectable performance gains, few attempts have been made to tackle very large problems. Our goal here is to investigate if the GPUs can be useful accelerators even with very large data sets that cannot fit into GPU's onboard memory. Using a popular clustering algorithm, K-Means, as an example, our results have been very positive. With the GPU acceleration, a data set with a billion data points can be clustered within minutes. We achieved 10x performance gain over our highly optimized CPU-only version running on 8 cores, and about 300x performance boost against a popular benchmark, MineBench running on a single core. /content/cudazone/CUDABrowser/assets/images/applications/330_renwu_cudazone_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/330_renwu_cudazone_large.jpg Intelligent Information Management Lab, HP Labs. http://www.hpl.hp.com/ 2009 05 18 05/18/2009 300 Ren Wu Bin Zhang Meichun Hsu Paper Business Intelligence Data Mining Analytics Parallel Algorithm, Data-mining, Clustering, Graphics Processor, GPGPU, Accelerator, Multi-core, Many-core, Data parallelism, Ren Wu, Bin Zhang, Meichun Hsu 66322604-44b3-4a6b-941f-5c6cd54cc0bb Running Unstructured Grid CFD Solvers on Modern Graphics Hardware We implement an unstructured grid finite volume solver for the three-dimensional Euler equations for compressible flow. We describe optimization strategies taken to minimize uncoalesced memory access and achieve high performance. We consider two benchmark cases from aerodynamics. /content/cudazone/CUDABrowser/assets/images/applications/329_pressure_small.png /content/cudazone/CUDABrowser/assets/images/applications/329_pressure_large.png Academia George Mason University 2009 06 24 06/24/2009 33 Andrew Corrigan Paper Computational Fluid Dynamics NACA0012 Air Foil, Andrew Corrigan d8e7a5a4-6e68-4613-885c-1f02a8232df8 Efficient parallel scan algorithms for GPUs Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library. /content/cudazone/CUDABrowser/assets/images/applications/328_thumbnail_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/328_thumbnail_large.jpg Research NVIDIA Research http://www.nvidia.com/research 2008 12 15 12/15/2008 Shubho Sengupta Paper Code Libraries Parallel Algorithms Data-parallel algorithms, algorithms, CUDPP, scan, segmented scan, Shubho Sengupta e98a2682-012c-4145-a08d-689927f2f107 Hyperspectral image compression on NVidia GPUs Hyperspectral imaging instruments are capable of collecting hundreds of images, corresponding to different wavelength channels, for the same area on the surface of the Earth. For instance, NASA is continuously gathering imagery data with instruments such as the Jet Propulsion Laboratory's Airborne Visible-Infrared Imaging Spectrometer (AVIRIS), able to record the visible and near-infrared spectrum (wavelength region from 0.4 to 2.5 micrometers) of the reflected light of an area 2 to 12 kilometers wide and several kilometers long, using 224 spectral bands. The resulting multidimensional data volume typically comprises several GBs per flight. We have developed a computationally efficient approach for lossy compression of remotely sensed hyperspectral images that retains the relevant information for analyzing the hyperspectral data with sub-pixel precision. The proposed methodology has been implemented, using the compute device unified architecture (CUDA), on an NVidia GeForce 8800 GTX GPU, achieving speedups in the order of 26x when compared to an optimized implementation of the same code in a dual-core CPU. /content/cudazone/CUDABrowser/assets/images/applications/327_hyperspectral_small.png /content/cudazone/CUDABrowser/assets/images/applications/327_hyperspectral_large.png Academia Technology of Computers and Communications, University of Extremadura http://www.umbc.edu/rssipl/people/aplaza 2009 06 24 06/24/2009 26 Antonio Plaza Javier Plaza Sergio Sanchez Paper Imaging Antonio Plaza, Javier Plaza, Sergio Sanchez d5d96c8c-eae1-4c91-b0d1-118677e7315a GPUTop - Topology Optimization on CUDA Graphics Cards in 3D GPUTop is a topology optimizer for CUDA enabled graphics cards. It is based on the SIMP method with optimality criteria updates in three dimensions. Linear Elasticity is discretized using finite elements on a cartesian mesh. The material density is assumed constant in each element. The resulting system is solved by a matrix-free conjugate gradient method entirely inside the GPU. /content/cudazone/CUDABrowser/assets/images/applications/326_cantilever_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/326_cantilever_large.jpg Academia University of Trier 2009 05 01 05/01/2009 60 Stephan Schmidt Application Multimedia Electronic Design Automation Numerics Science Stephan Schmidt fe18ed72-01e4-486a-974c-4a31bdce2636 The sparse matrix vector product on GPUs The sparse matrix vector product (SpMV) is a paramount operation in engineering and scientific computing and, hence, has been a subject of intense research for long. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. The Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. Currently, SpMV implementations for NVIDIA-GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for GPUs based on a new matrix storage format, called ELLPACK-R, and compares it against a variety of formats proposed elsewhere. The most important qualities of this new format is that (1) no preprocessing of the sparse matrix is required, and (2) the resulting SpMV algorithm is very regular. The comparative evaluation of this new SpMV approach has been carried out based on a representative set of test matrices. The results show that the SpMV approach based on ELLPACK-R turns out to be superior to the previous strategies used so far. Moreover, a comparison with standard state-of-the-art superscalar processors reveals that significant speedup factors are achieved with GPUs. /content/cudazone/CUDABrowser/assets/images/applications/324_Cuda_Zone_Sp_format_GPU_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/324_Cuda_Zone_Sp_format_GPU_large.jpg Academia Dpt. Computer Architecture and Electronics, University of Almeria, Spain 2009 06 14 06/14/2009 80 Ester Martin Garzon Paper Numerics Libraries Ester Martin Garzon 8c9d6b52-283f-46d2-be89-3b83c39233e0 Real Time Holographic Optical Trapping Using CUDA powered NVIDIA graphics card we can quickly generate highly optimized holograms allowing interactive optical manipulation of micron sized structures. /content/cudazone/CUDABrowser/assets/images/applications/323_HotCuda-big_small.png /content/cudazone/CUDABrowser/assets/images/applications/323_HotCuda-big_large.png Research CNR-Dip Fisica, Univ."La Sapienza" Roma, Italy 2009 06 19 06/19/2009 350 S.Bianchi R.Di Leonardo Application Multimedia Science holgrphic optical tweezers, S.Bianchi, R.Di Leonardo 13958d4d-5cf1-420b-b8a2-14932b5bb9d7 Parallel Computation With NVIDIA Graphics Card Using CUDA in Hyperthermia Applications In this work we developed a fast parallel computing tool to simulate electromagnetic (EM) fields using the finite-difference time-domain (FDTD) method. The software is used to calculate the EM distribution during a hyperthermia session. Hyperthermia is a modality in cancer treatment that involves heating of tumors. The software can also be used for different applications that require fast and accurate simulation of EM fields, like MRI, RFID, medical implants, wireless sensors, etc. /content/cudazone/CUDABrowser/assets/images/applications/322_SARCP01_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/322_SARCP01_large.jpg Research Academic Medical Center, Amsterdam http://www.amc.nl/radiotherapie 2009 06 12 06/12/2009 25 Davi Correia Multimedia Presentation Numerics Life Sciences Science hyperthermia, electromagnetics, FDTD, Davi Correia aca0f696-82fe-41fb-b6a3-6d296bd1bb83 Real Time Elimination of Undersampling Artifacts in CE MRA using Variational Denoising on Graphics Hardware Undersampled imaging strategies with state of the art reconstruction methods like compressed sensing, which reformulate image reconstruction as a constrained optimization problem, have the potential to deliver CE MRA images with high spatial and temporal resolution. The drawback of these algorithms is their long reconstruction time which makes it impossible to use them in clinical practice. This study demonstrates that these optimization problems can be solved on modern graphic processing units (GPUs), with computation times that allow real time imaging. /content/cudazone/CUDABrowser/assets/images/applications/321_mra08_small.png /content/cudazone/CUDABrowser/assets/images/applications/321_mra08_large.png Academia Graz University of Technology http://www.tugraz.at 2008 12 01 12/01/2008 2300 F. Knoll M. Unger F. Ebner Multimedia Presentation Imaging F. Knoll, M. Unger, F. Ebner b7be624c-6ee4-4753-89a6-2cfab4e0f695 Mumford-Shah Meets Stereo: Integration of Weak Depth Hypotheses Recent results on stereo indicate that an accurate segmentation is crucial for obtaining faithful depth maps. Variational methods have successfully been applied to both image segmentation and computational stereo. In this paper we propose a combination in a unified framework. In particular, we use a Mumford-Shah-like functional to compute a piecewise smooth depth map of a stereo pair. Our approach has two novel features: First, the regularization term of the functional combines edge information obtained from the color segmentation with flow-driven depth discontinuities emerging during the optimization procedure. Second, we propose a robust data term which adaptively selects the best matches obtained from different weak stereo algorithms. We integrate these features in a theoretically consistent framework. The final depth map is the minimizer of the energy functional, which can be solved by the associated functional derivatives. The underlying numerical scheme allows an efficient implementation on modern graphics hardware. We illustrate the performance of our algorithm using the Middlebury database as well as on real imagery. /content/cudazone/CUDABrowser/assets/images/applications/320_cvpr07pock_small.png /content/cudazone/CUDABrowser/assets/images/applications/320_cvpr07pock_large.png Academia Graz University of Technology http://www.tugraz.at 2007 12 01 12/01/2007 Thomas Pock Christopher Zach Horst Bischof Paper Imaging Thomas Pock, Christopher Zach, Horst Bischof 5f8c1341-18c0-4057-9760-46eac91eea78 A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration Robust integration of range images is an important task for building high-quality 3D models. Since range images, and in particular range maps from stereo vision, may have a substantial amount of outliers, any integration approach aiming at high-quality models needs an increased level of robustness. Additionally, a certain level of regularization is required to obtain smooth surfaces. Computational efficiency and global convergence are further preferable properties. The contribution of this paper is a unified framework to solve all these issues. Our method is based on minimizing an energy functional consisting of a total variation (TV) regularization force and an L1 data fidelity term. We present a novel and efficient numerical scheme, which combines the duality principle for the TV term with a point-wise optimization step. We demonstrate the superior performance of our algorithm on the well-known Middlebury multi-view database and additionally on real-world multi-view images. /content/cudazone/CUDABrowser/assets/images/applications/319_iccv07_paper_small.png /content/cudazone/CUDABrowser/assets/images/applications/319_iccv07_paper_large.png Academia Graz University of Technology http://www.tugraz.at 2007 12 01 12/01/2007 Christopher Zach Thomas Pock Horst Bischof Paper Imaging Christopher Zach, Thomas Pock, Horst Bischof d816287e-5a19-4af3-9147-a117a60e5d9e A Convex Formulation of Continuous Multi-Label Problems We propose a spatially continuous formulation of Ishikawa's discrete multi-label problem.We show that the resulting non-convex variational problem can be reformulated as a convex variational problem via embedding in a higher dimensional space. This variational problem can be interpreted as a minimal surface problem in an anisotropic Riemannian space. In several stereo experiments we show that the proposed continuous formulation is superior to its discrete counterpart in terms of computing time, memory efficiency and metrication errors. /content/cudazone/CUDABrowser/assets/images/applications/318_pockeccv08_small.png /content/cudazone/CUDABrowser/assets/images/applications/318_pockeccv08_large.png Academia Graz University of Technology http://www.tugraz.at 2008 12 01 12/01/2008 33 Thomas Pock Thomas Schoenemann Gottfried Grabe Paper Imaging Thomas Pock, Thomas Schoenemann, Gottfried Grabe c754fdc4-120c-4eb3-ab2d-de1dc540cf4b Continuous Globally Optimal Image Segmentation with Local Constraints The Geodesic Active contour model is a very flexible model for variational image segmentation. Unfortunately the Geodesic Active Contour model exhibits local minima making segmentation results strongly dependent on its initialization. We propose a flexible, interactive segmentation method in two and three dimensions that yields the globally optimal solution with respect to local constraints introduced by the user. A fast numerical scheme is used to minimize the proposed energy which is based on a weighted Total Variation energy functional. With our GPU-based implementation, real-time performance is achieved for both 2D and 3D segmentation problems. We show experimental results on various medical datasets, and discuss the properties of the segmentation framework. /content/cudazone/CUDABrowser/assets/images/applications/317_cvww08seg_small.png /content/cudazone/CUDABrowser/assets/images/applications/317_cvww08seg_large.png Academia Graz University of Technology http://www.tugraz.at 2008 05 01 05/01/2008 Markus Unger Thomas Pock Horst Bischof Multimedia Paper Imaging Markus Unger, Thomas Pock, Horst Bischof 44c02794-46b1-4ec6-8c12-1d0ebb082d7b Globally Optimal TV-L1 Shape Prior Segmentation Interpreting an image is a common and challenging task in computer vision. A human observer does not only use intensity or color information or other basic features when looking for region boundaries but also takes prior knowledge into account. This increases the robustness on the segmentation result for most images. The main intention of our work is to propose a globally optimal segmentation algorithm that incorporates prior knowledge in form of a geometric shape. The proposed energy is based on a weighted Total Variation energy and is optimized with fast numerical approaches like the projected gradient descent method. The GPU-based implementation is able to achieve real-time performance for the presented applications. We show the coherence of the proposed energy model to former variational methods like the well-known edge-preserving restoration model of Rudin, Osher and Fatemi and methods that incorporate prior information into classical segmentation models. Different applications are realized with the proposed energy. First of all a semi-automatic, interactive segmentation tool is implemented. The user can either define a shape prior on the fly using the weighted Total Variation as geodesic active contour or load a predefined geometric shape. Next the energy model can be used to align two shapes on each other or optimize the alignment of a shape to an underlying edge function. Consequentially a tracking approach was introduced with the ability to optimize the incorporated shape information according to consecutive frames. This position update is also used when processing 3D data sets with a 2D prior which is particularly useful for segmenting tubular structures in medical data sets with a single constraint on the first slice. /content/cudazone/CUDABrowser/assets/images/applications/316_werlberger_master_small.png /content/cudazone/CUDABrowser/assets/images/applications/316_werlberger_master_large.png Academia Graz University of Technology http://www.tugraz.at 2008 05 01 05/01/2008 Manuel Werlberger Multimedia Paper Imaging Manuel Werlberger fd209c06-27b3-4bd5-9110-18f512c9d85e Interactive Globally Optimal Image Segmentation Image segmentation is a challenging task in computer vision. We present a general purpose image segmentation framework, and focus on its application to medical imaging. Features like gray values or edges are commonly used as input for segmentation algorithms. The geodesic active contour model gained popularity as a flexible variational image segmentation model based solely on edge information. Unfortunately the geodesic active contour model exhibits local minima, making segmentation results strongly dependent on its initialisation. We propose a globally optimal segmentation model, that unifies the usage of gray value information with the geodesic active contour model. A flexible, interactive segmentation framework is presented, that allows incorporation of local constraints. Fast numerical schemes are used to minimise the proposed energy which is based on a weighted Total Variation energy functional. Different segmentation approaches using the proposed energy functional are discussed. The relation to the image denoising task is analysed, and we present a fast implementation of the image denoising model of Rudin, Osher and Fatemi. With our GPU-based implementation real-time performance is achieved for both 2D and 3D segmentation problems. We show experimental results on various real world images and different medical datasets. /content/cudazone/CUDABrowser/assets/images/applications/315_unger_tr0802_small.png /content/cudazone/CUDABrowser/assets/images/applications/315_unger_tr0802_large.png Academia Graz University of Technology http://www.tugraz.at/ 2008 02 01 02/01/2008 100 Markus Unger Thomas Pock Horst Bischof Multimedia Paper Imaging Markus Unger, Thomas Pock, Horst Bischof 4230035e-7064-4178-837b-7e86b8b8af75 TVSeg - Interactive Total Variation Based Image Segmentation Interactive object extraction is an important part in any image editing software. We present a two step segmentation algorithm that first obtains a binary segmentation and then applies matting on the border regions to obtain a smooth alpha channel. The proposed segmentation algorithm is based on the minimization of the Geodesic Active Contour energy. A fast Total Variation minimization algorithm is used to find the globally optimal solution. We show how user interaction can be incorporated and outline an efficient way to exploit color information. A novel matting approach, based on energy minimization, is presented. Experimental evaluations are discussed, and the algorithm is compared to state of the art object extraction algorithms. The GPU based binaries are available online. /content/cudazone/CUDABrowser/assets/images/applications/314_ungerbmvc2008_small.png /content/cudazone/CUDABrowser/assets/images/applications/314_ungerbmvc2008_large.png Academia Graz University of Technology http://www.tugraz.at/ 2008 12 01 12/01/2008 Markus Unger Thomas Pock Werner Trobin Application Multimedia Paper Imaging Markus Unger, Thomas Pock, Werner Trobin 9b4757c0-bb97-47fe-bafb-21111b2cd277 A convex approach for computing minimal partitions We describe a convex relaxation for a family of problems of minimal perimeter partitions. The minimization of the relaxed problem can be tackled numerically, we describe an algorithm and show some results. In most cases, our relaxed problem finds a correct (approximate) solution: We give some arguments to explain why it should be so, and also discuss some situation where it fails. /content/cudazone/CUDABrowser/assets/images/applications/313_chambolle08_small.png /content/cudazone/CUDABrowser/assets/images/applications/313_chambolle08_large.png Academia ECOLE POLYTECHNIQUE http://www.cmap.polytechnique.fr/ 2008 11 01 11/01/2008 Antonin Chambolle Daniel Cremers Thomas Pock Paper Imaging Antonin Chambolle, Daniel Cremers, Thomas Pock 98759e63-9907-443c-ab79-dc6e8de0f2c4 A Convex Relaxation Approach for Computing Minimal Partitions In this work we propose a convex relaxation approach for computing minimal partitions. Our approach is based on rewriting the minimal partition problem (also known as Potts model) in terms of a primal dual Total Variation functional. We show that the Potts prior can be incorporated by means of convex constraints on the dual variables. For minimization we propose an efficient primal dual projected gradient algorithm which also allows a fast implementation on parallel hardware. Although our approach does not guarantee to find global minimizers of the Potts model we can give a tight bound on the energy between the computed solution and the true minimizer. Furthermore we show that our relaxation approach dominates recently proposed relaxations. As a consequence, our approach allows to compute solutions closer to the true minimizer. For many practical problems we even find the global minimizer. We demonstrate the excellent performance of our approach on several multi-label image segmentation and stereo problems. /content/cudazone/CUDABrowser/assets/images/applications/312_pockcvpr2009_small.png /content/cudazone/CUDABrowser/assets/images/applications/312_pockcvpr2009_large.png Academia Graz University of Technology http://www.tugraz.at/ 2009 01 01 01/01/2009 Thomas Pock Daniel Cremers Antonin Chambolle Paper Imaging Antonin Chambolle, Daniel Cremers, Thomas Pock f5051dff-fcce-4689-ad6e-bbbeeb64ad67 Semi Automatic Segmentation of Articular Cartilage using Variational Methods Osteoarthritis (OA) is a syndrome of joint pain that acts the large weight bearing joints. It is caused by an abnormal wearing of articular cartilage, covering the joints. In addition to the inconvenience OA causes, its treatment is very time consuming and expensive. Therefore it is desirable to improve methods for an early diagnosis of OA. The detection of thinning of articular cartilag provides a good support for the diagnosis of OA in its early stage. The first step in this diagnosis process is the accurate segmentation of the cartilage surface. In this Master's Thesis we propose an interactive segmentation framework for the semi automatic segmentation of articular cartilage. Until today, no automatic segmentation method is able achieve the accuracy, necessary for a trustworthy diagnosis. Also, physicians in general prefer to be able to control and modify the segmentation result, which is usually complicated using automatic methods. Semi automatic methods allow the user to incorporate knowledge into the segmentation process, whilst reducing the time and improving the repeatability compared to fully manual methods. The proposed segmentation model is based on a weighted Total Variation energy and minimised using efficient numerical approaches. Implemented on today's userprogrammable graphics cards, it allows real-time user interaction. The evaluation of our segmentation method using real-world magnet resonance datasets of human knee joints shows, that we are able to speed up the segmentation process significantly, compared to manual and semi automatic segmentation methods. /content/cudazone/CUDABrowser/assets/images/applications/311_thesis_christian_reinbacher_web_small.png /content/cudazone/CUDABrowser/assets/images/applications/311_thesis_christian_reinbacher_web_large.png Academia Graz University of Technology http://www.tugraz.at/ 2009 01 01 01/01/2009 50 Christian Reinbacher Paper Imaging Christian Reinbacher ad494f35-db0c-4777-bec8-14bcd01a305c A Variational Model for Interactive Shape Prior Segmentation and Real-Time Tracking In this paper, we introduce a semi-automated segmentation method based on minimizing the Geodesic Active Contour energy incorporating a shape prior. We increase the robustness of the segmentation result using the additional shape information that represents the desired structure. Furthermore the user has the possibility to take corrective actions during the segmentation and adapt the shape prior position. Interaction is often desirable when processing difficult data like in medical applications. To facilitate the user interaction we add a shape deformation which allows to change the shape position manually by the user and automatically in terms of underlying image features. Using a variational formulation, the optimization can be done in a globally optimal manner for a fixed shape representation. To obtain real-time behavior, which is especially important for an interactive tool, the whole method is implemented on the GPU. Experiments are done on medical, as well as on video data and camera streams that are processed in real-time. In terms of medical data we compare our method with a segmentation done by an expert. The GPU based binaries will be available online on our homepage. /content/cudazone/CUDABrowser/assets/images/applications/310_werlberger_ssvm2009_small.png /content/cudazone/CUDABrowser/assets/images/applications/310_werlberger_ssvm2009_large.png http://www.gpu4vision.org http://www.gpu4vision.org 2008 12 01 12/01/2008 Manuel Werlberger Thomas Pock Thomas Pock Paper Multimedia Imaging Manuel Werlberger, Thomas Pock 847a423f-7656-4ac0-b0d5-5e5b41c66f11 A Duality Based Approach for Realtime TV-L1 Optical Flow Variational methods are among the most successful approaches to calculate the optical flow between two image frames. A particularly appealing formulation is based on total variation (TV) regularization and the robust L1 norm in the data fidelity term. This formulation can preserve discontinuities in the flow field and offers an increased robustness against illumination changes, occlusions and noise. In this work we present a novel approach to solve the TV-L1 formulation. Our method results in a very efficient numerical scheme, which is based on a dual formulation of the TV energy and employs an efficient point-wise thresholding step. Additionally, our approach can be accelerated by modern graphics processing units. We demonstrate the real-time performance (30 fps) of our approach for video inputs at a resolution of 320 x 240 pixels. /content/cudazone/CUDABrowser/assets/images/applications/309_pockdagm07_small.png /content/cudazone/CUDABrowser/assets/images/applications/309_pockdagm07_large.png Academia Graz University of Technology http://www.tugraz.at/ 2007 12 01 12/01/2007 C. Zach T. Pock H. Bischof Application Multimedia Paper Imaging C. Zach, T. Pock, H. Bischof e51747cc-73ba-4e32-8e4d-35dfc6679a6c An Unbiased Second-Order Prior for High-Accuracy Motion Estimation Virtually all variational methods for motion estimation regularize the gradient of the flow field, which introduces a bias towards piecewise constant motions in weakly textured areas. We propose a novel regularization approach, based on decorrelated second-order derivatives, that does not suffer from this shortcoming. We then derive an efficient numerical scheme to solve the new model using projected gradient descent. A comparison to a TV regularized model shows that the proposed second-order prior exhibits superior performance, in particular in lowtextured areas (where the prior becomes important). Finally, we show that the proposed model yields state-of-the-art results on the Middlebury optical flow database. /content/cudazone/CUDABrowser/assets/images/applications/308_trobindagm08_small.png /content/cudazone/CUDABrowser/assets/images/applications/308_trobindagm08_large.png Academia Graz University of Technology http://www.tugraz.at/ 2008 12 01 12/01/2008 Werner Trobin Thomas Pock Daniel Cremers Paper Imaging Werner Trobin, Thomas Pock, Daniel Cremers 0d9b7a3f-47e7-4f51-a0c9-8161f1f5a368 Continuous Energy Minimization Variational problems, which are commonly used to solve lowlevel vision tasks, are typically minimized via a local, iterative optimization strategy, e.g. gradient descent. Since every iteration is restricted to a small, local improvement, the overall convergence can be slow and the algorithm may get stuck in an undesirable local minimum. In this paper, we propose to approximate the minimization by solving a series of binary subproblems to facilitate large optimization moves. The proposed method can be interpreted as an extension of discrete graph-cut based methods such as -expansion or LogCut to a spatially continuous setting. In order to demonstrate the viability of the approach, we evaluated the novel optimization strategy in the context of optical flow estimation, yielding excellent results on the Middlebury optical flow datasets. /content/cudazone/CUDABrowser/assets/images/applications/307_trobineccv2008_small.png /content/cudazone/CUDABrowser/assets/images/applications/307_trobineccv2008_large.png Academia Graz University of Technology http://www.tugraz.at/ 2008 12 1 12/1/2008 Werner Trobin Thomas Pock Daniel Cremers Paper Imaging Werner Trobin, Thomas Pock, Daniel Cremers bf8c14b0-7314-47ac-915f-e8a86a852188 Duality TV-L1 Flow with Fundamental Matrix Prior Variational techniques yield the most accurate results for dense optical flow fields between two images. They have the nice property of inherent smoothness to cope with untextured image regions: the filling-in of such regions is driven by neighbouring pixels. Such filling-in is not always the best choice. If the scene is mostly stationary and the camera is moving, the direction of the optical flow vectors can be restricted using the fundamental matrix. In this paper we propose an exact solution of the variational optical flow, using the fundamental matrix geometry as an additional weak prior. Our novel approach currently performs best on the Middlebury flow evaluation which includes images from stationary and dynamic scenes. /content/cudazone/CUDABrowser/assets/images/applications/306_ivcnz08_small.png /content/cudazone/CUDABrowser/assets/images/applications/306_ivcnz08_large.png Academia Daimler Group Research, Sindelfingen, Germany 2008 12 01 12/01/2008 A. Wedel T. Pock J. Braun Multimedia Paper Imaging Optical flow, fundamental matrix, structure from motion, optimization, total variation, A. Wedel, T. Pock, J. Braun 337c0c41-4ef4-4b42-897e-0a7762736367 Real-time Computation of Variational Methods on Graphics Hardware This paper combines two powerful approaches: variational methods and graphics hardware. Variational methods have demonstrated considerable success in computer vision for such diverse tasks as denoising, segmentation, registration, stereo matching etc. Their main advantage is a mathematically clean and powerful formulation of the vision problem in terms of energy functionals that have to be minimized. However, due to their iterative nature these approaches tend to be slow and far from real-time capable. Recent progress in graphics hardware (the computational power grows much faster than for standard CPUs) makes this hardware interesting for computer vision applications. For example floating point arithmetic and high-level programming languages such as Cg are now available for modern graphics cards. In this paper we demonstrate that by a careful analysis and formulation of variational methods and exploitation of the parallelism of modern GPUs we can achieve real-time performance without complex multi-grid optimization schemes (we obtain speed-ups of a factor of more than 200 compared to Matlab implementations). This opens several new application areas for variationalmethods (e.g. realtime algorithms). /content/cudazone/CUDABrowser/assets/images/applications/305_cvww07_pock_small.png /content/cudazone/CUDABrowser/assets/images/applications/305_cvww07_pock_large.png Academia Graz University of Technology http://www.tugraz.at/ 2007 02 06 02-06-2007 200 Thomas Pock Markus Grabner Horst Bischof Paper Imaging Thomas Pock, Markus Grabner, Horst Bischof 7385b402-b661-4ec7-a1e7-a1774baf45f6 Fast Total Variation for Computer Vision Motivated by statistical inference methods, variational methods are among the most successful methods to solve a number of different Computer Vision problems. Variational methods aim to minimize an energy functional which is designed to appropriately describe a Computer Vision task. Since Computer Vision problems are typically ill-posed, appropriate priors (or regularizers) are needed to find physically meaningful solutions. A particularly interesting prior is given by the Total Variation norm. It provides a good tradeoff between modeling the true statistics of natural images while still allowing to compute an exact solution. Total Variation methods were first introduced Rudin Osher and Fatemi in 1992 for edge preserving image denoising. Computing the solution of energy functionals incorporating Total Variation is a challenging task. We review different numerical algorithms and discuss its properties. For implementation on a digital computer we will consider two different approaches. The first approach is based on an explicit discretization of the Euler-Lagrange partial differential equations, which is the standard approach. The second approach is based on algorithmic differentiation of a discretized version of the energy functional. We show that both approaches yield equivalent results whereas algorithmic differentiation is less error prone and can be applied to very complex models. For performance evaluation, we implement our variational algorithms on the graphics processing unit (GPU). Through controlled experiments we show that our GPU-based implementations clearly outperform recently proposed discrete optimization techniques in both speed and maximum problem size. In the remaining part of the thesis, we apply Total Variation methods to three fundamental Computer Vision problems: Segmentation, Optical Flow and 3D Reconstruction. We show that our Total Variation based methods yield state-of-the-art results. /content/cudazone/CUDABrowser/assets/images/applications/304_pock_phd_small.png /content/cudazone/CUDABrowser/assets/images/applications/304_pock_phd_large.png Academia Graz University of Technology http://www.tugraz.at/ 2008 1 1 1/1/2008 1000 Thomas Pock Paper Imaging Thomas Pock bfd5c98d-c5d6-4d33-9d85-2cdaa02e118f Fast and Exact Solution of Total Variation Models on the GPU This paper discusses fast and accurate methods to solve Total Variation (TV) models on the graphics processing unit (GPU). We review two prominent models incorporating TV regularization and present different algorithms to solve these models. We mainly concentrate on variational techniques, i.e. algorithms which aim at solving the Euler Lagrange equations associated with the variational model. We then show that particularly these algorithms can be effectively accelerated by implementing them on parallel architectures such as GPUs. For comparison we chose a state-ofthe- art method based on discrete optimization techniques. We then present the results of a rigorous performance evaluation including 2D and 3D problems. As a main result we show that the our GPU based algorithms clearly outperform discrete optimization techniques in both speed and maximum problem size. /content/cudazone/CUDABrowser/assets/images/applications/303_pockcvpr2008_small.png /content/cudazone/CUDABrowser/assets/images/applications/303_pockcvpr2008_large.png Academia 1Institute for Computer Graphics and Vision, Graz University of Technology http://www.icg.tu-graz.ac.at 2008 12 01 12/01/2008 1000 Thomas Pock Markus Unger Daniel Cremers Application Multimedia Paper Imaging Thomas Pock, Markus Unger, Daniel Cremers f7584aa1-8c40-4182-86b2-2ec2baebb15d Automatic Differentiation for GPU-Accelerated 2D/3D Registration A common task in medical image analysis is the alignment of data from different sources, e.g., X-ray images and computed tomography (CT) data. Such a task is generally known as registration. We demonstrate the applicability of automatic differentiation (AD) techniques to a class of 2D/3D registration problems which are highly computationally intensive and can therefore greatly benefit from a parallel implementation on recent graphics processing units (GPUs). However, being designed for graphics applications, GPUs have some restrictions which conflict with requirements for reverse mode AD, in particular for taping and TBR analysis. We discuss design and implementation issues in the presence of such restrictions on the target platform and present a method which can register a CT volume data set (512x512x288 voxels) with three X-ray images (512x512 pixels each) in 11.8 seconds on a GeForce 8800GTX graphics card. /content/cudazone/CUDABrowser/assets/images/applications/302_grabner_AD08_small.png /content/cudazone/CUDABrowser/assets/images/applications/302_grabner_AD08_large.png Academia Institute for Computer Graphics and Vision, Graz University of Technology http://www.icg.tu-graz.ac.at/ 2008 08 15 08/15/2008 Markus Grabner Thomas Pock Tobias Gross Paper Imaging Optimization, medical image analysis, 2D/3D registration, GPU, Markus Grabner, Thomas Pock, Tobias Gross 4c67cc92-5ea4-4b4c-afd2-e7041206bc09 Ascalaph Liquid GPU 1.2.1 Ascalaph Liquid GPU is a application, that calculates dynamic molecules in a liquid phase. Calculation of dynamic molecules is a very complex process. So you have to get as much renderpower as possible. So Agile Molecule took the CUDA 2.0 API and designed an application, that takes advantage of this technology. The GPU version can speed up the calculations dramatically. /content/cudazone/CUDABrowser/assets/images/applications/301_liquid-simulation_small.png /content/cudazone/CUDABrowser/assets/images/applications/301_liquid-simulation_large.png Commercial Agile Molecule http://www.agilemolecule.com 2009 03 01 03/01/2009 29 Agile Molecule Application Computational Fluid Dynamics Agile Molecule d73b70a5-03df-4e6d-b2cd-7098e976fd00 CUDA OpenGL Tutorials CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the GPU. CUDA has been developed by NVIDIA and to use this architecture requires an NVIDIA GPU and special stream processing drivers. CUDA only works with the new GeForce 8 Series, featuring G8X GPUs; NVIDIA guarantees that programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards. CUDA gives developers unfettered access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, NVIDIA GeForce-based GPUs effectively become powerful, programmable open architectures like todays CPUs (Central Processing Units). By opening up the architecture, CUDA provides developers both with the low-level, deterministic, and for repeatable access to hardware that is necessary API to develop essential high-level programming tools such as compilers, debuggers, math libraries, and application platforms. /content/cudazone/CUDABrowser/assets/images/applications/300_index_clip_image002_0002_small.png /content/cudazone/CUDABrowser/assets/images/applications/300_index_clip_image002_0002_large.png Academia Department of Computer Science & Engineering, CUHK http://www.cse.cuhk.edu.hk/ 2007 02 26 02/26/2007 Xie Yongming Application Imaging Xie Yongming f633b870-abcb-4a3d-9dd4-0a53519dca66 Concurrent Number Cruncher An Efficient Sparse Linear Solver on the GPU A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse generalpurpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML). /content/cudazone/CUDABrowser/assets/images/applications/299_dragon_small.png /content/cudazone/CUDABrowser/assets/images/applications/299_dragon_large.png Academia Gocad Research Group, INRIA, Nancy Universite, France http://gocad.org 2007 12 1 12/1/2007 8 L. Buatois G. Caumon B. Levy Paper Numerics L. Buatois, G. Caumon, B. Levy 633f6b09-e9cb-4ce6-9bc4-ce40bfa307d8 LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs 60% faster than the vendor implementation in CUBLAS 1.1 and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate. http://www.netlib.org/lapack/lawnspdf/lawn202.pdf /content/cudazone/CUDABrowser/assets/images/applications/298_lawn202_small.png /content/cudazone/CUDABrowser/assets/images/applications/298_lawn202_large.png Academia University of California at Berkeley http://berkeley.edu/ 2008 07 07 07/07/2008 8 Vasily Volkov James W. Demmel Paper Research Vasily Volkov, James W. Demmel f832aa18-4900-4118-9e81-cd2d52dedf71 General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floatingpoint co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization. http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Christen07.pdf /content/cudazone/CUDABrowser/assets/images/applications/297_christen_small.png /content/cudazone/CUDABrowser/assets/images/applications/297_christen_large.png 2006 09 01 09/01/2006 7 Matthias Christen Olaf Schenk Helmar Burkhart Paper Numerics Matthias Christen, Olaf Schenk, Helmar Burkhart 65d33580-ff45-439e-9f5c-5fdc1814f542 A Fast Double Precision CFD Code using CUDA We describe a second-order double precision finite volume Boussinesq code implemented using the CUDA platform. We perform detailed validation of the code on a variety of Rayleigh-Benard convection problems and show second order convergence. We obtain matching results with a Fortran code running on a high-end eight-core CPU. The CUDA-accelerated code achieves approximately an eight-time speedup for versus the Fortran code on identical problems. As a result, we are able to run a simulation with a grid of size 384 x 384 x 192 at 1.6 seconds per time step on a machine with a single GPU. /content/cudazone/CUDABrowser/assets/images/applications/296_cfd_small.png /content/cudazone/CUDABrowser/assets/images/applications/296_cfd_large.png Commercial / Academic NVIDIA / IGPP UCLA http://www.nvidia.com 2009 06 01 06/01/2009 8 Jonathan M. Cohen M. Jeroen Molemaker Paper Computational Fluid Dynamics CUDA, GPU Computing, Multicore, Rayleigh-Benard convection, Jonathan M. Cohen, M. Jeroen Molemaker a711966a-9d50-42f0-9797-544c84d2d4a2 cRARk RAR archives password recovery /content/cudazone/CUDABrowser/assets/images/applications/295_crark32-ss_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/295_crark32-ss_large.jpg Academia St. Petersburg Technical University 2009 05 29 05/29/2009 15 Pavel Semjanov Application Multimedia Numerics rar password, Pavel Semjanov e69e67b3-e02f-4918-a2a1-f4ba67c92bfd GPGPU BASED IMAGE SEGMENTATION LIVEWIRE ALGORITHM IMPLEMENTATION This thesis presents a GPU implementation of the Livewire algorithm. Instead of using traditional architectures, like the CPU, this implementation focuses advantages obtained using Single Instruction Multiple Data (SIMD) architectures. http://code.google.com/p/gpuwire/ /content/cudazone/CUDABrowser/assets/images/applications/294_gpuwire_large.jpg /content/cudazone/CUDABrowser/assets/images/applications/294_gpuwire_small.jpg 2008 12 1 12/1/2008 Daniel Lelis Baggio Application Multimedia Paper Imaging Daniel Lelis Baggio 46373d5b-aa72-4200-b0ba-c62ddda1cd1b Asymmetric Cryptography on GPUs The paper discusses the use of CUDA to accelerate asymmetric cryptographic algorithms based on modular exponentiation, such as RSA and DSA, and elliptic curve-based methods such as ECDSA. /content/cudazone/CUDABrowser/assets/images/applications/293_nvidia-CUDA,P-W-111092-13_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/293_nvidia-CUDA,P-W-111092-13_large.jpg Hochschule Horst Gortz Institute for IT Security, Ruhr University Bochum http://www.crypto.rub.de 2008 08 10 08/10/2008 N. z. Robert Szerwinski Paper Numerics Science RSA, DSA, ECDSA, ECC, Robert Szerwinski d095f5a1-5a77-4533-9f22-576c0d37f0aa Level-3 BLAS on a GPU: Picking the Low Hanging Fruit The arrival of hardware accelerators has created a new gold rush to be the first to deliver their promise of high performance for numerical applications. Since they are relatively hard to program, with limited language and compiler support, it is generally accepted that one needs to roll up one's sleeves and tough it out, not unlike the early days of distributed memory parallelcomputing (or any other period after the introduction of a drastically different architecture). In this paper we remind the community that while this is a noble endeavor, there is a lot of lowhanging fruit that can be harvested easily. Picking this low hanging fruit benefits the scientific computing community imme diately and prototypes the approach that the further optimizationsmay wish to follow. We demonstrate this by focusing on a widely used set of operations, thelevel-3 BLAS, targeting the NVIDIA family of GPUs. /content/cudazone/CUDABrowser/assets/images/applications/292_level3_small.png /content/cudazone/CUDABrowser/assets/images/applications/292_level3_large.png Research Universitat Jaume I, Spain - University of Texas at Austin 2009 05 22 05/22/2009 4 Francisco Igual Gregorio Quintana Robert van de Geijn Paper Numerics Francisco Igual, Gregorio Quintana, Robert van de Geijn 795116c5-6b29-4699-849d-7fd134527031 Density field viewer This Cuda demo, is able to ray trace a density volume and surrounding triangle objects in real-time. In the demo the density is a smoke simulation done by Michael Bang and Brian Bunch Christensen from the Department of Computer Science, University of Aarhus. /content/cudazone/CUDABrowser/assets/images/applications/291_cool_smoke_small.png /content/cudazone/CUDABrowser/assets/images/applications/291_cool_smoke_large.png Research Alexandra Institute http://www.alexandra.dk/ 2009 05 15 05/15/2009 10 Peter Trier Application Multimedia Computational Fluid Dynamics Graphics Science Ray tracing, volume rendering, Peter Trier aece5608-4e34-46be-aa9d-69da7de03fbf Accelerating RTM on GPU, what is the current status? This presentation describes experience with GPU-parallelization of the Reverse Time Migration (RTM) approach to seismic depth imaging. RTM is reviewed at a high level, followed by GPU implementation geared towards multi-GPU systems. GPU code (in this case C for CUDA) was generated using CAPS HMPP programming directives. /content/cudazone/CUDABrowser/assets/images/applications/290_trc_small.png /content/cudazone/CUDABrowser/assets/images/applications/290_trc_large.png Commercial NVIDIA http://www.nvidia.com 2009 01 01 01/01/2009 Henri Calandra Stephane bihan Paulius Micikevicius Paper Signal Processing Henri Calandra, Stephane bihan, Paulius Micikevicius 61950ddf-e994-4823-a034-9b92880a62f9 Parallelized Turing bombe & Enigma simulations This project involves implementing simulations of Enigma machines and the Turing bombe on various parallel-computing systems including multi-processor PCs, Linux clusters, and modern enhanced graphic cards. /content/cudazone/CUDABrowser/assets/images/applications/289_180px-Enigma-rotor-stack_small.png /content/cudazone/CUDABrowser/assets/images/applications/289_180px-Enigma-rotor-stack_large.png Academia Department of Electronic Engineering, La Trobe University, Bundoora, Australia http://www.latrobe.edu.au/ee/ 2009 05 14 05/14/2009 35 Open source Cong Van Nguyen Application Multimedia Code Numerics Enigma, Turing bombe, CUDA, Cong Van Nguyen 43f8cd9c-46c8-48d4-aac8-588815fde07b CUDA Accelerated Expectation Maximization of Gaussian Mixture Models This is a CUDA implementation of the Expectation Maximization algorithm for Gaussian Mixture Models. On my machine, it provides up to 170x performance increases versus a CPU reference version. See the report available at http://andrewharp.com/gmmcuda for more information. /content/cudazone/CUDABrowser/assets/images/applications/288_em_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/288_em_large.jpg 2009 5 6 5/6/2009 170 Andrew Harp Paper Code Numerics Science Other Machine Learning Clustering AI Statistics GMM, Gaussian, Machine Learning, Statistics, Andrew Harp 7fcf2257-7a1f-4f39-b2f1-78266b3cc524 Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures Stencil computations arise in a wide range of applications of computational sciences. This paper focuses on stencil computations arising in the context of a biomedical simulation. Compute-intensive bio-medical simulations represent an attractive application for the Cell Broadband Engine Architecture (CBEA) and for graphics processing units (GPUs) as hardware accelerators. Due to the low arithmetic intensity of stencil computations and bandwidth limitations of the compute hardware, the performance is usually only a fraction of peak performance. We detail an implementation of parallel stencil computations on the CBEA and GPUs, which improves performance by exploiting temporal locality. We report on performance improvements over CPU implementations. /content/cudazone/CUDABrowser/assets/images/applications/287_christenschenk_small.png /content/cudazone/CUDABrowser/assets/images/applications/287_christenschenk_large.png 2008 12 01 12/01/2008 Matthias Christen Olaf Schenk Peter Messmer Paper Numerics Matthias Christen, Olaf Schenk, Peter Messmer 5c5a6a6d-ccd0-47d4-bdff-bc89f7320589 Seismic imaging using GPGPU accelerated Reverse Time Migration CS315A Final Project Report In this report, I outline the implementation and preliminary benchmarking of a parallelized program to perform reverse time migration (RTM) seismic imaging using the Nvidia CUDA platform for scientific computing, accelerated by a general purpose graphics processing unit (GPGPU). This novel software architecture allows access to the massively parallel computational capabilities of a high performance GPU system, which is leveraged for its high throughput of numeric capabilities. /content/cudazone/CUDABrowser/assets/images/applications/286_nwmoussa_small.png /content/cudazone/CUDABrowser/assets/images/applications/286_nwmoussa_large.png Academia Stanford University http://folding.stanford.edu/ 2009 01 01 01/01/2009 Nader W. Moussa Paper Science Nader W. Moussa a7979b2b-d04d-4ce5-9395-ba7f6facbd8e GPU accelerated Monte Carlo simulation of the Ising model The compute unified device architecture (CUDA) is a programming approach for performing scientific calculations on a graphics processing unit (GPU) as a data-parallel computing device. First, we apply this new technology to Monte Carlo simulations of the two dimensional ferromagnetic square lattice Ising model. By implementing a variant of the checkerboard algorithm, results are obtained up to 60 times faster on the GPU than on a current CPU core. An implementation of the three dimensional ferromagnetic cubic lattice Ising model on a GPU is able to generate results up to 35 times faster than on a current CPU core. As proof of concept we calculate the critical temperature of the 2D and 3D Ising model using finite size scaling techniques. Theoretical results for the 2D Ising model and previous simulation results for the 3D Ising model can be reproduced. /content/cudazone/CUDABrowser/assets/images/applications/285_montecarlo_small.png /content/cudazone/CUDABrowser/assets/images/applications/285_montecarlo_large.png Academia Johannes Gutenberg University Mainz 2009 4 30 4/30/2009 60 Open source Tobias Preis Multimedia Paper Code Numerics Tobias Preis 827b03f6-1371-40f2-b2a2-406e56f22825 A GPU interval library based on Boost interval Interval arithmetic is widely used in numerical algorithms requiring reliability. Ray tracing of implicit surface is one of these applications that use interval arithmetic to increase the quality of a produced image. However these applications are computationally demanding. One solution is to use graphics processing unit (GPU) in order to take advantage of its computational power. We describe in this paper a GPU implementation of interval operators based on the Boost library. We tested these operators on a ray tracing algorithms and observe several order of execution speed improvements over the CPU version with the same image quality. /content/cudazone/CUDABrowser/assets/images/applications/284_interval_small.png /content/cudazone/CUDABrowser/assets/images/applications/284_interval_large.png 2008 03 13 03/13/2008 300 Sylvain Collange Jorge Florez David Defour Paper Labraries Sylvain Collange, Jorge Florez, David Defour a257d123-ba1c-4b24-99ae-3cf4c49b4f28 Implementation of float-float operators on graphics hardware The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point precision. This computational power is now being used for general-purpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44-bit floating-point number format and its corresponding operations. An implementation is presented along with performance and accuracy results. /content/cudazone/CUDABrowser/assets/images/applications/283_floats_small.png /content/cudazone/CUDABrowser/assets/images/applications/283_floats_large.png Academia Dali, LP2A, Universite de Perpignan 2006 03 29 03/29/2006 Guillaume Da Graca David Defour Paper Programming Tools Guillaume Da Graca, David Defour cd2c96da-0961-4644-af98-390aa7bbdb59 Power Consumption of GPUs from a Software Perspective GPUs are now considered as serious challengers for high performance computing solutions. They have power consumptions up to 300 W. This may lead to power supply and thermal dissipation problems in computing centers. In this article we investigate, using measurements, how and where modern GPUs are using energy during various computations in a CUDA environment. /content/cudazone/CUDABrowser/assets/images/applications/282_pcg_small.png /content/cudazone/CUDABrowser/assets/images/applications/282_pcg_large.png Academia ELIAUS, Univ. de Perpignan http://www.univperp.fr 2009 02 12 02/12/2009 Sylvain Collange David Defour Arnaud Tisserand Paper Programming Tools Sylvain Collange, David Defour, Arnaud Tisserand 2ad66f68-4139-4334-b027-f3a282d242a2 Barra, a Modular Functional GPU Simulator for GPGPU The use of GPUs for general-purpose applications promises huge performance returns for a small investment. However the internal design of such processors is undocumented and many details are unknown, preventing developers to optimize their code for these architectures. One solution is to use functional simulation to determine program behavior and gather statistics when counters are missing or unavailable. In this article we present a GPU functional simulator targeting GPGPU based on the UNISIM framework which takes a NVIDIA cubin file as input. /content/cudazone/CUDABrowser/assets/images/applications/281_bmfg_small.png /content/cudazone/CUDABrowser/assets/images/applications/281_bmfg_large.png Academia ELIAUS, Univ. de Perpignan http://www.univperp.fr 2009 02 09 02/09/2009 Sylvain Collange David Defour David Parello Paper Programming Tools Sylvain Collange, David Defour, David Parello 2234c230-375e-11de-8a39-0800200c9a66 Stochastic Differential Equations with CUDA Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675x compared to a standard CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively. /content/cudazone/CUDABrowser/assets/images/applications/280_cudasde3_small.png /content/cudazone/CUDABrowser/assets/images/applications/280_cudasde3_large.png Academia University of Silesia http://www.us.edu.pl 2009 03 23 03/23/2009 675 Open source Michal Januszewski Marcin Kostur Code Paper Numerics Science SDE,stochastic,simulation,Langevin, Michal Januszewski, Marcin Kostur 1b536fc0-375e-11de-8a39-0800200c9a66 Efficient Sparse Matrix-Vector Multiplication on CUDA In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU. /content/cudazone/CUDABrowser/assets/images/applications/279_esmv_small.png /content/cudazone/CUDABrowser/assets/images/applications/279_esmv_large.png Commercial NVIDIA http://www.nvidia.com 2008 12 08 12/08/2008 Nathan Bell Code Paper Numerics Libraries Sparse Matrix, SpMV, iterative, Nathan Bell 11fb7800-375e-11de-8a39-0800200c9a66 Solving Kinetic Equations on GPUs I: Model Kinetic Equations We present an algorithm specifically tailored for solving kinetic equations onto GPUs. The efficiency of the algorithm is demonstrated by solving the one- dimensional shock wave structure problem and a two-dimensional low Mach number driven cavity flow. Computational results show that it is possible to cut down the computing time of the sequential codes of two order of magnitudes. The algorithm can easily be extended to three-dimensional flows and more general collision models. /content/cudazone/CUDABrowser/assets/images/applications/278_ske_small.png /content/cudazone/CUDABrowser/assets/images/applications/278_ske_large.png Academia Politecnico di Milano / Dipartimento di Matematica http://www.mate.polimi.it/ 2009 03 24 03/24/2009 500 Aldo Frezzotti Gian Pietro Ghiroldi Livio Gibelli Paper Computational Fluid Dynamics Numerics Science Rarefied Gas Dynam ics Rarefied Gas Dynamics, Boltzmann Equation, Semi-regular method of solution, Aldo Frezzotti, Gian Pietro Ghiroldi, Livio Gibelli fa944db0-35cb-11de-8a39-0800200c9a66 Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem. The target architecture is a current general purpose multi-core processor, where parallelism is extracted using a tuned multi-threaded implementation of BLAS. Also, in response to the advances of hardware accelerators, we modify the code in SBR to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Our results on a system with two Intel QuadCore processors and a Tesla C1060 GPU illustrate the performance and scalability delivered by these architectures. /content/cudazone/CUDABrowser/assets/images/applications/277_hpf_small.png /content/cudazone/CUDABrowser/assets/images/applications/277_hpf_large.png Research Aachen University, University Jaume I, ETH Zurich www.hpca.uji.es 2009 04 02 04/02/2009 12 Paolo Bientinesi Francisco Igual Daniel Kressner Enrique Quintana-Orti Paper Numerics Eigenvalues, CUDA, Tesla, Paolo Bientinesi, Francisco Igual, Daniel Kressner, Enrique Quintana-Orti f4522170-35cb-11de-8a39-0800200c9a66 Using Graphics Processors to Acceleratethe Solution of Out-of-Core Linear Systems Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 x 100, 000 symmetric positive definite linear system in less than one hour. Thus, for problems that used to be considered large, it is not necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution to be computed on a fast multithreaded architecture like a desktop computer equipped with a GPU. This paper provides evidence in support of these claims. /content/cudazone/CUDABrowser/assets/images/applications/276_ooc_small.png /content/cudazone/CUDABrowser/assets/images/applications/276_ooc_large.png Academia Uni versitat Jaume I (Castellon, Spain) 2009 04 02 04/02/2009 Mercedes Marques Gregorio Quintana-Orti Enrique Quintana-Orti Robert van de Geijn Paper Numerics Mercedes Marques, Gregorio Quintana-Orti, Enrique Quintana-Orti, Robert van de Geijn ed20da40-35cb-11de-8a39-0800200c9a66 Multidimensional Decomposition for Nuclear Magnetic Resonance Recently, the multilinear decomposition approved as the new robust method for data processing for multidimensional Nuclear Magnetic Resonance and these results published in Nature. In this application problem sizes are so huge, that solutions need several days on modern workstations. The algorithm is based on sparse implementation of parallel factor decomposition algorithm (PARAFAC), that performs sparsely defined alternate least sqares minimization. The algorithm allows to solve simultaniously several small and middle sized problems using all power of several GPU multiprocessors, or one huge (rank over 1000) sparse PARAFAC task. It reduce NMR data processing of factor 20-30 allow to do it on modern PC Supercomputer from NVIDIA instead of middle sized linux cluster. /content/cudazone/CUDABrowser/assets/images/applications/275_mdd-nmr_small.png /content/cudazone/CUDABrowser/assets/images/applications/275_mdd-nmr_large.png Commercial Commercial Elegant Mathematics Ltd. http://www.elegant-mathematics.com/ 2009 04 02 04/02/2009 40 Ilgis Ibragimov Code Numerics Libraries Science Signal Processing Multilinear Multidimentional Tensor Singular Value Decomposition, Ilgis Ibragimov e17a1080-35cb-11de-8a39-0800200c9a66 Real-Time Fiber Tracking Fiber tracking is a technique based on diffusion tensor magnetic resonance imaging (DT-MRI) that allows a neurosurgeon to visualize the neuronal fibers in the brain of a patient. By using CUDA, our fiber tracking tool is now much more interactive. /content/cudazone/CUDABrowser/assets/images/applications/274_fiber_small.png /content/cudazone/CUDABrowser/assets/images/applications/274_fiber_large.png Academia The Cyclops Group - Laboratory for Image Processing and Computer Graphics http://www.lapix.ufsc.br/ 2009 03 20 04/20/2009 Adiel Mittmann Multimedia Paper Medicalimaging fiber tracking, dt-mri, diffusion tensor imagi ng, Adiel Mittmann dc080580-35cb-11de-8a39-0800200c9a66 Thrust Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity. Develop high-performance applications rapidly with Thrust! /content/cudazone/CUDABrowser/assets/images/applications/273_thrust_small.png /content/cudazone/CUDABrowser/assets/images/applications/273_thrust_large.png Commercial NVIDIA http://www.nvidia.com 2009 04 06 04/06/2009 Nathan Bell Code Libraries STL CUDA Templates C++ HighLevel, Nathan Bell d6b394a0-35cb-11de-8a39-0800200c9a66 2D FDTD Wave Propagation The finite difference time domain (FDTD) solution of the scalar wave equation over a two-dimensional space discretized by a 2D grid of uniform grid cells. /content/cudazone/CUDABrowser/assets/images/applications/272_fdtd_small.png /content/cudazone/CUDABrowser/assets/images/applications/272_fdtd_large.png Academia University of Stuttgart 2008 02 06 02/06/2008 50 Open source Ana Balevic Code Presentation Science Computational electromagnetics, FDTD, Wave propagation, Ana Balevic ccb37600-35cb-11de-8a39-0800200c9a66 Fast and Scalable List Ranking on the GPU In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. We first present an implementation of Wyllie's algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPU optimized, Recursive Helman-JaJa (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relatin g to the implementation of irregular algorithms on massively multi-threaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/271_fastscaleable_small.png /content/cudazone/CUDABrowser/assets/images/applications/271_fastscaleable_large.png Academia International Institute of Information Technology, Hyderabad http://www.iiit.ac.in 2009 04 16 04/16/2009 15 Suhail Rehman Paper Numerics Algorithms list ranking, GPGPU, irregular algorithm, Suhail Rehman ab2b9de0-35c7-11de-8a39-0800200c9a66 Statistical phylogenetics Many-core algorithms for statistical phylogenetics /content/cudazone/CUDABrowser/assets/images/applications/270_statphy_small.png /content/cudazone/CUDABrowser/assets/images/applications/270_statphy_large.png Academia UCLA 2009 04 01 04/01/2009 90 Open source Marc Suchard Code Paper Numerics Life Sciences Libraries Marc Suchard a4938970-35c7-11de-8a39-0800200c9a66 GPUmat GPUmat is a Freeware library that enables Matlab code to run on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/269_gpumat_small.png /content/cudazone/CUDABrowser/assets/images/applications/269_gpumat_large.png Commercial The GP-you Group http://gp-you.org/ 2009 04 28 04/28/2009 The GP-you Group Application Numerics Matlab 95ffb1e0-35c7-11de-8a39-0800200c9a66 GPU accelerated Poisson Boltzmann calculations and their comparison to the ASIC MDGRAPE-3 For proper functionality biomolecules need to be in specific environments (water, biomembranes, specialized tissues, etc). The Poisson Boltzmann (PB) approach can be used to account for this environmental effect in simulation studies of biomolecular matter. Here we study the suitability of the GPU (NVIDIA GTX 280) to accelerate PB computations within an enhanced Boundary Element Method (BEM). We compare to a general purpose CPU (INTEL Quad-Core Xeon E5430)and a specifically designed chip (the ASIC MDGRAPE-3). Both specialized devices, ie the GPU as well as the ASIC, offer comparable compute performance revealing theoretical Speed Up of approximately 39x within the current implementation. /content/cudazone/CUDABrowser/assets/images/applications/268_pbbem_small.png /content/cudazone/CUDABrowser/assets/images/applications/268_pbbem_large.png Academia Keio University; University of Electro-Communications; RIKEN Advanced Science Institute; University of Bologna; Michigan Tech; 2009 02 04 02/04/2009 39 Tetsu Narumi Kenji Yasuoka Makoto Taiji Siegfried Hoefinger Paper Life Sciences Computational Biophysical Chemistry solvation, Poisson Boltzmann, implicit solvation models, GPU, ASIC, MDGRAPE, Tetsu Narumi, Kenji Yasuoka, Makoto Taiji, Siegfried Hoefinger b4f27b00-1731-11de-8c30-0800200c9a66 Tool for Generalized Harmonics Analysis An implementation of Generalized Harmonics Analysis with CUDA /content/cudazone/CUDABrowser/assets/images/applications/267_harmonics_small.png /content/cudazone/CUDABrowser/assets/images/applications/267_harmonics_large.png Research Nishihara's Laboratory in Tokyo Institute of Technology http://www.nh.cradle.titech.ac.jp/ 2009 03 19 03/10/2009 420 Hisayori Noda Application Numerics Signal Processing Hisayori Noda ad352070-1731-11de-8c30-0800200c9a66 vReveal vReveal is MotionDSP's video enhancement software for Windows. It can dramatically improve the quality of flawed videos with just one click. The unrivaled enhancement technology powering vReveal works wonders with videos that are shaky, dark, noisy, pixelated, or blurry. With vReveal, you can play video files, preview enhancements to videos, and then save enhanced videos to disk or upload them directly to YouTube. Plus, vReveal has been specially tuned to run up to five times faster on CUDA-enabled NVIDIA graphics processors. That means you can process video enhancements in less time and have your CPU available for normal everyday tasks like emailing and internet browsing. /content/cudazone/CUDABrowser/assets/images/applications/266_vraveal_new_small.png /content/cudazone/CUDABrowser/assets/images/applications/266_vraveal_new_large.png Commercial MotionDSP http://www.vreveal.com 2009 03 24 03/24/2009 5 Commercial MotionDSP Application Digital Content Creation Imaging Signal Processing Video & Audio vReveal, video enhancement, GPU, super-resolution, MotionDSP a653ce00-1731-11de-8c30-0800200c9a66 Discontinuous Galerkin Methods DG methods yield high-order accurate PDE (partial differential equation) solvers on unstructured meshes. The special structure of the DG operator is shown to be well-suited to the CUDA parallel computation model, allowing net application speeds exceeding 200 GFlops/s. /content/cudazone/CUDABrowser/assets/images/applications/265_galerkin_methods_small.png /content/cudazone/CUDABrowser/assets/images/applications/265_galerkin_methods_large.png Academia Brown/Rice Collaboration http://www.dam.brown.edu/scicomp 2008 12 18 12/18/2008 50 Open Source A Kloeckner T Warburton J Bridge JS Hesthaven Paper Multimedia Computational Fluid Dynamics Numerics Science A Kloeckner, T Warburton, J Bridge, JS Hesthaven 8843e330-0f59-11de-8c30-0800200c9a66 High-Speed Single-Database PIR Implementation In this HotPETs session we would like to present an implementation of a singledatabase Private Information Retrieval (PIR) scheme that can process a database at 2 Gbits/s using a commodity Graphics Processing Unit (GPU). http://www.petsymposium.org/2008/hotpets/PIR.pdf /content/cudazone/CUDABrowser/assets/images/applications/258_PIR_small.png /content/cudazone/CUDABrowser/assets/images/applications/258_PIR_large.png Academia University of Limoges http://www.unilim.fr 2008 01 01 01/01/2008 Carlos Aquilar Paper Carlos Aquilar 850c10c0-0f59-11de-8c30-0800200c9a66 Lattice QCD on modern graphics cards Presentation on using CUDA-enabled GPUs to solve Dirac-Wilson equation on the lattice. /content/cudazone/CUDABrowser/assets/images/applications/257_lattice_small.png /content/cudazone/CUDABrowser/assets/images/applications/257_lattice_large.png Academia University of Wuppertal & University Budapest 2007 11 09 11/09/2007 10 Gyozo Egri Paper Gyozo Egri 81e30b60-0f59-11de-8c30-0800200c9a66 Fine-grained Parallelization of Lattice QCD Kernel Routine on GPU This paper briefs an experience in parallelizing a kernel function responsible for computing the action of the Dirac operator, called Hopping_Matrix. This routine contributes to most of the execution time of a simulation to the classical problem of Lattice Quantum Chromodynamics (Lattice QCD). /content/cudazone/CUDABrowser/assets/images/applications/263_fine_grained_small.png /content/cudazone/CUDABrowser/assets/images/applications/263_fine_grained_large.png Academia INRIA http://irisa.fr/home_html-en?set_language=en 2008 01 01 01/01/2008 9 Khaled Z. Ibrahim F. Bodin Olivier Pene Paper Numerics Khaled Z. Ibrahim, F. Bodin, Olivier Pene E72d2b0d0-0f59-11de-8c30-0800200c9a66 Blasting through lattice calculations using CUDA Modern graphics hardware is designed for highly parallel numerical tasks and provides significant cost and performance benefits. Graphics hardware vendors are now making available development tools to support general purpose high performance computing. Nvidias CUDA platform, in particular, offers direct access to graphics hardware through a programming language similar to C. Using the CUDA platform we have implemented a Wilson-Dirac operator which runs at an effective 68 Gflops on the Tesla C870. The recently released GeForce GTX 280 runs this same code at 92 Gflops, and we expect further improvement pending code optimization. /content/cudazone/CUDABrowser/assets/images/applications/256_bu_small.png /content/cudazone/CUDABrowser/assets/images/applications/256_bu_large.png Academia Boston University http://arxiv.org/PS_cache/arxiv/pdf/0810/0810.5365v1.pdf 2008 10 29 10/29/2008 Kipton Barros Paper Numerics Kipton Barros 3edb5b60-1378-11de-8c30-0800200c9a66 Accelerating Leukocyte Tracking using CUDA The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application detection and tracking of white blood cells in video microscopy can be accelerated by 200x using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU. /content/cudazone/CUDABrowser/assets/images/applications/255_leukocyte_small.png /content/cudazone/CUDABrowser/assets/images/applications/255_leukocyte_large.png Academia Computer Science and Electrical and Computer Engineering University of Virginia, Charlottesville http://www.cs.virginia.edu/ 2009 05 01 05/012009 29 Michael Boyer David Tarjan Scott T. Acton Kevin Skadron Paper Life Sciences Michael Boyer, David Tarjan, Scott T. Acton, Kevin Skadron 5f3db880-0f59-11de-8c30-0800200c9a66 Cloud Services for GPU Computing Hoopoe is a cloud solution and infrastructure for organizations, allows using of GPU hardware for computational intensive tasks, while running on existing systems. http://www.gass-ltd.co.il/hoopoe/Features.aspx /content/cudazone/CUDABrowser/assets/images/applications/254_hoopoe_small.png /content/cudazone/CUDABrowser/assets/images/applications/254_hoopoe_large.png Research Hoopoe http://www.gass-ltd.co.il/hoopoe/Features.aspx#instance_types 2008 12 01 2008 2008 Hoopoe Paper Science Hoopoe 5a60c190-0f59-11de-8c30-0800200c9a66 Automatic Parallelization for Graphics Processing Units in JikesRVM Accelerated graphics cards, including specialized high-performance processors called Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time. http://uwspace.uwaterloo.ca/bitstream/10012/3752/1/thesis.pdf /content/cudazone/CUDABrowser/assets/images/applications/253_automatic_parallelization_small.png /content/cudazone/CUDABrowser/assets/images/applications/253_automatic_parallelization_large.png Academia University of Waterloo http://www.lib.uwaterloo.ca/ 2008 01 01 01/01/2008 13 Alan Chun-Wai Leung Paper Alan Chun-Wai Leung 56788ae0-0f59-11de-8c30-0800200c9a66 Seeing 3D: issues, algorithms and applications The acquisition of 3D data is becoming very easy. With the recent technological advances, we expect 3D acquisition hardware and 3d modeling softwares to become as popular as their traditional 2d counter-part. this on one hand, opens new perspectives to solve computer vision problems related to scene analysis and understanding. On the other hand, it rises new challenges in storage, analyses, and retrieval of the large amount of 3D data. /content/cudazone/CUDABrowser/assets/images/applications/252_seeing_3d_small.png /content/cudazone/CUDABrowser/assets/images/applications/252_seeing_3d_large.png Academia Tokyo Institute of Technology, Japan http://www.titech.ac.jp 2008 01 01 01/01/2008 Hamid Laga Paper Signal Processing Web search, query processing, GPU, Hamid Laga 53b10670-0f59-11de-8c30-0800200c9a66 Using Graphics Processors for High-Performance IR Query Processing Web search engines are facing formidable performance challenges as they need to process thousands of queries per second over billions of documents. To deal with this heavy workload, current engines use massively parallel architectures of thousands of machines that require large hardware investments. We investigate new ways to build such high-performance IR systems based on Graphical Processing Units (GPUs). http://www2008.org/papers/pdf/p1213-ding.pdf /content/cudazone/CUDABrowser/assets/images/applications/251_ir_query_small.png /content/cudazone/CUDABrowser/assets/images/applications/251_ir_query_large.png Academia CIS Department, Polytechnic University http://www.poly.edu/cse/ 2008 04 25 04/25/2008 3 Shuai Ding Jinru He Hao Yan Torsten Suel Paper Shuai Ding, Jinru He, Hao Yan, Torsten Suel 4f6d6950-0f59-11de-8c30-0800200c9a66 FLOCKING-BASED DOCUMENT CLUSTERING ON THE GRAPHICS PROCESSING UNIT Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. Each bird represents a single document and flies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to receive results in a reasonable amount of time. However, flocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have experienced improved performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefit the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NVIDIA, we developed a document flocking implementation to be run on the NVIDIA GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3,000 documents. The results of these tests were very significant. Performance gains ranged from three to nearly five times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms. /content/cudazone/CUDABrowser/assets/images/applications/250_flocking_small.png /content/cudazone/CUDABrowser/assets/images/applications/250_flocking_large.png Science U.S. Department of Energy Journal of Undergraduate Research http://www.scied.science.doe.gov 2007 12 01 12/01/2007 5 Jesse St. Charles Robert M. Patton Thomas E. Potok Xiaohui Cui Paper Jesse St. Charles, Robert M. Patton, Thomas E. Potok, Xiaohui Cui 4808dd20-0f59-11de-8c30-0800200c9a66 Real Time 3D Fluid and Particle Simulation and Rendering This application uses a GPU based 3D fluid solver with a moving domain to seamlessly advect hundreds of thousands of particles. We use a globally second-order accurate fluid simulation method which takes constant calculation time, leading to a highly turbulent but stable simulation suitable for use in a real-time situation. In this demo our solver is used to simulate the turbulent wake behind a car, and we render smoke transported through that wake with a hardware-based volume renderer. The simulation domain is at a fixed position relative to the car, and we exploit Galilean invariance to transform the moving simulation domain into a motionless domain with inflow and outflow boundary conditions. Particles are advected through this fluid, but seamlessly transition to simple Newtonian dynamics as they leave the simulation domain. After simulation, the particles are sorted using a GPU-accelerated radix sort, and then rendered as alpha blended sprites with motion blur and volumetric shadows. /content/cudazone/CUDABrowser/assets/images/applications/249_GDCSmokeDemo_6_small.png /content/cudazone/CUDABrowser/assets/images/applications/249_GDCSmokeDemo_6_large.png Commercial NVIDIA http://www.nvidia.com 2009 03 17 03/17/2009 40 Jonathan Cohen Sarah Tariq Simon Green Multimedia Computational Fluid Dynamics Computational Fluid Dynamics, CFD, Navier Stokes, Particle simulation, Jonathan Cohen, Sarah Tariq, Simon Green 63708100-f945-11dd-87af-0800200c9a66 8850 Roll Microfilm Scanstation Image Processing The application processes an 85 Mbyte/s stream of greyscale data as it is captured by the linear CCD array camera of the scanner from the source microfilm. An NVIDIA GPU running a CUDA application then processes the data stream in the real time. The image frames ( typically newspaper pages, medical records, land records ) are enhanced and an additional bitonal data stream created. Image enhancement includes correction for the microfilm density, background artefact removal, character enhancement and noise reduction without the loss of fine detail. The NVIDIA GPU and CUDA application replaces, in our previous Scanstations, a dedicated piece of image processing hardware. The new CUDA solution is faster ( more data per second ), cheaper and most importantly improvements can be made in the software without change to custom hardware. /content/cudazone/CUDABrowser/assets/images/applications/259_8850-1_small.png /content/cudazone/CUDABrowser/assets/images/applications/259_8850-1_large.png Commercial Wicks and Wilson http://www.wwl.co.uk 2008 11 03 11/03/2008 20 Commercial Kevin Keeler Application Multimedia Imaging Kevin Keeler 448d9910-0f59-11de-8c30-0800200c9a66 GPU in power system engineering This work demonstrates how the application of a GPU can yield significant speedup in ltransient stability simulation of the large-scale power systems. /content/cudazone/CUDABrowser/assets/images/applications/247_electricalandcomputerengineering_small.png /content/cudazone/CUDABrowser/assets/images/applications/247_electricalandcomputerengineering_large.png Academia University of Alberta http://www.engineering.ualberta.ca/ece/ 2009 02 25 02/25/2009 340 Vahid Jalili-Marandi Paper Numerics Graphics processors, Parallel programming,Power system transient stability, Vahid Jalili-Marandi 4135bb80-0f59-11de-8c30-0800200c9a66 Graph Cuts via L1 Norm Minimization Graph cuts have become an increasingly important tool for solving a number of energy minimization problems in computer vision and other fields. In this paper, the graph cut problem is reformulated as an unconstrained L1 norm minimization that can be solved effectively using interior point methods. This reformulation exposes connections between graph cuts and other related continuous optimization problems. Eventually, the problem is reduced to solving a sequence of sparse linear systems involving the Laplacian of the underlying graph. The proposed procedure exploits the structure of these linear systems in a manner that is easily amenable to parallel implementations. Experimental results obtained by applying the procedure to graphs derived from image processing problems are provided. /content/cudazone/CUDABrowser/assets/images/applications/246_pattern_analysis_small.png /content/cudazone/CUDABrowser/assets/images/applications/246_pattern_analysis_large.png Academia Department of Computer & Information Science, Penn Libraries http://www.library.upenn.edu/ 2008 10 01 October 2008 Arvind Bhusnurmath Camillo J. Taylor Paper Video & Audio Arvind Bhusnurmath, Camillo J. Taylor 3c7462e0-0f59-11de-8c30-0800200c9a66 3D Finite Difference Computation on GPUs using CUDA In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation. /content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_small.png /content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_large.png Commercial NVIDIA http://www.nvidia.com 2009 03 08 03/08/2009 Paulius Micikevicius Paper Oil & Gas Paulius Micikevicius 386bd2f0-0f59-11de-8c30-0800200c9a66 Efficient Parallelization of Stochastic Simulation Algorithm for Chemically Reacting Systems on the Graphics Processing Unit In this paper, we will show how Single Instruction Multiple Data (SIMD) computation can be implemented on a CUDA-enabled GPU, the NVIDIA GeForce 8800GTX, to efficiently perform ensemble runs of SSA simulations for chemically reacting systems. /content/cudazone/CUDABrowser/assets/images/applications/262_efficient_small.png /content/cudazone/CUDABrowser/assets/images/applications/262_efficient_large.png Academia Department of Computer Science, University of California, Santa Barbara http://www.cs.ucsb.edu 2008 12 01 12/01/2008 Hong Li Linda Petzold Paper Hong Li, Linda Petzold 33ae4ae0-0f59-11de-8c30-0800200c9a66 A FAST HYBRID TIME-SYNCHRONOUS/EVENT APPROACH TO PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS The trend in computing architectures has been toward multicore central processing units (CPUs) and graphics processing units (GPUs). An affordable and highly parallelizable GPU is practical example of Single Instruction, Multiple Data (SIMD) architectures oriented toward stream processing. http://www.informs-sim.org/wsc08papers/095.pdf /content/cudazone/CUDABrowser/assets/images/applications/261_fast_hybrid_small.png /content/cudazone/CUDABrowser/assets/images/applications/261_fast_hybrid_large.png Academia Department of Computer and Information Science and Engineering University of Florida http://www.cise.ufl.edu/ 2008 08 06 08/06/2008 2 Hyungwook Park Paul A. Fishwick Paper Hyungwook Park, Paul A. Fishwick 2f7757f0-0f59-11de-8c30-0800200c9a66 Singular Value Decomposition on GPU using CUDA Linear algebra algorithms are fundamental to many computing applications. Modern GPUs are suited for many general purpose processing tasks and have emerged as inexpensive high performance co-processors due to their tremendous computing power. In this paper, we present the implementation of singular value decomposition (SVD) of a dense matrix on GPU using the CUDA programming model. SVD is implemented using the twin steps of bidiagonalization followed by diagonalization. It has not been implemented on the GPU before. Bidiagonalization is implemented using a series of Householder transformations which map well to BLAS operations. Diagonalization is performed by applying the implicitly shifte d QR algorithm. Our complete SVD implementation outperforms the MATLAB and Intel R Math Kernel Library (MKL) LAPACK implementation significantly on the CPU. We show a speedup of upto 60 over the MATLAB implementation and upto 8 over the Intel MKL implementation on a Intel Dual Core 2.66GHz PC on NVIDIA GTX 280 for large matrices. We also give results for very large matrices on NVIDIA Tesla S1070. /content/cudazone/CUDABrowser/assets/images/applications/245_auevt_small.png /content/cudazone/CUDABrowser/assets/images/applications/245_auevt_large.png Research IIIT Hyderabad http://www.iiit.net/ 2008 12 15 12/15/2008 60 Sheetal Lahabar Paper Numerics Sheetal Lahabar 6e9f39d0-04a4-11de-8c30-0800200c9a66 Sparse Multifrontal Performance Gains via NVIDIA GPU Discussion of Access Analytics International's BCSLIB-EXT, an advanced analytic engine for use in finite element analysis, optimization, and data analysis tool suites. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009. /content/cudazone/CUDABrowser/assets/images/applications/244_sparse_small.png /content/cudazone/CUDABrowser/assets/images/applications/244_sparse_large.png Commercial Access Analytics International http://www.aanalytics.com 2009 01 16 01/16/2009 7 Dan'l Pierce Presentation Numerics Dan'l Pierce 2bcbdc20-0f59-11de-8c30-0800200c9a66 AN APPROACH FOR THE EFFECTIVE UTILIZATION OF GP-GPUS IN PARALLEL COMBINED SIMULATION A major challenge in the field of Modeling & Simulation is providing efficient parallel computation for a variety of algorithms. Algorithms that are described easily and computed efficiently for continuous simulation, may be complex to describe and/or efficiently execute in a discrete event context, and vice-versa. Real-world models often employ multiple algorithms that are optimally defined in one approach or the other. Parallel combined simulation addresses this problem by allowing models to define algorithmic components across multiple paradigms. In this paper, we illustrate the performance of parallel combined simulation, where the continuous component is executed across multiple graphical processing units (GPU) and the discrete event component is executed across multiple central processing units (CPU). /content/cudazone/CUDABrowser/assets/images/applications/260_approach_small.png /content/cudazone/CUDABrowser/assets/images/applications/260_approach_large.png Commercial The MITRE Corporation http://www.mitre.org 2008 01 01 2008 539 David W. Bauer Jr. Matthew McMahon Ernest H. Page Paper Imaging Simulation David W. Bauer Jr., Matthew McMahon, Ernest H. Page 156a1ca0-0e03-11de-8c30-0800200c9a66 Particle Swarm Optimization on GPU Review of Particle Swarm Optimization (PSO) on the GPU. PSO is a population-based stochastic optimization technique inspired by social behavior of bird flocking or fish schooling. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009. /content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_small.png /content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_large.png Academia Department of Mathematics National Taiwan University http://www.math.ntu.edu.tw/newenglish 2009 01 16 01/16/2009 270 Weichung Wang Presentation Life Sciences Weichung Wang 0f4626c0-0e03-11de-8c30-0800200c9a66 JaCuda this project aims to help you running applications on CUDA using java/python or groovy. Basically we provice a couple of functions and these functions are executed on the the gpu or if you don't have a gpu, well you always can run them in java/python mode depending which framework you prefer. /content/cudazone/CUDABrowser/assets/images/applications/242_jacuda_small.png /content/cudazone/CUDABrowser/assets/images/applications/242_jacuda_large.png Research 2008 06 07 06/07/2008 LGPL gert wohlgemuth Application Programming Tools gert wohlgemuth 612c3b90-04a4-11de-8c30-0800200c9a66 JCufft JCufft is providing Java bindings for the NVIDIA CUDA FFT implementation /content/cudazone/CUDABrowser/assets/images/applications/241_jcufft_small.png /content/cudazone/CUDABrowser/assets/images/applications/241_jcufft_large.png Research JavaGL http://javagl.de/jcuda/jcublas/JCublas.html 2008 12 31 12/31/2008 javagl@javagl.de Application Paper Programming Tools javagl@javagl.de 5b705650-04a4-11de-8c30-0800200c9a66 JCublas JCublas is providing Java bindings for the NVIDIA CUDA BLAS implementation, thus making the parallel processing power of modern graphics hardware available for Java programs. /content/cudazone/CUDABrowser/assets/images/applications/240_jcublas_small.png /content/cudazone/CUDABrowser/assets/images/applications/240_jcublas_large.png Research JavaGL http://javagl.de/jcuda/jcublas/JCublas.html 2008 12 31 12/31/2008 javagl@javagl.de Application Paper Programming Tools javagl@javagl.de 5359b830-04a4-11de-8c30-0800200c9a66 FLAGON: Fortran-9X Library for GPU Numerics FLAGON is an open source library/middleware for using GPUs from Fortran-9X, without necessarily knowing too much C or CUDA. It provides a Fortran Module (similar to a class in C++) that provides variables that are pointers to device variables on the GPU. Several supporting functions are available for data transfer, for manipulating variables on the device, as are simple interfaces to the CUBLAS, CUFFT librariers. Additional functionality is provided by functions that are written in CU, and can be called. Several such functions are available. The CUDPP library is also incorporated. Flagon has been used to develop relatively large pieces of scientific computing code on the GPU (Fast Multipole Methods on the GPU, plasma turbulence simulations), under both Windows and Linux. /content/cudazone/CUDABrowser/assets/images/applications/239_flagon_small.png /content/cudazone/CUDABrowser/assets/images/applications/239_flagon_large.png Research Fantalgo, LLC http://www.fantalgo.com/ 2008 08 01 08/01/2008 Yuancheng Luo Nail A. Gumerov Kate Despain Bill Dorland Ramani Duraiswami Application Imaging Scientific Physics Geoscience Yuancheng Luo, Nail A. Gumerov, Kate Despain, Bill Dorland, Ramani Duraiswami 1c6dd6c0-0f59-11de-8c30-0800200c9a66 Manufacturing Computations Lab Agent-Based Models ABMs are used to model dynamic systems such as stock markets, societies, and complex biological systems that are difficult to model analytically using partial differential equations. This is particularly the case where the system consists of autonomous individuals who are "intelligent" and evolve over time. By intelligent, we mean that the individuals can independently act based on their goals, the environment and other individuals. The properties of the system as a whole emerge from micro-scale interactions between individuals and the environment. In the recent past, there has been an explosion in ABM research in disciplines such as economics, sociology, ecology, epidemiology, and computational biology. /content/cudazone/CUDABrowser/assets/images/applications/239_manufacturing_small.png /content/cudazone/CUDABrowser/assets/images/applications/239_manufacturing_large.png Academia Dept. of Mechanical Engineering-Engineering Mechanics Michigan Tech. University http://www.me.mtu.edu/ 2008 12 1 12/01/2008 Michigan Tech. University Paper Multimedia Numerics Life Sciences ABM Simulation, Graphics Processing Units (GPU), GPGPU, Complex Systems, Data-Parallel Algorithms, Michigan Tech. University 029579b0-0f59-11de-8c30-0800200c9a66 GPU Accelerated Radio Astronomy Signal Convolution The increasing array size of radio astronomy interferometers is causing the associated computation to scale quadratically with the number of array signals. Consequently, efficient usage of alternate processing architectures should be explored in order to meet this computational challenge. Affordable parallel processors have been made available to the general scientific community in the form of the commodity graphics card. This work investigates the use of the Graphics Processing Unit (GPU) in the parallelisation of the combined conjugate multiply and accumulation stage of a correlator for a radio astronomy array. Using NVIDIA's Compute Unified Device Architecture, our testing shows processing speeds from one to two orders of magnitude faster than a Central Processing Unit (CPU) approach. /content/cudazone/CUDABrowser/assets/images/applications/238_gpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/238_gpu_large.png Academia The University of Western Australia http://www.uwa.edu.au/ 2008 07 31 07/31/2008 Chris Harris Karen Haines Lister Staveley-Smith Paper Signal Processing Chris Harris, Karen Haines, Lister Staveley-Smith 38e24440-0ed1-11de-8c30-0800200c9a66 GPU Implementation of Belief Propagation Using CUDA for Cloud Tracking and Reconstruction This paper describes an efficient CUDA-based GPU implementation of the belief propagation algorithm that can be used to speed up stereo image processing and motion tracking calculations without loss of accuracy. Preliminary results in using belief propagation to analyze satellite images of Hurricane Luis for real-time cloud structure and tracking are promising with speedups of nearly a factor of five. /content/cudazone/CUDABrowser/assets/images/applications/237_clouds_small.png /content/cudazone/CUDABrowser/assets/images/applications/237_clouds_large.png Academia Department of Computer and Information Sciences University of Delaware http://www.cis.udel.edu/ 2009 02 13 02/13/2009 5 Scott Grauer-Gray Chandra Kambhamettu Kannappan Palaniappan Paper Science Scott Grauer-Gray, Chandra Kambhamettu, Kannappan Palaniappan f99265b0-0ec9-11de-8c30-0800200c9a66 LOW-COST REAL-TIME SAR SIMULATION FOR APPLICATIONS IN MISSION PLANNING, EDUCATION AND INFORMATION EXTRACTION SAR simulators are important for a huge variety of applications. Realistic SAR simulations need realistic 3D models, which are often not available. Less realistic models can be used in the less accurate real-time simulation approach. Using modern graphic cards for SAR simulation even complex environments can be simulated in real-time. This is realised by implementing of SAR geometry and radiometry within standard graphics hardware, which offers 3D hardware acceleration and programmable graphics processing units (GPU). http://www.isprs2007ist.itu.edu.tr/21.pdf /content/cudazone/CUDABrowser/assets/images/applications/236_sar_small.png /content/cudazone/CUDABrowser/assets/images/applications/236_sar_large.png Academia Institute for Photogrammetry, Universitat Stuttgart, Germany http://www.ifp.uni-stuttgart.de/forschung/photo/georef-Dateien/georef.en.html 2007 05 08 05/08/2007 Timo Balz Paper Imaging Graphics SAR, Radar, Real-Time, Simulation, Timo Balz f43e42f0-0ec9-11de-8c30-0800200c9a66 GPU Accelerated Acoustic Likelihood Computations This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used a NVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as a parallel coprocessor. http://www.crim.ca/Publications/2008/documents/plein_texte/PAR_CarPals_Interspeech2008.pdf /content/cudazone/CUDABrowser/assets/images/applications/235_likelihood_computations_small.png /content/cudazone/CUDABrowser/assets/images/applications/235_likelihood_computations_large.png Academia Centre de transfert de technologies et de connaissances (CRIM) http://www.crim.ca/fr/ 2008 12 01 12/01/2008 5 Patrick Cardinal Pierre Dumouchel Gilles Boulianne Michel Comeau Paper Signal Processing Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne, Michel Comeau e4315ff0-0ec9-11de-8c30-0800200c9a66 Ultrasound goes GPU: real-time simulation using CUDA Despite the increasing adoption of other imaging modalities, ultrasound guidance is widely used for surgical procedures and clinical imaging due to its low cost, non-invasiveness, and real-time visual feedback. Many ultrasound-guided procedures require extensive training and where possible training on simulations should be preferred over patients. Computational resources for existing approaches to ultrasound simulation are usually limited by real-time requirements. Unlike previous approaches we simulate freehand ultrasound images from CT data on the Graphics Processing Unit (GPU). http://campar.in.tum.de/pub/reichl2009spie/reichl2009spie.pdf /content/cudazone/CUDABrowser/assets/images/applications/234_ultrasound_small.png /content/cudazone/CUDABrowser/assets/images/applications/234_ultrasound_large.png Research Computer-Aided Medical Procedures (CAMP), TUM, Munich, Germany http://wwwnavab.in.tum.de/WebHome 2009 02 01 02/01/2009 Tobias Reichl Josh Passenger Oscar Acosta Olivier Salvado Paper Life Sciences Tobias Reichl, Josh Passenger, Oscar Acosta, Olivier Salvado f1c168e0-0e15-11de-8c30-0800200c9a66 GPU FX Spectrometer using CUDA The next generation of radio telescopes, such as Square Kilometer Array and the associated Pathfinder arrays, require vast amounts of computation due to the extremely large number of interferometers and the imaging requirements. The hardware for this computation is becoming a significant consideration in array design, both in terms of initial cost and power consumption. Graphics Processing Units (GPU) provide power efficiency and affordability as well as the flexibility of general purpose hardware. This work implements a GPU-based FX spectrometer, which processes four streams of 8-bit interferometer data, for a variable number of frequency channels. This approach scales well with frequency channels, with a computation speeds up to 18 times faster than those of a CPU implementation. Further work is in progress to scale the algorithm with the number of interferometer streams, and to investigate optimisation of the GPU algorithm. /content/cudazone/CUDABrowser/assets/images/applications/233_gpu_fx_spectrometer_small.png /content/cudazone/CUDABrowser/assets/images/applications/233_gpu_fx_spectrometer_large.png Research The University of Western Australia http://www.uwa.edu.au/ 2007 12 01 1/12/2007 Chris Harris Paper Signal Processing Imaging Chris Harris ec53f1c0-0e15-11de-8c30-0800200c9a66 IMPLEMENTING ALGORITHMS FOR SIGNAL AND IMAGE RECONSTRUCTION ON GRAPHICAL PROCESSING UNITS Several highly efective algorithms that have been proposed recently for compressed sensing and image processing applications can be implemented eficiently on commodity graphical processing units (GPUs). The properties of algorithms and application that make for eficient GPU implementation are discussed, and computational results for several algorithms are presented that show large speedups over CPU implementations. /content/cudazone/CUDABrowser/assets/images/applications/232_signal_images_small.png /content/cudazone/CUDABrowser/assets/images/applications/232_signal_images_large.png Academia The University of Wisconsin Madison http://www.cs.wisc.edu/ 2008 12 01 12/01/2008 169 SANGKYUN LEE STEPHEN J. WRIGHT Code Paper Numerics Imaging Graphical processing units, compressed sensing, image denoising, image deblurring, SANGKYUN LEE, STEPHEN J. WRIGHT e3fb68a0-0e15-11de-8c30-0800200c9a66 CUDA Fluid Simulation in NVIDIA PhysX The NVIDIA fluid particle demo uses PhysX technology accelerated by the CUDA architecture. Released in October 2008 as part of a GeForce Powerpack Over 64000 SPH fluid particles pour into the scene, push aside wooden crates, which float up as the fluid level rises. All these particles are simulated real time using accelerated PhysX, each SPH particle moves in the scene as the result of interactions with other particles, rigid body objects and the surrounding environment. This demo was recorded using a GeForce 9800GTX+ /content/cudazone/CUDABrowser/assets/images/applications/231_fluidsim_small.png /content/cudazone/CUDABrowser/assets/images/applications/231_fluidsim_large.png Consumer NVIDIA http://www.nvidia.com/cuda 2008 12 1 12/01/2008 10 Mark Harris Paper Computational Fluid Dynamics PhysX, Fluids, SPH, Mark Harris ded49770-0e15-11de-8c30-0800200c9a66 Translating GPU Binaries to Tiered SIMD Architectures with Ocelot Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even simple architectures with low hardware complexity can easily exploit the parallelism in an application. With these applications in mind, this paper presents Ocelot, a binary translation framework designed to allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically, we show how (i) the PTX thread hierarchy can be mapped to many-core architectures, (ii) translation techniques can be used to hide memory latency, and (iii) GPU data structures can be efficiently emulated or mapped to native equivalents. We describe the low level implementation of our translator, ending with a case study detailing the complete translation process from PTX to SPU assembly used by the IBM Cell Processor. /content/cudazone/CUDABrowser/assets/images/applications/230_ptx_small.png /content/cudazone/CUDABrowser/assets/images/applications/230_ptx_large.png Academia School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia http://www.ece.gatech.edu/ 2009 01 01 01/01/2009 32 Gregory Diamos Andrew Kerr Mukil Kesavan Paper Programming Tools Gregory Diamos, Andrew Kerr, Mukil Kesavan d964fd70-0e15-11de-8c30-0800200c9a66 Reverberate LE GPU Edition VST convolution reverb for PC, powered by NVIDIA CUDA for GeForce 8 series and above. /content/cudazone/CUDABrowser/assets/images/applications/229_reverberate_small.png /content/cudazone/CUDABrowser/assets/images/applications/229_reverberate_large.png Commercial LiquidSonics http://www.liquidsonics.com/publicbeta/ 2009 03 05 03/05/2009 LiquidSonics Application Video & Audio VST, Audio, Reverb, LiquidSonics d46f0040-0e15-11de-8c30-0800200c9a66 VST Plugin: Convolution Reverb on NVidia GPUs The plugin is able to load wav files as impulse responses and can be used together with a NVidia GPU (Geforce 8xxx or better) to have a convolution reverb with nearly no CPU usage at all. /content/cudazone/CUDABrowser/assets/images/applications/228_kvr_listed_at_png32_small.png /content/cudazone/CUDABrowser/assets/images/applications/228_kvr_listed_at_png32_large.png Commercial nilsschneider.de http://www.nilsschneider.de/cms/index.php 2008 11 17 11/17/2008 Nils Schneider Application Video & Audio VST, Audio, Reverb, Nils Schneider cef9c0f0-0e15-11de-8c30-0800200c9a66 Massively Parallel Two-Dimensional TLM Algorithm on Graphics Processing Units Recent advances in computing technology has brought massively parallel computing power to desktop PCs. As multi-core processor technology becomes mature, a new front in parallel technology based on graphics processors has emerged. A massively parallel 2D-TLM algorithm for NVIDIA advanced graphics processors has been developed. The proposed parallel computing paradigm can be adopted straightforwardly to accelerate time-domain electromagnetic field modeling programs. /content/cudazone/CUDABrowser/assets/images/applications/227_twodimensional-tlm_small.png /content/cudazone/CUDABrowser/assets/images/applications/227_twodimensional-tlm_large.png Academia University of British Columbia http://www.ece.ubc.ca/ 2009 01 22 01/22/2009 10 Filippo V. Rossi Poman P.M. So Nikolaus Fichtner Peter Russer Paper Filippo V. Rossi, Poman P.M. So, Nikolaus Fichtner, Peter Russer b6f64af0-0e15-11de-8c30-0800200c9a66 Fast GPU Implementation of Sparse Signal Recovery from Random Projections We consider the problem of sparse signal recovery from a small number of random projections (measurements). This is a well known NP-hard to solve combinatorial optimization problem. A frequently used approach is based on greedy iterative procedures, such as the Matching Pursuit (MP) algorithm. Here, we discuss a fast GPU implementation of the MP algorithm, based on the recently released NVIDIA CUDA API and CUBLAS library. The results show that the GPU version is substantially faster (up to 31 times) than the highly optimized CPU version based on CBLAS (GNU Scientific Library). /content/cudazone/CUDABrowser/assets/images/applications/226_sparse_signal_recovery_small.png /content/cudazone/CUDABrowser/assets/images/applications/226_sparse_signal_recovery_large.png Academia University of Calgary http://www.ucalgary.ca 2009 01 29 01/25/2009 M. Andrecut Paper Signal Processing GPU programming, Nvidia CUDA, sparse signal recovery, random projections, matching pursuit algorithm, M. Andrecut c9700f40-0e15-11de-8c30-0800200c9a66 ArcSoft TotalMedia Theatre The newly released, CUDA-powered SimHD technology is available in TotalMedia Theatre to allow viewers to obtain a HD-like viewing experience from not only DVDs, but also other standard definition multimedia files,"said George Tang, ArcSoft's Vice President and General Manager of Video and Home Entertainment Group. "We are pleased to be partnering with NVIDIA to deliver excellence in high-definition video on the PC. /content/cudazone/CUDABrowser/assets/images/applications/225_arcsoft_small.png /content/cudazone/CUDABrowser/assets/images/applications/225_arcsoft_large.png Consumer ArcSoft http://www.arcsoft.com/products/totalmediatheatre/ 2003 03 10 03/27/2008 ArcSoft Application Consumer ArcSoft c39d46a0-0e15-11de-8c30-0800200c9a66 SETI@home SETI (Search for Extraterrestrial Intelligence) is a scientific area whose goal is to detect intelligent life outside Earth. One approach, known as radio SETI, uses radio telescopes to listen for narrow-bandwidth radio signals from space. Such signals are not known to occur naturally, so a detection would provide evidence of extraterrestrial technology. You can participate by running this free program that downloads and analyzes radio telescope data. And now, with this CUDA-optimized client powered by your CUDA-ready GeForce GPU, your system can deliver as much as 10 times the computational power of an average home PC CPU. /content/cudazone/CUDABrowser/assets/images/applications/224_setiathome_small.png /content/cudazone/CUDABrowser/assets/images/applications/224_setiathome_large.png Consumer University of Berlkey http://setiathome.berkeley.edu/ 2008 12 01 12/01/2008 10 University of Berlkey Application Signal Processing University of Berlkey be062770-0e15-11de-8c30-0800200c9a66 Folding@home Download Folding@home and band together with people across the globe by adding the massive compute power of a NVIDIA GeForce GPU to one of the largest supercomputers in the world. Use your PC to help fight many of the world's most devastating diseases while you sleep at night. /content/cudazone/CUDABrowser/assets/images/applications/223_foldingathome_small.png /content/cudazone/CUDABrowser/assets/images/applications/223_foldingathome_large.png Academia Stanford University http://folding.stanford.edu/ 2008 06 01 06/01/2008 100 Stanford University Application Multimedia Life Sciences Stanford University b02413b0-0e15-11de-8c30-0800200c9a66 PowerDirector7 Ultra Award winning video editing software with CUDA accelerated video effects and encoding. PowerDirector 7's support for NVIDIA CUDA technology delivers huge speed gains when encoding HD video into the H.264 format. Offering performance gains of 270% for encoding high-definition video using NVIDIA CUDA technology, PowerDirector 7 leverages the power of the GPU to deliver its faster results. http://www.cyberlink.com/multi/products/main_4_en_US.html /content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_new_small.png /content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_new_large.png /content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_box1.png Commercial Cyberlink http://www.cyberlink.com 2009 02 04 02/04/2009 3.5 Consumer Cyberlink Application Video & Audio PowerDirector,CyberLink,Video,Videoschnitt,Videobearbeitung,Authoring,Videoauthoring,AVCHD,Blu-ray,H.264,MPEG-2,DV-CAM,Kamera,Videokamera, Cyberlink 031bb7a0-0e0f-11de-8c30-0800200c9a66 cuSVM cuSVM is a software package for high-speed (Gaussian-kernelized) Support Vector Machine training and prediction that exploits the massively parallel processing power of Graphics Processors (GPUs). cuSVM is written in NVIDIA's CUDA C-language GPU programming environment, includes implementations of both classification and regression, and performs SVM training (prediction) at 13-73 (22-172) times the rate of state of the art CPU software. Moreover, cuSVM features a Matlab MEX wrapper so that users can access the GPU's power without having to do any "real" programming. /content/cudazone/CUDABrowser/assets/images/applications/222_cusvm_small.png /content/cudazone/CUDABrowser/assets/images/applications/222_cusvm_large.png 2009 01 17 01/17/2009 172 AUSTIN CARPENTER Application Paper AUSTIN CARPENTER fc3423a0-0e0e-11de-8c30-0800200c9a66 CuPP Big improvements in the performance of graphics processing units (GPUs) and enhancements in the corresponding programming systems turned GPUs into a compelling platform for high performance computing. In this thesis, we present CuPP, a C++ framework built up on NVIDIAs CUDA. CuPP allows easier development of GPU programs even as compared to CUDA, by automating frequent programming tasks e.g. memory management. The thesis is roughly divided into three parts. We begin with an introduction to the CUDA programming system and discuss difficulties when integrating it into already existing C++ applications. Then we describe the CuPP framework and explain how it solves these difficulties. Afterwards we demonstrate the benefits of CuPP on the example of OpenSteer, a steering library and a demo application. With only a small amount of code changes, the performance was increased by a factor of 42 as compared to the CPU version. /content/cudazone/CUDABrowser/assets/images/applications/221_cupp_small.png /content/cudazone/CUDABrowser/assets/images/applications/221_cupp_large.png Academia Universitat Kassel http://cms.uni-kassel.de/unicms/?id=eecs 2009 01 16 01/16/2009 Jens Breitbart Application Paper Programming Tools Jens Breitbart f64a5090-0e0e-11de-8c30-0800200c9a66 Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics Discussion of GraCCA (Graphic-Card Cluster for Astrophysics) system and AMR (Adaptive-Mesh-Refinement) Hydrodynamics + Self-Gravity Simulation in GPUs. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009. /content/cudazone/CUDABrowser/assets/images/applications/220_gpu_computation_in_astrophysics_small.png /content/cudazone/CUDABrowser/assets/images/applications/220_gpu_computation_in_astrophysics_large.png Academia Graduate Institute of Physics, National Taiwan University http://www.ntu.edu.tw/engv4/ 2009 01 16 01/16/2009 23 H. Y. Schive Presentation Science H. Y. Schive f1230a30-0e0e-11de-8c30-0800200c9a66 CRYPTOGRAPHIC COMPUTING ON GPU Review of Cryptographic Computing on the GPU. Discusses elliptic curve method of factorization (ECM) on the GPU. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009. /content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_small.png /content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_large.png Academia Dept. Electrical Engineering National Taiwan University http://www.ee.ntu.edu.tw/en/ 2009 01 16 01/16/2009 Chen-Mou Cheng Presentation Numerics Chen-Mou Cheng 67a8ffd0-04a4-11de-8c30-0800200c9a66 The GPU Supercomputer of CQSE We have developed highly efficient CUDA (Compute Unified Device Architecture ) codes for our computationally intense problems (quantum chromodynamics, quantum spin systems, and astrophysics) With our GPU supercomputer, we can tackle many large scale computations withoutusing the prohibitively expensive supercomputers like IBM BlueGene. /content/cudazone/CUDABrowser/assets/images/applications/219_cqse_small.png /content/cudazone/CUDABrowser/assets/images/applications/219_cqse_large.png Academia Center for Quantum Science and Engineering National Taiwan University http://cqse.ntu.edu.tw/cqse/ 2009 01 16 01/16/2009 100 Ting-WaiChiu Paper Ting-WaiChiu 4c8a6720-04a4-11de-8c30-0800200c9a66 Real-time modelling of sea-surface radiance In order to address the issue of scene simulation in marine environment with optical sea clutter, Alyotech Technologies developed a real-time model of the wind-driven sea surface radiance in the IR and visible spectrum. While IR surveillance from surface ships is among the first considered applications, the model was specifically designed to face the tricky problems related to observation at grazing angles. For this purpose, special effort has been carried out to deal efficiently with the following issues: - dynamical computing of surface geometry from wave height spectra, including some (limited) nonlinearity, - representing the surface on optimized multi-scale mesh, - global illumination of the ocean surface by partially cloudy sky-domes, - dynamical estimate and rendering of unresolved surface rugosity accounting for both capillary waves and unresolved distant gravity waves. Real-time animation and rendering for meshes as large as several 106 polygons is achieved through massive parallelization on GPU. Full sky domes for both global illumination and sky rendering are precomputed using SKYGEN, a cloudy-sky simulation software recently developed by Alyotech. The application is based on CUDA and OpenGL Shader Model 4.0. The CUDA part is at least 50 times faster than current multi-core implementations. Almost all the computation is offloaded to the GPU giving high performance results (around 150 fps using a GeForce GTX 280). /content/cudazone/CUDABrowser/assets/images/applications/218_sea_small.png /content/cudazone/CUDABrowser/assets/images/applications/218_sea_large.png Commercial Alyotech technologies http://www.alyotech.com 2009 03 02 03/02/2009 50 Commercial Stephane Melledant Sebastien Vince Goulven Monnier Multimedia Ocean, sea, surface, radiance,alyotech, Stephane Melledant, Sebastien Vince, Goulven Monnier 441666c0-04a4-11de-8c30-0800200c9a66 Smith Waterman algorithm Protein sequence alignment /content/cudazone/CUDABrowser/assets/images/applications/217_SWA_small.png /content/cudazone/CUDABrowser/assets/images/applications/217_SWA_large.png Academia ICM, University of Warsaw http://bioinfo.icm.edu.pl/algorithm/ 2009 03 09 03/09/2009 3.5 Lukasz Ligowski/Witold Rudnicki Paper Life Sciences bioinformatics, sequence alignment, Lukasz Ligowski, Witold Rudnicki 3cdfef70-04a4-11de-8c30-0800200c9a66 GPU VSIPL GPU VSIPL is an implementation of the Vector Signal Image Processing Library Application Programming Interface that exploits CUDA capable GPUs to accelerate signal processing and dense linear algebra applications /content/cudazone/CUDABrowser/assets/images/applications/216_GPU_VSIPL_small.png /content/cudazone/CUDABrowser/assets/images/applications/216_GPU_VSIPL_large.png Research Georgia Tech Research Institute http://www.gtri.gatech.edu 2009 02 27 02/27/2009 75 GPU VSIPL Team Application Paper Numerics Libraries Signal Processing Linear Algebra Signal Processing, GPU VSIPL Team 3512b660-04a4-11de-8c30-0800200c9a66 Boris Pusher The Boris pusher is a numerical algorithm to advance charged particles in an electromagnetic field. It is widely used in numerical simulations in Plasma Physics. This application implements the Boris Pusher in CUDA. /content/cudazone/CUDABrowser/assets/images/applications/215_sbp_small.png /content/cudazone/CUDABrowser/assets/images/applications/215_sbp_large.png Research Lasers & Plasma Group (GoLP) of the Institute for Plasmas and Nuclear Fusion (IPFN) http://cfp.ist.utl.pt/golp/epp/ 2008 08 02 08/02/2008 16 Open source Paulo Abreu Multimedia Paper Science Paulo Abreu 26f98d10-04a4-11de-8c30-0800200c9a66 Ikena Live Ikena is a revolutionary video enhancement/forensic solution for Intelligence and Law Enforcement applications. It's entire video processing pipeline is implemented in both CUDA and x86, and is able to run on any Windows PC (laptop or desktop), and runs up to 5 times faster with NVIDIA GPUs. Ikena's powerful multi-frame enhancement technology can quickly and dramatically extract information from poor video sources such as: mobile phones, YouTube videos, and surveillance cameras. In seconds, faces and objects can be enhanced, and license plates can be read. /content/cudazone/CUDABrowser/assets/images/applications/214_Ikena160x90_small.png /content/cudazone/CUDABrowser/assets/images/applications/214_Ikena160x90_large.png Commercial MotionDSP http://www.motiondsp.com 2008 12 15 12/15/2008 5 MotionDSP Multimedia Video & Audio Imaging video, video forensic, video enhancement, CUDA, MotionDSP 80228240-ff98-11dd-87af-0800200c9a66 Many-Core Simulation on GPU Platform Many-Core processor has gained significant attention recently. To do researches about Many-core processor we need a fast and accurate simulator. Existed multi-core simulator works in CPU platform, and it is extermly slow when the simulated cores is more than sixteen. So for many-core processor traditioanl simulating technique will not suitable for many-core case. In this project, we try to use GPU platform, an existing many-core instance which favorates streaming applications, to simulate general many-core processor. It is well known that general CPU processor simulation has a very irregular program behaviour. How to map this irregular simulating behaviours onto an regular platform and getting a sim ulation speedup with the platform gives us a big challenge. We plan to do it in two steps: first we try to make it works and do not consider the simulating speed, and then secondly we will play with the simulator and make it faster and easy to be used by the lieterature. /content/cudazone/CUDABrowser/assets/images/applications/213_irisa1_small.png /content/cudazone/CUDABrowser/assets/images/applications/213_irisa1_large.png Academia Inner Mongolia University http://www.imu.edu.cn 2007 12 31 12/31/2007 Open Source He Liqiang Code microprocessor simulation microprocessor, simulation, GPU, He Liqiang 7053b9b0-ff98-11dd-87af-0800200c9a66 StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems Today Graphics Processing Units (GPUs) are a largely underexploited resource on existing desktops and a possible costeffective enhancement to high-performance systems. To date, most applications that exploit GPUs are specialized scientific applications. Little attention has been paid to harnessing these highly-parallel devices to support more generic functionality at the operating system or middleware level. This study starts from the hypothesis that generic middleware-level techniques that improve distributed system reliability or performance (such as content addressing, erasure coding, or data similarity detection) can be significantly accelerated using GPU support. http://www.ece.ubc.ca/~samera/papers/StoreGPU-HPDC08.pdf /content/cudazone/CUDABrowser/assets/images/applications/212_md5_small.png /content/cudazone/CUDABrowser/assets/images/applications/212_md5_large.png Academia Electrical and Computer Engineering Department, University of British Columbia http://www.ece.ubc.ca 2008 06 27 06/27/2008 8 Samer Al-Kiswany Paper Numerics MD5, Samer Al-Kiswany 6a162150-ff98-11dd-87af-0800200c9a66 Exegy Exegy offers hardware-accelerated appliances for the Financial Services community, facilitating the delivery and normalization of market data at very high data rates, without sacrificing latency or useful functionality. Because Exegy appliances are based on nontraditional hardware-acceleration technologies, more market data can be delivered faster without increasing operating costs, space or management time. /content/cudazone/CUDABrowser/assets/images/applications/211_exegy_small.png /content/cudazone/CUDABrowser/assets/images/applications/211_exegy_large.png Commercial Exegy http://www.exegy.com 2008 11 16 11/16/2008 74 Exegy Paper Financial Exegy 60c359b0-ff98-11dd-87af-0800200c9a66 GPU-HMMER GPU-Based MPI-HMMER is an open source MPI implementation of the HMMER protein sequence analysis suite. The main search algorithms, hmmpfam and hmmsearch, have been ported to MPI in order to provide high throughput HMMER searches on modern computational clusters. We improve on HMMER through sophisticated I/O, a self-contained coordinator/worker model, and the easy inclusion of accelerated architectures. This results in better scalability while still maintaining the familiar user interface http://www.mpihmmer.org/ /content/cudazone/CUDABrowser/assets/images/applications/210_hmmr_small.png /content/cudazone/CUDABrowser/assets/images/applications/210_hmmr_large.png Research mpiHMMER http://www.mpihmmer.org/ 2009 02 09 02/09/2009 Open Source John Paul Walters Joseph Landman Vipin Chaudhary Application Paper John Paul Walters, Joseph Landman, Vipin Chaudhary 584a2930-ff98-11dd-87af-0800200c9a66 Glimmer: Multilevel MDS on the GPU We present Glimmer, a new multilevel algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hardware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm makes local minima less likely while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We also show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm. /content/cudazone/CUDABrowser/assets/images/applications/209_mds_small.png /content/cudazone/CUDABrowser/assets/images/applications/209_mds_large.png Academia University of British Columbia http://www.cs.ubc.ca/ 2009 01 08 01/08/2009 Open Source Stephen Ingram Tamara Munzner Marc Olano Paper Multimedia Code Numerics Stephen Ingram, Tamara Munzner, Marc Olano 4fc93080-ff98-11dd-87af-0800200c9a66 Accelerating Molecular Dynamic Simulations on GPUs Using OpenMM OpenMM is a freely downloadable, high performance, extensible library that allows molecular dynamics (MD) simulations to run on high performance computer architectures, such as graphics processing units (GPUs). Significant performance speed ups of over 100X in some cases were achieved using OpenMM, as compared to a conventional implementation running on a single CPU core. The library performs full protein Hamiltonian calculations without any cutoffs (full O(N2) treatment). The current release includes a version of GROMACS that uses OpenMM to speed up its calculations on recent versions of NVIDIA and ATI GPUs. It supports implicit solvent models (Onufriev, Bashford, Case GB), with explicit solvent models to be incorporated into the next release /content/cudazone/CUDABrowser/assets/images/applications/208_openmm_small.png /content/cudazone/CUDABrowser/assets/images/applications/208_openmm_large.png Academia Simbios http://simbios.stanford.edu 2009 01 26 01/26/2009 100 Open Source OpenMM Team Application Life Sciences Libraries protein folding, RNA folding, molecular dynamics, molecular modeling, GROMACS, OpenMM Team 43e02da0-ff98-11dd-87af-0800200c9a66 Predictive Runtime Code Scheduling for Heterogeneous Architectures Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every recent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain specific applications like scientific applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous computing systems where all their heterogeneous resources are continuously utilized by dierent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power consumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user level scheduler based on past performance history for heterogeneous systems. We developed several scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple applications to fully utilize all available processing resources in CPU/GPU like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode. /content/cudazone/CUDABrowser/assets/images/applications/207_pred_runtime_small.png /content/cudazone/CUDABrowser/assets/images/applications/207_pred_runtime_large.png Research HiPEAC European Network of Excellence http://www.hipeac.net 2008 01 01 01/01/2008 72 Barcelona Supercomputing Center Paper Numerics Barcelona Supercomputing Center 3afb9120-ff98-11dd-87af-0800200c9a66 GPU Acceleration of a Production Molecular Docking Code Modeling the interactions of biological molecules, or docking, is critical to both understanding basic life processes and to designing new drugs. Here we describe the GPU-based acceleration of a recently developed, complex, production docking code. We show how the various functions can be mapped to the GPU and present numerous optimizations. We find which parts of the problem domain are best suited to the different correlation methods. The GPU-accelerated system achieves a speedup of at least 16x for all likely problems sizes. This makes it competitive with FPGA-based systems for small molecule docking, and superior for protein-protein docking. /content/cudazone/CUDABrowser/assets/images/applications/205_moleculardocking_small.png /content/cudazone/CUDABrowser/assets/images/applications/205_moleculardocking_large.png Academia Department of Electrical and Computer Engineering Boston University http://www.bu.edu/dbin/ece/web 2008 01 01 01/01/2008 16 Bharat Sukhwani Martin C. Herbordt Paper Science Bharat Sukhwani, Martin C. Herbordt 262c6b20-ff98-11dd-87af-0800200c9a66 Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. Hidden Markov Model (HMM) based recognition approaches are widely used for modeling the human speech process by constructing probabilistic estimates of the underlying word sequence from an acoustic signal. High-accuracy speech recognition, however, requires complex models, large vocabulary sizes, and exploration of a very large search space, making the computation too intense for current personal and mobile platforms. In this paper, we explore opportunities for parallelizing the HMM based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on current many-core platforms. For the case study, we use a recognition model of 50,000 English words, with more than 500,000 word bigram transitions, and one million hidden states. We examine important implementation tradeoffs for shared-memory single-chip many-core processors by implementing LVCSR on the NVIDIA G80 Graphics Processing Unit (GPU) in Compute Unified Device Architecture (CUDA), leading to significant speedups. This work is an important step forward for LVCSR-based applications to leverage many-core processors in achieving real-time performance on personal and mobile computing platforms. /content/cudazone/CUDABrowser/assets/images/applications/206_continuos_speech_recognition_small.png /content/cudazone/CUDABrowser/assets/images/applications/206_continuos_speech_recognition_large.png Academia Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu 2008 05 22 05/22/2008 9 Jike Chong Youngmin Yi Arlo Faria Nadathur Satish Kurt Keutzer Paper Video & Audio Jike Chong,Youngmin Yi, Arlo Faria, Nadathur Satish, Kurt Keutzer 16639600-ff98-11dd-87af-0800200c9a66 Improving Optical Character Recognition There is a clear need for optical character recognition in order to provide a fast and accurate method to search both existing images as well as large archives of existing paper documents. However, existing optical character recognition programs suffer from a flawed tradeoff between speed and accuracy, making it less attractive for large quantities of documents. This paper analyzes five different algorithms which operate completely independently of optical character recognition programs, but which have the combined effect of decreasing computational complexity and increasing overall accuracy. Finally, the paper proposes implementing each of these algorithms on the GPU, as well as optical character recognition programs themselves, in order to deliver another massive speed increase. /content/cudazone/CUDABrowser/assets/images/applications/204_wordrecognition_small.png /content/cudazone/CUDABrowser/assets/images/applications/204_wordrecognition_large.png Academia Villanova University http://www.villanova.edu 2008 01 01 01/01/2008 AJ Palkovic Paper AJ Palkovic f08f28b0-fa04-11dd-87af-0800200c9a66 Parallel, stochastic measurement of molecular surface area Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation thatmaps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy. /content/cudazone/CUDABrowser/assets/images/applications/203_jmgm_small.png /content/cudazone/CUDABrowser/assets/images/applications/203_jmgm_large.png Academia Department of Computer Science, University of Maryland http://www.cs.umd.edu/ 2008 02 13 03/13/2008 7 Derek Juba Amitabh Varshney Paper Science Molecular surface, Parallel, Progressive, GPU, Stochastic, Quasi-random, Derek Juba, Amitabh Varshney e5d3a950-fa04-11dd-87af-0800200c9a66 IV Data Feed Server Analytical Data Distribution System, Implied Volatility Index (IVX) calculations (IVX(c) is a registered trademark of IVolatility.com), Risk analysis, all in real-time. /content/cudazone/CUDABrowser/assets/images/applications/201_logo_small.png /content/cudazone/CUDABrowser/assets/images/applications/201_logo_large.png Commercial IVolatility.com http://www.ivolatility.com 2008 08 03 08/03/2008 20 Commercial Sergey Fedoseev Presentation Finance Sergey Fedoseev e0e30350-fa04-11dd-87af-0800200c9a66 Q GPU Q-GPU (Quantara-GPU) is a high performance options analytics for pricing and risk managing exotic structures. Q-GPU is based on the NVIDIA-CUDA architecture high performance computing technology to price a wide range of interest rate structures using state-of-the-art stochastic volatility and multi-factor models /content/cudazone/CUDABrowser/assets/images/applications/200_q-gpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/200_q-gpu_large.png Commercial Advanced Derivatives Solutions http://www.aderivatives.com 2008 08 20 08/20/2008 100 Commercial Skander Handous Presentation Finance montecarlo callable interest rates daabd390-fa04-11dd-87af-0800200c9a66 Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration General-purpose computing on graphics processing units (GPGPU) is shown to dramatically increase the speed of Monte Carlo simulations of photon migration. In a standard simulation of time-resolved photon migration in a semi-infinite geometry, the proposed methodology executed on a low-cost graphics processing unit (GPU) is a factor 1000 faster than simulation performed on a single standard processor. In addition, we address important technical aspects of GPU-based simulations of photon migration. The technique is expected to become a standard method in Monte Carlo simulations of photon migration. /content/cudazone/CUDABrowser/assets/images/applications/199_parallelcomputing_small.png /content/cudazone/CUDABrowser/assets/images/applications/199_parallelcomputing_large.png Academia Lund University http://www.lth.se/fysik/english/ 2008 07 16 07/16/2008 1000 Erik Alerstam Tomas Svensson Stefan Andersson-Engels Paper biomedical optics, simulations, scattering, Monte Carlo, Erik Alerstam, Tomas Svensson, Stefan Andersson-Engels d526dcd0-fa04-11dd-87af-0800200c9a66 Dense Compressed/Hierarchical Linear System Solver Benchmarks with NVIDIA 260 GTX hardware for solution large dense linear systems with compressed/hierarchical structures are discussed. A linear system with 163844 unknowns was generated and solved on GPU with 25 times speedup in regards to Quad Core Xeon 2.66HGz. Matrix generation shows 20 times speedup with a peak performance of 70 GFlop/s. The iterative solver and compressed matrix multiplication algorithm produce up to 50 times speedup with the peak performance 6 GFlop/s = 45 GB/s memory bandwidth. /content/cudazone/CUDABrowser/assets/images/applications/198_dense_small.png /content/cudazone/CUDABrowser/assets/images/applications/198_dense_large.png Commercial Elegant Mathematics Ltd. http://www.elegant-mathematics.com/ 2009 02 13 02/13/2009 50 Commercial Ilgis Ibragimov Application Computational Fluid Dynamics Numerics Libraries Science dense h-matrix low rank linear system solver, Ilgis Ibragimov d02cc0f0-fa04-11dd-87af-0800200c9a66 GPU Accelerated Free Surface Flows Using Smoothed Particle Hydrodynamics NVIDIA C870 Implementation of a Smoothed Particle Hydrodynamics (SPH) CFD code for wave interactions with fixed and floating bodies /content/cudazone/CUDABrowser/assets/images/applications/197_shot700_small.png /content/cudazone/CUDABrowser/assets/images/applications/197_shot700_large.png Academia Manchester Metropolitan University http://www.doc.mmu.ac.uk/cmmfa/ 2009 02 12 02/12/2009 23 Professor Derek Causon Presentation Computational Fluid Dynamics Numerics Computational Fluid Dynamics, Free Surface Flows, SPH, Waves, Professor Derek Causon f146d3d0-f99f-11dd-87af-0800200c9a66 Fluids: Technology Demo The NVIDIA fluid particle demo uses PhysX technology accelerated by the CUDA architecture. Released in October 2008 as part of a GeForce Powerpack Over 64000 SPH fluid particles pour into the scene, push aside wooden crates, which float up as the fluid level rises. All these particles are simulated real time using accelerated PhysX, each SPH particle moves in the scene as the result of interactions with other particles, rigid body objects and the surrounding environment. This demo was recorded using a GeForce 9800GTX+ /content/cudazone/CUDABrowser/assets/images/applications/196_screenshot_fluids_small.png /content/cudazone/CUDABrowser/assets/images/applications/196_screenshot_fluids_large.png Commercial NVIDIA http://www.nvidia.com/object/nvidia_physx.html 2008 08 12 08/12/2008 10 NVIDIA Application Multimedia Game Physics NVIDIA 94c0f270-f991-11dd-87af-0800200c9a66 The Great Kulu: Technology Demo Using CUDA The Great Kulu was one of NVIDIA GTX280 Launch demos, featuring PhysX technology, accelerated by the CUDA architecture. The demo is set below deck of a research ship, a large sea creature - "Kulu" , features a fully physically simulated flesh - using PhysX soft body simulation. The creature skeleton movements are animated - but the flesh movements are the result of simulation. This amazing demo gives a glimpse into the immense possibilities offered by GPU accelerated PhysX. /content/cudazone/CUDABrowser/assets/images/applications/195_screenshot_kulu_small.png /content/cudazone/CUDABrowser/assets/images/applications/195_screenshot_kulu_large.png Commercial NVIDIA http://www.nvidia.com/object/nvidia_physx.html 2008 08 12 08/12/2008 5 Commercial NVIDIA Application Multimedia Game Physics NVIDIA 8fbceb80-f991-11dd-87af-0800200c9a66 Nurien Demo Using CUDA This is a demo of a social networking game in development by Nurien, featuring NVIDIA PhysX technology accelerated by the CUDA architecture. This demo was released in October 2008 as part of a GeForce Powerpack. This fashion show runway scene is brought to life by the physically simulated skirts and character hair. Note how the skirt moves and flows naturally as the character walks and dances, this was recorded using an GeForce 9800GT. /content/cudazone/CUDABrowser/assets/images/applications/194_screenshot_nurien_small.png /content/cudazone/CUDABrowser/assets/images/applications/194_screenshot_nurien_large.png Commercial Nurien Software http://www.nurien.com/service/main/main.nrn 2008 08 12 08/12/2008 5 Commercial Application Multimedia Game Physics 4f4d41b0-f961-11dd-87af-0800200c9a66 UT3 PhysX Mod Pack Using CUDA The NVIDIA PhysX mod pack features three extra PhysX levels for EPIC's UnReal Tournament III, accelerated by the CUDA architecture . These levels are Lighthouse , Tornado and HeatRay . Each level features advance PhysX effects though out, with dynamic environmental elements such as dust, hail , rain and wind, destructible buildings, tearing cloth banners , amazing explosions and many more additions these levels are a must for any UT3 PC gamer. /content/cudazone/CUDABrowser/assets/images/applications/193_screenshot_ut3_small.png /content/cudazone/CUDABrowser/assets/images/applications/193_screenshot_ut3_large.png Commercial Epic Games http://www.epicgames.com 2008 08 12 08/12/2008 5 Commercial NVIDIA Application Multimedia Game Physics NVIDIA 824e3300-f991-11dd-87af-0800200c9a66 GPU for Surveillance Our goal is to develop very fast image and video processing algorithms by taking advantage of Graphics Processing Units (GPU). We have already implemented MERL's state-of-the art Bayesian background generation and foreground detection method. In comparison to the CPU version of the same algorithm, the GPU implementation is more than 20 times faster. /content/cudazone/CUDABrowser/assets/images/applications/192_surveillance_small.png /content/cudazone/CUDABrowser/assets/images/applications/192_surveillance_large.png Commercial Mitsubishi Electric Research Laboratories http://www.merl.com/ 2009 01 16 01/16/2009 20 Fatih Porikli Jay Thornton Paper Imaging Fatih Porikli, Jay Thornton 5fb7fc20-f961-11dd-87af-0800200c9a66 Parallel Fast Multipole Method for Global Illumination on Graphics Hardware Traditionally, Graphics Processing Units (GPUs) were designed for performing graphics specific computations. However, with rapid improvements in performance and programmability, GPUs have fostered considerable interest in doing computations that go beyond computer graphics; general purpose computation on GPUs, or "GPGPU". GPUs may be viewed as data parallel compute coprocessors that can provide significant improvements in computational performance especially for algorithms which exhibit sufficiently high amount of parallelism. One such algorithm is the Fast Multipole Method (FMM). http://www.cse.iitb.ac.in/~prekshu/dd.html /content/cudazone/CUDABrowser/assets/images/applications/190_prekshu_small.png /content/cudazone/CUDABrowser/assets/images/applications/190_prekshu_large.png Academia Prekshu Ajmera http://www.cse.iitb.ac.in/~prekshu/flash.php 2008 06 27 06/27/2008 20 Prekshu Ajmera Paper Presentation Imaging Prekshu Ajmera 59b4ada0-f961-11dd-87af-0800200c9a66 Extending VForce to Include Support for NVIDIA GPUs using CUDA VSIPL++ for Reconfigurable Computing (VForce) is a middleware framework that adds support for special purpose processors (SPPs) to VSIPL++ [1], a C++ extension of the Vector, Signal, and Image Processing Library. VSIPL++ defines an object oriented API that provides a collection of commonly used signal processing algorithms and strives to enable performance, portability, and productivity. /content/cudazone/CUDABrowser/assets/images/applications/189_vforce_small.png /content/cudazone/CUDABrowser/assets/images/applications/189_vforce_large.png Academia Northeastern University http://www.ece.neu.edu/groups/rcl/projects/vsipl/vsipl.html/ 2008 09 24 09/24/2008 Miriam Leeser Presentation Paper Imaging, Signal Processing, Libraries Signal Processing Libraries Miriam Leeser 538a2810-f961-11dd-87af-0800200c9a66 Computing spike-based convolutions on GPUs using CUDA This project developed a hierarchical spike-based network for object recognition using a the dynamic vision sensor silicon retina and NVIDIA CUDA GPUs. /content/cudazone/CUDABrowser/assets/images/applications/188_CUDAspikeoutDiffSaccade_small.png /content/cudazone/CUDABrowser/assets/images/applications/188_CUDAspikeoutDiffSaccade_large.png Telluride Neuromorphic Engineering Workshop https://neuromorphs.net 2008 07 18 07/18/2008 5 Yingxue Wang Jayram Moorkanikara Nageswaran Tobi Delbruck Paper Multimedia Imaging Yingxue Wang, Jayram Moorkanikara Nageswaran, Tobi Delbruck 497d3830-f961-11dd-87af-0800200c9a66 A GPU Accelerated Speech Recognition System Graphics Processing Units (GPUs) have become increasingly programmable over the past few years and are able to accomplish more than the specific graphics tasks for which they were designed. Their relatively low price, inherent parallelism, and the fact that their overall performance is increasing faster than for CPUs make them ideal co-processors for certain applications other than graphics. The purpose of this project is to utilize the power of a GPU for a speech recognition application, decreasing the overall processing time while maintaining the same level of recognition performance. /content/cudazone/CUDABrowser/assets/images/applications/187_speech_recognition_small.png /content/cudazone/CUDABrowser/assets/images/applications/187_speech_recognition_large.png Academia Mississippi State University http://www.msstate.edu/ 2005 05 01 01/05/2005 7 John Johnson Paper Video & Audio John Johnson 2bd2c480-f961-11dd-87af-0800200c9a66 Hierarchical Object Recognition Algorithm NVIDIA CUDA Implementation of a Hierarchical Object Recognition Algorithm /content/cudazone/CUDABrowser/assets/images/applications/186_object_recognition_small.png /content/cudazone/CUDABrowser/assets/images/applications/186_object_recognition_large.png Academia 2008 11 1 01/11/2008 Sharat Chikkerur Paper Graphics Sharat Chikkerur 24de3830-f961-11dd-87af-0800200c9a66 Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs Reed-Solomon coding is a method of generating arbitrary amounts of checksum information from original data via matrix-vector multiplication in finite fields. Previous work has shown that CPUs are not well-matched to this type of computation, but recent graphical processing units (GPUs) have been shown through a case study to perform this encoding quickly for the 3 + 3 (three data + three parity) case. In order to be utilized in a true RAID-like system, it is important to understand how well this computation can scale in the number of data disks supported. This paper details the performance of a general Reed-Solomon encoding and decoding library that is suitable for use in RAID-like systems. Both generation and recovery are performance-tested and discussed. /content/cudazone/CUDABrowser/assets/images/applications/185_raid_on_gpu_small.png /content/cudazone/CUDABrowser/assets/images/applications/185_raid_on_gpu_large.png Academia University of Alabama at Birmingham / Sandia National Laboratories http://www.cis.uab.edu 2008 11 17 11/07/2008 15 Matthew Curry tony@cis.uab.edu Lee Ward Ron Brightwell Paper Matthew Curry, tony@cis.uab.edu, Lee Ward, Ron Brightwell 7324e1c0-f95b-11dd-87af-0800200c9a66 Knoppix for CUDA We are developping a USB/CD/DVD bootable linux system named "Knoppix for CUDA". Using it, we can quickly start and evaluate many applications in the scientific GPU computing area without any efforts for installing CUDA software, GPU device driver, MPI/OpenMP environments also compiling CUDA sample code etc. /content/cudazone/CUDABrowser/assets/images/applications/184_knx4cuda_small.png /content/cudazone/CUDABrowser/assets/images/applications/184_knx4cuda_large.png Academia Nagasaki University http://progrape.jp/cs/ 2008 02 10 02/10/2008 77 Open source Tsuyoshi Hamada Application Code Multimedia Computational Fluid Dynamics Life Sciences Libraries Programming Tools Science Other USB bootable and GPU-enable Linux System, Tsuyoshi Hamada 718b9100-f943-11dd-87af-0800200c9a66 3D Particle Boltzmann Solver This software package solves the Boltzmann equation with a particle method. It allows one to make mixtures of particles and particle transformations, i.e. chemical reactions at supersonic speed. The CUDA collision part is almost 30 times faster compared with tuned and threaded Quad Core Xeons version. The free flow part is even faster, up to 120 times faster. A "simple" supersonic (Mach=7) problem with 15,000,000 particles can be solved within several minutes just in the GPU memory of 2xx GTX series. (9xxx series are also supported) /content/cudazone/CUDABrowser/assets/images/applications/182_ss_small.png /content/cudazone/CUDABrowser/assets/images/applications/182_ss_large.png Commercial Elegant Mathematics Ltd. http://www.elegant-mathematics.com 2009 02 10 02/10/2009 120 Commercial Ilgis Ibragimov Application Computational Fluid Dynamics Dynamics Numerics Life Sciences Libraries Science Boltzmann Particle CFD, Ilgis Ibragimov f760db20-f88d-11dd-87af-0800200c9a66 Creation parallel dotplots for suite of protein sequences This application is developed to generate pairwise dotplots for huge number of protein sequences in the database. It consumes a huge amount time for creation of dotplots sequentially for a database of sequences. However, if we use CUDA we can do the same task in a much shorter time. /content/cudazone/CUDABrowser/assets/images/applications/181_multidotplot_small.png /content/cudazone/CUDABrowser/assets/images/applications/181_multidotplot_large.png Commercial New England Biolabs 2009 01 17 01/17/2009 Chandra Sekhar Pedamallu Application Code Life Sciences Protein Sequences, Dot plots, Sequence similarity, Chandra Sekhar Pedamallu 7f2db890-ef5e-11dd-ba2f-0800200c9a66 Fast Support Vector Machine Training and Classification on Graphics Processors Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training, using Platt's Sequential Minimal Optimization algorithm, which achieves speedups of 5-32x over LibSVM running on a high-end traditional processor. We also present a system for SVM classification which achieves speedups of 120-150x over LibSVM. /content/cudazone/CUDABrowser/assets/images/applications/180_vector_machine_training_small.png /content/cudazone/CUDABrowser/assets/images/applications/180_vector_machine_training_large.png Academia Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/ 2008 02 08 02/08/2008 150 Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Paper Numerics Support Vector Machines, Sequential Minimal Optimization, Graphics Processing Units, Bryan Catanzaro, Narayanan Sundaram, Kurt Keutzer 275f4a70-e91e-11dd-ba2f-0800200c9a66 GPGPU for Accelerated GRAPPA Autocalibration in Magnetic Resonance Imaging The first part of this thesis provided an overview of MRI and explained how the acquisition time can be reduced by parallel imaging techniques such as GRAPPA. GRAPPA in a nutshell: undersample k-space and reconstruct missing information by fitting acquired data to k-space gaps. The reconstruction of missing data is a computationally intensive task The second part described the massively parallel and specialized architecture of graphics hardware and how to effectively harness its computational power to accelerate general-purpose computations. The final part presented different CUDA kernels for complex-valued matrix multiplication on GPU and explained various optimization techniques that have been applied step-by-step yielding speedups of 12 through 18 for special cases compared to the highly optimized Intel MKL. /content/cudazone/CUDABrowser/assets/images/applications/178_grappa_small.png /content/cudazone/CUDABrowser/assets/images/applications/178_grappa_large.png 2008 04 01 04/01/2008 18 Matthias Schneider Paper Imaging Matthias Schneider 136a90d0-e91c-11dd-ba2f-0800200c9a66 Clinical Evaluation of GPU-Based Cone Beam Computed Tomography The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3-D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 seconds). In many situations, the short scanning time of CBCT is followed by a time consuming 3-D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 2563 takes up to 25 minutes on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high performance computing solutions at a low cost, allowing for use in applications to many scientific problems. We have implemented an algorithm for 3-D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Cor., Santa Clara, California),which was executed on a NVIDIA GeForce 8800GT. Our implementation results in improved reconstruction times from on the order of minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3-D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe differences that can occur between CPU and GPU based reconstructions. By using our approach, the computation time for 2563 is reduced from 25 minutes on the CPU to 4.8 seconds on the GPU. The GPU reconstruction time for 5123 is 11.3 seconds, and 10243 is 61.4 seconds. /content/cudazone/CUDABrowser/assets/images/applications/177_conebeam_small.png /content/cudazone/CUDABrowser/assets/images/applications/177_conebeam_large.png Academia Computer Science and Toshiba Stroke Research Center of The State University of New York at Buffalo http://www.cse.buffalo.edu 2008 12 31 12/31/2008 300 Peter B. Noel Alan M. Walczak Kenneth R. Hoffmann Jinhui Xu Jason J. Corso Sebastian Schafer Paper Medical Imaging Peter B. Noel, Alan M. Walczak, Kenneth R. Hoffmann, Jinhui Xu, Jason J. Corso, Sebastian Schafer 863de380-e919-11dd-ba2f-0800200c9a66 Fast Deformable Registration on the GPU: A CUDA Implementation of Demons In the medical imaging field, we need fast deformable registration methods especially in intra-operative settings characterized by their time-critical applications. Image registration studies which are based on Graphics Processing Units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on deformable registration. We implemented Demons, a widely used deformable image registration algorithm, on NVIDIA's Quadro FX 5600 GPU with the Compute Unified Device Architecture (CUDA) programming environment. Using our code, we registered 3D CT lung images of patients. Our results show that we achieved the fastest runtime among the available GPU-based Demons implementations. Additionally, regardless of the given dataset size, we provided a factor of 55 speedup over an optimized CPU-based implementation. Hence, this study addresses the need for on-line deformable registration methods in intra-operative settings by providing the fastest and most scalable Demons implementation available to date. In addition, it provides an implementation of a deformable registration algorithm on a GPU, an understudied type of registration in the general-purpose computation on graphics processors (GPGPU) community. /content/cudazone/CUDABrowser/assets/images/applications/176_brain_small.png /content/cudazone/CUDABrowser/assets/images/applications/176_brain_large.png Academia University of California / University of Florida 2008 06 01 06/01/2008 55 Pinar Muyan-Ozcelik John D. Owens Junyi Xia Sanjiv S. Samant Paper Life Sciences Science Pinar Muyan-Ozcelik, John D. Owens, Junyi Xia, Sanjiv S. Samant e51d78e0-e912-11dd-ba2f-0800200c9a66 Accelerated Image Registration With CUDA In image registration, one of the images is referred to as the reference or source and the second image is referred to as the target or sensed. Image registration involves spatially transforming the target image to align with the reference image. A broad category of transformation models includes linear transformations, which include translation, rotation, scaling, and affine. Affine registration was performed between two 3D anatomical volume data sets (size240x256x176 voxels). The registration algorithm seeks to find an affine transformation that maps a "source" volume onto a "target" volume so as to minimize a cost function calculated between the two. In practice the transformed source voxels straddle target voxels positions, so that interpolation of target voxel values is required. /content/cudazone/CUDABrowser/assets/images/applications/175_brain_small.png /content/cudazone/CUDABrowser/assets/images/applications/175_brain_large.png Academia BSS Group, Cavendish Laboratory, University of Cambridge UK http://www.bss.phy.cam.ac.uk/ 2008 08 01 08/01/2008 100 Richard Ansorge Paper Life Sciences Science Richard Ansorge bcfa89a0-e90f-11dd-ba2f-0800200c9a66 A Note on Auto-tuning GEMM for GPUs The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13, 11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development [12]. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280). Keywords: Auto-tuning, matrix multiply, dense linear algebra, GPUs. /content/cudazone/CUDABrowser/assets/images/applications/174_anagg_small.png /content/cudazone/CUDABrowser/assets/images/applications/174_anagg_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://www.cs.utk.edu/~tomov 2009 01 12 01/12/2009 Yinan Li Jack Dongarra Stanimire Tomov Paper Numerics Auto-tuning, matrix multiply, dense linear algebra, GPUs, Yinan Li, Jack Dongarra, Stanimire Tomov 33852d10-e90f-11dd-ba2f-0800200c9a66 Enhancing the Performance of Dense Linear Algebra Solvers on GPUs The MAGMA project, headed by the linear algebra research groups at University of Tennessee, UC Berkeley, and UC Denver, aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current 'Multicore+GPU' systems. This transition cannot be done automatically as in many cases new algorithms that significantly differ from algorithms for conventional architectures, will be needed. Preliminary studies - on a new class of 'heterogeneity-aware' algorithms of 'reduced communication' and 'high-parallelism', as shown in this poster - confirm that this is the case. /content/cudazone/CUDABrowser/assets/images/applications/173_magma_small.png /content/cudazone/CUDABrowser/assets/images/applications/173_magma_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://www.cs.utk.edu/~tomov 2008 11 08 11/08/2008 2 M. Baboulin J. Demmel J. Dongarra S. Tomov V. Volkov Paper Application Numerics M. Baboulin, J. Demmel, J. Dongarra, S. Tomov, V. Volkov 95015ef0-e90b-11dd-ba2f-0800200c9a66 Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications.We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LU-factorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99:4 GFlop/s for double precision factorization on a hybrid Intel Xeon (2x4 cores @ 2.33 GHz) { NVIDIA GeForce GTX 280 5 (240 cores @ 1.30 GHz) system. Keywords: hybrid computing, dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphics processing units. /content/cudazone/CUDABrowser/assets/images/applications/172_tdla_small.png /content/cudazone/CUDABrowser/assets/images/applications/172_tdla_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://www.cs.utk.edu/~tomov 2008 10 01 10/01/2008 5 S. Tomov J. Dongarra M. Baboulin Paper Numerics hybrid computing, dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphics processing units, S. Tomov, J. Dongarra, M. Baboulin 0d29c620-e906-11dd-ba2f-0800200c9a66 Exploring New Architectures in Accelerating CFD for Air Force Applications Computational Fluid Dynamics (CFD) is an active field of research where the development of faster and more accurate methods is linked to the continuous demand for ever higher computational power. And indeed, for at least two decades, high-performance computing (HPC) programmers have taken for granted that each successive generation of microprocessors would, either immediately or after minor adjustments, make their software run substantially faster. But recent microprocessor design trends including the introduction of multi/many-core designs and the increasingly popular use in HPC of accelerators such as General Purpose Graphics Processing Units (GPGPU) and Field Programmable Gate Arrays (FPGAs), present an unprecedented challenge, namely how to update and enhance the existing large CFD software infrastructure to efficiently use these new architectures. In this paper we address some main issues in this transition and present ideas on using the new architectures to accelerate CFD applications that are of interest to the Air Force. We consider not only multi/many-core but also special purpose (e.g. GPUs) and reconfigurable computing (e.g. FPGAs) architectures. Moreover, we demonstrate benefits of using hybrid combinations where the strengths of each platform can be used to better map algorithm requirements and underlying architecture. /content/cudazone/CUDABrowser/assets/images/applications/171_enaa_small.png /content/cudazone/CUDABrowser/assets/images/applications/171_enaa_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://www.cs.utk.edu/~tomov 2008 10 31 10/31/2008 3 J. Dongarra S. Moore G. Peterson S. Tomov J. Allred V. Natoli D. Richie Paper Numerics J. Dongarra, S. Moore, G. Peterson, S. Tomov 0d6fb020-e903-11dd-ba2f-0800200c9a66 Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures We address some key issues in designing dense linear algebra (DLA) algorithms that are common for both multi/many-cores and special purpose architectures (in particular GPUs). We present them in the context of an LU factorization algorithm, where randomization techniques are used as an alternative to pivoting. This approach yields an algorithm based entirely on a collection of small Level 3 BLAS type computational tasks, which has emerged as a common goal in designing DLA algorithms for new architectures. Other common trends, also considered here, are block asynchronous task execution and "Block" layouts for the data associated with the separate tasks. We present numerical results and other specific experiments with DLA algorithms on NVIDIA GPUs using CUDA. The GPU results are also of interest themselves as we show a performance of up to 160 Glop/s on a single Quadro FX 5600 card. Keywords: dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphic process units. /content/cudazone/CUDABrowser/assets/images/applications/170_sidlamspa_small.png /content/cudazone/CUDABrowser/assets/images/applications/170_sidlamspa_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://www.cs.utk.edu/~tomov 2009 01 13 01/13/2009 2 M. Baboulin J. Dongarra Stanimire Tomov Paper Numerics dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphic process units, M. Baboulin, J. Dongarra, Stanimire Tomov 6dc8e4b0-e821-11dd-ba2f-0800200c9a66 MATRIX ALGEBRA ON GPU AND MULTICORE ARCHITECTURES The MAGMA project, led by the linear algebra research groups at University of Tennessee, UC Berkeley, and UC Denver, aims to develop a linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current "Multicore+GPU" systems. This transition cannot be done automatically, as in many cases new algorithms that significantly differ from algorithms for conventional architectures will be needed. Preliminary studies on a new class of "heterogeneity-aware" algorithms of "reduced communication" and "high-parallelism" confirm that this is the case. /content/cudazone/CUDABrowser/assets/images/applications/169_magma_small.png /content/cudazone/CUDABrowser/assets/images/applications/169_magma_large.png Academia Innovative Computing Laboratory Computer Science Department, University of Tennessee http://icl.cs.utk.edu/magma 2008 11 08 11/08/2008 2 Stanimire Tomov Paper Numerics Stanimire Tomov c74547b0-e810-11dd-ba2f-0800200c9a66 High-Performance Stream Computing for Particle Beam Transport Simulations Understanding modern particle accelerators requires simulating charged particle transport through the machine elements. These simulations can be very time consuming due to the large number of particles and the need to consider many turns of a circular machine. Stream computing offers an attractive way to dramatically improve the performance of such simulations by calculating the simultaneous transport of many particles using dedicated hardware. Modern Graphics Processing Units (GPUs) are powerful and affordable stream computing devices. The results of simulations of particle transport through the booster-to-storage-ring transfer line of the DIAMOND synchrotron light source using an NVidia GeForce 7900 GPU are compared to the standard transport code MAD. It is found that particle transport calculations are suitable for stream processing and large performance increases are possible. The accuracy and potential speed gains are compared and the prospects for future work in the area are discussed. /content/cudazone/CUDABrowser/assets/images/applications/168_hpscpbts_small.png /content/cudazone/CUDABrowser/assets/images/applications/168_hpscpbts_large.png N/A Academia Particle Physics Group at the University of Manchester http://www.hep.man.ac.uk/whoswho/HEP_members.html 2007 09 02 09/02/2007 5 R. Appleby D. Bailey M. Salt Application Paper R. Appleby, D. Bailey, M. Salt 073721d0-e1a5-11dd-ad8b-0800200c9a66 YARP CUDA Driver YARP is an open source robotic platform well widespread and it's used on many advanced humanoid robots (eg: MIT Cog, Kismet, RobotCub iCub, ...). It's divided into two main branches: IPC for programs distributed in a local network; a driver system to allow standardization of hardware's software interface (eg: a "grabber" interface allows video acquisition from many types of devices, allowing developers to use the same code on different platforms). This project provides YARP with a CUDA-based driver to allow execution of user-made kernels on nVidia GPUs, helping easy integration of GPGPU and robotics. /content/cudazone/CUDABrowser/assets/images/applications/167_small_yarp.png /content/cudazone/CUDABrowser/assets/images/applications/167_large_yarp.png N/A N/A 2007 11 24 11/24/2007 Open Source Giacomo Spigler (YARP authors are Paul Fitzpatrick, Giorgio Metta, Lorenzo Natale and Alessandro Scalzo) Code Numerics Science Signal Processing CUDA, YARP, YARP device, robotics, robotcub, icub, robot, Giacomo Spigler 317cf050-e114-11dd-ad8b-0800200c9a66 National Instruments LabVIEW HIL application in control of extremely large telescope Prototype in which NVIDIA's CUDA technology enables NI LabVIEW has been thoroughly benchmarked with impressive computational results. /content/cudazone/CUDABrowser/assets/images/applications/166_small_E-ELT_telescope_10494_p.png /content/cudazone/CUDABrowser/assets/images/applications/166_large_E-ELT_telescope_10494_p.png National Instruments http://www.ni.com 2008 12 02 12/02/2008 Shawn McCaslin Paper Multimedia Science Hardware-in-the-loop, HIL, high performance computing, extremely large telescope, Shawn McCaslin 29168480-e114-11dd-ad8b-0800200c9a66 Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors Lattice Boltzmann Methods (LBM) are used for the computational simulation of Newtonian fluid dynamics. In general, LBM-based simulations are relatively straightforward to parallelize, and this technique has been applied numerous times on general-purpose processors, field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). Of the three methods, the GPU implementations achieved the highest simulation performance per chip. With memory bandwidth of up to 141 GB/s and a theoretical maximum floating point performance of over 600 GFLOPS, CUDA-ready GPUs from NVIDIA provide an attractive platform for a wide range of scientific simulations, inclu ding LBM. Using the D3Q19 model, this paper improves upon prior GPU LBM results by increasing GPU multiprocessor occupancy, resulting in an increase in maximum performance by 20%, and by introducing a space-efficient storage method which reduces GPU RAM requirements by 50% at a slight detriment to performance. Both GPU versions are over 28 times faster than a quad-core CPU version using OpenMP. /content/cudazone/CUDABrowser/assets/images/applications/165_small_flow_in_porous_media.jpg /content/cudazone/CUDABrowser/assets/images/applications/165_large_flow_in_porous_media.jpg Academia University of Minnesota http://www.umn.edu 2008 12 19 12/19/2008 28 Peter Bailey Paper Computational Fluid Dynamics "Lattice Boltzmann" LBM cfd d3q19, Peter Bailey 1fe0e1d0-e114-11dd-ad8b-0800200c9a66 Ray tracing with CUDA An interactive real-time ray tracing system, developed as part of a diploma thesis. Amongst others, it features three different traversal strategies for a SAH-based BVH, two of them leveraging specific aspects of the novelties NVIDIA CUDA introduces to GPU computing, one resembling a more traditional approach. It is shown that the new features offered by CUDA provide for substantial improvement in hierarchy traversal over previous solutions and that GPU-based ray tracing in general can match or even better the performance of CPU-based systems. /content/cudazone/CUDABrowser/assets/images/applications/164_small_RTCUDA_cover.png /content/cudazone/CUDABrowser/assets/images/applications/164_large_RTCUDA_cover.png Academia University of Koblenz-Landau, Koblenz Campus http://www.uni-koblenz.de 2008 09 30 09/30/2008 Hanno Rabe Presentation Graphics ray tracing, Hanno Rabe 136059e0-e114-11dd-ad8b-0800200c9a66 GPU Particle Tracking and Multi-Fluid Simulations with Greatly Enhanced Computational Speed This is a poster presented at the American Geophysical Union Meeting in December 2008. /content/cudazone/CUDABrowser/assets/images/applications/163_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/163_large.jpg Academia University of Washington 2008 12 16 12/16/08 50 Open source Michele Cash Presentation Computational Fluid Dynamics Science Michele Cash 494de4a0-e114-11dd-ad8b-0800200c9a66 CoreAVC Professional CoreAVC is being recognized as being the world's fastest H.264 software video decoder. Now featuring NVIDIA CUDA Support! /content/cudazone/CUDABrowser/assets/images/applications/162_bg_small.png /content/cudazone/CUDABrowser/assets/images/applications/162_bg_large.png Commercial CoreCodec, Inc. http://www.corecodec.com 2008 12 19 12/19/2008 Commercial Dan Marlin Application Graphics h.264, h264, avc, coreavc, corecodec, video, decoder, Dan Marlin 208b40f0-ccef-11dd-ad8b-0800200c9a66 Gnort: High Performance Network Intrusion Detection Using Graphics Processors We present an intrusion detection system based on the Snort open-source NIDS that suggest that modern graphics cards can be greatly speed up intrusion detection systems. /content/cudazone/CUDABrowser/assets/images/applications/161_gnort_Small.png /content/cudazone/CUDABrowser/assets/images/applications/161_gnort_Large.png Research ICS-FORTH http://www.ics.forth.gr 2008 06 06 06/06/2008 Giorgos Vasiliadis Paper Other Network Intrusion Detection pattern matching, intrusion detection systems, network security, Giorgos Vasiliadis eb232f60-cc29-11dd-ad8b-0800200c9a66 Parallel algorithms for approximation of distance maps on parametric surfaces We present an efficient O(n) numerical algorithm for first-order approximation of geodesic distances on geometry images, where n is the number of points on the surface. The structure of our algorithm allows efficient implementation on parallel architectures. /content/cudazone/CUDABrowser/assets/images/applications/160_PMM_head_Small.png /content/cudazone/CUDABrowser/assets/images/applications/160_PMM_head_Large.png Academia Technion http://www.cs.technion.ac.il/~weber/Publications/ 2008 10 1 10/1/2008 150 Ofir Weber Application Multimedia Paper Graphics Numerics Geodesic distance, shortest path, Ofir Weber 90c91cc0-cc1d-11dd-ad8b-0800200c9a66 Multigrid on GPU: Tackling Power Grid Analysis on Parallel SIMT Platforms The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to today large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance. /content/cudazone/CUDABrowser/assets/images/applications/159_HybridMultigrid_Small.png /content/cudazone/CUDABrowser/assets/images/applications/159_HybridMultigrid_Large.png Texas A&M University http://www.tamu.edu 2008 11 12 11/12/2008 Zhuo Feng and Peng Li Paper Electronic Design Automation Zhuo Feng and Peng Li 44cf2020-cc14-11dd-ad8b-0800200c9a66 Voxel-based real-time ray tracing A real-time ray tracer that uses voxels (volumetric pixels) and is fully implemented on the GPU using the NVIDIA CUDA architecture. This ray tracer is independant of the scene complexity thus making it possible to create models as complex as desired. /content/cudazone/CUDABrowser/assets/images/applications/158_voxeldemo1_Small.png /content/cudazone/CUDABrowser/assets/images/applications/158_voxeldemo1_Large.png Academia ijsf 2008 07 12 7/12/2008 Cecill Etheredge Application Paper Graphics Imaging voxel ray tracing real time volumetric raytracing, Cecill Etheredge fa51e3b6-a0f8-4d3c-af4b-a19144509f43 Multi-GPU Incompressible Navier-Stokes Solver We describe the implementation of an incompressible Navier-Stokes solver code with Cartesian geometry capability on desktop supercomputers with multi-GPUs. Specifically, we adopt the programming model for the NVIDIA CUDA Architecture NVIDIA {CUDA programming model} to implement the discretized form of the governing equations on desktop supercomputers with up to four GPUs. /content/cudazone/CUDABrowser/assets/images/applications/157_lidDrivenCavity_Small.png /content/cudazone/CUDABrowser/assets/images/applications/157_lidDrivenCavity_Large.png Academia Boise State University http://www.boisestate.edu 2008 12 10 10/12/2008 100 Thibault / Senocak Presentation Multimedia Computational Fluid Dynamics Numerics CFD, multi-GPU, Navier-Stokes, Thibault, Senocak d3447ef0-c132-11dd-ad8b-0800200c9a66 Realtime Conversation Scene Analysis This is a realtime system for analyzing group meetings by combining face pose tracking and speaker diarization, based on audio-visual signals from an omnidirectional camera-microphone. It aims to estimate "who is talking to whom".based on CUDA. /content/cudazone/CUDABrowser/assets/images/applications/148_3d_Crop_TopviewCrop_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/148_3d_Crop_TopviewCrop_large.jpg Research NTT Communication Science Laboratories 2008 10 20 20/10/2008 Kazuhiro Otsuka Application Paper Presentation Imaging Video & Audio Kazuhiro Otsuka c0d91d10-c133-11dd-ad8b-0800200c9a66 Matrix Algebra on GPU and Multicore Architectures The MAGMA project, headed by the linear algebra research groups at University of Tennessee and U of California, Berkeley, aims to develop a library similar to LAPACK but for Multicore+GPU systems. /content/cudazone/CUDABrowser/assets/images/applications/149_logoMAGMA_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/149_logoMAGMA_large.jpg Academia University of Tennessee 2008 11 2 2/11/2008 S.Tomov, J.Dongarra, M. Baboulin Application Paper Numerics S.Tomov, J.Dongarra, M. Baboulin d5849900-c134-11dd-ad8b-0800200c9a66 Flame Fractals Flame fractals are a generalization of IFS fractals, allowing a range of non-linear transformations in addition to linear ones. A good example is the Electric Sheep screensaver, which uses a distributed rendering system to render animations /content/cudazone/CUDABrowser/assets/images/applications/150_12secondswithfinalxform_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/150_12secondswithfinalxform_large.jpg Research 2008 11 44 4/11/2008 50 Open source Steven Brodhead Application Code Digital Content Creation Graphics Flam3, Flame, Fractal, Steven Brodhead c418fe70-c136-11dd-ad8b-0800200c9a66 Fast Iterative Linear System Solvers CG, BiCGStab, GMRES/FGMRES/NGMRES, Lancos, Arnoldy and Davidson together with tensor Preconditioners /content/cudazone/CUDABrowser/assets/images/applications/151_GPU-Iter_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/151_GPU-Iter_large.jpg Commercial Elegant Mathematics Ltd. 2008 11 5 5/11/2008 50 Commercial Ibragimov Application Code Numerics Libraries Programming Tools Science Ibragimov 70a1d900-c137-11dd-ad8b-0800200c9a66 TMPGEnc 4.0 Xpress Multi-codec transcoder with the CUDA accelerated filters. /content/cudazone/CUDABrowser/assets/images/applications/152_te4xp_box_med_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/152_te4xp_box_med_large.jpg Commercial Pegasys Inc. http://www.pegasys-inc.com/en/index.html 2008 10 31 31/10/2008 4.46 Commercial Kaname Saito Application Video & Audio transcode, encode, transcoder, encoder, TMPGEnc, Pegasys, Kaname Saito 07c34530-c138-11dd-ad8b-0800200c9a66 Exploiting the capabilities of modern GPUs for dense matrix computations We present several algorithms to compute the solution of a linear system of equations on the GPU, as well as general techniques to improve the performance, such as padding and hybrid CPU-GPU computation. We compare single and double precision on a GTX280 /content/cudazone/CUDABrowser/assets/images/applications/153_tr_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/153_tr_large.jpg Academia University Jaume I, Castellon 2008 10 10 10/10/2008 Francisco Igual Application Paper Numerics Francisco Igual ba6e8f00-c138-11dd-ad8b-0800200c9a66 Jacket's Graphics Toolbox for MATLAB The Graphics Toolbox extends Jacket for MATLAB to seamlessly integrate computation with visualization making difficult to program, multi-threaded, and real time graphical displays effortless to achieve. /content/cudazone/CUDABrowser/assets/images/applications/154_gfx_interactive_ocean_example_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/154_gfx_interactive_ocean_example_large.jpg Commercial AccelerEyes http://www.accelereyes.com 2008 11 1 1/11/2008 100 Gallagher Pryor Application Multimedia Computational Fluid Dynamics Finance Graphics Imaging Numerics Libraries Oil & Gas Programming Tools Science Signal Processing Video & Audio Gallagher Pryor 66a26bc0-c139-11dd-ad8b-0800200c9a66 Neurocuda - CUDA-accelerated neurodynamics Software for neurodynamic simulations, consisting of a C++ class library and applications. The neural networks are biologically plausible competetive nets and associative memories, with an intended direction towards vision processing. /content/cudazone/CUDABrowser/assets/images/applications/155_neurocuda1_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/155_neurocuda1_large.jpg Research 2008 11 12 12/11/2008 Open source Fredrik Farnstrom Application Code Science Other Neural, network, neurodynamics, artificial, intelligence, AI, associative, memory, competetive, vision, computational, neuroscience, Fredrik Farnstrom ee77fb00-c139-11dd-ad8b-0800200c9a66 GPU-Quicksort GPU-Quicksort is a Quicksort-based sorting algorithm designed for GPUs for sorting integers and floats on graphics processors. Experiments shows that it can outperform highly optimized CPU-based Quicksort with a factor of 10 on high-end graphics processor /content/cudazone/CUDABrowser/assets/images/applications/156_gpuqsort_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/156_gpuqsort_large.jpg Academia Distributed Computing and Systems - Chalmers University of Technology http://www.cs.chalmers.se/~dcs 2008 11 10 10/11/2008 10 Open source Daniel Cederman Philippas Tsigas Application Code Paper Presentation Libraries sorting, Daniel Cederman, Philippas Tsigas 7E3FAFC6-B7D0-11DD-A2A0-A58455D89593 Badaboom Media Converter Elemental Technologies Badaboom Media Converter takes a fundamentally different approach to video format conversion from other solutions. Instead of performing format conversion on the CPU, it harnesses massively parallel GPUs from NVIDIA. By using the power of the GPU, the time required for video conversion is reduced. /content/cudazone/CUDABrowser/assets/images/applications/147_badaboom_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/147_badaboom_large.jpg Commercial Elemental Technologies http://www.elementaltechnologies.com/ 2008 1 10 10/1/2008 GeForce 8 an higher 20 Commercial Elemental Technologies Application Video & Audio Encoding, transcoding, audio, video, BadaBOOM, Elemental Technologies b830c9b0-a5bd-11dd-ad8b-0800200c9a66 Multiresolution Gradient Adaptive Filter The gimp plugin implements a multiresolution gradient adaptive filter with a bilateral filter kernel. The filter operates on grayscale images and removes noise while edges are preserved. /content/cudazone/CUDABrowser/assets/images/applications/146_Multiresolution_Gradient_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/146_Multiresolution_Gradient_large.jpg Academia University of Erlangen-Nuremberg http://www.fau.de/ 2008 10 09 10/09/2008 30 Open Source Membarth Application Code Graphics Imaging Membarth 310eabf0-a5bd-11dd-ad8b-0800200c9a66 CUDA vs Wizard CUDA Microsoft Visual Studio Wizard /content/cudazone/CUDABrowser/assets/images/applications/145_CUDA_vs_Wizard_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/145_CUDA_vs_Wizard_large.jpg HKBU 2008 04 18 04/18/2008 Zhao, Kaiyong Application Programming Tools Zhao, Kaiyong 360ecd30-a5bb-11dd-ad8b-0800200c9a66 LISSOM LISSOM is a model of human neocortex (mainly visual cortex) at a neural column level. The model was developed by Bednar, Choe, Miikkulainen and Sirosh, at University of Texas. The model was ported to GPUs. /content/cudazone/CUDABrowser/assets/images/applications/144_LISSOM_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/144_LISSOM_large.jpg Research 2008 10 14 10/14/2008 9 Open Source Spigler Application Code Presentation Life Sciences Science Video & Audio lissom visual cortex neural network som v1, Spiglerg, Spigler 15bde710-a5ba-11dd-ad8b-0800200c9a66 Fast Computed Tomography North Star Imaging's new proprietary efX CT software utilizes GPU reconstruction speed. The software was developed with a CUDA interface and reconstruction speeds have increased up to 50x as compared to other CT software. /content/cudazone/CUDABrowser/assets/images/applications/143_Fast_Computed_Tomography_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/143_Fast_Computed_Tomography_large.jpg Commercial North Star Imaging, Inc. http://www.4nsi.com/ 2008 10 16 10/16/2008 50 Damhof Noel Application Paper Digital Content Creation Imaging industrial nondestructive testing, computed tomography, ct scan, ct reconstruction, ct software, CUDA, GPU reconstruction, North Star Imaging, NSI, Damhof, Noel 0329d970-a5b9-11dd-ad8b-0800200c9a66 Flowball Flowball is an interactive game using dense optical flow computed in realtime on a Geforce GTX 280. We provide a video and even a free Win32 optical flow library. /content/cudazone/CUDABrowser/assets/images/applications/142_Flowball_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/142_Flowball_large.jpg Academia Graz University of Technology, Institute for Computer Graphics and Vision http://www.icg.tugraz.at/ 2008 10 09 10/09/2008 Open Source Santner Application Paper Multimedia Imaging Numerics Science Signal Processing Video & Audio Realtime Dense Optical Flow Interactive Game, Santner ad3b3da0-a5b4-11dd-ad8b-0800200c9a66 Fast Blood Flow Visualization of High-resolution Laser Speckle Imaging Data This paper introduces GPUs into the data processing framework of laser speckle contrast imaging, to achieve fast and high-resolution blood flow visualization on PCs by exploiting the high floating-point processing power of GPUs. By using GPU, a 12-60 fold performance enhancement is obtained in comparison to the optimized CPU implementations. /content/cudazone/CUDABrowser/assets/images/applications/141_Fast_Blood_Flow_Visualization_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/141_Fast_Blood_Flow_Visualization_large.jpg Academia Wuhan National Laboratory for Optoelectronics http://wnlo.hust.edu.cn/english/ 2008 08 29 08/29/2008 60 Li, Liu, and Luo Paper Science Video & Audio Blood flow, visualization, video, Li, Liu, Luo f2667150-a5b0-11dd-ad8b-0800200c9a66 Powerful Real-time Electrodynamics Aeth.drive is a fast, parallel, versatile grid-based EM modelling framework including support for relativity, turbulent/quantum effects, and isothermal participating media. Runs in real-time and includes demo. /content/cudazone/CUDABrowser/assets/images/applications/140_Powerful_Real-time_Electrodynamics_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/140_Powerful_Real-time_Electrodynamics_large.jpg Research 2008 10 20 10/20/2008 Open Source Daley Application Code Science Alpha, computational, computer graphics, computer simulation, CUDA, electrodynamics, EM, free software, gpgpu, GPU, isothermal, mapping, MIT license, modelling, NVIDIA, open source, parallel, photon, physics, plasma, radiosity, release, science, scientific computing, simulation, thermodynamics, turbulence, daley db3a3d80-99f4-11dd-ad8b-0800200c9a66 GPU Based Image Segmentation Livewire Algorithm Implementation This thesis presents a GPU implementation of the Livewire algorithm. It is divided in 3 phases: Sobel or Laplacian filter convolution, image modeling as a grid graphand solving the non-negative weighted edges single source shortest path problem. An adapted version of the parallel delta-stepping algorithm is used for the GPU. /content/cudazone/CUDABrowser/assets/images/applications/139_GPU_Based_Image_Segmentation_Livewire_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/139_GPU_Based_Image_Segmentation_Livewire_large.jpg Academia Instituto Tecnologico de Aeronautica http://www.ita.br/ 2008 02 15 02/15/2008 Open Source Baggio Application Paper Code Multimedia Imaging livewire, dijkstra, cuda, Baggio 43b5c9c0-99f4-11dd-ad8b-0800200c9a66 Computational Fluid Dynamics (CFD) using GPUs Solving 2D head conduction CFD problems using CUDA. Using Red-Black Gauss-Seidel with SOR (successive over-relaxation). /content/cudazone/CUDABrowser/assets/images/applications/138_CFD_using_GPUs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/138_CFD_using_GPUs_large.jpg Academia Mechanical Science and Engineering Dept., Univerity of Illinois Urbana-Champaign http://www.mechse.uiuc.edu/ 2008 08 18 08/18/2008 17 Shinn Presentation Computational Fluid Dynamics Science CFD, 2D solver, heat conduction, Shinn 500ec1f0-99f3-11dd-ad8b-0800200c9a66 Volume Ray Casting With CUDA This dissertation includes a chapter that is devoted to volume ray casting on CUDA architecture. It is tailored to take into account the CUDA architecture's unique details and performs 1.5 times better than that of the Cell processor and 15 times better than that of Intel Xeon processor in our implementations. /content/cudazone/CUDABrowser/assets/images/applications/137_Volume_Ray_Casting_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/137_Volume_Ray_Casting_large.jpg Academia University of Maryland at College Park http://www.umd.edu/ 2008 05 09 05/09/2008 Kim Paper Graphics Kim 73926e40-99f0-11dd-ad8b-0800200c9a66 Radius-CUDA This application implements a complete ray tracing kernel using a kd-tree structure for the hierarchical space subdivision and plain triangles for the geometry. The complete kernel including simple shading, shadow and visibility ray generation is implemented using CUDA. The source is also provided by the author. /content/cudazone/CUDABrowser/assets/images/applications/136_Radius-Cuda_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/136_Radius-Cuda_large.jpg Commercial Etranges Libellules http://www.etranges-libellules.fr/?lang=en 2008 10 10 10/10/2008 2 Open Source Segovia Application Graphics Ray tracing, CUDA, Segovia d70d9d10-99ef-11dd-ad8b-0800200c9a66 The Synchronization Power of Coalesced Memory Accesses This paper investigates the synchronization power of coalesced memory accesses in CUDA. The results show the coalesced memory accesses can be used to construct concurrent data objects that tolerate up to 63 crash-failures (compute capability 1.2 & higher) or 15 crash-failures (compute capability 1.1 & lower). /content/cudazone/CUDABrowser/assets/images/applications/135_Coalesced_Memory_Accesses_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/135_Coalesced_Memory_Accesses_large.jpg Academia University of Troms, Norway http://uit.no/informatikk/ 2008 09 24 09/24/2008 Ha Anshus Tsigas Paper Presentation Programming Tools Science Fault-tolerance, synchronization, multicores, memory access mechanisms, consensus, Ha, Anshus, Tsigas fc0adf20-99ee-11dd-ad8b-0800200c9a66 Cubic Interpolation Cubic B-spline interpolation of 2D and 3D textures. Easily replace your tex2D and tex3D calls by cubic interpolated texture lookups. A CUDA accelerated pre-filter for calculating the B-spline coefficients and example programs are also included. /content/cudazone/CUDABrowser/assets/images/applications/134_Cubic_Interpolation_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/134_Cubic_Interpolation_large.jpg Research KU Leuven / TU Eindhoven 2008 10 06 10/06/2008 327 Open Source Ruijters Application Code Graphics Imaging Libraries Programming Tools Science Signal Processing Video & Audio Cubic B-spline interpolation, Ruijters f8398140-99ed-11dd-ad8b-0800200c9a66 SVI Pro Advanced 3D Seismic Analysis SVI Pro is a 3D seismic image analysis and visualisation application, allowing geological objects to be identified, enhanced and extracted from large 3D seismic datasets. The results of this objective and repeatable analysis provide a comprehensive understanding of the subsurface, without pre-interpretation. /content/cudazone/CUDABrowser/assets/images/applications/133_SVI_Pro_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/133_SVI_Pro_large.jpg Commercial ffA http://www.ffa.co.uk/ 2008 10 08 10/08/2008 34 Commercial ffA Application Presentation Paper Multimedia Oil & Gas 3D Seismic, Image Analysis, Exploration, Production, Image Processing, Visualisation, ffA e274b200-8df7-11dd-ad8b-0800200c9a66 Computational Chemistry Using GPUs We undertook the task of accelerating the resolution-of-the-identity second-order Moeller-Plesset (RI-MP2) calculations as implemented in Q-Chem 3.1 by executing matrix-matrix multiplication operations using CUBLAS. We exploited the fact that large matrices can be multiplied about 13x faster on the GPU than on the host CPU. With moderate programming effort, our code had a 4.3x speedup. /content/cudazone/CUDABrowser/assets/images/applications/132_Computational_Chemistry_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/132_Computational_Chemistry_large.jpg Academia Harvard, Department of Chemistry and Chemical Biology http://aspuru.chem.harvard.edu/About/ 2008 08 01 01/08/2008 4.3 Open Source Aspuru-Guzik Paper Science Other Quantum, chemistry, Moller-Plesset, molecular, Aspuru-Guzik 776e39b0-8959-11dd-ad8b-0800200c9a66 GpuCV: GPU-accelerated Computer vision library GpuCV is an open-source GPU-accelerated image processing and Computer Vision library. It is meant for easily porting existing OpenCV applications, while taking advantage of computing power available from recent GPUs. /content/cudazone/CUDABrowser/assets/images/applications/131_GpuCV_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/131_GpuCV_large.jpg Academia Institut TELECOM; TELECOM & Management SudParis http://www.it-sudparis.eu/ 2008 10 22 10/22/2007 100 Open Source Allusse Horain Application Paper Code Imaging Programming Tools Science Signal Processing Video & Audio Other GLSL, NVIDIA CUDA, computer vision, image processing, Allusse, Horain 21efc800-8959-11dd-ad8b-0800200c9a66 Ray Tracing This application shows a method allowing ray tracing with CUDA. /content/cudazone/CUDABrowser/assets/images/applications/130_Ray_Tracing_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/130_Ray_Tracing_large.jpg Academia University of Reims 2008 9 12 9/12/2008 16 Open Source Maxime Application Code Graphics Imaging Numerics Ray tracing, Maxime bec4bb60-847f-11dd-ad8b-0800200c9a66 Photon-Mapping Demo Real-time, physically-based photon mapper using aeth.drive, a general-purpose library for dynamical simulations. Kernel solves Maxwell's equations in a turbulent participating medium using nearly 800,000 samples in a few hundredths of a second. /content/cudazone/CUDABrowser/assets/images/applications/129_Photon_Mapping_Demo_Small.png /content/cudazone/CUDABrowser/assets/images/applications/129_Photon_Mapping_Demo_Large.png Research 2008 9 6 9/6/2008 256 Open Source Daley Application Code Graphics Libraries Science Other Photon mapper, simulations, Maxwell's equations, Daley 07243f10-847c-11dd-ad8b-0800200c9a66 Fast Sparse Signal Recovery from Random Projections We consider the problem of sparse signal recovery from a small number of random projections (measurements). This is a well known NP-hard to solve combinatorial optimization problem. Here, we discuss the fast GPU (CUDA & CUBLAS) implementation. /content/cudazone/CUDABrowser/assets/images/applications/128_Fast_Sparse_Signal_Recovery_from_Random_Projections_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/128_Fast_Sparse_Signal_Recovery_from_Random_Projections_Large.jpg Academia Institute for Biocomplexity and Informatics, University of Calgary http://www.ibi.ucalgary.ca/ 2008 9 10 9/10/2008 31 Andrecut Application Paper code Numerics Signal Processing Matching Pursuit; Signal Recovery; Random Projections, Andrecut 0d68f890-7b56-11dd-ad8b-0800200c9a66 Ray tracing with CUDA (CUDART-sp) Realtime ray tracing on a cuda device. This implementation is using only spheres, 2 lights and no textures. /content/cudazone/CUDABrowser/assets/images/applications/127_cudart_sp_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/127_cudart_sp_large.jpg Academia BV2 http://www.bv2.co.uk/ 2008 8 27 8/27/2008 25 Abernethy Application Multimedia Graphics Ray tracing, Abernethy 7b764240-7b54-11dd-ad8b-0800200c9a66 Lucas and Kanade optical flow algorithm using CUDA (LKCUDA) Real time pyramidal implementation of the Lucas and Kanade optical flow algorithm. /content/cudazone/CUDABrowser/assets/images/applications/126_LKCuda_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/126_LKCuda_large.jpg Research INRIA http://www-rocq.inria.fr/ 2008 8 29 8/29/2008 55 Yann DUMORTIER Julien MARZAT Application Imaging Libraries vision, dense optical flow, real-time, Lucas and Kanade, algorithm, LKCuda Team 012a74f0-7b51-11dd-ad8b-0800200c9a66 Fast Sliding-Window Object Detection The paper presents a fast object class localization framework implemented on a data parallel architecture available in recent computers. Our case study, the implementation of HOG descriptors, shows that just by using this recent programming model we can easily speed up an original CPU-only implementation by a factor of 34/109, making it unnecessary to use early rejection cascades that sacrifice classification performance, even in real-time conditions. /content/cudazone/CUDABrowser/assets/images/applications/125_Fast_Sliding_Window_Object_Detection_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/125_Fast_Sliding_Window_Object_Detection_large.jpg Academia TU Darmstadt http://www.mis.informatik.tu-darmstadt.de/ 2008 6 10 6/10/2008 109 Wojek, Dork, Schulz, Schiele Paper Graphics Imaging Video & Audio Other Object Detection, Histograms of Oriented Gradients, HOG, Sliding-Window, People Detection, Wojek, Dork, Schulz, Schiele f89bc2f5-c528-41ce-84ef-8e833503d4de Teraflops for Games and Derivatives Pricing Financial instruments pricing using Monte-Carlo methods. /content/cudazone/CUDABrowser/assets/images/applications/124_Teraflops_for_Games_and_Derivatives_Pricing_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/124_Teraflops_for_Games_and_Derivatives_Pricing_large.jpg Commercial QuantCatalyst Inc. http://www.quantcatalyst.com/ 2008 8 1 8/1/2008 50 Egloff, Bennemann, Beinker, Gauckler Paper Finance Graphics processing units, high performance computing, cluster, grid, Monte-Carlo simulation, basket options, local volatility, derivatives pricing, risk analytics, Egloff, Bennemann, Beinker, Gauckler 3f8319f0-5fc7-11dd-ad8b-0800200c9a66 Ray Casting Deformable Models This paper explores the problem of real time ray casting of large deformable models (over a million triangles) on large displays (a million pixels) on an off-the-shelf GPU. We build a GPU-efficient three dimensional data structure for this purpose and a corresponding algorithm that uses it for fast ray casting using the CUDA model. http://cvit.iiit.ac.in/projects/gpuproject/ /content/cudazone/CUDABrowser/assets/images/applications/123_Ray_Casting_Deformable_Models_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/123_Ray_Casting_Deformable_Models_Large.jpg Academia International Institute of Information Technology Hyderabad http://cvit.iiit.ac.in/ 2008 7 4 7/4/2008 Open Source Patidar, et al Paper Graphics Ray casting, Deformable Models, Data structures on GPU, Patidar, Narayanan, Patidar, et al 90392390-5fc6-11dd-ad8b-0800200c9a66 A Fast Similarity Join Algorithm A novel similarity join algorithm called LSS is presented that executes on a GPU, exploiting its parallelism and high data throughput. Experimental results demonstrate that LSS is suitable for similarity joins in large high-dimensional datasets, and that it performs well when compared against two existing prominent similarity join methods. /content/cudazone/CUDABrowser/assets/images/applications/122_A_Fast_Similarity_Join_Algorithm_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/122_A_Fast_Similarity_Join_Algorithm_Large.jpg Academia University of Maryland http://www.cs.umd.edu/ 2008 04 09 04/09/2008 100 Lieberman, Sankaranarayanan, Samet Paper Science Simiarity search, joins, high-dimensional points, Lieberman, Sankaranarayanan, Samet bc2716d0-5fc4-11dd-ad8b-0800200c9a66 Concurrent Number Cruncher Concurrent Number Cruncher: a general purpose symmetric sparse solver on the GPU. It describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies to implement a sparse general-purpose linear solver. /content/cudazone/CUDABrowser/assets/images/applications/120_CNC_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/120_CNC_Large.jpg Academia INRIA http://www.inria.fr/ 2008 7 28 7/28/2008 10 Open Source Buatois, Caumon, Lvy Application Paper Code Multimedia Numerics Conjugate gradient, sparse solver, Buatois, Caumon, Levy f751cbd0-5fc2-11dd-ad8b-0800200c9a66 NaminamiFX for Fluid Simulation NaminamiFX bundled with LiquidPack is improved more than 4 times in computational performance than a CPU only system. Available as a Plug-in for LightWave v9, it was developed and is currently sold only in Japan. /content/cudazone/CUDABrowser/assets/images/applications/119_NaminamiFX_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/119_NaminamiFX_Large.jpg Commercial D-Storm Inc. http://www.dstorm.co.jp/ 2007 10 16 10/16/2007 4 Commercial Kenkyujo Application Digital Content Creation LightWave, Liquid Pack, fluid simulation, wave, Kenkyujo db0238a0-5ecf-11dd-ad8b-0800200c9a66 Real-time Digital Holographic Microscopy This paper describes a real-time DHM system using a GPU with many stream processors. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512x512 grids in 24 frames per second. /content/cudazone/CUDABrowser/assets/images/applications/115_Real-time_Digital_Holographic_Microscopy_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/115_Real-time_Digital_Holographic_Microscopy_Large.jpg Academia Yamagata University http://gabor.yz.yamagata-u.ac.jp/ 2008 7 24 7/24/2008 Shimobaba, Sato, Miura, Takenouchi, Ito Application Paper Multimedia Numerics Libraries Science Signal Processing Digital holography microscope microscopy Fresnel diffraction light propagation hologram, Tomoyoshi, Shimobaba, Yoshikuni, Sato, Junya, Miura, Mai, Takenouchi, Tomoyoshi, Ito 7083a0f0-5e42-11dd-ad8b-0800200c9a66 GPU4Vision Usage of GPUs to tackle computer vision tasks like denoising, filtering, segmentation, stereo, optical flow, etc. /content/cudazone/CUDABrowser/assets/images/applications/118_GPU4Vision_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/118_GPU4Vision_Large.jpg Academia Graz University of Technology http://www.icg.tugraz.at/ 2008 7 21 7/21/2008 Pock Application Paper Graphics Imaging Numerics Science Signal Processing Video & Audio Computer vision, denoising, filtering, segmentation, stereo, optical flow, Pock 4b8d2790-5e41-11dd-ad8b-0800200c9a66 Molecular Dynamics of DNA and Liquids Ascalaph Liquid GPU is a program for molecular dynamics simulation in liquid phase."Ascalaph DNA GPU" is the program for creating models of nucleic acids and their complexes with ligands. http://mtzweb.scs.uiuc.edu/research/gpu/ /content/cudazone/CUDABrowser/assets/images/applications/117_Molecular_Dynamics_of_DNA_and_Liquids_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/117_Molecular_Dynamics_of_DNA_and_Liquids_Large.jpg Commercial Agile Molecule http://www.agilemolecule.com/index.html 2008 7 23 7/23/2008 18 Commercial Alexey Application Science Molecular Dynamics, Building Design, Alexey 8ec8dff0-5e40-11dd-ad8b-0800200c9a66 GPUGRID.NET GPUGRID.NET is a volunteer computing project using NVIDIA graphics cards and CUDA for full-atom molecular dynamics simulations of proteins. /content/cudazone/CUDABrowser/assets/images/applications/116_GPUGRID_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/116_GPUGRID_Large.jpg Academia Universitat Pompeu Fabra - Multiscale Lab http://multiscalelab.org/ 2008 7 17 7/17/2008 Commercial De Fabritiis Application Life Sciences Science molecular dynamics, distributed computing, BOINC, Gianni, De Fabritiis 8bfac8a0-5e3c-11dd-ad8b-0800200c9a66 LIBOR Interest rate Model With recent exciting developments, the author is dedicating some research time to exploring the capabilities of the latest hardware/software for HPC including topics such as trends in mainstream HPC, the co-processor alternatives, NVIDIA GPUs (hardware/software/applications), and whether the alternatives will have an impact. /content/cudazone/CUDABrowser/assets/images/applications/114_LIBOR_Interest_rate_Model_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/114_LIBOR_Interest_rate_Model_Large.jpg Academia Oxford University http://people.maths.ox.ac.uk/~gilesm/hpc/ 2006 7 1 7/1/2006 50 Giles Application Presentation Finance Monte Carlo, finance, computational finance, LIBOR, interest rate model, finite difference, Giles f6a4261c-5a03-4123-b651-0057b017551d Real-time Visual Tracker by Stream Processing This work describes the implementation of a real-time visual tracker that targets the position and 3D pose of objects (specifically faces) in video sequences. Using a GPU and the NVIDIA CUDA technology, performance improvements as large as ten times compared to a similar CPU-only tracker are achieved. /content/cudazone/CUDABrowser/assets/images/applications/113_Real-time Visual_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/113_Real-time Visual_Large.jpg Research NTT Communication Science Laboratories http://www.kecl.ntt.co.jp/rps/index.html 2008 7 12 7/12/2008 10 Lozano, Otsuka Application Paper Video & Audio Stream processing, particle filtering, face tracking, gpu, cuda, vision, Lozano, Otsuka b4954515-c71d-4c7d-b29c-8dc6954724ef Numerical Calculation Library for Diffraction Integrals The GPU-based Wave Optics library is numerical calculation library for the diffraction integrals using the GPU. It can calculate several diffractions: Fresnel diffraction, Angular spectrum method, Fraunhofer diffraction and Shifted-Fresnel diffraction. /content/cudazone/CUDABrowser/assets/images/applications/112_Diffraction Integrals_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/112_Diffraction Integrals_Large.jpg Academia Yamagata University http://gabor.yz.yamagata-u.ac.jp/ 2008 7 01 7/01/2008 Shimobaba Application Numerics Science Imaging Digital Content Creation Light propagation, Diffraction theory, Holography, Hologram, Computer-generated-hologram, Shimobaba 6cf70d73-92ff-4896-9f30-9a5b4c47e2b5 Cost-effective Medical Image Reconstruction This paper demonstrates parallel implementations for modern medical imaging applications on traditional parallel architectures can be outperformed, in both speed and cost-effectiveness, by new implementations on next-generation architectures like GPUs. http://portal.acm.org/citation.cfm?id=1366230.1366278 /content/cudazone/CUDABrowser/assets/images/applications/111_Cost Effective Medical_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/111_Cost Effective Medical_Large.jpg Academia University of Muenster, Germany http://www.math.uni-muenster.de/ 2008 5 15 5/15/2008 10 Schellmann, et al Paper Imaging Algorithms, general purpose gpu programming, list-mode osem, medical image reconstruction, parallel programming, Schellmann, et al 7cf2de98-b41d-44dc-bf59-ef4f0510e9d4 Obsidian: GPU Programming in Haskell Obsidian is a GPGPU language embedded in Haskell. The goal is to simplify GPGPU programming by raising the level of abstraction but still offer control of the details necessary to write efficient programs. /content/cudazone/CUDABrowser/assets/images/applications/110_Obsidian_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/110_Obsidian_Large.jpg Academia Chalmers University of Technology http://www.chalmers.se/ 2008 5 15 5/15/2008 Svensson, et al Presentation Libraries Embedded language, GPGPU, Functional Programming, CUDA, Schellmann, et al 20cc945a-5e1b-4473-8f25-f42a4155eb22 Accelerating Density Functional Calculations with GPU G80 GPU accelerates the ab initio density functional calculation (Gaussian03) by a factor of 10 over the latest Quad-core CPU. The errors due to single precision were found to be small enough for practical usage. The new algorithm suitable for GPUs were reported. /content/cudazone/CUDABrowser/assets/images/applications/109_Accelerating Density_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/109_Accelerating Density_Large.jpg Academia Nagoya University http://www.is.nagoya-u.ac.jp/index.html.en 2008 7 04 7/4/2008 G80 40 Yasuda Paper Science Density functional theory, quantum chemistry, first-principle calculation, Yasuda e84a322a-abec-4a99-a894-d1aa60e2ffa3 Motion Tracking Using Recursive Gaussian Using a re-implemented recursive gaussian, track an object moving into the screen: the backgroud can have a global displacement, and the shape of the object must not change a lot. /content/cudazone/CUDABrowser/assets/images/applications/108_Motion Tracking_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/108_Motion Tracking_Large.jpg Academia Donar Team 2008 6 30 6/30/2008 Open source Donar Team Code Imaging Recursive gaussian, imaging, motion tracking, Donar Team 6928c0a0-8571-11dd-ad8b-0800200c9a66 Sliding-Windows for Rapid Object Class Localization: A Parallel Technique This paper presents a fast object class localization framework implemented on a data parallel architecture currently available in recent computers, with a case study on the implementation of Histograms of Oriented Gradients (HOG) descriptors showing speed up of a CPU-only implementation by a factor of 34 for the application and 109 for the accelerated part. /content/cudazone/CUDABrowser/assets/images/applications/107_Sliding-Window for Rapid_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/107_Sliding-Window for Rapid_Large.jpg Academia TU Darmstadt http://www.mis.informatik.tu-darmstadt.de/ 2008 6 11 6/11/2008 109 Wojek, Dork, Schulz, Schiele Paper Imaging Science Imaging, science, object detection, object localization, HOG, SVM, Wojek, Dork, Schulz, Schiele e03fdc93-ad98-47e2-b82d-a809e92adf6c Dense Matrix-Vector Multiplication A dense matrix-vector multiplication routine maximum 15.69 times faster than sgemv in CUBLAS 1.1 /content/cudazone/CUDABrowser/assets/images/applications/106_Desnse Matrix_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/106_Dense_Matrix_Large.jpg Academia Osaka Prefecture University http://www.osakafu-u.ac.jp/english/index.html 2008 4 14 4/14/2008 32 Open source Fujimoto Application Paper Code Numerics Numeric, matrix-vector multiplication, sgemv, CUBLAS, Fujimoto e120454a-0ccb-4a7f-8bb4-ec1179681443 Wait-free Programming for Computations on Graphics Processors This paper demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs. /content/cudazone/CUDABrowser/assets/images/applications/105_Wait-free Programming_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/105_Wait-free Programming_Large.jpg Academia University of Troms, Norway http://uit.no/informatikk 2008 7 18 7/18/2008 Phuong, Tsigas, Anshus Paper Presentation Science Non-blocking programming, consensus, read-modify-write objects, synchronization, many-core architectures, SIMD, graphics processors, Science, Phuong, Tsigas, Anshus 5ee8c199-24b4-46b6-8ef9-34f4af739e65 CUDA.NET CUDA.NET is a library that provides access to CUDA functionality from .NET based applications. It can be used on both Windows and Linux operating systems, supporting 32 and 64 bit modes of operation. Examples are provided in C# and IronPython. /content/cudazone/CUDABrowser/assets/images/applications/104_CUDANET_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/104_CUDA.NET_Large.jpg Commerical GASS Company for Advanced Supercomputing Solutions Ltd. http://www.gass-ltd.co.il/ 2008 6 07 6/7/2008 Butrashvily Application Code Library Programming Tools CUDA.NET, .NET, Library, Butrashvily a31eb28a-5d98-404d-8038-034575e9ea19 Real Time Deformable Body Physics Simulation This paper introduces an optimal solution to implement accurate simulation of deformable bodies in real time, accomplished through the use of Point Based Animation. Significant improvements on performance in comparison to the CPU was observed impressive speedups of about 20 times could be achieved in the simulation of deformable bodies with 575 physics elements (phyxels) and 53,504 surface elements (surfels). /content/cudazone/CUDABrowser/assets/images/applications/103_Massively Parallel_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/103_Massively Parallel_Large.jpg Academia GRVM / CIn / UFPE http://www.gprt.ufpe.br/~grvm 2008 6 08 6/8/2008 24 Farias, Almeida, Teixeira, Teichrieb, Kelner Application paper Graphics Point Based Animation, meshless simulation technique, graphics, Farias, Almeida, Teixeira, Teichrieb, Kelner c5e7ae8a-f431-4c5d-877f-17b90f3ffc6e Mixed Precision Linear Solvers This report updates results from "Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations". It shows mixed-precision schemes are still preferable to double percision alone. A significant quantitative performance improvements is observed with more powerful hardware, demonstrated in a multi-grid scheme that provided an accurate solution in Finite Element settings with one million unknowns in less than 0.1 seconds. /content/cudazone/CUDABrowser/assets/images/applications/102_Mixed_Precision_Linear_Solvers_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/102_Mixed_Precision_Linear_Solvers_large.jpg Academia TU Dortmund http://www.mathematik.uni-dortmund.de/~goeddeke/ 2008 7 08 7/08/2008 G200, T10 27 Gddeke, Strzodka Application paper Numerics Mixed precision multigrid finite element, multigrid, finite element, FEM, numerics, Gddeke, Strzodka 395b7b4c-d22d-4539-a7ba-b4157bafb2d2 Large Vocabulary Continuous Speech Recognition Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. This paper explores opportunities for parallelizing the Hidden Markov Model (HMM) based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on the G80 architecture. /content/cudazone/CUDABrowser/assets/images/applications/101_LVCSR_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/101_LVCSR_Large.j pg Academia University of California Berkeley http://www.eecs.berkeley.edu/ 2008 6 21 6/21/2008 Chong, Yi, Faria, Satish, Keutzer Paper Video & Audio Speech recognition,probabilistiv inference,HMM,Beam Search,LVCSR,data parallelism,graph traversal, Chong, Yi, Faria, Satish, Keutzer e4ad4db9-af51-4c58-a3aa-b20ea097ae3e Jacket: GPU Engine for MATLAB Jacket enables standard MATLAB code to run on the GPU, connecting MATLAB directly to the speed and visual computing capability of the GPU. It is system that automatically makes memory transfer and execution optimization decisions, and it uses a compile on-the-fly system to allow GPU functions to run in MATLAB's interpretive style. This example demonstrates some of the BLAS capability of Jacket, providing several speedup benchmarks. http://www.accelereyes.com/documentation.php /content/cudazone/CUDABrowser/assets/images/applications/100_Jacket_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/100_Jacket_Large.jpg Commerical AccelerEyes LLC http://accelereyes.com/ 2008 6 16 6/16/2008 50 Melonakos Application Computational Fluid Dynamics Digital Content Creation Electronic Design Automation Finance Graphics Imaging Numerics Life Sciences Libraries Oil & Gas Science Video & Audio Matlab, Jacket, memory transfer, Pryor, Malcolm, Rehman, Melonakos e415e883-5349-4c9a-abb7-fefe294c5b08 Optical Flow Algorithm using CUDA and OpenCV It implements an optical flow algorithm using CUDA and OpenCV, achieving 90FPS on 640x480 images with a 4 level pyramid using a GeForce 8800 GTX compared to 1FPS on a Pentium 4@3GHz with 320x240 images on a 3 level pyramid. The algorithm implemented is described in Bayesian Multi-scale Differential Optical Flow, Handbook of Computer Vision and Applications. /content/cudazone/CUDABrowser/assets/images/applications/99_Optical_Flow_Algorithm_using_CUDA_and_OpenCV_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/99_Optical_Flow_Algorithm_using_CUDA_and_OpenCV_large.jpg Academia 2008 6 18 6/18/2008 90 Hauagge Paper Code Imaging Numeric, algorithm, optical flow algorithm, optical flow, OpenCV, Hauagge 6a83932a-0df0-4e85-832d-cd3c218c24c7 Python bindings for CUDA using ctypes The application emulates the original CUDA code more closely. The .tar.gz or .rpm files contain numerous examples, many translated from the CUDA SDK examples. ftp://ftp.graviscom.com/pub/code/python-cuda/ /content/cudazone/CUDABrowser/assets/images/applications/98_Python_bindings_for_CUDA_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/98_Python_bindings_for_CUDA_large.jpg Commercial GraVisCom.de http://www.graviscom.com/ 2008 3 8 3/8/2008 Open source Paehler Paper Libraries Programming Tools CUDA, Python, Paehler ef17a9c2-4240-4c2e-970a-dd41ce505d92 Towards Acceleration of Fault Simulation This paper discusses the implementation of a fault simulator in a GPU that exploits thread level parallelism. /content/cudazone/CUDABrowser/assets/images/applications/96_Towards_Acceleration_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/96_Towards_Acceleration_Large.jpg Academia Texas A & M University http://www.tamu.edu/ 2008 6 8 6/8/2008 35 Gulati, Khatri Paper Electronic Design Automation EDA, fault simulator, simulation, electronic design automation, Gulati, Khatri ec54e761-d6dc-4fbc-a8f6-5346104cade3 Accelerating Statistical Static Timing Analysis This paper explores the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/95_Accelerating_Statistical_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/95_Accelerating_Statistical_Large.jpg Academia Texas A & M University http://www.tamu.edu/ 2008 5 7 5/7/2008 260 Gulati, Khatri Paper Electronic Design Automation EDA, Monte Carlo, simulation, electronic design automation, Gulati, Khatri 34737671-9717-4453-98b0-57b4ea19a3e6 Low Viscosity Flow Simulations for Animation This paper describes a fluid simulation method used in the film industry. We use CUDA to accelerate our high resolution Poisson solver to enforce fluid incompressibility. /content/cudazone/CUDABrowser/assets/images/applications/94_Low_Viscosity_Flow_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/94_Low_Viscosity_Flow_Large.jpg Commercial Rhythm and Hues, UCLA, Inst. of Geophysics and Planetary Physics http://www.rhythm.com/ 2008 7 7 7/7/2008 55 Cohen, Molemaker, Patel, Noh Paper Multimedia Computational Fluid Dynamics Fluid dynamics,multigrid,poisson solver, Cohen, Molemaker, Patel, Noh 5cbe6168-f2ee-4979-94d4-189bac136744 MIDG Discontinuous Galerkin Methods for GPU. http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html /content/cudazone/CUDABrowser/assets/images/applications/93_MIDG_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/93_MIDG_Large.jpg Academia Rice University http://www.rice.edu/ 2008 8 1 08/01/2008 50 Warburton Application Numerics Galerkin method,partial differential equation, Warburton e379ae7f-988d-44f6-ad46-4a87ff7f2bee Real Time Capture of Audio Images and Use with Video Arrays of microphone arrays provide an ability to compute the intensity of sound corresponding to different directions at a given time. Intensities may be exhibited as an image and these images updated at a high frame rate to achieve a real time video image of the sound reflections. /content/cudazone/CUDABrowser/assets/images/applications/94_Real_Time_Capture_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/94_Real_Time_Capture_Large.jpg Academia University of Maryland http://www.umd.edu/ 2007 9 21 9/21/2007 ODonovan, Duraiswami, Gumerov Paper Video & Audio Imaging Imaging, audio, camera, spherical microphone arrays, ODonovan, Duraiswami, Gumerov 745d5157-b2c7-495e-bced-1c3f6d6d7a32 Silicon Informatics Protein Docking The DockStar deskside server for AutoDock 4.0 can dramatically change workflow and thinking and increasing scientific productivity and interactivity. http://www.siliconinformatics.com/products.html /content/cudazone/CUDABrowser/assets/images/applications/91_Silicon_Informatics_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/91_Silicon_Informatics_Large.jpg Commerical Silicon Informatics http://www.siliconinformatics.com/ 2008 6 16 6/16/2008 20 Silicon Informatics Application Life Sciences Life science, science, protein, drug discovery, Silicon Informatics 2e520f9d-943e-454d-8a46-865d5f351a7b High Performance Pattern Recognition on GPU This paper presents high performance Pattern Recognition algorithms using GPUs and present fast implementations on the GPU using CUDA. We study the Parzen windows scheme for density estimation and the Artificial Neural Network for training and classification. /content/cudazone/CUDABrowser/assets/images/applications/90_High_Performance_Pattern_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/90_High_Performance_Pattern_Large.jpg Academia International Institute of Information Technology http://www.iiit.ac.in/ 2007 11 7 11/7/2007 100 Lahabar, Agrawal, Narayanan Paper Numerics Numerics, pattern recognition, algorithms, Parzen, Artificial Neural Network, Lahabar, Agrawal, Narayanan ab502e83-66df-44cb-a81e-14dae30ac6d6 Audio FIR Crossover 4 Way FIR Crossover / Channel Divider, with 8192 TAPs FIR filter. http://koonlab.com/CUDA_RealFIR/CUDA%20Real%20FIR.html /content/cudazone/CUDABrowser/assets/images/applications/89_Audio_FIR_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/89_Audio_FIR_Large.jpg Academia Koon lab http://koonlab.com/ 2008 5 3 5/3/2008 35 Open source Koon lab Application Code Video & Audio CrossOver FIR Audio,Koon c101fdeb-9e59-4490-b6f5-75e198022e18 PyCuda PyCuda lets you access NVIDIA CUDA parallel computation API from Python, and offers features such as object cleanup tied to lifetime of objects and automatic error checking. http://mathema.tician.de/software/pycuda /content/cudazone/CUDABrowser/assets/images/applications/88_PyCuda_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/88_PyCuda_Large.jpg Academia Brown University http://www.brown.edu/ 2008 6 15 6/15/2008 Open source Klockner Code Numerics Programming Tools Numerics, PyCUDA, CUDA, Python, object cleanup, automatic error checking, Klockner 397405d9-916f-4c22-8d5d-770001adf5c7 OpenVIDIA: Parallel GPU Computer Vision This project implements computer vision algorithms on computer graphics hardware. The project provides useful example programs which run real time computer vision algorithms on single or multiple GPU system configurations. http://openvidia.sourceforge.net/ /content/cudazone/CUDABrowser/assets/images/applications/87_OpenVIDIA_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/87_OpenVIDIA_Large.jpg Academia University of Toronto http://www.eecg.toronto.edu/ 2004 1 7 1/7/2004 Open source Fung, Mann Paper Application Code Multimedia Imaging Imaging, computer vision algorithm, computer vision, algorithm, Canny edge, numerics, Fung, Mann e7670906-c1de-4202-acfa-3a56dc57aef9 TechniScan 3D UltraSound CT The TechniScan UltraSound CT Imaging System features include the ability to scan the whole breast and produce high resolution 3D images, which provide for easier, more accurate localization and characterization of areas identified as requiring further workup after mammography or conventional ultrasound. http://www.techniscanmedicalsystems.com/ /content/cudazone/CUDABrowser/assets/images/applications/86_TechniScan_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/86_TechniScan_Large.jpg Commercial TechniScan http://www.techniscanmedicalsystems.com/ 2007 12 31 12/31/2007 TechniScan Application Life Sciences Imaging Life sciences, imaging, medical imaging, medical equipment, CT scan, 3D, TechniScan e96a79c5-371c-496a-b851-b99b83360354 Ray Casting Algebraic Surfaces using the Frustum Form This paper discusses an algorithm for interactive ray-casting of algebraic surfaces of high degree. Authors performanced nuermica root-finding using B-spline and B zier techniques, then compared them to recent and classical algorithms. The paper proposes an anti-aliasing scheme and shows how this algorithm can be implemented on streaming architectures with single precision. http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-8659.2008.01133.x /content/cudazone/CUDABrowser/assets/images/applications/84_klebsch_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/84_klebsch_Large.jpg Academia University of Oslo, Norway http://www.uio.no/english/ 2008 4 1 04/01/2008 16 Seland, et al Paper Numerics Graphics Numeric, algorithm, B-spline, Bazier, graphics, Seland, et al fbdfdeae-93a4-473e-899d-081e73858fda xNormal An application to render normal/ambient occlusion/parallax/relief maps with an integrated 3D viewer and Photoshop tools and mesh importers/exporters for 3dsmax. It makes an intensive use of the GPU and CPU multicore to perform ray tracing. http://www.xnormal.net/ /content/cudazone/CUDABrowser/assets/images/applications/85_xNormal_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/85_xNormal_Large.jpg Commercial xNormal http://www.xnormal.net/ 2008 6 11 6/11/2008 Open source xNormal Application Graphics Graphic, render, rendering, ray tracing, xNormal 210dbbcf-ffdf-4145-a4d8-c89d1ca50d54 CUDA Accelerated DXT Compression NVIDIA supplies a free texture utility to create different types of texture formats for content creation tools. http://developer.nvidia.com/object/texture_tools.html /content/cudazone/CUDABrowser/assets/images/applications/02_CUDA_Accelerated_DXT_Compression_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/02_CUDA_Accelerated_DXT_Compression_Large.jpg Commercial NVIDIA http://www.nvidia.com 2008 3 18 3/18/2008 Any CUDA Open source NVIDIA Code Digital Content Creation Digital content creation,texture, NVIDIA 1218c1de-21b0-45c4-a999-c531eb2be811 Efficient Computation of Sum Products on GPUs A wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications rely on mathematical techniques such as solvers for the sum-product or marginalize a product of functions (MPF). The authors describe the results of an MPF solver that achieves excellent results on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/03_Efficient_Computation_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/03_Efficient_Computation_Large.jpg Academia 2008 7 12 7/12/2008 Any CUDA 270 Silberstein, Schuster, Geiger, Patney, Owens Paper Numerics Algorithms,Numerics,Mathematics, Silberstein, Schuster, Geiger, Patney, Owens 2f6af007-7448-4114-abc9-c41b39265b76 Programming Algorithms-by-Block Made easy The FLAME project is a framework for linear algebra. This paper explains how, when applied to a new architecture (GPU), an out-of-the-box solution attains high performance almost effortlessly. The FLAME project has been studying the question of parallel programming in the context of dense and banded matrix computations. In this paper they address the programmability issue head-on and demonstrate that their solution, which departs from the traditional evolutionary path, supports portability to new architectures by demonstrating their work with an NVIDIA multi-GPU system. /content/cudazone/CUDABrowser/assets/images/applications/04_Programming_Algorithms_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/04_Programming_Algorithms_Large.jpg Academia 2008 11 1 01/01/2008 Any CUDA Castillo, Chan, Igual, Mayo, Quintana-Orta, van de Geijn, Van Zee Paper Numerics Algorithms, Numerics, Mathematics, linear algebra, libraries, high-performance, multithreaded architectures, Castillo, Chan, Igual, Mayo, Quintana-Orta, van de Geijn, Van Zee e7e74109-3b6e-49c4-9a30-dfc5b74f986c Highly Optimized Object-oriented Molecular Dynamics: HOOMD HOOMD stands for Highly Optimized Object Oriented Molecular Dynamics. It performs general purpose molecular dynamics simulations on a single workstation, taking advantage of the NVIDIA GPUs to attain a level of performance equivalent to 30 processor cores on a fast cluster. http://www.external.ameslab.gov/hoomd/download.html /content/cudazone/CUDABrowser/assets/images/applications/05_Highly_Optimized_Object_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/05_Highly_Optimized_Object_Large.jpg Academia Ames Laboratory, United States Department of Energy http://www.external.ameslab.gov/hoomd/index.html 2008 2 1 02/01/2008 Any CUDA 15 Open source, BSD Anderson, et al Code Science Molecular dynamics,HOOMD,biophysics, Anderson, et al d6c799c2-7704-4e68-a188-022c8a40b02a AES Crytography Acceleration This paper presents a study of the efficiency in applying modern GPUs can be applied to symmetric key cryptographic solutions. This paper describes an efficient implementation of the Advanced Encryption Standard (AES) algorithm in the novel CUDA platform by NVIDIA. /content/cudazone/CUDABrowser/assets/images/applications/06_CUDA compatible GPU_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/06_CUDA compatible GPU_Large.jpg Research 2007 11 1 11/01/2007 Any CUDA 12 Manavski Paper Numerics Numerics, Manavski 07341e38-d5ed-462a-8891-6e9803d13697 Visualization of Meshless Simulations Using Fourier Volume Rendering This paper discusses Fourier volume rendering technique's implementation on graphics hardware, and demonstrates its usefulness in visualizing data produced by both astrophysical and fluid dynamics simulations. http://cds.gmu.edu/~acorriga/pubs/meshless_fvr/ /content/cudazone/CUDABrowser/assets/images/applications/07_Visualization_of_Meshless_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/07_Visualization_of_Meshless_Large.jpg Academia George Mason University http://cds.gmu.edu/ 2007 7 1 07/01/2007 Open source Corrigan, Wallin, Vesenjak Paper Code Multimedia Science Computational Fluid Dynamics Astrophysics,fluid dynamics,Fourier,hydrodynamics, Corrigan, Wallin, Vesenjak 194c280d-6e0e-48ad-84d0-442383dad19c Scalable Molecular Dynamics: NAMD This article introduces concepts and methods used in the NAMD program and provides a list of the key features of NAMD. Describes the benefits of combining NAMD with the molecular graphics/sequence analysis software, VMD, and the grid computing/collaboratory software, BioCoRE. http://www.ks.uiuc.edu/Research/namd/ /content/cudazone/CUDABrowser/assets/images/applications/08_Scalable_Molecular_Dynamics_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/08_Scalable_Molecular_Dynamics_Large.jpg Academia University of Illinois, Urbana-Champagn http://www.ks.uiuc.edu/ 2005 10 1 10/01/2005 Open source Phillips, Braun, Wang, Gumbart, Tajkhorshid, Villa, Chipot, Skeel, Kala, Schulten Paper Code Life sciences Biomolecular simulation, molecular dynamics, parallel computing, Phillips, Braun, Wang, Gumbart, Tajkhorshid, Villa, Chipot, Skeel, Kala, Schulten 3e29b9f9-4166-41eb-97e7-8ba488703924 Automated Dynamic Analysis of CUDA Programs This paper presents an automated analysis technique that can be run directly in CUDA's device emulation mode to help programmers find and solve subtle bugs in programs that are too complex to analyze manually. /content/cudazone/CUDABrowser/assets/images/applications/09_Automated_Dynamic_analysis_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/09_Automated_Dynamic_analysis_Large.jpg Academia http://web.mit.edu/rabbah/www/conferences/08/ 2008 4 1 04/012008 Open source Boyer, Skadron, Weimer Paper Libraries Analysis,memory, Boyer, Skadron, Weimer 473d6a2d-1359-4212-9586-320d648f5469 GLAME@lab API for Linear Algebra Operations on GPUs This paper describes the implementation and performance evaluation of three different variants of the Cholesky factorization, on two high-level APIs to use a GPU as a coprocessor for dense linear algebra operations. /content/cudazone/CUDABrowser/assets/images/applications/10_GLAME_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/10_GLAME_Large.jpg Academia Universitat Jaume http://www.uji.es/ 2008 2 1 02/01/2008 GeForce 8800 Ultra (G80 processor) Barrachina, et al Paper Graphics Graphics processors (GPUs), general purpose computing on GPU, linear algebra, BLAS, high performance, Barrachina, et al 415c9ac6-7945-458e-88ae-f036b45a697a Remote Rendering with CUDA This paper presents the utilization of advanced programming techniques on current graphics hardware to improve the performance of remote rendering for interactive applications. http://www.nvidia.com/object/io_1200981635689.html /content/cudazone/CUDABrowser/assets/images/applications/11_CUDA_Supported_Approach_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/11_CUDA_Supported_Approach_Large.jpg Academia University of Paderborn http://www.uni-paderborn.de/en/ 2007 9 1 09/01/2007 GeForce 8800 GTS Lietsch, Marquardt Paper Graphics Graphics,rendering,visualization, Lietsch, Marquardt 4804ed9b-996c-4cbc-a091-1c7c3ff3abf8 Molecular Dynamics Simulations This paper presents a new approach to high performance molecular dynamics simulations on GPUs, facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. http://www.springerlink.com/content/p106n8501059l077/ /content/cudazone/CUDABrowser/assets/images/applications/12_Molecular_Dynamics_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/12_Molecular_Dynamics_large.jpg Academia Nanyang Technological University http://www.ntu.edu.sg/publicportal/ 2007 12 1 12/01/2007 Liu, Schmidt, Voss, Mailler-Wittig Application Life Sciences Molecular dynamic,molecule,biology, Liu, Schmidt, Voss, Mailler-Wittig 9d99ed75-fd2a-49fe-9951-d19f6dbc8637 Improved Magnetic Resonance Imaging (MRI) Quality This paper describes how the reconstruction algorithm leverages the resources of the G80 GPU to achieve over 150 GFLOPS in performance. The G80 helps to dramatically reduced the algorithm's required bandwidth to off-chip memory, while providing substantial acceleration for the trigonometric computations in the algorithm's inner loops -- resulting in significant performance increase. /content/cudazone/CUDABrowser/assets/images/applications/13_Improved_MRI_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/13_Improved_MRI_Large.jpg Academia Univeristy of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/ 2007 10 1 October 2007 GeForce 8800 GTX (G80) Stone, et al Paper Life Sciences MRI,magnetic resonance imaging, Stone, et al df71897e-b47a-428e-b12f-fafdc0b537d3 Fast Multipole Methods on Graphics Processors GPUs contain a large number of processing units with access to local and shared memory, and achieve significant speedups vis-a-vis CPUs on problems that can be mapped to their Single Program Multiple Data (SPMD) architecture. This paper describes how our FMM algorithm achieves timings that if computed using an O(N2) algorithm correspond to speeds of 25-45 Tflops (for achieved L2 errors of ~10-6 - 2x10-4). http://www.nvidia.com/object/io_1195169962941.html /content/cudazone/CUDABrowser/assets/images/applications/14_Fast_Multiple_Methods_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/14_Fast_Multiple_Methods_Large.jpg Academia University of Maryland http://www.umd.edu/ 2007 10 1 10/01/2007 GeForce 8800 GTX Gumerov, Duraiswami Paper Libraries Fast Multipole Method, GPU, GPGPU, Personal Supercomputing, NVIDIA CUDA, GPU/Multicore Development Environment, Gumerov, Duraiswami df6142a1-ec69-44ec-8ba4-853403e1f34a FIR and QR Decomposition on GPUs This paper describes the implementation of two HPEC Challenge benchmarks (Finite Impulse Response and QR decomposition) on NVIDIA GPUs using data-parallel implementation approach, as well as results compared to calculations on a CPU. /content/cudazone/CUDABrowser/assets/images/applications/15_FIR_and_QR_Decomposition_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/15_FIR_and_QR_Decomposition_Large.jpg Academia MIT http://web.mit.edu/ 2007 9 1 September 2007 GeForce 8800 GTX 35 McGraw-Herdeg, Enright, Michel Paper Science Computation,HPEC,parallel algorithm, McGraw-Herdeg, Enright, Michel 05ceb304-a1a9-474e-8ed7-29dfb61f358e Molecular Dynamics Simulations on GPUs This paper discusses an implementation of molecular dynamics simulations on a GPU in the CUDA language. Results for two algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPUs is compared to their main processor counterpart. http://eprintweb.org/S/article/cond-mat/0709.3225 /content/cudazone/CUDABrowser/assets/images/applications/16_Molecular_Dynamics_Simulations_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/16_Molecular_Dynamics_Simulations_Large.jpg Academia Universiteit van Amsterdam http://www.uva.nl/start.cfm/la=en/th=main 2007 9 1 09/01/2007 GeForce 8800 GTX 150 Open Source van Meel, Arnold, Frenkel, Portegies Zwart, Belleman Paper Code Life Sciences Molecular dynamics,simulation, van Meel, Arnold, Frenkel, Portegies Zwart, Belleman 6a5b108f-8260-4481-a1c1-f946acb50255 Accelerating Molecular Modeling with GPUs In this article presents an overview of recent advances in programmable GPUs, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed. http://www3.interscience.wiley.com/journal/116323814/abstract /content/cudazone/CUDABrowser/assets/images/applications/17_Accelerating_Molecular_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/17_Accelerating_Molecular_large.jpg Academia University of Illinois, Urbana-Champagn http://www.ks.uiuc.edu/ 2007 07 01 07/01/2007 Stone, et al Application Life Sciences GPU computing, CUDA, parallel computing, molecular modeling, electrostatic potential, multilevel summation, molecular dynamics, ion placement, multithreading, graphics processing unit, Stone, et al 2518fa4c-903f-438d-a48a-aeb104d1b8f9 Increased Performance of Digital Forensics Tools This paper presents the results of a number of experiments that evaluate the effectiveness of offloading processing common to digital forensics tools to a GPU, using "massive" numbers of threads to parallelize the computation. These results are compared to speedups obtainable by simple threading schemes appropriate for multi-core CPUs. /content/cudazone/CUDABrowser/assets/images/applications/18_Increased_Performance_Digital_Forensics_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/18_Increased_Performance_Digital_Forensics_Large.jpg Academia University of New Orleans http://www.uno.edu/ 2007 8 1 August 2007 GeForce 8800GTX (G80) Marziale, Richard, Roussev Paper Science Forensic, Marziale, Richard, Roussev 0885bce4-5ead-478e-807d-8bb29b38f831 Scan Primitives for GPU Computing This paper describes GPU implementations of scan primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA programming model implemented with the C-language. Using the scan primitives, the paper shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver. http://www.nvidia.com/object/io_1195170133199.html /content/cudazone/CUDABrowser/assets/images/applications/19_Scan_Primitives_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/19_Scan_Primitives_Large.jpg Academia University of California Davis http://www.ucdavis.edu/index.html 2007 8 1 08/01/2007 NVIDIA 8-series (G80) 12 Sengupta, Harris, Zhang, Owens Paper Libraries Parallel computing, general purpose computing, scan primitives, segmented scan, Sengupta, Harris, Zhang, Owens 5dc43cde-e9be-44af-a0f0-4243411a22a3 N-body Simulations in CUDA This paper presents the results of gravitational direct N-body simulations using the GPU. The force evaluation of the N-body problem is implemented in CUDA using the GPU to speed-up the calculations,and the implementation is tested on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. http://arxiv.org/abs/0707.0438v2 /content/cudazone/CUDABrowser/assets/images/applications/20_N_body_Simulations_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/20_N_body_Simulations_Large.jpg Academia Universiteit van Amsterdam http://www.uva.nl/start.cfm/la=en/th=main 2007 07 01 07/01/2007 GeForce 8800GTX Belleman, Baedorf, Portegies Zwart Paper Science Gravitation,stellar dynamics,N-body simulation,numerical, Belleman, Baedorf, Portegies Zwart cd2d33ac-8dfd-4845-a2e8-89ec3bd4095e Graphic-Card Cluster for Astrophysics (GraCCA) This paper describes the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. To demonstrate this computing cluster's performance in astrophysics computation, the authors implemented a parallel direct N-body simulation program with shared time-step algorithm in this system, and reported performance results and comparison. http://arxiv.org/abs/0707.2991 /content/cudazone/CUDABrowser/assets/images/applications/21_Graphic_Card_Cluster_2_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/21_Graphic_Card Cluster_2_Large.jpg Academia National Taiwan University http://www.ntu.edu.tw/engv4/ 2008 1 1 01/01/2008 GeForce 8800 GTX 250 Schive, Chiena, Wong, Tsai, Chiueh Paper Science Gravitation,stellar dynamics,N-body simulations,numerical, Schive, Chiena, Wong, Tsai, Chiueh fb45c4f8-549c-4d31-9a84-25c92313ebaf The Chamomile Scheme: N-body Simulations This paper presents an algorithm named "Chamomile Scheme". The scheme is fully optimized for calculating gravitational interactions on a programmable GPU, which has (a) small but fast shared memories with no broadcasting mechanism and (b) floating point arithmetic hardware of 500 Gflop/s but only for single precision. Based on this scheme, the authors developed a library for gravitational N-body simulations, "CUNBODY-1", whose measured performance reaches to 173 Gflop/s for 2048 particles and 256 Gflop/s for 131072 particles. http://arxiv.org/abs/astro-ph/0703100 /content/cudazone/CUDABrowser/assets/images/applications/22_The_Chamomile_Scheme_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/22_The_Chamomile_Scheme_Large.jpg Research Computational Astrophysics Laboratory, RIKEN http://atlas.riken.go.jp/ 2007 3 1 03/01/2007 GeForce8800GTX Hamada, Iitaka Paper Science Stellar Dynamics,numerical,N-body simulations, Hamada, Iitaka 6fe4156e-2e59-4913-931c-8d975619cd06 Smith-Waterman Sequence Alignment This paper exploits the huge computational power of commonly available GPUs to develop high performance solutions for sequence alignment, as industry development and increasing demands make using Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches. http://www.biomedcentral.com/1471-2105/9/S2/S10 /content/cudazone/CUDABrowser/assets/images/applications/23_Smith_Waterman_Sequence_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/23_Smith_Waterman_Sequence_large.jpg Academia Universita Degli Studi Di Padova http://www.unipd.it/en/ 2008 3 1 03/01/2008 GeForce 8800 GTX 30 Manavski, Valle Paper Life Sciences Molecular biology,Smith-Waterman algorithm,protein,DNA,FASTA,BLAST, Manavski, Valle edbc17aa-a223-42fa-842c-7b2d4b80cd75 MUMmerGPU: High-throughput Sequence Alignment This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on GPUs in common workstations. MUMmerGPU uses CUDA to align multiple query sequences against a single reference sequence stored as a suffix tree, providing a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. http://www.biomedcentral.com/1471-2105/8/474 /content/cudazone/CUDABrowser/assets/images/applications/24_MUMmerGPU_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/24_MUMmerGPU_Large.jpg Academia University of Maryland http://www.umd.edu/ 2007 12 1 12/01/2007 G80 10 Schatz, Trapnell, Delcher, Varshney Paper Life Sciences DNA sequencing, sequence alignment, MUMmer, genotyping, genome resequencing, metagenomics, de novo genome assembly, parallel computing, Schatz, Trapnell, Delcher, Varshney 6be79b95-7535-41e5-bfb7-e89d5ec9305b Two-electron Integral Evaluation The paper proposes the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the GPU. The paper discusses in detail the results of numerical accuracy required for the algorithm. http://www3.interscience.wiley.com/journal/114287520/abstract /content/cudazone/CUDABrowser/assets/images/applications/25_Two-electron Integral_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/25_Two-electron Integral_Large.jpg Academia Nagoya University http://www.nagoya-u.ac.jp/en/ 2007 4 1 04/01/2007 GeForce 8800 GTX Yasuda Application Science Algorithm, Coulomb, computing, Gauss-Rys, two-electron integrals, quantum chemistry, first-principle calculation, Yasuda c3e22827-0b47-4e95-b4ba-c073ee1fc74a Interactive Visualization of Volumetric White Matter Connectivity Diffusion tensor magnetic resonance imaging (DT-MRI) using parallel Hamilton-Jacobi (H-J) equation solver implemented in CUDA running on NVIDIA GeForce 8800GTX. /content/cudazone/CUDABrowser/assets/images/applications/26_Interactive_Visualization_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/26_Interactive_Visualization_Large.jpg Research University of Utah http://www.cs.utah.edu/ 2007 10 1 October 2007 G80+ 100 Jeong, Fletcher, Tao, Whitaker Paper Life Sciences GPGPU CUDA MRI DT-MRI H-J Hamilton-Jacobi PDE parallel, Diffusion tensor visualization, graphics hardware, interactivity, fast iterative method (FIM), NIH, Jeong, Fletcher, Tao, Whitaker e223fbfc-f017-498c-8174-699747dbd88b Accelerating Distributed Storage Systems with CUDA Hashing module algorithms in CUDA: SHA1 and MD5. http://www.ece.ubc.ca/~samera/projects/StoreGPU/ /content/cudazone/CUDABrowser/assets/images/applications/27_Accelerating_Distributed_Storage_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/27_Accelerating_Distributed_Storage_large.jpg Academia The University of British Columbia http://www.ubc.ca/ 2008 01 01 01/01/2008 G80+ 9 Al-Kiswany, Gharaibeh, Santos-Neto, Yuan, Ripeanu Paper Code Numerics MD5,SHA1,CTM, Al-Kiswany, Gharaibeh, Santos-Neto, Yuan, Ripeanu aecf6efa-875d-47d1-a3f1-429316708e70 Astrophysical N-body Simulation An optimized C/C++/Fortran library to accelerate N-body interactions using CUDA on NVIDIA GPUs http://progrape.jp/cs/ /content/cudazone/CUDABrowser/assets/images/applications/28_Astrophysical_N_body_Simulation_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/28_Astrophysical_N_body_Simulation_Large.jpg Academia Genomic Sciences Center, RIKEN http://mdgrape.gsc.riken.jp/ 2007 7 27 7/27/2007 G8x and up Paper Code Science Numerics Astrophysics,N-body,library 248f3de9-aa4c-4a56-9980-76eb1057ce4d pystream: Stream and GPU computing in Python PyStream enhances Python with seamless access to CUDA libraries including the CUDA BLAS and FFT libraries. http://code.google.com/p/pystream/ /content/cudazone/CUDABrowser/assets/images/applications/29_pystream_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/29_pystream_Large.jpg Commercial Tech-X Corporation http://www.txcorp.com/ 2007 12 31 12/31/2007 G80 and up Tech-X Corporation Paper Code Numerics Python, language bindings, high performance computing, stream computing, Tech-X Corporation 3af4b289-a5ff-4d4e-bb63-ab493aa101a5 Biomedical Image Analysis Large scale biomedical image analysis applications on heterogeneous systems with multiple processors and multiple GPUs. /content/cudazone/CUDABrowser/assets/images/applications/30_Biomedical_Image_Analysis_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/30_Biomedical_Image_Analysis_Large.jpg Academia Dept of Biomedical Informatics, Ohio State http://bmi.osu.edu/ 2008 6 7 6/7/2008 G80 13 Hartley, Catalyurek, Ruiz, Igual, Mayo, Ujaldon Paper Life Sciences Imaging Biomedial Imaging, heterogeneous computing, cluster, high performance computing, Hartley, Catalyurek, Ruiz, Igual, Mayo, Ujaldon cf8a8ada-2505-4fc3-8584-567addfc8c02 3D Euler Solver Two- and three-dimensional Euler solvers are ported to the GPU, achieving 16x and 29x speedups respectively. /content/cudazone/CUDABrowser/assets/images/applications/31_3D_Euler_Solver_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/31_3D_Euler_Solver_Large.jpg Academia Whittle Laboratory, University of Cambridge http://www-g.eng.cam.ac.uk/whittle/ 2008 1 1 January 2008 G80 and up 29 Brandvik, Pullan Paper Computational Fluid Dynamics Numerics Euler Solver, Brandvik, Pullan f49c3ba6-2bd2-426f-b42e-5b36899ca0d3 Lattice Boltzmann Kernal using CUDA A 2D-Lattice Boltzmann kernel is accelerated using CUDA, and performance is shown in an example of flow through a generic pourous medium. /content/cudazone/CUDABrowser/assets/images/applications/32_Lattice_Boltzmann_Kernal_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/32_Lattice_Boltzmann_Kernal_Large.jpg Academia TU Braunschweig http://www.tu-braunschweig.de/ 2008 2 1 02/01/2008 G80 and up 10 Tolke Paper Computational Fluid Dynamics CFD, Lattice Boltzmann, Tolke fb370bb1-f723-4cf8-a3f1-c29eac806c20 AstroGPU 2007 Workshop Video collection of presentations at the AstroGPU 2007 Workshop on GPUs in Astronomy and Astrophysics. /content/cudazone/CUDABrowser/assets/images/applications/33_AstroGPU_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/33_AstroGPU_Large.jpg Academia AstroGPU 2007 http://www.astrogpu.org/ 2007 11 01 11/01/2007 Multimedia Science Astrophysics, Photo/Imaging 28b28fb9-ff52-4086-ace1-be86818553b0 GPULib: Library of Mathematical Functions GPULib allows users to harness the computational power of GPUs from high level languages and environments such as Python, MATLAB, and IDL. http://www.txcorp.com/technologies/GPULib/download.php /content/cudazone/CUDABrowser/assets/images/applications/34_Library_of_mathematical_functions_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/34_Library_of_mathematical_functions_Large.jpg Commercial Tech-X Corporation http://www.txcorp.com/technologies/GPULib/ 2008 3 6 3/6/2008 G80 and up 40 Commercial Tech-X Corporation Code Numerics Numerics/Algorithms/Libraries, MATLAB, Python, Programming Languages, Interpreters, Tech-X Corporation ea24bf13-69ba-432a-9c47-deae561251e5 GPU Acceleration Solutions Acceleware's products leverage NVIDIA GPUs to provide solutions for processing computationally intensive applications. Acceleware provides Seismic solutions and Imaging solutions. http://www.acceleware.com /content/cudazone/CUDABrowser/assets/images/applications/35_GPU_Acceleration_Solution_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/35_GPU_Acceleration_Solution_Large.jpg Commercial Acceleware http://www.acceleware.com/ 2007 12 31 12/31/2007 G80 and up 35 Commercial Acceleware Application Imaging Oil & Gas Astrophysics, Photo/Imaging, Acceleware 6716afdc-9d38-4f80-a103-b4f53d389a1b Cmatch: Fast Exact String Matching on the GPU A string matching kernel with the benefit of having search times proportional to string length rather than body of text searched. http://www.cbcb.umd.edu/software/cmatch/ /content/cudazone/CUDABrowser/assets/images/applications/36_Cmatch_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/36_Cmatch_Large.jpg Academia Center for Bioinformatics & Computational Biology, University of Maryland http://www.cbcb.umd.edu/ 2007 05 01 05/01/2007 G80, T10P 35 Open source Schatz, Trapnell Paper Code Presentation Life Sciences String match,computational biology,suffix tree,data reordering, Schatz, Trapnell f54917b8-caab-4705-9aef-6ced78aa72d6 General Purpose Molecular Dynamics Simulations This paper and code show that our GPU implementation provides a performance equivalent to that of fast thirty processor core distributed memory cluster. http://www.ameslab.gov/hoomd/index.html /content/cudazone/CUDABrowser/assets/images/applications/37_General_Purpose_Molecular_Dynamics_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/37_General_Purpose_Molecular_Dynamics_Large.jpg Academia Ames Laboratory, United States Department of Energy http://www.external.ameslab.gov/ 2008 2 1 02/01/2008 G80, T10P 30 Open source Code Life Sciences Molecular dynamics,HOOMD c8a25cdb-3816-49bc-a7ac-d22b58473d56 Quantum Mechanical Calculations of Molecular Properties The modification of a general purpose code for quantum mechanical calculations of molecular properties (Q-Chem) to use a graphical processing unit (GPU) is reported. http://pubs.acs.org/cgi-bin/abstract.cgi/jpcafh/2008/112/i10/abs/jp0776762.html /content/cudazone/CUDABrowser/assets/images/applications/38_Quantum_Mechanical_Calculations_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/38_Quantum_Mechanical_Calculations_large.jpg Academia 2007 11 1 11/01/2007 G80 4 Vogt, et al Application Science Moller-Plesset, Quantum Chemistry, Vogt, et al a0302629-3147-4aa5-9e29-74e5875036d5 Fast GPU-Based CT Reconstruction Application of GPU acceleration to the FDK method of image CBCT image reconstruction with on-the-fly reconstruction for presentation immediately after scanning. /content/cudazone/CUDABrowser/assets/images/applications/39_Fast_GPU_Based_CT_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/39_Fast_GPU_Based_CT_large.jpg Academia 2007 11 1 11/01/2007 G8x 2 Scherl, Keck, Kowarschik, Hornegger Paper Imaging CT, FDK reconstruction, FDK algorithm, CUDA, Scherl, Keck, Kowarschik, Hornegger 037a9a90-0c50-4769-965e-a6eec6d448b1 MapReduce Framework MapReduce interface (a software framework implemented by Google to support parallel computations on large datasets) using GPUs. http://www.cse.ust.hk/gpuqp /content/cudazone/CUDABrowser/assets/images/applications/40_MapReduce_Framework_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/40_MapReduce_Framework_Large.jpg Academia 2007 11 25 11/25/2007 10 He, Fang, Govindaraju, Luo, Wang Paper Code Numerics MapReduce,search, He, Fang, Govindaraju, Luo, Wang b6ff75f6-1b1c-4faf-bc5a-77fb92193382 Dirac Video Codec Acceleration of the wavelet-based, Dirac Video Codec (DVC) including overlapped block motion compensation, wavelet transforms, and frame arithmetic. Used for better compression rates and real-time decompression for streaming over low-bandwidth networks. http://www.cs.rug.nl/~wladimir/sc-cuda/ /content/cudazone/CUDABrowser/assets/images/applications/41_Dirac_Video_Codec_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/41_Dirac_Video_Codec_Large.jpg Academia University of Groningen http://www.rug.nl/corporate/index 2007 12 1 December 2007 Open source van der Laan, et al Paper Code Video & Audio Codec,video,streaming, van der Laan, et al 6ad74c31-fdb2-4fd2-a322-f7ee4d8d7ee1 Quantum Chemistry Two-Electron Integral Evolution Use of GPUs to calculate two-electron repulsion integrals over Gaussian basis functions. http://pubs.acs.org/cgi-bin/abstract.cgi/jctcce/2008/4/i02/abs/ct700268q.html /content/cudazone/CUDABrowser/assets/images/applications/42_Quantum_Chemistry_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/42_Quantum_Chemistry_large.jpg Academia Bechman Institute, University of Illinois at Urbana-Champaign http://www.beckman.uiuc.edu/ 2008 1 1 01/01/2008 130 Ufimtsev, et al Application Science Computational chemistry, Ufimtsev, et al e60b1da5-bf35-47ea-8a16-5b471358e63d Teraflop CFD Computing Implementation of a Lattice Boltzmann (LB) kernel based on a D3Q13 model. /content/cudazone/CUDABrowser/assets/images/applications/43_Towards_3D_teraflop_CFD_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/43_Towards_3D_teraflop_CFD_Large.jpg Academia TU Braunschweig http://www.tu-braunschweig.de/ 2008 2 1 02/01/2008 100 Tolke, Krafczyk Paper Computational Fluid Dynamics Fluids,Lattice Boltzmann, Tolke, Krafczyk 92e63d97-f7f8-47e6-938e-a23ac28444d8 OmegaSim GX Hardware-Accelerated SPICE Simulator SPICE simulator for analog and mixed-analog-digital circuits. http://www.nascentric.com/omegasim_gx.html /content/cudazone/CUDABrowser/assets/images/applications/45_OmegaSim_GX_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/45_OmegaSim_GX_Large.jpg Commercial Nascentric http://www.nascentric.com/ 2008 4 1 04/01/2008 8 Commercial Nascentric Application Electronic Design Automation EDA,SPICE, Nascentric 00b5aac0-8570-11dd-ad8b-0800200c9a66 A Neural Network on GPU Implementation of a neural network with CUDA. /content/cudazone/CUDABrowser/assets/images/applications/46_Neutral_Network_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/46_Neutral_Network_Large.jpg Academia University of California Davis http://www.ucdavis.edu/ 2008 3 14 3/14/2008 10 Open Source Billconan, Kavinguy Paper Code Multimedia Life Sciences Neural network, Billconan, Kavinguy d5f22f44-568c-4be8-9c30-e631e0b37fea General Relativistic Evolution Code Implementation of a finite-differencing code for solving Einstein's field equations on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/47_General_Relativistic_Evolution_Code_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/47_General_Relativistic_Evolution_Code_Large.jpg Academia Center for Computation and Technology, and Department of Physics and Astronomy, Louisiana State University http://www.cct.lsu.edu/home 2008 1 1 01/01/2008 26 Zink, Burkhard Paper Science Computational physics, Zink, Burkhard ed4fdee7-a95b-461b-8dc8-394f4ce1dd8b Relational Joins on Graphics Processors Implementation of indexed or non-indexed nested-loop, sort-merge and hash joins using a set of data-parallel primitives such as split and sort. /content/cudazone/CUDABrowser/assets/images/applications/48_Relational_Joins_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/48_Relational_Joins_Large.jpg Academia 2008 6 1 06/01/2008 7 He, Yang, Fang, Lu, Govindaraju, Luo, Sander Paper Numerics Algorithm, He, Yang, Fang, Lu, Govindaraju, Luo, Sander d3acfd2b-6d24-4718-bff6-980d0b5116c7 Level 3 CUBLAS Evaluation of the performance Level 3 performance operations in CUBLAS. /content/cudazone/CUDABrowser/assets/images/applications/49_Level_3_CUBLAS_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/49_Level_3_CUBLAS_Large.jpg Academia Departamento de Ingenieria y Ciencia de los Computadores, Universitat Jaume I http://www.uji.es/CA/departaments/icc/ 2008 4 1 04/01/2008 Barrachina, Castillo, Igual, Mayo, Quintana-Orti Paper Numerics Linear algebra, Barrachina, Castillo, Igual, Mayo, Quintana-Ortia 229a0477-cf7f-4ee2-bc86-0c91ef1d8332 SnapCT: Tomography Volume Reconstruction Accelerated tomography volume reconstruction. http://www.digisens.fr/snapct/ /content/cudazone/CUDABrowser/assets/images/applications/50_SnapCT_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/50_SnapCT_Large.jpg Commercial Digisens SA http://www.digisens.fr/en/ 2007 12 31 12/31/2007 50 Commercial Digisens SA Application Imaging Tomography,medical imaging, Digisens SA 450f1e45-cf76-43ee-8b6f-fa2d8af8202c Distributed Password Recovery High-Performance Distributed Password Recovery from the operating system, Microsoft Office products, Adobe PDF files, ZIP and RAR archives, and a variety of other applications. http://elcomsoft.com/edpr.html/ /content/cudazone/CUDABrowser/assets/images/applications/51_Distributed_Password_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/51_Distributed_Password_Large.jpg Commercial Elcomsoft http://www.elcomsoft.com/ 2008 12 31 12/31/2008 50 Elcomsoft Code Science Password recovery,forensic, Elcomsoft a5b6f331-06dd-452e-8120-54c1ee3b1ba9 Smith-Waterman Sequence Alignment Exact Smith-Waterman sequence alignment on CUDA. http://www.biomedcentral.com/1471-2105/9/S2/S10 /content/cudazone/CUDABrowser/assets/images/applications/52_Smith_Waterman_Sequence_Alignment_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/52_Smith_Waterman_Sequence_Alignment_large.jpg Academia Universita Degli Studi Di Padova http://www.cribi.unipd.it/ 2008 3 26 3/26/2008 30 Ruiz, Ujaldon, Cooper, Huang Paper Life Sciences Smith-Waterman, Gene, sequencing, DNA, molecular biology, FASTA, BLAST, medical, Ruiz, Ujaldon, Cooper, Huang 44c31276-bccd-4eda-af9f-5a6eb60ad923 Simulation Open Framework Architecture (SOFA) GPU-based Gauss-Seidel Algorithm for Dense Matrices. http://www.sofa-framework.org/ /content/cudazone/CUDABrowser/assets/images/applications/53_Simulation_Open_Framework_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/53_Simulation_Open_Framework_Large.jpg Research Institut National De Recherche En Informatique Et En Automatique http://www.inria.fr/ 2007 2 1 02/01/2007 55 Open source Allard, et al Paper Code Numerics Medical, simulation, dense matrix, Gauss-Seidel, Allard, et al a7ace7a8-69c7-48c9-9a5f-495d8827e3dc Non-rigid Registration for Large Set of Microscopic Images 3D reconstruction and visualization of tissue structures from large sets of microscopic images. /content/cudazone/CUDABrowser/assets/images/applications/54_Non_Ridgid_Registration_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/54_Non_Ridgid_Registration_Large.jpg Academia Universidad de Malaga http://www.ac.uma.es/ 2008 4 1 04/01/2008 4 Ruiz, Ujaldon, Cooper, Huang Paper Life sciences Medical,imaging,Microscopic imaging,3D reconstruction, Ruiz, Ujaldon, Cooper, Huang 55062578-eb80-4a33-b1a4-90fb21746ebf Solving Dense Linear Systems on GPUs Algorithms to compute the solution of a linear system of equations on a GPU. /content/cudazone/CUDABrowser/assets/images/applications/55_Solving_Dense_Linear_Systems_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/55_Solving_Dense_Linear_Systems_Large.jpg Academia Departamento de Ingenieria y Ciencia de los Computadores, Universitat Jaume I http://www.uji.es/CA/departaments/icc/ 2008 4 1 04/01/2008 3 Barrachina, Castillo, Igual, Mayo, Quintana-Orta Paper Numerics Linear algebra, BLAS, Linear systems, Cholesky factorization, LU factorization, Barrachina, Castillo, Igual, Mayo, Quintana-Orta 63beacf6-b813-4932-aba9-62f9ce118774 Geometric Algorithms with CUDA Solve basic geometric problems on 3D meshes like the point inclusion test or the self-intersection detection. /content/cudazone/CUDABrowser/assets/images/applications/56_Geometric_Algorithms_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/56_Geometric_Algorithms_Large.jpg Academia Universidad de Jaen http://www.ujaen.es/ 2008 1 29 1/29/2008 100 Rueda, Ortega Paper Numerics 3D meshes, inclusion test, self-intersection test, geometric, Rueda, Ortega b67217ab-626a-4163-9ea7-bb219d660698 Numerical Weather Prediction CUDA-based speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model . /content/cudazone/CUDABrowser/assets/images/applications/57_Numerical_Weather_Prediction_2_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/57_Numerical_Weather_Prediction_2_Large.jpg Research The Weather Research & Forecasting Model http://wrf-model.org/index.php 2007 12 19 12/19/2007 1.3 Michalakes, Vachharajani Application Paper Computational Fluid Dynamics Weather,computational fluid dynamics,CFD,microphysics,thermodynamics, Michalakes, Vachharajani 0ccfe013-7de1-4269-929f-32012797ce28 MDGPU: Molecular Dynamics Simulation The MDGPU simulation package presents a framework for Molecular Dynamics simulations, where the most computationally intensive parts are offloaded to the GPU and system dependent tasks are conveniently performed on the CPU. http://www.amolf.nl/~vanmeel/mdgpu/download.html /content/cudazone/CUDABrowser/assets/images/applications/58_MDGPU_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/58_MDGPU_large.jpg Research Institute for Atomic and Molecular Physics http://www.amolf.nl/ 2007 12 31 12/31/2007 van Meel, et al Paper Code Life Sciences Molecular Dynamics,Life Sciences, van Meel, et al e9158b7d-0e71-4550-98ed-386d6acb1142 Visual Molecular Dynamics: VMD VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting. http://www.ks.uiuc.edu/Research/vmd/ /content/cudazone/CUDABrowser/assets/images/applications/59_Visual_Molecular_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/59_Visual_Molecular_Large.jpg Academia http://www.ks.uiuc.edu/ 2007 4 1 04/01/2007 100 Paper Code Life Sciences Molecular Dynamics,Life Sciences f6a4e7d4-7f25-41f8-aa92-83305b61867d Map Reduce Framework The MapReduce interface is a software framework implemented by Google to support parallel computations on the datasets. This paper describes a framework built around the Map Reduce abstraction, which allows application developers to focus on their application, while enabling high performance GPU implementation. /content/cudazone/CUDABrowser/assets/images/applications/60_Map_Reduce_Framework_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/60_Map_Reduce_Framework_Large.jpg Academia University of California Berkeley http://www.eecs.berkeley.edu/ 2008 3 1 03/01/2008 G80 150 Catanzaro, Sundaram, Keutzer Paper Numerics MapReduce,Search,Numerical,Algorithms,Libraries, Catanzaro, Sundaram, Keutzer 7e814a19-5a09-470d-9752-e3ba25d1d51e Innovative 3D visualization solutions for Oil and Gas Open Inventor by Mercury is a comprehensive solution for simultaneous computation and visualization of huge 3D seismic data sets, or any highly demanding computing tasks in the interpretation and simulation workflows. http://3dviz.mc.com/solutions/GPUComputing/default.asp /content/cudazone/CUDABrowser/assets/images/applications/61_Innovative_3D_visualization_solutions_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/61_Innovative_3D_visualization_solutions_Large.jpg Commercial Mercury Computer Systems http://3dviz.mc.com/solutions/oilandgas/ngog/ 2007 11 1 November 2007 G80 10 Mercury Computer Systems Application Multimedia Oil & Gas Seismic, Graphics, Mercury Computer Systems 25c2eda7-7c30-41c2-9bb3-92134f7f3638 Prestack Seismic Data Interaction Headwave's Prestack for Interpreters is an integrated Windows/Linux software solution, giving easy access to potentially enormous prestack datasets from single PCs using GPUs. http://www.headwave.com/article/articleview/25 /content/cudazone/CUDABrowser/assets/images/applications/62_Prestack_Seismic_Data_Interaction_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/62_Prestack_Seismic_Data_Interaction_Large.jpg Commercial Headwave http://www.headwave.com/ 2007 6 1 06/01/2007 G80 100 Headwave Application Oil & Gas Seismic,Graphics, Headwave 515ca076-3ed9-4311-a6d0-498d5560e1a5 Swaption Volatility Short rate models have been dismissed for financial engineering applications in favor of market models as the latter are more flexible and best suited to cluster computing implementations. In this paper, we argue that the paradigm shift toward GPU architectures currently taking place in the high performance computing world can potentially change the situation and tilt the balance back in favor of a new generation of short rate models. http://www.level3finance.com/index.html /content/cudazone/CUDABrowser/assets/images/applications/63_Swaption_Violatility_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/63_Swaption_Volatility_Large.jpg Commercial Level 3 Finance http://www.level3finance.com/ 2007 9 1 09/01/2007 G80 11 Level 3 Finance Application Finance Finance, Level 3 Finance 22497f94-fe89-4c5a-a6d3-e396e0c60aec Quantitative Risk Analysis and Algorithmic Trading Systems The Volera product line delivers high-speed, low-latency options analytics for trading and risk management. Using GPU-based high-performance computing technology, Volera systems offer performance exceeding that of traditional grid computing, and require far less hardware, rack space, electrical power and cooling. http://www.hanweckassoc.com/products.html /content/cudazone/CUDABrowser/assets/images/applications/64_Quantitative_Risk_Analysis_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/64_Quantitative_Risk_analysis_Large.jpg Commercial Hanweck Associates http://www.hanweckassoc.com/ 2007 9 1 09/01/2007 G80 50 Hanweck Associates Application Finance Finance, Hanweck Associates f5121b1f-fbcf-466b-9d47-d4b892ca0da2 Geographic Information System (GIS) Manifold System is a single, integrated product that provides three major classes of GIS functionality in a single package: as a desktop application, as an objects library for programmers and as an Internet Map Server for web applications. http://www.manifold.net/info/products.shtml /content/cudazone/CUDABrowser/assets/images/applications/65_Geographic_Information_System_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/65_Geographic_Information_System_Large.jpg Commercial Manifold http://www.manifold.net/index.shtml 2007 8 1 08/01/2007 G80 36 Manifold Application Imaging GIS,Imaging, Manifold 0cdcdb36-45e3-4c53-a001-3cc39aa4460b Human Body 3D Surface Image Capture and Analysis A stereo pair of images are captured simultaneously and instantaneously using a pair of synchronised high resolution digital stills cameras. http://www.di3d.com/ /content/cudazone/CUDABrowser/assets/images/applications/66_Human_Body_3D_Surface_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/66_Human_Body_3D_Surface_Large.jpg Commercial Dimensional Imaging http://www.di3d.com/ 2007 12 31 12/31/2007 G80 Dimensional Imaging Application Imaging Imaging, Dimensional Imaging f31a1ef1-71f6-45ea-bec5-cdb783b71acb Synthesis of Artificial Neural Circuitry Simulation of neuronal components closely modeled after neurons in the brain, and synthesize arrays which wire themselves by simulating neural circuit growth in 3D. http://www.evolvedmachines.com/ /content/cudazone/CUDABrowser/assets/images/applications/67_Synthesis_of_Artificial_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/67_Synthesis_of_Artificial_Large.jpg Commercial Evolved Machines http://www.evolvedmachines.com/ 2007 12 31 12/31/2007 G80 Evolved Machines Application Science Computational biology and simulation, Evolved Machines 87617b60-6f68-4a10-a064-49acf3f44a21 Multiple Relatively Robust Representations (MRRR) This paper presents an implementation of the Algorithm of Multiple Relatively Robust Representations (MRRR) for the symmetric tridiagonal eigenvalue problem on a data-parallel coprocessor using the CUDA programming environment. http://www.dgp.toronto.edu/people/lessig/mrrr/ /content/cudazone/CUDABrowser/assets/images/applications/68_MRRR_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/68_MRRR_Large.jpg Academia University of Toronto http://www.utoronto.ca/ 2008 5 1 05/01/2008 50 Open source Lessig Paper Code Numerics Algorithms,numberic,libraries,tridiagonal eigenvalue,multiple relatively robust representations,MRRR, Lessig e2821f27-0731-465a-a8bf-550230b7c600 Fast MRI Gridding on GPUs via CUDA This paper explores the challenges and opportunities of exploiting general-purpose GPU processing, we implemented the non-equispaced Fast-Fourier Transform algorithm (commonly known as 'gridding') and reports results. /content/cudazone/CUDABrowser/assets/images/applications/69_Fast_MRI_Gridding_on_GPUs_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/69_Fast_MRI_Gridding_on_GPUs_large.jpg Academia University of Wisconsin - Madison http://www.wisc.edu/ 2008 5 13 5/13/2008 GeForce 8800 (G80) Open source Gregerson Paper Code Presentation Imaging Imaging,gridding,Fast Fourier,algorithm, Gregerson 5e5775d0-6663-4807-a80a-02c9b89859c9 Fast Support Vector Machine Training and Classification This paper describes a solver for Support Vector Machine training running on a GPU, using Platt's Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic. /content/cudazone/CUDABrowser/assets/images/applications/70_Fast Support_Vector_Machine_Smal.jpg /content/cudazone/CUDABrowser/assets/images/applications/70_Fast_Support_Vector_Machine_Large.jpg Academia University of California Berkeley http://www.eecs.berkeley.edu/ 2007 12 31 12/31/2007 138 Catanzaro, Sundaram, Keutzer Paper Numerics Algorithms,numeric,libraries,support vector machine,Platt, Catanzaro, Sundaram, Keutzer a00952ad-19d7-4ce9-a84a-d35cf6252276 Fluorescent Microscopy This paper describese how a typical calculation covering 10 seconds of measurement time, which required 8 minutes of computing time on a single core of a Quad CPU, can be done with reduced time by using parallel processing with GPUs. Not only the cheaper option, these GPUs offer computation that can be carried out as fast as on a computer cluster with as many processors offering nearly teraflops of computer power. http://www.ks.uiuc.edu/Research/microscope/ /content/cudazone/CUDABrowser/assets/images/applications/71_Flurorescent_Microscopy_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/71_Flurorescent_Microscopy_Large.jpg Academia Univeristy of Illinois at Urbana-Champaign http://www.ks.uiuc.edu/ 2007 11 19 11/19/2007 GeForce 8800 GTX Open souce Arkhipov, et al Paper Code Multimedia Life Sciences Life sciences,microscopy, Arkhipov, et al 32888aaf-5ef8-4302-a4c7-843626b05718 Antenna Modeling Design System This tool addresses the challenges of fast changing wireless appliance design dictated by consumer aesthetic and feature appetite with technology that efficiently imports, meshes and simulates the entire wireless appliance within its surrounding real-word environment. http://eesof.tm.agilent.com/products/amds_main.html /content/cudazone/CUDABrowser/assets/images/applications/73_Antenna_Modeling_Design_System_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/73_Antenna_Modeling_Design_System_Large.jpg Commercial Agilent EEsof EDA http://eesof.tm.agilent.com/ 2008 2 14 2/14/2008 Agilent AMDS Application Imaging Imaging,simluation,3D,wireless, Agilent AMDS 9791ea82-56fa-461c-bb77-e6d18eb8cb76 Flocking-based Document Clustering on the GPU In this paper, we have conducted research to exploit the GPU's architecture and apply its strengths to the document flocking problem. Our results highlight the potential bene t the GPU brings to many naturally inspired algorithms. /content/cudazone/CUDABrowser/assets/images/applications/74_Flocking_based_Document_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/74_Flocking_based_Document_Large.jpg Research Applied Software Engineering Group http://aser.ornl.gov/ 2007 10 15 10/15/2007 GeForce 8800 3 Charles, Potok, Patton, Cui Paper Numerics Algorithm,document flocking, Charles, Potok, Patton, Cui 1d4c05f9-4904-457c-acae-50514535274d Tomographic Reconstruction How much computing power can you cram into a single desktop PC? In this research on image reconstruction, it often requires large-scale scientific computations that can easily take weeks on a normal PC. To tackle this problem, the team developed a special PC costing less than 4000 euro that is capable of performing computations as fast as a cluster, and can perform three-dimensional reconstructions within a few hours -- over 100 times as fast. http://fastra.ua.ac.be/en/index.html /content/cudazone/CUDABrowser/assets/images/applications/75_Tomographic_Reconstruction_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/75_Tomographic_Reconstruction_Large.jpg Academia University of Antwerp http://www.ua.ac.be/main.aspx?c=.ENGLISH 2008 5 28 5/28/2008 GeForce 9800 GX2 40 Sijbers, et al Paper Code Multimedia Imaging Medical, imaging, 3D scan, x ray, FASTRA, Vision Lab, Sijbers, et al fdb3584f-25dd-47d7-a251-56c932a15edf Solving Dense Linear Systems on Multi-Accelerator Platforms This paper generalizes the approach for systems with multiple hardware accelerators, and incorporates software implementations of standard cache/memory coherence techniques from computer architecture to improve performance. This experimental evaluation on an NVIDIA Tesla S870 platform delivers a peak performance well over 400 GFLOPS. /content/cudazone/CUDABrowser/assets/images/applications/76_Solving_Dense_Linear_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/76_Solving_Dense_Linear_Large.jpg Academia Universidad Jaume http://www.icc.uji.es/ 2008 5 9 5/9/2008 Quintana-Orta, Igual, van de Geijn Paper Numerics FLAME, linear algebra, numeric, library, multicore, multi-core, BLAS, PLASMA, SMPS, Cell, FPGA, Tesla, Quintana-Orta, Igual, van de Geijn 807de7b2-250d-45f0-aca6-2c9720ad523c Histogram Computation with CUDA GPU's higher processing power compared to a standard CPU comes at the cost of reduced data caching and flow control logic as more transistors have to be devoted to data processing. This imposes certain limitations in terms of how an application may access memory and implement flow control. As a result, implementation of certain algorithms (even trivial ones) on the GPU may be difficult or may not be computationally justified. http://users.rsise.anu.edu.au/~ramtin/cuda.htm /content/cudazone/CUDABrowser/assets/images/applications/77_Histogram_Computation_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/77_Histogram_Computation_Large.jpg Academia The Australian National University http://www.anu.edu.au/ 2008 5 1 05/01/2008 Ramtin Shams Application Code Numerics Numerics,algorithm,parallel processing,GPU,histogram computation, Ramtin Shams 0766881c-4193-45b7-8aa7-c4113a8d0172 Efficient Histogram Algorithms This paper presents two efficient histogram algorithms designed for NVIDIA's CUDA compatible GPUs, which can be used for parallel computation of histograms on large data-sets and for thousands of bins. These algorithms do not require the typically costly data transfers by allowing efficient histogram calculation on the GPU. /content/cudazone/CUDABrowser/assets/images/applications/78_Efficient_Histogram_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/78_Efficient_Histogram_Large.jpg Academia The Australian National University http://www.anu.edu.au/ 2007 08 08 08/08/2007 30 Ramtin Shams Paper Numerics Histogram, parallel processing, compute unified device architecture, CUDA, graphics processor unit, GPU, numerics, algorithm, Ramtin Shams 007043c1-e26b-4087-b694-fb00069f40c3 Speeding Up Mutual Information Computation Hardware This paper presents an efficient method for mutual information computation between images (2D or 3D) for NVIDIA's CUDA compatible devices, overcoming limitations by approximating the pmfs using a down-sampled version of the jointhistogram which avoids memory update problems. /content/cudazone/CUDABrowser/assets/images/applications/79_Speeding_up_Mutual_Information_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/79_Speeding_up_Mutual_Information_Large.jpg Academia The Australian National University http://www.anu.edu.au/ 2007 9 11 9/11/2007 25 Ramtin Shams Paper Numerics Histogram, parallel processing, compute unified device architecture, CUDA, graphics processor unit, GPU, numerics, algorithm, Ramtin Shams 121a8218-926a-4578-be55-6632c782b1ba LINZIK: The compact optical CAD A lens ray tracing program for calculating, in particular, astronomical optics. It includes optimizer, which can choose parameters of surfaces to minimize the goal (merit) function and satisfy the specified restrictions. http://www.linzik.com/download_eng.htm /content/cudazone/CUDABrowser/assets/images/applications/80_LINZINK_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/80_LINZINK_Large.jpg LINZIK http://www.linzik.com/ 2008 5 18 5/18/2008 10 Open source Vodyanik Application Code Vodyanik f4df4b66-bc01-4b08-afe1-a4868c9ff4d8 Canny Edge Detection The Canny edge detector is a very popular edge feature detector used as a pre-processing step in many computer vision algorithms. By using the more programmer friendly CUDA framework, we are able to implement the entire Canny algorithm. Details are presented along with a comparison with CPU implementations. We also integrate our detector in to MATLAB, a popular interactive simulation package often used by researchers. http://www.wam.umd.edu/~yluo1/canny.htm /content/cudazone/CUDABrowser/assets/images/applications/81_Canny_Edge_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/81_Canny_Edge_Large.jpg Academia University of Maryland http://www.umd.edu/ 2008 5 6 5/6/2008 3 Open source Luo, Duraiswami Paper Code Digital Content Creation Graphics Numerics Imaging Canny edge detector,Canny algorithm,edge feature detector,graphical application layers,algorithm,simulation,numerics, Luo, Duraiswami 9b9fb409-8a9a-4c8d-92c7-3dbb2b1a4c6c SciFinance Speeds Financial Results with Parallel Computing By harnessing the power of NVIDIA CUDA with GPU or multi-CPU workstations, SciFinance parallel codes for Monte Carlo pricing models run blazingly fast. SciFinance CUDA-enabled codes achieve astounding acceleration, while SciFinance OpenMP-compliant codes yield near linear acceleration on multi-CPU workstations. http://www.scicomp.com/parallel_computing/GPU_OpenMP /content/cudazone/CUDABrowser/assets/images/applications/82_SciFinance_Small.jpg /content/cudazone/CUDABrowser/assets/images/applications/82_SciFinance_Large.jpg Commercial SciComp Inc. http://www.scicomp.com/ 2008 6 9 6/9/2008 220 SciComp Inc. Application Code Multimedia Finance Graphic card, parllel computing, finance, Monte Carlo, SciComp Inc. d1677622-d8fa-4f22-9532-74d40f7d2d6e Accelerate Large Graph Algorithms This paper presents a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the GeForce 8800 GTX GPU costing $600. /content/cudazone/CUDABrowser/assets/images/applications/83_Accelerating_Large_Graph_Algorithms_small.jpg /content/cudazone/CUDABrowser/assets/images/applications/83_Accelerating_Large_Graph_Algorithms_large.jpg Academic International Institute of Information Technology Hyderabad http://www.iiit.ac.in/ 2008 6 5 6/5/2008 Harish, et al Paper Numerics Numeric, algorithm, breadth first search, single source shortest path, all-pairs shortest path, vertex graph, Harish, et al eb42f530-fa04-11dd-87af-0800200c9a66 Multibody mechanical simulations on the GPU Parallel solver for non-linear complementarity problems in mechanical systems with a large number of moving parts and frictional contacts. /content/cudazone/CUDABrowser/assets/images/applications/202_benchmarkGPU_small.png /content/cudazone/CUDABrowser/assets/images/applications/202_benchmarkGPU_large.png Academia Universita degli Studi di Parma and University of Wisconsin - Madison 2009 02 01 02/01/2009 13 Alessandro Tasora Application Multimedia Numerics Libraries Science Multibody, complementarity, differential variational inequality, Alessandro Tasora