7cde13sw-sdfg-443b-82d0-ba01dd84469a9
aeroCuda: GPU-Optimized Immersed Solid Code
This is an immersed solid CFD code that uses Peskin's immersed boundary method with Tryggvason's formulation of Chorin's projection method for solving the full Navier-Stokes equations. It is written in Python and uses the PyCuda/PyFFT libraries to access CUDA and the cuFFT libraries.
http://www.scribd.com/doc/97875813/aeroCuda-The-2-d-CFD-Code
/content/cudazone/CUDABrowser/assets/images/applications/163272_Vel100-low.png
/content/cudazone/CUDABrowser/assets/images/applications/163272_Vel100-med.png
Academia
Harvard School of Engineering and Applied Sciences
2011
06
21
06/21/2011
48
Open source
Samir Patel
Paper
Computational Fluid Dynamics
7cde13sw-sdfg-443b-82d0-ba018el4469a9
Coulomb Integrals CUDA Experimental Routine
Quantum Dots are artificial physical structures of semiconductor material in which charge holders are confined in a nanometric space region. This means they’re very close to each other, therefore calculating their state means to calculate a correlated state. These are computable by rewriting the many-particles Schrödinger equation in matrix form. Finding the correct elements of this matrix implies the calculus of Coulomb Integrals. From a computational perspective, this is a very demanding operation. This application shows that it's possible to dramatically improve the speed of the above operation by properly exploiting the power of CUDA GPUs.
http://cudaci.tumblr.com/
/content/cudazone/CUDABrowser/assets/images/applications/157939_nanotrappole-low.png
/content/cudazone/CUDABrowser/assets/images/applications/157939_nanotrappole-med.png
Academia
University of Modena and Reggio Emilia
2011
01
15
01/15/2011
44
Open source
Martin Klapez
Code
Paper
Presentation
Numerics
Science
Quantum Dots, Coulomb Integrals
7cde13sw-sdfg-443v-82d0-ba018el4469a9
VexCL
VexCL is vector expression template library for OpenCL. It has been created for ease of C++ based OpenCL development. Multi-device (and multi-platform) computations are supported. The code is publicly available under MIT license. Main features: * Selection and initialization of compute devices according to extensible set of device filters. * Transparent allocation of device vectors spanning multiple devices. * Convenient notation for vector arithmetic, sparse matrix-vector multiplication, reductions. All computations are performed in parallel on all selected devices. * Appropriate kernels for vector expressions are generated automatically first time an expression is used.
http://ddemidov.github.com/vexcl/
/content/cudazone/CUDABrowser/assets/images/applications/19880_logo-low.png
/content/cudazone/CUDABrowser/assets/images/applications/19880_logo-med.png
Academia
Kazan Federal University, Supercomputer Center of Russian Academy of Sciences
http://www.ksu.ru
2012
05
18
05/18/2012
10-20
Open source
Denis Demidov
Code
Numerics
Libraries
Programming Tools
Science
C++, OpenCL, Libraries, Meta-programming, Linear Algebra
7cde13sw-7ofb-443b-82d0-ba018el4469a9
Computational Fluid Dynamics Simulations Using Many Graphics Processors
In this scenario, computational fluid dynamics simulations of turbulence are performed with 64 GPUs and an optimized CFD algorithm using communication/computation overlapping. Detailed timings reveal that the GPUs' internal calculations are so efficient that operations related to data exchange between compute nodes now cause a scaling bottleneck on all but the largest problems.
http://www.computer.org/csdl/mags/cs/2012/03/mcs2012030010-abs.html
/content/cudazone/CUDABrowser/assets/images/applications/405887_GPU-CFD-low.png
/content/cudazone/CUDABrowser/assets/images/applications/405887_GPU-CFD-med.png
Academia
University of Massachusetts, Amherst
2012
04
21
04/21/2012
45
N/A
Ali Khajeh-Saeed
J. Blair Perot
Paper
Computational Fluid Dynamics
7cde13sw-7ofb-443b-82d0-ba018el4468a9
Fast Multipole Method on GPU: Tackling 3-DCapacitance Extraction on Massively ParallelSIMD Platforms
To facilitate full chip capacitance extraction, field solvers are typically deployed for characterizing capacitance libraries for various interconnect structures and configurations. In the past decades, various algorithms for accelerating boundary element methods (BEM) have been developed to improve the efficiency of field solvers for capacitance extraction. This paper presents the first massively parallel capacitance extraction algorithm FMMGpu that accelerates the well-known fast multipole methods (FMM) on modern Graphics Processing Units (GPUs). We propose GPU friendly data structures and SIMD parallel algorithm flows to facilitate the FMM-based 3-D capacitance extraction on GPU. Effective GPU performance modeling methods are also proposed to properly balance the workload of each critical kernel in our FMMGpu implementation, by taking advantage of the latest Fermi GPU’s concurrent kernel executions on streaming multiprocessors (SMs). Our experimental results show that FMMGpu brings 22X to 30X speedups in capacitance extractions for various test cases. We also show that even for small test cases that may not well utilize GPU’s hardware resources, the proposed cube clustering and workload balancing techniques can bring 20% to 60% extra performance improvements.
http://www.ece.mtu.edu/~zhuofeng/MTU_VLSI_DA_files/papers/DAC11paper.pdf
/content/cudazone/CUDABrowser/assets/images/applications/25120_CapExtraction-low.png
/content/cudazone/CUDABrowser/assets/images/applications/25120_CapExtraction-med.png
Academia
Michigan Technological University
http://www.mtu.edu
2011
06
10
06/10/2011
30
N/A
Xueqian Zhao and Zhuo Feng
Paper
Presentation
Electronic Design Automation
Fast Multipole Method, Capacitance Extraction
7cde1372-7ofb-443b-82d0-ba018el469a9
GPU-Accelerated Large-Eddy Simulation of Turbulent Channel Flows
High performance computing clusters that are augmented with cost and power efficient graphics processing unit (GPU) provide new opportunities to broaden the use of large-eddy simulation technique to study high Reynolds number turbulent flows in fluids engineering applications. In this paper, we extend our earlier work on multi-GPU acceleration of an incompressible Navier-Stokes solver to include a large-eddy simulation (LES) capability. In particular, we implement the Lagrangian dynamic subgrid scale model and compare our results against existing direct numerical simulation (DNS) data of a turbulent channel flow at friction Re = 180. Overall, our LES results match fairly well with the DNS data. Our results show that the friction Re = 180 case can be entirely simulated on a single GPU, whereas higher Reynolds cases can benefit from a GPU cluster.
http://scholarworks.boisestate.edu/mecheng_facpubs/37/
/content/cudazone/CUDABrowser/assets/images/applications/gpu-large-order-low.png
/content/cudazone/CUDABrowser/assets/images/applications/gpu-large-order-med.png
Academia
Boise State University, Mechanical & Biomedical Engineering Department
2012
01
09
01/09/2012
N/A
Rey DeLeon
Inanc Senocak
Paper
Computational Fluid Dynamics
7cde1372-7efb-443b-82t0-ba018el469a9
GPU Application for High-Order Compact Finite Difference Scheme
A high-order compact finite difference scheme for the solution of fluid flow problems is implemented to run on a GPU. Besides the compact scheme, a high-order low pass filter is also employed. For time integration, the classical fourth-order Runge–Kutta method is used. Advection of a vortical disturbance and a temporal mixing layer are simulated with the speed ups between 9x–16.5x
http://www.sciencedirect.com/science/article/pii/S0045793011003227
/content/cudazone/CUDABrowser/assets/images/applications/gpu-high-order-low.png
/content/cudazone/CUDABrowser/assets/images/applications/gpu-high-order-med.png
Academia
Istanbul Technical University, Faculty of Aeronautics and Astronautics
2012
02
15
02/15/2012
N/A
Bulent Tutkun
Firat Oguz Edis
Paper
Computational Fluid Dynamics
Numerics
7cde1372-7efb-443b-82d0-ba018el469a9
Focused Ultrasound -Efficient GPU Simulation Methods for Therapy Planning
Over the past years, high intensity focused ultrasound therapy has become a promising therapeutic alternative for non-invasive tumor treatment. The basic idea of this interventional approach is to apply focused ultrasound waves to the tumor tissue such that the cells are heated and hence destroyed. Since it is quite difficult to assess the quality of this non-invasive therapy, there is a dire need for computer support in planning, conduction, and monitoring of such treatments. In this work, we propose efficient simulation techniques for focused ultrasound waves as well as their heat dissemination using current graphics hardware as a numerical co-processor. We achieve speed-ups between 10 and 700 for the single simulation steps compared to an optimized CPU solution, overall resulting in a significant performance gain over previous approaches for simulation of focused ultrasound.
http://diglib.eg.org/EG/DL/PE/vriphys/vriphys11/119-128.pdf.abstract.pdf
/content/cudazone/CUDABrowser/assets/images/applications/screenshotTemperature-low.png
/content/cudazone/CUDABrowser/assets/images/applications/screenshotTemperature-med.png
Research
Fraunhofer MEVIS
2011
12
05
12/05/2011
N/A
J. Georgii. C.v. Dresky, S. Meier, D. Demedts, C. Schumann, T. Preusser
Paper
Medical Imaging
Numerics
Science
FUS, HIFU
8cdd1372-7efb-443b-82d0-ba018el469a9
GPU and APU computations of Finite Time Lyapunov Exponent fields
We present GPU and APU accelerated computations of Finite-Time Lyapunov Exponent (FTLE) fields. The calculation of FTLEs is a computationally intensive process, as in order to obtain the sharp ridges associated with the Lagrangian Coherent Structures an extensive resampling of the flow field is required. The computational performance of this resampling is limited by the memory bandwidth of the underlying computer architecture.
http://www.sciencedirect.com/science/article/pii/S0021999111006322
/content/cudazone/CUDABrowser/assets/images/applications/ftle-app-image-low.png
/content/cudazone/CUDABrowser/assets/images/applications/ftle-app-image-med.png
Academia
ETH Zurich
2011
11
18
11/18/2011
N/A
Christian Conti, Diego Rossinelli, Petros Koumoutsakos
Paper
Computational Fluid Dynamics
Numerics
Science
Signal Processing
8cdd1372-7efb-429b-82d0-ba018el469a9
LASSO Regression Using Distributed GPUs
We have introduced a scalable framework that uses MPI to distribute work across multiple GPUs in order to solve extremely large regression problems. Speed up is multiplied by the number of available nodes.
http://code.google.com/p/parallel-lasso/
/content/cudazone/CUDABrowser/assets/images/applications/mpi-low.png
/content/cudazone/CUDABrowser/assets/images/applications/mpi-med.png
Academia
University of Southern CA
2012
01
19
01/19/2012
Open source
Gary K. Chen
Code
Paper
Life Sciences
bioinformatics
8ced1372-7efb-429b-82d0-ba018el469a9
CUDA ACCELERATED FACE RECOGNITION USING LOCAL BINARY PATTERNS
We present a GPU accelerated face recognition framework using CUDA. The framework utilizes weighted regional LBP (Local Binary Pattern) histograms as features and k-nearest neighbour (k-NN) algorithm for classification. We show an efficient way to compute LBP values from an input image and construct weighted regional LBP histograms in GPU using a single kernel. We also propose a massively parallel GPU implementation of the k-NN algorithm optimized for handling high-dimensional feature vectors. Comparisons with CPU implementations have shown that, by accelerating both the feature extraction and classification process of the face recognition algorithm, we have managed to achieve up to 29x increase in recognition speed.
http://cvip.itu.edu.tr/media/4809606.pdf
/content/cudazone/CUDABrowser/assets/images/applications/45273_image-low.png
/content/cudazone/CUDABrowser/assets/images/applications/45273_image-med.png
Academia
Istanbul Technical University
www.itu.edu.tr
2012
02
20
02/20/2012
29x
N/A
Salih Cihan Tek
Muhittin Gökmen
Paper
Imaging
Signal Processing
Face recognition, k-nearest neighbours, local binary patterns
8cdd1372-7efb-449b-82d0-ba018el469a9
PyCOOL
PyCOOL (Cosmological Object-Oriented Lattice code) is a fast GPU accelerated program that solves the evolution of interacting scalar fields in an expanding universe with symplectic algorithms. The program has been written with the intention to hit a sweet spot of speed, accuracy and user friendliness. This is achieved by using the Python language with the PyCUDA interface to make a program that is very easy to adapt to different scalar field models.
http://www.physics.utu.fi/tiedostot/theory/particlecosmology/pycool/
/content/cudazone/CUDABrowser/assets/images/applications/pycool-low.png
/content/cudazone/CUDABrowser/assets/images/applications/pycool-med.png
Academia
University of Turku / Department of Physics and Astronomy
2012
01
24
01/24/2011
Open source
Jani Sainio
Application
Code
Science
8cdd1372-7efb-449b-82d0-ba018ef469a9
CUDAfy.NET
An open source Microsoft .NET library that allows writing of CUDA applications including device code in languages such as C# and VB. Contains wrappers for CUSPARSE, CUBLAS, CUFFT and CURAND, as well as a growing number of specialized numerics libraries.
http://www.hybriddsp.com/Products/CUDAfyNET.aspx
/content/cudazone/CUDABrowser/assets/images/applications/cudafyi-low.png
/content/cudazone/CUDABrowser/assets/images/applications/cudafy-med.png
Commercial
Hybrid DSP Systems
http://www.hybriddsp.com
2011
12
07
12/07/2011
Open source
Hybrid DSP Systems
Code
Numerics
Libraries
.NET,C#,VB,Solver
8cdd1372-7efb-449b-82d0-ba018ef469a6
QnDynCUDA
We present a set of C++ classes which allow one to use the graphics card processors cores for quantum ab initio simulations, i.e. a direct solving of the time-dependent Schrödinger equation, gaining the benefits from the parallel architecture of the graphical processor units. We use the Chebyshev polynomial and FFT algorithm. The solution is based on NVIDIA CUDA technology. The speed-up factor in the test runs of our classes performed using the graphics card processor can even be of order of 300 in comparison with the test runs using only the single core of CPU. Not only the Schrödinger equation can be integrated using the presented solver. With only small changes it can be used for solving the nonlinear Gross–Pitaevskii equation of BECs dynamics, the heat equation, the diffusion equation or other parabolic partial differential equations of second order.
http://dx.doi.org/10.1016/j.cpc.2011.11.026
/content/cudazone/CUDABrowser/assets/images/applications/QnDynCUDA-low.png
/content/cudazone/CUDABrowser/assets/images/applications/QnDynCUDA-med.png
Academia
Nicolaus Copernicus University
http://fizyka.umk.pl
2011
12
09
12/09/2011
Open source
Tomasz Dziubak
Jacek Matulewski
Code
Paper
Numerics
Libraries
Science
CUDA
8cdd1372-7efb-449b-82d0-ba08wef469a6
Efficient Decoding of QC-LDPC Codes Using GPUs
In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs).We optimize the data structures of the messages used in the decoding process such that both the read and write processes can be performed in a highly parallel manner by the GPUs. We also propose a highly efficient algorithm to convert the data structure of the messages from one form to another with very little latency. Finally, with the use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder is found to run at around 100 times faster than a CPU-based simulator.
http://www.springerlink.com/content/j8g53w2260224wx7/
/content/cudazone/CUDABrowser/assets/images/applications/QC-LDPC-low.png
/content/cudazone/CUDABrowser/assets/images/applications/QC-LDPC-med.png
Academia
The Hong Kong PolyU
2011
06
16
06/16/2011
100
N/A
Yue Zhao
Xu Chen
Chiu-Wing Sham
Wai M. Tam and Francis C. M. Lau
Paper
Signal Processing
CUDA
8cdd1372-7efb-449b-82d0-ba018ef4542f
Integrating CUDA & GNU Autotools
One of the drawbacks to GNU Autotools is that is only provides native support for certain languages, however, it is flexible enough so that you can make it do what you want it to do... if you know how. I wanted to distribute CUDA based applications with GNU Autotools but unfortunately CUDA is not one of the languages that it supports... so I started googling around. I found several threads where other people wanted to be able to do the same thing. I found various bread crumbs here and there that enabled me to piece together a working build. Since I couldn't find all of the information that I needed in one spot, I figured I'd write it all down and publish it so others don't have to waste time figuring it out. "The ClusterChimps Guide to Integrating CUDA and GNU Autotools" is a simple guide to building CUDA targets using GNU Autotools. It will show you how to build stand alone CUDA applications, static CUDA libraries, and shared CUDA libraries. The guide comes with a companion example tarball.
http://www.clusterchimps.org/autotools.php
/content/cudazone/CUDABrowser/assets/images/applications/dr-zaius-low.png
/content/cudazone/CUDABrowser/assets/images/applications/dr-zaius-med.png
Research
ClusterChimps.org
http://www.clusterchimps.org
2011
11
18
11/18/2011
Open source
Dr. Zaius
Code
Paper
Tools
CUDA, Autotools
8cdd1372-7efb-449b-82d0-ba018ef543e8
FMRI Analysis on the GPU
Faster fMRI analysis by using the GPU.
http://www.sciencedirect.com/science/article/pii/S0169260711001957
/content/cudazone/CUDABrowser/assets/images/applications/article-01.png
/content/cudazone/CUDABrowser/assets/images/applications/article-02.png
Research
Linköping University
http://www.liu.se
2011
11
12
11/12/2011
N/A
Anders Eklund
Mats Andersson
Hans Knutsson
Multimedia
Paper
Signal Processing
Medical Imaging
fMRI, GPU, permutation test
8cdd1372-7efb-449b-82d0-ba018ef567r9
True 4D Image Denoising on the GPU
4D Image denoising on the GPU
http://www.hindawi.com/journals/ijbi/2011/952819/
/content/cudazone/CUDABrowser/assets/images/applications/4-dimension-01.png
/content/cudazone/CUDABrowser/assets/images/applications/4-dimensions-02.png
Research
Linköping University
http://www.liu.se
2011
11
12
11/12/2011
N/A
Anders Eklund
Mats Andersson
Hans Knutsson
Multimedia
Paper
Signal Processing
Medical Imaging
Science
4D, image denoising, CT
8cdd1372-7efb-449b-82d0-ba018ef458y9
A real-time crosstalk canceller on a notebook GPU
People want to participate in the communication with the feeling of being together and sharing the same environment. Crosstalk cancellation is one of the main applications in multichannel acoustic signal processing that provides this kind of feelings. This work shows that GPU can be used as a co-processor which carries out audio processing tasks, freeing CPU resources which can be employed in other tasks.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6012072
/content/cudazone/CUDABrowser/assets/images/applications/159460_Crosstalk_01.png
/content/cudazone/CUDABrowser/assets/images/applications/159460_Crosstalk_02.png
Academia
INCO2 (http://www.inco2.upv.es/) and GTAC (http://www.gtac.upv.es/) Groups. Universidad Politecnica de Valencia
http://www.upv.es
2011
09
06
09/16/2011
N/A
Jose A. Belloch
Alberto Gonzalez
F. J. Martínez-Zaldívar
Antonio M. Vidal
Application
Paper
Signal Processing
Video & Audio
8cdd1372-7efb-449b-82d0-ba018ez454f2
Real-time massive convolution for audio applications on GPU
Massive convolution is the basic operation in multichannel acoustic signal processing. This field has experienced a major development in recent years due to the growing need to incorporate new effects and the natural desire to improve the hearing experience. These acoustic effects require to compute multiples convolutions simultaneously in real-time. The work we present describes a GPU-implementation of all the operations involved in the convolution, extrapolated to multiple channels. One of the main feature in our work is the utilization of the streams on GPU. This allows to overlap computation and data-transfer from CPU to GPU. This application is the first step to carry out real-time multichannel-sound applications on GPU.
http://www.springerlink.com/content/h37u46j2416m6733/
/content/cudazone/CUDABrowser/assets/images/applications/156214_Rea_Time_2.png
/content/cudazone/CUDABrowser/assets/images/applications/156214_Rea_Time_1.png
Academia
INCO2 (http://www.inco2.upv.es/) and GTAC (http://www.gtac.upv.es/) Groups. Universidad Politecnica de Valencia
http://www.upv.es
2011
04
19
04/19/2011
N/A
Jose A. Belloch
Alberto Gonzalez
F. J. Martínez-Zaldívar
Antonio M. Vidal
Application
Paper
Signal Processing
Video & Audio
8cdd1372-7efb-449b-82d0-ca018ef454f2
Exposure Render
Exposure Render is a Direct Volume Rendering Application that applies progressive Monte Carlo raytracing, coupled with physically based light transport to heterogeneous volumetric data. Exposure Render enables the configuration of any number of arbitrarily shaped area lights, models a real-world camera, including its lens and aperture, and incorporates complex materials, whilst still maintaining interactive display updates. It features both surface and volumetric scattering, and applies noise reduction to remove the unwanted startup noise associated with progressive Monte Carlo rendering. The complete implementation is available in source and binary forms under a permissive free software license.
http://code.google.com/p/exposure-render/downloads/list
/content/cudazone/CUDABrowser/assets/images/applications/655602-example-01.png
/content/cudazone/CUDABrowser/assets/images/applications/655602-example-02.png
Academia
TU Delft
http://graphics.tudelft.nl
2011
10
19
10/19/2011
Open source
T. Kroes
Application
Paper
Code
Digital Content Creation
Graphics
Imaging
Medical Imaging
Ray Tracing
Monte Carlo Simulation, NVIDIA CUDA, Open Source, Ray Tracing, Volume Rendering
8cdd1372-7efb-449b-82e0-ba018ef454f2
DualSPHysics
DualSPHysics is a combined CPU/GPU solver for mesh-free Smoothed Particle Hydrodynamics to be applied in CFD applications with free-surface flows.
/content/cudazone/CUDABrowser/assets/images/applications/1370_283365_dualsphysics_cuda_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1370_283365_dualsphysics_cuda_large.png
Academia
EPHYSLAB--Universidade de Vigo and School of Mechanical, Aerospace and Civil Engineering-University of Manchester
http://ephyslab.uvigo.es/index.php/eng/dual_sphysics/
2011
01
11
01/11/2011
90
A.J.C. Crespo
J.M. Dominguez
M.G. Gesteira
Multimedia
Paper
Computational Fluid Dynamics
Science
SPH, GPU, meshless method, lagrangian, fluid dynamics, free-surface flow,A.J.C. Crespo,J.M. Dominguez,M.G. Gesteira,alexbexe@uvigo.es,jmdominguez@uvigo.es,mggesteira@uvigo.es
9561fe2e-1de6-461b-b0dc-1cae41da5eb5
NeMo: real-time spiking neural network simulation
Spiking neural network simulations are used to model biological brain structures. Simulating large-scale networks is computationally expensive, however, due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in a embodied (i.e. robotic) setting, the simulation must be real-time in order to be useful.
http://nemosim.sourceforge.net
/content/cudazone/CUDABrowser/assets/images/applications/1368_193704_firing_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1368_193704_firing_large.png
Academia
Imperial College London
http://www.imperial.ac.uk
2010
07
18
07/18/2010
20
Open source
Andreas Fidjeland
Murray Shanahan
Paper
Life Sciences
Science
neural network simulation,Andreas Fidjeland,Murray Shanahan,andreas.fidjeland@imperial.ac.uk
a6fbee45-c093-4bab-958c-ce8e5e7baa64
CUDA Benoit
Realtime, high resolution, high iteration, supersampled, fractal zoom. Specify the vanishing point, iteration count and colors for an animated zoom into the Mandelbrot set and then watch the zoom without having to run a separate, lengthy, rendering step. Multiple zoom specifications, called tracks, can be stored and played back in sequence, creating a continuously running fractal zooming show.
/content/cudazone/CUDABrowser/assets/images/applications/1367_86365_player_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1367_86365_player_large.jpg
Research
2011
04
30
04/30/2011
Open source
Roger Dahl
Paper
Application
Code
Graphics
Mandelbrot fractal realtime zoom log scale map,Roger Dahl,dahlsys@gmail.com
457a8222-cc5c-4ab5-a938-290aa493803f
SARUMAN
SARUMAN (Semiglobal Alignment of short Reads Using CUDA and NeedleMAN-Wunsch) is a mapping approach that returns all possible alignment positions of a read in a reference genome sequence under a given error threshold, together with one optimal alignment for each of these positions. Alignments are computed in parallel on graphics hardware, facilitating an considerable speedup of this normally time consuming step.
http://www.cebitec.uni-bielefeld.de/brf/saruman/saruman.html
/content/cudazone/CUDABrowser/assets/images/applications/1366_61128_saruman_flow_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1366_61128_saruman_flow_large.png
Academia
Center for Biotechnology, Bielefeld University
http://www.cebitec.uni-bielefeld.de/
2011
03
30
03/30/2011
Jochen Blom
Tobias Jakobi
Daniel Doppmeier
Sebastian Jaenicke
Jörn KalinowskiJens Stoye, Alexander Goesmann
Jens Stoye
Alexander Goesmann
Paper
Life Sciences
Science
Bioinformatics, Sequence Alignment, Short read mapping, Bioinformatics workbench, Sequence Analysis,Jochen Blom,Tobias Jakobi,Daniel Doppmeier,saruman@cebitec.uni-bielefeld.de
a11c0bd6-6125-4d81-ad46-4f5c320951c7
Data Assimilation using a GPU Accelerated Path Integral Monte Carlo Approach
A general approach to data assimilation (state and parameter estimation) in nonlinear dynamical systems with noisy dynamics and noisy measurements. In general terms, it is a method for extracting a few usefull pieces of information from a large amount of raw time series data.
/content/cudazone/CUDABrowser/assets/images/applications/1365_90544_unobsStatesSmall_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1365_90544_unobsStatesSmall_large.png
Academia
University of California, San Diego
http://physics.ucsd.edu/
2011
04
05
04/05/2011
300
John C. Quinn
Henry D.I. Abarbanel
Paper
Science
Signal Processing
Data Assimilation, Parameter Estimation, Monte Carlo, Path Integral,John C. Quinn,Henry D.I. Abarbanel,jquinn@ucsd.edu,habarbanel@ucsd.edu
cf146928-c319-4ea3-97ee-6f24e3b80847
CP_select
parallel selection algorithm for GPUs: calculation of the median and order statistics
/content/cudazone/CUDABrowser/assets/images/applications/1364_8595_median_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1364_8595_median_large.png
Academia
Deakin University
http://www.deakin.edu.au/
2011
04
10
04/10/2011
6
Open Source
Gleb Beliakov
Paper
Numerics
Libraries
Programming Tools
Science
selection, median, sorting,Gleb Beliakov,gleb@deakin.edu.au
5e4ba313-241d-4dc3-8146-66ddb7379614
Mesh-particle interpolations on GPUs and multicore CPUs
Particle-mesh interpolations are fundamental operations for particle-in-cell codes, as implemented in vortex methods, plasma dynamics and electrostatics simulations. In these simulations, the mesh is used to solve the field equations and the gradients of the fields are used in order to advance the particles. The time integration of particle trajectories is performed through an extensive resampling of the flow field at the particle locations. The computational performance of this resampling turns out to be limited by the memory bandwidth of the underlying computer architecture. We investigate how mesh-particle interpolation can be efficiently performed on graphics processing units (GPUs) and multicore central processing units (CPUs), and we present two implementation techniques.
http://rsta.royalsocietypublishing.org/content/369/1944/2164.abstract
/content/cudazone/CUDABrowser/assets/images/applications/1363_259257_cyl-re40000_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1363_259257_cyl-re40000_large.png
Academia
CSE Lab, ETH Zurich
www.cse-lab.ethz.ch
2011
06
01
06/01/2011
Diego Rossinelli
Christian Conti
Petros Koumoutsakos
Paper
Computational Fluid Dynamics
Game Physics
Numerics
CPU, GPU, HPC, mesh-particle, grid-particle,Diego Rossinelli,Christian Conti,Petros Koumoutsakos,petros@inf.ethz.ch
9abc758e-c89b-4761-bfcc-57c36df7e9a1
GPU-computing in econophysics and statistical physics
A recent trend in computer science and related fields is general purpose computing on graphics processing units (GPUs), which can yield impressive performance. With multiple cores connected by high memory bandwidth, today's GPUs offer resources for non-graphics parallel processing. This article provides a brief introduction into the field of GPU computing and includes examples.
http://dx.doi.org/10.1140/epjst/e2011-01398-x
/content/cudazone/CUDABrowser/assets/images/applications/1362_5681_econophysics_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1362_5681_econophysics_large.gif
Academia
ETH Zurich
http://www.tobiaspreis.de/
2011
04
07
04/07/2011
Open source
Tobias Preis
Paper
Finance
Science
Compuational Finance, Computational Physics,Tobias Preis,mail@tobiaspreis.de
aa2667f4-078f-444b-b136-f0065d5c014e
Processing and rendering of Fourier domain optical coherence tomography images at a line rate over 524 kHz using a graphics processing unit
In Fourier domain optical coherence tomography (FD-OCT), a large amount of interference data needs to be resampled from the wavelength domain to the wavenumber domain prior to Fourier transformation. We present an approach to optimize this data processing, using a graphics processing unit (GPU) and parallel processing algorithms. We demonstrate an increased processing and rendering rate over that previously reported by using GPU paged memory to render data in the GPU rather than copying back to the CPU.
http://spie.org/x648.html?product_id=896535
/content/cudazone/CUDABrowser/assets/images/applications/1361_367646_Screenshot-a_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1361_367646_Screenshot-a_large.png
Academia
Aston University, Queen Mary Univ of London, NPL
www.aston.ac.uk
2011
02
01
02/01/2011
Janarthanan Rasakanthan
Kate Sugden
Peter H. Tomlins
Paper
Medical Imaging
Signal Processing
Optical Coherence Tomography, OCT, medical Imaging,Janarthanan Rasakanthan,Kate Sugden,Peter H. Tomlins,raskanj@aston.ac.uk
13893c5b-60cd-468f-b8ec-ea6c11e367c2
Multicore/Multi-GPU Accelerated Simulations of Multiphase Compressible Flows Using Wavelet Adapted Grids
We present a computational method of coupling average interpolating wavelets with high-order finite volume schemes and its implementation on heterogeneous computer architectures for the simulation of multiphase compressible flows. The method is implemented to take advantage of the parallel computing capabilities of emerging heterogeneous multicore/multi-GPU architectures.
http://epubs.siam.org/sisc/resource/1/sjoce3/v33/i2/p512_s1
/content/cudazone/CUDABrowser/assets/images/applications/1360_263363_application-image_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1360_263363_application-image_large.png
Academia
ETH Zurich
www.cse-lab.ethz.ch
2011
03
01
03/01/2011
Diego Rossinelli
Babak Hejazialhosseini
Daniele G. Spampinato
Paper
Computational Fluid Dynamics
Numerics
Signal Processing
GPU, compressible flow, wavelets, multiresolution, adaptive grid, multiphase, multicore architectures, OpenCL,Diego Rossinelli,Babak Hejazialhosseini,Daniele G. Spampinato,diegor@inf.ethz.ch
089e53a3-5369-4c7c-b9ff-20d04b833618
Real-time numerical dispersion compensation using graphics processing unit for Fourier-domain optical coherence tomography
Numerical dispersion compensation for both standard and full-range Fourier-domain optical coherence tomography (FD-OCT) on the graphics processing unit (GPU) architecture has been implemented. The data acquisition, processing and image display were performed on a multi-thread, CPU-GPU heterogeneous computing system. The real-time ultra-high-resolution full-range complex-conjugate-free FD-OCT imaging was demonstrated at 68.4 frame/s with a frame size of 1024 (lateral) by 2048 (axial) pixels.
/content/cudazone/CUDABrowser/assets/images/applications/1359_20606_dispersion_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1359_20606_dispersion_large.jpg
Academia
Johns Hopkins University
http://www.ece.jhu.edu/photonics/zhangkang.html
2011
03
03
03/03/2011
Kang Zhang
Jin U. Kang
Multimedia
Paper
Imaging
Medical Imaging
Life Sciences
Signal Processing
GPU, Numerical Dispersion Compensation, Optical coherence tomography,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu
fa369b09-c534-4b01-b7b8-28ee8d433c77
Real-time intraoperative 4D full-range FD-OCT based on the dual graphics processing units architecture for microsurgery guidance
Real-time 4D full-range complex-conjugate-free Fourier-domain optical coherence tomography (FD-OCT) is implemented using a dual graphics processing units (dual-GPUs) architecture. One GPU is dedicated to the FD-OCT data processing while the second one is used for the volume rendering and display. GPU accelerated non-uniform fast Fourier transform (NUFFT) is also implemented to suppress the side lobes of the point spread function to improve the image quality.
http://www.opticsinfobase.org/abstract.cfm?uri=boe-2-4-764
/content/cudazone/CUDABrowser/assets/images/applications/1358_40008_microsurgery_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1358_40008_microsurgery_large.jpg
Academia
Johns Hopkins University
http://www.ece.jhu.edu/photonics/zhangkang.html
2011
03
01
03/01/2011
Kang Zhang
Jin U. Kang
Multimedia
Paper
Imaging
Medical Imaging
Life Sciences
Signal Processing
GPU, Optical coherence tomography , 4D imaging,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu
93912935-c2cc-4303-b5b2-7355ddff8c8e
IGMAS+
Three-dimensional (3D) interactive modeling with the IGMAS software provides the means for integrated processing and interpretation of geoid, gravity and magnetic fields, yielding improved geological interpretation. IGMAS 3D models are constructed using triangulated polyhedra to which constant density and/or induced and remnant susceptibility are assigned.
http://www.potentialgs.com/
/content/cudazone/CUDABrowser/assets/images/applications/1357_790835_IGMAS_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1357_790835_IGMAS_large.png
Academia
Transinsight GmbH
http://transinsight.com/
2011
02
01
02/01/2011
300
Transinsight GmbH, Christan-Albrecht-Universitat zu Kiel - Department for Geophysics & Geoinformation
Paper
Oil & Gas
Science
interactive modelling, gravity, magnetic, seismic, inversion, numerical modelling, OpenCL,Transinsight GmbH, Christan-Albrecht-Universitat zu Kiel - Department for Geophysics & Geoinformation,info@potentialgs.com
fb111a83-f81b-4fb8-8adf-b15c8e375ad4
Directionally Unsplit Hydrodynamic Schemes with Hybrid MPI/OpenMP/GPU Parallelization in AMR
We present the implementation and performance of a class of directionally unsplit Riemann-solver-based hydrodynamic schemes on Graphic Processing Units (GPU). These schemes, including the MUSCL-Hancock method, a variant of the MUSCL-Hancock method, and the corner-transport-upwind method, are embedded into the adaptive-mesh-refinement (AMR) code GAMER. Furthermore, a hybrid MPI/OpenMP model is investigated, which enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance.
/content/cudazone/CUDABrowser/assets/images/applications/1356_1516457_KH_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1356_1516457_KH_large.png
Academia
National Taiwan University, Department of Physics
2011
03
22
03/22/2011
101
Hsi-Yu Schive
Ui-Han Zhang
Tzihong Chiueh
Paper
Computational Fluid Dynamics
Science
hybrid MPI/OpenMP/GPU, AMR,Hsi-Yu Schive,Ui-Han Zhang,Tzihong Chiueh,b88202011@ntu.edu.tw
e47a37df-6e16-41c8-b775-a88790713add
Horizon MHD
General relativistic magnetohydrodynamics code. Used in computational astrophysics applications, particular the prediction of gravitational radiation from compact objects, and the dynamics of magnetars.
http://www.horizoncode.org/
/content/cudazone/CUDABrowser/assets/images/applications/1355_172283_orszag_tang_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1355_172283_orszag_tang_large.png
Academia
University of Tuebingen, Institute for Astronomy and Astrophysics
2011
02
25
02/25/2011
200
Burkhard Zink
Multimedia
Paper
Computational Fluid Dynamics
Science
mhd astrophysics simulator relativistic,Burkhard Zink,bzink@tat.uni-tuebingen.de
d0c8b7d9-8adf-4aa6-82aa-4dc5d4c7a070
Practical Time Bundle Adjustment for 3D Reconstruction on GPUt
We present a hybrid implementation of sparse bundle adjustment on the GPU using CUDA, with the CPU working in parallel. The algorithm is decomposed into smaller steps, each of which is scheduled on the GPU or the CPU. We develop efficient kernels for the steps and make use of existing libraries for several steps. Our implementation outperforms the CPU implementation significantly, achieving a speedup of 30-40 times over the standard CPU implementation for datasets with upto 500 images on an Nvidia Tesla C2050 GPU
/content/cudazone/CUDABrowser/assets/images/applications/1354_45129_CPU-GPU-Hybrid3_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1354_45129_CPU-GPU-Hybrid3_large.jpg
Academia
IIIT Hyderabad
www.iiit.ac.in
2011
01
01
01/01/2011
40
Siddharth Choudhary
Paper
Siddharth Choudhary,siddharth.choudhary@research.iiit.ac.in
5064a523-1a53-4c6e-9be0-49b519753279
Flow dynamics measurements using digital holographic PIV
An in-line digital holographic (D-HPIV) setup and CUDA-accelerated algorithm were implemented in order to measure the instantaneous three-dimensional (3D), three-component (3C) velocity field of nonstationary flows. This increases dramatically the speed of digital video hologram processing. The system can measure the number, 3D position, size, 3C velocity and track of the particles. The results of the hologram reconstruction are represented using OpenGL.
/content/cudazone/CUDABrowser/assets/images/applications/1353_145946_Figure3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1353_145946_Figure3_large.png
Academia
Petrozavodsk State University
www.petrsu.ru
2010
10
07
10/07/2010
1000
Dmitry Ekimov
Paper
Imaging
Numerics
Science
Signal Processing
Video & Audio
Dmitry Ekimov,edmitr@onego.ru
c3c135bf-3665-438c-b634-25ee54e81a90
Numerical simulation of flow around an oscillating cylinder
This program presents a finite difference solution for 2D, low Reynolds number (1-350), unsteady flow around and heat transfer from a stationary or oscillating circular cylinder with constant surface temperature and placed in a uniform stream. The fluid is assumed to be incompressible and of constant property. The cylinder is moved mechanically and can vibrate in-line with or transverse to the main stream or can follow an elliptical or figure-8-shaped path. The governing equations are the Navier-Stokes equations, the continuity equation, a Poisson equation for pressure and the energy equation.
http://www.filefactory.com/file/cac6d70/n/FlowCFD.zip
/content/cudazone/CUDABrowser/assets/images/applications/1352_24973_NVIDIA_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1352_24973_NVIDIA_large.jpg
Academia
Department of Fluid and Heat Engineering, University of Miskolc, Hungary
www.uni-miskolc.hu
2010
02
02
02/02/2010
13
Prof. Laszlo Baranyi, Laszlo Daroczy
Multimedia
Paper
Computational Fluid Dynamics
Numerics
Science
CFD, numerics, Computational Fluid Dynamics, oscillating cylinder, in-line oscillation, transverse oscillation, Figure-8-shape motion, SOR, successive over-relaxation, heat transfer, 2D, Reynolds number, Strouhal number, Nusselt number, incompressible, lift, drag, Poisson equation, Navier-Stokes equations, temperature,Prof. Laszlo Baranyi, Laszlo Daroczy,arambl@uni-miskolc.hu; daroczy4@freemail.hu
acbbd15e-82f0-45e6-988a-f1726e4bb1ce
Running the High Performance Linpack (HPL) Benchmark on NVIDIA GPUs
The HPL benchmark is used to rank the world's Top500 supercomputers. This is a step by step procedure on how to run NVIDIA's version of the HPL benchmark on Tesla GPUs. We also compare the results of a normal HPL run on CPU to a hybird run on CPU-GPU to show the performance boost gained with GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/1349_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1349_logo_large.png
Research
Saudi Aramco
2011
01
10
01/10/2011
Open source
Mohamad Sindi
Paper
Benchmark
NVIDIA, Linpack, HPL, GPU, Top500, FLOPS, High Performance Computing, HPC,Mohamad Sindi,sindimo@ieee.org
be475639-1007-4bf8-bcc8-b103379fdf9d
GPU Vision: Accelerating Computer Vision algorithms with Graphics Processing Units
We present an introduction to the eld of GPU accelerated computer vision by examining several projects that provide the framework for researchers and developers to tap into the computational power of Graphics Processing Units (GPU). Our goal is to identify the tools and areas where GPU acceleration can provide the highest performance increases in computer vision applications by creating performance benchmarks to compare and contrast the GPU and CPU versions in realistic applications.
http://c13software.com/downloads/GPUVision_2011.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1347_133852_haar_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1347_133852_haar_large.jpg
Academia
University of Connecticut
2011
02
09
02/09/2011
Tamas K. Lengyel
James Gedarovich
Antonio Cusano
Thomas J. Peters
Paper
Imaging
Tamas K. Lengyel,James Gedarovich,Antonio Cusano,tamas.k.lengyel@gmail.com,james.gedarovich@gmail.com,antonio.cusano@gmail.com
1deb8e36-96ab-486b-a05e-0af7a067b6b7
CUDA Image Mosaic
Creates image mosaics from a database of thumbnails on a pixel by pixel basis using CUDA to perform the image comparisons.
Digital Content Creation,Graphics
/content/cudazone/CUDABrowser/assets/images/applications/1346_601593_cuda800_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1346_601593_cuda800_large.jpg
Research
Personal
2011
02
11
02/11/2011
100
Commercial
Andy H Coates
Multimedia
Paper
Photo Mosaic Image PhotoMosaic,Andy H Coates,andyhcoates@gmail.com
50827283-e4ee-4c05-bc3d-27bd9df0436b
CUDA
Real-time renderer
/content/cudazone/CUDABrowser/assets/images/applications/1345_62423_chessRefraction_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1345_62423_chessRefraction_large.jpg
Research
Freelancer
2011
02
05
02/05/2011
20
Open Source
Thanassis Tsiodras
Multimedia
Paper
Graphics
Ray Tracing
SAH AABB BVH Triangle-meshes real-time raytracer,Thanassis Tsiodras,ttsiodras@gmail.com
7717e876-57cb-49db-9350-0ae7cc77ac63
Multi-GPU accelerated multi-spin Monte Carlo simulations of the 2D Ising model
A Modern Graphics Processing unit is able to perform massively parallel scientific computations at low cost. We extend our implementation of the checkerboard algorithm for the two dimensional Ising model T. Preis et al., Journal of Computational Physics 228 2009 4468 4477 in order to overcome the memory limitations of a single GPU which enables us to simulate significantly larger systems. Using multi spin coding techniques, we are able to accelerate simulations on a single GPU by factors up to 35 compared to an optimized single Central Processor Unit core implementation which employs multispin coding.
www.tobiaspreis.de/publications/bvp_cpc_2010.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1344_11409_preis_multi_gpu_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1344_11409_preis_multi_gpu_large.gif
Academia
Johannes Gutenberg University Mainz
www.tobiaspreis.de
2010
08
01
08/01/2010
35
Benjamin Block
Peter Virnau
Tobias Preis
Paper
Science
Computational Physics, Monte Carlo Simulation, GPU Clusters,Benjamin Block,Peter Virnau,Tobias Preis,mail@tobiaspreis.de
96a2f87d-895b-4b9f-b04e-7ceb30d28941
Hex Protein Docking
Modelling protein-protein interactions (PPIs) is an important aspect of structural bioinformatics. The Hex spherical polar Fourier protein docking algorithm has been implemented on Nvidia graphics processor units (GPUs). On a GTX 285 GPU, an exhaustive six-dimensional docking search can be calculated in just 15 seconds using multiple one-dimensional fast Fourier transforms. This represents a 45-fold speed-up over the corresponding calculation on a single CPU, and is at least two orders of magnitude faster than conventional Cartesian grid-based FFT docking approaches.
/content/cudazone/CUDABrowser/assets/images/applications/1342_47767_hex_3hfl_docked_rainbow_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1342_47767_hex_3hfl_docked_rainbow_large.jpg
Research
INRIA
http://www.inria.fr
2010
04
24
04/24/2010
45
Dave Ritchie
Paper
Life Sciences
protein docking,Dave Ritchie,dave.ritchie@inria.fr
fe5a34d7-6844-4b87-8542-7ef5ce307b1c
GPU-accelerated molecular dynamics simulation for study of liquidcrystalline flows
We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed.
http://portal.acm.org/citation.cfm?id=1808870
/content/cudazone/CUDABrowser/assets/images/applications/1340_header_r1_c1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1340_header_r1_c1_large.jpg
Academia
Kochi University of Technology
2010
08
01
08/01/2010
50
Alfeus Sunarso
Tomohiro Tsuji
Shigeomi Chono
Paper
Science
Alfeus Sunarso ,Tomohiro Tsuji,Shigeomi Chono,sunarso@kochi-tech.ac.jp
4672b45f-125b-480f-a8c6-f5aa647b2a75
MandelCUDA
Real-time rendering of the Mandelbrot fractal
/content/cudazone/CUDABrowser/assets/images/applications/1338_43133_mandel_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1338_43133_mandel_large.gif
Research
Freelancer
2010
03
01
03/01/2010
40
Open Source
Thanassis Tsiodras
Paper
Graphics
Application,Code,Thanassis Tsiodras,ttsiodras@gmail.com
adcd077f-17fb-44e4-91de-cb028f7fe788
CUDA Accelerated Particle Engine
A simple point sprite based particle engine accelerated with CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/1337_309888_particles (1)_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1337_309888_particles (1)_large.png
Academia
Student
2010
05
24
05/24/2010
10
Craig Mouser
Multimedia
Paper
Digital Content Creation
Graphics
Video
Audio
Craig Mouser,mouser58907@yahoo.com
8a9de003-fdec-4da9-b81e-d3c7a991d982
Parallel Option Pricing on GPU: Barrier Options and Realized Variance Options
We present parallel algorithms implemented in CUDA subroutines ready to run on Graphics Processing Units (GPUs) to price two kinds of financial derivatives, that is: continuous barrier options and realized variance options. The outstanding parallel performance of these algorithms when executed on GPUs is due to the mathematical properties of the pricing formulae used and to their software implementation.
http://www.econ.univpm.it/recchioni/finance/w13
/content/cudazone/CUDABrowser/assets/images/applications/1336_Fig2GPU_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1336_Fig2GPU_large.jpg
Academia
Universita di Camerino, Universita Politecnica delle Marche, Universita di Roma La Sapienza
2010
11
05
11/05/2010
L. Fatone
M. Giacinti
F. Mariani
Paper
Finance
L. Fatone,M. Giacinti,F. Mariani,lorella.fatone@unicam.it ,m.giacinti@univpm.it ,fra_mariani@libero.it
38d09cb5-3ff4-431c-a705-23b70259a7c1
Graphics processing unit accelerated non-uniform fast Fourier transform for ultrahigh-speed, real-time Fourier-domain OCT
We implemented fast Gaussian gridding (FGG)-based non-uniform fast Fourier transform (NUFFT) on the graphics processing unit (GPU) architecture for ultrahigh-speed, real-time Fourier-domain optical coherence tomography (FD-OCT). The Vandermonde matrix-based non-uniform discrete Fourier transform (NUDFT) as well as the linear/cubic interpolation with fast Fourier transform (InFFT) methods are also implemented on GPU to compare their performance in terms of image quality and processing speed.
http://www.opticsinfobase.org/oe/abstract.cfm?uri=oe-18-22-23472
/content/cudazone/CUDABrowser/assets/images/applications/1335_165404_finger_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1335_165404_finger_large.jpg
Academia
Johns Hopkins University
www.ece.jhu.edu
2010
10
25
10/25/2010
Kang Zhang
Jin U. Kang
Paper
Imaging
Medical Imaging
Signal Processing
Kang Zhang,Jin U. Kang,kzhang8@jhu.edu
ba0d2d88-8e95-486f-87c7-2e084e8bcc35
Real-time 4D signal processing and visualization using graphics processing unit on a regular nonlinear-k Fourier-domain OCT system
We realized graphics processing unit (GPU) based real-time 4D (3D + time) signal processing and visualization on a regular Fourier-domain optical coherence tomography (FD-OCT) system with a nonlinear k-space spectrometer. An ultra-high speed linear spline interpolation (LSI) method for -to-k spectral re-sampling is implemented in the GPU architecture, which gives average interpolation speeds of >3,000,000 line/s for 1024-pixel OCT (1024-OCT) and >1,400,000 line/s for 2048-pixel OCT (2048-OCT).
http://www.opticsinfobase.org/oe/abstract.cfm?URI=oe-18-11-11772
/content/cudazone/CUDABrowser/assets/images/applications/1333_61749_finger tip singles 2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1333_61749_finger tip singles 2_large.jpg
Academia
Johns Hopkins University
www.ece.jhu.edu
2010
05
18
05/18/2010
Kang Zhang
Jin U. Kang
Paper
Imaging
Medical Imaging
Signal Processing
Real-time 4D Optical coherence tomography,Kang Zhang,Jin U. Kang,kzhang8@jhu.edu
4ee10503-6f9a-4136-858c-5df8aa4cf07f
GPU Smoldyn
Porting to CUDA of the core simulation algorithms of Smoldyn
/content/cudazone/CUDABrowser/assets/images/applications/1332_167342_screenshot_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1332_167342_screenshot_large.png
Research
COSBI
2010
12
10
12/10/2010
130
Lorenzo Dematte
Davide Prandi
Paper
Life Sciences
Lorenzo Dematte,Davide Prandi,dematte@ieee.org
6933fa46-7fc4-4053-b724-feda8995296b
MC-GPU: Monte Carlo Simulation of X-ray Transport for Medical Imaging Applications
MC-GPU is a GPU-accelerated x-ray transport simulation code that can generate clinically-realistic radiographic projection images and computed tomography (CT) scans of the human anatomy. MC-GPU implements a massively multi-threaded Monte Carlo simulation algorithm for the transport of x rays in a voxelized geometry and uses the x-ray interaction models and cross sections from PENELOPE 2006. The code can handle realistic human anatomy phantoms, for example the freely available models from the Virtual Family. Electron transport is not implemented. The code has been developed using the CUDA programming model and MPI to address multiple GPUs in parallel. In typical diagnostic imaging simulations, a 15 to 30-fold speed up is obtained using a GPU compared to a CPU execution.
/content/cudazone/CUDABrowser/assets/images/applications/1331_99177_mc-gpu_1mmDuke_50keV_1e10hist__All_and_NoScatter_LowRes_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1331_99177_mc-gpu_1mmDuke_50keV_1e10hist__All_and_NoScatter_LowRes_large.png
Research
US Food and Drug Administration
http://www.fda.gov/MedicalDevices/ScienceandResearch/ucm2007489.htm
2010
07
08
07/08/2010
30
Open source
Andreu Badal
Aldo Badano
Paper
Code
Medical Imaging
Ray Tracing
Andreu Badal,Aldo Badano,andreu_badal@hotmail.com
59b21188-3fc1-448c-bee4-6c5923cfcd67
A demonstration of Exact String Matching Algorithms in CUDA
I had a simple idea: is it possible to convert some of the well-known exact string matching algorithms into CUDA versions
/content/cudazone/CUDABrowser/assets/images/applications/1330_80324_Screen_shot_2011-01-07_at_PM_2012_12_43_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1330_80324_Screen_shot_2011-01-07_at_PM_2012_12_43_large.png
Research
HP Labs Singapore
http://www.hp.com
2010
12
23
12/23/2010
100
Open source
Raymond Tay
Application
Paper
Code
general purpose computing
Raymond Tay,raymondtay1974@gmail.com
f48bb746-7b1b-4da3-b204-cc93a3414cc0
Poker Simulation In GPU
Simulation is a widely using technique by artificial and human players for helping the decision process in poker. In a typical texas hold'em game simulating all possible game states requires millions of hand evaluations. In this application, we port the Hand-Eval poker library to CUDA providing a generic interface for evaluations of large amounts of hand data.
/content/cudazone/CUDABrowser/assets/images/applications/1329_6875_resim_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1329_6875_resim_large.jpg
Academia
METU
2010
01
31
01/31/2010
15
Commercial
Sirin,Volkan
Paper
Game Simulation
Sirin,Volkan,volkansirin@gmail.com
3b26e749-9952-4f86-b8f7-4ca377ef9dea
Satellite Image Processing on GPU
Satellite Image Processing on GPU is demonstration of performance of remote sensing algorithms such as Shadow Detection and Vegetation Detection on GPU. Also basic image processing algorithms like Contrast Normalization, Histogram Equalization, Automatic Threshold (Otsu's) are implemented. 4 band Satellite Images with 8-bit and 16-bit data are used in tests. Performance of basic and complex algorithms are compared in CPU and GPU with images with various sizes. In the tests, the effect of memory transfer and the order of bands are also considered.
/content/cudazone/CUDABrowser/assets/images/applications/1328_60270_imageGPU_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1328_60270_imageGPU_large.png
Academia
Informatics Institute, Middle East Technical University
http://www.vrcv.ii.metu.edu.tr
2010
01
31
01/31/2010
10
Open source
Mustafa Teke
Paper
Code
Signal Processing
Mustafa Teke,mustafa.teke@gmail.com
3dcc9409-2154-48c2-8b2d-9f64294ea4a5
Parallel implementation of large scale crowd simulation
Human crowd movement was simulated using texture convolution and a behavioral model inspired by smoothed particle hydrodynamics. In order to make large scale simulation possible in real-time, or almost real-time, we will implement a model for human crowd behavior on a parallel processing platform using CUDA
/content/cudazone/CUDABrowser/assets/images/applications/1327_361460_accumulatorMultiColor_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1327_361460_accumulatorMultiColor_large.png
Academia
DIKU
2009
11
11
11/11/2009
Thomas Gronnelov
Paper
Crowd simulation
Thomas Gronnelov,tag@greenleaf.dk
c9f1c703-120a-4322-a545-025f92f91c95
Monte Carlo simulation of the q-state Potts Model using CUDA
In this work we implement a parallel code to perform finite temperature Monte Carlo simulations of a magnetic system described by a two dimensional q-state Potts model.
http://www.famaf.unc.edu.ar/grupos/GPGPU/Potts/CUDAPotts.html
/content/cudazone/CUDABrowser/assets/images/applications/1326_82402_potts-nvidia_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1326_82402_potts-nvidia_large.jpg
Academia
GPGPU Computing Group - Fa.M.A.F. - U.N.C.
http://www.famaf.unc.edu.ar/grupos/GPGPU/
2010
01
05
01/05/2010
155
Open source
Ezequiel E. Ferrero
Juan Pablo De Francesco
Nicolas Wolovick, Sergio A. Cannas
Paper
Statistical Mechanics
Ezequiel E. Ferrero,Juan Pablo De Francesco,Nicolas Wolovick, Sergio A. Cannas,ferrero@famaf.unc.edu.ar,jde@famaf.unc.edu.ar,nicolasw@famaf.unc.edu.ar, cannas@famaf.unc.edu.ar
ce4b9b35-490f-49f0-b1e7-4d5ec3b9841f
CoroBot
CUDA-enabled controller for a mobile robot. The controller takes advantage of an ION board. Machine vision algorithms are accelerated by a minimum of 8x compared to their single-threaded C++ version executed on the ION Atom CPU
/content/cudazone/CUDABrowser/assets/images/applications/1325_6852_corobot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1325_6852_corobot_large.jpg
Commercial
RealityFrontier
http://www.realityfrontier.com
2010
09
01
09/01/2010
8
Commercial
Raphael Cariou
Application
Multimedia
Imaging
Signal Processing
Robotics
Raphael Cariou,raphael.cariou@realityfrontier.com
9101844c-56f4-40d7-a856-c75fc81e385a
Simulating spin models on GPU
Simulations of the Ising, Heisenberg and spin-glass models with Metropolis and parallel tempering updates.
/content/cudazone/CUDABrowser/assets/images/applications/1324_checker_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1324_checker_large.png
Academia
Johannes Gutenberg-University Mainz
http://www.uni-mainz.de
2010
01
07
01/07/2010
1000
Open source
Martin Weigel
Paper
Code
Science
Statistical physics
Martin Weigel,weigel@uni-mainz.de
be82fae2-de7e-48bc-94c2-5f68a86c8c48
iWormhole Desktop Edition
Is an ultra-secure file sending Windows Application. This application was designed for consumer use with three guiding principles: 1) Speed, 2) Privacy and 3) Security.
/content/cudazone/CUDABrowser/assets/images/applications/1323_36895_screenshot_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1323_36895_screenshot_large.png
Commercial
iWormhole Communications Corp
http://www.iwormhole.com
2010
12
12
12/12/2010
1700
Commercial
Rob Gagnon
Application
File Transmission
Rob Gagnon,rob@iwormhole.com
56b1c6a1-88bd-4f75-9898-730bd21ce344
Simulation of 1+1 dimensional surface growth andl attices gases using GPUs
Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ( x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.
/content/cudazone/CUDABrowser/assets/images/applications/1322_15738_Model1d_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1322_15738_Model1d_large.png
Academia
MTA-MFA, Res. Inst. for Tech. Phys. and Materials Sci. Budapest
http://www.mfa.kfki.hu
2010
12
03
12/03/2010
100
Henrik Schulz
Geza Odor
Gergely Odor, Mate F. Nagy
Paper
Statistical Physics
Henrik Schulz,Geza Odor,Gergely Odor, Mate F. Nagy,odor@mfa.kfki.hu
e795a2dd-798b-4cb4-9630-e8c7cd042a16
CUVI Lib
CUDA for Vision and Imaging Library
/content/cudazone/CUDABrowser/assets/images/applications/1321_5734_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1321_5734_logo_large.png
Commercial
TunaCode
2010
08
26
08/26/2010
40
Tauseef Rehman
Salman Haq
Usman Aziz, Jawad Masood
Code
Medical Imaging
Libraries
Programming Tools
Signal Processing
Tauseef Rehman,Salman Haq,Usman Aziz, Jawad Masood,tauseef@tunacode.com
c0ae46f3-d256-4c38-9f59-11cddd117c0a
Interactive visualization of the largest radioastronomy cubes
Astronomy is a data intensive science. The upcoming and future astronomy research facilities will systematically generate terabyte-sized data sets moving astronomy into the Petascale data era. Such increases in dataset size and dimensionality will pose serious computational challenges for many current astronomy data analysis and visualization tools.
http://astronomy.swin.edu.au/~ahassan/Research.html
/content/cudazone/CUDABrowser/assets/images/applications/1320_optiportal_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1320_optiportal_large.jpg
Academia
Swinburne University of Technology-Centre for Astrophysics and Supercomputing
http://astronomy.swin.edu.au/scivis/
2010
09
01
09/01/2010
A. H. Hassan
C. J. Fluke
D. G. Barnes
Application
A. H. Hassan,C. J. Fluke,D. G. Barnes
f01e4f1c-7947-4609-b24c-4732f954bafb
Simulation of 1+1 dimensional surface growth andl attices gases using GPUs
Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/ evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ( x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.
/content/cudazone/CUDABrowser/assets/images/applications/1318_15738_Model1d_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1318_15738_Model1d_large.png
Academia
MTA-MFA, Res. Inst. for Tech. Phys. and Materials Sci. Budapest
http://www.mfa.kfki.hu
2010
12
03
12/03/2010
100
Henrik Schulz
Geza Odor
Gergely Odor, Mate F. Nagy
Paper
Statistical Physics
Henrik Schulz,Geza Odor,Gergely Odor, Mate F. Nagy,odor@mfa.kfki.hu
4936a20a-b98f-4e39-94d2-ec174d002e9e
Nonlinear Free Surface Water Waves
Fast Desktop Computing for Nonlinear Free Surface Water Waves (OceanWave3D potential flow model)
/content/cudazone/CUDABrowser/assets/images/applications/1317_47409_whalint3_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1317_47409_whalint3_large.jpg
Academia
Technical University of Denmark
http://www.imm.dtu.dk/~apek
2010
12
03
12/03/2010
42
Open source
Allan P. Engsig-Karup
Application
Computational Fluid Dynamics
Numerics
oceanwave3d, potential free surface flow, finite difference method, coastal engineering,Allan P. Engsig-Karup,apek@imm.dtu.dk
a79d4690-9c30-4837-88bc-805873f7e5f5
Lagrangian Stochastic Particle Model using Large-Eddy Simulation Meteorology
Atmospheric transport and dispersion (T D) models play an important roll in United States national defense. Due to operational time constraints, less sophisticated models have consistently dominated the defense market. Recent advances in graphics processing units (GPUs) and their programming models have made GPUs an attractive platform for commodity, low-power, high-performance parallel computing. Two GPU accelerated (using NVIDIA Corporation's CUDA technology) versions of a sophisticated, large-eddy simulation (LES) based, Lagrangian stochastic model, developed at the National Center for Atmospheric Research (NCAR), were implemented and compared against their single and multiple core CPU (Intel Harpertown) counterparts. The implementation representing the shortest route to GPU acceleration observed a single GPU speedup of 14x over the single core CPU implementation. A more robust and scalable single GPU implementation observed speedups of 20x over the single core CPU implementation.
/content/cudazone/CUDABrowser/assets/images/applications/1316_27146_ave_plan_view_crop_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1316_27146_ave_plan_view_crop_large.png
Academia
University of Colorado - Boulder
2010
07
13
07/13/2010
20
Jonathan Hurst
Paper
Computational Fluid Dynamics
Jonathan Hurst,jhurst@ucar.edu
ebb0619e-dd44-4c10-ac6c-1659ff388b6f
rCUDA 2.0
Allows performing CUDA calls to remote GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/1315_4691_rCUDA_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1315_4691_rCUDA_logo_large.png
Academia
UPV / UJI
2010
11
24
11/24/2010
Open source
The rCUDA Team
Application
Paper
Code
Libraries
Programming Tools
The rCUDA Team,apenya@gap.upv.es
bab91959-3b2a-494c-8644-cf771a2f9bc0
LATTE
GPU-accelerated self-consistent tight-binding molecular dynamics for materials with mixed covalent and ionic bonding.
/content/cudazone/CUDABrowser/assets/images/applications/1314_main_orig_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1314_main_orig_large.png
Research
Los Alamos National Laboratory
http://www.lanl.gov
2010
11
01
11/01/2010
Open source
E.J. Sanville
N. Bock
A. M. N. Niklasson
A. Odell
S. Rudin
M. J. Cawkwell
J. Coe
Code
Life Sciences
Science
E.J. Sanville,N. Bock,J. Coe, A. M. N. Niklasson, A. Odell, S. Rudin, M. J. Cawkwell,edsanville@gmail.com,nbock@lanl.gov, jcoe@lanl.gov, amn@lanl.gov, aodell@kth.se, srudin@lanl.gov, cawkwell@lanl.gov
4dfaf12b-06c9-418f-8250-7d75fa91a932
Reverse extraction of early-age hydration kinetic equation from observed data of Portland cement.
The early-age hydration of Portland cement paste has an important impact on the formation of microstructure and development of strength. However, manual derivation of hydration kinetic equation is very difficult because there are multi-phased, multi-sized and interrelated complex chemical and physical reactions during cement hydration. In this paper, early-age hydration kinetic equation is reversely extracted automatically from the observed time series of hydration degree of Portland cement using evolutionary computation method that combines gene expression programming and particle swarm optimization algorithms. In order to reduce the computing time, GPUs are used for acceleration in parallel. Studies have shown that according to the extracted kinetic equation, simulation curve of early-age hydration is in good accordance with the observed experimental data. Furthermore, this equation still has a good generalization ability even changing chemical composition, particle size and curing conditions.
/content/cudazone/CUDABrowser/assets/images/applications/1313_75384_Reverse_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1313_75384_Reverse_large.jpg
Academia
Provincial Key Laboratory for Network-based Intelligent Computing, University of Jinan, Jinan 250022, China
2010
11
19
11/19/2010
WANG Lin
YANG Bo
ZHAO XiuYang, CHEN YueHui, CHANG Jun
Paper
Science
Material
WANG Lin,YANG Bo,ZHAO XiuYang, CHEN YueHui, CHANG Jun,wangplanet@gmail.com
f905568d-b7d2-4ac5-b102-89b8f9a8cbdc
IntelliEtch GPU module
IntelliEtch is an Anisotropic Wet Etch simulator. This chemical process can be used for Silicon-based Microsystems fabrication. IntelliEtch can be used as a CAD tool for Microsystem fabrication, allowing fast and accurate simulations.
/content/cudazone/CUDABrowser/assets/images/applications/1312_87164_Images-036_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1312_87164_Images-036_large.jpg
Research
I3M Institute(Polytechnic University of Valencia), DIPC Intitute (University of the Basque Country)
2010
10
01
10/01/2010
150
Commercial
N Ferrando
M A Gosalvez
Multimedia
Paper
Computer Aided Engineering
Microsystems
MEMS, microsystems, cellular automata,N Ferrando,M A Gosalvez,nesferjo@upvnet.upv.es,miguelangel.gosalvez@ehu.es
a63d80ac-a9bb-4cb9-a632-66d9d2563f07
Ultra Fast SOM using CUDA
This paper presents an overall idea of the optimization strategies used for the parallel implementation of Basic-SOM on GPU using CUDA programming paradigm.
/content/cudazone/CUDABrowser/assets/images/applications/1311_NeST-NVIDIA_Center_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1311_NeST-NVIDIA_Center_large.png
Commercial
NeST
http://nestsoftware.com/
2010
05
18
05/18/2010
Sijo Mathew
Preetha Joy
Paper
Numerics
Data mining
Sijo Mathew,Preetha Joy,hpc@nestgroup.net
30fbce15-63ea-41fe-8ff3-665316b591e3
AgiSoft PhotoScan
AgiSoft PhotoScan is an advanced image-based 3D modeling solution for creating professional quality 3D content from still images. Based on the latest multi-view 3D reconstruction technology, it operates on arbitrary images and is efficient in both controlled and uncontrolled conditions. The photos can be taken from any positions, providing that an object to be reconstructed is visible on at least two photos. Both image alignment and 3D model reconstruction is fully automated.
/content/cudazone/CUDABrowser/assets/images/applications/1309_436476_logo-pscan-2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1309_436476_logo-pscan-2_large.png
Commercial
AgiSoft
http://www.agisoft.ru
2010
08
18
08/18/2010
20
Commercial
AgiSoft
Application
Computer Aided Engineering
Digital Content Creation
Graphics
image based modeling,AgiSoft,info@agisoft.ru
46c6675d-b59a-49c8-89dd-fc7a32e6484e
Field Forge
Field Forge brings massively parallel processing (MPP) to PostgreSQL's current single-threaded sessions. Field Forge utilizes the MPP power of the Kappa framework. The Kappa framework provides practical usage of CUDA GPU, OpenMP, and partitioned data flow scheduled processing. Field Forge make the Kappa framework from Psi Lambda LLC a new Language for defining Window and Table functions. These functions allow processing to be specified using SQL and index component notation for MPP using GPUs and CPUs. Within each Field Forge node, the Kappa framework passes (subsets) of the data sets between processing kernels and into and out of data sets. Field Forge also utilizes the Kappa framework's Apache Portable Runtime (APR) database driver SQL connections to retrieve data fields from any database source (including other Field Forge sessions and nodes), process them using the MPP capabilities of the Kappa framework, and return them as PostgreSQL table or window fields returned from table or window functions respectively. This combination of features enables a Dataset Passing Interface (DPI) for distributed MPP. DPI leverages the existing skills, protocols, connectivity, and infrastructure of an organization.
/content/cudazone/CUDABrowser/assets/images/applications/1308_37625_psilambdakappa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1308_37625_psilambdakappa_large.png
Commercial
Psi Lambda LLC
http://psilambda.com
2010
11
07
11/07/2010
Commercial
Psi Lambda LLC
Application
Finance
Numerics
Life Sciences
Programming Tools
Science
PostgreSQL OpenMP CUDA Window Table Partition,Psi Lambda LLC,kappa@psilambda.com
dbb00087-2b32-421a-9e65-f1295b0546b6
Introducing libflame with multi-GPU support
We are happy to announce the fifth milestone release (r4648) of libflame, a modern replacement for the most-used functionality of the LAPACK linear algebra library. The main improvement since version 4.0 is that libflame now supports parallel execution using multiple GPUs through the SuperMatrix runtime system. By linking libflame with CUBLAS for the execution of BLAS routines on a single GPU, the SuperMatrix runtime system schedules operations to each GPU and manages the explicit movement of data. This release includes support for single and double precision real and complex floating point operations.
/content/cudazone/CUDABrowser/assets/images/applications/1307_205958_FLAMEbanner_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1307_205958_FLAMEbanner_large.png
Academia
UT Austin / Universitat Jaume
2010
10
28
10/28/2010
Open source
Ernie Chan
Francisco Igual
Field van Zee, Robert van de Geijn
Application
Code
Numerics
Libraries
Ernie Chan,Francisco Igual,Field van Zee, Robert van de Geijn,figual@icc.uji.es
6865f9fe-2a11-47bc-9d50-c0b4026248dc
alenka
Alenka is a high level, high performance SQL-like language for data processing on CUDA hardware
/content/cudazone/CUDABrowser/assets/images/applications/1306_53666_Cubes_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1306_53666_Cubes_large.png
Research
2010
11
02
11/02/2010
Open source
Anton K.
Application
Code
Programming Tools
databases
Anton K.,antonmks@gmail.com
159ab96a-c59e-4853-bd70-03a4e7691b6f
CUDA Accelerated Face Recognition
We explore one of the possibilities of parallelizing and optimizing a well-known Face Recognition algorithm, Principal Component Analysis (PCA) with Eigenfaces.
/content/cudazone/CUDABrowser/assets/images/applications/1305_NeST-NVIDIA_Center_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1305_NeST-NVIDIA_Center_large.png
Commercial
NeST
http://nestsoftware.com
2010
07
26
07/26/2010
Numaan. A
Sibi A
Paper
Imaging
Numaan. A,Sibi A,hpc@nestgroup.net
df37fa39-5f4f-43cd-8a94-797995daea91
On the Use of Small 2D Convolutions on GPUs
Computing many small 2D convolutions using FFTs is a basis for a large number of applications in many domains in science and engineering, among them electromagnetic diraction modeling in physics. The GPU architecture seems to be a suitable architecture to ac- celerate these convolutions, but reaching high application performance requires substantial development time and non-portable optimizations. In this work, we present the techniques, performance results and consid- erations to accelerate small 2D convolutions using CUDA, and compare performance to a multi-threaded CPU implementation. To improve programmability and performance of applications that make heavy use of small convolutions, we argue that two improvements to software and hardware are needed: FFT libraries must be extended with a single con- volution function and communication bandwidth between CPU and GPU needs to be drastically improved.
/content/cudazone/CUDABrowser/assets/images/applications/1304_2dconvolutions_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1304_2dconvolutions_large.jpg
Academia
TUDelft, ASML, TU/e
http://www.tudelft.nl/http://www.asml.nl/http://www.tue.nl
2010
06
19
06/19/2010
Shams Al Umairy
Alexander S. van Amesfoort
Henk Sips, Irwan Setija, Martijn van Beurden
Irwan Setija
Martijn van Beurden
Paper
Numerics
Science
2D convolution, FFT, Electromagnetic diffraction grating,GPU, CUDA, Tesla,Shams Al Umairy,Alexander S. van Amesfoort,Henk Sips, Irwan Setija, Martijn van Beurden,salumairy@gmail.com,a.s.vanamesfoort@tudelft.nl,sips@ewi.tudelft.nl,Irwan.Setija@asml.com,M.C.v.Beurden@tue.nl
f0e9b09f-f104-4594-9ec8-165b10670b21
GFARGO
GFARGO simulates the evolution of a gaseous protoplanetary disk subject to the gravitational perturbation of forming protoplanets embedded in it, by solving the Navier-Stokes equations on a polar mesh. It simultaneously describes how the planetary orbits expand or shrink with time, a process known as planetary migration, which plays an important role in shaping the planetary system that emerges once the disk dissipates. The actual implementation is two-dimensionnal, and performance gains ranging up to 90x are achieved with respect to CPU implementations.
/content/cudazone/CUDABrowser/assets/images/applications/1303_35573_fargo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1303_35573_fargo_large.png
Academia
Institute of Physical Sciences, UNAM, Mexico and CEA, Saclay, France
2010
10
22
10/22/2010
90
Open source
Frederic Masset
Application
Code
Computational Fluid Dynamics
Science
Frederic Masset,fmasset@cea.fr
6c2695c6-b17f-4357-ba6a-71715583d5ab
2 million pixel experiment
This experimental application maps a HD video source (1080p) into 3D space. Each frame is processed in realtime on the GPU using NVIDIA CUDA technology. Each pixel in a frame (2.073.600 pixels per frame) is scaled by its luminance value and given the original color. The application is written in C# using DirectX11 via SlimDX, CUDA.NET and DirectShow.NET libraries.
/content/cudazone/CUDABrowser/assets/images/applications/1302_114048_visualcompute_cuda_app960_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1302_114048_visualcompute_cuda_app960_large.jpg
Research
noumentalia.de - digital arts - visualcompute.com
http://www.noumentalia.de
2010
10
22
10/22/2010
Philipp Drieger
Multimedia
Presentation
Digital Content Creation
Graphics
Imaging
Libraries
Science
Signal Processing
Video & Audio
HD video processing 1080p 3D CUDA .NET C# map 3D space,Philipp Drieger,info@visualcompute.com
ece4bfff-3896-47dd-b0af-f6abbb592a8e
powDOG: powder diffraction on GPUs
Diffraction, particularly of X-rays, is a powerful technique for the investigation of structure, microstructure and dynamical properties of matter. In order to link theoretical methods, like Molecular Dynamics and other atomistic approaches, and diffraction experiments we developed a new software for calculating the powder diffraction pattern of nano-sized objects on the GPUs. The software, soon to be made available under GPL license, allows the use of GPUs on different hosts for a direct (brute-force) computation of the Debye scattering equation.
/content/cudazone/CUDABrowser/assets/images/applications/1301_1322162_powDOG_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1301_1322162_powDOG_large.png
Academia
University of Trento, Trento, Italy
2010
02
08
02/08/2010
Luca Gelisio
Cristy Leonor Azanza Ricardo, Matteo Leoni, Paolo Scardi.
Application
Science
Powder diffraction, Debye scattering equation, nanostructured materials,Luca Gelisio,Cristy Leonor Azanza Ricardo, Matteo Leoni, Paolo Scardi.,luca.gelisio@unitn.it
85367756-abdb-4bc0-ab43-beed43680f51
GPU Accelerated Likelihoods for Stereo-Based Articulated Tracking
For many years articulated tracking has been an active research topic in the computer vision community. While working solutions have been suggested, computational time is still problematic. We present a GPU implementation of a ray-casting based likelihood model that is orders of magnitude faster than a traditional CPU implementation. We explain the non-intuitive steps required to attain an optimized GPU implementation, where the dominant part is to hide the memory latency effectively. Benchmarks show that computations which previously required several minutes, are now performed in few seconds
/content/cudazone/CUDABrowser/assets/images/applications/1299_88964_gpu_vision_2010_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1299_88964_gpu_vision_2010_large.png
Academia
The eScience Centre,Dept. of Computer Science, University of Copenhagen
http://www.diku.dk
2010
09
05
09/05/2010
600
Rune Mollegaard Friborg
Soren Hauberg
Kenny Erleben
Paper
Ray Tracing
Computer Vision
Machine Learning
Tracking
Articulated Tracking,Particle Filtering,Rune Mollegaard Friborg,Soren Hauberg,Kenny Erleben ,runef@diku.dk,hauberg@diku.dk,kenny@diku.dk
8052c851-ce8c-4767-9452-e3df12796c1d
Electronic Design Automation
GPU-Based Robust Multigrid Preconditioned Solver for Large Scale On-Chip Power Grid Simulation
/content/cudazone/CUDABrowser/assets/images/applications/1298_1108533_40-1-9_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1298_1108533_40-1-9_large.jpg
Academia
Michigan Technological University
http://www.ece.mtu.edu/~zhuofeng/MTU_VLSI_DA.htm
2010
09
15
09/15/2010
50
Zhuo Feng
Multimedia
Paper
Computer Aided Engineering
Electronic Design Automation
Multigrid, preconditioned iterative methods, power delivery network, on-chip interconnect simulation, VLSI system,Zhuo Feng,zhuofeng@mtu.edu
c9e901d1-7e78-40ed-ac3d-4b59592bbb9b
Engine_cudamrg for OpenSSL
Engine_cudamrg is a cryptographic engine for the OpenSSL Toolkit that can accelerate some operation using a CUDA supported device, we currently support the following cipher types: * AES-128-ECB * AES-128-CBC * AES-192-ECB * AES-192-CBC * AES-256-ECB * AES-256-CBC We support both encryption and decryption for theese cipher types. For future releases we plan to optimize currently supported cipher types, add more cipher types and digest algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/1297_8191_engineCudamrg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1297_8191_engineCudamrg_large.png
Commercial
Engine_cudamrg Development Team
http://groups.google.com/group/engine-cudamrg
2010
07
26
07/26/2010
Open source
Paolo Margara
Application
Code
Cryptography
AES, cryptography,Paolo Margara,paolo.margara@gmail.com
8179b89f-1d5e-4dad-84d4-6b0466dcde7e
Smoke Simulation for Fire Engineering using a Multigrid Method on Graphics Hardware
We present a GPU-based Computational Fluid Dynamics solver for the purpose of fire engineering. We apply a multigrid method to the Jacobi solver when solving the Poisson pressure equation, supporting internal boundaries. Boundaries are handled on the coarse levels, ensuring that boundaries will never vanish after restriction. We demonstrate cases where the multigrid solver computes results up to three times more accurate than the standard Jacobi method within the same time. Providing rich visual details and flows closer to widely accepted standards in fire engineering. Making accurate interactive physical simulation for engineering purposes, has the benefit of reducing production turn-around time. We have measured speed-up improvements by a factor of up to 350, compared to existing CPU-based solvers. The present CUDA-based solver promises huge potential in economical benefits, as well as constructions of safer and more complex buildings. In this paper, the multigrid method is applied to fire engineering. However, this is not a limitation, since improvements are possible for other fields as well. Traditional Jacobi solvers are particulary suitable for the methods presented.
/content/cudazone/CUDABrowser/assets/images/applications/1296_121739_vriphys2009_glimberg_erleben_teaser_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1296_121739_vriphys2009_glimberg_erleben_teaser_large.png
Academia
Department of Computer Science/University of Copenhagen
http://www.diku.dk
2009
11
05
11/05/2009
350
Stefan Glimberg
Kenny Erleben
Jens Bennetsen
Paper
Computational Fluid Dynamics
Computer Aided Engineering
Pre-parameter studies of virtual designs
Stefan Glimberg,Kenny Erleben,Jens Bennetsen,glimberg@diku.dk,kenny@diku.dk
2e28ba24-3cc8-4333-be77-fb572dc28776
GPU Accelerated Tandem Traversal of Blocked Bounding Volume Hierarchy Collision Detection for Multibody Dynamics
The performance bottleneck of physics based animation, is often the collision detection. It is well-known by practitioners that the collision detection may consume more than half of the simulation time. In this work we will introduce a novel approach for collision detection using bounding volume hierarchies. Our approach makes it possible to perform non-convex object versus non-convex object collision on the GPU, using tandem traversals of bounding volume hierarchies. Prior work only supports single traversals on GPUs. We introduce a blocked hierarchy data structure, using imaginary nodes and a simultaneous descend in the tandem traversal. The data structure design and traversal are highly specialized for exploiting the parallel threads in the NVIDIA GPUs. As proof-of-concept we demonstrate a GPU implementation for a multibody dynamics simulation, showing an approximate speedup factor of up to 8 compared to a CPU implementation
/content/cudazone/CUDABrowser/assets/images/applications/1295_52591_vriphys2009_damkjaer_erleben_teaser_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1295_52591_vriphys2009_damkjaer_erleben_teaser_large.png
Academia
Department of Computer Science, University of Copenhagen.
http://www.diku.dk/
2009
11
05
11/05/2009
8
Open source
Jesper Damkjaer
Kenny Erleben
Paper
Game Physics
Graphics
Numerics
Libraries
Programming Tools
Science
Bounding volume Hierarchies, Collision Detection, Rigid Body Simulation,Jesper Damkjaer, Kenny Erleben,damkjaer@diku.dk,kenny@diku.dk
70463033-d9d8-49f6-9067-8a982284a733
SpofetwraremGPU: Using graphics processing units in RNA microarray association studies
Background: Many analyses of microarray association studies involve permutation, bootstrap resampling and crossvalidation, that are ideally formulated as embarrassingly parallel computing problems. Given that these analyses are computationally intensive, scalable approaches that can take advantage of multi-core processor systems need to be developed.
http://www.gpucomputing.net/?q=node/2083
/content/cudazone/CUDABrowser/assets/images/applications/1294_bmc_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1294_bmc_large.jpg
Commercial
BMC Bioinformatics
2010
05
22
05/22/2010
78
Ivo D Shterev
Sin-Ho Jung
Stephen L George
Paper
Ivo D Shterev,Sin-Ho Jung,Stephen L George
796fa229-7424-4736-b68b-793cf9120ee9
High performance GPU radix sorting in CUDA
This project implements a very fast, efficient radix sorting method for CUDA-capable devices. For sorting large sequences of fixed-length keys (and values), we believe our sorting primitive to be the fastest available for any fully-programmable microarchitecture: our stock NVIDIA GTX480 sorting results exceed the 1G keys/sec average sorting rate (i.e., one billion 32-bit keys sorted per second).
http://code.google.com/p/back40computing/wiki/RadixSorting
/content/cudazone/CUDABrowser/assets/images/applications/1291_SortingSmall_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1291_SortingSmall_large.jpg
Research
CUDA Developer
2010
05
27
05/27/2010
Duane Merrill
Application
Duane Merrill
b8244c5b-21cf-4a7d-8eef-7b8e72792b98
Hardware-Assisted Projected Tetrahedra
We present a flexible and highly efficient hardware-assisted volume renderer grounded on the original Projected Tetrahedra (PT) algorithm. Unlike recent similar approaches, our method is exclusively based on the rasterization of simple geometric primitives and takes full advantage of graphics hardware. Both vertex and geometry shaders are used to compute the tetrahedral projection, while the volume ray integral is evaluated in a fragment shader; hence, volume rendering is performed entirely on the GPU within a single pass through the pipeline.
http://www.lcg.ufrj.br/Members/andream/papers/cgf2010.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1290_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1290_GPUComputing bgimg_large.png
Academia
University of Rio de Janeiro
2010
03
18
03/18/2010
A. Maximo
R. Marroquim
R. Farias
Paper
A. Maximo,R. Marroquim,R. Farias
1af9d850-4d52-4267-9787-72027ff4928c
A Parallel Algorithm for Construction of Uniform Grids
We present a fast, parallel GPU algorithm for construction of uniform grids for ray tracing, which we implement in CUDA. The algorithm performance does not depend on the primitive distribution, because we reduce the problem to sorting pairs of primitives and cell indices. Our implementation is able to take full advantage of the parallel architecture of the GPU, and construction speed is faster than CPU algorithms running on multiple cores.
http://graphics.cs.uni-sb.de/fileadmin/cguds/papers/2009/kalojanov_hpg2009/kalojanov_hpg2009.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1289_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1289_GPUComputing bgimg_large.png
Academia
Saarland University
2009
06
13
06/13/2009
Javor Kalojanov
Philipp Slusallek
Paper
Javor Kalojanov,Philipp Slusallek
17270ae2-0417-4766-88db-17f20e1e3073
Evaluation of Streaming Aggregation on Parallel Hardware Architectures
We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures is still an open question.
http://people.cs.vt.edu/~scschnei/papers/debs2010.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1288_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1288_GPUComputing bgimg_large.png
Research
IBM Research Division
2010
07
12
07/12/2010
Scott Schneider
Henrique Andrade
Bugra Gedik
Kun-Lung Weu
Dimitrios S. Nikolopoulos
Paper
Scott Schneider,Henrique Andrade,Bugra Gedik
9f95c82d-a9c4-4324-be40-85e0a4d5ebd3
A Middleware for Efficient Stream Processing in CUDA
This paper presents a middleware capable of out-of-order execution of kernels and data transfers for
efficient stream processing in the compute unified device architecture (CUDA). Our middleware runs on the
CUDA-compatible graphics processing unit (GPU). Using the middleware, application developers are allowed
to easily overlap kernel computation with data transfer between the main memory and the video memory.
http://www-hagi.ist.osaka-u.ac.jp/research/papers/201005_s-nakagw_isc.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1287_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1287_GPUComputing bgimg_large.png
Academia
University of Trier
2010
03
12
03/12/2010
Shinta Nakagawa
Fumihiko Ino
Kenichi Hagihara
Paper
Shinta Nakagawa,Fumihiko Ino,Kenichi Hagihara
5fc49232-b32f-49ab-81d9-8548d9b4b730
An Adaptive Performance Modeling Tool for GPU Architectures
This paper presents an analytical model to predict the performance of general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features.
http://impact.crhc.illinois.edu/ftp/conference/sara.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1286_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1286_GPUComputing bgimg_large.png
Academia
University of Illinois at Urbana-Champaign
2009
11
19
11/19/2009
Sara S. Baghsorkhi
Matthieu Delahaye
Sanjay J. Patel
William D. Gropp
Wen-mei W. Hwu
Paper
Sara S. Baghsorkhi,Matthieu Delahaye,Sanjay J. Patel,bsadeghi@illinois.edu,matthieu@illinois.edu,sjp@illinois.edu
d3d1ef02-d0ff-40e9-ae3c-2f380b1f45d7
Kd-Jump: a Path-Preserving Stackless Traversal for Faster Isosurface Raytracing on GPUs
Stackless traversal techniques are often used to circumvent memory bottlenecks by avoiding a stack and replacing return traversal with extra computation. This paper addresses whether the stackless traversal approaches are useful on newer hardware and technology (such as CUDA). To this end, we present a novel stackless approach for implicit kd-trees, which exploits the benefits of index-based node traversal, without incurring extra node visitation. This approach, which we term Kd-Jump, enables the traversal to immediately return to the next valid node, like a stack, without incurring extra node visitation (kd-restart).
http://vplab.snu.ac.kr/lectures/09-2/graphics/lecture_notes/11%20Kd-jump.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1285_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1285_GPUComputing bgimg_large.png
Academia
Bangor University
2009
07
27
07/27/2009
David m. Hughes
Ik Soo Lim
Paper
David m. Hughes,Ik Soo Lim,meirion@bangor.ac.uk,i.s.lim@bangor.ac.uk
b82e77d4-8aab-4e0c-828d-d9c87e198557
Accelerating Flow Cytometry Data Clustering Workflows with Graphics Processing Units
Flow cytometry is a mainstay technology used by biologists and immunologists for counting, sorting,
and analyzing cells suspended in a fluid. The results of flow cytometry are used in a variety
of important clinical and research applications such as phenotyping, DNA analysis, and cell function
analysis. Like many modern scientific applications, flow cytometry produces massive amounts
of data which must be clustered in order to be useful. Conventional analysis of flow cytometry
data uses manual sequential bivariate gating.
http://cyberaide.googlecode.com/svn/trunk/papers/thesis-pangborn/proposal/pangborn-proposal.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1284_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1284_GPUComputing bgimg_large.png
Academia
Rochester Institute of Technology
2009
09
01
09/01/2009
Andrew D. Pangborn
Paper
Andrew D. Pangborn
b3217a94-0e15-431b-bdba-c2b96f030be4
GPU Accelerated Scientific Computing: Fluid and Particulate Flows with CUDA
Simulations of particulate flows, which involve gases and liquids with suspended solid particles like dust, are generally highly CPU-time demanding. The question arises whether such computations can be performed on the GPU applying highly parallel programming models like CUDA. In this paper we demonstrate that numerical simulation in that context can greatly benefit from these emerging technologies and present results in a 2D and 3D setup.
http://numhpc.math.kit.edu/download/PARS_Full_Paper_Final_Heuveline_Hahn_Rocker.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1283_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1283_GPUComputing bgimg_large.png
Academia
University of Karlsruhe
2009
10
14
10/14/2009
Tobias Hahn
Vincent Heuveline
Bjorn Rocker
Paper
Tobias Hahn,Vincent Heuveline,Bjorn Rocker,tobias.hahn@kit.edu,vincent.heuveline@kit.edu,bjoern.rocker@kit.edu
3587097f-b801-4185-ad0c-f4d4c78480c9
General-Purpose vs. GPU: Comparison of Many-Cores on Irregular Workloads
XMT1 is a general-purpose many-core parallel architecture. The foremost design objective for XMT was to meet
the highest standards for ease of parallel programming. GPUs, on the other hand, have acquired a strong reputation on performance, sometimes at the expense of ease of programming. The current paper presents a performance comparison on diverse workloads between XMT and an NVIDIA CUDA-enabled GPU. Configured with
roughly the same amount of chip resources as the GPU, XMT achieves an average speedup of 6.05x on irregular
applications, while incurring an average slowdown of 2.07x on regular ones.
http://www.umiacs.umd.edu/users/vishkin/XMT/CKTV_hotpar10.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1282_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1282_GPUComputing bgimg_large.png
Academia
University of Maryland, College Park
2010
04
27
04/27/2010
George C. Caragea
Fuat Keceli
Alexandros Tzannes
Uzi Vishkin
Paper
George C. Caragea,Fuat Keceli,Alexandros Tzannes,gcaragea@umd.edu,keceli@umd.edu,tzannes@umd.edu
ace3d995-2e1a-4d64-a5af-450ffcfee3fd
Fast Minimum Spanning Tree for Large Graphs on the GPU
Graphics Processor Units are used for many general purpose processing due to high compute power available on them. Regular, data-parallel algorithms map well to the SIMD architecture of current GPU. Irregular algorithms on discrete structures like graphs are harder to map to them. Efficient data-mapping primitives can play crucial role in mapping such algorithms onto the GPU. In this paper, we present a minimum spanning tree algorithm on Nvidia GPUs under CUDA, as a recursive formulation of Boruvka's approach for undirected graphs.
http://www.gpucomputing.net/?q=node/1612
/content/cudazone/CUDABrowser/assets/images/applications/1281_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1281_GPUComputing bgimg_large.png
Academia
International Institute of Information Technology
2009
06
07
06/07/2009
50
Vibhav Vineet
Pawan Harish
Suryakant Patidar
P. J. Narayanan
Paper
Vibhav Vineet,Pawan Harish,Suryakant Patidar,vibhavvinet@research.iiit.ac.in,harishpk@research.iiit.ac.in,skp@research.iit.ac.in
81ab4f38-8401-4ac6-9e25-d5ad5edceb53
CUDA-based Triangulations of Convolution Molecular Surfaces
Computing molecular surfaces is important to measure areas and volumes of molecules, as well as to infer useful information about interactions with other molecules. Over the years many algorithms have been developed to triangulate and to render molecular surfaces. However, triangulation algorithms usually are very expensive in terms of memory storage and time performance, and thus far from real-time performance. Fortunately, the massive computational power of the new generation of low-cost GPUs opens up an opportunity
window to solve these problems: real-time performance and cheap computing commodities.
http://salsahpc.indiana.edu/ECMLS2010/papers/066.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1280_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1280_GPUComputing bgimg_large.png
Academia
Universidade da Beira Interior
2010
06
20
06/20/2010
Sergio Dias
Kuldeep Bora
Abel Gomes
Paper
Sergio Dias,Kuldeep Bora,Abel Gomes,sdias@ubi.pt,kuldeep@iitg.ernet.in,agomes@di.ubi.pt
a5df523d-c91c-4105-b8ee-fb0979362cc3
Data Parallel Three-Dimensional Cahn-Hilliard Field Equation Simulation on GPUs with CUDA
Computational scientific simulations have long used parallel computers to increase their performance. Recently graphics cards have been utilised to provide this functionality. GPGPU APIs such as NVIDIA's CUDA can be used to harness the power of GPUs for purposes other than computer graphics. GPUs are designed for processing two-dimensional data. In previous work we have presented several two-dimensional Cahn-Hilliard simulations that each utilise different CUDA memory types and compared their results.
http://www.massey.ac.nz/~kahawick/cstn/073/cstn-073.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1279_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1279_GPUComputing bgimg_large.png
Academia
Massey University
2009
02
01
02/01/2009
D. P. Playne
K. A. Hawick
Paper
D. P. Playne,K. A. Hawick,d.p.playne@massey.ac.nz,k.a.hawick@massey.ac.nz
0b5f7cf7-9d2e-4318-b4c5-9c9a1dba1143
FAST VISUAL HULL AND STEREO MATCHING ON CUDA
Stereo matching and visual hull are techniques that are often used in 3D reconstruction. This paper presents and evaluates implementations of these algorithms on the GPU using the CUDA architecture. Experimental results show that both, visual hull and stereo matching, have much to gain in terms of speed from the data parallel execution model.
/content/cudazone/CUDABrowser/assets/images/applications/1278_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1278_GPUComputing bgimg_large.png
Academia
University of Surrey
2010
02
11
02/11/2010
Mykyta Fastovets
Jean-Yves Guillemaut
Adrian Hilton
Paper
Mykyta Fastovets,Jean-Yves Guillemaut,Adrian Hilton,mykyta.fastovets@surrey.ac.uk, j.guillemaut@surrey.ac.uk,a.hiltong@surrey.ac.uk
17140b63-a39b-4cdf-bd86-9a54079055b8
Speed records for NTRU
In this paper NTRUEncrypt is implemented for the first time on a GPU using the CUDA platform. As is shown, this operation lends itself excellently for parallelization and performs extremely well compared to similar security levels for ECC and RSA giving speedups of around three to four orders of magnitude. The focus is on achieving a high throughput, in this case performing a large number of encryptions/decryptions in parallel.
http://www.gpucomputing.net/?q=node/1573
/content/cudazone/CUDABrowser/assets/images/applications/1277_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1277_GPUComputing bgimg_large.png
Academia
University of Leuven
2009
09
10
09/10/2009
Jens Hermans
Frederik Vercauteren
Bart Preneel
Paper
Jens Hermans,Frederik Vercauteren,Bart Preneel
042c7881-8db7-45d5-8824-8c7f99c6ce91
Implementing a GPU Programming Model on a non-GPU Accelerator Architecture
Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging.
http://hal.archives-ouvertes.fr/docs/00/49/39/05/PDF/A4MMC-kofsky.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1275_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1275_GPUComputing bgimg_large.png
Academia
University of Illinois at Urbana-Champaign
2010
06
21
06/21/2010
Stephen M. Kofsky
Daniel R. Johnson
John A. Stratton
Wen-mei W. Hwu
Sanjay J. Patel
Steven S. Lumetta
Paper
Stephen M. Kofsky,Daniel R. Johnson,John A. Stratton
93887a5d-c38e-40ec-bda5-b5bb59aae6d7
Parallelising Wavefront Applications on General-Purpose GPU Devices
Pipelined wavefront applications form a large portion of the high performance scientific computing workloads at supercomputing centres such as LANL in the United States and AWE in the United Kingdom. This paper investigates the viability of utilising graphics processing units (GPUs) for the acceleration of these codes, using NVIDIA's Compute Unified Device Architecture (CUDA). Wavefront applications differ from the massively data-parallel codes typically selected for execution on GPUs in that their computation must obey a strict data dependency, limiting the achievable level of parallelism.
http://www2.warwick.ac.uk/fac/sci/dcs/research/pcav/publications/pubs/ukpew-gpu-wavefronts.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1274_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1274_GPUComputing bgimg_large.png
Academia
University of Warwick
2010
06
01
06/01/2010
S. J. Pennycook
G. R. Mudalige
S. D. Hammond
S. A. Jarvis
Paper
S. J. Pennycook,G. R. Mudalige,S. D. Hammond,sjp@dcs.warwick.ac.uk,g.r.mudalige@dcs.warwick.ac.uk,sdh@dcs.warwick.ac.uk
f6b4d7a3-c2f6-456c-80f9-1a6666c19c99
Performance Cost Analysis of Software-Implemented Hardware Fault Tolerance Methods in General-Purpose GPU Computing
Commercial off-the-shelf graphics processing units (GPUs) provide an attractive, inexpensive platform for highthroughput scientific applications. Whereas fault tolerance may be desirable for many scientific applications, off-the-shelf GPU hardware has been designed for commodity graphics applications, where fault tolerance is not necessary.
http://homepages.cae.wisc.edu/~ece753/papers/Paper_4.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1273_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1273_GPUComputing bgimg_large.png
Academia
University of Wisconsin, Madison
2009
04
26
04/26/2009
Anthony E. Gregerson
Ameya V. Abhyankar
Paper
Anthony E. Gregerson,Ameya V. Abhyankar,agregerson@wisc.edu,aabhyankar@wisc.edu
a90fc8ef-5fba-447e-b091-fd5f9d5b1e50
GPU Accelerated Stylistic Augmented Reality
With the introduction of programmable graphics pipeline, the highly parallel processing power of graphical processing units (GPU) is being used not only for special graphics effects but also for general purpose computation in areas such as molecular dynamics simulation, stock options pricing, and image processing. In this work, we utilize this power to increase the immersion level in an augmented reality (AR) application.
http://www.vmasc.odu.edu/downloads/Capstone_Papers/Engineering/Aras.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1272_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1272_GPUComputing bgimg_large.png
Academia
Old Dominion University
2010
04
02
04/02/2010
Rifat Aras
Yuzhong Shen
Paper
Rifat Aras,Yuzhong Shen
33549c56-faa7-4ed2-8762-b7176670e639
A Batched GPU Algorithm for Set Intersection
Intersection of inverted lists is a frequently used operation in search engine systems. Efficient CPU and GPU
intersection algorithms for large problem size are well studied. We propose an efficient GPU algorithm for high performance intersection of inverted index lists on CUDA platform. This algorithm feeds queries to GPU in batches, thus can take full advantage of GPU processor cores even if problem size is small. We also propose an input preprocessing method which alleviate load imbalance effectively.
http://nbjl.nankai.edu.cn/Lab_Papers/2009/A%20Batched%20GPU%20Algorithm%20for%20Set%20Intersection.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1271_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1271_GPUComputing bgimg_large.png
Academia
Nankai University
2009
09
19
09/19/2009
Di Wu
Fan Zhang
Naiyong Ao
Fang Wang
Xiaoguang Liu
Gang Wang
Paper
Di Wu,Fan Zhang,Naiyong Ao,wakensky@gmail.com,zhangfan555@gmail.com,aonaiyong@163.com
c7c4497b-7a3a-4847-8010-748fc72fdd19
GPU-based ultrafast IMRT plan optimization
The widespread adoption of on-board volumetric imaging in cancer radiotherapy has stimulated research efforts to develop online adaptive radiotherapy techniques to handle the inter-fraction variation of the patient's geometry. Such efforts face major technical challenges to perform treatment planning in real time. To overcome this challenge, we are developing a supercomputing online re-planning environment (SCORE) at the University of California, San Diego (UCSD). As part of the SCORE project, this paper presents our work on the implementation of an intensity-modulated radiation therapy (IMRT) optimization algorithm on graphics processing units (GPUs).
http://iopscience.iop.org/0031-9155/54/21/008
/content/cudazone/CUDABrowser/assets/images/applications/1270_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1270_GPUComputing bgimg_large.png
Academia
University of California, San Diego
2009
10
14
10/14/2009
40
Chunhua Men
Xuejun Gu
Dongju Choi
Amitava Majumdar
Ziyi Zheng
Klaus Mueller
Steve B. Jiang
Paper
Chunhua Men,Xuejun Gu,Dongju Choi
4e362946-a2fd-4f2b-8740-0009b7348bd6
Real-time Forest Simulation for a Flight Simulator using a GPU
This paper concerns the real-time simulation of forests for a flight simulator, exploiting the capacities of recent graphics cards. As we will show, these architectures coupled with recent ergonomic environments like CUDA allow C-programmers to implement highly parallelizable algorithms to be executed on GPU, without being specialized in parallel programming.
http://www.ecam-rennes.fr/IMG/pdf/ICCTA2008.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1268_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1268_GPUComputing bgimg_large.png
Academia
Louis de Broglie, Graduate Engineering School
2008
02
19
02/19/2008
Jean-Marc Laferte
Guillaume Daussin
Pascal Haigron
Jihed Flifla
Paper
Jean-Marc Laferte,Guillaume Daussin,Jihed Flifla,laferte@ecole-debroglie.fr,g.daussin@ecole-debroglie.fr,flifla@ecole -debroglie.fr
e490331e-ba50-4ff8-ad01-f2af8b63cada
cuInspiral: prototype gravitational waves detection pipeline fully coded on GPU using CUDA
In this paper we report the prototype of the first coalescing binary detection pipeline fully implemented on NVIDIA GPU hardware accelerators. The code has been embedded in a GPU library, called cuInspiral and has been developed under CUDA framework. The library contains for example a PN gravitational wave signal generator, matched filtering/FFT and detection algorithms that have been profiled and compared with the corresponding CPU code with dedicated benchmark in order to provide gain factor respect to the standard CPU
implementation.
http://arxiv.org/PS_cache/arxiv/pdf/1006/1006.4644v1.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1267_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1267_GPUComputing bgimg_large.png
Research
National Institute of Nuclear Physics
2010
06
16
06/16/2010
Leone B. Bosi
Paper
Leone B. Bosi
b0a54882-2b4a-4058-a6e7-b829a2a04a53
GPU Accelerated Path-planning for Multi-agents in Virtual Environments
Many games are populated by synthetic humanoid actors that act as autonomous agents. The animation of humanoids in real-time applications is yet a challenge if the problem involves attaining a precise location in a virtual world (path-planning), and moving realistically according to its own personality, intentions and mood (motion planning). In this paper we present a strategy to implement - using CUDA on GPU - a path planner that produces natural steering behaviors for virtual humans using a numerical solution for boundary value problems.
http://www.sbgames.org/papers/sbgames09/computing/full/cp15_09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1266_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1266_GPUComputing bgimg_large.png
Academia
Federal University of Rio Grande do Sul
2009
10
08
10/08/2009
56
Leonardo G. Fischer
Renato Silveira
Luciana Nedel
Paper
Leonardo G. Fischer,Renato Silveira,Luciana Nedel
cb2fd4bb-d313-4c57-aae7-547e4b78dc27
Real-time image segmentation on a GPU
Efficient segmentation of color images is important for many applications in computer vision. Non-parametric solutions are required in situations where little or no prior knowledge about the data is available. In this paper, we present a novel parallel image segmentation algorithm which segments images in real-time in a non-parametric way. The algorithm finds the equilibrium states of a Potts model in the superparamagnetic
phase of the system.
http://upcommons.upc.edu/e-prints/bitstream/2117/7866/1/1104-Real-time-image-segmentation-on-a-GPU.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1264_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1264_GPUComputing bgimg_large.png
Academia
Georg-August University
2010
06
28
06/28/2010
Alexey Abramov
Tomas Kulvicius
Florentin Worgotter
Babette Dellen
Paper
Alexey Abramov,Tomas Kulvicius,Florentin Worgotter,abramov@bccn-goettingen.de,tomas@bccn-goettingen.de,worgottg@bccn-goettingen.de
7f4e5772-7ed9-40a1-8da0-7694fec71c3c
Development of a GPU-Based Monte Carlo Dose Calculation Code for Coupled Electron-Photon Transport
Monte Carlo simulation is the most accurate method for absorbed dose calculations in radiotherapy. Its efficiency still requires improvement for routine clinical applications, especially for online adaptive radiotherapy. In this paper, 20 we report our recent development on a GPU-based Monte Carlo dose calculation code for coupled electron-photon transport. We have implemented the Dose Planning Method (DPM) Monte Carlo dose calculation package (Sempau et al, Phys. Med. Biol., 45(2000)2263-2291) on GPU architecture under CUDA platform.
http://arxiv.org/ftp/arxiv/papers/0910/0910.0329.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1263_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1263_GPUComputing bgimg_large.png
Academia
University of California, San Diego
2010
03
22
03/22/2010
Xun Jia
Xuejun Gu
Josep Sempau
Dongju Choi
Amitava Majumdar
Steve B. Jiang
Paper
Xun Jia,Xuejun Gu,Josep Sempau,Dongju Choi,Amitava Majumdar,Steve B. Jiang
e41a3774-d96e-444e-a928-811d5f31b161
Performance Characterization of a GPU as a Ubiquitous Accelerator in Commodity Multiprocessor Systems
Graphic processing units (GPUs) are increasingly being employed as commodity data-parallel co-processors in desktop and laptop systems due to their tremendous computational power as well as high memory bandwidth. A number of research efforts are focusing on the development of methodologies for efficient utilization of GPU hardware as a ubiquitous accelerator for CPU and memory intensive tasks to off-load the main processor(s). In order to effectively off-load parts of computation, developers need to have a clear understanding of performance trade-offs of using GPU as an accelerator for the host processor.
http://www.kics.edu.pk/hpcnl/images/hpcnl_kics_tr_03.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1262_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1262_GPUComputing bgimg_large.png
Academia
Al-Khawarizmi Institute of Computer Science
2010
06
01
06/01/2010
Ghulam Mustafa
Abdul Waheed
Waqar Mahmood
Paper
Ghulam Mustafa,Abdul Waheed,Waqar Mahmood,ghulam.mustafa@kics.edu.pk,awaheed@kics.edu.pk,director@kics.edu.pk
38d916ba-ce82-48eb-9159-6f33a4396526
Implementation of Stereophonic Acoustic Echo Canceller on nVIDIA GeForce Graphics Processing Unit
This paper presents an implementation of a stereophonic acoustic echo canceller on NVIDIA GeForce graphics processor and CUDA software development environment. For efficiency, fast shared memory has been used as much as possible. A tree adder is introduced to reduce the cost for summing thread outputs up. The performance evaluation results suggest that Even a low-cost GPU's with a small number of shader processor greatly helps the echo cancellation for low-cost PCbased teleconferencing.
/content/cudazone/CUDABrowser/assets/images/applications/1261_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1261_GPUComputing bgimg_large.png
Academia
Kanazawa University
2009
12
07
12/07/2009
Akihiro Hirano
Kenji Nakayama
Paper
Akihiro Hirano,Kenji Nakayama,hirano@t.kanazawa-u.ac.jp,nakayama@t.kanazawa-u.ac.jp
74329815-f8c3-4f35-92de-134ac83e4ada
Implementing Closed-Form Expressions on FPGAs Using the NAL, with Comparison to CUDA GPU and Cell BE Implementations
This paper outlines the Nallatech Accelerator Layer (NAL) and its relationship to Intel's Accelerator Abstraction Layer. The NAL is looked at in its academic context. Hardware platforms that support the NAL are discussed: the Nallatech H101, the Intel FSB-FPGA Module and the BenOne PCIe. The Intel QuickAssist Technology initiative and its associated Accelerator Abstraction Layer (AAL) are introduced.
http://www.rssi2008.org/proceedings/papers/posters/07_Bruce.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1260_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1260_GPUComputing bgimg_large.png
Research
Nallatech Ltd
2008
06
17
06/17/2008
Robin Bruce
Javier Setoain
Richard Chamberlain
Malachy Devlin
Rosa M. Badia
Paper
Robin Bruce,Javier Setoain,Richard Chamberlain
64be2be4-6831-45f3-aa71-7ed3906590cc
MITHRA: Multiple data Independent Tasks on a Heterogeneous Resource Architecture
With the advent of high-performance COTS clusters, there is a need for a simple, scalable and fault-tolerant parallel programming and execution paradigm. In this paper, we show that the popular MapReduce programming model can be utilized to solve many interesting scientific simulation problems with much higher performance than regular cluster computers by leveraging GPGPU accelerators in cluster nodes. We use the Massive Unordered Distributed (MUD) formalism and establish a one-to-one correspondence between it and general Monte
Carlo simulation methods.
http://verma7.com/wp/wp-content/uploads/2009/09/CS597_Spring09_MITHRA_Technical_Report.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1259_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1259_GPUComputing bgimg_large.png
Academia
University of Illinois at Urbana-Champaign
2009
08
25
08/25/2009
Reza Farivar
Abhishek Verma
Ellick M. Chan
Roy H. Campbell
Paper
Reza Farivar,Abhishek Verma,Ellick M. Chan,rhc@illinois.edu,Roy H. Campbell,farivar2@illinois.edu,verma7@illinois.edu,emchan@illinois.edu
82928644-d96f-4ba8-9c46-0e6bf1e1f95e
Simulation of Reaction-Diffusion Processes in Three Dimensions using CUDA
Numerical solution of reaction-diffusion equations in three dimensions is one of the most challenging applied mathematical problems. Since these simulations are very time consuming, any ideas and strategies aiming at the reduction of CPU time are important topics of research. A general and robust idea is the parallelization of source codes/programs. Recently, the technological development of graphics hardware created a possibility to use desktop video cards to solve numerically intensive problems.
http://arxiv.org/ftp/arxiv/papers/1004/1004.0480.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1257_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1257_GPUComputing bgimg_large.png
Academia
Eotvos Lorand University
2010
04
03
04/03/2010
Ferenc Molnar Jr
Ferenc Izsak
Robert Meszaros
Paper
Ferenc Molnar Jr,Ferenc Izsak,Robert Meszaros
171b372c-fb8e-4fc3-9a02-a3db231424a7
CUDASW++2.0: enhanced Smith-Waterman Protein Database Search on CUDA-Enabled GPUs Based on SIMT and Virtualized SIMD Abstractions
Due to its high sensitivity, the Smith-Waterman algorithm is widely used for biological database searches. Unfortunately, the quadratic time complexity of this algorithm makes it highly time-consuming. The exponential growth of biological databases further deteriorates the situation. To accelerate this algorithm, many efforts have been made to develop techniques in high performance architectures, especially the recently emerging many-core architectures and their associated programming models.
http://www.biomedcentral.com/content/pdf/1756-0500-3-93.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1256_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1256_GPUComputing bgimg_large.png
Academia
Nanyang Technological University, Singapore
2010
04
14
04/14/2010
Yongchao Liu
Bertil Schmidt
Douglas L Maskell
Paper
Yongchao Liu,Bertil Schmidt,Douglas L Maskell,liu0039@ntu.edu.sg
2733733f-bf02-44f0-92ce-49eed1ab150c
Design and Implementation of the Smith-Waterman Algorithm on the CUDA-Compatible GPU
This paper describes a design and implementation of the Smith-Waterman algorithm accelerated on the graphics
processing unit (GPU). Our method is implemented using compute unified device architecture (CUDA), which is available on the nVIDIA GPU. The method efficiently uses on-chip shared memory to reduce the data amount being transferred between off-chip memory and processing elements in the GPU. Furthermore, it reduces the number of data fetches by applying a data reuse technique to query and database sequences.
http://www-hagi.ist.osaka-u.ac.jp/research/papers/200810_y-munekw_bibe.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1255_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1255_GPUComputing bgimg_large.png
Academia
Osaka University
2008
08
09
08/09/2008
Yuma Munekawa
Fumihiko Ino
Kenichi Hagihara
Paper
Yuma Munekawa,Fumihiko Ino,Kenichi Hagihara,y-munekw@ist.osaka-u.ac.jp,ino@ist.osaka-u.ac.jp
5f961830-f9db-421a-b7b6-cd749907f46e
Tapping the Supercomputer Under Your Desk: Solving Dynamic Equilibrium Models with Graphics Processors
This paper shows how to build algorithms that use graphics processing units (GPUs) installed in most modern computers to solve dynamic equilibrium models in economics. In particular, we rely on the compute unifed device architecture (CUDA) of NVIDIA GPUs. We illustrate the power of the approach by solving a simple real business cycle model with value function iteration. We document improvements in speed of around 200 times and suggest that even further gains are likely.
/content/cudazone/CUDABrowser/assets/images/applications/1254_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1254_GPUComputing bgimg_large.png
Academia
Duke University
2010
04
10
04/10/2010
Eric M. Aldrich
Jesus Fernandez-Villaverde
A. Ronald Gallant
Juan F. Rubio-Ramirez
Paper
Eric M. Aldrich,Jesus Fernandez-Villaverde,A. Ronald Gallant,ealdrich@gmail.com,jesusfv@econ.upenn.edu,aronaldg@gmail.com,Juan F. Rubio-Ramirez,jfr23@duke.edu
c1f6075d-5f89-44d5-938b-f527b8a56825
Faster Matrix-Vector Multiplication on GeForce 8800GTX
Recently a GPU has acquired programmability to perform general purpose computation fast by running ten thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on NVIDIA CUDA architecture. The experimental results on GeForce 8800GTX show that the proposed algorithm runs maximum 15.69 (resp., 32.88) times faster than the sgemv routine in NVIDIA's BLAS library CUBLAS 1.1 (resp., Intel Xeon E5335 CPU with SSE3 SIMD instructions) for matrices with order 16 to 12800.
http://ch.nvidia.com/docs/IO/47905/fujimoto_lspp2008.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1253_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1253_GPUComputing bgimg_large.png
Academia
Osaka University
2008
01
29
01/29/2008
15
Noriyuki Fujimoto
Paper
Noriyuki Fujimoto,fujimoto@ist.osaka-u.ac.jp
3939b0a0-52a9-48e0-8b31-6af60eae6ce6
Stackless KD-Tree Traversal for High Performance GPU Ray Tracing
Significant advances have been achieved for realtime ray tracing recently, but realtime performance for complex scenes still requires large computational resources not yet available from the CPUs in standard PCs. Incidentally, most of these PCs also contain modern GPUs that do offer much larger raw compute power. However, limitations in the programming and memory model have so far kept the performance of GPU ray tracers well below that of their CPU counterparts.
http://www.gpucomputing.net/?q=node/1293
/content/cudazone/CUDABrowser/assets/images/applications/1252_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1252_GPUComputing bgimg_large.png
Academia
Saarland University and MPI Informatik
2007
06
11
06/11/2007
Stefan Popov
Johannes Gunther
Hans-Peter Seidel
Philipp Slusallek
Paper
Stefan Popov,Johannes Gunther,Hans-Peter Seidel
c747f985-752f-4f32-be50-cf24898c527b
Fast Parallel GPU-Sorting Using a Hybrid Algorithm
This paper presents an algorithm for fast sorting of large lists using modern GPUs. The method achieves high speed by efficiently utilizing the parallelism of the GPU throughout the whole algorithm. Initially, a parallel bucketsort splits the list into enough sublists then to be sorted in parallel using merge-sort. The parallel bucketsort, implemented in NVIDIA's CUDA, utilizes the synchronization mechanisms, such as atomic increment, that is available on modern GPUs.
http://www.gpucomputing.net/?q=node/1291
/content/cudazone/CUDABrowser/assets/images/applications/1251_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1251_GPUComputing bgimg_large.png
Academia
Chalmers University of Technology
2007
09
25
09/25/2007
Erik Sintorn
Ulf Assarsson
Paper
Erik Sintorn,Ulf Assarsson,erik.sintorn@chalmers.se,uffe@chalmers.se
01c04032-f48d-40e7-ba6f-105e9e541977
Testing the Feasibility of Running a Computationally Intensive Real-Time Traffic Simulation on a Multicore Programmable Graphics Processor
In the 1960s, a semiconductor scientist named Gordon Moore theorized that the number of transistors would double each year on a single integrated circuit. Through much effort, the semiconductor industry has been able to closely follow "Moore's Law", but new information shows this type of progress is not sustainable in the coming years. This realization has implications in both chip fabrication and software development.
Instead of making chips with more transistors per unit area, industry now produces newer multicore chips.
http://www.gpucomputing.net/?q=node/603
/content/cudazone/CUDABrowser/assets/images/applications/1250_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1250_GPUComputing bgimg_large.png
Academia
University of Virginia
2007
04
04
04/04/2007
Kevin Stammetti
Paper
Kevin Stammetti
18ea0385-7666-484e-abed-98ef81d2697a
A Flexible High-Performance Lattice Boltzmann GPU Code for the Simulations of Fluid Flows in Complex Geometries
We describe the porting of the Lattice Boltzmann component of MUPHY, a multi-physics/scale simulation
software, to multiple graphics processing units using the Compute Unified Device Architecture. The novelty
of this work is the development of ad hoc techniques for optimizing the indirect addressing that MUPHY
uses for efficient simulations of irregular domains.
http://www.iac.rm.cnr.it/~massimo/Papers/LBEonGPU.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1247_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1247_GPUComputing bgimg_large.png
Academia
1 Istituto Applicazioni Calcolo, 2 NVIDIA, 3 SOFT, Istituto Nazionale Fisica della Materia, 4Harvard University School of Eng and Applied Sciences, 5 Harvard University Initiative in Innovative Computing
2009
05
11
05/11/2009
Massimo Bernaschi1
Massimiliano Fatica2
Simone Melchionna3,4
Sauro Succi1,5
Efthimios Kaxiras4
Paper
Massimo Bernaschi,Massimiliano Fatica,Simone Melchionna,Sauro Succi1,Efthimios Kaxiras
74ce7180-6c27-4cf9-906a-507feae7f418
GPU Clusters for High-Performance Computing
Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster architecture, resource sharing in a cluster environment, programming models, and applications for GPU clusters.
http://www.ncsa.illinois.edu/~gshi/ppac09_paper.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1246_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1246_GPUComputing bgimg_large.png
Academia
University of Illinois at Urbana-Champaign
2009
08
01
08/01/2009
Volodymyr V. Kindratenko
Jeremy J. Enos
Guochun Shi
Michael T. Showerman
Galen W. Arnold
John E. Stone
James C. Phillips
Wen-mei Hwu
Paper
Volodymyr V. Kindratenko,Jeremy J. Enos,Guochun Shi,kindr@ncsa.uiuc.edu,jenos@ncsa.uiuc.edu,gshi@ncsa.uiuc.edu
ed2b492f-10c9-4c78-9acc-215d2de0900e
Towards User Transparent Parallel Multimedia
The research area of Multimedia Content Analysis (MMCA) considers all aspects of the automated extraction of knowledge from multimedia archives and data streams. To satisfy the increasing computational demands of MMCA problems, the use of High Performance Computing (HPC) techniques is essential. As most MMCA researchers are not HPC experts, there is an urgent need for 'familiar' programming models and tools that are both easy to use and efficient.
http://hal.inria.fr/docs/00/49/38/83/PDF/A4MMC-werkhoven.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1244_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1244_GPUComputing bgimg_large.png
Academia
VU University
2010
06
21
06/21/2010
Ben van Werkhoven
Jason Maassen
Frank J. Seinstra
Paper
Ben van Werkhoven,Jason Maassen,Frank J. Seinstra,bjvwerkh@few.vu.nl,jason@few.vu.nl,fjseins@vew.vu.nl
a4825b5f-3099-4dcc-8335-ba0f18a6a351
Axel: A Heterogeneous Cluster with FPGAs and GPUs
This paper describes a heterogeneous computer cluster called Axel. Axel contains a collection of nodes; each node can include multiple types of accelerators such as FPGAs (Field Programmable Gate Arrays) and GPUs (Graphics Processing Units). A Map-Reduce framework for the Axel cluster is presented which exploits spatial and temporal locality through different types of processing elements and communication channels. The Axel system enables the first demonstration of FPGAs, GPUs and CPUs running collaboratively for N-body simulation.
http://portal.acm.org/citation.cfm?id=1723112.1723134#abstract
/content/cudazone/CUDABrowser/assets/images/applications/1243_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1243_logo_acm_portal2_large.jpg
Academia
Imperial College London
2010
02
01
02/01/2010
23
Kuen Hung Tsoi
Wayne Luk
Paper
Kuen Hung Tsoi,Wayne Luk
f6a0332b-f8ec-4d4d-9551-0f7fc57ad735
GPU-based Hierarchical Computations for View Independent Visibility
With rapid improvements in the performance and programmability, Graphics Processing Units (GPUs) have fostered considerable interest in substantially reducing the running time of compute intensive problems. The solution to the view-independent mutual point-pair visibility problem (required for inter-reflections in global illumination) can, it would seem, require the capabilities of the GPUs.
http://dspace.library.iitb.ac.in/jspui/bitstream/10054/1708/1/4756034.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1242_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1242_GPUComputing bgimg_large.png
Academia
Indian Institute of Technology Bombay
www.cse.iitb.ac.in
2008
12
16
12/16/2008
Rhushabh Goradia
Prekshu Ajmera
Sharat Chandran
Paper
Rhushabh Goradia,Prekshu Ajmera,Sharat Chandran
e3c99de6-a223-4437-af51-01bd3faf3026
Fast, GPU-based Diffuse Global Illumination For Point Models
Photorealistic computer graphics attempts to match as closely as possible the rendering of a virtual scene
with an actual photograph of the scene had it existed in the real world. Of the several techniques that are used to achieve this goal, physically-based approaches (i.e. those that attempt to simulate the actual physical process of illumination) provide the most striking results.
http://www.cse.iitb.ac.in/~rhushabh/aps/aps4/report.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1241_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1241_GPUComputing bgimg_large.png
Academia
Indian Institute of Technology
2008
08
26
08/26/2008
Rhushabh Goradia
Paper
Rhushabh Goradia
5c93da6a-193d-4595-998d-07313764eef5
Fast GPU-based Adaptive Tessellation with CUDA
Compact surface descriptions like higher-order surfaces are popular representations for both modeling and animation. However, for fast graphics-hardware-assisted rendering, they usually need to be converted to triangle meshes. In this paper, we introduce a new framework for performing on-the-fly crack-free adaptive tessellation of surface primitives completely on the GPU. Utilizing CUDA and its flexible memory write capabilities, we parallelize the tessellation task at the level of single surface primitives.
https://www.mpi-sb.mpg.de/~mschwarz/papers/cudatess-eg09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1240_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1240_GPUComputing bgimg_large.png
University of Erlangen-Nuremberg
2009
04
01
04/01/2009
Michael Schwarz
Marc Stamminger
Michael Schwarz,Marc Stamminger
5eb012b6-230e-49e0-ac77-6be5c40b011a
Using Graphics Devices in Reverse: GPU-based Image Processing and Computer Vision
Graphics and vision are approximate inverses of each other. Ordinarily Graphics Processing Units are used to convert "numbers into pictures" (i.e. computer graphics). In this paper, we discus the use of GPUs in approximately the reverse way to assist in "converting pictures into numbers" (i.e. computer vision). For graphical operations, GPUs currently provide many hundreds of gigaflops of processing power.
http://www.uweb.ucsb.edu/~yichuwang/ecv/paper/using_graphics_device_in_reverse.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1239_gpucomputing_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1239_gpucomputing_large.png
Commercial
NVIDIA Corporation
2008
08
26
08/26/2008
James Fung
Steve Mann
Paper
James Fung,Steve Mann
1bbe2f89-4dd9-4051-9bf0-bcd1abdc7a03
CUDA SURF - A real-time implementation for SURF
Keypoint detection and matching is a basic computer vision task and a necessary ingredient for several applications, e.g., object recognition, structure from motion, panorama stitching. In this work we implement the popular SURF descriptor, an approximation of SIFT, on commodity graphics hardware and achieve real-time performance even for HD images. For VGA images we achieve a speed-up of about 50x and a GTX 285 and for HD images even up to 87x.
/content/cudazone/CUDABrowser/assets/images/applications/1238_633214_match_GPU_graff_img1_img2_small_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1238_633214_match_GPU_graff_img1_img2_small_large.png
Academia
TU Darmstadt
http://www.tu-darmstadt.de
2010
07
13
07/13/2010
87
Open source
Andre Schulz
Florian Jung
Sebastian Hartte
Daniel Trick
Christian Wojek
Konrad Schindler
Jens Ackermann
Michael Goesele
Application
Code
Imaging
Video & Audio
Keypoint detection, Keypoint description, SURF, Object detection, SfM, Image stitching,Andre Schulz,Florian Jung,Sebastian Hartte, Daniel Trick,Christian Wojek, Konrad Schindler, Jens Ackermann, Michael Goesele,wojek@cs.tu-darmstadt.de
fdd8f407-1003-4323-82d5-f91b0c483a18
A Real-Time Multigrid Finite Hexahedra Method for Elasticity Simulation using CUDA
We present an efficient CUDA implementation of a finite hexahedra multigrid solver for simulating elastic deformable models in real time. Due to the regular shape of the numerical stencil induced by the hexahedral regime, computations and data layout can be restructured to avoid execution divergence and to support memory access patterns enabling the hardware to coalesce multiple memory accesses into single memory transactions. This enables to effectively exploit the GPU's parallel processing units and high memory bandwidth. Performance gains of up to a factor of 12 compared to a highly optimized CPU implementation are demonstrated.
/content/cudazone/CUDABrowser/assets/images/applications/1237_170404_VoxelModel_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1237_170404_VoxelModel_large.png
Academia
Computer Graphics and Visualization Group, Technische Universitat Munchen, Germany
http://wwwcg.in.tum.de/
2010
07
10
07/10/2010
12
Christian Dick
Joachim Georgii
Rudiger Westermann
Paper
Computer Aided Engineering
Numerics
Deformable Objects, Finite Element Methods, Multigrid,Christian Dick,Joachim Georgii,Rudiger Westermann
1cbf9658-009b-4ee7-81fe-1f144a456225
Real-time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid
We introduce a real-time stereo matching technique based on a reformulation of Yoon and Kweons adaptive support weights algorithm [1]. Our implementation uses the bilateral grid to achieve a speedup of 200x compared to a straightforward full-kernel GPU implementation, which in turn is 20x faster than the original CPU implementation, thus making it the fastest technique on the Middlebury website. Published at the European Conference on Computer Vision (ECCV) 2010.
/content/cudazone/CUDABrowser/assets/images/applications/1236_167313_DCBGrid-teaser_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1236_167313_DCBGrid-teaser_large.png
Academia
University of Cambridge and Microsoft Research Cambridge
2010
06
24
06/24/2010
200
Christian Richardt
Douglas Orr
Ian Davies
Antonio Criminisi
Neil A. Dodgson
Multimedia
Paper
Code
Science
Video & Audio
Computer Vision
Christian Richardt,Douglass Orr,Ian Davies, Antonio Criminisi, Neil A. Dodgson,christian.richardt@cl.cam.ac.uk
bbaba1fd-20be-44f9-9b64-fea2cd12fce2
Realtime Tracking With a Pan-Tilt Camera
The human eye is amazingly adept at tracking moving objects. The process is so natural to humans that it happens without any conscious effort. While this remarkable ability depends in part on the human brain's immense processing power, the fast response of the extraocular muscles and the eyeball's light weight are also vital. Even a small point and shoot camera mounted on a servo is typically too heavy and slow to move with the agility of the human eye. How, then, can we give a computer the ability to track movement quickly and responsively?
http://umassgv.blogspot.com/2010/07/realtime-tracking-with-pan-tilt-camera.html
/content/cudazone/CUDABrowser/assets/images/applications/1235_123697_tracking_overview_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1235_123697_tracking_overview_large.jpg
Academia
University of Massachusetts, Amherst
http://www.cs.umass.edu
2010
07
06
07/06/2010
Open source
Blake Foster
Rui Wang
Erik Learned-Miller
Multimedia
Paper
Code
Video & Audio
tracking, camera, fpv, pan tilt, human eye, vision,Blake Foster,Rui Wang,Erik Learned-Miller,blfoster@cs.umass.edu,ruiwang@cs.umass.edu,elm@cs.umass.edu
749bb40e-ff99-4a34-9093-c8ff4b7ab08d
Thrust Graph Library
Thrust Graph Library provides graph container, algorithm, and other concepts like a Boost Graph Library. This Library based on the thrust, which is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL).
/content/cudazone/CUDABrowser/assets/images/applications/1234_53418_networks_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1234_53418_networks_large.jpg
Research
National Institute of Advanced Industrial Science and Technology (AIST)
2010
07
06
07/06/2010
Open source
kazuhiro kojima
Code
Libraries
Graph Library,kazuhiro kojima,k.kojima@aist.go.jp
aafa92f6-e4db-4ff5-bbf2-093cef1271ea
Modeling Rotor Wakes with a Hybrid OVERFLOW-Vortex Method on a GPU Cluster
The vortex core shed from rotorcraft blades maintains coherency---and thus dynamic relevance---many blade turns after its creation. This presents a challenge to traditional Eulerian computational methods, as fine grids are required to suppress numerical diffusion which would weaken the vortex cores after a small number of revolutions. Vortex methods have been used in the past to overcome these problems, as they require computational elements only in vorticity-containing regions, but suffer from greater computational cost per element.
http://markjstock.org/research/AIAA-2010-4553.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1233_86624_4bladed_720_web2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1233_86624_4bladed_720_web2_large.png
Research
Applied Scientific Research, Inc.
http://www.applied-scientific.com/
2010
06
29
06/29/2010
Commercial
Mark J. Stock
Adrin Gharakhani
Multimedia
Paper
Computational Fluid Dynamics
cfd rotor helicopter vortex fluid,Mark J. Stock,Adrin Gharakhani,mstock@applied-scientific.com
5eb5fa1a-f845-4b64-863f-48f16aae06bb
A GPU-accelerated Boundary Element Method and Vortex Particle Method
Vortex particle methods, when combined with multipole-accelerated boundary element methods (BEM), become a complete tool for direct numerical simulation (DNS) of internal or external vortex-dominated flows. In previous work, we presented a method to accelerate the vorticity-velocity inversion at the heart of vortex particle methods by performing a multipole treecode N-body method on parallel graphics hardware. The resulting method achieved a 17-fold speedup over a dual-core CPU implementation.
http://markjstock.org/research/AIAA-2010-5099.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1232_266408_spheres_cl_vort_crop_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1232_266408_spheres_cl_vort_crop_large.png
Research
Applied Scientific Research, Inc.
http://applied-scientific.com/
2010
07
01
07/01/2010
43
Commercial
Mark J. Stock
Adrin Gharakhani
Paper
Computational Fluid Dynamics
cfd vortex nbody bem fluid,Mark J. Stock,mstock@applied-scientific.com
c7ec9d53-04d1-4b50-9a6a-0f92323dce34
Leukocyte Tracking: ImageJ Plugin
This software is a plugin for the ImageJ image processing program. The plugin is designed to detect and track rolling leukocytes (white blood cells) through multiple frames of video. It can take advantage of a CUDA-capable GPU to dramatically accelerate video processing time; with appropriate hardware, near real-time processing can be achieved.
/content/cudazone/CUDABrowser/assets/images/applications/1231_26570_leukocytes_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1231_26570_leukocytes_large.png
Academia
University of Virginia
2010
07
01
07/01/2010
26
Open source
Michael Boyer
David Tarjan
Scott T. Acton
Kevin Skadron
Application
Multimedia
Paper
Code
Imaging
Medical Imaging
Science
Video & Audio
leukocyte, blood cell, tracking, video,Michael Boyer,David Tarjan,Scott T. Acton, Kevin Skadron,boyer@cs.virginia.edu
d36b8ae0-3465-4b14-82f6-07712472e3ae
McStas CUDA optimization project
Optimize the single crystal component of McStas neutron raytracer using CUDA
http://www.mcstas.org/
/content/cudazone/CUDABrowser/assets/images/applications/1229_6226_logo-left_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1229_6226_logo-left_large.png
Academia
eScience Center, University of Copenhagen
http://www.escience.ku.dk
2010
01
29
01/29/2010
125
Jesper Dahlkild
Martin Djurno
Finn Krog
Paper
Code
Ray Tracing
Science
Jesper Dahlkild,Martin Djurno,Finn Krog,jesper.dahlkild@gmail.com,djurnoe@diku.dk,fk@finnkrog.com
c15f6620-26e7-4f91-b857-46380ff67782
Raytracing in participating media
This work presents a CUDA-accelerated algorithm for visualization of photorealistic lighting effects which is based on Henrik Wann Jensen's method for global illumination in scenes with participating media.
/content/cudazone/CUDABrowser/assets/images/applications/1228_58732_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1228_58732_logo_large.png
Academia
Wroclaw University of Technology
2010
07
02
07/02/2010
10
Open source
Piotr Orzechowski
Application
Multimedia
Paper
Code
Graphics
Ray Tracing
raytracing, participating media, photon mapping,Piotr Orzechowski,piotr.orzechowski@gmail.com
d35e353a-09c4-47df-ba61-10984823be50
Multi-domain, Higher Order Level Set Scheme for 3D Image Segmentation on the GPU
A streaming level set PDE solver to handle large volume ( sizes more than the available GPU memory). A higher order and multi-phase solver for smooth segmentation of the volume.
/content/cudazone/CUDABrowser/assets/images/applications/1227_545525_Slide2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1227_545525_Slide2_large.jpg
Academia
The Technical University of Denmark / The University of Texas at Austin
2010
06
16
06/16/2010
10
Open source
Ojaswa Sharma
Qin Zhang
Francois Anton
Chandrajit Bajaj
Paper
Ojaswa Sharma, Qin Zhang,Francois Anton, and Chandrajit Bajaj ,os@imm.dtu.dk,zqyork@ices.utexas.edu,fa@imm.dtu.dk, bajaj@cs.utexas.edu
082dbdd8-2eb2-4511-9072-d5eff768e420
A Simple Pseudo-Random Number Generator
Implementation of uniformly and normally distributed pseudo random number generators as device functions.
http://people.virginia.edu/~mjt5v/pf/RNG/
/content/cudazone/CUDABrowser/assets/images/applications/1226_144550627858cb6d44ceb02ba9434317_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1226_144550627858cb6d44ceb02ba9434317_large.png
Academia
University of Virginia
2010
06
08
06/08/2010
Michael Trotter
Matt Goodrum
Application
Programming Tools
Michael Trotter,Matt Goodrum,mjt5v@virginia.edu,mag6x@virginia.edu
9e6bae6c-dc38-44dd-863a-441c9440bf9e
High-order finite-element seismic wave propagation modeling with MPI on a large GPU cluster
We implement a high-order finite-element application, which performs the numerical simulation of seismic wave propagation resulting for instance from earthquakes at the scale of a continent or from active seismic acquisition experiments in the oil industry, on a large cluster of NVIDIA Tesla graphics cards using the CUDA programming environment and non-blocking message passing based on MPI. Contrary to many finite-element implementations, ours is implemented successfully in single precision, maximizing the performance of current generation GPUs. We discuss the implementation and optimization of the code and compare it to an existing very optimized implementation in C language and MPI on a classical cluster of CPU nodes. We use mesh coloring to efficiently handle summation operations over degrees of freedom on an unstructured mesh, and non-blocking MPI messages in order to overlap the communications across the network and the data transfer to and from the device via PCIe with calculations on the GPU. We perform a number of numerical tests to validate the single-precision CUDA and MPI implementation and assess its accuracy. We then analyze performance measurements and depending on how the problem is mapped to the reference CPU cluster, we obtain a speedup of 20x or 12x.
/content/cudazone/CUDABrowser/assets/images/applications/1225_20831_seismic_paper_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1225_20831_seismic_paper_large.jpg
Academia
Universite de Pau (France), Florida State University (US,) TU Dortmund (Germany)
2010
06
15
06/15/2010
20
Dimitri Komatitsch
Gordon Erlebacher
Dominik Goddeke
David Michea
Paper
Numerics
Oil & Gas
Clusters
Dimitri Komatitsch,Gordon Erlebacher,Dominik Goddeke,David Michea,dimitri.komatitsch@univ-pau.fr
5f003419-e86a-4d8b-9d19-997acd44898b
To GPU Synchronize or Not GPU Synchronize?
The graphics processing unit (GPU) has evolved from being a fixed function processor with programmable stages into a programmable processor with many fixed function components that deliver massive parallelism. By modifying the GPUs stream processor to support general-purpose computation on the GPU (GPGPU), applications that perform massive vector operations can realize many orders of magnitude improvement in performance over a traditional processor, i.e., CPU.
https://research.cs.vt.edu/synergy/pubs/papers/feng-iscas2010-gpusync.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1224_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1224_GPUComputing bgimg_large.png
Academia
Virginia Tech
2010
03
28
03/28/2010
Wu-chun Feng
Shucai Xiao
Paper
Wu-chun Feng,Shucai Xiao,wfeng@vt.edu,shucaig@vt.edu
aed6944b-13eb-478c-874a-751a332dad9d
FATSEA An Architectural Simulator for General Purpose Computing on GPUs
We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance on Nvidia platforms.
http://ditec.um.es/~jlaragon/papers/FATSEA-RAPIDO-2010.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1223_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1223_GPUComputing bgimg_large.png
Academia
University of Murcia
2009
12
22
12/22/2009
K. E. Ostby
J. L. Aragon
J. M. Garcia
M. Ujaldon
Paper
K. E. Ostby,J. L. Aragon,J. M. Garcya
0c25d963-909c-4a1e-8f60-81bf8fd868cf
Evaluating the use of GPUs in Liver Image Segmentation and HMMER Database Searches
In this paper we present the results of parallelizing two life sciences applications, Markov random fieldsbased (MRF) liver segmentation and HMMER's Viterbi algorithm, using GPUs. We relate our experiences in porting both applications to the GPU as well as the techniques and optimizations that are most beneficial. The unique characteristics of both algorithms are demonstrated by implementations on an NVIDIA 8800 GTX Ultra using the CUDA programming environment. We test multiple enhancements in our GPU kernels in order to demonstrate the effectiveness of each strategy.
http://cadi.buffalo.edu/papers/2009/2009_4.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1222_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1222_GPUComputing bgimg_large.png
Academia
University at Buffalo, SUNY
2009
02
15
02/15/2009
John Paul Walters
Vidyananth Balu
Suryaprakash Kompalli
Vipin Chaudhary
Paper
John Paul Walters,Vidyananth Balu,Suryaprakash Kompalli,waltersj@buffalo.edu,vbalu2@buffalo.edu,kompalli@hp.com
e21e7880-81bf-46d6-9185-b56a93a0ad3b
GPUCT: A GPU-Accelerated CT Reconstruction System' CT scanning is a medical imaging technique commonly used in hospitals, including the University of Virginia Hospital, to see inside the human body. Modern CT scanners can generate images of the body in three dimensions, a process called 3D reconstruction. This project illustrates the feasibility of using graphics hardware (GPUs) to process CT scans in a more efficient and inexpensive manner than current commercial reconstruction systems. Additionally, this research considers the ethical and social implications of an improved CT reconstruction system in terms of risks for hospitals and patients.
http://www.cs.virginia.edu/~skadron/Papers/maier_thesis07.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1221_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1221_GPUComputing bgimg_large.png
Academia
University of Virginia
2007
03
30
03/30/2007
56
Drew Maier
Paper
Drew Maier
72ba00af-f128-438b-9e0f-379786953cce
FPGA-Based Hardware Acceleration of Lithographic Aerial Image Simulation
Lithography simulation, an essential step in design for manufacturability (DFM), is still far from computationally efficient. Most leading companies use large clusters of server computers to achieve acceptable turn-around time. Thus coprocessor acceleration is very attractive for obtaining increased computational performance with a reduced power consumption. This article describes the implementation of a customized accelerator on FPGA using a polygon-based simulation model. An application-specific memory partitioning scheme is designed to meet the bandwidth requirements for a large number of processing elements.
http://cadlab.cs.ucla.edu/~cong/papers/TRETS-17.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1220_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1220_GPUComputing bgimg_large.png
Academia
University of California, Los Angeles
2009
09
17
09/17/2009
15
Jason Congyi
Yi Zou
Paper
Jason Congyi,Yi Zou
ebfbcda7-6930-45cc-83d5-1414da4c1325
Visualising Spins and Clusters in Regular and Small-World Ising Models with GPUs
Visualising computational simulation models of solid state physical systems is a hard problem for dense lattice models. Fly throughs and cutaways can aid viewer understanding of a simulated system. Interactive time model parameter updates and overlaying of measurements and graticules, cluster colour labelling and other visual highlighting cues can also enhance user intuition of the model's meaning. We present some graphical and simulation optimisation techniques and various graphical rendering and explanatory techniques for computational simulation models such as the Ising model in 2 and 3 dimensions. In addition to aiding understanding of conventional algorithms such as Metropolis Monte Carlo, we try to visualise cluster updates to the system using algorithms like that of Wolff.
http://www.massey.ac.nz/~dpplayne/Papers/cstn-108.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1219_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1219_GPUComputing bgimg_large.png
Academia
Massey University
2010
03
19
03/19/2010
A. Leist
D. P. Playne
K.A. Hawick
Paper
A. Leist,D. P. Playne,K.A. Hawick
3c9c3b66-7ebb-461e-884c-c820abca8856
Stereo Depth with a Unified Architecture GPU
This paper describes how the calculation of depth from stereo images was accelerated using a GPU. The Compute
Unified Device Architecture (CUDA) from NVIDIA was employed in novel ways to compute depth using BT cost matching and the Semi-Global Matching algorithm. The challenges of mapping a sequential algorithm to a massively parallel thread environment and performance optimization techniques are considered.
http://mplab.ucsd.edu/wp-content/uploads/CVPR2008/WorkShops/data/papers/143.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1218_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1218_GPUComputing bgimg_large.png
Academia
Florida Atlantic University
2008
05
04
05/04/2008
Joel Gibson
Oge Marques
Paper
Joel Gibson ,Oge Marques
4ea4d671-5de1-4355-8c9d-392fa35be551
3D Registration Based on Normalized Mutual Information: Performance of CPU vs. GPU Implementation
Medical image registration is time-consuming but can be sped up employing parallel processing on the GPU. Normalized mutual information (NMI) is a well performing similarity measure for performing multi-modal registration. We present CUDA based solutions for computing NMI on the GPU and compare the results obtained by rigidly registering multi-modal data sets with a CPU based implementation. Our tests with RIRE data sets show a speed-up of factor 5 to 7 for our best GPU implementation.
http://www.gris.informatik.tu-darmstadt.de/~swesarg/papers/1632.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1217_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1217_GPUComputing bgimg_large.png
Academia
Technische Universitat Darmstadt
2010
01
04
01/04/2010
Florian Jung
Stefan Wesarg
Paper
Florian Jung,Stefan Wesarg,stefan.wesarg@gris.tu-darmstadt.de
93ec4473-01fb-42ff-9e39-46248ff46941
fastHOG - a real-time GPU implementation of HOG
We introduce a parallel implementation of the histogram of oriented gradients algorithm for object detection. Our implementation uses the GPU and the NVIDIA CUDA framework. We achieve speedups of over 67x from the standard sequential code, using a single video card. Furthermore it supports multiple video cards so speedups of 120x or more can be achieved. This allows us to achieve real-time performance, using the
full HOG algorithm for the first time in the literature.
http://www.robots.ox.ac.uk/ActiveVision/Papers/prisacariu_reid_tr2310_09/prisacariu_reid_tr2310_09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1216_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1216_GPUComputing bgimg_large.png
Academia
University of Oxford
2009
07
14
07/14/2009
120
Victor Adrian Prisacariu
Ian Reid
Paper
Victor Adrian Prisacariu,Ian Reid,victor@robots.ox.ac.uk,ian@robots.ox.ac.uk
608510ce-a512-41c8-b32a-b55cb524284d
Detection and Tracking of Human Subjects
The goal of the thesis project was to devise an algorithm to detect and track people in a static video. Existing techniques are inadequate; instead a new approach based on background subtraction is used. The approach is successful with a static camera. In background subtraction, the background of the video is calculated a priori and then subtracted from each frame of the video. This isolates the foreign objects, which are detected via two simple algorithms. Both algorithms are based on the subject's center of mass, but the first algorithm traces the path of the person around the video, making it very cluttered.
http://www.cs.virginia.edu/~skadron/Papers/Grosvenor_Douglas_thesis.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1215_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1215_GPUComputing bgimg_large.png
Academia
University of Virginia
2009
04
30
04/30/2009
Douglas Grosvenor
Paper
Douglas Grosvenor
78013ee5-9a75-42af-93a8-4a5b68a330fa
Optimizing Sparse Matrix-Vector Multplication on GPUs
We are witnessing the emergence of Graphics Processor units (GPUs) as powerful massively parallel systems. Furthermore, the introduction of new APIs for general-purpose comptuations on GPUs, namely, CUDA from NVIDIA, Stream SDK form AMD, and OpenCL, makes GPUs an attractive choice for high-performance numerical and scientific computing.
http://domino.watson.ibm.com/library/CyberDig.nsf/papers/1D32F6D23B99F7898525752200618339/$File/rc24704.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1214_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1214_GPUComputing bgimg_large.png
Research
IBM Research Division
2009
04
02
04/02/2009
Muthu Manikandan Baskaran
Rajesh Bordawekar
Paper
Muthu Manikandan Baskaran,Rajesh Bordawekar,baskaran@ces.ohio-state.edu,bordaw@us.ibm.com
67ac5cbe-b91c-46b0-8a4e-9635e86dec49
Acceleration of Binomial Options Pricing via Parallelizing along time-axis on a GPU
Since the introduction of organized trading of options for commodities and equities, computing fair prices for options has been an important problem in financial engineering. A variety of numerical methods, including Monte Carlo methods, binomial trees, and numerical solution of stochastic differential equations, are used to compute fair prices. Traders and brokerage firms constantly strive to achieve faster calculation of option prices because timely information can mean the difference between a deal struck or missed, which
translates to substantial profit or loss. Hence, the latency to compute a fair option price plays an important role in short-term trading and arbitrage.
http://saahpc.ncsa.illinois.edu/09/papers/Ganesan_paper.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1213_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1213_GPUComputing bgimg_large.png
Academia
Washington University in St. Louis
2009
06
29
06/29/2009
Narayan Ganesan
Roger D. Chamberlain
Jeremy Buhler
Paper
Narayan Ganesan,Roger D. Chamberlain,Jeremy Buhler,nganesan@wustl.edu,roger@wustl.edu,jbuhler@wustl.edu
52b3b800-2810-4e18-b631-e995c9a2ed48
GPU Accelerated Cardiac Electrophysiology
Numerical simulations of cellular membranes are useful for both basic science and increasingly for clinical diagnostic and therapeutic applications. A common bottleneck in such simulations arises from solving large highly complex stiff systems of ordinary di fferential equations (ODEs) thousands of times for numerous collocation points (representing cells) throughout a three-dimensional volume. For some electrophysiology simulations, over 98% of the time is spent solving these systems of ODEs when run in serial on a single core.
http://cseweb.ucsd.edu/groups/hpcl/scg/papers/2010/lionetti_ms_thesis.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1212_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1212_GPUComputing bgimg_large.png
Academia
University of California, San Diego
2010
04
15
04/15/2010
280
Fred Lionetti
Paper
Fred Lionetti
7dbbc8bd-6029-4692-911e-5a4e725ef349
HARNESSING THE POWER OF IDLE GPUS FOR ACCELERATION OF BIOLOGICAL SEQUENCE ALIGNMENT
This paper presents a parallel system capable of accelerating biological sequence alignment on the graphics processing unit (GPU) grid. The GPU grid in this paper is a desktop grid system that utilizes idle GPUs and CPUs in the office and home. Our parallel implementation employs a master-worker paradigm to accelerate an OpenGLbased algorithm that runs on a single GPU. We integrate this implementation into a screensaver-based grid system that detects idle resources on which the alignment code can run.
http://www-hagi.ist.osaka-u.ac.jp/research/papers/200912_ino_ppl.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1211_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1211_GPUComputing bgimg_large.png
Academia
Osaka Univeristy
2009
08
24
08/24/2009
FUMIHIKO INO
YUKI KOTANI
YUMA MUNEKAWA
Paper
FUMIHIKO INO,YUKI KOTANI,YUMA MUNEKAWA
701ae5ed-4273-4c9a-8546-55b23244a5ca
Mixing Multi-Core CPUs and GPUs for Scientific Simulation Software
Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations however, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary.
http://www.massey.ac.nz/~dpplayne/Papers/cstn-091.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1210_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1210_GPUComputing bgimg_large.png
Academia
Massey University
2009
09
21
09/21/2009
K. A. Hawick
A. Leist
D. P. Playne
Paper
K. A. Hawick,A. Leist,D. P. Playne,k.a.hawick@massey.ac.nz,a.leist@massey.ac.nz,d.p.playne@massey.ac.nz
4a0b9cb0-a366-4a75-b2c4-e00c1af60a5c
Computing on GPUs
The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been the porting of computing intensive algorithms like e.g. ray-tracing algorithms form CPU to GPU. Through the Compute Unified Device Architecture (CUDA [4]) GPUs can also be used to increase computing speed for High Performance Computing applications. In this paper different parallelization strategies for different processor architectures are presented. They are compared and firt experiences using GPUs for a collection of numerical applications are given.
/content/cudazone/CUDABrowser/assets/images/applications/1209_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1209_GPUComputing bgimg_large.png
Commercial
DYNAmore GmbH
2009
05
14
05/14/2009
Dr. Uli Gohner
Paper
Dr. Uli Gohner
3e365933-a682-4caa-aa9c-b165f30d448d
High-Performance Physics Simulations Using Multi-Core CPUs and GPGPUs in a Volunteer Computing Context
This paper presents two conceptually simple methods for parallelizing a Parallel Tempering Monte Carlo simulation in a distributed volunteer computing context, where computers belonging to the general public are used. The first method uses conventional multi-threading. The second method uses CUDA, a graphics card computing system. Parallel Tempering is described, and challenges such as parallel random number generation and mapping of Monte Carlo chains to different threads are explained. While conventional multi-threading on CPUs is well-established, GPGPU programming techniques and technologies are still developing and present several challenges, such as the effective use of a relatively large number of threads.
http://arxiv.org/ftp/arxiv/papers/1004/1004.0023.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1208_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1208_GPUComputing bgimg_large.png
Commercial
D-Wave Systems Inc.
2010
03
31
03/31/2010
Kamran Karimi
Neil G. Dickson
Firas Hamze
Paper
Kamran Karimi ,Neil G. Dickson ,Firas Hamze,kkarimi,@dwavesys.com,ndickson@dwavesys.com,fhamze@dwavesys.com
1e8f6f3a-a7bd-4267-9fc9-3680f9bc0449
Accelerating Large-scale Convolutional Neural Networks with Parallel Graphics Multiprocessors
Training convolutional neural networks (CNNs) on large sets of high-resolution images is too computationally intense to be performed on commodity CPUs. Such architectures however achieve state-of-the-art results on low-resolution machine vision tasks such as the recognition of handwritten characters. We have adapted the inherent multi-level parallelism of CNNs for Nvidia's CUDA GPU architecture to accelerate the training by two orders of magnitude.
http://www.ais.uni-bonn.de/papers/nips09ws_scherer_behnke.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1206_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1206_GPUComputing bgimg_large.png
University of Bonn, Germany
2009
12
01
12/01/2009
Dominik Scherer
Sven Behnke
Paper
Dominik Scherer,Sven Behnke,scherer@ais.ni-bonn.de,behnke@cs.uni-bonn.de
c0bb0403-84f7-42b9-8575-ec19bf6268d2
A Practical GPU Based KNN Algorithm
The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. However, we can't be able to get a satisfactory performance in many applications, as the KNN algorithm has a high computational complexity. Recent developments in programmable, highly paralleled Graphics Processing Units (GPU) have opened a new era of parallel computing which deliver tremendous computational horsepower in a single chip. In this paper, we describe a practical GPU based K Nearest Neighbor (KNN) algorithm implemented by CUDA.
http://www.academypublisher.com/proc/iscsct09/papers/iscsct09p151.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1205_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1205_GPUComputing bgimg_large.png
Academia
Soochow University
2009
12
26
12/26/2009
Quansheng Kuang
Lei Zhao
Paper
Quansheng Kuang,Lei Zhao,kqs.net@163.com,zhaol@suda.edu.cn
f51cd9bf-7067-4281-9b34-95671175e688
http://www.modelica.org/events/modelica2009/Proceedings/memorystick/pages/papers/0032/0032.pdf
This work focuses on the use of parallel hardware to improve the simulation speed of equation-based object-oriented Modelica models. With this intention, a method has been developed that allows for the translation of a restricted class of Modelica models to parallel simulation code, targeted for the Nvidia Tesla architecture and based on the Quantized State Systems (QSS) simulation algorithm.
http://www.modelica.org/events/modelica2009/Proceedings/memorystick/pages/papers/0032/0032.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1204_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1204_GPUComputing bgimg_large.png
2009
09
21
09/21/2009
Martina Maggio
Kristian Stavaker
Filippo Donida
Francesco Casella
Peter Fritzson
Paper
Martina Maggio,Kristian Stavaker,Filippo Donida
0648ea65-f71f-4795-a1be-579a9aa03e90
An efficient GPU implementation for large scale individual-based simulation of collective behavior
In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model
each fish aligns its position and orientation with an appropriate average of its neighbors positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of high-density individual groups.
http://www.unibas.it/utenti/erra/Papers/HiBi09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1203_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1203_GPUComputing bgimg_large.png
Academia
Universita della Basilicata
2009
10
01
10/01/2009
Ugo Erra
Bernardino Frola
Vittorio Scarano
Iain Couzin
Paper
Ugo Erra,Bernardino Frola,Vittorio Scarano,ugo.erra@unibas.it,ber.frola@gmail.com,vitsca@dia.unisa.it
9dedab39-73d9-4c13-b5d2-2104819cec81
A Hybrid Analytical DRAM Performance Model
As process technology scales, the number of transistors that can in a unit area has increased exponentially. Processor throughput, memory storage, and memory throughput have all been increasing at an exponential pace. As such, DRAM has become an ever-tightening bottleneck for applications with irregular memory access patterns. Computer architects in industry sometimes use ad hoc analytical modeling techniques in lieu of cycle-accurate performance simulation to identify critical design points.
https://www.ece.ubc.ca/~aamodt/papers/gyuan.mobs2009.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1202_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1202_GPUComputing bgimg_large.png
Academia
University of British Columbia
2008
05
19
05/19/2008
George L. Yuan
Tor M. Aamodt
Paper
George L. Yuan,Tor M. Aamodt,gyuan@ece.ubc.ca,aamodt@ece.ubc.ca
df66c76c-4fda-468e-9eca-124cae57e3c4
Parallelisation of Fuzzy Inference on a Graphics Processor Unit Using the Compute Unified Device Architecture
The inherently parallel nature of fuzzy inference is rarely exploited by fuzzy systems researchers. Hardware implementations, such as Field Programmable Gate Arrays (FPGAs), commonly use parallel architectures to
achieve fast inference speeds. In this paper, we explore the use of Graphics Processor Units (GPUs) and NVIDIA‟s Compute Unified Device Architecture (CUDA) for fast inference speeds in a scalable and flexible
Mamdani type fuzzy inference system (FIS). Our goal is to provide computational intelligence researchers the skills necessary to exploit the low cost and high performance of GPUs with a minimum learning cost.
http://www.cci.dmu.ac.uk/ukci2008/papers/Parallelisation-of-Fuzzy-Inference-on-a-Graphics-Processor-Unit-Using-the-Compute-Unified-Device-Architecture.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1201_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1201_GPUComputing bgimg_large.png
Academia
University of Missouri, Columbia
2008
05
01
05/01/2008
Derek Anderson
Simon Coupland
Paper
Derek Anderson,Simon Coupland
76dcd879-098a-41f2-ab03-38c43d2a042e
GPU-Based Road Sign Detection Using Particle Swarm Optimization
Road Sign Detection is a major goal of Advanced Driving Assistance Systems (ADAS). Since the dawn of this
discipline, much work based on different techniques has been published which shows that traffic signs can be first detected and then classified in video sequences in real time. While detection is usually performed using classical computer vision techniques based on color and/or shape matching, most often classification is performed by neural networks. In this work we present a novel approach based on both sign shape and color which uses Particle Swarm Optimization (PSO) for detection.
http://www.ce.unipr.it/~mussi/downloads/papers/mussiISDA09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1200_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1200_GPUComputing bgimg_large.png
Academia
University of Parma
2009
11
01
11/01/2009
Luca Mussi
Stefano Cagnoni
Fabio Daolio
Paper
Luca Mussi,Stefano Cagnoni,Fabio Daolio,mussi@ce.unipr.it,cagnoni@ce.unipr.it,fabio.daolio@unil.ch
f0e5e186-d65b-4d6b-9f18-fb153cfcf39a
LARGE-SCALE PARALLEL MULTIBODY DYNAMICS WITH FRICTIONAL CONTACT
In the context of simulating the frictional contact dynamics of large systems of rigid bodies, this paper reviews a novel method for solving large cone complementarity problems by means of a fixed-point iteration algorithm. The method is an extension of the Gauss-Seidel and Gauss-Jacobi methods with overrelaxation for symmetric convex linear complementarity problems.
http://www.mcs.anl.gov/uploads/cels/papers/P1487.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1199_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1199_GPUComputing bgimg_large.png
Academia
University of Wisconsin Madison
2008
10
20
10/20/2008
Dan Negrut
Alessandro Tasora
Mihai Anitescu
Paper
Dan Negrut,Alessandro Tasora,Mihai Anitescu,negrut@wisc.edu,tasora@ied.unipr.it,anitescu@mcs.anl.gov
abb411b1-54e9-4a4d-8f26-7acad6754856
A characterization and analysis of PTX kernels
General purpose application development for GPUs (GPGPU) has recently gained momentum as a cost-effective approach for accelerating data- and compute-intensive applications. It has been driven by the introduction of C-based programming environments such as NVIDIA's CUDA [1], OpenCL [2], and Intel's Ct [3]. While significant effort has been focused on developing and evaluating applications and software tools, comparatively little has been devoted to the analysis and characterization of applications to assist future work in compiler optimizations, application re-structuring, and micro-architecture design.
http://www.computer.org/portal/web/csdl/doi/10.1109/IISWC.2009.5306801
/content/cudazone/CUDABrowser/assets/images/applications/1198_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1198_GPUComputing bgimg_large.png
Academia
Georgia Institute of Technology
2009
05
05
05/05/2009
Andrew Kerr
Gregory Diamos
Sudhakar Yalamanchili
Paper
Andrew Kerr,Gregory Diamos,Sudhakar Yalamanchili
ed43c757-50fe-4151-a1c4-21184ce71dbd
General Purpose Computation on Graphics Processing Units (GPGPU) using CUDA
Graphics processing units (GPUs) are special processors which traditionally were used to accelerate computer graphics by offloading work from the CPU. Today, GPUs are highly parallel many-core processors which enable general-purpose computation on graphics processing units (GPGPU). GPGPU has already been an issue since 2002 but a huge interest did not evolve until Nvidia released the CUDA platform in 2007. Developers and researchers started to use CUDA for parallel programming.
http://www.wi.uni-muenster.de/pi/lehre/ws0910/pppa/papers/gpgpu.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1195_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1195_GPUComputing bgimg_large.png
Academia
Westfalische Wilhelms-Universitat
2009
12
20
12/20/2009
Alexander Zibula
Paper
Alexander Zibula
993dd63c-de2a-49e1-81ba-ade7f1682b25
Simulation of one-layer shallow water systems on multicore and CUDA architectures
The numerical solution of shallow water systems is useful for several applications related to geophysical flows but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for 2D one-layer shallow water systems by using modern Graphics Processing Units (GPUs) supporting the NVIDIA CUDA programming model.
http://lsi.ugr.es/~jmantas/papers/supercomputing09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1194_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1194_GPUComputing bgimg_large.png
Academia
1 Universidad de Granada 2Universidad de Malaga
2010
03
01
03/01/2010
Marc de la Asuncion1
Jose M. Mantas1
Manuel J. Castro2
Paper
Marc de la Asuncion1,Jose M. Mantas1,Manuel J. Castro2
78e3f6d6-b219-43a8-8c2d-f515465c3670
IDN_MFC
Image denoising with bilateral filter algorithms
/content/cudazone/CUDABrowser/assets/images/applications/1193_64638_Application_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1193_64638_Application_large.jpg
Academia
Wlroclaw University of Technology
2010
06
22
06/22/2010
100
Wojciech Korycki
Application
Paper
Signal Processing
Bilateral Filter denoising,Wojciech Korycki,wojciech.korycki@gmail.com
9a370a84-4d82-4133-b523-6f56cca33568
Hypercubic Storage Layout and
Many simulations in the physical sciences are expressed in terms of rectilinear arrays of variables. It is
attractive to develop such simulations for use in 1-, 2-, 3- or arbitrary physical dimensions and also in a
manner that supports exploitation of data-parallelism on fast modern processing devices.We report on data
layouts and transformation algorithms that support both conventional and data-parallel memory layouts.
http://tur-www1.massey.ac.nz/~dpplayne/Papers/cstn-096.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1192_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1192_GPUComputing bgimg_large.png
Academia
Massey University
2009
06
23
06/23/2009
K. A. Hawick
D. P. Playne
Paper
K. A. Hawick,D. P. Playne
f33d13b3-5699-4fb2-b908-98d32866aa20
Analyzing CUDA Workloads Using a Detailed GPU Simulator
Modern Graphic Processing Units (GPUs) provide sufficiently flexible programming models that understanding their performance can provide insight in designing tomorrow's manycore processors, whether those are GPUs or otherwise. The combination of multiple, multithreaded, SIMD cores makes studying these GPUs useful in understanding tradeoffs among memory, data, and thread level parallelism. While modern GPUs offer orders of magnitude more raw computing power than contemporary CPUs, many important applications, even those with abundant data level parallelism, do not achieve peak performance.
https://www.ece.ubc.ca/~aamodt/papers/gpgpusim.ispass09.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1191_GPUComputing bgimg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1191_GPUComputing bgimg_large.png
Academia
University of British Columbia
2009
03
01
03/01/2009
Ali Bakhoda
George L. Yuan
Wilson W. L. Fung
Henry Wong
Tor M. Aamodt
Paper
Ali Bakhoda,George L. Yuan,Wilson W. L. Fung,Henry Wong,Tor M. Aamodt
c1d20c0c-79f9-44f0-9331-291ccbeb0ee7
Phase Based Volume Registration Using CUDA
We have implemented phase based volume registration using CUDA, in contrast to all other GPU based image registration implementations that are based on the image intensity. Our registration algorithm is more robust for volumes that differ significantly in intensity. This work was presented at the IEEE conference ICASSP in Dallas 2010.
/content/cudazone/CUDABrowser/assets/images/applications/1189_449881_phase_based_volume_registration_using_CUDA_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1189_449881_phase_based_volume_registration_using_CUDA_large.png
Academia
Linkoping university
http://www.moviii.isy.liu.se
2010
06
22
06/22/2010
30
Anders Eklund
Mats Andersson
Hans Knutsson
Paper
Medical Imaging
Image registration, local phase,Anders Eklund,Mats Andersson,Hans Knutsson,andek@imt.liu.se,matsa@imt.liu.se,knutte@imt.liu.se
f9f4cdda-fab7-40fa-b8bc-43482f378a81
Towards a Software Transactional Memory for Graphics Processors
The introduction of general purpose computing on many-core graphics processor systems, and the general shift in the industry towards parallelism, has created a demand for ease of parallelization. Software transactional memory (STM) simplifies development of concurrent code by allowing the programmer to mark sections of code to be executed concurrently and atomically in an optimistic manner. In contrast to locks, STMs are easy to compose and do not suffer from deadlocks. We have designed and implemented two STMs for graphics processors, one blocking and one non-blocking. The design issues involved in the development of these two STMs are described and explained in the paper together with experimental results comparing the performance of the two STMs.
/content/cudazone/CUDABrowser/assets/images/applications/1188_7612_cudazonestm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1188_7612_cudazonestm_large.png
Academia
Chalmers University of Technology
http://www.chalmers.se
2010
04
14
04/14/2010
Daniel Cederman
Philippas Tsigas
Muhammad Tayyab Chaudhry
Paper
Programming Tools
Daniel Cederman,Philippas Tsigas,cederman@chalmers.se,tsigas@chalmers.se
a8c7fb3f-0fb8-40d2-8440-c9dedecf7051
nexiwave Speech Indexing
nexiwave 2.0 the GPU Assisted Speech Indexing
/content/cudazone/CUDABrowser/assets/images/applications/1187_13933_nexilogo_betawith_snowflakes_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1187_13933_nexilogo_betawith_snowflakes_large.png
Commercial
nexiwave.com
http://nexiwave.com
2010
06
03
06/03/2010
75
Commercial
Ben Jiang
Application
Signal Processing
Speech Indexing
Speech Indexing, GPU,Ben Jiang,ben@nexiwave.com
a42593ae-afe0-46ed-85d3-7a1ab25c93ac
Massive Bayesian Mixture Modelling
This paper describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large data sets. An example context concerns common biological studies using high-throughput technologies generating many, very large data sets and requiring increasingly high-dimensional mixture models with large numbers of mixture components.
/content/cudazone/CUDABrowser/assets/images/applications/1186_179430_cfse_clusters_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1186_179430_cfse_clusters_large.jpg
Academia
UCLA and Duke University
2010
03
01
03/01/2010
160
Open source
Marc A. Suchard
Quanli Wang
Cliburn Chan
Jacob Frelinger
Andrew Cron
Mike West
Paper
Code
Numerics
Life Sciences
Science
Computational statistics,Marc A. Suchard,Quanli Wang,Cliburn Chan, Jacob Frelinger, Andrew Cron, Mike West,msuchard@ucla.edu,mw@stat.duke.edu
b5e57af8-695e-4550-9a42-a30be2716079
Accelerating Quadrature Methods for Option Valuation
This paper presents an architecture for FPGA acceleration of quadrature methods used for pricing complex options, such as discrete barrier, Bermudan, and American options. The architecture can be optimized for speed and power consumption by exploiting pipelining and parallelism to produce efficient implementations in reconfigurable logic. An optimised implementation using Graphics Processing Units (GPUs) is also developed, to provide a performance and efficiency comparison with an FPGA accelerator.
http://www.computer.org/portal/web/csdl/doi/10.1109/FCCM.2009.36
/content/cudazone/CUDABrowser/assets/images/applications/1185_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1185_logo_CS Digital Library_large.jpg
Academia
Imperial College London
09
04
01
04/01/09
Anson H. T. Tse
David B. Thomas
Wayne Luk
Paper
Anson H. T. Tse,David B. Thomas,Wayne Luk
934fb2f1-7ee6-4958-bb7d-b4ed38debaee
High Resolution Program Flow Visualization of Hardware Accelerated Hybrid Multi-core Applications
The advent of multi-core processors has made parallel computing techniques mandatory on main stream systems. With the recent rise of hardware accelerators, hybrid parallelism adds yet another dimension of complexity to the process of software development. This article presents a tool for graphical program flow analysis of hardware accelerated parallel programs.
http://www.computer.org/portal/web/csdl/doi/10.1109/CCGRID.2010.27
/content/cudazone/CUDABrowser/assets/images/applications/1184_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1184_logo_CS Digital Library_large.jpg
2010
05
01
05/01/2010
Daniel Hackenberg
Guido Juckeland
Holger Brunst
Paper
Daniel Hackenberg,Guido Juckeland,Holger Brunst
d0ffd6f3-3337-4dbe-a796-0c7d19b1cd6e
An Analysis of GPU Parallel Computing
Parallel systems are becoming ubiquitous in the world of computing as evidenced by multi-core processors, heterogeneous Cell broadband engine, and highly parallel graphics processing units (GPUs). All parallel systems share a requirement that parallel programming is necessary to leverage multiple cores. As a result of this trend, multi-core CPUs are no longer a clear winner due to its peaked clock frequency and programming effort involved in parallelizing code for multi-core architecture. Given such drawbacks, dataparallel applications might benefit from GPU assisted computing.
http://www.computer.org/portal/web/csdl/doi/10.1109/HPCMP-UGC.2009.59
/content/cudazone/CUDABrowser/assets/images/applications/1183_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1183_logo_CS Digital Library_large.jpg
Research
U. S. Army Research Laboratory
2009
06
01
06/01/2009
Song Jun Park
Paper
Song Jun Park
79cc1232-9c03-4eca-ac3f-dd9b3743fac0
Tiling for Performance Tuning on Different Models of GPUs
The strategy of using CUDA-compatible GPUs as a parallel computation solution to improve the performance of programs has been more and more widely approved during the last two years since the CUDA platform was released. Its benefit extends from the graphic domain to many other computationally intensive domains. Tiling, as the most general and important technique, is widely used for optimization in CUDA programs. New models of GPUs with better compute capabilities have, however, been released, new versions of CUDA SDKs were also released.
http://www.computer.org/portal/web/csdl/doi/10.1109/ISISE.2009.60
/content/cudazone/CUDABrowser/assets/images/applications/1182_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1182_logo_CS Digital Library_large.jpg
Academia
Hong Kong University of Science and Technology
2002
11
01
11/01/2002
Chang Xu
Steven R. Kirk
Samantha Jenkins
Paper
Chang Xu,Steven R. Kirk,Samantha Jenkins
e5e749ac-945e-42f2-a238-1209f8986eb2
Acceleration of Medical Image Registration Using Graphics Process Units in Computing Normalized Mutual Information
This paper presents a computational performance analysis of an accelerated medical image registration using Graphics Processing Units (GPUs). In our previous work, a multi-resolution approach using normalized mutual information (NMI) has proven to be useful in medical image registration. In this paper, we propose an acceleration of the NMI procedure using GPU implementation because of the parallel processing capabilities.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICIG.2009.48
/content/cudazone/CUDABrowser/assets/images/applications/1181_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1181_logo_CS Digital Library_large.jpg
Academia
Kent State University
2009
09
01
09/01/2009
Wei-Hung Cheng
Cheng-Chang Lu
Paper
Wei-Hung Cheng,Cheng-Chang Lu
705aaab7-bfe3-4433-ae2d-e1490bf77dbb
MAX-MIN Ant System on GPU with CUDA
We propose a parallel MAX-MIN Ant System (MMAS) algorithm that is suitable for an implementation on graphics processing units (GPUs). Multi ant colonies with respective parameter settings are whole offloaded to the GPU in parallel. We have implemented this GPU-based MMAS on the GPU with compute unified device architecture (CUDA). Some performance optimization means for kernel program of GPU are introduced. Experimental results that are based on simulations for the traveling salesperson problem are presented to evaluate the proposed techniques.
/content/cudazone/CUDABrowser/assets/images/applications/1180_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1180_logo_CS Digital Library_large.jpg
2009
12
07
12/07/2009
Hongtao Bai
Dantong Ou Yang
Ximing Li
Lili He
Haihong Yu
Paper
Hongtao Bai,Dantong Ou Yang,Ximing Li
f3d7658d-4572-4345-a1bd-3fe05ca6ce37
Scene Recognition Acceleration Using CUDA and OpenMP
Scene recognition has become a remarkable field in image processing area, and many methods have been proposed in recent years, in which the idea of extracting the scene gist from global features has been proved to have higher retrieval accuracy compared with many other methods. However, the process of extracting gist is heavily time-consuming and not suitable for real-time application. In this paper, the CUDA architecture is deployed to accelerate this process.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICISE.2009.1045
/content/cudazone/CUDABrowser/assets/images/applications/1179_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1179_logo_CS Digital Library_large.jpg
Academia
Dalian University of Technology
2009
12
01
12/01/2009
Yuxin Wang
Zhen Feng
He Guo
Changqin He
Yuansheng Yang
Paper
Yuxin Wang,Zhen Feng,He Guo
b717867c-c959-4577-b6a5-afbc0e42fdae
A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA
Nowadays, the compute capability of traditional cluster system can't keep up with the computing needs of a practical application, and these aspects of energy, space technology, etc. have become a huge problem. However, as parallel computing equipment, the stream processor (SP) has a high performance of floating-point operations. NVIDIA GPUs is a typical stream processor device, CUDA technology enables the way to develop a better parallel program on GPUs to become flexible. In this paper, we make use of the hybrid parallel computing programming environment (HPCPE) with MPI and CUDA technology to build the simple CPU + GPU-based stream processor cluster system.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICISE.2009.171
/content/cudazone/CUDABrowser/assets/images/applications/1177_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1177_logo_CS Digital Library_large.jpg
Academia
University of Shanghai for Science and Technology
2009
12
01
12/01/2009
Qing-kui Chen
Jia-kang Zhang
Paper
Qing-kui Chen,Jia-kang Zhang
5da86d39-2d35-4792-a0b5-e400e3383959
Formal Description and Optimization Based High - Performance Computing on CUDA
In recent years, with the development of GPU, based on the general purpose computation on graphics processors has became a new field. Aiming at the processing of GPU, this paper provides the formal description for data parallel mode, a detailed description of the CUDA programming mode land the principle of optimization. It shows by the comparative experiment that CUDA owns strongly of the ability to the parallel processing and provides new methods and ideas to GPGPU.
/content/cudazone/CUDABrowser/assets/images/applications/1176_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1176_logo_CS Digital Library_large.jpg
Academia
Hong Kong University of Science and Technology
2009
12
01
12/01/2009
Bo Li
Huacheng Zhao
JingJing Liang
Paper
Bo Li,Huacheng Zhao,JingJing Liang
2287f161-8378-4190-ae1f-bd428d9ca3c3
Password Recovery for RAR Files Using CUDA
Driven by the insatiable demand of real-time graphics, especially from the market of computer games, Graphics Processing Unit (GPU) is becoming a major computing horsepower during recent years since the performance of GPU is surpassing that of the contemporary CPU. This paper presents our study on how to efficiently recover the passwords for encrypted RAR files.
http://www.computer.org/portal/web/csdl/doi/10.1109/DASC.2009.123
/content/cudazone/CUDABrowser/assets/images/applications/1175_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1175_logo_CS Digital Library_large.jpg
2009
12
01
12/01/2009
Guang Hu
Jianhua Ma
Benxiong Huang
Paper
Guang Hu,Jianhua Ma,Benxiong Huang
2b933813-11b7-4f81-8e63-9ce67eba045f
Password Recovery for RAR Files Using CUDA
Driven by the insatiable demand of real-time graphics, especially from the market of computer games, Graphics Processing Unit (GPU) is becoming a major computing horsepower during recent years since the performance of GPU is surpassing that of the contemporary CPU. This paper presents our study on how to efficiently recover the passwords for encrypted RAR files. Our research focus is on the AES key generation processing, which is the most time consuming stage in the whole RAR encryption/decryption process.
http://www.computer.org/portal/web/csdl/doi/10.1109/DASC.2009.123
/content/cudazone/CUDABrowser/assets/images/applications/1174_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1174_logo_CS Digital Library_large.jpg
2009
12
01
12/01/2009
Guang Hu
Jianhua Ma
Benxiong Huang
Paper
Guang Hu,Jianhua Ma,Benxiong Huang
f928f65a-86c9-4a31-8299-3e40f02d03fa
GPU-Assisted Computation of Centroidal Voronoi Tessellation
Centroidal Voronoi tessellations (CVT) are widely used in computational science and engineering. The most commonly used method is Lloyd's method, and recently the L-BFGS method is shown to be faster than Lloyd's method for computing the CVT. However, these methods run on the CPU and are still too slow for many practical applications. We present techniques to implement these methods on the GPU for computing the CVT on 2D planes and on surfaces, and demonstrate significant speedup of these GPU-based methods over their CPU counterparts.
http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2010.53
/content/cudazone/CUDABrowser/assets/images/applications/1173_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1173_logo_CS Digital Library_large.jpg
Academia
University of Texas at Dallas
2010
03
16
03/16/2010
Guodong Rong
Yang Liu
Wenping Wang
Xiaotian Yin
David Gu
Xiaohu Guo
Paper
Guodong Rong,Yang Liu,Wenping Wang,Xiaotian Yin,David Gu,Xiaohu Guo
7e3424c0-b0ed-4476-8830-eb60da8a80c7
Designing Efficient Many-Core Parallel Algorithms for All-Pairs Shortest-Paths Using CUDA
Finding the all-pairs shortest-paths on a large graph is a fundamental problem in many practical applications such as bioinformatics, internet node traffic and network routing. In this paper, we present the designs of two efficient parallel algorithms for many-core GPUs using CUDA. Our algorithms expose substantial fine-grained parallelism while maintaining minimal global communication. By using the global scope of the GPU's global memory, coalescing the global memory reads and writes, and avoiding on-chip shared memory bank conflicts, we are able to achieve a large performance benefit with a speed-up of 2,500x on a desktop computer in comparison with a single core program.
http://www.computer.org/portal/web/csdl/doi/10.1109/ITNG.2010.230
/content/cudazone/CUDABrowser/assets/images/applications/1172_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1172_logo_CS Digital Library_large.jpg
Academia
Lamar University
2010
04
01
04/01/2010
Quoc-Nam Tran
Paper
Quoc-Nam Tran
0315bfde-f758-4fc6-8cf5-85aac810ca12
Record Setting Software Implementation of DES Using CUDA
The increase in computational power of off-the-shelf hardware offers more and more advantageous tradeoffs among efficiency, cost and availability, thus enhancing the feasibility of of cryptanalytic attacks aiming to lower the security of widely used cryptosystems. In this paper we illustrate an GPU-based software implementation of the most efficent variant of Data Encryption Standard (DES), showing the performance of a software breaker which effectively exploits the multi-core Nvidia GT200 graphic architecture.
http://www.computer.org/portal/web/csdl/doi/10.1109/ITNG.2010.43
/content/cudazone/CUDABrowser/assets/images/applications/1171_logo_CS Digital Library_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1171_logo_CS Digital Library_large.jpg
2010
04
01
04/01/2010
Giovanni Agosta
Allessandro Barenghi
Fabrizio De Santis
Gerardo Pelosi
Paper
Giovanni Agosta,Allessandro Barenghi,Fabrizio De Santis
7df8d14f-4c52-460a-8881-ad932fd45292
Eye-Full Tower: A GPU-based variable multibaseline omnidirectional stereovision system with automatic baseline selection for outdoor mobile robot navigation
In recent years, it can be observed that there is a gradual increase in the number of researchers and projects involved with the development of omnidirectional vision systems for various applications. The primary factors, which contributed towards this positive trend, are the availability of inexpensive and high resolution vision sensors, robust and fast computers and the advantages of using such systems over perspective vision systems. In this paper, a novel variable multibaseline omnidirectional stereovision system is presented.
http://portal.acm.org/citation.cfm?id=1805342.1805504&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289
/content/cudazone/CUDABrowser/assets/images/applications/1170_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1170_logo_acm_portal2_large.jpg
Academia
Monash University
2010
06
01
06/01/2010
Wen Lik Dennis Lui
Ray Jarvis
Paper
Wen Lik Dennis Lui,Ray Jarvis
56e6ce9f-ae50-427d-8d3f-23b1e24c6683
Optimized high speed pixel sorting and its application in watershed based image segmentation
Efficient sorting of image pixels based on their grayscale value is traditionally implemented using an algorithm based on distribution or counting sort methods. We show that an elegant alternative can be used which outperforms the traditional method both in terms of processing speed and main memory access. We discuss both theoretically analyzed and real-life performance results, and demonstrate the improvements that can be obtained when our algorithm is combined with a well-known watershed image segmentation method.
/content/cudazone/CUDABrowser/assets/images/applications/1169_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1169_logo_acm_portal2_large.jpg
Research
The National Institute for Criminalistics and Criminology (NICC)
2010
07
01
07/01/2010
Patrick De Smet
Paper
Patrick De Smet
729f484e-7759-401b-a6ef-78c85f290bd6
GPU-accelerated molecular dynamics simulation for study of liquid crystalline flows
We have developed a GPU-based molecular dynamics simulation for the study of flows of fluids with anisotropic molecules such as liquid crystals. An application of the simulation to the study of macroscopic flow (backflow) generation by molecular reorientation in a nematic liquid crystal under the application of an electric field is presented. The computations of intermolecular force and torque are parallelized on the GPU using the cell-list method, and an efficient algorithm to update the cell lists was proposed. Some important issues in the implementation of computations that involve a large number of arithmetic operations and data on the GPU that has limited high-speed memory resources are addressed extensively.
http://portal.acm.org/citation.cfm?id=1808372.1808870&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289
/content/cudazone/CUDABrowser/assets/images/applications/1168_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1168_logo_acm_portal2_large.jpg
Academia
Kochi University of Technology
2010
08
01
08/01/2010
Alfeus Sunarso
Tomohiro Tsuji
Shigeomi Chono
Paper
Alfeus Sunarso,Tomohiro Tsuji,Shigeomi Chono
2b5f83d2-9afa-42a5-8582-8a9b3ae48841
Parallel implementation of wavelet-based image denoising on programmable PC-grade graphics hardware
The discrete wavelet transform (DWT) has been extensively used for image compression and denoising in the areas of image processing and computer vision. However, the intensive computation of DWT due to its inherent multilevel data decomposition and reconstruction operations brings a bottleneck that drastically reduces its performance and implementations for real-time applications when facing large size digital images and/or high-definition videos. Although various software-based acceleration solutions, such as the lifting scheme, have been devised and achieved a higher performance in general, the pure software accelerated DWT still struggle to cope with the demands from real-time and interactive applications. With the growing capacity and popularity of graphics hardware, personal computers (PCs) nowadays are often equipped with programmable graphics processing units (GPUs) for graphics acceleration. The GPU offers a cost-effective parallel data processing mechanism for operations on large amount of data, even for applications beyond graphics. This practice is commonly referred as general-purpose computing on GPU (GPGPU).
http://portal.acm.org/citation.cfm?id=1786816.1787181&coll=Portal&dl=GUIDE&CFID=92176503&CFTOKEN=39358289
/content/cudazone/CUDABrowser/assets/images/applications/1167_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1167_logo_acm_portal2_large.jpg
Academia
University of Huddersfield
2010
08
01
08/01/2010
Yang Su
Zhijie Xu
Paper
Yang Su,Zhijie Xu
a4486be2-fdef-4db0-89b0-879b296f6681
GPU Computing for Atmospheric Modeling Experience with a small kernel and implications for a full model
Much success has been achieved using GPUs to accelerate existing applications that are highly data parallel, or that are dominated by small, intense computational kernels. What are the prospects for porting existing large scientific models that do not fit this mold? We take an expensive routine from the CAM atmosphere model, and port it to a GPU using CUDA. We use the experience gained as a guide in thinking about porting the full application to an accelerator based system. We consider the best path forward for getting large scientific models running on accelerator based systems, and identify cases where porting may be feasible, and where a complete redesign may be the best option.
/content/cudazone/CUDABrowser/assets/images/applications/1166_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1166_logo_xplore_large.gif
Research
National Center for Atmospheric Research
2010
02
04
02/04/2010
Rory Kelly
Paper
Rory Kelly
ce8e0150-e62b-4a0c-890b-0442b2e058a6
Design and Performance Evaluation of Image Processing Algorithms on GPUs
In this paper, we construe key factors in design and evaluation of image processing algorithms on the massive parallel GPU (graphics processing units) using the CUDA (compute unified device architecture) programming model. A set of metrics, customized for image processing, are proposed to quantitatively evaluate algorithm characteristics. In addition, we show that a range of image processing algorithms map readily to CUDA using multiview stereo matching, linear feature extraction, JPEG2000 image encoding, and non-photorealistic rendering (NPR) as our example applications.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5477417&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1165_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1165_logo_xplore_large.gif
Academia
Inha University
2010
06
03
06/03/2010
In Kyu Park
Nitin Singhal
Man Hee Lee
Sungdae Cho
Chris Kim
Paper
In Kyu Park,Nitin Singhal,Man Hee Lee
750d6910-dfb0-4eee-9c8e-cec1320d7f09
CUDA-Based Linear Solvers for Stable Fluids
In the field of computer graphics, physically-based fluids simulations (i.e., simulations that solve the equations that govern fluids behaviour) are performed using, among others, Stam's stable fluids method. This method requires the solution of two sparse linear systems that can be solved using an iterative solver (e.g., Jacobi, Gauss-Seidel, conjugate gradient, etc.). Focusing on real-time 3D applications, we provide and analyze the performance of the parallel GPU-based (using CUDA) algorithms of the Jacobi, Gauss-Seidel, and conjugate gradient solvers.
/content/cudazone/CUDABrowser/assets/images/applications/1164_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1164_logo_xplore_large.gif
2010
04
21
04/21/2010
Goncalo Amador
Abel Gomes
Paper
Goncalo Amador,Abel Gomes
d43dc906-8a5f-4bb4-bcb2-006f2d9be085
Implementation of Variable Preconditioned GCR with mixed precision on GPU using CUDA
The Variable Preconditioned GVR (VPGCR) with mixed precision on Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA) is numerically investigated. The convergence theorem of VPGCR is guaranteed that the residual equation for the preconditioned procedure can be solved in the range of single precision operation. The results of computations show that VPGCR with mixed precision operation on GPU demonstrated significant achievement than that of CPU. Especially, VPGCR on GPU with mixed precision operation is 22.53 times faster than that of Central Processing Unit (CPU).
/content/cudazone/CUDABrowser/assets/images/applications/1163_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1163_logo_xplore_large.gif
Academia
Tokyo University of Technology
2010
05
09
05/09/2010
Soichiro Ikuno
Norihisa Fujita
Susumu Yamamoto
Susumu Nakata
Paper
Soichiro Ikuno,Norihisa Fujita,Susumu Yamamoto
55d72cf8-150e-4281-be46-1a00cd588e1e
A CUDA-Based Implementation of Stable Fluids in 3D with Internal and Moving Boundaries
Fluid simulation has been an active research field in computer graphics for the last 30 years. Stam's stable fluids method, among others, is used for solving the equations that govern fluids (i.e. Navier-Stokes equations). An implementation of stable fluids in 3D using NVIDIA Compute Unified Architecture, shortly CUDA, is provided in this paper. This CUDA-based implementation also features the accurate physical treatment of internal (i.e. static boundaries inside the simulation domain) and moving boundaries. The performance gains of the presented implementation vs a sequential CPU-based implementation, and points of further improvement are also addressed.
/content/cudazone/CUDABrowser/assets/images/applications/1161_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1161_logo_xplore_large.gif
2010
03
23
03/23/2010
Goncalo Amador
Abel Gomes
Paper
Goncalo Amador,Abel Gomes
d41f4ce3-8c86-4dd4-835b-8954c9caef44
Hybrid Core Acceleration of UWB SIRE Radar Signal Processing
To move High Performance Computing (HPC) closer to forward operating environments and missions, the Army Research Laboratory is developing approaches using hybrid, asymmetric core computing. By blending capabilities found in Graphics Processing Units (GPUs) and traditional von Neumann multi-core Central Processing Units (CPUs), approaches are being developed and optimized to provide at or near real-time processing speeds for research project applications. Algorithms are designed to partition work to resources best designed to handle the processing load. The use of commodity resources allows the design to be flexible throughout the life-cycle without the costly and time-consuming delays associated with Application Specific Integrated Circuit (ASIC) development. This paradigm allows for rapid technology transfer to end users.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5477419&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1160_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1160_logo_xplore_large.gif
Research
U. S. Army Researc Laboratory
2010
06
03
06/03/2010
Song Jun Park
James Ross
Dale Shires
David Richie
Brian Henz
Lam Nguyen
Paper
Song Jun Park,James Ross,Dale Shires
78482823-4944-4b0a-ab95-15e1aad00454
Optimal loop unrolling for GPGPU programs
Graphics Processing Units (GPUs) are massively parallel, many-core processors with tremendous computational power and very high memory bandwidth. With the advent of general purpose programming models such as NVIDIA's CUDA and the new standard OpenCL, general purpose programming using GPUs (GPGPU) has become very popular. However, the GPU architecture and programming model have brought along with it many new challenges and opportunities for compiler optimizations. One such classical optimization is loop unrolling. Current GPU compilers perform limited loop unrolling. In this paper, we attempt to understand the impact of loop unrolling on GPGPU programs.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470423&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1159_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1159_logo_xplore_large.gif
Academia
The Ohio State University
2010
04
19
04/19/2010
Giridhar Murthy Sreenivasa
Mahesh Ravishankar
Muthu Manikandan Baskaran
P. Sadayappan
Paper
Giridhar Murthy Sreenivasa,Mahesh Ravishankar,Muthu Manikandan Baskaran
86fc0781-21d5-4b31-9fb0-7061d02a703b
Using CUDA enabled FDTD simulations to solve multi-gigahertz EMI challenges
Thanks to the application of GPU-CUDA acceleration technology to EM simulation tools, more and more complicated EMI challenges can be efficiently investigated and solved very early in the design process. This paper presents a novel methodology to predict EMI emission due to memory SSO noise from a real, commercial graphics card by means of a commercially available CUDA accelerated full-wave FDTD simulator. It is shown that thanks to the CUDA acceleration one can estimate the influence of on-board decoupling capacitors on the EMI emission within hours.
/content/cudazone/CUDABrowser/assets/images/applications/1158_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1158_logo_xplore_large.gif
Research
KHBO Flanders Mechatronics Engineering Centre
2010
04
12
04/12/2010
Davy Pissoort
Chen Wang
Hany Fahmy
Paper
Davy Pissoort,Chen Wang,Hany Fahmy
860722bd-ba53-4391-9e5c-7197a5574713
Dynamic load balancing on single- and multi-GPU systems
The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIA's CUDA.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470413&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1157_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1157_logo_xplore_large.gif
Academia
University of Delaware
2010
04
19
04/19/2010
Long Chen
Oreste Villa
Sriram Krishnamoorthy
Paper
Long Chen,Oreste Villa,Sriram Krishnamoorthy
8bc36492-7c2b-4e62-ba6d-a9664ee84f10
Automatic Generation of Multi-Core Chemical Kernels
This work presents KPPA (the Kinetics PreProcessor: Accelerated), a general analysis and code generation tool that achieves significantly reduced time-to-solution for chemical kinetics kernels on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis of chemical kernels from WRF-Chem and the Community Multiscale Air Quality Model (CMAQ) is presented for each platform in double and single precision on coarse and fine grids.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5473221&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1156_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1156_logo_xplore_large.gif
Academia
Virginia Polytechnic Institute and State University
2010
05
27
05/27/2010
John Linford
John Michalakes
Manish Vachharajani
Adrian Sandu
Paper
J. Linford,J. Michalakes,M. Vachharajani
f0ed0b4c-5d78-4283-b786-d977d462b699
Speculative execution on multi-GPU systems
The lag of parallel programming models and languages behind the advance of heterogeneous many-core processors has left a gap between the computational capability of modern systems and the ability of applications to exploit them. Emerging programming models, such as CUDA and OpenCL, force developers to explicitly partition applications into components (kernels) and assign them to accelerators in order to utilize them effectively. An accelerator is a processor with a different ISA and micro-architecture than the main CPU. These static partitioning schemes are effective when targeting a system with only a single accelerator.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470427&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1155_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1155_logo_xplore_large.gif
Academia
Georgia Institute of Technology
2010
04
19
04/19/2010
Gregory Diamos
Sudhakar Yalamanchili
Paper
Gregory Diamos,Sudhakar Yalamanchili
829884ae-a849-4ab7-a5db-1ebb4290798a
AUTO-GC: Automatic translation of data mining applications to GPU clusters
Because of the very favorable price to performance ratio of the GPUs, a popular parallel programming configuration today is a cluster of GPUs. However, extracting performance on such a configuration would typically require programming in both MPI and CUDA, thus requiring a high degree of expertise and effort. It is clearly desirable to be able to support higher-level programming of this emerging high-performance computing platform.
/content/cudazone/CUDABrowser/assets/images/applications/1154_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1154_logo_xplore_large.gif
Academia
The Ohio State University
2010
04
19
04/19/2010
Wenjing Ma
Gagan Agrawal
Paper
Wenjing Ma,Gagan Agrawal
4401d416-56a2-4a55-88a0-a8ccbb66c75d
Pricing of cross-currency interest rate derivatives on Graphics Processing Units
We present a Graphics Processing Unit (GPU) parallelization of the computation of the price of cross-currency interest rate derivatives via a Partial Differential Equation (PDE) approach. In particular, we focus on the GPU-based parallel computation of the price of long-dated foreign exchange interest rate hybrids, namely Power Reverse Dual Currency (PRDC) swaps with Bermudan cancelable features. We consider a three-factor pricing model with foreign exchange skew which results in a time-dependent parabolic PDE in three spatial dimensions. Finite difference methods on uniform grids are used for the spatial discretization of the PDE, and the Alternating Direction Implicit (ADI) technique is employed for the time discretization.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470708&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1153_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1153_logo_xplore_large.gif
Academia
University of Toronto
2010
04
19
04/19/2010
Duy Minh Dang
Paper
Duy Minh Dang
fb5436fc-e3e8-4c43-b70a-2dc6ba8e4f18
Study on GPU-accelerated extraction of interconnects parasitic using CUDA and MPI
Parallel computation is application-oriented, particularly for the GPU (Graphics Processing Unit) with the inherent parallelism. This paper shows the architecture of a GPU cluster based on MPI (Message Passing Interface) and CUDA (Compute Unified Device Architecture). Results show that the acceleration ratio is obviously improved but the acceleration effect seems decelerated in large-scale GPU cluster. The parallel algorithm is mainly focused on task partitioning sparse matrix-vector multiplications (SpVM) in GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/1151_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1151_logo_xplore_large.gif
Academia
Chinese Academy of Sciences
2010
05
09
05/09/2010
Xiaoyu Xu
Guoqiang Liu
Hui Qu
Wei Xu
Yang Zhang
Paper
Xiaoyu Xu,Guoqiang Liu,Hui Qu
eb22bb0c-f56a-47bc-8b54-6bc2fa978435
Performance study of mapping irregular computations on GPUs
Recently, Graphical Processing Units (GPUs) have become increasingly more capable and well-suited to general purpose applications. As a result of the GPUs high degree of parallelism and computational power, there has been a great deal of interest directed toward the platform for parallel application development. Much of the focus, however, has been on very regular applications that exhibit a high degree of data parallelism, as these applications map well to the GPU. Irregular applications, such as the Breadth First Search discussed in this paper, have not been as extensively studied and are more difficult to implement in an efficient fashion on the GPU. We will present both an implementation of the Breadth First Search algorithm as well as that of a Matrix Parenthesization algorithm.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470770&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1150_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1150_logo_xplore_large.gif
Academia
University of Manitoba
2010
04
19
04/19/2010
Steven Solomon
Parimala Thulasiraman
Paper
Steven Solomon,Parimala Thulasiraman
2aa51865-7e62-41be-8adf-a461a0ae58d7
Design and implementation of MPEG audio layer III decoder using graphics processing units
This paper describes a new implemented method for the MPEG audio layer III (MP3) decoder. The proposed architecture is based on a graphic process unit (GPU) using CUDA environment, where it can effectively take advantage of modern GPU's parallel computing power. The implemented system with this architecture employs a multi-thread model and memory optimization to process MP3 decoding in parallel, so it is significant to minimize the computational overhead. Experimental results on a GTX260+ graphics card showed that the proposed architecture is over five times faster than traditional MP3 library based on CPU.
/content/cudazone/CUDABrowser/assets/images/applications/1148_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1148_logo_xplore_large.gif
Academia
Chinese Academy of Sciences
2010
04
09
04/09/2010
Chen Xiaoliang
Zheng Chengshi
Ma Longhua
Cheng Xiaobin
Li Xiaodong
Paper
Chen Xiaoliang ,Zheng Chengshi ,Ma Longhua
0e47069d-420f-4d76-9565-d04f4341f8d2
Dynamically tuned push-relabel algorithm for the maximum flow problem on CPU-GPU-Hybrid platforms
The maximum flow problem is a fundamental graph theory problem with many important applications. Max-flow algorithms based on the push-relabel method are known to have better complexity bound and faster practical execution speed than others. However, existing push-relabel algorithms are designed for uniprocessors or parallel processors that support locking primitives, thus making it very difficult to apply the push-relabel technique to CUDA-based GPUs. In this paper, we present a first generic parallel push-relabel algorithm for CUDA devices. We model the parallelization efficiency of the algorithm, which reveals that, for a given input graph, the level of parallelism varies during the execution of the algorithm.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470401&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1147_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1147_logo_xplore_large.gif
Academia
Georgia Institute of Technology
2010
04
19
04/19/2010
Zhengyu He
Bo Hong
Paper
Zhengyu He,Bo Hong
a23a06cf-4a9e-46f4-9171-718369793c99
An auto-tuning framework for parallel multicore stencil computations
Although stencil auto-tuning has shown tremendous potential in effectively utilizing architectural resources, it has hitherto been limited to single kernel instantiations; in addition, the large variety of stencil kernels used in practice makes this computation pattern difficult to assemble into a library. This work presents a stencil auto-tuning framework that significantly advances programmer productivity by automatically converting a straightforward sequential Fortran 95 stencil expression into tuned parallel implementations in Fortran, C, or CUDA, thus allowing performance portability across diverse computer architectures, including the AMD Barcelona, Intel Nehalem, Sun Victoria Falls, and the latest NVIDIA GPUs.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470421&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1146_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1146_logo_xplore_large.gif
Research
CRD/NERSC, Lawrence Berkeley National Laboratory
2010
04
19
04/19/2010
Shoaib Kamil
Cy Chan
Leonid Oliker
John Shalf
Samuel Williams
Paper
Shoaib Kamil,Cy Chan,Leonid Oliker
5448d0e9-eebf-4e90-8c29-59f7f4c224c1
Comparing Hardware Accelerators in Scientific Applications: A Case Study
Multi-core processors and a variety of accelerators have allowed scientific applications to scale to larger problem sizes. We present a performance, design methodology, platform, and architectural comparison of several application accelerators executing a Quantum Monte Carlo application. We compare the application's performance and programmability on a variety of platforms including CUDA with Nvidia GPUs, Brook+ with ATI graphics accelerators, OpenCL running on both multi-core and graphics processors, C++ running on multi-core processors, and a VHDL implementation running on a Xilinx FPGA.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5482576&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1145_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1145_logo_xplore_large.gif
Academia
University of Tennessee
2010
06
06
06/06/2010
R. Weber
A. Gothandaraman
R. Hinde
G. Peterson
Paper
R. Weber,A. Gothandaraman,R. Hinde
1aa1156a-6f55-4e59-93bb-4cf5c1a6b6a2
Demystifying GPU Microarchitecture Through Microbenchmarking
Graphics processors (GPU) offer the promise of more than an order of magnitude speedup over conventional processors for certain non-graphics computations. Because the GPU is often presented as a C-like abstraction (e.g., Nvidia's CUDA), little is known about the characteristics of the GPU's architecture beyond what the manufacturer has documented. This work develops a microbechmark suite and measures the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. Various undisclosed characteristics of the processing elements and the memory hierarchies are measured.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5452013&queryText%3Dcuda+2010%26openedRefinements%3D*%26sortType%3Ddesc_Publication+Year%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1144_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1144_logo_xplore_large.gif
Academia
University of Toronto
2010
03
28
03/28/2010
H. Wong
M. Papadopoulou
M. Sadooghi-Alvandi
A. Moshovos
Paper
H. Wong,M. Papadopoulou,M. Sadooghi-Alvandi
56a2b66b-15bc-46e7-bce1-2f1b937dfe11
SelfAudience
Audience Measurement - real time video analysis for counting people, face detection and tracking
/content/cudazone/CUDABrowser/assets/images/applications/1143_428449_selfadvert_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1143_428449_selfadvert_large.png
Commercial
SelfAdvert
http://www.selfadvert.com
2010
05
15
05/15/2010
300
SelfAdvert
Application
Multimedia
Video & Audio
freeware audience measurement,SelfAdvert,sales@selfadvert.com
41b685bb-4e0a-49ba-af2f-b938f11bae36
Cellular Automata Evolver
Evolver of Cellular Automata 1D rules plus inference tools with the state of the art technology
/content/cudazone/CUDABrowser/assets/images/applications/1142_caev2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1142_caev2_large.png
Research
Cellular Automata Evolver
2010
06
02
06/02/2010
10
Denis Antiga
Application
Science
Cellular Automata,Denis Antiga,a.denis1@yahoo.com
a71cdec6-989a-487e-bfe7-36278925ca5d
Statistical constraints on binary black hole inspiral dynamics
We perform a statistical analysis of the binary black hole problem in the post-Newtonian approximation by systematically sampling and evolving the parameter space of initial configurations for quasi-circular inspirals. Through a principal component analysis of spin and orbital angular momentum variables we systematically look for uncorrelated quantities and find three of them which are highly conserved in a statistical sense, both as functions of time and with respect to variations in initial spin orientations.
http://arxiv.org/abs/1005.5560
/content/cudazone/CUDABrowser/assets/images/applications/1141_bh_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1141_bh_large.png
Academia
University of Maryland
2010
05
30
05/30/2010
50
Chad Galley
Frank Herrmann
John Silberholz
Manuel Tiglio
Gustavo Guerberoff
Paper
Numerics
Science
Chad Galley,Frank Herrmann,John Silberholz, Manuel Tiglio, Gustavo Guerberoff,tiglio@umd.edu
f5c5c329-3a57-4d10-a40d-475b6d59423c
Object-oriented stream programming using aspects
High-performance parallel programs that efficiently utilize heterogeneous CPU+GPU accelerator systems require tuned coordination among multiple program units. However, using current programming frameworks such as CUDA leads to tangled source code that combines code for the core computation with that for device and computational kernel management, data transfers between memory spaces, and various optimizations. In this paper, we propose a programming system based on the principles of Aspect-Oriented Programming, to un-clutter the code and to improve programmability of these heterogeneous parallel systems. Specifically, we use standard C++ to describe the core computations and aspects to encapsulate all other support parts.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470472&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26pageNumber%3D2%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1140_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1140_logo_xplore_large.gif
Academia
Rutgers University
2010
04
19
04/19/2010
Mingliang Wang
Manish Parashar
Paper
Mingliang Wang,Manish Parashar
60a13ea4-55ee-4aa3-9fbe-2d1ee29bca6c
The GPU Computing Era
GPU computing is at a tipping point, becoming more widely used in demanding consumer applications and high-performance computing. This article describes the rapid evolution of GPU architectures-from graphics processors to massively parallel many-core multiprocessors, recent developments in GPU computing architectures, and how the enthusiastic adoption of CPU+GPU coprocessing is accelerating parallel applications.
/content/cudazone/CUDABrowser/assets/images/applications/1139_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1139_logo_xplore_large.gif
Commercial
NVIDIA
2010
03
01
03/01/2010
J. Nickolls
W. J. Dally
Paper
J. Nickolls,W. J. Dally
d5846aba-4896-45b5-b4d7-371b91ef56e5
Fast implementation of Wyner-Ziv Video codec using GPGPU
In this paper, we report a fast implementation of Wyner-Ziv video decoder using general-purpose computing on graphics processing units (GPGPU). Despite of its many advantages, Wyner-Ziv video coding has a problem of huge decoding complexity. Since Slepian-Wolf decoding with rate adaptive LDPC accumulate code takes up more than 90% of entire Wyner-Ziv video decoding complexity, in this paper, we focus on fast implementation of the Slepian-Wolf decoder using the CUDA (Compute Unified Device Architecture) which is a GPGPU architecture developed by NVIDIA. Our implementation is shown to be 4 5 times (QCIF size) or 15 20 times (CIF size) faster compared to conventional Slepian-Wolf decoding.
/content/cudazone/CUDABrowser/assets/images/applications/1138_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1138_logo_xplore_large.gif
Academia
Sungkyunkwan University
2010
03
24
03/24/2010
20
Ryanggeun Oh
Jongbing Park
Byeungwoo Jeon
Paper
Ryanggeun Oh,Jongbing Park,Byeungwoo Jeon
b37872a6-2751-4cb1-9d68-64b433ae6da1
Efficient parallel algorithms for maximum-density segment problem
One of the fundamental problems involving DNA sequences is to find high density segments of certain widths, for example, those regions with intensive guanine and cytosine (GC). Formally, given a sequence, each element of which has a value and a width, the maximum-density segment problem asks for the segment with the maximum density while satisfying minimum and possibly maximum width constraints. While several linear-time sequential algorithms have emerged recently due to its primitive-like utility, to our knowledge, no nontrivial parallel algorithm has yet been proposed for this topical problem.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470390&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1136_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1136_logo_xplore_large.gif
Academia
Georgia State University
2010
04
19
04/19/2010
Xue Wang
Fasheng Qiu
Sushil K. Prasad
Paper
Xue Wang,Fasheng Qiu,Sushil K. Prasad
ad4bb0eb-1147-4ed3-ae05-d7d49cb8d9b4
Fast binding site mapping using GPUs and CUDA
Binding site mapping refers to the computational prediction of the regions on a protein surface that are likely to bind a small molecule with high affinity. The process involves flexibly docking a variety of small molecule probes and finding a consensus site that binds most of those probes. Due to the computational complexity of flexible docking, the process is often split into two steps: the first performs rigid docking between the protein and the probe; the second models the side chain flexibility by energy-minimizing the (few thousand) top scoring protein-probe complexes generated by the first step. Both these steps are computationally very expensive, requiring many hours of runtime per probe on a serial CPU.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470895&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1134_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1134_logo_xplore_large.gif
Academia
Boston University
2010
04
19
04/19/2010
Bharat Sukhwani
Martin C. Herbordt
Paper
Bharat Sukhwani,Martin C. Herbordt
4fb1d9f2-99b7-4237-b4f0-46e4bf9cf25a
Parallel external sorting for CUDA-enabled GPUs with load balancing and low transfer overhead
Sorting is a well-investigated topic in Computer Science in general and by now many efficient sorting algorithms for CPUs and GPUs have been developed. There is no swapping, paging, etc. available on GPUs to provide more virtual memory than physically available, thus if one wants to sort sequences that exceed GPU memory using the GPU the problem of external sorting arises.
/content/cudazone/CUDABrowser/assets/images/applications/1133_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1133_logo_xplore_large.gif
Academia
Christian-Albrechts-University
2010
04
19
04/19/2010
Hagen Peters
Ole Schulz-Hildebrandt
Norbert Luttenberger
Paper
Hagen Peters,Ole Schulz-Hildebrandt,Norbert Luttenberger
5668b6fb-9541-43d3-b426-da3fbc93395c
A tile-based parallel Viterbi algorithm for biological sequence alignment on GPU with CUDA
The Viterbi algorithm is the compute-intensive kernel in Hidden Markov Model (HMM) based sequence alignment applications. In this paper, we investigate extending several parallel methods, such as the wave-front and streaming methods for the Smith-Waterman algorithm, to achieve a significant speed-up on a GPU. The wave-front method can take advantage of the computing power of the GPU but it cannot handle long sequences because of the physical GPU memory limit. On the other hand, the streaming method can process long sequences but with increased overhead due to the increased data transmission between CPU and GPU. To further improve the performance on GPU, we propose a new tile-based parallel algorithm. We take advantage of the homological segments to divide long sequences into many short pieces and each piece pair (tile) can be fully held in the GPU's memory.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470903&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1132_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1132_logo_xplore_large.gif
Academia
Tsinghua University
2010
04
19
04/19/2010
Zhihui Du
Zhaoming Yin
David A. Bader
Paper
Zhihui Du,Zhaoming Yin,David A. Bader
a2642850-909e-475e-bd95-fcb458609914
Designing scalable many-core parallel algorithms for min graphs using CUDA
Removing redundant edges on a large graph is a fundamental problem in many practical applications such as verification of real-time systems and network routing. In this paper, we present the designs of scalable and efficient parallel algorithms for multiple many-core GPU devices using CUDA. Our algorithms expose substantial fine-grained parallelism while maintaining minimal global communication. By using the global scope of the GPU's global memory, coalescing the global memory reads and writes, and avoiding on-chip shared memory bank conflicts, we are able to achieve a large performance benefit with a speed-up of 2,500x on a desktop computer in comparison with a single core CPU program. We report our experiments on large graphs with up to 29K vertices using multiple GPU devices.
/content/cudazone/CUDABrowser/assets/images/applications/1131_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1131_logo_xplore_large.gif
Academia
Lamar University
2010
04
19
04/19/2010
Quoc-Nam Tran
Paper
Quoc-Nam Tran
ebaed060-0f46-49b2-b9b3-5bfbb7e21ff5
Implementing the Himeno benchmark with CUDA on GPU clusters
This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers allows linear scaling and performs at over 800 GFlops on a cluster with 16 GPUs. The paper presents the optimizations required to achieve this level of performance.
/content/cudazone/CUDABrowser/assets/images/applications/1130_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1130_logo_xplore_large.gif
Commercial
NVIDIA
2010
04
19
04/19/2010
Everett H. Phillips
Massimiliano Fatica
Paper
Everett H. Phillips,Massimiliano Fatica
adad926b-73e9-492a-a3d5-69301fb1d791
CUDA-based AES parallelization with fine-tuned GPU memory utilization
Current Graphics Processing Unit (GPU) presents large potentials in speeding up computationally intensive data parallel applications over traditional parallelization approaches since there are much more hardware threads inside GPUs than the computational cores available to common CPU threads. NVIDIA developed a generic GPU programming platform, CUDA, which allows programmers to utilize GPU through C programming language and parallelize applications in a similar way as in traditional multithreading approach. However, not all applications are suitable for this new platform.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470766&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1129_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1129_logo_xplore_large.gif
Academia
Arkansas State University
2010
04
19
04/19/2010
Chonglei Mei
Hai Jiang
Jeff Jenness
Paper
Chonglei Mei,Hai Jiang,Jeff Jenness
59dd2453-f984-4ab7-9a4c-49d2350b0f09
Optimization of linked list prefix computations on multithreaded GPUs using CUDA
We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470455&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1128_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1128_logo_xplore_large.gif
Academia
University of Maryland
2010
04
19
04/19/2010
Zheng Wei
Joseph JaJa
Paper
Zheng Wei,Joseph JaJa
9ecfd491-9553-4775-b59e-87718a3593fc
Parallel computing with CUDA
NVIDIA's CUDA architecture provides a powerful platform for writing highly parallel programs. By providing simple abstractions for hierarchical thread organization, memories, and synchronization, the CUDA programming model allows programmers to write scalable programs without the burden of learning a multitude of new programming constructs.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5470378&queryText%3Dcuda%26searchWithin%3D2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1127_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1127_logo_xplore_large.gif
Commercial
NVIDIA
2010
04
19
04/19/2010
Michael Garland
Paper
Michael Garland
7473646a-8f1b-4e8b-b578-cabe90a66678
Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs
In this paper we describe techniques for compiling fine-grained SPMD-threaded programs, expressed in programming models such as OpenCL or CUDA, to multicore execution platforms. Programs developed for manycore processors typically express finer thread-level parallelism than is appropriate for multicore platforms. We describe options for implementing fine-grained threading in software, and find that reasonable restrictions on the synchronization model enable significant optimizations and performance improvements over a baseline approach. We evaluate these techniques in a production-level compiler and runtime for the CUDA programming model targeting modern CPUs.
http://portal.acm.org/citation.cfm?id=1772954.1772971&coll=Portal&dl=ACM&CFID=91959390&CFTOKEN=70859630
/content/cudazone/CUDABrowser/assets/images/applications/1126_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1126_logo_acm_portal2_large.jpg
Research
NVIDIA Corporation
2010
04
01
04/01/2010
John A Stratton
Vinod Grover
Jaydeep Marathe
Bastiaan Aarts
Mike Murphy
Ziang Hu
Wen-mei W. Hwu
Paper
John A Stratton,Vinod Grover,Jaydeep Marathe
7702a523-e58c-4e1e-8e05-207e1430c47c
Non-blocking programming on multi-core graphics processors: extended asbtract
This paper investigates the synchronization power of coalesced memory accesses, a family of memory access mechanisms introduced in recent large multicore architectures like the CUDA graphics processors. We first design three memory access models to capture the fundamental features of the new memory access mechanisms. Subsequently, we prove the exact synchronization power of these models in terms of their consensus numbers. These tight results show that the coalesced memory access mechanisms can facilitate strong synchronization between the threads of multicore processors, without the need of synchronization primitives other than reads and writes.
http://portal.acm.org/citation.cfm?id=1556444.1556448&coll=Portal&dl=ACM&CFID=91959390&CFTOKEN=70859630
/content/cudazone/CUDABrowser/assets/images/applications/1125_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1125_logo_acm_portal2_large.jpg
Academia
University of Tromse
2009
06
01
06/01/2009
Phuong Hoai Ha
Philippas Tsigas
Otto J. Anshus
Paper
Phuong Hoai Ha,Philippas Tsigas,Otto J. Anshus
018fa5a9-ae5d-498c-87ba-c505061b01c5
Application-guided tool development for architecturally diverse computation
Architecturally diverse computation exploits non-traditional computing platforms (e.g., field-programmable gate arrays, graphics processors, heterogeneous chip multiprocessors) to execute user applications. We have designed the Auto-Pipe tool set with the goal of easing the task of developing applications for architecturally diverse systems. Prior to and during the course of Auto-Pipe's design, we have developed a number of real, substantial applications, and the the lessons learned during the development of these applications has had a direct bearing on the capabilities of Auto-Pipe. In this paper, we describe the relationship between our application development experience and Auto-Pipe. In short, how have applications guided the tools' evolution and development?
/content/cudazone/CUDABrowser/assets/images/applications/1124_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1124_logo_acm_portal2_large.jpg
Academia
Washington University in St. Louis
2010
03
01
03/01/2010
R. D. Chamberlain
J. Buhler
M. Franklin
J. H. Buckley
Paper
R. D. Chamberlain,J. Buhler,M. Franklin
7c408604-7b5b-4079-8a47-1aeb09371dde
NeuroSolutions CUDA Add-on
The NeuroSolutions CUDA Add-on implements high performance parallel computing of Neural Networks using Levenberg-Marquardt - one of the most powerful form of back-propagation learning available. Neural Networks are a form of artificial intelligence (AI) that have proved to be effective in solving a wide range of data mining and data modeling problems including credit card fraud detection, cancer diagnosis and financial forecasting to name a few. As problems become more and more complex, so does the demand for processing power. By parallelizing advanced learning algorithms on a GPU (Graphics Processing Unit), NeuroSolutions can achieve up to 50 times greater performance than that of processing on a traditional CPU (Central Processing Unit). A free evaluation version of NeuroSolutions is available for download on our website.
/content/cudazone/CUDABrowser/assets/images/applications/1123_v6-nscuda-large_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1123_v6-nscuda-large_large.jpg
Commercial
NeuroDimension, Inc.
http://www.nd.com
2010
06
13
05/13/2010
50
Commercial
Gary Lynn
Brian Kachnowski
Application
Presentation
Computational Fluid Dynamics
Finance
Imaging
Medical Imaging
Numerics
Life Sciences
Libraries
Oil & Gas
Science
Signal Processing
Neural Networks
Data Mining
Machine Learning
neural network, Levenberg-Marquardt, CUDA, Mutlilayer Perceptron, GPU, parallel processing,Gary Lynn,Brian Kachnowski,info@nd.com
e1b2a932-da54-4115-9913-ef21d09b12cb
Bayesian Real-Time Perception Algorithms on GPU
Real-time implementation of a Bayesian framework for robotic multisensory perception using the Compute Unified Device Architecture (CUDA).
/content/cudazone/CUDABrowser/assets/images/applications/1122_1bayesoccupancyfilter_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1122_1bayesoccupancyfilter_large.jpg
Academia
Mobile Robotics Lab, Institute of Systems and Robotics, Coimbra Pole, Portugal
http://paloma.isr.uc.pt
2010
02
26
02/26/2010
30,000
Joao Filipe Ferreira
Jorge Lobo
Jorge Dias
Paper
Life Sciences
Science
Signal Processing
Video & Audio
Robotics and Artificial Perception
Joao Filipe Ferreira,Jorge Lobo,Jorge Dias,jfilipe@isr.uc.pt
f695686e-a314-4d4c-a222-7a1e88c753f3
A Work-Efficient GPU Algorithm for Level Set Segmentation
Level set segmentation is a powerful computational method for identifying complex objects in n-dimensional images. We present a novel level set segmentation algorithm that scales efficiently on an unbounded number of parallel computer processors while performing asymptotically no more work than the most efficient known sequential algorithm. We demonstrate that our new algorithm is one order of magnitude faster than current state-of-the-art parallel algorithms with no reduction in accuracy.
/content/cudazone/CUDABrowser/assets/images/applications/1121_brainweb-3D-composite-offset-small_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1121_brainweb-3D-composite-offset-small_large.jpg
Academia
University of Calgary
http://www.ucalgary.ca
2010
06
25
06/25/2010
14
Mike Roberts
Jeff Packer
Mario Costa Sousa
J Ross Mitchell
Multimedia
Paper
Graphics
Imaging
Medical Imaging
level set image segmentation,Mike Roberts,Jeff Packer,Mario Costa Sousa / J Ross Mitchell,mlrobert@ucalgary.ca
4b052c40-1ee0-4188-bb05-5854ba1bafc9
Real Time Face Tracking
A real-time face tracker that tracks multiple faces simultaneously on subsequent video frames with maximum stability.
/content/cudazone/CUDABrowser/assets/images/applications/1120_NeST-NVIDIA_Center_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1120_NeST-NVIDIA_Center_large.png
Commercial
Network Systems & Technologies (P) Ltd.
http://nestsoftware.com
2009
11
28
11/28/2009
Midhun M
Neethu K Chandran
Preetha Joy
Paper
Imaging
Midhun M,Neethu K Chandran,Preetha Joy,hpc@nestgroup.net
055f0f06-5ab2-4644-9765-9d40534a183a
HPC Platform options: Cell BE and GPU
This write up briefly compares two competing performance architectures for data parallelism Cell Broadband Engine (Cell in short) and the GPU (Graphics Processing Unit).
/content/cudazone/CUDABrowser/assets/images/applications/1119_NeST-NVIDIA_Center_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1119_NeST-NVIDIA_Center_large.png
Commercial
Network Systems & Technologies (P) Ltd.
http://nestsoftware.com
2009
11
10
11/10/2009
Anoop Thomas
Paper
Cell BE and GPU comparison
Anoop Thomas,hpc@nestgroup.net
91689c21-d8df-4d7d-b37a-1ac1fdd6227c
Parallel Iterative Linear Solvers on GPU: A Financial Engineering Case
In many numerical applications resulting from computational science and engineering problems, the solution of sparse linear systems is the most prohibitively compute intensive task. Consequently, the linear solvers need to be carefully chosen and efficiently implemented in order to harness the available computing resources. Krylov subspace based iterative solvers have been widely used for solving large systems of linear equations. In this paper, we focus on the design of such iterative solvers to take advantage of massive parallelism of general purpose Graphics Processing Units (GPU)s.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5452413&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1118_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1118_logo_xplore_large.gif
Academia
Ecole Centrale Paris
2010
02
17
02/17/2010
A. Gaikwad
I.M. Toke
Paper
A. Gaikwad,I.M. Toke
5ad8691f-8688-42ac-b692-0094ba1c701b
IP routing processing with graphic processors
Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one order of magnitude lower. Modern GPUs are offering significant computing power, and its data-parallel computing model well matches the typical patterns of packet processing on routers. Accordingly, in this research we investigate the potential of CUDA-enabled GPUs for IP routing applications.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5457229&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1117_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1117_logo_xplore_large.gif
Academia
Tsinghua University
2010
03
08
03/08/2010
Shuai Mu
Xinya Zhang
Nairen Zhang
Jiaxin Lu
Yangdong Steve Deng
Shu Zhang
Paper
Shuai Mu,Xinya Zhang,Nairen Zhang
2c2381ee-55c4-4cd2-8c69-1433e6716c77
Frame-based parallelization of MPEG-4 on compute unified device architecture (CUDA)
Due to its object based nature, flexible features and provision for user interaction, MPEG-4 encoder is highly suitable for parallelization. The most critical and time-consuming operation of encoder is motion estimation. Nvidia's general-purpose graphical processing unit (GPGPU) architecture allows for a massively parallel stream processor model at a very cheap price (in a few thousands Rupees). However synchronization of parallel calculations and repeated device to host data transfer is a major challenge in parallelizing motion estimation on CUDA.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5422997&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1116_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1116_logo_xplore_large.gif
Academia
Indian Institute of Technology
2010
02
19
02/19/2010
D. Ailawadi
M.K. Mohapatra
A. Mittal
Paper
D. Ailawadi,M.K. Mohapatra,A. Mittal
031e13f9-1141-4bf3-8aed-851ab751fa2b
CUDA Based GPU Programming to Simulate 3D Tissue Deformation
The medical training systems based on virtual simulation are highly desired since minimally invasive surgical techniques have become popular to patients. The training system helps surgeon trainees to acquire, practice and evaluate their surgical skills, and the key component of such a system is to simulate the dynamic procedure such as 3D biological tissue deformation in surgical operation. In our paper, an improved mass-spring model is proposed to represent the biological tissue surface, during which the virtual spring is introduced and utilized to help compensate the weakness of the conventional mass-spring model.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5462444&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1115_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1115_logo_xplore_large.gif
2010
04
23
04/23/2010
Yuanyuan Zhang
Jianhui Zhao
Zhiyong Yuan
Yihua Ding
Chengjian Long
Lu Xiong
Paper
Yuanyuan Zhang,Jianhui Zhao,Zhiyong Zhao, Yihua Ding, Chengjian Long,Lu Xiong
678ff1ca-9f68-4d9c-b7ec-767fbbf2d2f0
Offloading Region Matching of Data Distribution Management with CUDA
Data distribution management (DDM) aims to reduce the transmission of irrelevant data between High Level Architecture (HLA) compliant simulators by taking their interesting regions into account (i.e. region matching). In a large-scale simulation, computation intensive region matching would have a direct impact on the simulation performance. To deal with the high computation cost of region matching, the whole process of region matching is offloaded to graphical processing units (GPUs) based on Computer Unified Device Architecture (CUDA).
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5416077&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1114_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1114_logo_xplore_large.gif
Academia
National Tsing Hua University
2010
01
27
01/27/2010
Shih Hsiang Lo
Yeh Ching Chung
Fang Ping Pai
Paper
Shih Hsiang Lo,Yeh Ching Chung,Fang Ping Pai
7bcba91a-58ab-41e9-b4f3-7cfa59a1b492
hiCUDA: High-Level GPGPU Programming
Graphics Processing Units (GPUs) have become a competitive accelerator for applications outside the graphics domain, driven by improvements in GPU programmability. Although the Compute Unified Device Architecture (CUDA) is a simple C-like interface for programming NVIDIA GPUs, porting applications to CUDA remains a challenge to average programmers. In particular, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host and GPU memories, and of manually optimizing the utilization of the GPU memory.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5445082&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1113_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1113_logo_xplore_large.gif
2010
04
08
04/08/2010
T. Han
T. Abdelrahman
Paper
T. Han,T. Abdelrahman
a27cc5b9-6fe9-40dd-a83c-c76f2b5e3228
Preliminary implementation of VQ image coding using GPGPU
GPGPU (general purpose computing on graphic processing unit) attracts a great deal of attention, that is used for general-purpose computations like numerical calculations as well as graphic processing. In this paper, as an example of hierarchical clustering algorithms, we evaluate PNN (pairwise nearest neighbor) on GPUs by using CUDA (compute unified device architecture). We also evaluate it from the viewpoint of the power consumption.
/content/cudazone/CUDABrowser/assets/images/applications/1111_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1111_logo_xplore_large.gif
Academia
Konan University
2010
01
09
01/09/2010
A. Wakatani
Paper
A. Wakatani
e8febd87-0451-4a70-b279-4e86b3c46e9b
Real Time Simulation of Tissue Cutting Based on GPU and CUDA for Surgical Training
A novel approach to the simulation of soft tissue cutting in a virtual reality endoscopic simulator is presented for applications in surgical training and education. This approach is based on an improved mass-spring model and the use of computational geometry. A virtual spring is introduced to compensate the weakness of the conventional mass-spring model, and a detection algorithm utilizing decomposition of affine coordinates is adopted to determine the springs that intersect with the cutting plane.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5462450&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1110_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1110_logo_xplore_large.gif
2010
04
23
04/23/2010
Yuanyuan Zhang
Zhiyong Yuan
Yihua Ding
Jianhui Zhao
Zhaoliang Duan
Mingui Sun
Paper
Life Sciences
Yuanyuan Zhang,Zhiyong Yuan,Yihua Ding
6ae115e8-d829-42f7-a29e-fb2bf1e2ba0a
A GPU-enabled solver for time-constrained linear sum assignment problems
This paper deals with solving large instances of the Linear Sum Assignment Problems (LSAPs) under realtime constraints, using Graphical Processing Units (GPUs). The motivating scenario is an industrial application for P2P live streaming that is moderated by a central tracker that is periodically solving LSAP instances to optimize the connectivity of thousands of peers. However, our findings are generic enough to be applied in other contexts. Our main contribution is a parallel version of a heuristic algorithm called Deep Greedy Switching (DGS) on GPUs using the CUDA programming language.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5461816&queryText%3Dcuda+2010%26openedRefinements%3D*%26searchField%3DSearch+All
/content/cudazone/CUDABrowser/assets/images/applications/1109_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1109_logo_xplore_large.gif
Commercial
Peerialism, Inc.
2010
03
28
03/28/2010
Roberto Roverso
Amgad Naeiem
Mohammed El-Beltagy
Sameh El-Ansary
Seif Haridi
Paper
Roberto Roverso,Amgad Naeiem,Mohammed El-Beltagy
883bddba-16b6-4c99-a6e8-8317a72ca1a9
Accelerating H.264 inter prediction in a GPU by using CUDA
H.264/AVC defines a very efficient algorithm for the inter prediction but it takes too much time. With the emergence of general purpose graphics processing units (GPGPU), a new door has been opened to support this video algorithm into these small processing units. In this paper, a forward step is developed towards an implementation of the H.264/AVC inter prediction algorithm into a GPU using compute unified device architecture (CUDA). The results show a negligible rate distortion drop with a time reduction on average up to 93.6%.
http://www.ieeexplore.ieee.org/search/freesrchabstract.jsp?tp=&arnumber=5418821
/content/cudazone/CUDABrowser/assets/images/applications/1108_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1108_logo_xplore_large.gif
Academia
University of Castilla
2010
01
09
01/09/2010
R. Rodriguez
J.L. Marttinez
G. Fernandez-Escribano
J.M. Claver
J.L. Sanchez
Paper
R. Rodriguez,J.L. Marttinez,G. Fernandez-Escribano
58f115c8-1620-40a8-a30a-1bee644e7c5f
Porting of an Edge-Based CFD Solver to GPUs
Graphics processing units (GPUs) are increasingly becoming a mainstream platform for high performance computational fluid dynamics. This paper describes the porting of a substantial portion of FEFLO, an adaptive, edge-based finite element code for the solution of compressible and incompressible flow, to run on GPUs. The code is primarily written in Fortran 77 and has been ported to vector, shared memory parallel (via OpenMP) and distributed memory parallel (via MPI) machines.
http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_523.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1107_logo_AIAA_portal_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1107_logo_AIAA_portal_large.jpg
Academia
George Mason University
2010
01
04
01/04/2010
Andrew Corrigan
Fernando Camelli
Rainald Lohner
Paper
Andrew Corrigan,Fernando Camelli,Rainald Lohner
22bda164-33bd-4311-9e70-1723cfeaee9b
Toward efficient GPU-accelerated N-body simulations
N-body algorithms are applicable to a number of common problems in computational physics including gravitation, electrostatics, and fluid dynamics. Fast algorithms (those with better than O(N2) performance) exist, but have not been successfully implemented on GPU hardware for practical problems. In the present work, we introduce not only best-in-class performance for a multipole-accelerated treecode method, but a series of improvements that support implementation of this solver on highly-data-parallel graphics processing units (GPUs).
http://pdf.aiaa.org/preview/CDReadyMASM08_1065/PV2008_608.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1102_logo_AIAA_portal_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1102_logo_AIAA_portal_large.jpg
Research
Applied Scientific Research
2010
01
07
01/07/2010
Mark J. Stock
Adrin Gharakhani
Paper
Science
Mark J. Stock,Adrin Gharakhani
732a1a48-f8a5-40a0-ad2b-741fb822af91
Using GPU on HPC Applications to Satisfy Low-Power Computational Requirement
The High-performance, low-power computing is required to reduce the computer
infrastructure needed for large multi-physics calculations for reactive flow, high-resolution
urban aerodynamics, deforming geometry fluid dynamics, etc. If computer infrastructure
and costs can be reduced sufficiently, highly accurate calculations currently being
performed only in large computer centers can be moved to operational centers and even into
the field.
http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_524.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1101_logo_AIAA_portal_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1101_logo_AIAA_portal_large.jpg
Research
Naval Research Laboratory
2010
01
04
01/04/2010
Gopal Patnaik
Keith S. Obenschain
Paper
Gopal Patnaik,Keith S. Obenschain
fdb72cda-8c6c-420f-bdf2-55a91c7427f5
An MPI-CUDA Implementation for Massively Parallel
Modern graphics processing units (GPUs) with many-core architectures have emerged as general-purpose
parallel computing platforms that can accelerate simulation science applications tremendously. While multi-
GPU workstations with several TeraFLOPS of peak computing power are available to accelerate computational
problems, larger problems require even more resources. Conventional clusters of central processing
units (CPU) are now being augmented with multiple GPUs in each compute-node to tackle large problems.
The heterogeneous architecture of a multi-GPU cluster with a deep memory hierarchy creates unique challenges
in developing scalable and efficient simulation codes.
http://pdf.aiaa.org/preview/2010/CDReadyMASM10_1812/PV2010_522.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1100_logo_AIAA_portal_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1100_logo_AIAA_portal_large.jpg
Boise State University
2010
01
04
01/04/2010
Dana Jacobsen
Julien Thibault
Inanc Senocak
Paper
Dana Jacobsen,Julien Thibault,Inanc Senocak
b2c13afd-3503-4af9-9a6f-273f3d7589dc
State-of-the-Art in Heterogeneous Computing
This extensive survey (33 pages, over 180 references) gives an overview of hardware and software tools for the Cell Broadband Engine, Graphics Processing Units, and Field Programmable Gate Arrays. A qualitative and quantitative comparison is also presented, together with a summary of state-of-the-art approaches to heterogeneous computing.
Computational Fluid Dynamics,Finance,Imaging,Numerics,Libraries,Oil & Gas,Programming Tools,Science
/content/cudazone/CUDABrowser/assets/images/applications/1099_star_heterocomp_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1099_star_heterocomp_large.jpg
Research
SINTEF ICT and Oak Ridge National Laboratory, Future Technologies Group
2010
05
04
05/04/2010
A. R. Brodtkorb/C. Dyken/T R. Hagen/J. M. Hjelmervik/O. O. Storaasli
Paper
A. R. Brodtkorb,C. Dyken, T R. Hagen J. M. Hjelmervik and O. O. Storaasli,Andre.Brodtkorb@sintef.no
aa8b4f0c-1c87-4251-8fe9-1e035864ce0e
Simulation and Visualization of the Saint-Venant System using GPUs
This paper describes the efficient implementation of three second order accurate explicit schemes that solve the shallow water equations. The implementation also supports real-time visualization with photorealistic effects.
/content/cudazone/CUDABrowser/assets/images/applications/1098_sw_2010_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1098_sw_2010_large.jpg
Research
SINTEF ICT
2010
02
28
02/28/2010
A. R. Brodtkorb/T. R. Hagen/K.-A. Lie/J. R. Natvig
Multimedia
Paper
Computational Fluid Dynamics
Graphics
Numerics
Science
A. R. Brodtkorb,T. R. Hagen,K.-A. Lie and J. R. Natvig,Andre.Brodtkorb@sintef.no
d105ce4e-64c5-4075-8ad7-ba5a026484d8
Kappa
The primary goal of Kappa is to allow for the creation of sophisticated, powerful, and complex processing that retain simple and easy-to-use interfaces. Kappa provides for creating processes with dynamic sizing, scheduling, and interactive execution for C and CUDA kernels to process data efficiently using the available resources.Kappa provides a library for creating processes to use combinations of CPUs and a GPU for tasks. Within a single host program process, a Kappa process can be created for each CUDA GPU using all GPUs. Each Kappa process can use all of the multiprocessors of each GPU, share all of the CPUs of the host system, have its own separate namespace, and have its own separate CUDA context.
/content/cudazone/CUDABrowser/assets/images/applications/1097_psilambdakappa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1097_psilambdakappa_large.png
Commercial
Psi Lambda LLC
http://psilambda.com
2010
05
03
05/03/2010
Commercial
Psi Lambda LLC
Application
Libraries
Programming Tools
Psi Lambda LLC,kappa@psilambda.com
8ecf8eac-dcc4-4770-ba77-c7b68b2ec6e6
Cusp: A sparse matrix library for CUDA
Cusp is a library for sparse linear algebra and graph computations on CUDA. Cusp provides a flexible, high-level interface for manipulating sparse matrices and solving sparse linear systems.
/content/cudazone/CUDABrowser/assets/images/applications/1096_cusp_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1096_cusp_logo_large.png
Research
NVIDIA Research
http://research.nvidia.com/
2010
05
04
05/04/2010
Open source
Nathan Bell
Michael Garland
Code
Numerics
Libraries
Nathan Bell,Michael Garland,nbell@gmail.com
4b17cf89-a211-4fcf-8c9e-32e04272529f
Performance and Scalability of GPU-Based Convolutional Neural Networks
In this paper we present the implementation of a framework for accelerating training and classification of arbitrary Convolutional Neural Networks (CNNs) on the GPU. CNNs are a derivative of standard Multilayer Perceptron (MLP) neural networks optimized for two-dimensional pattern recognition problems such as Optical Character Recognition (OCR) or face detection. We describe the basic parts of a CNN and demonstrate the performance and scalability improvement that can be achieved by shifting the computation-intensive tasks of a CNN to the GPU. Depending on the network topology training and classification on the GPU performs 2 to 24 times faster than on the CPU. Furthermore, the GPU version scales much better than the CPU implementation with respect to the network size.
/content/cudazone/CUDABrowser/assets/images/applications/1095_lenet5_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1095_lenet5_large.png
Academia
Distributed and Parallel Systems Group, Institute of Computer Science, University of Innsbruck
http://www.dps.uibk.ac.at
2010
02
18
02/18/2010
24
Open source
Daniel Strigl
Klaus Kofler
Stefan Podlipnig
Paper
Code
Imaging
Numerics
Science
Machine Learning
Daniel Strigl,Klaus Kofler,Stefan Podlipnig,daniel.strigl@student.uibk.ac.at, klaus.kofler@student.uibk.ac.at,stefan.podlipnig@uibk.ac.at
bbc802a8-2ae5-449d-b4c7-059fe0daa3a2
AntiPlanet Reflections
AntiPlanet Reflections is first person "doom" style 3D shooter game in fantastic extraterrestrial world, which is built of spheres, shadows and infinite reflections. AntiPlanet scenes are fully dynamic with moving objects and light sources. AntiPlanet uses ray tracing for visualization. It works through CUDA. GT200 architecture performs about 15 times faster than ordinary dual core cpu, and Fermi performs about 45 times faster.
/content/cudazone/CUDABrowser/assets/images/applications/1094_sphericalflowers_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1094_sphericalflowers_large.jpg
Research
virtualray.ru
http://www.virtualray.ru
2010
05
06
05/06/2010
30
Commercial
Lev Dymchenko
Application
Graphics
Ray Tracing
antiplanet real time ray tracing spherical reflections,Lev Dymchenko,lev@virtualray.ru
e5354ea1-1f61-4d80-b50d-f4c7d07355b3
Acceleration of the Smith-Waterman Algorithm using Single and Multiple Graphics Processors
Finding regions of similarity between two very long data streams is a computationally intensive problem referred to as sequence alignment. Alignment algorithms must allow for imperfect sequence matching with different starting locations and some gaps and errors between the two data sequences. Perhaps the most well known application of sequence matching is the testing of DNA or protein sequences against genome databases.The Smith Waterman algorithm is a method for precisely characterizing how well two sequences can be aligned and for determining the optimal alignment of those two sequences.
http://www.ecs.umass.edu/mie/faculty/perot/Programs.htm
/content/cudazone/CUDABrowser/assets/images/applications/1093_S-W_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1093_S-W_large.png
Academia
University of Massachusetts, Amherst
http://www.umass.edu/
2010
05
13
05/13/2010
45
Open source
Ali Khajeh-Saeed
Stephen Poole
J. Blair Perot
Paper
Code
Life Sciences
Ali Khajeh-Saeed,Stephen Poole,J. Blair Perot,khajehsaeed@ecs.umass.edu ,perot@ecs.umass.edu
ad98b4cb-6676-46c7-8cdb-c5288f4ab6b0
Nifty_reg
Global and local medical image registration using CUDA. The global alignment is based on a block-matching technique and the local warping on a cubic B-spline deformation model.
/content/cudazone/CUDABrowser/assets/images/applications/1092_nifty_reg_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1092_nifty_reg_logo_large.png
Academia
CMIC - University College London
http://cmic.cs.ucl.ac.uk/
2009
09
18
09/18/2009
Open source
Marc Modat
Pankaj Daga
Sebastien Ourselin
Paper
Code
Medical Imaging
Marc Modat,marc.modat@gmail.com
38b7a4f8-004f-452b-b5dd-c7bc77b6fca3
Massively parallel Linux laptops, workstations and clusters with CUDA
Unleash the GPU within!
/content/cudazone/CUDABrowser/assets/images/applications/1090_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1090_logo_acm_portal2_large.jpg
2008
11
01
11/01/2008
Robert Farber
Paper
Robert Farber
b7bcb258-0b02-4468-ad5c-217f93df94fe
Multi-target C++ implementation of parallel skeletons
This paper presents the design of an efficient multi-target (CPU+GPU) implementation for the Parallel_for skeleton. Emerging massively parallel architectures promise very high performances for a low cost. However, these architectures change faster than ever. Thus, optimization of codes becomes a very complex and time consumming task. We have identified the data storage as the main difference between the CPU and the GPU implementation of a code. We introduce an abstract data layout in order to adapt the data storage. Based on this layout, the utilization of Parallel_for skeleton allows to compile and execute the same program both on CPU and on GPU. Once compiled, the program runs close to the hardware limits.
/content/cudazone/CUDABrowser/assets/images/applications/1089_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1089_logo_acm_portal2_large.jpg
Research
EDF R&D
2009
07
01
07/01/2009
Wilfried Kirschenmann
Laurent Plagne
Stephane Vialle
Paper
Wilfried Kirschenmann,Laurent Plagne,Stephane Vialle
ed102912-3ed0-4296-bb43-f1e56474388a
Triangular matrix inversion on Graphics Processing Unit
Dense matrix inversion is a basic procedure in many linear algebra algorithms. A computationally arduous step in most dense matrix inversion methods is the inversion of triangular matrices as produced by factorization methods such as LU decomposition. In this paper, we demonstrate how triangular matrix inversion (TMI) can be accelerated considerably by using commercial Graphics Processing Units (GPU) in a standard PC. Our implementation is based on a divide and conquer type recursive TMI algorithm, efficiently adapted to the GPU architecture.
http://portal.acm.org/citation.cfm?id=1654059.1654069&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1088_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1088_logo_acm_portal2_large.jpg
Academia
University of Bologna
2009
11
01
11/01/2009
Florian Ries
Tommaso De Marco
Matteo Zivieri
Roberto Guerrieri
Paper
Florian Ries,Tommaso De Marco,Matteo Zivieri
372d3c3e-bd11-424d-80f2-f54341867325
Fast heterogeneous computing with CUDA compatible Tesla GPU computing processor (personal supercomputing)
This paper presents how fast heterogeneous computing can be achieved with Tesla GPU computing processor. Tesla GPU super computer brings the performance of a cluster to a workstation and turning it into a supercomputer. We have chosen molecular dynamics field to show fast and high performance computing with Tesla GPU. We have given a DCS (direct coulomb summation) algorithm for computing electrostatic fields around molecules with CUDA. Tesla GPU speeds up the molecular dynamics application up to 240X.
http://portal.acm.org/citation.cfm?id=1741906.1742121&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1085_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1085_logo_acm_portal2_large.jpg
Academia
Aligarh Muslim University
2010
02
01
02/01/2010
D. Kumar
M. A. Qadeer
Paper
D. Kumar,M. A. Qadeer
51b44143-4a88-4a74-96de-e899a53cb6d2
A configurable simulation environment for the efficient simulation of large-scale spiking neural networks on graphics processors
Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for various neural engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Compute Unified Device Architecture (CUDA) Graphics Processing Units (GPUs) can provide a low-cost, programmable, and high-performance computing platform for simulation of SNNs. In this paper we demonstrate an efficient, biologically realistic, large-scale SNN simulator that runs on a single GPU. The SNN model includes Izhikevich spiking neurons, detailed models of synaptic plasticity and variable axonal delay. We allow user-defined configuration of the GPU-SNN model by means of a high-level programming interface written in C++ but similar to the PyNN programming interface specification.
http://portal.acm.org/citation.cfm?id=1594405.1594470&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1084_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1084_logo_acm_portal2_large.jpg
Academia
University of California Irvine
2009
07
01
07/01/2009
Jayram Moorkanikara Nageswaran
Nikil Dutt
Jeffrey L. Krichmar
Alex Nicolau
Alexander V. Veidenbaum
Paper
Jayram Moorkanikara Nageswaran ,Nikil Dutt,Jeffrey L. Krichmar
ff32502d-f153-4df5-9007-fe61af6560c1
Boids that see: Using self-occlusion for simulating large groups on GPUs
Behavioral models have been used in the entertainment industry to increase the realism in the simulation of large groups of individuals. Unfortunately, the classical models can be very compute-intensive when very large groups are considered, reducing its applicability in games and other interactive systems. In this article we explore both search space reduction and parallelism to improve the performance of Reynold's Boids model. We propose a methodology that considers self-occlusion (visibility culling) to reduce the number of neighbors and we take advantage the parallelism present in common graphics processor units (GPUs) to allow the simulation of very large groups. We performed different GPU implementations (GPGPU and CUDA); the results show that visibility culling allows significant gains in performance without affecting the model's overall behavior.
/content/cudazone/CUDABrowser/assets/images/applications/1083_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1083_logo_acm_portal2_large.jpg
Academia
Universidade Federal de Minas Gerais
2009
12
01
12/01/2009
Alessandro Ribeiro Da Silva
Wallace Santos Lages
Luiz Chaimowicz
Paper
Alessandro Ribeiro Da Silva,Wallace Santos Lages,Luiz Chaimowicz
6d68818f-bbb6-4550-a857-32e87a7c5c86
Using common graphics hardware for multi-agent traffic simulation with CUDA
Today's graphics processing units (GPU) have tremendous resources when it comes to raw computing power. The simulation of large groups of agents in transport simulation has a huge demand of computation time. Therefore it seems reasonable to try to harvest this computing power for traffic simulation. Unfortunately simulating a network of traffic is inherently connected with random memory access. This is not a domain that the SIMD (single instruction, multiple data) architecture of GPUs is known to work well with. In this paper the authors will try to achieve a speedup by computing multi-agent traffic simulations on the graphics device using NVIDIAs CUDA framework.
/content/cudazone/CUDABrowser/assets/images/applications/1081_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1081_logo_acm_portal2_large.jpg
Academia
TU Berlin
2009
03
01
03/01/2009
David Strippgen
Kai Nagel
Paper
David Strippgen,Kai Nagel
2bedb9a8-538c-458a-8014-36e5d8b1d4dc
Taming irregular EDA applications on GPUs
Recently general purpose computing on graphic processing units (GPUs) is rising as an exciting new trend in high-performance computing. Thus it is appealing to study the potential of GPU for Electronic Design Automation (EDA) applications. However, EDA generally involves irregular data structures such as sparse matrix and graph operations, which pose significant challenges for efficient GPU implementations. In this paper, we propose highperformance GPU implementations for two important irregular EDA computing patterns, Sparse-Matrix Vector Product (SMVP) and graph traversal. On a wide range of EDA problem instances, our SMVP implementations outperform all published work and achieve a speedup of one order of magnitude over the CPU baseline.
http://portal.acm.org/citation.cfm?id=1687399.1687501&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1080_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1080_logo_acm_portal2_large.jpg
Academia
Tsinghua University
2009
11
01
11/01/2009
Yangdong Deng
Bo David Wang
Shuai Mu
Paper
Yangdong Deng,Bo David Wang,Shuai Mu
e601603a-6688-43d2-b74b-2febe1d8dafc
CUDA renderer: a programmable graphics pipeline
Modern GPUs provide gradually increasing programmability on vertex shader, geometry shader and fragment shader in the past decade. However, many classical problems such as order-independent transparency (OIT), occlusion culling have not yet been efficiently solved using the traditional graphics pipeline. The main reason is that the behavior of the current stage of the pipeline is hard to be determined due to the unpredictable future data. Since the rasterization and blending stage are still largely fixed functions on chip, previous improvements on these problems always require hardware modifications thus remain on the theoretical level.
http://portal.acm.org/citation.cfm?id=1667146.1667189&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1078_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1078_logo_acm_portal2_large.jpg
Academia
Chinese Academy of Sciences
2009
12
01
12/01/2009
Fan Liu
Meng-Cheng Huang
Xue-Hui Liu
Paper
Fan Liu,Meng-Cheng Huang,Xue-Hui Liu
e88d024a-05b4-4bba-8641-5fbd42154978
42 TFlops hierarchical N-body simulations on GPUs with applications in both astrophysics and turbulence
As an entry for the 2009 Gordon Bell price/performance prize, we present the results of two different hierarchical N-body simulations on a cluster of 256 graphics processing units (GPUs). Unlike many previous N-body simulations on GPUs that scale as O(N2), the present method calculates the O(N log N) treecode and O(N) fast multipole method (FMM) on the GPUs with unprecedented efficiency. We demonstrate the performance of our method by choosing one standard application --a gravitational N-body simulation-- and one non-standard application --simulation of turbulence using vortex particles. The gravitational simulation using the treecode with 1,608,044,129 particles showed a sustained performance of 42.15 TFlops.
http://portal.acm.org/citation.cfm?id=1654059.1654123&coll=Portal&dl=GUIDE&CFID=88127131&CFTOKEN=72486951
/content/cudazone/CUDABrowser/assets/images/applications/1075_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1075_logo_acm_portal2_large.jpg
Academia
Nagasaki University
2009
11
01
11/01/2009
Tsuyoshi Hamada
Tetsu Narumi
Rio Yokota
Paper
Tsuyoshi Hamada,Tetsu Narumi,Rio Yokota
e78b9ca3-7ebe-4fdb-98fe-c2ea6b67a9d5
LBM based flow simulation using GPU computing processor
Graphics Processing Units (GPUs), originally developed for computer games, now provide computational power for scientific applications. In this paper, we develop a general purpose Lattice Boltzmann code that runs entirely on a single GPU. The results show that: (1) simple precision floating point arithmetic is sufficient for LBM computation in comparison to double precision; (2) the implementation of LBM on GPUs allows us to achieve up to about one billion lattice update per second using single precision floating point; (3) GPUs provide an inexpensive alternative to large clusters for fluid dynamics prediction.
/content/cudazone/CUDABrowser/assets/images/applications/1074_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1074_logo_acm_portal2_large.jpg
Academia
Universite de Lyon
2010
04
01
04/01/2010
Frederic Kuznik
Christian Obrecht
Gilles Rusaouen
Paper
Frederic Kuznik,Christian Obrecht,Gilles Rusaouen
7619b5ce-3009-4e36-b291-af64ec7413fb
Parallel GPU-based data-dependent triangulations
In this paper we introduce a new technique for data-dependent triangulation which is suitable for implementation on a GPU. Our solution is based on a new parallel version of the well known Lawson's optimization process and is fully compatible with restrictions of the GPU hardware. We test and compare the quality of our solution in an image reconstruction problem. In comparison with the standard implementations we achieve significant speed-up (eight times on average) with comparable quality of the reconstructed image. Further, several other improvements and optimizations are introduced and tested, and the results are discussed in detail.
/content/cudazone/CUDABrowser/assets/images/applications/1073_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1073_logo_acm_portal2_large.jpg
Comenius University
2010
04
01
04/01/2010
Michal Ervenansky
Zsolt Toth
Juraj Starinsky
Paper
Michal Ervenansky ,Zsolt Toth,Juraj Starinsky
80347ce7-cbf9-4e15-a927-5bab420dff15
Fault Table Computation on GPUs
In this paper, we explore the implementation of fault table generation on a Graphics Processing Unit (GPU). A fault table is essential for fault diagnosis and fault detection in VLSI testing and debug. Generating a fault table requires extensive fault simulation, with no fault dropping, and is extremely expensive from a computational standpoint. Fault simulation is inherently parallelizable, and the large number of threads that a GPU can operate on in parallel can be employed to accelerate fault simulation, and thereby accelerate fault table generation. Our approach, called GFTABLE, employs a pattern parallel approach which utilizes both bit-parallelism and thread-level parallelism. Our implementation is a significantly modified version of FSIM, which is pattern parallel fault simulation approach for single core processors.
http://portal.acm.org/citation.cfm?id=1773593.1773611&coll=Portal&dl=GUIDE&CFID=88119154&CFTOKEN=11832401
/content/cudazone/CUDABrowser/assets/images/applications/1072_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1072_logo_acm_portal2_large.jpg
Academia
Texas A and M University
2010
04
01
04/01/2010
Kanupriya Gulati
Sunil P. Khatri
Paper
Kanupriya Gulati,Sunil P. Khatri
8c78d222-202b-4a63-8202-cdb88bbb1977
Small-Ruleset Regular Expression Matching on GPGPUs: Quantitative Performance Analysis and Optimization
We explore the intersection between an emerging class of architectures and a prominent workload: GPGPUs (General-Purpose Graphics Processing Units) and regular expression matching, respectively. It is a challenging task because this workload with its irregular, non-coalesceable memory access patterns is very different from the regular, numerical workloads that run efficiently on GPGPUs.
http://domino.research.ibm.com/comm/research_people.nsf/pages/scarpazza.pubs.html/$FILE/2010-06-ICS-scarpazza.pdf
/content/cudazone/CUDABrowser/assets/images/applications/1067_small-ruleset_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1067_small-ruleset_large.png
Academia
IBM T.J. Watson Research Center / Technische Universitat Braunschweig
2010
01
01
01/01/2010
Jamin Naghmouchi
Daniele Paolo Scarpazza
Mladen Berekovic
Paper
Science
Jamin Naghmouchi,Daniele Paolo Scarpazza,Mladen Berekovic,j.naghmouchi@us.ibm.com,dpscarpazza@us.ibm.com,berekovic@ida.ing.tu-bs.de
77cbd1bb-51f1-4571-82f2-33999f0dd072
Modeling GPU-CPU workloads and systems
Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms.
http://portal.acm.org/citation.cfm?id=1735688.1735696&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1066_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1066_logo_acm_portal2_large.jpg
Academia
Georgia Institute of Technology
2010
03
01
03/01/2010
Andrew Kerr
Gregory Diamos
Sudhakar Yalamanchili
Paper
Andrew Kerr,Gregory Diamos,Sudhakar Yalamanchili
658e9e38-21dd-4f5f-9a1a-0da0b5d8df4d
Cortical architectures on a GPGPU
As the number of devices available per chip continues to increase, the computational potential of future computer architectures grows likewise. While this is a clear benefit for future computing devices, future chips will also likely suffer from more faulty devices and increased power consumption. It is also likely that these chips will be difficult to program if the current trend of adding more parallel cores continues to follow in the future. However, recent advances in neuroscientific understanding make parallel computing devices modeled after the human neocortex a plausible, attractive, fault-tolerant, and energy-efficient possibility.
http://portal.acm.org/citation.cfm?id=1735688.1735693&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1065_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1065_logo_acm_portal2_large.jpg
Academia
University of Wisconsin Madison
2010
03
01
03/01/2010
Andrew Nere
Mikko Lipasti
Paper
Andrew Nere,Mikko Lipasti
763cc86a-ccf2-4b2b-bd42-276c55734d18
A simulation of large-scale groundwater flow on CUDA-enabled GPUs
This paper presents a simulation method for large-scale groundwater flow on CUDA-enabled GPUs. The discretization method for a three-dimensional groundwater flow equation is introduced. When using the preconditioned Conjugate Gradient algorithm to solve the discretized equation, the implementing methods for the sparse matrix-vector multiplication and the vector inner product on CUDA-enabled GPUs are given. The experimental results show that GPUs can speed up the groundwater simulation significantly.
/content/cudazone/CUDABrowser/assets/images/applications/1064_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1064_logo_acm_portal2_large.jpg
Academia
China University of Geosciences
2010
03
01
03/01/2010
Xiaohui Ji
Tangpei Cheng
Qun Wang
Paper
Science
Xiaohui Ji,Tangpei Cheng,Qun Wang
83d40bdb-c97a-4e07-ad48-646ada77716b
A symbolic verifier for CUDA programs
We present a preliminary automated verifier based on mechanical decision procedures which is able to prove functional correctness of CUDA programs and guarantee to detect bugs such as race conditions. We also employ a symbolic partial order reduction (POR) technique to mitigate the interleaving explosion problem.
/content/cudazone/CUDABrowser/assets/images/applications/1063_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1063_logo_acm_portal2_large.jpg
Academia
University of Utah
2010
01
01
01/01/2010
Guodong Li
Ganesh Gopalakrishnan
Robert Kirby
Paper
Guodong Li,Ganesh Gopalakrishnan,Robert Kirby
2fc58876-4ace-47d4-b181-1164b718426b
A breadth-first course in multicore and manycore programming
The technique of scaling hardware performance through increasing the number of cores on a chip requires programmers to learn to write parallel code that can exploit this hardware. In order to expose students to a variety of multicore programming models, our university offered a breadth-first introduction to multicore and manycore programming for upper-level undergraduates.
http://portal.acm.org/citation.cfm?id=1734263.1734339&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1062_cover_thumbbreadth-first_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1062_cover_thumbbreadth-first_large.jpg
Academia
Sonoma State University
2010
03
01
03/01/2010
Suzanne Rivoire
Paper
Suzanne Rivoire
84257f90-4151-49ba-96a8-86a45d23e3be
CUDAlign: using GPU to accelerate the comparison of megabase genomic sequences
Biological sequence comparison is a very important operation in Bioinformatics. Even though there do exist exact methods to compare biological sequences, these methods are often neglected due to their quadratic time and space complexity. In order to accelerate these methods, many GPU algorithms were proposed in the literature. Nevertheless, all of them restrict the size of the smallest sequence in such a way that Megabase genome comparison is prevented. In this paper, we propose and evaluate CUDAlign, a GPU algorithm that is able to compare Megabase biological sequences with an exact Smith-Waterman affine gap variant. CUDAlign was implemented in CUDA and tested in two GPU boards, separately. For real sequences whose size range from 1MBP (Megabase Pairs) to 47MBP, a close to uniform GCUPS (Giga Cells Updates per Second) was obtained, showing the potential scalability of our approach. Also, CUDAlign was able to compare the human chromosome 21 and the chimpanzee chromosome 22. This operation took 21 hours on GeForce GTX 280, resulting in a peak performance of 20.375 GCUPS. As far as we know, this is the first time such huge chromosomes are compared with an exact method.
http://portal.acm.org/citation.cfm?id=1693453.1693473&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1060_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1060_logo_acm_portal2_large.jpg
Academia
University of Brasilia
2010
01
01
01/01/2010
Edans Flavius O. Sandes
Alba Cristina M.A. de Melo
Paper
Edans Flavius O. Sandes,Alba Cristina M.A. de Melo
d2d92ca7-3344-4f47-9039-c3a203ed52cc
Accelerating MATLAB Image Processing Toolbox functions on GPUs
In this paper, we present our effort in developing an open-source GPU (graphics processing units) code library for the MATLAB Image Processing Toolbox (IPT). We ported a dozen of representative functions from IPT and based on their inherent characteristics, we grouped these functions into four categories: data independent, data sharing, algorithm dependent and data dependent. For each category, we present a detailed case study, which reveals interesting insights on how to efficiently optimize the code for GPUs and highlight performance-critical hardware features, some of which have not been well explored in existing literature. Our results show drastic speedups for the functions in the data-independent or data-sharing category by leveraging hardware support judiciously; and moderate speedups for those in the algorithm-dependent category by careful algorithm selection and parallelization. For the functions in the last category, fine-grain synchronization and data-dependency requirements are the main obstacles to an efficient implementation on GPUs.
http://portal.acm.org/citation.cfm?id=1735688.1735703&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1058_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1058_logo_acm_portal2_large.jpg
Academia
University of Central Florida
2010
03
01
03/01/2010
Jingfei Kong
Martin Dimitrov
Yi Yang
Paper
Jingfei Kong,Martin Dimitrov,Yi Yang
be8a567d-256c-453b-b4f5-1112db68abcc
The Scalable Heterogeneous Computing (SHOC) benchmark suite
Scalable heterogeneous computing systems, which are composed of a mix of compute devices, such as commodity multicore processors, graphics processors, reconfigurable processors, and others, are gaining attention as one approach to continuing performance improvement while managing the new challenge of energy efficiency. As these systems become more common, it is important to be able to compare and contrast architectural designs and programming systems in a fair and open forum. To this end, we have designed the Scalable HeterOgeneous Computing benchmark suite (SHOC). SHOC's initial focus is on systems containing graphics processing units (GPUs) and multi-core processors, and on the new OpenCL programming standard.
http://portal.acm.org/citation.cfm?id=1735688.1735702&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1057_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1057_logo_acm_portal2_large.jpg
Academia
University of Tennessee
2010
03
01
03/01/2010
Anthony Danalis
Gabriel Marin
Collin McCurdy
Paper
Anthony Danalis,Gabriel Marin,Collin McCurdy
0b3ddfaa-9b4f-47d2-8f15-c7901994cd34
FreePipe: a programmable parallel rendering architecture for efficient multi-fragment effects
In the past decade, modern GPUs have provided increasing programmability with vertex, geometry and fragment shaders. However, many classical problems have not been efficiently solved using the current graphics pipeline where some stages are still fixed functions on chip. In particular, multi-fragment effects, especially order-independent transparency, require programmability of the blending stage, that makes it difficult to be solved in a single geometry pass. In this paper we present FreePipe, a system for programmable parallel rendering that can run entirely on current graphics hardware and has performance comparable with the traditional graphics pipeline. Within this framework, two schemes for the efficient rendering of multi-fragment effects in a single geometry pass have been developed by exploiting CUDA atomic operations. Both schemes have achieved significant speedups compared to the state-of-the-art methods that are based on traditional graphics pipelines.
http://portal.acm.org/citation.cfm?id=1730804.1730817&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1056_cover_thumbfreepipe_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1056_cover_thumbfreepipe_large.jpg
Academia
Chinese Academy of Sciences
2010
02
01
02/01/2010
Fang Liu
Meng-Cheng Huang
Xue-Hui Liu
Paper
Fang Liu,Meng-Cheng Huang,Xue-Hui Liu
bfecf3ec-363f-460f-b5f2-d4bc7599ca9e
Accelerating the local outlier factor algorithm on a GPU for intrusion detection systems
The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries -- this overhead can limit the use of LOF due to the computational overhead involved.
http://portal.acm.org/citation.cfm?id=1735688.1735707&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1055_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1055_logo_acm_portal2_large.jpg
Academia
Northeastern University
2010
03
01
03/01/2010
Malak Alshawabkeh
Byunghyun Jang
David Kaeli
Paper
Malak Alshawabkeh,Byunghyun Jang,David Kaeli
1e349053-7440-46a2-9958-5f49940f6846
An asymmetric distributed shared memory model for heterogeneous parallel systems
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existing programming models for heterogeneous computing rely on programmers to explicitly manage data transfers between the CPU system memory and accelerator memory.
http://portal.acm.org/citation.cfm?id=1736020.1736059&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1054_Asplos XV_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1054_Asplos XV_large.jpg
Academia
Universitat Politecnica de Catalunya
2010
03
01
03/01/2010
Isaac Gelado
John E. Stone
Javier Cabezas
Paper
Isaac Gelado,John E. Stone,Javier Cabezas
0944eab1-8c6f-4ba4-a258-14912057a27e
Teaching design & analysis of multi-core parallel algorithms using CUDA
One of the dominant trends in microprocessor architecture in recent years is continually increasing chip-level parallelism. However, many undergraduate curriculums, especially at small schools, do not offer courses that focus on the design and analysis of multi-threaded algorithms for multi-core processors. The courses that are offered address the theoretical aspects of parallel system design, but often fail to provide students with the opportunity to develop and evaluate distributed applications in real-world environments. As a result, undergraduate students are not as prepared as they should be for graduate study or careers in industry.
http://portal.acm.org/citation.cfm?id=1734797.1734800&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1053_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1053_logo_acm_portal2_large.jpg
Lamar University
2010
04
01
04/01/2010
Quoc-Nam Tran
Paper
Quoc-Nam Tran
422d4865-5b1d-4be0-82d2-fc9a795befe8
High-performance cone beam reconstruction using CUDA compatible GPUs
Compute unified device architecture (CUDA) is a software development platform that allows us to run C-like programs on the nVIDIA graphics processing unit (GPU). This paper presents an acceleration method for cone beam reconstruction using CUDA compatible GPUs. The proposed method accelerates the Feldkamp, Davis, and Kress (FDK) algorithm using three techniques: (1) off-chip memory access reduction for saving the memory bandwidth; (2) loop unrolling for hiding the memory latency; and (3) multithreading for exploiting multiple GPUs. We describe how these techniques can be incorporated into the reconstruction code.
http://portal.acm.org/citation.cfm?id=1750592.1750768&coll=GUIDE&dl=GUIDE&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1051_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1051_logo_acm_portal2_large.jpg
Academia
Graduate School of Information Science and Technology
2010
02
01
02/01/2010
24
Yusuke Okitsu
Fumihiko Ino
Kenichi Hagihara
Paper
Science
Yusuke Okitsu,Fumihiko Ino,Kenichi Hagihara
6e20e119-f8dd-4ca4-b7f3-8afc56248978
Computational visual attention systems and their cognitive foundations: A survey
Based on concepts of the human visual system, computational visual attention systems aim to detect regions of interest in images. Psychologists, neurobiologists, and computer scientists have investigated visual attention thoroughly during the last decades and profited considerably from each other. However, the interdisciplinarity of the topic holds not only benefits but also difficulties: Concepts of other fields are usually hard to access due to differences in vocabulary and lack of knowledge of the relevant literature.
http://portal.acm.org/citation.cfm?id=1658349.1658355&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1047_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1047_logo_acm_portal2_large.jpg
Academia
Rheinische Friedrich-Wilhelms Universitat
http://www3.uni-bonn.de/
2010
01
01
01/01/2010
Simone Frintrop
Erich Rome
Henrick I. Christensen
Paper
Life Sciences
Simone Frintrop,Erich Rome,Henrick I. Christensen
da549e23-701a-44e5-b861-5f34fcfb1b47
Iterative induced dipoles computation for molecular mechanics on GPUs
In this work, we present a first step towards the efficient implementation of polarizable molecular mechanics force fields with GPU acceleration. The computational bottleneck of such applications is found in the treatment of electrostatics, where higher-order multipoles and a self-consistent treatment of polarization effects are needed. We have coded these sections, for the case of a non-periodic simulation, with the CUDA programming model. Results show a speedup factor of 21 for a single precision GPU implementation, when comparing to the serial CPU version. A discussion of the optimization and prameterization steps is included. Comparison between different graphic cards and a shared memory parallel CPU implementation is also given. The current work demonstrates the potential usefulness of GPU programming in accelerating this field of applications.
http://portal.acm.org/citation.cfm?id=1735688.1735708&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1046_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1046_logo_acm_portal2_large.jpg
Research
INESC-ID Lisboa
http://www.gsd.inesc-id.pt/
2010
03
01
03/01/2010
21
Frederico Pratas
Ricardo Mata
Leonel Sousa
Paper
Science
Frederico Pratas,Ricardo Mata,Leonel Sousa
cc358d86-c5ab-4c38-ad0a-21fefd4037d4
Exploring NVIDIA-CUDA for video coding
http://portal.acm.org/citation.cfm?id=1730836.1730839&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1045_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1045_logo_acm_portal2_large.jpg
Academia
Florida Atlantic University
www.fau.edu
2010
02
01
02/01/2010
100
Aleksandar Colic
Hari Kalva
Borko Furht
Paper
Video and Audio
Aleksandar Colic,Hari Kalva,Borko Furht
6067f8f4-55fe-430b-8ca9-2a857b264b33
Thermal analysis of multiprocessor SoC applications by simulation and verification
Overheating of computer chips leads to degradation of performance and reliability. Therefore, preventing chips from overheating in spite of increased performance requirements has emerged as a major challenge. Since the cost of cooling has been rising steadily, various architecture and application design techniques are used to prevent chip overheating. Temperature-aware task scheduling has emerged as an important application design methodology for addressing this problem in multiprocessor SoC systems.
http://portal.acm.org/citation.cfm?id=1698759.1698765&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1042_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1042_logo_acm_portal2_large.jpg
Academia
Indian Institute of Technology
http://www.iitkgp.ac.in/
2010
02
01
02/01/2010
Dipankar Das
P. P. Chakrabarti
Rajeev Kumar
Paper
Science
Dipankar Das,P. P. Chakrabarti,Rajeev Kumar
bc4ed1f4-eb4d-47bf-8564-801f90e87a6b
A mapping path for multi-GPGPU accelerated computers from a portable high level programming abstraction
Programmers for GPGPU face rapidly changing substrate of programming abstractions, execution models, and hardware implementations. It has been established, through numerous demonstrations for particular conjunctions of application kernel, programming languages, and GPU hardware instance, that it is possible to achieve significant improvements in the price/performance and energy/performance over general purpose processors. But these demonstrations are each the result of significant dedicated programmer labor, which is likely to be duplicated for each new GPU hardware architecture to achieve performance portability.
http://portal.acm.org/citation.cfm?id=1735688.1735698&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1041_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1041_logo_acm_portal2_large.jpg
Research
Reservoir Labs
https://www.reservoir.com/
2010
03
01
03/01/2010
Allen Leung
Nicolas Vasilache
Benoit Meister
Paper
Science
Allen Leung,Nicolas Vasilache,Benoit Meister
6fdd88c3-83c9-49e8-a50b-0d6ccd284b6b
Performance analysis of accelerated image registration using GPGPU
This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of NVIDIA's Tesla C870 GPU. We explain the underlying structure of the GPU implementation and compare its performance and accuracy against a fast CPU-based implementation.
http://portal.acm.org/citation.cfm?id=1513895.1513900&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1037_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1037_logo_acm_portal2_large.jpg
Academia
University of Notre Dame
http://www.nd.edu/
2009
03
08
03/08/2009
Peter Bui
Jay Brockman
Paper
Medical Imaging
CUDA, GPGPU, image registration, performance analysis,Peter Bui,Jay Brockman
e3ebcf2d-15c3-41b0-b4a1-6191ac30209b
Design and implementation of the software architecture for a 3-D reconstruction system in medical imaging
The design and implementation of the reconstruction system in medical X-ray imaging is a challenging issue due to its immense computational demands. In order to ensure an efficient clinical workflow it is inevitable to meet high performance requirements. Hence, the usage of hardware acceleration is mandatory. The software architecture of the reconstruction system is required to be modular in a sense that different accelerator hardware platforms are supported and it must be possible to implement different parts of the algorithm using different acceleration architectures and techniques.
http://portal.acm.org/citation.cfm?id=1368088.1368181&coll=GUIDE&dl=ACM&CFID=88441459&CFTOKEN=90295264
/content/cudazone/CUDABrowser/assets/images/applications/1036_logo_acm_portal2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1036_logo_acm_portal2_large.jpg
Academia
University of Erlangen-Nuremberg
http://www.uni-erlangen.org/
2008
05
18
05/18/2008
Holger Scherl
Stefan Hoppe
Markus Kowarschik
Paper
Medical Imaging
Holger Scherl,Stefan Hoppe,Markus Kowarschik
1df75b6e-b484-4233-8742-0cc83e34dfc8
Porting a high-order finite-element earthquake modeling application to NVIDIA graphics cards using CUDA
We port a high-order finite-element application that performs the numerical simulation of seismic wave propagation resulting from earthquakes in the Earth on NVIDIA GeForce 8800 GTX and GTX 280 graphics cards using CUDA. This application runs in single precision and is therefore a good candidate for implementation on current GPU hardware, which either does not support double precision or supports it but at the cost of reduced performance.
http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WKJ-4VHSDG0-1&_user=10&_coverDate=05%2F31%2F2009&_alid=1315357454&_rdoc=3&_fmt=high&_orig=search&_cdi=6908&_sort=r&_docanchor=&view=c&_ct=791&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=cf0dc7f5e75630ed58513933e89e2835
/content/cudazone/CUDABrowser/assets/images/applications/1033_sciencedirect_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1033_sciencedirect_large.png
Academia
Universite de Pau et des Pays de lAdour
2009
05
01
05/01/2009
25
Dimitri Komatitsch
David Michea
Gordon Erlebacher
Paper
Science
Dimitri Komatitsch,David Michea,Gordon Erlebacher
6ae2dac9-06ab-4893-b7a0-89da5330df36
Realtime free surface fluid simulation and visualization
Implementation of a free surface fluid simulation and visualization using the Lattice Boltzmann method. OpenCL 1.0 is used for the fluid simulation and free surface handling while OpenGL is used for visualization of the refractions and caustics.
/content/cudazone/CUDABrowser/assets/images/applications/1032_2010_04_05_free_beer_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1032_2010_04_05_free_beer_large.jpg
CUDA Developer
2010
06
05
06/05/2010
Open Source
Martin Schreiber
Application
Multimedia
Code
Computational Fluid Dynamics
Game Physics
Graphics
Numerics
Science
Martin Schreiber,schreiberx@gmail.com
3bbb04a8-19f7-4823-9e08-1d712a548449
Palo GPU Business Intelligence
Online Analytical Processing (OLAP) is a core technology in Business Intelligence and Corporate Performance Management, allowing users to navigate and explore corporate data (usually extracted from a data warehouse) and to roll up or drill down along different hierarchical levels.Palo enables OLAP reporting and analysis directly inside Excel spreadsheets. The GPU speeds up multidimensional aggregation queries for real-time interactive analyses.
/content/cudazone/CUDABrowser/assets/images/applications/1031_suite-screen2-lg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1031_suite-screen2-lg_large.png
Commercial
Jedox
2010
03
31
03/31/2010
40
Commercial
Tobias Lauer
Leif Mergener
Application
Multimedia
Business Intelligence
Leif Mergener,leif.mergener@jedox.com,OLAP,multidimensional aggregation
487089bf-2127-4c4c-9f6c-12cf40551ab4
Real-Time Multi-Agent Path Planning on Arbitrary Surfaces
Path planning is an active topic in the literature, and efficient navigation over non-planar surfaces is an open research question. In this work we present a novel technique for navigation of multiple agents over arbitrary triangular domains. The proposed solution uses a fast hierarchical computation of geodesic distances over triangular meshes to allow interactive frame rates, and a GPU-based collision avoidance technique to guide individual agents. Unlike most previous work, the method imposes no limitations on the surface over which the agents are moving, and can naturally deal with non-planar meshes of arbitrary genus and curvature. Moreover, the implementation is a hybrid CPU/GPU algorithm that explores the current trend of increasing the number of CPU cores and GPU programmability. This approach exploits the best qualities in each processor, thus achieving very high performance.
/content/cudazone/CUDABrowser/assets/images/applications/1030_teaser_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1030_teaser_large.png
http://www.inf.ufrgs.br/
UFRGS and NVIDIA
Academia
2010
02
21
02/21/2010
Rafael P. Torchelsen
Luiz F. Scheidegger
Guilherme N. Oliveira
Rui Bastos
Joao L. D. Comba
Multimedia
Paper
Presentation
Graphics
Games
Path Planning, Multi-Agents, Games,Rafael P. Torchelsen,Luiz F. Scheidegger,Guilherme N. Oliveira, Rui Bastos and Joao L. D. Comba ,rafael.torchelsen@gmail.com
c6d19cfa-dfde-45ba-a8f3-c92550306e67
High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning
Motivated by high computation power and low price per performance ratio of GPUs, GPU accelerated clusters are being built for high performance scientific computing. In this work, we propose a scalable implementation of a Conjugate Gradient (CG) solver for unstructured matrices on a GPU-extended cluster, where each cluster node has multiple GPUs. Basic computations of the solver are held on GPUs and communications are managed by the CPU. For sparse matrix-vector multiplication, which is the most time-consuming operation, solver selects the fastest between several high performance kernels running on GPUs. In a GPU-extended cluster, it is more difficult than traditional CPU clusters to obtain scalability, since GPUs are very fast compared to CPUs. Since computation on GPUs is faster, GPU-extended clusters demand faster communication between compute units. To achieve scalability, we adopt hypergraph-partitioning models, which are state-of-the-art models for communication reduction and load balancing for parallel sparse iterative solvers. We implement a hierarchical partitioning model which better optimizes underlying heterogeneous system. In our experiments, we obtain up to 94 Gflops double-precision CG performance using 64 NVIDIA Tesla GPUs on 32 nodes.
/content/cudazone/CUDABrowser/assets/images/applications/1029_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1029_implementation_large.png
Academia
Tokyo Institute of Technology
http://matsu-www.is.titech.ac.jp/
2010
04
02
04/02/2010
Ali Cevahir
Akira Nukada
Satoshi Matsuoka
Paper
Numerics
Science
Ali Cevahir,Akira Nukada,Satoshi Matsuoka,ali@matsulab.is.titech.ac.jp
ebf74770-43d6-4df3-9ad0-7caf33e71e45
Best-effort semantic document search on GPUs
Semantic indexing is a popular technique used to access and organize large amounts of unstructured text data. We describe an optimized implementation of semantic indexing and document search on manycore GPU platforms. We observed that a parallel implementation of semantic indexing on a 128-core Tesla C870 GPU is only 2.4X faster than a sequential implementation on an Intel Xeon 2.4GHz processor. We ascribe the less than spectacular speedup to a mismatch in the workload characteristics of semantic indexing and the unique architectural features of GPUs. Compared to the regular numerical computations that have been ported to GPUs with great success, our semantic indexing algorithm (the recently proposed Supervised Semantic Indexing algorithm called SSI) has interesting characteristics -- the amount of parallelism in each training instance is data-dependent, and each iteration involves the product of a dense matrix with a sparse vector, resulting in random memory access patterns. As a result, we observed that the baseline GPU implementation significantly under-utilizes the hardware resources (processing elements and memory bandwidth) of the GPU platform. However, the SSI algorithm also demonstrates unique characteristics, which we collectively refer to as the "forgiving nature" of the algorithm. These unique characteristics allow for novel optimizations that do not strive to preserve numerical equivalence of each training iteration with the sequential implementation. In particular, we consider best-effort computing techniques, such as dependency relaxation and computation dropping, to suitably alter the workload characteristics of SSI to leverage the unique architectural features of the GPU. We also show that the realization of dependency relaxation and computation dropping concepts on a GPU is quite different from how one would implement these concepts on a multicore CPU, largely due to the distinct architectural features supported by a GPU. Our new techniques dramatically enhance the amount of parallel workload, leading to much higher performance on the GPU. By optimizing data transfers between CPU and GPU, and by reducing GPU kernel invocation overheads, we achieve further performance gains. We evaluated our new GPU-accelerated implementation of semantic document search on a database of over 1.8 million documents from Wikipedia. By applying our novel performance-enhancing strategies, our GPU implementation on a 128-core Tesla C870 achieved a 5.5X acceleration as compared to a baseline parallel implementation on the same GPU. Compared to a baseline parallel TBB implementation on a dual-socket quad-core Intel Xeon multicore CPU (8-cores), the enhanced GPU implementation is 11X faster. Compared to a parallel implementation on the same multi-core CPU that also uses data dependency relaxation and dropping computation techniques, our enhanced GPU implementation is 5X faster.
/content/cudazone/CUDABrowser/assets/images/applications/1027_object3_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1027_object3_large.jpg
Research
NEC Labs America
2010
03
14
03/14/2010
14
Surendra Byna
Jiayuan Meng
Anand Raghunathan
Srimat Chakradhar
Srihari Cadambi
Paper
Life Sciences
Surendra Byna,Jiayuan Meng,Anand Raghunathan, Srimat Chakradhar, Srihari Cadambi ,sbyna@nec-labs.com
b71710f7-ab68-4e1a-a418-109bcd152445
rCUDA
The rCUDA Framework enables the concurrent usage of CUDA-compatible devices remotely.
/content/cudazone/CUDABrowser/assets/images/applications/1026_bluebreeze_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1026_bluebreeze_logo_large.png
Academia
Universidad Politecnica de Valencia and Universidad Jaume I
2010
04
01
04/01/2010
Open source
The rCUDA team
Application
Programming Tools
Remote CUDA,The rCUDA team,apenya@gap.upv.es
7a68e84c-71bb-4325-9bba-94e4f01277e1
High Performance Finite Difference PDE Solvers on GPUs
We show how to implement highly efficient GPU solvers for one dimensional PDEs based on finite difference schemes. The typical use case is to price a large number of similar or related derivatives in parallel. Application scenarios include market making, real time pricing, and risk management. The tridiagonal systems in the backward propagation of a finite difference scheme are solved with parallel cyclic reduction. This is a fine-grained parallel tridiagonal solver, which is well adapted to the hierarchical architecture of a modern GPU. We explain in detail the calculation work flow and study the performance of the solver relative to a well optimized CPU implementation. Our timings demonstrate performance improvement factors 25 on a single GPU and 38 on two GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/1025_parallel_cyclic_reduction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1025_parallel_cyclic_reduction_large.png
Commercial
QuantAlea GmbH
2010
02
28
02/28/2010
38
Daniel Egloff
Paper
Finance
Numerics
Daniel Egloff,daniel.egloff@quantalea.net
b5acd6e0-1899-49f0-b8c8-503be97f7afc
CNS: a GPU-based framework for simulating cortically-organized networks
A general GPU-based framework for the fast simulation of "cortically-organized" networks, defined as networks consisting of n-dimensional layers of similar cells.
/content/cudazone/CUDABrowser/assets/images/applications/1024_cns_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1024_cns_large.png
Academia
Center for Biological & Computational Learning (CBCL) at MIT
http://cbcl.mit.edu
2010
03
06
03/06/2010
80
Open source
Jim Mutch
Application
Paper
Code
Programming Tools
Science
Computational Neuroscience
Jim Mutch,jmutch@mit.edu
3c9da880-2acb-4ca9-984e-75abcca19b77
Massively parallel forward modeling of scalar and tensor gravimetry data
Geophysical modeling code for calculating the first and second derivative of the gravitational potential for a 3D mass distribution.
/content/cudazone/CUDABrowser/assets/images/applications/1023_logo_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1023_logo_large.gif
Academia
IFM-GEOMAR
2010
03
24
03/24/2010
40
Open source
Max Moorkamp
M. Jegen
A. Roberts and R. Hobbs
Paper
Code
Oil & Gas
Science
Max Moorkamp,M. Jegen,A. Roberts and R. Hobbs,mmoorkamp@ifm-geomar.de
e06f5a43-2233-44c5-af51-7fd7c4cffe23
Simulation Game of Life on GPU.
Simulation Game of Life on GPU via CUDA. It used shared ram technic.
/content/cudazone/CUDABrowser/assets/images/applications/1022_GPUgameOfLife_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1022_GPUgameOfLife_large.gif
http://www.cs.tu.ac.th/
Thammasat University, Thailand
2010
03
25
03/25/2010
8
Open source
Treepop Sunpetchniyom
Code
Life Sciences
Science
Simulation , Game of Life,Treepop Sunpetchniyom,treepop.sunpetchniyom@nectec.or.th
20e2b57e-43ee-4681-aa07-d1c24ed1ebda
Incompressible Flow Computations on the NCSA Lincoln Tesla Cluster
We pursue MPI-CUDA implementations and investigate three strategies to probe the efficiency and scalability of incompressible flow computations on the Lincoln Tesla cluster at the National Center for Supercomputing Applications (NCSA). We exploit some of the advanced features of MPI and CUDA programming to overlap both GPU data transfer and MPI communications with computations on the GPU. We sustain approximately 2.4 TeraFLOPS on the 64 nodes of the NCSA Lincoln Tesla cluster using 128 GPUs with a total of 30,720 processing elements. Our results demonstrate that multi-GPU clusters can substantially accelerate computational fluid dynamics (CFD) simulations.
/content/cudazone/CUDABrowser/assets/images/applications/1021_OKC_downtown_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1021_OKC_downtown_large.png
Academia
Boise State University
2010
03
25
03/25/2010
Jacobsen
Thibault
Senocak
Paper
Computational Fluid Dynamics
Numerics
CFD, CUDA, incompressible flow, MPI, GPU cluster,Jacobsen,Thibault,Senocak,danajacobsen@u.boisestate.edu,tchetchenko@gmail.com,senocak@boisestate.edu
8597944c-e271-43f7-aa6b-dd50f497ec38
ANDSolver
ANDSolver solves the compressible Euler equations on unstructured meshes of polyhedrals.
/content/cudazone/CUDABrowser/assets/images/applications/1020_page2_1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1020_page2_1_large.png
Commercial
Palix Technologies
http://www.palixtech.com
2010
03
23
03/23/2010
10
Commercial
Palix Technologies
Application
Computational Fluid Dynamics
Palix Technologies,info@palixtech.com
b7ebd84b-4dfe-49c9-8ca7-64c913d30343
NBSymple: a symplectic N-body code for astrophysical simulations using TESLA GPUs
NBSymple is a brand new parallel code which exploits joint performances of multicore CPUs and GPUs, by mean of Open MP and CUDA, respectively. It performs numerical integration of the motion equations of a set of N particles interacting via Newtonian gravitational forces. The time integration is done by a high precision algorithm, which guarantees time reversibility and excellent energy conservation. We tested the code in various cases, making use of simple precision and double precision arithmetics, as well as of a software "double-single" precision which seems a good compromise between precision and speed on TESLA C1060 GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/1019_macchina_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1019_macchina_large.png
Academia
Dep. of Physics, Univ. of Roma "La Sapienza", Roma, Italy
http://www.uniroma1.it
2010
03
17
03/17/2010
Roberto CAPUZZO-DOLCETTA Alessandra MASTROBUONO -BATTISTI
Paper
Code
Numerics
Astrophysics
astrophysics; N-body simulations; symplectic schemes,Roberto CAPUZZO-DOLCETTA Alessandra MASTROBUONO -BATTISTI,roberto.capuzzodolcetta@uniroma1.it
a5518c10-3337-449f-aa4f-2b67aedd8636
Accelerating Biomedical Signal Processing Algorithms with Parallel Programming on Graphic Processor Units
This paper investigates the benefits derived by adopting the use of Graphics Processing Unit (GPU) parallel programming in the field of biomedical signal processing. The differences in execution time when computing the Correlation Dimension (CD) of multivariate neurophysiological recordings and the Skin Conductance Level (SCL) are reported by comparing several common programming environments. Moreover, as indicated in this study, the combination of parallel programming with special design techniques dealing with memory management issues such as data transfer between device memory and GPU may further accelerate the processing speed. So, the minimization achieved in the time execution by means of proper parallel architecture design may reach a factor of 29 in comparison with pure C language. Therefore, the role of parallel GPU programming environment may be beneficial for numerous biomedical applications within the sphere of biosignal processing.
/content/cudazone/CUDABrowser/assets/images/applications/1018_biosignal_analysis_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1018_biosignal_analysis_large.jpg
Academia
Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki, Greece
http://lomiweb.med.auth.gr/gan/
2009
11
05
11/05/2009
29
Konstantinidis Evdokimos
Frantzidis Christos
Panagiotis Bamidis
Paper
Science
Signal Processing
Konstantinidis Evdokimos,Frantzidis Christos,Panagiotis Bamidis,evdokimosk@gmail.com,christos.frantzidis@gmail.com,bamidis@med.auth.gr
608f2d59-cd5f-43a0-ac09-c0741f149014
B Flash Finder
This program harnesses the power of nvidia's cuda technology to get fast search results from an entire harddrive in seconds
/content/cudazone/CUDABrowser/assets/images/applications/1017_flashfinder_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1017_flashfinder_large.jpg
Academia
Psyzone
http://www.psyzone.co.in
2010
03
16
03/16/2010
Bhairav Pardiwala
Application
Files Search
Search,files/folders,fast search,Bhairav Pardiwala,bhairavpardiwala@gmail.com
78387568-da7b-4cf6-b23c-504b45e212e4
Allinea DDT
Graphical debugger for NVIDIA CUDA - parallel and sequential code
/content/cudazone/CUDABrowser/assets/images/applications/1016_Allineaddt_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1016_Allineaddt_large.jpg
Commercial
Allinea Software Ltd.
http://www.allinea.com
2010
03
23
03/23/2010
Commercial
Allinea Software
Application
Paper
Programming Tools
David Lecomber,david@allinea.com
b63787c7-a224-436e-92da-515a2bc7d015
A GPU approach to FDTD for radio coverage prediction
A well known approach to compute radio wave propagation is the Finite-Difference Time-Domain (FDTD) model which solves the Maxwell equations on a discrete grid. With the development of new programmable graphics hardware, novel solutions to compute electromagnetics are being already implemented on GPUs. In this paper a GPU implementation of FTDT is developed and achieves a speedup of over 100X over a Matlab implementation running on AMD Athlon 64X2 dual core 4600
/content/cudazone/CUDABrowser/assets/images/applications/1015_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/1015_logo_xplore_large.gif
Academia
University of Bedfordshire, Luton, Bedfordshire, UK
2009
01
06
01/06/2009
100
Alvaro Valcarce
Guillaume De La Roche
Jie Zhang
Paper
Signal Processing
Alvaro Valcarce,Guillaume De La Roche,Jie Zhang
4f6fb7c7-a80f-473f-9f52-0022e61724e9
gVirtuS: A GPGPU transparent virtualization component
gVirtuS tries to fill the gap between in-house hosted computing clusters, equipped with GPGPUs devices, and pay-for-use high performance virtual clusters deployed via public or private computing clouds. gVirtuS allows an instanced virtual machine to access GPGPUs in a transparent way, with an overhead slightly greater than a real machine/GPGPU setup. gVirtuS is hypervisor independent, and, even though it currently virtualizes nVIDIA CUDA based GPUs, it is not limited to a specific brand technology. The performance of the components of gVirtuS is assessed through a suite of tests in different deployment scenarios, such as providing GPGPU power to cloud computing based HPC clusters and sharing remotely hosted GPGPUs among HPC nodes.
/content/cudazone/CUDABrowser/assets/images/applications/1014_gVirtuS_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1014_gVirtuS_large.png
UniParthenope Open Source Lab
http://osl.uniparthenope.it
2010
03
05
03/05/2010
Giulio Giunta
Raffaele Montella
Giuseppe Agrillo
◦Giuseppe Coviello
Code
Giulio Giunta ,Raffaele Montella,Giuseppe Agrillo,giulio.giunta@uniparthenope.it,raffaele.montella@uniparthenope.it,giuseppe.agrillo@uniparthenope.it
1f219825-5344-4682-9cc8-d329d28f8eb3
Volmaster FX
A complete pricing tool for FX derivatives delivering advanced stochastic volatility models natively. Wide range of exotics covered. Thanks to innovative proprietary pricing techniques implemented with CUDA, Volmaster can achieve unrivalled computational speed. Delivered via web (software-as-a-service), Volmaster runs instantly on any desktop with a click-and-go distribution model.
/content/cudazone/CUDABrowser/assets/images/applications/1013_logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1013_logo_large.jpg
Commercial
Volmaster B.V.
http://www.volmaster.com
2010
03
05
03/05/2010
Commercial
Stefano Silvano
Application
Finance
fx option derivative pricing stochastic volatility exotic vanilla greeks numerical skew,Stefano Silvano,info@volmaster.com
e0663fcb-55fe-4b9b-886b-0c7305d789df
Exploring utilisation of GPU for database applications
This study is devoted to exploring possible applications of GPU technology for acceleration of the database access. We use the n-gram based approximate text search engine as a test bed for GPU based acceleration algorithms. Two solutions - hybrid CPU/GPU and pure GPU algorithms for query processing are studied and compared with the baseline CPU algorithm as well as with the optimized versions of the CPU algorithm. The hybrid algorithm performs poorly on most queries and only modest acceleration is achievable for long queries with high error level. On the other hand speedups up to 18 times were achieved for pure GPU algorithm. Application of the GPU acceleration for more general data base problems is discussed.
/content/cudazone/CUDABrowser/assets/images/applications/1012_chemsearch_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1012_chemsearch_logo_large.png
Academia
Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw
http://www.icm.edu.pl
2010
03
09
03/09/2010
18
S. Walkowiak
K. Wawruch
L. Ligowski
Paper
Science
Databases
S. Walkowiak,K. Wawruch,L. Ligowski,S.Walkowiak@icm.edu.pl,L.Ligowski@icm.edu.pl,W.Rudnicki@icm.edu.pl
09210fbd-d990-4fd2-a864-aeb2dc8291eb
Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed Precision Multigrid
We have previously suggested mixed precision iterative solvers specifically tailored to the iterative solution of sparse linear equation systems as they typically arise in the finite element discretization of partial differential equations. These schemes have been evaluated for a number of hardware platforms, in particular single precision GPUs as accelerators to the general purpose CPU. This paper reevaluates the situation with new mixed precision solvers that run entirely on the GPU: We demonstrate that mixed precision schemes constitute a significant performance gain over native double precision. Moreover, we present a new implementation of cyclic reduction for the parallel solution of tridiagonal systems and employ this scheme as a line relaxation smoother in our GPU-based multigrid solver. With an alternating direction implicit variant of this advanced smoother we can extend the applicability of the GPU multigrid solvers to very ill-conditioned systems arising from the discretization on anisotropic meshes, that previously had to be solved on the CPU. The resulting mixed precision schemes are always faster than double precision alone, and outperform tuned CPU solvers consistently by almost an order of magnitude.
/content/cudazone/CUDABrowser/assets/images/applications/1011_TPDS_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1011_TPDS_large.jpg
Academia
TU Dortmund and Max Planck Institut Informatik
2010
03
02
03/02/2010
10
Dominik Goddeke
Robert Strzodka
Paper
Numerics
Dominik Goddeke,Robert Strzodka,dominik.goeddeke@math.tu-dortmund.de
a22255de-199f-41da-b3c7-703a9714f29f
Lattice-Boltzmann Simulation of the Shallow-Water Equations with Fluid-Structure Interaction on Multi- and Manycore Processors
We present an efficient method for the simulation of laminar fluid flows with free surfaces including their interaction with moving rigid bodies, based on the two-dimensional shallow water equations and the Lattice-Boltzmann method. Our implementation targets multiple fundamentally different architectures such as commodity multicore CPUs with SSE, GPUs, the Cell BE and clusters. We show that our code scales well on an MPI-based cluster; that an eightfold speedup can be achieved using modern GPUs in contrast to multithreaded CPU code and, finally, that it is possible to solve fluid-structure interaction scenarios with high resolution at interactive rates.
/content/cudazone/CUDABrowser/assets/images/applications/1010_mcc-paper_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1010_mcc-paper_large.png
Academia
TU Dortmund
2010
02
24
02/24/2010
8
Markus Geveler
Paper
Computational Fluid Dynamics
Numerics
Markus Geveler,markus.geveler@math.tu-dortmund.de
095c041f-aa2c-472f-b0b4-eaa692951dc5
Fast Image Blurring with CUDA
High performance and good quality of image blurring, using stack blurring algorithm provided by http://incubator.quasimondo.com
/content/cudazone/CUDABrowser/assets/images/applications/1008_device_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1008_device_large.png
Research
http://home.so-net.net.tw/lioucy/
2009
09
10
09/10/2009
300
Open source
ChaoJui
Application
Graphics
ChaoJui,lioucr@yahoo.ca
476d306d-6a77-4749-8210-8b7b19ebd420
Fast Human Detection with Cascaded Ensembles
A real time implementation of the Histograms of Oriented Gradients algorithm with cascaded classifers.
/content/cudazone/CUDABrowser/assets/images/applications/1007_cover_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1007_cover_large.jpg
Academia
MIT
2010
02
26
02/26/2010
13
Berkin Bilgic
Paper
Signal Processing
Berkin Bilgic,berkin@mit.edu
ba8b2f03-6170-4b26-99c3-86b4e47546d1
Cuda-Renderer 2009 - A Multi-Volume Polyhedral Renderer
We present a new algorithm for hardware-accelerated ray casting of multiple volumes. Our approach supports a large number of volumes, complex translucent and concave polyhedral objects as well as CSG intersections of volumes and geometry in any combination. It is implemented as a software renderer in CUDA without any fixed function portions, which allows full control over the use of memory bandwidth. High depth complexity, which is problematic for conventional approaches based on depth peeling, can be successfully handled. As far as we know, our approach is the first framework for multi-volume rendering which provides interactive frame rates when concurrently rendering more than 50 arbitrarily overlapping volumes on current graphics hardware.
/content/cudazone/CUDABrowser/assets/images/applications/1006_Thumbnail_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1006_Thumbnail_large.png
Academia
Graz University of Technology
http://www.icg.tugraz.at
2009
12
14
12/14/2009
Bernhard Kainz
Markus Grabner
Alexander Bornik
Stefan Hauswiesner
Judith Muehl
Dieter Schmalstieg
Multimedia
Paper
Graphics
Medical Imaging
Bernhard Kainz,Markus Grabner,Alexander Bornik,kainz@icg.tugraz.at
aa417b5b-e0cc-446a-9fca-a93e14d4868b
Accelerating SQL Database Operations on a GPU with CUDA
A reimplementation of portions of the SQLite database to execute on a GPU, part of the GPGPU-3 workshop.
/content/cudazone/CUDABrowser/assets/images/applications/1005_volcano_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1005_volcano_large.jpg
Academia
University of Virginia (LAVA Lab)
http://www.cs.virginia.edu/~skadron/pub_list.html
2010
03
14
03/14/2010
70
Peter Bakkum
Kevin Skadron
Paper
Data Mining
Peter Bakkum,pbb7c@virginia.edu
c6b19852-39b9-4460-8777-47047330ce20
Gramm-software package for molecular dynamics on graphical processing units
This work describes the software package and algorithms for molecular dynamics using NVIDIA GPU G80, G84, and G92. All potentials needed for MM2 and AMBER force fields are implemented and the combination of different potentials is allowed. The performance comparison of different MD algorithms on GPU and CPU is presented. All software is available from www.gpamm.mntech.ru.
/content/cudazone/CUDABrowser/assets/images/applications/1003_cover-medium2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1003_cover-medium2_large.jpg
Academia
Russian Academy of Sciences
2010
01
21
01/21/2010
D. S. Tarasov
E. D. Izotova
D. A. Alisheva
Paper
Numerics
D. S. Tarasov,E. D. Izotova,D. A. Alisheva
de2ccfcc-9cf0-4076-b92e-969e42607064
Leveraging Computation Sharing and Parallel Processing in Location-Based Services
A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. In this paper, we introduce an efficient and scalable system for monitoring continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit.
http://www.computer.org/portal/web/csdl/doi/10.1109/CSE.2009.437
/content/cudazone/CUDABrowser/assets/images/applications/1002_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1002_cs_large.jpg
Research
2009 International Conference on Computational Science and Engineering
2008
08
31
08/31/2008
Jonathan Cazalas
Kien Hua
Paper
Science
Jonathan Cazalas,Kien Hua
5914c55d-b33d-4834-91f5-52968c66c450
Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors
Lattice Boltzmann Methods (LBM) are used for the computational simulation of Newtonian fluid dynamics. LBM-based simulations are readily parallelizable; they have been implemented on general-purpose processors, field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). Of the three methods, the GPU implementations achieved the highest simulation performance per chip. With memory bandwidth of up to 141 GB/s and a theoretical maximum floating point performance of over 600 GFLOPS, CUDA-ready GPUs from NVIDIA provide an attractive platform for a wide range of scientific simulations, including LBM.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPP.2009.38
/content/cudazone/CUDABrowser/assets/images/applications/1001_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1001_cs_large.jpg
Research
2009 International Conference on Parallel Processing
2009
09
25
09/25/2009
Peter Bailey
Joe Myre
Stuart D.C. Walsh
Paper
Science
Peter Bailey,Joe Myre,Stuart D.C. Walsh
9c638c6c-3e27-4a8c-b3ea-e6f75ca52f8d
Theoretical and Empirical Analysis of a GPU Based Parallel Bayesian Optimization Algorithm
General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelization of Bayesian Optimization Algorithm (BOA) to solve complex combinatorial optimization problems over nVidia commodity graphics hardware using Compute Unified Device Architecture (CUDA).
http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.32
/content/cudazone/CUDABrowser/assets/images/applications/1000_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/1000_cs_large.jpg
Research
2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
2009
12
11
12/11/2009
Asim Munawar
Mohamed Wahib
Masaharu Munetomo
Paper
Science
Asim Munawar,Mohamed Wahib,Masaharu Munetomo
9c689f15-653f-4c90-b64c-1140bae9d5df
Applying Modern Soft- and Hardware Technologies for Computational Steering Approaches in Computational Fluid Dynamics
In this article we present an educational simulation tool, FlowSim 2007 CUDA edition, a computational steering application for interactive 2D flow simulation based on the Lattice Boltzmann Method. The application combines a comfortable user interface as well as a convenient development platform on the one hand and a high performance flow solver on the other hand.
http://www.computer.org/portal/web/csdl/doi/10.1109/CW.2007.53
/content/cudazone/CUDABrowser/assets/images/applications/999_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/999_cs_large.jpg
Research
2007 International Conference on Cyberworlds
2007
10
26
10/26/2007
Jan Linxweiler
Jonas Tlke
Manfred Krafczyk
Paper
Science
Jan Linxweiler,Jonas Tlke,Manfred Krafczyk
722324b0-4ea9-4cc8-896d-190e61c0da21
High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units
The power of graphics processing unit (GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed??which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of Korea using OpenGL and CUDA libraries on GPU.
http://www.computer.org/portal/web/csdl/doi/10.1109/MUE.2008.94
/content/cudazone/CUDABrowser/assets/images/applications/998_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/998_cs_large.jpg
Research
2008 International Conference on Multimedia and Ubiquitous Engineering
2008
04
26
04/26/2008
Yongjin Yeom
Yongkuk Cho
Moti Yung
Paper
Science
Yongjin Yeom,Yongkuk Cho,Moti Yung
8fc3f95e-0465-479a-a4d9-6876d7b5e3b3
Accelerating Compute-Intensive Applications with GPUs and FPGAs
Accelerators are special purpose processors designed to speed up compute-intensive sections of applications. Two extreme endpoints in the spectrum of possible accelerators are FPGAs and GPUs, which can often achieve better performance than CPUs on certain workloads. FPGAs are highly customizable, while GPUs provide massive parallel execution resources and high memory bandwidth.
http://www.computer.org/portal/web/csdl/doi/10.1109/SASP.2008.4570793
/content/cudazone/CUDABrowser/assets/images/applications/997_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/997_cs_large.jpg
Academia
University of Virginia
2008
06
09
06/09/2008
Shuai Che
Jie Li
Jeremy W. Sheaffer
Paper
Science
Shuai Che,Jie Li,Jeremy W. Sheaffer
458661f9-8252-4e51-b2bd-d53126b80571
High-Speed Private Information Retrieval Computation on GPU
A Private Information Retrieval (PIR) scheme is a protocol in which a user retrieves a record out of n from a replicated database, while hiding from the database which record has been retrieved, as long as the different replicas do not collude. A specially interesting sub-field of research, called single-database PIR, deals with the schemes that allow a user to retrieve privately an element of a non-replicated database. In these schemes, user privacy is related to the intractability of a mathematical problem, instead of being based on the assumption that different replicas exist and do not collude against their users.
http://www.computer.org/portal/web/csdl/doi/10.1109/SECURWARE.2008.55
/content/cudazone/CUDABrowser/assets/images/applications/996_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/996_cs_large.jpg
Research
2008 Second International Conference on Emerging Security Information, Systems and Technologies
2008
08
31
08/31/2008
Carlos Aguilar Melchor
Benoit Crespin
Philippe Gaborit
Paper
Science
Carlos Aguilar Melchor,Benoit Crespin,Philippe Gaborit
23325740-fa2b-4e7b-a7ee-b607da12ee54
Compute Unified Device Architecture Application Suitability
Graphics processing units (GPUs) can provide excellent speedups on some, but not all, general-purpose workloads. Using a set of computational GPU kernels as examples, the authors show how to adapt kernels to utilize the architectural features of a GeForce 8800 GPU and what finally limits the achievable performance.
/content/cudazone/CUDABrowser/assets/images/applications/995_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/995_cs_large.jpg
Academia
University of Illinois
2009
06
01
06/01/2009
Wen-Mei Hwu
Christopher Rodrigues
Shane Ryoo
Paper
Science
Wen-Mei Hwu,Christopher Rodrigues,Shane Ryoo
0e7f8bf5-4959-4ece-9756-519dda1fe8b6
Parallel Approaches for SWAMP Sequence Alignment
This document is a summary and overview of several approaches to implement the local sequence alignmentalgorithms known as SWAMP and SWAMP+ on commerciallyavailable hardware. Using a Smith-Waterman style of alignment, these parallel algorithms have several innovative extensions that take advantage of the ASC associative computing model while maintaining speed, accuracy, and producing a richer set of results in an automated way that is not currently available.We consider four different hardware architectures for therealization of the ASC model. These are the ClearSpeed CSXprocessor, NVIDIA GPGPU graphics processors, IBM Cell Processors, and FPGAs.
/content/cudazone/CUDABrowser/assets/images/applications/994_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/994_cs_large.jpg
Academia
Case Western University, Cleveland, Ohio
2009
06
17
06/17/2009
Shannon Steinfadt
Kevin Schaffer
Paper
Science
Shannon Steinfadt,Kevin Schaffer
9d864e53-07ca-423a-ab86-088d470c12a2
Accelerating Algebraic Reconstruction Using CUDA-Enabled GPU
In this paper, we apply the Compute Unified Device Architecture (CUDA) to the 3D cone-beam CT reconstruction using Simultaneous Algebraic Reconstruction Technique (SART). With the hardware acceleration, the computationally complex SART can run at speed comparable to the commonly used Filtered Back-Projection, and provide even better quality volume with less samples. The main contributions include two novel techniques to accelerate the reconstruction.
http://www.computer.org/portal/web/csdl/doi/10.1109/CGIV.2009.18
/content/cudazone/CUDABrowser/assets/images/applications/993_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/993_cs_large.jpg
Research
2009 Sixth International Conference on Computer Graphics, Imaging and Visualization
2009
08
14
08/14/2009
Yuqiang Lu
Weiming Wang
Shifu Chen
Paper
Science
Yuqiang Lu,Weiming Wang,Shifu Chen
9eb3b453-b0e1-42bc-8040-626f61e09879
Profiling General Purpose GPU Applications
We are witnessing an increasing adoption of GPUs for performing general purpose computation, which is usually known as GPGPU. The main challenge in developing such applications is that they often do not fit in the model required by the graphics processing devices, limiting the scope of applications that may be benefit from the computing power provided by GPUs. Even when the application fits GPU model, obtaining optimal resource usage is a complex task.
http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2009.26
/content/cudazone/CUDABrowser/assets/images/applications/992_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/992_cs_large.jpg
Research
2009 21st International Symposium on Computer Architecture and High Performance Computing
2009
10
31
10/31/2009
Bruno Rocha Coutinho
George Luiz Medeiros Teodoro
Rafael Sachetto Oliveira
Paper
Science
Bruno Rocha Coutinho,George Luiz Medeiros Teodoro,Rafael Sachetto Oliveira
f186e9e8-e3bd-41da-a917-0868ecbc7fdc
Improving Performance of Matrix Multiplication and FFT on GPU
In this paper we discuss about our experiences in improving the performance of two key algorithms: the single-precision matrix-matrix multiplication subprogram (SGEMM of BLAS) and single-precision FFT using CUDA. The former is computation-intensive, while the latter is memory bandwidth or communication-intensive. A peak performance of 393 Gflops is achieved on NVIDIA GeForce GTX280 for the former, about 5% faster than the CUBLAS 2.0 library. Better FFT performance results are obtained for a range of dimensions. Some common principles are discussed for the design and implementation of many-core algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/991_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/991_cs_large.jpg
Research
2009 15th International Conference on Parallel and Distributed Systems
2009
12
11
12/11/2009
Xiang Cui
Yifeng Chen
Hong Mei
Paper
Science
Xiang Cui,Yifeng Chen,Hong Mei
dedb0ea8-da35-401f-a28a-50bf45cb4f96
Coprocessor Computing with FPGA and GPU
Specialized secondary processing units, such as field programmable gate arrays (FPGAs) and graphics processing units (GPUs), attempt to tackle the time consuming applications containing high computational requirements. In order to achieve acceleration, FPGAs allow a customizable architecture and Nvidia GPUs offer up to 16 cores with 128 stream processors.
http://www.computer.org/portal/web/csdl/doi/10.1109/DoD.HPCMP.UGC.2008.69
/content/cudazone/CUDABrowser/assets/images/applications/990_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/990_cs_large.jpg
Research
2008 DoD HPCMP Users Group Conference
2008
07
17
07/17/2008
Song Jun Park
Dale R. Shires
Brian J. Henz
Paper
Science
Song Jun Park,Dale R. Shires,Brian J. Henz
a594c3dc-754e-4f17-a212-b1968a962069
GPU as a General Purpose Computing Resource
In the last few years, GPUs(Graphics Processing Units) have made rapid development. Their ever-increasing computing power and decreasing cost have attracted attention from both industry and academia. In addition to graphics applications, researchers are interested in using them for general purpose computing. Recently, NVIDIA released a new computing architecture, CUDA (Compute Unified Device Architecture), for its GeForce 8 series, Quadro FX, and Tesla GPU products.
http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2008.38
/content/cudazone/CUDABrowser/assets/images/applications/989_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/989_cs_large.jpg
Research
2008 Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies
2008
12
04
12/04/2008
Qihang Huang
Zhiyi Huang
Paul Werstein
Paper
Science
Qihang Huang,Zhiyi Huang,Paul Werstein
ce07c8c1-bb41-4cc2-b04f-ee02a2980f68
Accelerating Partitional Algorithms for Flow Cytometry on GPUs
Like many modern techniques for scientific analysis, flow cytometry produces massive amounts of data that must be analyzed and clustered intelligently to be useful. Current manual binning techniques are cumbersome and limited in both the quality and quantity of analysis produced. To address the quality of results, a new framework applying two different sets of clustering algorithms and inference methods are implemented.
http://www.computer.org/portal/web/csdl/doi/10.1109/ISPA.2009.29
/content/cudazone/CUDABrowser/assets/images/applications/988_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/988_cs_large.jpg
Research
2009 IEEE International Symposium on Parallel and Distributed Processing with Applications
2009
08
12
08/12/2009
Jeremy Espenshade
Andrew Pangborn
Gregor von Laszewski
Paper
Science
Jeremy Espenshade,Andrew Pangborn,Gregor von Laszewski
34d48171-d36d-4bd4-8cf5-25e71f00c0ee
kD-Tree Traversal Implementations for Ray Tracing on Massive Multiprocessors: A Comparative Study
Current GPU computational power enables the execution of complex and parallel algorithms, such as Ray Tracing techniques supported by kD-trees for 3D scene rendering in real time. This work describes in detail the study and implementation of five different kD-Tree traversal algorithms using the parallel framework NVIDIA Compute Unified Device Architecture (CUDA), in order to point their pros and cons regarding adaptation capability to the chosen architecture.
http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2009.25
/content/cudazone/CUDABrowser/assets/images/applications/987_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/987_cs_large.jpg
Research
2009 21st International Symposium on Computer Architecture and High Performance Computing
2009
10
31
10/31/2009
Artur L. dos Santos
Joao Marcelo X.N. Teixeira
Thiago S.M.C. de Farias
Paper
Science
Artur L. dos Santos,Joao Marcelo X.N. Teixeira,Thiago S.M.C. de Farias
ed07648a-45fa-4ca3-b053-e20c0411184a
Multi-core acceleration of chemical kinetics for simulation and prediction
This work implements a computationally expensive chemical kinetics kernel from a large-scale community atmospheric model on three multi-core platforms: NVIDIA GPUs using CUDA, the Cell Broadband Engine, and Intel Quad-Core Xeon CPUs. A comparative performance analysis for each platform in double and single precision on coarse and fine grids is presented. Platform-specific design and optimization is discussed in a mechanism-agnostic way, permitting the optimization of many chemical mechanisms.
http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654067
/content/cudazone/CUDABrowser/assets/images/applications/986_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/986_cs_large.jpg
Academia
Virginia Polytechnic Institute and State University
2009
11
20
11/20/2009
John C. Linford
John Michalakes
Manish Vachharajani
Paper
Science
John C. Linford,John Michalakes,Manish Vachharajani
7d14d54f-5d86-4c0d-aed3-d3c48af7bcd6
Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence
Direct numerical simulation (DNS) of turbulence is computationally intensive and typically relies on some form of parallel processing. Spectral kernels used for spatial discretization are a common computational bottleneck on distributed memory architectures. One way to increase DNS algorithms' efficiency is to parallelize spectral kernels using tightly coupled single-program, multiple-data (SPMD) multiprocessor units with minimal interprocessor communication latency.
http://www.computer.org/portal/web/csdl/doi/10.1109/MCSE.2009.42
/content/cudazone/CUDABrowser/assets/images/applications/985_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/985_cs_large.jpg
Academia
University of Maryland
2009
04
01
04/01/2009
George Stantchev
Derek Juba
William Dorland
Paper
Science
George Stantchev,Derek Juba,William Dorland
c8e73d49-b21f-4fdb-9e45-387be9600fe0
Accelerating Phase Correlation Functions Using GPU and FPGA
In this paper, we present a comparison study about implementations of phase correlation function using GPUs, ASIC and FPGAs. The Phase Only Correlation(POC) method demonstrates high robustness and subpixel accuracy in the pattern matching and the image registration. However, there is a disadvantage in computational speed because of the calculation of 2D-FFT etc.
http://www.computer.org/portal/web/csdl/doi/10.1109/AHS.2009.53
/content/cudazone/CUDABrowser/assets/images/applications/984_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/984_cs_large.jpg
Research
2009 NASA/ESA Conference on Adaptive Hardware and Systems
2009
08
01
08/01/2009
Kentaro Matsuo
Tsuyoshi Hamada
Masayuki Miyoshi
Paper
Science
Kentaro Matsuo,Tsuyoshi Hamada,Masayuki Miyoshi
9d8a949b-df07-474b-987c-0f649a0c3750
Financial Derivatives Modeling Using GPU's
The architecture of the latest Graphic Processing Unit (GPU) has surpassed the previous application-specific stream architecture. This has led to an architecture consisting of a number of uniform programmable units integrated on the same chip which facilitate the general-purpose computing beyond the graphic processing.
http://www.computer.org/portal/web/csdl/doi/10.1109/EmbeddedCom-ScalCom.2009.85
/content/cudazone/CUDABrowser/assets/images/applications/983_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/983_cs_large.jpg
Research
2009 International Conference on Scalable Computing and Communications
2009
09
27
09/27/2009
Myungho Lee
Chin Hong Chun
Sugwon Hong
Paper
Science
Myungho Lee,Chin Hong Chun,Sugwon Hong
a9fb7f0f-051b-4f80-940c-0d35b077453f
Fast k nearest neighbor search using GPU
Statistical measures coming from information theory represent interesting bases for image and video processing tasks such as image retrieval and video object tracking. For example, let us mention the entropy and the Kullback-Leibler divergence. Accurate estimation of these measures requires to adapt to the local sample density, especially if the data are high-dimensional.
http://www.computer.org/portal/web/csdl/doi/10.1109/CVPRW.2008.4563100
/content/cudazone/CUDABrowser/assets/images/applications/982_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/982_cs_large.jpg
Research
Universitu de Nice-Sophia Antipolis/CNRS Laboratoire I3S, France
2008
06
28
06/28/2008
Vincent Garcia
Eric Debreuve
Michel Barlaud
Paper
Science
Vincent Garcia,Eric Debreuve,Michel Barlaud
3b04ed36-f9f7-4cdd-bae3-6f168d7a28f4
Accelerating Simulations of Light Scattering Based on Finite-Difference Time-Domain Method with General Purpose GPUs
Simulations of light scattering from nano-structured surface areas require substantial amount of computing time. The emergence of General Purpose Graphics Processing Units (GPGPUs) as affordable PC SIMD arithmetic coprocessors brings the necessary computing power to modern desktop PCs. In this paper we examine how the computation time of the Finite-Difference Time-Domain (FDTD), a classic numerical method for computing a solution to Maxwell's equations, can be reduced by leveraging the massively parallel architecture of GPGPUs cards.
/content/cudazone/CUDABrowser/assets/images/applications/981_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/981_cs_large.jpg
Research
2008 11th IEEE International Conference on Computational Science and Engineering
2008
07
28
07/28/2008
A. Balevic
L. Rockstroh
A. Tausendfreund
Paper
Science
A. Balevic,L. Rockstroh,A. Tausendfreund
b249b69b-fbc0-48a9-b9a1-6dfc8e766fee
Exploring the multiple-GPU design space
Graphics Processing Units (GPUs) have been growing in popularity due to their impressive processing capabilities, and with general purpose programming languages such as NVIDIA's CUDA interface, are becoming the platform of choice in the scientific computing community. Previous studies that used GPUs focused on obtaining significant performance gains from execution on a single GPU. These studies employed low-level, architecture-specific tuning in order to achieve sizeable benefits over multicore CPU execution. In this paper, we consider the benefits of running on multiple (parallel) GPUs to provide further orders of performance speedup.
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5161068
/content/cudazone/CUDABrowser/assets/images/applications/980_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/980_cs_large.jpg
Academia
Northeastern University
2009
05
29
05/29/2009
Dana Schaa
David Kaeli
Paper
Science
Dana Schaa,David Kaeli
558ecded-dc18-475f-9e3d-ebda86332f8f
The Virtual Marathon: Parallel Computing Supports Crowd Simulations
To be realistic, an urban model must include appropriate numbers of pedestrians, vehicles, and other dynamic entities. Using a parallelcomputing architecture, researchers simulated a marathon with more than a million participants. To simulate participant behavior, they used fuzzy logic on a GPU to perform millions of inferences in real time.
/content/cudazone/CUDABrowser/assets/images/applications/979_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/979_cs_large.jpg
Research
IEEE Computer Graphics
2009
08
01
08/01/2009
Erdal Yilmaz
Veysi Isler
Yasemin Yardimci Cetin
Paper
Science
Erdal Yilmaz,Veysi Isler,Yasemin Yardimci Cetin
e09be7a1-fad3-47c5-8189-cb111c0818df
A Parallel Gibbs Sampling Algorithm for Motif Finding on GPU
Motif is overrepresented pattern in biological sequence and Motif finding is an important problem in bioinformatics. Due to high computational complexity of motif finding, more and more computational capabilities are required as the rapid growth of available biological data, such as gene transcription data. Among many motif finding algorithms, Gibbs sampling is an effective method for long motif finding. In this paper we present an improved Gibbs sampling method on graphics processing units (GPU) to accelerate motif finding. Experimental data support that, compared to traditional programs on CPU, our program running on GPU provides an effective and low-cost solution for motif finding problem, especially for long motif finding.
/content/cudazone/CUDABrowser/assets/images/applications/978_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/978_cs_large.jpg
Research
2009 IEEE International Symposium on Parallel and Distributed Processing with Applications
2009
08
12
08/12/2009
Linbin Yu
Yun Xu
Paper
Science
Linbin Yu,Yun Xu
c1239952-bc70-4a62-b8aa-da685c20d2ea
Cellular Level Agent Based Modelling on the Graphics Processing Unit
Cellular level agent based modelling is reliant on either sequential processing environments or expensive and largely unavailable PC grids. The GPU offers an alternative architecture for such systems, however the steep learning curve associated with the GPUs data parallel architecture has previously limited the uptake of this emerging technology.
http://www.computer.org/portal/web/csdl/doi/10.1109/HiBi.2009.12
/content/cudazone/CUDABrowser/assets/images/applications/977_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/977_cs_large.jpg
Research
2009 International Workshop on High Performance Computational Systems Biology
2009
10
14
10/14/2009
Paul Richmond
Simon Coakley
Daniela Romano
Paper
Science
Paul Richmond,Simon Coakley,Daniela Romano
843ce581-fa99-4372-99d8-6ef7b20ec10e
A microdriver architecture for error correcting codes inside the Linux kernel
Coding tasks, such as encryption of data or the generation of failure-tolerant codes, belong to the most computationaly expensive tasks inside the Linux kernel. Their integration into the kernel enables the user to transparently access these functionalities, encrypted hard disks can be used in the same way as unencrypted ones. Nevertheless, Linux as a monolithic kernel is not prepared to support these expensive tasks by accessing modern hardware accelerators, like graphics processing units (GPUs), as the corresponding accelerator libraries, like the CUDA-API for NVIDIA GPUs, only offer user-space APIs.
http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654095
/content/cudazone/CUDABrowser/assets/images/applications/976_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/976_cs_large.jpg
Academia
University of Paderborn, Germany
2009
11
20
11/20/2009
A. Brinkmann
D. Eschweiler
Paper
Science
A. Brinkmann,D. Eschweiler
d6ce8db6-de74-452f-8737-b2efa14c3d63
A Program Behavior Study of Block Cryptography Algorithms on GPGPU
Recently many studies have been made to map cryptography algorithms onto graphics processors (GPU), and gained great performances. This paper does not focus on the performance of a specific program exploited by using all kinds of optimization methods algorithmically, but the intrinsic reason which lies in GPU architectural features for this performance improvement.
http://www.computer.org/portal/web/csdl/doi/10.1109/FCST.2009.13
/content/cudazone/CUDABrowser/assets/images/applications/975_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/975_cs_large.jpg
Research
2009 Fourth International Conference on Frontier of Computer Science and Technology
2009
12
19
12/19/2009
Gu Liu
Hong An
Wenting Han
Paper
Science
Gu Liu,Hong An,Wenting Han
d9dd760c-7469-42a8-905c-7144ff3d043d
Count Sort for GPU Computing
Counting sort is a simple, stable and efficient sort algorithm with linear running time, which is a fundamental building block for many applications. This paper depicts the design issues of a data parallel implementation of the count sort algorithm on a commodity multiprocessor GPU using the Compute Unified Device Architecture (CUDA) platform, both from NVIDIA Corporation.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.30
/content/cudazone/CUDABrowser/assets/images/applications/974_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/974_cs_large.jpg
Research
2009 15th International Conference on Parallel and Distributed Systems
2009
12
11
12/11/2009
Weidong Sun
Zongmin Ma
Paper
Science
Weidong Sun,Zongmin Ma
47e489fe-8243-425b-a88c-27a5d18b0f6a
Solving Computational Problems with GPU Computing
Modern GPUs are massively parallel microprocessors that can deliver very high performance for the parallel computations common in science and engineering.
/content/cudazone/CUDABrowser/assets/images/applications/972_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/972_cs_large.jpg
Research
Computing Science
2009
10
01
10/01/2009
Jonathan Cohen
Michael Garland
Paper
Science
Jonathan Cohen,Michael Garland
8a1ca4bd-1895-40b6-98b7-33bdddca994d
The Synchronization Power of Coalesced Memory Accesses
Multicore architectures have established themselves as the new generation of computer architectures. As part of the one core to many cores evolution, memory access mechanisms have advanced rapidly. Several new memory access mechanisms have been implemented in many modern commodity multicore architectures. By specifying how processing cores access shared memory, memory access mechanisms directly influence the synchronization capabilities of multicore architectures. Therefore, it is crucial to investigate the synchronization power of these new memory access mechanisms.
/content/cudazone/CUDABrowser/assets/images/applications/971_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/971_cs_large.jpg
Academia
Chalmers University of Technology, Gothenburg
2008
12
31
12/31/2008
Phuong Hoai Ha
Philippas Tsigas
Otto J. Anshus
Paper
Science
Phuong Hoai Ha,Philippas Tsigas,Otto J. Anshus
fbaceb8f-3e46-4070-ac07-fe2e8d5e4608
Fast Disk Encryption through GPGPU Acceleration
We present the design and performance analysis of a GPU-optimized implementation of a disk encryption application employing the XTS mode of operation applied together with the Twofish algorithm within the well-known TrueCrypt suite. We show how to correctly tune the design parameters, including data allocation, thread packing, and parallelization strategy. Overall, our implementation of TrueCrypt running on a NVidia GTX260 GPU outperforms by 67% the baseline implementation running on a four core CPU.
/content/cudazone/CUDABrowser/assets/images/applications/970_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/970_cs_large.jpg
Research
2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
2009
12
11
12/11/2009
Giovanni Agosta
Alessandro Barenghi
Fabrizio De Santis
Paper
Science
Giovanni Agosta,Alessandro Barenghi,Fabrizio De Santis
4471c01f-7076-4798-acd4-af519ff3ae9e
Optical Flow Computation on Compute Unified Device Architecture
In this study, the implementation of an image processing technique on Compute Unified Device Architecture (CUDA) is discussed. CUDA is a new hardware and software architecture developed by NVIDIA Corporation for the generalpurpose computation on graphics processing units.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICIAP.2007.97
/content/cudazone/CUDABrowser/assets/images/applications/969_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/969_cs_large.jpg
Academia
Yamaguchi University, Japan
2007
09
14
09/14/2007
Yoshiki Mizukami
Katsumi Tadamura
Paper
Science
Yoshiki Mizukami,Katsumi Tadamura
8252cbfd-7350-4004-acfb-096dfea1d9e2
Mapping High-Fidelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures
Medical volumetric imaging requires high fidelity, high performance rendering algorithms. We motivate and analyze new volumetric rendering algorithms that are suited to modern parallel processing architectures. First, we describe the three major categories of volume rendering algorithms and confirm through an imaging scientist-guided evaluation that ray-casting is the most acceptable.
http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2009.164
/content/cudazone/CUDABrowser/assets/images/applications/967_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/967_cs_large.jpg
Research
Intel Corporation
2009
11
15
11/15/2009
Mikhail Smelyanskiy
David Holmes
Jatin Chhugani
Paper
Medical Imaging
Mikhail Smelyanskiy,David Holmes,Jatin Chhugani
bf5ce6bc-a2c0-4262-961b-d0fe4504edc1
GPU-accelerated, gradient-free MI deformable registration for atlas-based MR brain image segmentation
Brain structure segmentation is an important task in many neuroscience and clinical applications. In this paper, we introduce a novel MI-based dense deformable registration method and apply it to the automatic segmentation of detailed brain structures.
http://www.computer.org/portal/web/csdl/doi/10.1109/CVPR.2009.5204043
/content/cudazone/CUDABrowser/assets/images/applications/966_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/966_cs_large.jpg
Academia
Maryland Heights
2009
06
25
06/25/2009
Xiao Han
L.S. Hibbard
V. Willcut
Paper
Science
Xiao Han,L.S. Hibbard,V. Willcut
97e427af-035a-48b2-944a-44029f2b874e
Efficient band approximation of Gram matrices for large scale kernel methods on GPUs
Kernel-based methods require O(N2) time and space complexities to compute and store non-sparse Gram matrices, which is prohibitively expensive for large scale problems. We introduce a novel method to approximate a Gram matrix with a band matrix. Our method relies on the locality preserving properties of space filling curves, and the special structure of Gram matrices. Our approach has several important merits.
http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654091
/content/cudazone/CUDABrowser/assets/images/applications/965_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/965_cs_large.jpg
Research
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
2009
11
20
11/20/2009
Mohamed Hussein
Wael Abd-Almageed
Paper
Science
Mohamed Hussein,Wael Abd-Almageed
0286b4ba-3c62-4513-b654-eb17ca5eb44f
CUDA-Based Jacobi's Iterative Method
Solving linear equations is a common problem in the fields of science and engineering. Accelerating its solving process is of great significance. Modern GPUs are high performance many-core processors fit for large scale parallel computing. They provide us a novel way for accelerating the solving process. A GPU based parallel Jacobis iterative solver for dense linear equations is presented in this paper.
http://www.computer.org/portal/web/csdl/doi/10.1109/IFCSTA.2009.68
/content/cudazone/CUDABrowser/assets/images/applications/964_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/964_cs_large.jpg
Research
2009 International Forum on Computer Science-Technology and Applications
2009
12
27
12/27/2009
Zhihui Zhang
Qinghai Miao
Ying Wang
Paper
Science
Zhihui Zhang,Qinghai Miao,Ying Wang
aa5f9c87-b466-42cd-bbbf-e6d69a883462
Voice Command Recognition with Dynamic Time Warping (DTW) using GPU with CUDA
Recently, we are attending to a huge evolution on the development of high performance computing platforms. Among these platforms, the GPU (Graphics Processing Units) stimulated by game industries, constantly demanding more graphical processing power, evolved from a simple graphical card to a general purpose computation parallel data processing device.
http://www.computer.org/portal/web/csdl/doi/10.1109/SBAC-PAD.2007.21
/content/cudazone/CUDABrowser/assets/images/applications/963_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/963_cs_large.jpg
Research
19th International Symposium on Computer Architecture and High Performance Computing
2007
10
27
10/27/2007
Gustavo Poli
Joso F. Mari
Josw Hiroki Saito
Paper
Science
Gustavo Poli,Joso F. Mari,Josw Hiroki Saito
6f659aa3-3402-474c-860e-06af9f94e3f8
NVIDIA Tesla: A Unified Graphics and Computing Architecture
To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture. Its scalable parallel array of processors is massively multithreaded and programmable in C or via graphics APIs.
/content/cudazone/CUDABrowser/assets/images/applications/961_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/961_cs_large.jpg
Research
NVIDIA Corp.
http://www.nvidia.com/cuda
2008
04
01
04/01/2008
Erik Lindholm
John Nickolls
Stuart Oberman
Paper
Science
Erik Lindholm,John Nickolls,Stuart Oberman
dee4e626-3b28-4f97-b21f-51e7af3cd36a
Parallel Computing Experiences with CUDA
The CUDA programming model provides a straightforward means of describing inherently parallel computations, and NVIDIA's Tesla GPU architecture delivers high computational throughput on massively parallel problems. This article surveys experiences gained in applying CUDA to a diverse set of problems and the parallel speedups over sequential codes running on traditional CPU architectures attained by executing key computations on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/956_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/956_cs_large.jpg
Research
NVIDIA Corp.
http://www.nvidia.com/cuda
2008
08
01
08/01/2008
Michael Garland
Scott Le Grand
John Nickolls
Paper
Science
Michael Garland,Scott Le Grand,John Nickolls
33b8d91d-369c-4bcf-a02d-dea9fc848e19
Low-cost, high-speed computer vision using NVIDIA's CUDA architecture
In this paper, we introduce real time image processing techniques using modern programmable Graphic Processing Units (GPU). GPUs are SIMD (Single Instruction, Multiple Data) device that is inherently data-parallel.
http://www.computer.org/portal/web/csdl/doi/10.1109/AIPR.2008.4906458
/content/cudazone/CUDABrowser/assets/images/applications/954_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/954_cs_large.jpg
Academia
Virginia Polytechnic Institute and University Blacksburg
2008
10
17
10/17/2008
Seung In Park
Sean P. Ponce
Jing Huang
Paper
Science
Seung In Park,Sean P. Ponce,Jing Huang
27bfd115-787f-47af-bc51-e3978fe90dc2
K-Means on Commodity GPUs with CUDA
K-means algorithm is one of the most famous unsupervised clustering algorithms. Many theoretical improvements for the performance of original algorithms have been put forward, while almost all of them are based on Single Instruction Single Data(SISD) architecture processors (CPUs), which partly ignored the inherent paralleled characteristic of the algorithms.
http://www.computer.org/portal/web/csdl/doi/10.1109/CSIE.2009.491
/content/cudazone/CUDABrowser/assets/images/applications/947_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/947_cs_large.jpg
Research
2009 WRI World Congress on Computer Science and Information Engineering
2009
04
02
04/02/2009
Bai Hong-tao
He Li-li
Ouyang Dan-tong
Paper
Science
Bai Hong-tao,He Li-li,Ouyang Dan-tong
e458a4f1-5693-4f34-beb8-9f46e2d0e158
Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture
We explore the use of todays high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish,there is a lot more to be gained the field of scientific computing, high performance computing and their applications.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICSPS.2009.167
/content/cudazone/CUDABrowser/assets/images/applications/945_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/945_cs_large.jpg
Academia
2009 International Conference on Signal Processing Systems
2009
05
17
05/17/2009
S.A. Arul Shalom
Manoranjan Dash
Minh Tue
Paper
Science
S.A. Arul Shalom,Manoranjan Dash,Minh Tue
cf4cef10-6cad-4a80-94eb-fca17c2968c6
Compute Pairwise Manhattan Distance and Pearson Correlation Coefficient of Data Points with GPU
Graphics processing units (GPUs) are powerful computational devices tailored towards the needs of the 3-D gaming industry for high-performance, real-time graphics engines. Nvidia Corporation released a new generation of GPUs designed for general-purpose computing in 2006, and it released a GPU programming language called CUDA in 2007.
http://www.computer.org/portal/web/csdl/doi/10.1109/SNPD.2009.34
/content/cudazone/CUDABrowser/assets/images/applications/944_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/944_cs_large.jpg
Academia
Catholic University of Daegu, Korea
2009
05
29
05/29/2009
Dar-Jen Chang
Ahmed H. Desoky
Ming Ouyang
Paper
Science
Dar-Jen Chang,Ahmed H. Desoky,Ming Ouyang
f61532c5-2f66-40d8-8f6c-04b50f5bbefd
Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA
Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method.
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160924
/content/cudazone/CUDABrowser/assets/images/applications/940_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/940_cs_large.jpg
Academia
Nanyang Technological University, Singapore
2009
05
29
05/29/2009
Haixiang Shi
Bertil Schmidt
Weiguo Liu
Paper
Science
Haixiang Shi,Bertil Schmidt,Weiguo Liu
158694c3-4565-45ec-9051-9c1dfd7120d0
An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases
The Smith Waterman algorithm for sequence alignment is one of the main tools of bioinformatics. It is used for sequence similarity searches and alignment of similar sequences.
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160931
/content/cudazone/CUDABrowser/assets/images/applications/937_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/937_cs_large.jpg
Academia
University of Warsaw, Poland
2009
05
29
05/29/2009
Lukasz Ligowski
Witold Rudnicki
Paper
Science
Lukasz Ligowski,Witold Rudnicki
e58ea7e5-f7df-481c-8b10-3b3eccd9e977
CuPP - A framework for easy CUDA integration
This paper reports on CuPP, our newly developed C++ framework designed to ease integration of NVIDIAs GPGPU system CUDA into existing C++ applications. CuPP provides interfaces to reoccurring tasks that are easier to use than the standard CUDA interfaces. In this paper we concentrate on memory management and related data structures.
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160937
/content/cudazone/CUDABrowser/assets/images/applications/936_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/936_cs_large.jpg
Academia
Universitat Kassel, Germany
2009
05
29
05/29/2009
Jens Breitbart
Paper
Science
Jens Breitbart
b854753b-3791-4a9c-9b9a-4fe6700b4aa1
Parallel reconstruction of neighbor-joining trees for large multiple sequence alignments using CUDA
Computing large multiple protein sequence alignments using progressive alignment tools such as ClustalW requires several hours on state-of-the-art workstations. ClustalW uses a three-stage processing pipeline:
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5160923
/content/cudazone/CUDABrowser/assets/images/applications/934_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/934_cs_large.jpg
Academia
Nanyang Technological University, Singapore
2009
05
29
05/29/2009
Yongchao Liu
Bertil Schmidt
Douglas L. Maskell
Paper
Science
Yongchao Liu,Bertil Schmidt,Douglas L. Maskell
cb7682ca-4aae-4703-ae29-e99695f85d91
Ocean3DTechnology
Simulation oceanic surfaces; physics calculation for objects in water environment.
/content/cudazone/CUDABrowser/assets/images/applications/933_ocean_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/933_ocean_large.jpg
Commercial
Ocean3DInteractive
http://www.ocean3dinteractive.com
2009
04
15
04/15/2009
Commercial
Mykola Ozerchuk
Multimedia
Game Physics
Graphics
Mykola Ozerchuk
23e01f03-877d-433d-8b31-74754d82b8d9
FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications.
http://www.computer.org/portal/web/csdl/doi/10.1109/SASP.2009.5226333
/content/cudazone/CUDABrowser/assets/images/applications/932_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/932_cs_large.jpg
Academia
University of Illinois
2009
07
27
07/27/2009
Alexandros Papakonstantinou
Karthik Gururaj
John A. Stratton
Paper
Science
Alexandros Papakonstantinou,Karthik Gururaj,John A. Stratton
5894f1ec-6a2f-4df5-ac17-a8a77ada7394
MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA
Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets.
http://www.computer.org/portal/web/csdl/doi/10.1109/ASAP.2009.14
/content/cudazone/CUDABrowser/assets/images/applications/931_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/931_cs_large.jpg
Research
2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
2009
07
07
07/07/2009
36
Yongchao Liu
Bertil Schmidt
Douglas L. Maskell
Paper
Science
Yongchao Liu,Bertil Schmidt,Douglas L. Maskell
0e4de9b3-9658-4d42-b197-6ef57ab2d2ee
Getting Started with GPU Programming
This tutorial describes a step-by-step procedure for programming a Macintosh Nvidia GPU. General scientific programmers with some C knowledge can get started in parallel processing application development with relative ease.
/content/cudazone/CUDABrowser/assets/images/applications/930_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/930_cs_large.jpg
Research
American University
2009
08
01
08/01/2009
Michael A. Gray
Paper
Science
Michael A. Gray
9777acef-c2ae-4810-a2bb-809471ddc369
An Empirically Optimized Radix Sort for GPU
Graphics Processing Units (GPUs) that support general purpose program are promising platforms for high performance computing. However, the fundamental architectural difference between GPU and CPU, the complexity of GPU platform and the diversity of GPU specifications have made the generation of highly efficient code for GPU increasingly difficult. Manual code generation is time consuming and the result tends to be difficult to debug and maintain.
http://www.computer.org/portal/web/csdl/doi/10.1109/ISPA.2009.89
/content/cudazone/CUDABrowser/assets/images/applications/929_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/929_cs_large.jpg
Research
2009 IEEE International Symposium on Parallel and Distributed Processing with Applications
2009
08
10
08/10/2009
Bonan Huang
Jinlan Gao
Xiaoming Li
Paper
Science
Bonan Huang,Jinlan Gao,Xiaoming Li
8a13d819-67c1-47d5-994b-7a995ba156b8
Accelerating Genome-Wide Association Studies Using CUDA Compatible Graphics Processing Units
Recent advances in highly parallel, multithreaded, manycore Graphics Processing Units (GPUs) have been enabling massive parallel implementations of many applications in bioinformatics. In this paper, we describe a parallel implementation of genome-wide association studies (GWAS) using Compute Unified Device Architecture (CUDA). Using a single NVIDIA GTX 280 graphics card, we achieve speedups of about 15 times over Intel Xeon E5420. We also implement a highly scalable, massive parallel, GWAS system using the Message Passing Interface (MPI) and show that a single GTX 280 can have similar performance as a 16-node cluster. We further apply the GPU program to two real genome-wide case-control data sets. The results show that the GPU program is 17.7 times as fast as the CPU version for an Age-related Macular Degeneration (AMD) data set and 25.7 times as fast as the CPU version for a Parkinsons disease data set.
/content/cudazone/CUDABrowser/assets/images/applications/928_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/928_cs_large.jpg
Research
2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing
2009
08
03
08/03/2009
25
Rui Jiang
Feng Zeng
Wangshu Zhang
Paper
Science
Rui Jiang,Feng Zeng,Wangshu Zhang
ff55e59c-e068-4456-a289-f60c94909099
Power Efficient Large Matrices Multiplication by Load Scheduling on Multi-core and GPU Platform with CUDA
Power efficiency is one of the most important issues in high performance computing (HPC) interrelated to both software and hardware. Power dissipation of a program lies on algorithm design and power features of the computer components on which the program runs. In this work, we measure and model the power consumption of large matrices multiplication on multi-core CPU and GPU platform. By incorporating major physical power constrains of hardware components with the analysis of program execution behaviors, we approach to save the overall power consumption by using multithreading CPU to control two GPU devices computing in parallel synchronously. By implementing above method on real system, we show that it can save 22% of energy and speedup the kernel execution time by 71%, compare with solving the same large matrices multiplication using single CPU and GPU combination.
/content/cudazone/CUDABrowser/assets/images/applications/927_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/927_cs_large.jpg
Academia
2009 International Conference on Computational Science and Engineering
2009
08
29
08/29/2009
DaQi Ren
Reiji Suda
Paper
Science
DaQi Ren,Reiji Suda
36b2c6dc-0c25-4e19-a1b1-4f01eb1ba9a3
Solving 0/1 Knapsack Problem for Light Communication SLA-Based Workflow Mapping Using CUDA
Mapping and running jobs on suitable resources are the core tasks in Grid Computing. In the algorithm to map light communication Grid-based workflow within the SLA context, there is an operation of resolving the conflict period which is exact a 0/1 knapsack problem. When the size of the workflow is large such as in the case of mapping a group of workflows, the time to solve this problem is long and thus, makes the whole mapping process long. In this paper, we describe a way to solve this problem by exploiting the parallel computing power of Graphic Processing Unit (GPU) with Compute Unified Device Architecture (CUDA). The experiment shows that the approach is very efficient with huge problem.
/content/cudazone/CUDABrowser/assets/images/applications/926_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/926_cs_large.jpg
Academia
2009 International Conference on Computational Science and Engineering
2009
08
29
08/29/2009
Dang Minh Quan
Laurence T. Yang
Paper
Science
Dang Minh Quan,Laurence T. Yang
5c12478a-a0ac-4a51-b496-dea3b86a936f
CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator
Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications are radically different from those of general purpose CPUs; and optimizing programs at the instruction-level for the GPU.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPPW.2009.78
/content/cudazone/CUDABrowser/assets/images/applications/925_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/925_cs_large.jpg
Academia
2009 International Conference on Parallel Processing Workshops
2009
09
25
09/25/2009
Jakob Siegel
Juergen Ributzka
Xiaoming Li
Paper
Science
Jakob Siegel,Juergen Ributzka,Xiaoming Li
4e0a3931-0a17-43fc-a2e9-068c23a4a0ea
String Matching on a Multicore GPU Using CUDA
Graphics Processing Units (GPUs) have evolved over the past few years from dedicated graphics rendering devices to powerful parallel processors, outperforming traditional Central Processing Units (CPUs) in many areas of scientific computing. The use of GPUs as processing elements was very limited until recently, when the concept of General-Purpose computing on Graphics Processing Units (GPGPU) was introduced. GPGPU made possible to exploit the processing power and the memory bandwidth of the GPUs with the use of APIs that hide the GPU hardware from programmers. This paper presents experimental results on the parallel processing for some well known on-line string matching algorithms using one such GPU abstraction API, the Compute Unified Device Architecture (CUDA).
/content/cudazone/CUDABrowser/assets/images/applications/924_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/924_cs_large.jpg
Academia
Corfu, Greece
2009
09
12
09/12/2009
Charalampos S. Kouzinopoulos
Konstantinos G. Margaritis
Paper
Science
Charalampos S. Kouzinopoulos,Konstantinos G. Margaritis
fffa6694-d276-4fc7-938e-d1ae95148346
Isosurface Extraction and View-Dependent Filtering from Time-Varying Fields Using Persistent Time-Octree (PTOT)
We develop a new algorithm for isosurface extraction andview-dependent filtering from large time-varying fields, by using anovel Persistent Time-Octree (PTOT) indexingstructure.
http://www.computer.org/portal/web/csdl/doi/10.1109/TVCG.2009.160
/content/cudazone/CUDABrowser/assets/images/applications/923_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/923_cs_large.jpg
Academia
Polytechnic Institute of New York University
2009
12
01
12/01/2009
Cong Wang
Yi-Jen Chiang
Paper
Science
Cong Wang,Yi-Jen Chiang
ffccf360-34bc-43f7-8fd1-de125348ba45
Simulation of P Systems with Active Membranes on CUDA
P systems or membrane systems provide a high level computational modeling framework that combines the structural and dynamic aspects of biological systems in a relevant and understandable way. P systems are massively parallel distributed, and non-deterministic systems. In this paper, we describe the implementation of a simulator for the class of recognizer P systems with active membranes by using the GPU (Graphics Processing Unit). We compare the high performance parallel simulator for the GPU to the simulator developed on a single CPU (Central Processing Unit), and we show that the GPU is better suited than the CPU to simulate P systems due to its highly parallel nature.
/content/cudazone/CUDABrowser/assets/images/applications/922_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/922_cs_large.jpg
Academia
CoSBi, Trento, Italy
2009
10
14
10/14/2009
Jose Maria Cecilia Canales
Jose Manuel Garcia Carrasco
Paper
Science
Jose Maria Cecilia Canales,Jose Manuel Garcia Carrasco
c4e34cb4-2b20-4b45-96ad-99b180dbcc47
Auto-tuning 3-D FFT library for CUDA GPUs
Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem.
http://www.computer.org/portal/web/csdl/doi/10.1145/1654059.1654090
/content/cudazone/CUDABrowser/assets/images/applications/921_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/921_cs_large.jpg
Academia
Tokyo Institute of Technology and Japan Science and Technology Agency
2009
11
14
11/14/2009
Akira Nukada
Satoshi Matsuoka
Paper
Science
Akira Nukada,Satoshi Matsuoka
c94e9a92-86b2-46a4-bc60-9910216c5d48
CUDA Accelerated LTL Model Checking
Recent technological developments made available various many-core hardware platforms. For example, a SIMD-like hardware architecture became easily accessible for many users who have their computers equipped with modern NVIDIA GPU cards with CUDA technology. In this paper we redesign the maximal accepting predecessors algorithm [7] for LTL model checking in terms of matrix-vector product in order to accelerate LTL model checking on many-core GPU platforms. Our experiments demonstrate that using the NVIDIA CUDA technology results in a significant speedup of verification process.
/content/cudazone/CUDABrowser/assets/images/applications/919_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/919_cs_large.jpg
Academia
Shenzhen, Guangdong, China
2009
12
11
12/11/2009
Jiri Barnat
Lubos Brim
Milan Ceska
Paper
Science
Jiri Barnat,Lubos Brim,Milan Ceska
810014ad-f17c-406d-aff6-737150d18fdd
RankBoost Acceleration on both NVIDIA CUDA and ATI Stream Platforms
NVIDIA CUDA and ATI Stream are the two major general-purpose GPU (GPGPU) computing technologies. We implemented RankBoost, a web relevance ranking algorithm, on both NVIDIA CUDA and ATI Stream platforms to accelerate the algorithm and illustrate the differences between these two technologies.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.115
/content/cudazone/CUDABrowser/assets/images/applications/917_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/917_cs_large.jpg
Academia
Shenzhen, Guangdong, China
2009
12
11
12/11/2009
Bo Wang
Tianji Wu
Feng Yan
Paper
Science
Bo Wang,Tianji Wu,Feng Yan
38e91cb6-7537-44d0-9a2f-3fb26b020e88
Optimal Data Distribution for Versatile Finite Impulse Response Filtering on Next-Generation Graphics Hardware Using CUDA
In this paper, we investigate discrete finite impulse response (FIR) filtering of images, while harnessing the powerful computational resources of next-generation GPUs. These novel platforms exhibit a massive data parallel architecture with an advanced SIMT execution model and thread management, to enable designers to better cope with the infamous memory wall, i.e. the growing gap between the cost of data communication and computational processing.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.79
/content/cudazone/CUDABrowser/assets/images/applications/916_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/916_cs_large.jpg
Academia
Shenzhen, Guangdong, China
2009
12
11
12/11/2009
Patrik Goorts
Sammy Rogmans
Philippe Bekaert
Paper
Science
Patrik Goorts,Sammy Rogmans,Philippe Bekaert
b6017842-9107-4a99-b3f3-b15cc13fe777
Parallel Lexicographic Names Construction with CUDA
Suffix array is a simpler and compact alternative to the suffix tree, lexicographic name construction is the fundamental building block in suffix array construction process. This paper depicts the design issues of first data parallel implementation of the lexicographic name construction algorithm on a commodity multiprocessor GPU using the Compute Unified Device Architecture (CUDA) platform, both from NVIDIA Corporation.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.31
/content/cudazone/CUDABrowser/assets/images/applications/915_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/915_cs_large.jpg
Academia
Shenzhen, Guangdong, China
2009
12
11
12/11/2009
Weidong Sun
Zongmin Ma
Paper
Science
Weidong Sun,Zongmin Ma
0b8f9a03-72c0-4eb4-bed8-7189f3048805
Program Optimization of Array-Intensive SPEC2k Benchmarks on Multithreaded GPU Using CUDA and Brook+
Graphic Processing Unit (GPU), with many light-weight data-parallel cores, can provide substantial parallel computing power to accelerate several general purpose applications. Both the AMD and NVIDIA corps provide their specific high performance GPUs and software platforms.
http://www.computer.org/portal/web/csdl/doi/10.1109/ICPADS.2009.12
/content/cudazone/CUDABrowser/assets/images/applications/914_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/914_cs_large.jpg
Academia
Shenzhen, Guangdong, China
2009
12
11
12/11/2009
Guibin Wang
Tao Tang
Xudong Fang
Paper
Science
Guibin Wang,Tao Tang,Xudong Fang
a76e6a95-dc8f-444b-af64-badd2fddee07
Accelerating Multi-scale Image Fusion Algorithms Using CUDA
Recently, fusion speed has emerged as an important factor in the image fusion and a substantial amount of memory and computing power are required for a high-speed fusion. This paper shows approaches to accelerate multi-scale image fusion speed on GPU (Graphics Processing Unit) using CUDA (Compute Unified Device Architecture).
http://www.computer.org/portal/web/csdl/doi/10.1109/SoCPaR.2009.63
/content/cudazone/CUDABrowser/assets/images/applications/913_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/913_cs_large.jpg
Academia
Malacca, Malaysia
2007
12
14
12/14/2007
Seung-Hun Yoo
Jin-Hyung Park
Chang-Sung Jeong
Paper
Science
Seung-Hun Yoo,Jin-Hyung Park,Chang-Sung Jeong
62ed4115-e5b5-4c6d-947e-3cb75f5c66d5
An Improved Parallel Implementation of 3D DRIE Simulation on GPU
Deep reactive ion etching (DRIE) technique is a new and powerful tool in Micro-Electro-Mechanical Systems (MEMS) fabrication. A 3D DRIE simulation can help researcher understand the time-evolution of Bosch process used in DRIE. Due to the high complexity of the algorithm used in the simulation, it is necessary to develop an algorithm that can accelerate the simulation.
http://www.computer.org/portal/web/csdl/doi/10.1109/I-SPAN.2009.111
/content/cudazone/CUDABrowser/assets/images/applications/912_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/912_cs_large.jpg
Academia
Kaohsiung, Taiwan
2008
12
14
12/14/2008
Fan Zhang
Gang Wang
Xiaoguang Liu
Paper
Science
Fan Zhang,Gang Wang,Xiaoguang Liu
5f454c72-f9aa-4a01-b530-1dc441677f43
CheCUDA: A Checkpoint/Restart Tool for CUDA Applications
In this paper, a tool named CheCUDA is designed to checkpoint CUDA applications that use GPUs as accelerators. As existing checkpoint/restart implementations do not support checkpointing the GPU status, CheCUDA hooks a part of basic CUDA driver API calls in order to record the status changes on the main memory.
http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.78
/content/cudazone/CUDABrowser/assets/images/applications/911_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/911_cs_large.jpg
Academia
Higashi Hiroshima, Japan
2008
12
11
12/11/2008
Hiroyuki Takizawa
Katsuto Sato
Kazuhiko Komatsu
Paper
Science
Hiroyuki Takizawa,Katsuto Sato,Kazuhiko Komatsu
d7276c60-49fb-4114-8743-89bca632df40
Accurate Measurements and Precise Modeling of Power Dissipation of CUDA Kernels toward Power Optimized High Performance CPU-GPU Computing
Power dissipation is one of the most imminent limitation factors influencing the development of High Performance Computing (HPC). Toward power-efficient HPC on CPU-GPU hybrid platform, we are investigating software methodologies to achieve optimized power utilization by algorithm design and programming technique. In this paper we discuss power measurements of GPU
http://www.computer.org/portal/web/csdl/doi/10.1109/PDCAT.2009.65
/content/cudazone/CUDABrowser/assets/images/applications/910_cs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/910_cs_large.jpg
Academia
Higashi Hiroshima, Japan
2008
12
11
12/11/2008
Reiji Suda
Da Qi Ren
Paper
Science
Reiji Suda,Da Qi Ren
985ae77e-9e9e-4719-b960-cce5bb84051a
Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA
Expectation maximization (EM) algorithm is an iterative technique widely used in the fields of signal processing and data mining. We present a parallel implementation of EM for finding maximum likelihood estimates of parameters of Gaussian mixture models, designed for many-core architecture of Graphics Processing Units (GPU).
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5166982
/content/cudazone/CUDABrowser/assets/images/applications/909_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/909_logo_xplore_large.gif
Research
NVIDIA Corp.
http://www.nvidia.com/cuda
2009
07
17
07/17/2009
Kumar, N
Satoor, S
Buck, I
Paper
Science
Kumar, N,Satoor, S, Buck, I
1a3dd903-83a7-40a0-ba34-0f84d4c7df59
Parallelizing Motion JPEG 2000 with CUDA
Due to the rapid growth of Graphics Processing Unit (GPU) processing capability, using GPU as a coprocessor for assisting the CPU in computing massive data has become indispensable. Nvidia's CUDA general-purpose graphical processing unit (GPGPU) architecture can greatly benefit single instruction multiple thread (SIMT) styled, computationally expensive programs.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5380169
/content/cudazone/CUDABrowser/assets/images/applications/907_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/907_logo_xplore_large.gif
Research
IEEE
2010
01
15
01/15/2010
Datla, Sanketh
Gidijala
Naga Sathish
Paper
Science
Datla, Sanketh,Gidijala,Naga Sathish
e8ece29c-cc39-4c42-a3fb-4f4d78a4a3bf
Reliability modeling of MEMS devices on CUDA based HPC setup
In this paper, we have reviewed the development in CUDA and the implementation of various distribution that exists in the reliability for MEMS based devices on a CUDA setup.
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5340289
/content/cudazone/CUDABrowser/assets/images/applications/905_logo_xplore_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/905_logo_xplore_large.gif
Academia
Acropolis Inst. of Technol. & Res., Indore, India
2009
11
24
11/24/2009
Pathak, R
Joshi, S
Paper
Science
Pathak, R,Joshi, S
fe250813-b016-4b71-b44d-80e8de5f4166
Survey on Parallel Programming Model
The development of microprocessors design has been shifting to multi-core architectures. Therefore, it is expected that parallelism will play a significant role in future generations of applications. Throughout the years, there has been a myriad number of parallel programming models proposed. In choosing a parallel programming model, not only the performance aspect is important, but also qualitative the aspect of how well parallelism is abstracted to developers. A model with a well abstraction of parallelism leads to a higher application-development productivity. In this paper, we propose seven criteria to qualitatively evaluate parallel programming models. Our focus is on how parallelism is abstracted and presented to application developers. As a case study, we use these criteria to investigate six well-known parallel programming models in the HPC community.
/content/cudazone/CUDABrowser/assets/images/applications/904_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/904_cover-medium_large.jpg
Research
Sun Microsystems
2008
10
11
10/11/2008
Henry Kasim
Verdi March
Rita Zhang
Paper
Science
Henry Kasim,Verdi March,Rita Zhang,henry.kasim@sun.com,verdi.march@sun.com,rita.zhang@sun.com
4736ba0b-002f-4ca5-b107-c70a1e05a004
A Variational Approach to Semiautomatic Generation of Digital Terrain Models
We present a semiautomatic approach to generate high quality digital terrain models (DTM) from digital surface models (DSM). A DTM is a model of the earths surface, where all man made objects and the vegetation have been removed. In order to achieve this, we use a variational energy minimization approach. The proposed energy functional incorporates Huber regularization to yield piecewise smooth surfaces and an L1 norm in the data fidelity term. Additionally, a minimum constraint is used in order to prevent the ground level from pulling up, while buildings and vegetation are pulled down. Being convex, the proposed formulation allows us to compute the globally optimal solution. Clearly, a fully automatic approach does not yield the desired result in all situations. Therefore, we additionally allow the user to affect the algorithm using different user interaction tools. Furthermore, we provide a real-time 3D visualization of the output of the algorithm which additionally helps the user to assess the final DTM. We present results of the proposed approach using several real data sets.
/content/cudazone/CUDABrowser/assets/images/applications/903_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/903_cover-medium_large.jpg
Academia
Graz University of Technology
2009
11
26
11/26/2009
Andreas Klaus
Thomas Pock
Markus Grabner
Paper
Science
Andreas Klaus,Thomas Pock,Markus Grabner
58b1aaf4-695f-453b-833a-0c77c40fcae7
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs
We discuss implementing blocked sparse matrix-vector multiplication for NVIDIA GPUs. We outline an algorithm and various optimizations, and identify potential future improvements and challenging tasks. In comparison with previously published implementation, our implementation is faster on matrices having many high fill-ratio blocks but slower on matrices with low number of non-zero elements per row.
/content/cudazone/CUDABrowser/assets/images/applications/902_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/902_cover-medium_large.jpg
Academia
Institute for System Programming of RAS, Russia
2009
07
21
07/21/2009
Alexander Monakov
Arutyun Avetisyan
Paper
Science
Alexander Monakov,Arutyun Avetisyan,amonakov@ispras.ru,arut@ispras.ru
ba6d3a3b-b774-495a-8c34-f7c46204b175
AtelierM++: a fast and accurate marbling system
We present AtelierM++, a new interactive marbling image rendering system which allows artists to create marbling textures with real-time visual feedback on mega-pixel sized images. Marbling is a method of aqueous surface design, which can produce patterns similar to marble or other stone, hence the name. The system is based on the physical model of the traditional marbling process. We simulate real marbling by solving the Navier-Stokes equations on the graphics processing unit. We employ a third-order accurate but fast Unsplit semi-Lagragian Constrained Interpolation Profile method to reduce the numerical dissipation while retaining the stability. To simulate very sharp interface lines among different paints, a simple yet effective transformation function is applied to the paint concentrations. Several intuitive interfaces are implemented to provide flexible control for users. Extensive experimental results are shown to demonstrate both the effectiveness and efficiency of the proposed approach.
/content/cudazone/CUDABrowser/assets/images/applications/901_cover-medium7_small.png
/content/cudazone/CUDABrowser/assets/images/applications/901_cover-medium7_large.png
Academia
Zhejiang University, China
2009
05
12
05/12/2009
Hanli Zhao
Xiaogang Jin
Shufang Lu
Paper
Science
Hanli Zhao,Xiaogang Jin,Shufang Lu,hanlizhao@gmail.com,jin@cad.zju.edu.cn,lushufang@cad.zju.edu.cn
13440b6a-5288-46fb-912a-0a7945d88544
Implementing P Systems Parallelism by Means of GPUs
Software development for Membrane Computing is growing up yielding new applications. Nowadays, the efficiency of P systems simulators have become a critical point when working with instances of large size. The newest generation of GPUs (Graphics Processing Units) provide a massively parallel framework to compute general purpose computations. We present GPUs as an alternative to obtain better performance in the simulation of P systems and we illustrate it by giving a solution to the N-Queens problem as an example.
/content/cudazone/CUDABrowser/assets/images/applications/900_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/900_cover-medium_large.jpg
Academia
University of Sevilla, Spain
2010
01
20
01/20/2010
Jose M. Cecilia
Jose M. Garcia
Gines D. Guerrero
Paper
Science
Jose M. Cecilia,Jose M. Garcia,Gines D. Guerrero,chema@ditec.um.es,jmgarcia@ditec.um.es,gines.guerrero@ditec.um.es
b21f6d6e-aa27-42cf-b19a-8c8d970e2433
Real-Time Neighborhood Based Disparity Estimation Incorporating Temporal Evidence
This paper presents a system for dense area based disparity estimation from binocular rectified image sequences with the integration of temporal evidence. The system is using dense 2D optical flow fields and timely displaced disparity estimates to reason about the observed 3D scene flow. This scene flow is then exploited to strengthen timely consistent observations in the disparity estimation. Moreover a novel neighborhood assumption is presented, which allows to seamlessly implement the presented algorithm on the GPU. It is shown that by means of the presented approach the sensitivity to noise and ambiguities observed with plain real-time disparity estimations can be improved, even in fully dynamic scenarios with simultaneous movement of objects and cameras
/content/cudazone/CUDABrowser/assets/images/applications/899_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/899_cover-medium_large.jpg
Academia
Universiy of Kiel, Germany
2008
06
29
06/29/2008
Bogumil Bartczak
Daniel Jung
Reinhard Koch
Paper
Science
Bogumil Bartczak,Daniel Jung,Reinhard Koch,bartczak@mip.informatik.uni-kiel.de,djung@mip.informatik.uni-kiel.de,rk@mip.informatik.uni-kiel.de
82775ff7-a0ec-4b29-8d21-f366cda8039a
Relighting Forest Ecosystems
Real-time cinematic relighting of large, forest ecosystems remains a challenging problem, in that important global illumination effects, such as leaf transparency and inter-object light scattering, are difficult to capture, given tight timing constraints and scenes that typically contain hundreds of millions of primitives. A solution that is based on a lattice-Boltzmann method is suggested. Reflectance, transmittance, and absorptance parameters are taken from measurements of real plants and integrated into a parameterized, dynamic global illumination model. When the model is combined with fast shadow rays, traced on a GPU, near real-time cinematic relighting is achievable for forest scenes containing hundreds of millions of polygons.
/content/cudazone/CUDABrowser/assets/images/applications/898_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/898_cover-medium_large.jpg
Academia
Clemson University
2009
11
26
11/26/2009
Jay E. Steele
Robert Geist
Paper
Science
Jay E. Steele,Robert Geist,jesteel@cs.clemson.edu,geist@cs.clemson.edu
179e73fd-8aa8-41ce-9778-7fc4aa5d5044
Acceleration of cardiac tissue simulation with graphic processing units
In this technical note we show the promise of using graphic processing units (GPUs) to accelerate simulations of electrical wave propagation in cardiac tissue, one of the more demanding computational problems in cardiology. We have found that the computational speed of two-dimensional (2D) tissue simulations with a single commercially available GPU is about 30 times faster than with a single 2.0 GHz Advanced Micro Devices (AMD) Opteron processor. We have also simulated wave conduction in the three-dimensional (3D) anatomic heart with GPUs where we found the computational speed with a single GPU is 1.6 times slower than with a 32-central processing unit (CPU) Opteron cluster. However, a cluster with two or four GPUs is faster than the CPU-based cluster. These results demonstrate that a commodity personal computer is able to perform a whole heart simulation of electrical wave conduction within times that enable the investigators to interact more easily with their simulations.
/content/cudazone/CUDABrowser/assets/images/applications/897_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/897_prediction_large.png
Academia
David Geffen School of Medicine at UCLA, Los Angeles, CA
2009
08
04
08/04/2009
Daisuke Sato
Alan Garfinkel
Paper
Computer Aided Engineering
Daisuke Sato,Alan Garfinkel,dasato@mednet.ucla.edu,agarfinkel@mednet.ucla.edu
765b8ab8-2775-423e-8d68-3d5f4a6cc0b5
Real-Time Prediction of Brain Shift Using Nonlinear Finite Element Algorithms
Patient-specific biomechanical models implemented using specialized nonlinear (i.e. taking into account material and geometric nonlinearities) finite element procedures were applied to predict the deformation field within the brain for five cases of craniotomy-induced brain shift. The procedures utilize the Total Lagrangian formulation with explicit time stepping. The loading was defined by prescribing deformations on the brain surface under the craniotomy. Application of the computed deformation fields to register the preoperative images with the intraoperative ones indicated that the models very accurately predict the intraoperative positions and deformations of the brain anatomical structures for limited information about the brain surface deformations. For each case, it took less than 40 s to compute the deformation field using a standard personal computer, and less than 4 s using a Graphics Processing Unit (GPU). The results suggest that nonlinear biomechanical models can be regarded as one possible method of complementing medical image processing techniques when conducting non-rigid registration within the real-time constraints of neurosurgery.
/content/cudazone/CUDABrowser/assets/images/applications/896_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/896_cover-medium_large.jpg
Academia
The University of Western Australia
2009
09
30
09/30/2009
Grand Roman Joldes
Paper
Science
Grand Roman Joldes,grandj@mech.uwa.edu.au
6134f011-ed6e-4cf4-9f52-887c65646088
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs
While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer's productivity will determine their success in the high-performance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.
/content/cudazone/CUDABrowser/assets/images/applications/895_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/895_cover-medium_large.jpg
Academia
Consejo Superior de Investigaciones Cientificas, Spain
2009
08
22
08/22/2009
Eduard Ayguade
Rosa M. Badia
Francisco D. Igual
Paper
Science
Task-level parallelism,heterogeneous systems,programming models,Eduard Ayguade,Rosa M. Badia,Francisco D. Igual,eduard.ayguade@bsc.es,rosa.m.badia@bsc.es,figual@icc.uji.es
480d6c65-d2c8-4d10-88e9-8f4cecdddd49
Fast Image Mapping of Endoscopic Image Mosaics with Three-Dimensional Ultrasound Image for Intrauterine Treatment of Twin-to-Twin Transfusion Syndrome
This paper describes a fast image mapping system that integrates endoscopic image mosaics with three-dimensional (3-D) ultrasound images for assisting intrauterine treatment of twin-to-twin transfusion syndrome (TTTS) by laser photocoagulation. Endoscopic laser photocoagulation treatment has a good survival rate and a low complication rate for twins. However, the small field of view and lack of surrounding information makes the identification of vessels anastomosis difficult. We have developed an extended placenta visualization system with the fusion of endoscopic image mosaics with a 3-D ultrasound-image placenta model. Fully automatic and fast calibration is used for endoscope calibration in fluid. The 3-D spatial position of the endoscopic images and the ultrasound image are tracked by a 3-D position tracking device. The mosaiced endoscope images are registered to the surface of the 3-D ultrasound placenta model by using a fast GPU-based image rendering method. Experimental results show that the system may provide an improved and efficient way of planning and guidance in laser photocoagulation TTTS treatment.
/content/cudazone/CUDABrowser/assets/images/applications/894_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/894_cover-medium_large.jpg
Academia
The University of Tokyo
2008
07
15
07/15/2008
Hongen Liao
Paper
Science
Hongen Liao,liao@bmpe.t.u-tokyo.ac.jp
0a6541a1-93ac-4431-a6ff-45865398551e
Accelerated Discovery of Discrete M-Clusters/Outliers on the Raster Plane Using Graphical Processing Units
This paper presents two discrete computational geometry algorithms designed for execution on Graphics Processing Units (GPUs). The algorithms are parallelized versions of sequential algorithms intended for application in geographical data mining. The first algorithm finds clusters of m points, called m-clusters, in the rasterized plane. The second algorithm complements the first by identifying outliers, those points which are not members of any m-clusters. The use of a raster representation of coordinates provides an ideal data stream environment for efficient GPU utilization. The parallel algorithms have low memory demands, and require only a limited amount of inter-process communication. Initial performance analysis indicates the algorithms are scalable, both in problem size and in the number of seeds, and significantly outperform commercial implementations.
/content/cudazone/CUDABrowser/assets/images/applications/893_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/893_cover-medium_large.jpg
Academia
Grand Valley State University, MI / Univ. of Maine-Augusta, ME
2009
05
20
05/20/2009
Christian Trefftz
Joseph Szakas
Igor Majdandzic
Paper
Numerics
GPU algorithms,Geographical data mining,Christian Trefftz,Joseph Szakas,Igor Majdandzic,trefftzc@gvsu.edu,szakas@maine.edu,majdanig@student.gvsu.edu
db591363-d303-433e-9ab7-f3e856c6a6b0
GP on SPMD parallel graphics hardware for mega Bioinformatics data mining
We demonstrate a SIMD C++ genetic programming system on a single 128 node parallel NVIDIA GeForce 8800 GTX GPU under RapidMind's GPGPU Linux software by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of 5 million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP.
/content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_small.png
/content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_large.png
Academia
University of Essex, Colchester
2008
05
08
05/08/2008
W. B. Langdon
Paper
Computer Aided Engineering
W. B. Langdon,wlangdon@essex.ac.uk
db591363-d303-433e-9ab7-f3e856c6a6b0
GP on SPMD parallel graphics hardware for mega Bioinformatics data mining
We demonstrate a SIMD C++ genetic programming system on a single 128 node parallel NVIDIA GeForce 8800 GTX GPU under RapidMind's GPGPU Linux software by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of 5 million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP.
/content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_small.png
/content/cudazone/CUDABrowser/assets/images/applications/892_cover-medium6_large.png
Academia
University of Essex, Colchester
2008
05
08
05/08/2008
W. B. Langdon
Paper
Computer Aided Engineering
W. B. Langdon,wlangdon@essex.ac.uk
ddc5f77d-8ad3-4adf-bf3b-d88271d702fe
A Real-Time Evolutionary Object Recognition System
We have created a real-time evolutionary object recognition system. Genetic Programming is used to automatically search the space of possible computer vision programs guided through user interaction. The user selects the object to be extracted with the mouse pointer and follows it over multiple frames of a video sequence. Several different alternative algorithms are evaluated in the background for each input image. Real-time performance is achieved through the use of the GPU for image processing operations.
/content/cudazone/CUDABrowser/assets/images/applications/891_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/891_cover-medium_large.jpg
Academia
Eberhard-Karls-Universitat Tubingen
2009
04
10
04/10/2009
Marc Ebner
Paper
Science
Marc Ebner,marc.ebner@wsii.uni-tuebingen.de
4861ac13-3ca4-4659-9fb8-ae5704b94996
Concurrent CT Reconstruction and Visual Analysis Using Hybrid Multi-resolution Raycasting in a Cluster Environment
GPU clusters nowadays combine enormous computational resources of GPUs and multi-core CPUs. This paper describes a distributed program architecture that leverages all resources of such a cluster to incrementally reconstruct, segment and render 3D cone beam computer tomography (CT) data with the objective to provide the user with results as quickly as possible at an early stage of the overall computation. As the reconstruction of high-resolution data sets requires a significant amount of time, our system first creates a low-resolution preview volume on the head node of the cluster, which is then incrementally supplemented by high-resolution blocks from the other cluster nodes using our multi-resolution renderer. It is further used for graphically choosing reconstruction priority and render modes of sub-volume blocks. The cluster nodes use their GPUs to reconstruct and render sub-volume blocks, while their multi-core CPUs are used to segment already available blocks.
/content/cudazone/CUDABrowser/assets/images/applications/890_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/890_cover-medium_large.jpg
Academia
Visualisierungsinstitut der Universitat Stuttgart
2009
11
26
11/26/2009
Steffen Frey
Christoph Muller
Magnus Strengert
Paper
Science
Steffen Frey,Christoph Muller,Magnus Strengert
67fcb101-5c74-49c7-abf2-b026feeea773
Modelling Anisotropic Viscoelasticity for Real-Time Soft Tissue Simulation
Previously almost all biomechanically-based time-critical surgical simulation has ignored the well established features of tissue mechanical response of anisotropy and time-dependence. We address this issue by presenting an efficient solution procedure for anisotropic visco-hyperelastic constitutive models which allows use of these in nonlinear explicit dynamic finite element algorithms. We show that the procedure allows incorporation of both anisotropy and viscoelasticity for as little as 5.1% additional cost compared with the usual isotropic elastic models. When combined with high performance GPU execution the complete framework is suitable for time-critical simulation applications such as interactive surgical simulation and intraoperative image registration.
/content/cudazone/CUDABrowser/assets/images/applications/889_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/889_cover-medium_large.jpg
Academia
University College London, UK
2008
09
10
09/10/2008
Zeike A. Taylor
Paper
Science
Zeike A. Taylor
fdeb943e-2746-42f1-a70b-e92ca74592c7
Fast and Robust Face Tracking for Analyzing Multiparty Face-to-Face Meetings
This paper presents a novel face tracker and verifies its effectiveness for analyzing group meetings. In meeting scene analysis, face direction is an important clue for assessing the visual attention of meeting participants. The face tracker, called STCTracker (Sparse Template Condensation Tracker), estimates face position and pose by matching face templates in the framework of a particle filter. STCTracker is robust against large head rotation, up to 60 degrees in the horizontal direction, with relatively small mean deviation error. Also, it can track multiple faces simultaneously in real-time by utilizing a modern GPU (Graphics Processing Unit), e.g. 6 faces at about 28 frames/second on a single PC. Also, it can automatically build 3-D face templates upon initialization of the tracker. This paper evaluates the tracking errors and verifies the effectiveness of STCTracker for meeting scene analysis, in terms of conversation structures, gaze directions, and the structure of cross-modal interactions involving head gestures and utterances. Experiments confirm that STCTracker can basically match the performance of from the user-unfriendly magnetic-sensor-based motion capture system.
/content/cudazone/CUDABrowser/assets/images/applications/888_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/888_cover-medium_large.jpg
Academia
NTT Communication Science Labs, Japan
2008
09
20
09/20/2008
Kazuhiro Otsuka
Junji Yamato
Paper
Science
Kazuhiro Otsuka,Junji Yamato,otsuka@eye.brl.ntt.co.jp,yamato@eye.brl.ntt.co.jp
d29fe864-0d87-456c-9127-ae0164499337
SUNVIZ: A Real-Time Visualization Environment for Space Physics Applications
Real-time physically accurate simulations are difficult to create because of limited computational power available on a CPU. General purpose computing on the graphics processing unit (GPU) can provide a significant increase in performance. We are able to investigate the flow characteristics of a cloud of charged particles, which is one of the first steps in our goal of generating a real-time Coronal Mass Ejection (CME) simulator. Preliminary results show a sustained 60 Hz visual simulation with approximately four million particles and a non-visual simulation of 16 million particles at 30 Hz. The simulator provides a novel way to investigate a CME in real-time, and it has the potential to predict when a particular CME is geoeffective, i.e. an event that could damage electrical infrastructure such as satellites, space stations, power grids, etc...
/content/cudazone/CUDABrowser/assets/images/applications/887_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/887_cover-medium_large.jpg
Academia
University of Alberta Physics
2008
12
03
12/03/2008
S. Eliuk
P. Boulanger
K. Kabin
Paper
Science
S. Eliuk,P. Boulanger,K. Kabin
af4a522e-2156-4c31-99f2-519daaa3e24d
Graphic processing unit-accelerated mutual information-based 3D image rigid registration
Mutual information (MI)-based image registration is effective in registering medical images, but it is computationally expensive. This paper accelerates MI-based image registration by dividing computation of mutual information into spatial transformation and histogram-based calculation, and performing 3D spatial transformation and trilinear interpolation on graphic processing unit (GPU). The 3D floating image is downloaded to GPU as flat 3D texture, and then fetched and interpolated for each new voxel location in fragment shader. The transformed results are rendered to textures by using frame buffer object (FBO) extension, and then read to the main memory used for the remaining computation on CPU. Experimental results show that GPU-accelerated method can achieve speedup about an order of magnitude with better registration result compared with the software implementation on a single-core CPU.
/content/cudazone/CUDABrowser/assets/images/applications/886_transactions_small.png
/content/cudazone/CUDABrowser/assets/images/applications/886_transactions_large.png
Academia
Dalian University of Technology, China
2009
10
26
10/26/2009
Zongying Ou
Paper
Computer Aided Engineering
Zongying Ou,ouzyg@dlut.edu.cn
684bdf6c-6ae7-4c97-a001-9dcc5b568603
Dual-RBF based surface reconstruction
Surface reconstruction (Bloomenthal and Wyvill, Introduction to Implicit Surfaces, 1997) is a fundamental work in Computer Aided Design (CAD) and Computer Graphics (CG). In this paper, motivated by the physical polar field model (Yuxu Lin Chun Chen in Proceedings of the 3rd Pacific-Rim Symposium on Image and Video Technology, 1997), we propose a novel implicit surface reconstruction approach, named Dual-RBF. Through simulating the physical polar field model, Dual-RBF provides a nice initial reconstruction state firstly. Then, two simple nonlinear methods are introduced to adjust the configurations of Dual-RBF model, so that a more accurate reconstruction is reached. Thirdly, the Dual-RBF becomes even more robust to fill the holes on some flawed input point-clouds by adopting a multi-level strategy. Finally, the visualization of the surface reconstruction is speed up with GPU. Experimental results show that the proposed approach is faster and more robust than previous implicit surface reconstruction techniques.
/content/cudazone/CUDABrowser/assets/images/applications/885_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/885_visualcomputer_large.png
Academia
Zhejiang University, China
2009
03
03
03/03/2009
Yuxu Lin
Chun Chen
Mingli Song
Paper
Science
Yuxu Lin,Chun Chen,Mingli Song,linyuxu@zju.edu.cn,chenc@cs.zju.edu.cn,brooksong@ieee.org
67822eb4-51de-4035-b30f-046c25f50c9d
A Color Management Process for Real Time Color Reconstruction of Multispectral Images
We introduce a new accurate and technology independent display color characterization model for color rendering of multispectral images. The establishment of this model is automatic, and does not exceed the time of a coffee break to be efficient in a practical situation. This model is a part of the color management workflow of the new tools designed at the C2RMF for multispectral image analysis of paintings acquired with the material developed during the CRISATEL European project. The analysis is based on color reconstruction with virtual illuminants and use a GPU (Graphics processor unit) based processing model in order to interact in real time with a virtual lighting.
/content/cudazone/CUDABrowser/assets/images/applications/884_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/884_cover-medium_large.jpg
Academia
Universite Jean Monnet / France
2009
07
14
07/14/2009
Philippe Colantoni
Jean-Baptiste Thomas
Paper
Science
Philippe Colantoni,Jean-Baptiste Thomas
5da0d849-dfa9-4a99-b8d9-1ed49ac75197
Regular Expression Matching on Graphics Hardware for Intrusion Detection
The expressive power of regular expressions has been often exploited in network intrusion detection systems, virus scanners, and spam filtering applications. However, the flexible pattern matching functionality of regular expressions in these systems comes with significant overheads in terms of both memory and CPU cycles, since every byte of the inspected input needs to be processed and compared against a large set of regular expressions.
http://springerlink.com/content/b3m7662014272t8m/?p=0dd80c5c9b564b009c9e0e9c88044df6
/content/cudazone/CUDABrowser/assets/images/applications/883_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/883_cover-medium_large.jpg
Academia
Foundation for Research and Technology,Hellas
2009
09
30
09/30/2009
48
Giorgos Vasiliadis
Michalis Polychronakis
Spiros Antonatos
Paper
Science
Giorgos Vasiliadis,Michalis Polychronakis,Spiros Antonatos,gvasil@ics.forth.gr,mikepo@ics.forth.gr,antonat@ics.forth.gr
0d6ee814-5cf0-4da6-988b-a9e7159f4f0a
Haptic guided 3-D deformable image registration
Purpose We present a system which supports deformable image registration guided by a haptic device.
Methods The haptic device is tied to a block matching method where a set of uniformly distributed control points determine the block positions. Each control point constitutes a particle in a mass spring grid which limits the space of allowed movements to elastic movements. Control points are manipulated by the haptic device, and the negative gradient of the similarity metric over the corresponding block is rendered as a force on the haptic device guiding the user to a minimum of the optimization landscape. Fast update of forces was achieved by exploiting the GPU for computations of the similarity metric and for interpolation of the deformation field.
/content/cudazone/CUDABrowser/assets/images/applications/882_cover-medium5_small.png
/content/cudazone/CUDABrowser/assets/images/applications/882_cover-medium5_large.png
Academia
University of Oslo, Norway / Rikshospitalet University Hospital, Norway
2009
02
24
02/24/2009
Petter Risholm
Eigil Samset
Paper
Medical Imaging
Petter Risholm,Eigil Samset,pettri@ifi.uio.no
0f7ccffd-5dcf-44a4-9971-18d0a82c6dd3
Radar Signal Processing with Graphics Processors (GPUs)
The investigation is conducted through comparing a GPU (GTX260) against a modern desktop CPU for several HPEC (High Performance Embedded Computing) and other radar signal processing algorithms; 12 in total. Several other aspects are also investigated, such as programming environment and efficiency, future GPU-architectures, and applicability in radar systems. Our CUDA GPU-implementations perform substantially better than the CPU and associated CPU-code used for all but one of the 12 algorithms tested, sometimes by a factor of 100 or more. The OpenCL implementations also perform substantially better than the CPU. The substantial performance achieved when using CUDA for almost all benchmarks can be attributed to both the high theoretical performance of the GPU, but also to the inherent data-parallelism, and hence GPU-suitability, of almost all of the investigated algorithms. Programming CUDA is reasonably straight forward, largely due to the mature development environment and abundance of documentation and white-papers. OpenCL is a lot more tedious to program. Furthermore, the coming CUDA GPU-architecture called Fermi is expected to further increase performance and programmability. When considering system integration of GPU-architectures into harsh radar application environments, one should be aware of potential heat and also possible obsolescence issues.
/content/cudazone/CUDABrowser/assets/images/applications/881_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/881_logo_large.png
Academia
HPC
http://www.hpcsweden.se
2010
02
08
02/08/2010
140
Ian Wainwright
Jimmy Pettersson
Paper
Signal Processing
Ian Wainwright,Jimmy Pettersson,jimmy.pettersson@hpcsweden.se,ian.wainwright@gmail.com
4ff91bfb-b496-46cb-9e1c-3572031aff73
Exploiting the Power of GPUs for Asymmetric Cryptography
Modern Graphics Processing Units (GPU) have reached a dimension with respect to performance and gate count exceeding conventional Central Processing Units (CPU) by far. Many modern computer systems include beside a CPU such a powerful GPU which runs idle most of the time and might be used as cheap and instantly available co-processor for general purpose applications.
http://springerlink.com/content/d1rt1r0326500541/?p=0dd80c5c9b564b009c9e0e9c88044df6
/content/cudazone/CUDABrowser/assets/images/applications/880_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/880_cover-medium_large.jpg
Academia
Ruhr University Bochum, Germany
2008
08
06
08/06/2008
Robert Szerwinski
Tim Guneysu
Paper
Science
Robert Szerwinski,Tim Guneysu,szerwinski@crypto.rub.de,gueneysu@crypto.rub.de
8749f743-3122-4ade-9664-c7c12c9cba95
Programmable and Scalable Architecture for Graphics Processing Units
Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput.
In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs. TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications. We show that TTA provides high floating point processing performance while allowing more programming freedom than vector processors.
Finally, one of the main features of the presented TTA-based GPU design is its fully programmable architecture making it suitable target for general purpose computing on GPU APIs which have become popular in recent years.
/content/cudazone/CUDABrowser/assets/images/applications/879_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/879_cover-medium_large.jpg
Academia
Universidad Rey Juan Carlos, Spain
2009
07
21
07/21/2009
Carlos S. de La Lama
Pekka Jaaskelainen
Jarmo Takala
Paper
Science
Carlos S. de La Lama,Pekka Jaaskelainen,Jarmo Takala,carlos.delalama@urjc.es,pekka.jaaskelainen@tut.fi,jarmo.takala@tut.fi
81a5b615-fc67-4667-ab39-ad855603e008
Breath-Hold Target Localization with Simultaneous Kilovoltage/Megavoltage Cone-Beam CT and Fast Reconstruction
Hypofractionated high dose radiotherapy of small lung tumors is very effective and was based on stereotaxy until now. It has recently become possible to achieve a high patient positioning precision based on on-line imaging with cone-beam CT (CBCT) and breath-hold techniques. The CBCT acquisition time of roughly 60 seconds, however, is too long for one breath-hold, resulting in image degradation by respiratory motion artifacts. By using megavoltage (MV) an kilovoltage (kV) photon source (mounted perpendicularly on the Linac gantry) for volume reconstruction, we could reduce the acquisition time to 15 seconds.
/content/cudazone/CUDABrowser/assets/images/applications/878_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/878_prediction_large.png
Academia
World Congress on Medical Physics and Biomedical Engineering, Germany
2010
01
04
01/04/2010
M. Blessing
D. Stsepankou
H. Wertz
Paper
Science
M. Blessing,D. Stsepankou,H. Wertz
b0ea1f26-c2cf-436b-830b-d3cbb7ecb7bf
Implementation of Fine-Grained Algorithms on Graphical Processing Unit
In this paper we solve the problem of mapping of fine- grained algorithm to graphical processing unit (GPU). Synchronous, asynchronous, block-synchronous and probabilistic cellular automata and explicit scheme of PDE are used as examples. Different implementation variants and their performances are presented.
/content/cudazone/CUDABrowser/assets/images/applications/877_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/877_cover-medium_large.jpg
Academia
ICMMG SB RAS, Novosibirsk, Russia
2009
09
01
09/01/2009
Konstantin Kalgin
Paper
Science
Konstantin Kalgin,kalgin@ssd.sscc.ru
ac91c067-09bc-44a8-877b-254db2f289b0
StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures
In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE SPUs) or data-parallel accelerators (e.g. GPGPUs).
Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge.
We have thus designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand.
We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
/content/cudazone/CUDABrowser/assets/images/applications/876_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/876_cover-medium_large.jpg
Academia
University of Bordeaux
2009
08
22
08/22/2009
Cedric Augonnet
Paper
Science
Cedric Augonnet
71356391-beda-4ab3-82f5-a0b60765af0f
Seismic Wave Field Modeling with Graphics Processing Units
GPGPU - general-purpose computing on graphics processing units is a very effective and inexpensive way of dealing with time consuming computations. In some cases even a low end GPU can be a dozens of times faster than a modern CPUs. Utilization of GPGPU technology can make a typical desktop computer powerful enough to perform necessary computations in a fast, effective and inexpensive way. Seismic wave field modeling is one of the problems of this kind. Some times one modeled common shot-point gather or one wave field snapshot can reveal the nature of an analyzed wave phenomenon. On the other hand these kinds of modelings are often a part of complex and extremely time consuming methods with almost unlimited needs of computational resources. This is always a problem for academic centers, especially now when times of generous support from oil and gas companies have ended
/content/cudazone/CUDABrowser/assets/images/applications/875_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/875_cover-medium_large.jpg
Academia
AGH University of Science and Technology, Poland
2009
05
21
05/21/2009
Tomasz Danek
Paper
Science
Tomasz Danek,tdanek@agh.edu.pl
9c1ce1c7-2162-49d4-93ca-a21ba5687e48
Active Structured Learning for High-Speed Object Detection
High-speed smooth and accurate visual tracking of objects in arbitrary, unstructured environments is essential for robotics and human motion analysis. However, building a system that can adapt to arbitrary objects and a wide range of lighting conditions is a challenging problem, especially if hard real-time constraints apply like in robotics scenarios. In this work, we introduce a method for learning a discriminative object tracking system based on the recent structured regression framework for object localization. Using a kernel function that allows fast evaluation on the GPU, the resulting system can process video streams at speed of 100 frames per second or more.
Consecutive frames in high speed video sequences are typically very redundant, and for training an object detection system, it is sufficient to have training labels from only a subset of all images. We propose an active learning method that select training examples in a data-driven way, thereby minimizing the required number of training labeling. Experiments on realistic data show that the active learning is superior to previously used methods for dataset subsampling for this task.
/content/cudazone/CUDABrowser/assets/images/applications/874_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/874_cover-medium_large.jpg
Academia
Max Planck Institute for Biological Cybernetics, Tubingen, Germany
2009
09
02
09/02/2009
Christoph H. Lampert
Jan Peters
Paper
Science
Christoph H. Lampert,Jan Peters,ChristophH.Lampert@tuebingen.mpg.de,Jan.Peters@tuebingen.mpg.de
7afe1a21-0c1b-40fd-9f4b-60e731b26240
Attaining High Performance in General-Purpose Computations on Current Graphics Processors
The increase in performance of the last generations of graphics processors (GPUs) has made this class of hardware a coprocessing platform of remarkable success in certain types of operations. In this paper we evaluate the performance of linear algebra and image processing routines, both on classical and unified GPU architectures and traditional processors (CPUs). From this study, we gain insights on the properties that make an algorithm likely to deliver high performance on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/873_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/873_cover-medium_large.jpg
Academia
Universidad Jaume, Spain
2008
12
06
12/06/2008
Francisco D. Igual
Rafael Mayo
Enrique S. Quintana-Orti
Paper
Science
Francisco D. Igual,Rafael Mayo,Enrique S. Quintana-Orti,figual@icc.uji.es,mayo@icc.uji.es,quintana@icc.uji.es
6fe3ccf0-abfd-4113-bcf7-49d91f20f318
Efficient Multiplication of Polynomials on Graphics Hardware
We present the algorithm to multiply univariate polynomials with integer coefficients efficiently using the Number Theoretic transform (NTT) on Graphics Processing Units (GPU). The same approach can be used to multiply large integers encoded as polynomials. Our algorithm exploits fused multiply-add capabilities of the graphics hardware. NTT multiplications are executed in parallel for a set of distinct primes followed by reconstruction using the Chinese Remainder theorem (CRT) on the GPU. Our benchmarking experiences show the NTT multiplication performance up to 77 GMul/s. We compared our approach with CPU-based implementations of polynomial and large integer multiplication provided by NTL and GMP libraries.
/content/cudazone/CUDABrowser/assets/images/applications/872_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/872_cover-medium_large.jpg
Academia
Saarbrucken
2009
08
21
08/21/2009
Pavel Emeliyanenko
Paper
Science
Pavel Emeliyanenko,asm@mpi-inf.mpg.de
5f4ef7a1-9d5f-4ef3-a1ea-c2be5ed4d1e8
Parallel LDPC Decoding on GPUs Using a Stream-Based Computing Approach
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes adopted by recent communication standards. LDPC decoders are based on belief propagation algorithms, which make use of a Tanner graph and very intensive message-passing computation, and usually require hardware-based dedicated solutions. With the exponential increase of the computational power of commodity graphics processing units (GPUs), new opportunities have arisen to develop general purpose processing on GPUs. This paper proposes the use of GPUs for implementing flexible and programmable LDPC decoders. A new stream-based approach is proposed, based on compact data structures to represent the Tanner graph. It is shown that such a challenging application for stream-based computing, because of irregular memory access patterns, memory bandwidth and recursive flow control constraints, can be efficiently implemented on GPUs. The proposal was experimentally evaluated by programming LDPC decoders on GPUs using the Caravela platform, a generic interface tool for managing the kernels' execution regardless of the GPU manufacturer and operating system. Moreover, to relatively assess the obtained results, we have also implemented LDPC decoders on general purpose processors with Streaming Single Instruction Multiple Data (SIMD) Extensions. Experimental results show that the solution proposed here efficiently decodes several codewords simultaneously, reducing the processing time by one order of magnitude.
/content/cudazone/CUDABrowser/assets/images/applications/871_cover-medium4_small.png
/content/cudazone/CUDABrowser/assets/images/applications/871_cover-medium4_large.png
Academia
Universidade de Coimbra
2009
09
28
09/28/2009
Gabriel Falcao
Shinichi Yamagiwa
Vitor Silva
Paper
Science
Gabriel Falcao,Shinichi Yamagiwa,Vitor Silva,gff@co.it.pt,yama@inesc-id.pt,vitor@co.it.pt
394dbac7-73bd-4c9e-86da-1dd81c35ad28
Retargeting PLAPACK to Clusters with Hardware Accelerators
Hardware accelerators are becoming a highly appealing approach to boost the raw performance as well as the price-performance and power-performance ratios of current clusters. In this paper we present a strategy to retarget PLAPACK, a library initially designed for clusters of nodes equipped with general- purpose processors and a single address space per node, to clusters equipped with graphics processors (GPUs). In our approach data are kept in the device memory and only retrieved to main memory when they have to be communicated to a different node. Here we benefit from the object-based orientation of PLAPACK which allows all communication between host and device to be embedded within a pair of routines, providing a clean abstraction that enables an efficient and direct port of all the contents of the library. Our experiments in a cluster consisting of 16 nodes with two NVIDIA Quadro FX5800 GPUs each show the performance of our approach.
/content/cudazone/CUDABrowser/assets/images/applications/870_FLAMEbanner_small.png
/content/cudazone/CUDABrowser/assets/images/applications/870_FLAMEbanner_large.png
Academia
University Jaume I / Texas University
2010
02
11
02/11/2010
Fogue
Igual
Quintana-Orti
Paper
Numerics
Fogue,Igual,Quintana-Orti,figual@icc.uji.es
d4831d82-4e49-46b5-9751-c1e58a61d67a
Neural Network Training with Extended Kalman Filter Using Graphics Processing Unit
The graphics processing unit has evolved through the years into the powerful resource for general purpose computing. We present in this article the implementation of Extended Kalman filter used for recurrent neural networks training, which most computational intensive tasks are performed on the GPU. This approach achieves significant speedup of neural network training process for larger networks.
/content/cudazone/CUDABrowser/assets/images/applications/869_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/869_cover-medium_large.jpg
Academia
Slovak University of Technology in Bratislava
2008
08
29
08/29/2008
Peter Trebaticky
Jiri Pospichal
Paper
Science
Peter Trebaticky,Jiri Pospichal,trebaticky@fiit.stuba.sk,pospichal@fiit.stuba.sk
c8daa779-2c65-4b59-a45a-a3648753fb56
Fast collision detection using the A-buffer
This paper presents a novel and fast image-space collision detection algorithm with the A-buffer, where the GPU computes the potentially colliding sets (PCSs), and the CPU performs the standard triangle intersection test. When the bounding boxes of two objects intersect, the intersection is passed to the GPU. The object surfaces in the intersection are rendered into the A-buffer. Rendering into the A-buffer is up to eight-times faster than the ordinary approaches. Then, PCSs are computed by comparing the depth values of each texel of the A-buffer. A PCS consists of only two triangles. The PCSs are read back to the CPU, and the CPU computes the intersection points between the triangles. The proposed algorithm runs extremely fast, does not require any preprocessing, can handle dynamic objects including deformable and fracturing models, and can compute self-collisions. Such versatility and performance gain of the proposed algorithm prove its usefulness in real-time applications such as 3D games.
/content/cudazone/CUDABrowser/assets/images/applications/868_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/868_visualcomputer_large.png
Academia
Korea University, Seoul, Korea
2008
05
17
05/17/2008
Hanyoung Jang
JungHyun Han
Paper
Science
Hanyoung Jang,JungHyun Han,jhan@korea.ac.kr
3752cd56-fe2a-4457-a8e1-ea665d83102d
Engineering of Computer Vision Algorithms Using Evolutionary Algorithms
Computer vision algorithms are currently developed by looking up the available operators from the literature and then arranging those operators such that the desired task is performed. This is often a tedious process which also involves testing the algorithm with different lighting conditions or at different sites. We have developed a system for the automatic generation of computer vision algorithms at interactive frame rates using GPU accelerated image processing. The user simply tells the system which object should be detected in an image sequence. Simulated evolution, in particular Genetic Programming, is used to automatically generate and test alternative computer vision algorithms. Only the best algorithms survive and eventually provide a solution to the users image processing task.
/content/cudazone/CUDABrowser/assets/images/applications/867_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/867_cover-medium_large.jpg
Academia
Eberhard Karls Universitat Tubingen
2009
09
30
09/30/2009
Marc Ebner
Paper
Science
Marc Ebner,marc.ebner@wsii.uni-tuebingen.de
e2835997-b2cc-4236-a3c2-83316b6befcb
Solving Dense Linear Systems on Graphics Processors
We present several algorithms to compute the solution of a linear system of equations on a GPU, as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We also show how iterative refinement with mixed-precision can be used to regain full accuracy in the solution of linear systems. Experimental results on a G80 using CUBLAS 1.0, the implementation of BLAS for NVIDIA GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed.
/content/cudazone/CUDABrowser/assets/images/applications/866_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/866_cover-medium_large.jpg
Academia
Universidad Jaume
2008
08
21
08/21/2008
Sergio Barrachina
Maribel Castillo
Francisco D. Igual
Paper
Science
Sergio Barrachina,Maribel Castillo,Francisco D. Igual,barrachi@icc.uji.es,castillo@icc.uji.es,figual@icc.uji.es
ee93a2d7-a172-4d28-88cc-ea3581de0988
Visual simulation of thermal fluid dynamics in a pressurized water reactor
We present a simulation and visualization system for a critical application analysis of the thermal fluid dynamics inside a pressurized water reactor of a nuclear power plant when cold water is injected into the reactor vessel. We employ a hybrid thermal lattice Boltzmann method (HTLBM), which has the advantages of ease of parallelization and ease of handling complex simulation boundaries. For efficient computation and storage of the irregular-shaped simulation domain, we classify the domain into nonempty and empty cells and apply a novel packing technique to organize the nonempty cells. This method is implemented on a GPU cluster for acceleration. We demonstrate the formation of cold-water plumes in the reactor vessel. A set of interactive visualization tools, such as side-view slices, 3D volume rendering, thermal layers rendering, and panorama rendering, are provided to collectively visualize the structure and dynamics of the temperature field in the vessel. To the best of our knowledge, this is the first system that combines 3D simulation and visualization for analyzing thermal shock risk in a pressurized water reactor.
/content/cudazone/CUDABrowser/assets/images/applications/865_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/865_visualcomputer_large.png
Academia
Stony Brook University, NY
2009
01
23
01/23/2009
Zhe Fan
Yu-Chuan Kuo
Ye Zhao
Paper
Science
Zhe Fan,Yu-Chuan Kuo,Ye Zhao,fzhe@cs.sunysb.edu,yukuo@cs.sunysb.edu,zhao@cs.kent.edu
bf51755a-de3f-4ba1-b775-ba5134f861e9
A novel multiple-walk parallel algorithm for the BarnesHut treecode on GPUs towards cost effective, high performance N-body simulation
Recently, general-purpose computation on graphics processing units (GPGPU) has become an increasingly popular field of study as graphics processing units (GPUs) continue to be proposed as high performance and relatively low cost implementation platforms for scientific computing applications. Among these applications figure astrophysical N-bodysimulations, which form one of the most challenging problems in computational science. However, in most reported studies, a simple algorithm was used for GPGPUs, and the resulting performances were not observed to be better than those of conventional CPUs that were based on more optimized algorithms such as the tree algorithm or the particle-particle particle-mesh algorithm. Because of the difficulty in getting efficient implementations of such algorithms on GPUs, a GPU cluster had no practical advantage over general-purpose PC clusters for N-bodysimulations. In this paper, we report a new method for efficient parallel implementation of the tree algorithm on GPUs. Our novel tree code allows the realization of an N-bodysimulation on a GPU cluster at a much higher performance than that on general PC clusters. We practically performed a cosmological simulation with 562 million particles on a GPU cluster using 128 NVIDIA GeForce 8800GTS GPUs at an overall cost of 168172 $. We obtained a sustained performance of 20.1 Tflops, which when normalized against a general-purpose CPU implementation leads to a performance of 8.50 Tflops. The achieved cost/performance was hence a mere $19.8 /Gflops which shows the high competitiveness of GPGPUs.
/content/cudazone/CUDABrowser/assets/images/applications/864_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/864_implementation_large.png
Academia
Nagasaki University, Japan
2009
05
20
05/20/2009
Tsuyoshi Hamada
Keigo Nitadori
Khaled Benkrid
Paper
Science
Tsuyoshi Hamada,Keigo Nitadori,Khaled Benkrid,hamada@cis.nagasaki-u.ac.jp,nitadori@cfca.jp,k.benkdird@ed.ac.uk
4977b07b-89ac-439e-abb4-8879e099c3c4
Efficient Acceleration of Asymmetric Cryptography on Graphics Hardware
Graphics processing units (GPU) are increasingly being used for general purpose computing. We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. DirectX 10 compliant graphics processors are the latest generation of GPU architecture, which provide increased programming flexibility and support for integer operations. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.
/content/cudazone/CUDABrowser/assets/images/applications/863_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/863_cover-medium_large.jpg
Academia
Trinity College Dublin
2009
06
19
06/19/2009
Owen Harrison
John Waldron
Paper
Science
Owen Harrison,John Waldron,harrisoo@cs.tcd.ie,john.waldron@cs.tcd.ie
ed7c674a-46f4-4536-84a9-24e1489c692e
Realistic real-time sound re-synthesis and processing for interactive virtual worlds
We present new GPU-based techniques for implementing linear digital filters for real-time audio processing. Our solution for recursive filters is the first presented in the literature. We demonstrate the relevance of these algorithms to computer graphics by synthesizing realistic sounds of colliding objects made of different materials, such as glass, plastic, and wood, in real time. The synthesized sounds can be parameterized by the object materials, velocities, and collision angles. Despite its flexibility, our approach uses very little memory, since it essentially requires a set of coefficients representing the impulse response of each material sound. Such features make our approach an attractive alternative to traditional CPU-based techniques that use playback of pre-recorded sounds.
/content/cudazone/CUDABrowser/assets/images/applications/862_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/862_visualcomputer_large.png
Academia
Instituto de Informatica
2009
03
11
03/11/2009
Fernando Trebien
Manuel M. Oliveira
Paper
Video & Audio
Fernando Trebien,Manuel M. Oliveira,ftrebien@inf.ufrgs.br,oliveira@inf.ufrgs.br
b3df15ed-da9f-40e2-9e52-827b4ffa8012
Solid Mesh Registration for Radiotherapy Treatment Planning
We present an algorithm for solid organ registration of pre-segmented data represented as tetrahedral meshes. Registration of the organ surface is driven by force terms based on a distance field representation of the source and reference shapes. Registration of internal morphology is achieved using a non-linear elastic finite element model. A key feature of the method is that the user does not need to specify boundary conditions (surface point correspondences) prior to the finite element analysis. Instead the boundary matches are found as an integrated part of the analysis. The method is evaluated on phantom data and prostate data obtained in vivo based on fiducial marker accuracy and inverse consistency of transformations. The parallel nature of the method allows an efficient implementation on a GPU and as a result the method is very fast. All validation registrations take less than 30 seconds to complete. The proposed method has many potential uses in image guided radiotherapy (IGRT) which relies on registration to account for organ deformation between treatment sessions.
/content/cudazone/CUDABrowser/assets/images/applications/861_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/861_cover-medium_large.jpg
Academia
Aarhus University, Denmark
2010
01
21
01/21/2010
Karsten Ostergaard Noe
Paper
Science
Karsten Ostergaard Noe,noe@cs.au.dk
6a0fe015-97bb-4d85-b408-308d30e105d5
Large Scale Bioinformatics Data Mining with Parallel Genetic Programming on Graphics Processing Units
A suitable single instruction multiple data GP interpreter can achieve high (Giga GPop/second) performance on a SIMD GPU graphics card by simultaneously running multiple diverse members of the genetic programming population. SPMD dataflow parallelisation is achieved because the single interpreter treats the different GP programs as data. On a single 128 node parallel nVidia GeForce 8800 GTX GPU, the interpreter can out run a compiled approach, where data parallelisation comes only by running a single program at a time across multiple inputs.
The RapidMind GPGPU Linux C++ system has been demonstrated by predicting ten year+ outcome of breast cancer from a dataset containing a million inputs. NCBI GEO GSE3494 contains hundreds of Affymetrix HG-U133A and HG-U133B GeneChip biopsies. Multiple GP runs each with a population of five million programs winnow useful variables from the chaff at more than 500 million GPops per second. Sources available via FTP.
/content/cudazone/CUDABrowser/assets/images/applications/860_iss_small.png
/content/cudazone/CUDABrowser/assets/images/applications/860_iss_large.png
Academia
King's College, London
2010
01
06
01/06/2010
William B. Langdon
Paper
Science
William B. Langdon
8a53a58c-cb1e-4854-b81a-88ec94b5490d
Hierarchical Markov Random Fields Applied to Model Soft Tissue Deformations on Graphics Hardware
Many methodologies dealing with prediction or simulation of soft tissue deformations on medical image data require preprocessing of the data in order to produce a different shape representation that complies with standard methodologies, such as mass spring networks, finite element method s (FEM). On the other hand, methodologies working directly on the image space normally do not take into account mechanical behavior of tissues and tend to lack physics foundations driving soft tissue deformations. This chapter presents a method to simulate soft tissue deformations based on coupled concepts from image analysis and mechanics theory. The proposed methodology is based on a robust stochastic approach that takes into account material properties retrieved directly from the image, concepts from continuum mechanics and FEM. The optimization framework is solved within a hierarchical Markov random field (HMRF) which is implemented on the graphics processor unit (GPU ).
/content/cudazone/CUDABrowser/assets/images/applications/859_cover-medium3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/859_cover-medium3_large.png
Academia
University of Bern, Switzerland
2009
11
24
11/24/2009
Christof Seiler
Philippe Buchler
Lutz-Peter Nolte
Paper
Science
Christof Seiler,Philippe Buchler,Lutz-Peter Nolte,christof.seiler@artorg.unibe.ch,christof.seiler@artorg.unibe.ch,christof.seiler@artorg.unibe.ch
8fea4552-0ab0-4c43-a0e4-f72c417d9e06
Efficient K- Means Clustering Using Accelerated Graphics Processors
We exploit the parallel architecture of the Graphics Processing Unit (GPU) used in desktops to efficiently implement the traditional K-means algorithm. Our approach in clustering avoids the need for data and cluster information transfer between the GPU and CPU in between the iterations. In this paper we present the novelties in our approach and techniques employed to represent data, compute distances, centroids and identify the cluster elements using the GPU. We measure performance using the metric: computational time per iteration. Our implementation of k-means clustering on an Nvidia 5900 graphics processor is 4 to 12 times faster than the CPU and 7 to 22 times faster on the Nvidia 8500 graphics processor for various data sizes. We also achieved 12 to 64 times speed gain on the 5900 and 20 to 140 times speed gains on the 8500 graphics processor in computational time per iteration for evaluations with various cluster sizes.
/content/cudazone/CUDABrowser/assets/images/applications/855_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/855_cover-medium_large.jpg
Academia
Nanyang Technological University, Singapore
2008
08
30
08/30/2008
S. A. Arul Shalom
Manoranjan Dash
Minh Tue
Paper
Science
S. A. Arul Shalom,Manoranjan Dash,Minh Tue,sall0001@ntu.edu.sg,asmdash@ntu.edu.sg,h0630082@nus.edu.sg
07072ad0-b7c1-4eb6-9826-2c1cc0ae740f
Systematic Parallelization of Medical Image Reconstruction for Graphics Hardware
Modern Graphics Processing Units (GPUs) consist of several SIMD-processors and thus provide a high degree of parallelism at low cost. We introduce a new approach to systematically develop parallel image reconstruction algorithms for GPUs from their parallel equivalents for distributed-memory machines. We use High-Level Petri Nets (HLPN) to intuitively describe the parallel implementations for distributed- memory machines. By denoting the functions of the HLPN with memory requirements and information about data distribution, we are able to identify parallel functions that can be implemented efficiently on the GPU. For an important iterative medical image reconstruction algorithm the list-mode OSEM algorithm we demonstrate the limitations of its distributed-memory implementation and show how our HLPN-based approach leads to a fast implementation on GPUs, reusable across different medical imaging devices.
/content/cudazone/CUDABrowser/assets/images/applications/854_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/854_cover-medium_large.jpg
Academia
University of Munster, Germany
2008
08
21
08/21/2008
Maraike Schellmann
Jurgen Vording
Sergei Gorlatch
Paper
Science
Maraike Schellmann,Jurgen Vording,Sergei Gorlatch,schellmann@uni-muenster.de,voerding@uni-muenster.de,gorlatch@uni-muenster.de
7794a349-0166-46c9-8c6d-32da8b4febda
Real-Time Autostereoscopic Visualization of Registration-Generated 4D MR Image of Beating Heart
This paper presents a real-time autostereoscopic visualization system using the principle of Integral Videography(IV). We develop MIP and composite volume ray casting method for IV volume rendering, and implemented the algorithm on GPU to achieve real-time rendering. The system was used to visualize 4D MR image that was generated from registration of 3D MR image and 4D ultrasound image. The registration scheme consists of inter-modality rigid registration between 3D MR image and 3D ultrasound image and intra-modality non-rigid registration between 3D ultrasound images. Registration processes were also implemented on GPU. Evaluation of processing speed showed that GPU processing time was 48x, 13x, 21x faster than CPU processing time for IV volume rendering, rigid registration, and non-rigid registration respectively. We also enabled real-time user interactivity for IV visualization system. In the future, We plan to use this system to develop intra-operative surgery navigation system for intra-cardiac surgery on beating heart.
/content/cudazone/CUDABrowser/assets/images/applications/853_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/853_cover-medium_large.jpg
Academia
The University of Tokyo, Japan
2008
07
15
07/15/2008
Nicholas Herlambang
Hongen Liao
Ken Masamune
Paper
Science
Nicholas Herlambang,Hongen Liao,Ken Masamune,nicholas@atre.t.u-tokyo.ac.jp,liao@atre.t.u-tokyo.ac.jp,masa@i.u-tokyo.ac.jp
498009d7-ccca-476d-881a-4a392b52b7ba
Multiscale and local search methods for real time region tracking with particle filters: local search driven by adaptive scale estimation on GPUs
Tracking systems are important in computervision, with applications in surveillance, human computer interaction, etc. Consumer graphics processing units (GPUs) have experienced an extraordinary evolution in both computing performance and programmability, leading to greater use of the GPU for non-rendering applications. In this work we propose a real-time object tracking algorithm, based on the hybridization of particle filtering (PF) and a multi-scale local search (MSLS) algorithm, presented for both CPU and GPU architectures. The developed system provides successful results in precise tracking of single and multiple targets in monocular video, operating in real-time at 70 frames per second for 640 x 480 video resolutions on the GPU, up to 1,100% faster than the CPU version of the algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/852_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/852_implementation_large.png
Academia
Universidad Rey Juan Carlos, Spain
2008
05
08
05/08/2008
Raul Cabido
Antonio S. Montemayor
Juan Jose Pantrigo
Paper
Science
Raul Cabido,Antonio S. Montemayor, Juan Jose Pantrigo ,raul.cabido@urjc.es,antonio.sanz@urjc.es,juanjose.pantrigo@urjc.es
066c8093-375d-46dc-a170-4955e4c07315
Deforming a High-Resolution Mesh in Real-Time by Mapping onto a Low-Resolution Physical Model
For interactive surgical simulation the physical model of the soft tissue needs to be solved in real-time. This limits the attainable model density to well below the desired mesh density for visual realism. Previous work avoids this problem by using a high-resolution visual mesh mapped onto a low-resolution physical model. We apply the same approach and present an computationally cheap implementation of a known algorithm to avoid texture artefacts caused by the mapping. We also introduce a spline-based algorithm to prevent groups of high-resolution vertices, mapped to the same low-resolution triangle, from exhibiting movements in which the underlying low-resolution structure can be recognised. The resulting mapping algorithm is very efficient, mapping 54,000 vertices in 8.5 ms on the CPU and in 0.88 ms on the GPU. Consequently, the density of the high-resolution visual mesh is limited only by the detail of the CT data from which the mesh was generated.
/content/cudazone/CUDABrowser/assets/images/applications/1372_evisser08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1372_evisser08_large.png
Research
The Australian e-Health Research Centre
2008
07
07
07/07/2008
Hans de Visser
Olivier Comas
David Conlan
Paper
Science
Hans de Visser,Olivier Comas,David Conlan
3423ec81-fdfb-4e8c-89af-2dce4ce05a4a
ECM on Graphics Cards
This paper reports record-setting performance for the elliptic-curve method of integer factorization: for example, 926.11 curves/second for ECM stage 1 with B 1 = 8192 for 280-bit integers on a single PC. The state-of-the-art GMP-ECM software handles 124.71 curves/second for ECM stage 1 with B 1 = 8192 for 280-bit integers using all four cores of a 2.4 GHz Core 2 Quad Q6600.
The extra speed takes advantage of extra hardware, specifically two NVIDIA GTX 295 graphics cards, using a new ECM implementation introduced in this paper. Our implementation uses Edwards curves, relies on new parallel addition formulas, and is carefully tuned for the highly parallel GPU architecture. On a single GTX 295 the implementation performs 41.88 million modular multiplications per second for a general 280-bit modulus. GMP-ECM, using all four cores of a Q6600, performs 13.03 million modular multiplications per second.
This paper also reports speeds on other graphics processors: for example, 2414 280-bit elliptic-curve scalar multiplications per second on an older NVIDIA 8800 GTS (G80), again for a general 280-bit modulus. For comparison, the CHES 2008 paper "Exploiting the Power of GPUs for Asymmetric Cryptography" reported 1412 elliptic-curve scalar multiplications per second on the same graphics processor despite having fewer bits in the scalar (224 instead of 280), fewer bits in the modulus (224 instead of 280), and a special modulus (2224 − 296 + 1).
/content/cudazone/CUDABrowser/assets/images/applications/849_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/849_cover-medium_large.jpg
Academia
University of Illinois at Chicago
2009
04
16
04/16/2009
Daniel J. Bernstein
Tien-Ren Chen
Chen-Mou Cheng
Paper
Science
Daniel J. Bernstein,Tien-Ren Chen,Chen-Mou Cheng,djb@cr.yp.to,trchen1033@crypto.tw,doug@crypto.tw
9dd9b7a3-2ab7-4a26-ad08-1f2554a989fe
A Practical Approach of Curved Ray Prestack Kirchhoff Time Migration on GPGPU
We introduced four prototypes of General Purpose GPU solutions by Compute Unified Device Architecture (CUDA) on NVidia GeForce 8800GT and Tesla C870 for a practical Curved Ray Prestack Kirchhoff Time Migration program, which is one of the most widely adopted imaging methods in the seismic data processing industry. We presented how to re-design and re-implement the original CPU code to efficient GPU code step by step. We demonstrated optimization methods, such as how to reduce the overhead of memory transportation on PCI-E bus, how to significantly increase the kernel thread numbers on GPU cores, how to buffer the inputs and outputs of CUDA kernel modules, and how to utilize the memory streams to overlap GPU kernel execution time, etc., to improve the runtime performance on GPUs. We analyzed the floating point errors between CPUs and GPUs. We presented the images generated by CPU and GPU programs for the same real-world seismic data inputs. Our final approach of Prototype-IV on NVidia GeForce 8800GT is 16.3 times faster than its CPU version on Intels P4 3.0G.
/content/cudazone/CUDABrowser/assets/images/applications/848_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/848_cover-medium_large.jpg
Academia
Beihang University, Beijing
2009
08
21
08/21/2009
Xiaohua Shi
Chuang Li
Xu Wang
Paper
Science
Xiaohua Shi,Chuang Li,Xu Wang,xhshi@buaa.edu.cn,whlichuang@126.com,xu.wang@sei.buaa.edu.cn
6a921e34-8d47-4a42-a892-8398ec64468f
A Practical Quicksort Algorithm for Graphics Processors
In this paper we present GPU-Quicksort, an efficient Quicksort algorithm suitable for highly parallel multi-core graphics processors. Quicksort has previously been considered as an inefficient sorting solution for graphics processors, but we show that GPU-Quicksort often performs better than the fastest known sorting implementations for graphics processors, such as radix and bitonic sort. Quicksort can thus be seen as a viable alternative for sorting large quantities of data on graphics processors
/content/cudazone/CUDABrowser/assets/images/applications/847_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/847_cover-medium_large.jpg
Academia
Chalmers University of Technology, Sweden
2008
09
20
09/20/2008
Daniel Cederman
Philippas Tsigas
Paper
Science
Daniel Cederman,Philippas Tsigas,cederman@chalmers.se,tsigas@chalmers.se
77df8f3f-2d01-428d-b8b3-70fc6d308873
Hardware-Accelerated Particle-Based Volume Rendering for Multiple Irregular Volumes
In this paper, we propose a performance improvement of particle-based volume rendering (PBVR) by using a current, programmable GPU architecture. PBVR allows to render without visibility sorting by representing a given volume dataset as a set of opaque and emissive particles. In our new GPU acceleration of PBVR, we provide a switchable rendering pipeline that is compatible with both regular and irregular grid volumes. Particle generation is improved by using a cell-by-cell approach for processing large volume dataset. We also reduce the memory cost required for storing all sub-pixel values by proposing a pixel-superimposing technique targeting a large sub-pixel level. Our work demonstrates a full detail rendering rate from 5 to 11 fps for overlapped or separated multi-irregular volumes with a mega-scale number of volume cells on NVIDIA Geforce 8800GTX.
/content/cudazone/CUDABrowser/assets/images/applications/846_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/846_cover-medium_large.jpg
Academia
Center for the Promotion of Excellence in Higher Education, Kyoto University
2008
12
03
12/03/2008
Naohisa Sakamoto
Ding Zhongming
Takuma Kawamura
Paper
Science
Naohisa Sakamoto,Ding Zhongming,Takuma Kawamura
acb94169-1150-437c-9ebb-6c183be2b38f
Compiler support for general-purpose computation on GPUs
In recent years, the GPU (graphics processing unit) has evolved into an extremely powerful and flexible processor, with it now representing an attractive platform for general-purpose computation. Moreover, changes to the design and programmability of GPUs provide the opportunity to perform general-purpose computation on a GPU (GPGPU). Even though many programming languages, software tools, and libraries have been proposed to facilitate GPGPU programming, the unusual and specific programming model of the GPU remains a significant barrier to writing GPGPU programs. In this paper, we introduce a novel compiler-based approach for GPGPU programming. Compiler directives are used to label code fragments that are to be executed on the GPU. Our GPGPU compiler, Guru, converts the labeled code fragments into ISO-compliant C code that contains appropriate OpenGL and Cg APIs. A native C compiler can then be used to compile it into the executable code for GPU. Our compiler is implemented based on the Open64 compiler infrastructure. Preliminary experimental results from selected benchmarks show that our compiler produces significant performance improvements for programs that exhibit a high degree of data parallelism.
/content/cudazone/CUDABrowser/assets/images/applications/844_neville_small.png
/content/cudazone/CUDABrowser/assets/images/applications/844_neville_large.png
Academia
National Chung Cheng University, China
2008
11
19
11/19/2008
Yu-Te Lin
Peng-Sheng Chen
Paper
Science
Yu-Te Lin,Peng-Sheng Chen,lyt94@cs.ccu.edu.tw,pschen@cs.ccu.edu.tw
a0c19949-34b1-4227-a049-a70feb8ad4e9
A Gradient Descent Approximation for Graph Cuts
Graph cuts have become very popular in many areas of computer vision including segmentation, energy minimization, and 3D reconstruction. Their ability to find optimal results efficiently and the convenience of usage are some of the factors of this popularity. However, there are a few issues with graph cuts, such as inherent sequential nature of popular algorithms and the memory bloat in large scale problems. In this paper, we introduce a novel method for the approximation of the graph cut optimization by posing the problem as a gradient descent formulation. The advantages of our method is the ability to work efficiently on large problems and the possibility of convenient implementation on parallel architectures such as inexpensive Graphics Processing Units (GPUs). We have implemented the proposed method on the Nvidia 8800GTS GPU. The classical segmentation experiments on static images and video data showed the effectiveness of our method.
/content/cudazone/CUDABrowser/assets/images/applications/843_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/843_cover-medium_large.jpg
Academia
Gebze Institute of Technology, Gebze
2009
09
02
09/02/2009
Alparslan Yildiz
Yusuf Sinan Akgul
Paper
Science
Alparslan Yildiz,Yusuf Sinan Akgul,yildiz@bilmuh.gyte.edu.tr,akgul@bilmuh.gyte.edu.tr
83479fc3-a87e-49c9-b505-d08ae7a1747f
Data Mining Using Graphics Processing Units
During the last few years, Graphics Processing Units (GPU) have evolved from simple devices for the display signal preparation into powerful coprocessors that do not only support typical computer graphics tasks such as rendering of 3D scenarios but can also be used for general numeric and symbolic computation tasks such as simulation and optimization. As major advantage, GPUs provide extremely high parallelism (with several hundred simple programmable processors) combined with a high bandwidth in memory transfer at low cost. In this paper, we propose several algorithms for computationally expensive data mining tasks like similarity search and clustering which are designed for the highly parallel environment of a GPU. We define a multidimensional index structure which is particularly suited to support similarity queries under the restricted programming model of a GPU, and define a similarity join method. Moreover, we define highly parallel algorithms for density-based and partitioning clustering. In an extensive experimental evaluation, we demonstrate the superiority of our algorithms running on GPU over their conventional counterparts in CPU.
/content/cudazone/CUDABrowser/assets/images/applications/842_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/842_cover-medium_large.jpg
Academia
University of Munich, Germany
2009
08
24
08/24/2009
Christian Bohm
Robert Noll
Claudia Plant
Paper
Science
Christian Bohm,Robert Noll,Claudia Plant,boehm@dbs.ifi.lmu.de,noll@dbs.ifi.lmu.de,plant@lrz.tum.de
23658fd2-eb70-4bb7-bc9a-5efb3e91b16e
GPU RayTracing Pipeline
We present a novel approach to ray tracing execution on commodity graphics hardware using CUDA. We decompose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frustums for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and decompression. Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and parallel scan per each BVH level. We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is reasonable for GPU implementation. For the same data sets and ray-primitive intersection routines our pipeline is ~3x faster than an optimized standard depth first ray tracing implemented in one kernel.
/content/cudazone/CUDABrowser/assets/images/applications/841_paper4_small.png
/content/cudazone/CUDABrowser/assets/images/applications/841_paper4_large.png
Research
Keldysh Institute of Applied Mathematics / Microsoft Research
2010
02
10
02/10/2010
3
K.Garanzha
C.Loop
Paper
Presentation
Graphics
Ray Tracing
GPU, ray tracing, custom pipeline,K.Garanzha,C.Loop,kirill@garanzha.com
89d31666-0540-4526-800d-124ea52364d8
Maaap Reduce
In order to verify the feasibility of using the GPU for a fairly substantial and rapidly changing dataset, a simple set of benchmark functions were created for three main programming language families. Each test evaluated every element in the MNIST dataset with the sigmoid function.
/content/cudazone/CUDABrowser/assets/images/applications/840_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/840_defaultlogo_large.png
CUDA Developer
2009
04
07
04/07/2009
Paul Reimer
Code
Libraries
Paul Reimer
5f27c6d9-6984-495a-bf57-bca8dd6ea108
Rocks CUDA
Rocks Cluster Distribution is a linux distribution for HPC clusters. It was started by National Partnership for Advanced Computational Infrastructure and the SDSC in 2000. Rocks includes many tools that make a group of computers into a cluster.
Installations can be customized with additional software packages at install-time by using special user-supplied CDs (called "Roll CDs"). The "Rolls" extend the system by integrating seamlessly and automatically into the management and packaging mechanisms used by base software, greatly simplifying installation and configuration of large numbers of computers. This project will contain the source code and images for an NPACI ROCKS 5.0 Roll for NVIDIA CUDA libraries and drivers.
/content/cudazone/CUDABrowser/assets/images/applications/839_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/839_defaultlogo_large.png
CUDA Developer
2008
08
14
08/14/2008
3kforme
Code
3kforme
746a6455-e475-40f9-b881-6f85a8ec0e76
GPU based Sparse Grid Technique for Solving Multidimensional Options Pricing PDEs
It has been shown that the sparse grid combination technique can be a practical tool to solve high dimensional PDEs arising in multidimensional option pricing problems in finance. Hierarchical approximation of these problems leads to linear systems that are smaller in size compared to those arising from standard finite element or finite difference discretizations. However, these systems are still excessively demanding in terms of memory for direct methods and challenging to solve by iterative methods. In this paper we address iterative solutions via preconditioned Krylov subspace based methods, such as Stabilized BiConjugate Gradient (BiCGStab) and CG Squared (CGS), with the main focus on the design of such iterative solvers to harness massive parallelism of general purpose Graphics Processing Units (GPGPU)s. We discuss data structures and efficient implementation of iterative solvers. We also present a number of performance results to demonstrate the scalability of these solvers on the NVIDIA's CUDA platform.
/content/cudazone/CUDABrowser/assets/images/applications/838_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/838_graph_large.png
Academia
Chatenay-Malabry, France
2009
12
31
12/31/2009
1000
Abhijeet Gaikwad
Ioane Muni Toke
Paper
Finance
NVIDIA CUDA, Iterative solvers, multidimensional option,Abhijeet Gaikwad,Ioane Muni Toke,abhijeet.gaikwad@ecp.fr,ioane.muni-toke@ecp.fr
a65412a3-1d34-490f-a209-9f8d486c7b55
Micromanager London Kings
CUDA makes the processing power of NVIDIA graphics cards available for normal computation. Here are Some Add-ons for uManager http://www.micro-manager.org/, a free cross-platform software to control microscopes and do image acquisition.
/content/cudazone/CUDABrowser/assets/images/applications/837_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/837_defaultlogo_large.png
Research
CUDA Developer
2009
04
28
04/28/2009
Martin Kielhorn
Code
Libraries
Martin Kielhorn
c0250d55-553e-4fa6-aa6a-e91916638b97
CBCL Model CUDA
CUDA version of the HMAX model
/content/cudazone/CUDABrowser/assets/images/applications/836_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/836_defaultlogo_large.png
Research
CUDA Developer
2009
01
28
01/28/2009
Sharat.Chikkerur
Code
Sharat.Chikkerur
9afbcfda-88d9-44a4-9cf3-8c2e3c2ec1d9
Real-time virtual environment signal extraction and denoising using programmable graphics hardware
The sense of being within a three-dimensional (3D) space and interacting with virtual 3D objects in a computer-generated virtual environment (VE) often requires essential image, vision and sensor signal processing techniques such as differentiating and denoising. This paper describes novel implementations of the Gaussian filtering for characteristic signal extraction and wavelet-based image denoising algorithms that run on the graphics processing unit (GPU). While significant acceleration over standard CPU implementations is obtained through exploiting data parallelism provided by the modern programmable graphics hardware, the CPU can be freed up to run other computations more efficiently such as artificial intelligence (AI) and physics. The proposed GPU-based Gaussian filtering can extract surface information from a real object and provide its material features for rendering and illumination. The wavelet-based signal denoising for large size digital images realized in this project provided better realism for VE visualization without sacrificing real-time and interactive performances of an application.
/content/cudazone/CUDABrowser/assets/images/applications/835_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/835_cover-medium_large.jpg
Academia
University of Huddersfield, Queensgate, Huddersfield
2009
10
21
10/21/2009
Yang Su
Zhi-Jie Xu
Xiang-Qian Jiang
Paper
Signal Processing
Yang Su,Zhi-Jie Xu,Xiang-Qian Jiang,y.su@hud.ac.uk
578067f7-ac4b-47f2-950b-1f9ed61408e5
Extracting Curve Skeletons from Gray Value Images for Virtual Endoscopy
The extraction of curve skeletons from tubular networks is a necessary prerequisite for virtual endoscopy applications. We present an approach for curve skeleton extraction directly from gray value images that supersedes the need to deal with segmentations and skeletonizations. The approach uses properties of the Gradient Vector Flow to derive a tube-likeliness measure and a medialness measure. Their combination allows the detection of tubular structures and an extraction of their medial curves that stays centered also in cases where the structures are not tubular such as junctions or severe stenoses. We present results on clinical datasets and compare them to curve skeletons derived with different skeletonization approaches from high quality segmentations. Our approach achieves a high centerline accuracy and is computationally efficient by making use of a GPU based implementation of the Gradient Vector Flow.
/content/cudazone/CUDABrowser/assets/images/applications/834_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/834_cover-medium_large.jpg
Academia
Graz University of Technology, Austria
2008
07
15
07/15/2008
Christian Bauer
Horst Bischof
Paper
Science
Christian Bauer,Horst Bischof,cbauer@icg.tu-graz.ac.at,bischof@icg.tu-graz.ac.at
bb093c2d-d78d-4eb7-81b1-af6e59587e17
Evaluating the Jaccard-Tanimoto Index on Multi-core Architectures
The Jaccard/Tanimoto coefficient is an important workload, used in a large variety of problems including drug design fingerprinting, clustering analysis, similarity web searching and image segmentation. This paper evaluates the Jaccard coefficient on three platforms: the Cell Broadband Engine processor Intel Xeon dualcore platform and NVIDIA 8800 GTX GPU. In our work, we have developed a novel parallel algorithm specially suited for the Cell/B.E. architecture for all-to-all Jaccard comparisons, that minimizes DMA transfers and reuses data in the local store. We show that our implementation on Cell/B.E. outperforms the implementations on comparable Intel platforms by 6-20X with full accuracy, and from 10-50X in reduced accuracy mode, depending on the size of the data, and by more than 60X compared to Nvidia 8800 GTX. In addition to performance, we also discuss in detail our efforts to optimize our workload on these architectures and explain how avenues for optimization on each architecture are very different and vary from one architecture to another for our workload. Our work shows that the algorithms or kernels employed for the Jaccard coefficient calculation are heavily dependent on the traits of the target hardware.
/content/cudazone/CUDABrowser/assets/images/applications/833_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/833_cover-medium_large.jpg
Academia
Technologies Design Center, Indianapolis
2009
05
20
05/20/2009
20
Vipin Sachdeva
Douglas M. Freimuth
Chris Mueller
Paper
Science
Vipin Sachdeva,Douglas M. Freimuth,Chris Mueller,vsachde@us.ibm.com,dmfreim@us.ibm.com,chemuell@cs.indiana.edu
7a0a2a4f-3fa3-4ed0-8e1c-f3dd9f2835e7
Focused Volumetric Visual Hull with Color Extraction
This paper introduces a new approach for volumetric visual hull reconstruction, using a voxel grid that focuses on the moving target object. This grid is continuously updated as a function of object location, orientation, and size. The benefit is a reduced amount of voxels that have to be evaluated or allocated towards capturing the target at higher resolution. This technique particularly improves reconstructions where the total reconstruction space is larger than the moving reconstruction target. The higher resolution of the voxel grid also reduces the computational cost per voxel reprojection since a one voxel to one input pixel reprojection ratio is approximated. In addition, the appropriate view independent color of the surface voxels is computed allowing for realistic visual hull texturing. All color calculations are performed locally, based on approximated surface voxel normals and the input images. A color outlier detection approach is introduced, which reduces the influence of occlusions in the color evaluation. The parallel nature of the presented focused visual hull reconstruction technique, lends itself to hardware acceleration, allowing interactive rates to be achieved by performing most computations on the GPU. A set of case studies is provided for well-defined static and dynamic data sets.
/content/cudazone/CUDABrowser/assets/images/applications/832_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/832_cover-medium_large.jpg
Academia
University of California, San Diego
2009
11
26
11/26/2009
Daniel Knoblauch
Falko Kuester
Paper
Science
Daniel Knoblauch,Falko Kuester
89f67b35-e35e-4e8c-8b22-14c848a66f32
Fourier Volume Rendering on GPGPU
Fourier Volume Rendering (FVR) is a volume rendering technique with lower computational complexity of O(N 2 logN) for an N 3 data array. A new FVR algorithm is proposed through expanding Fourier Projection-Slice Theorem into High-Dimension and mapping the pipeline totally on GPU. A windowed-sinc function is used as reconstruction filter to implement higher-order interpolation and reduction of samples is executed on GPU in parallel, which meets the architecture of Heterogeneous multi-core. The rendering is accelerated by a factor of 7 when rendering images resolution is larger than 512x512.
/content/cudazone/CUDABrowser/assets/images/applications/831_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/831_cover-medium_large.jpg
Academia
Hunan University
2009
05
21
05/21/2009
Degui Xiao
Yi Liu
Lei Yang
Paper
Science
Degui Xiao,Yi Liu,Lei Yang
e7dc92ba-1736-4c6f-92ea-6ac559d565f7
Practical Random Linear Network Coding on GPUs
Recently, random linear network coding has been widely applied in peer-to-peer network applications. Instead of sharing the raw data with each other, peers in the network produce and send encoded data to each other. As a result, the communication protocols have been greatly simplified, and the applications experience higher end-to-end throughput and better robustness to network churns.Since it is difficult to verify the integrity of the encoded data, such systems can suffer from the famous pollution attack, in which a malicious node can send bad encoded blocks that consist of bogus data. Consequently, the bogus data will be propagated into the whole network at an exponential rate. Homomorphic hash functions (HHFs) have been designed to defend systems from such pollution attacks, but with a new challenge: HHFs require that network coding must be performed in GF(q), where q is a very large prime number. This greatly increases the computational cost of network coding, in addition to the already computational expensive HHFs. This paper exploits the potential of the huge computing power of Graphic Processing Units (GPUs) to reduce the computational cost of network coding and homomorphic hashing. With our network coding and HHF implementation on GPU, we observed significant computational speedup in comparison with the best CPU implementation. This implementation can lead to a practical solution for defending against the pollution attacks in distributed systems.
/content/cudazone/CUDABrowser/assets/images/applications/830_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/830_cover-medium_large.jpg
Academia
Hong Kong Baptist University / University of Calgary, Alberta, Canada
2009
05
07
05/07/2009
Xiaowen Chu
Kaiyong Zhao
Mea Wang
Paper
Science
Xiaowen Chu,Kaiyong Zhao,Mea Wang,chxw@comp.hkbu.edu.hk,kyzhao@comp.hkbu.edu.hk,meawang@ucalgary.ca
207cd764-e884-47aa-b0c3-b5505bedfbe4
Fast Conjugate Gradients with Multiple GPUs
The limiting factor for efficiency of sparse linear solvers is the memory bandwidth. In this work, we describe a fast Conjugate Gradient solver for unstructured problems, which runs on multiple GPUs installed on a single mainboard. The solver achieves double precision accuracy with single precision GPUs, using a mixed precision iterative refinement algorithm. To achieve high computation speed, we propose a fast sparse matrix-vector multiplication algorithm, which is the core operation of iterative solvers. The proposed multiplication algorithm efficiently utilizes GPU resources via caching, coalesced memory accesses and load balance between running threads. Experiments on wide range of matrices show that our matrix-vector multiplication algorithm achieves up to 11.6 Gflops on single GeForce 8800 GTS card and CG implementation achieves up to 24.6 Gflops with four GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/829_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/829_cover-medium_large.jpg
Academia
Tokyo Institute of Technology / National Institute of Informatics, Japan
2009
05
20
05/20/2009
Ali Cevahir
Akira Nukada
Satoshi Matsuoka
Paper
Science
Ali Cevahir,Akira Nukada,Satoshi Matsuoka,ali@matsulab.is.titech.ac.jp,nukada@matsulab.is.titech.ac.jp,matsu@is.titech.ac.jp
307d80ab-1016-4ba9-9fda-be6f1e85a18f
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study
To facilitate the design of hardware accelerators we propose in this paper the adoption of the stream-based computing model and the usage of Graphics Processing Units (GPUs) as prototyping platforms. This model exposes the maximum data parallelism available in the applications and decouples computation from memory accesses. The design and implementation procedures, including the programming of GPUs, are illustrated with the widely used MrBayes bioinformatics application. Experimental results show that a straightforward mapping of the stream-based program for the GPU into hardware structures leads to improvements in performance, scalability and cost. Moreover, it is shown that a set of simple optimization techniques can be applied in order to reduce the cost, and the power consumption of hardware solutions.
/content/cudazone/CUDABrowser/assets/images/applications/828_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/828_cover-medium_large.jpg
Academia
Rua Alves Redol
2009
07
21
07/21/2009
Frederico Pratas
Leonel Sousa
Paper
Science
Frederico Pratas,Leonel Sousa,fcpp@inesc-id.pt,las@inesc-id.pt
24d9dfbe-430a-4065-b835-69d1728e3a2b
Parallel Calculating of the Goal Function in Metaheuristics Using GPU
We consider a metaheuristic optimization algorithm which uses single process (thread) to guide the search through the solution space. Thread performs in the cyclic way (iteratively) two main tasks: the goal function evaluation for a single solution or a set of solutions and management (solution filtering and selection, collection of history, updating). The latter task takes statistically 1-3% total iteration time, therefore we skip its acceleration as useless. The former task can be accelerated in parallel environments in various manners. We propose certain parallel small-grain calculation model providing the cost optimal method. Then, we carry out an experiment using Graphics Processing Unit (GPU) to confirm our theoretical results.
/content/cudazone/CUDABrowser/assets/images/applications/827_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/827_cover-medium_large.jpg
Academia
Wrocaw University of Technology
2009
05
20
05/20/2009
Wojciech Bozejko
Czes'aw Smutnicki
Mariusz Uchronski
Paper
Science
Wojciech Bozejko,Czes'aw Smutnicki,Mariusz Uchronski,wojciech.bozejko@pwr.wroc.pl,czeslaw.smutnicki@pwr.wroc.pl,mariusz.uchronski@pwr.wroc.pl
89537d32-f563-4d80-af24-b3b43058d026
Accelerating astrophysical particle simulations with programmable hardware (FPGA and GPU)
In a previous paper we have shown that direct gravitational N-body simulations in astrophysics scale very well for moderately parallel supercomputers (order 10100 nodes). The best balance between computation and communication is reached if the nodes are accelerated by special purpose hardware; in this paper we describe the implementation of particle based astrophysical simulation codes on new types of accelerator hardware (field programmable gate arrays, FPGA, and graphical processing units, GPU). In addition to direct gravitational N-body simulations we also use the algorithmically similar smoothed particle hydrodynamics method as test application; the algorithms are used for astrophysical problems as e.g. evolution of galactic nuclei with central black holes and gravitational wave generation, and star formation in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and GPU) and some implementation aspects (e.g. accuracy). The results show that GPU hardware for real application codes is as fast as GRAPE, but for an order of magnitude lower price, and that FPGA is useful for acceleration of complex sequences of operations (like SPH). We discuss future prospects and new cluster computers built with new generations of FPGA and GPU cards.
/content/cudazone/CUDABrowser/assets/images/applications/826_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/826_implementation_large.png
Academia
University of Heidelberg
2009
05
12
05/12/2009
R. Spurzem
P. Berczik
G. Marcus
Paper
Science
R. Spurzem,P. Berczik,G. Marcus,spurzem@ari.uni-heidelberg.de,berczik@ari.uni-heidelberg.de,guillermo.marcus@ziti.uni-heidelberg.de
a9e38eba-e87f-426b-916a-5c33b9f69177
A framework for exploring numerical solutions of advection reaction diffusion equations using a GPU-based approach
In this paper we describe a general purpose, graphics processing unit (GP-GPU)-based approach for solving partial differential equations (PDEs) within advection reaction diffusion models. The GP-GPU-based approach provides a platform for solving PDEs in parallel and can thus significantly reduce solution times over traditional CPU implementations. This allows for a more efficient exploration of various advection reaction diffusion models, as well as, the parameters that govern them. Although the GPU does impose limitations on the size and accuracy of computations, the PDEs describing the advection reaction diffusion models of interest to us fit comfortably within these constraints. Furthermore, the GPU technology continues to rapidly increase in speed, memory, and precision, thus applying these techniques to larger systems should be possible in the future. We chose to solve the PDEs using two numerical approaches: for the diffusion, a first-order explicit forward Euler solution and a semi-implicit second order Crank Nicholson solution; and, for the advection and reaction, a first-order explicit solution. The goal of this work is to provide motivation and guidance to the application scientist interested in exploring the use of the GP-GPU computational framework in the course of their research. In this paper, we present a rigorous comparison of our GPU-based advection reaction diffusion code model with a CPU-based analog, finding that the GPU model out-performs the CPU implementation in one-to-one comparisons.
/content/cudazone/CUDABrowser/assets/images/applications/825_computedvisualation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/825_computedvisualation_large.png
Academia
University of Utah
2008
03
04
03/04/2008
Allen R. Sanderson
Miriah D. Meyer
Robert M. Kirby
Paper
Numerics
Allen R. Sanderson,Miriah D. Meyer,Robert M. Kirby,allen@sci.utah.edu,miriah@sci.utah.edu,kirby@sci.utah.edu
7d3cc29a-3dac-4791-8478-77dd28708ea8
Going Forward with GPU Computing
This article describes why CEA is looking at GPU Computing and how the first experiments are conducted. We describe here a well defined global strategy which relies on training users and taking advantage of Grand Challenges, involving early access users and system administrators. We also describe some preliminary results and raise questions which need to be addressed in the near future.
/content/cudazone/CUDABrowser/assets/images/applications/824_highperformancecomputing_small.png
/content/cudazone/CUDABrowser/assets/images/applications/824_highperformancecomputing_large.png
Research
CEA, DAM
2009
10
07
10/07/2009
Guillaume Colin de Verdiere
Paper
Science
Guillaume Colin de Verdiere
441d4d84-f548-4465-ac76-eef36ff2a059
Introduction to Mastering Cell BE and GPU Execution Platforms
Both Cell BE-type and GPU processors have emerged as multi-processor execution platforms that can outperform general purpose multi-core computers in certain application domains. The two architectures are quite different, and by no means interchangeable. GPUs are reminiscent of fine-grained systolic array architectures, while the Cell BE is suitable to execute a set of co-ordinated coarse-grained tasks. By now, enough applications have been mapped on either of these two processors, mostly by hand, that the pros and cons tables can be filled. The next step is to provide mappings that are based on efficient programming models and methods, in particular methods that minimize communication overheads. The six papers in this special session are attempts to take precisely that route. Three of them are taking the GPU as the underlying execution platform, the third taking also the Cell-BE multicore processor into consideration. The other three papers are targetting the Cell-BE processor.
/content/cudazone/CUDABrowser/assets/images/applications/823_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/823_cover-medium_large.jpg
Academia
Leiden University, the Netherlands
2009
07
21
07/21/2009
Ed Deprettere
Ana L. Varbanescu
Paper
Science
Ed Deprettere,Ana L. Varbanescu
8949e7e3-c9b6-487a-894e-75c35f7b8d45
Development of a GPU-based multithreaded software application to calculate digitally reconstructed radiographs for radiotherapy
To provide faster calculation of digitally reconstructed radiographs (DRRs) in patient-positioning verification, we developed and evaluated a graphic processing unit (GPU)-based DRR software application and compared it with a central processing unit (CPU)-based application. The evaluation metrics were calculation speed and image quality for various slice thicknesses. The results showed that the GPU-based DRR computation was an average of 50 times faster than the CPU-based methodology, whereas the image quality was very similar. This excellent performance may increase the accuracy of patient positioning and improve the patient treatment throughput time
/content/cudazone/CUDABrowser/assets/images/applications/822_radialogics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/822_radialogics_large.png
Research
National Institute of Radiological Sciences, Japan
2008
11
07
11/07/2008
Shinichiro Mori
Paper
Medical Imaging
Shinichiro Mori,shinshin@nirs.go.jp
6884796d-0fa2-4f33-9297-1fde62fcc824
Lattice Boltzmann based PDE solver on the GPU
In this paper, we propose a hardware-accelerated PDE (partial differential equation) solver based on the lattice Boltzmann model (LBM). The LBM is initially designed to solve fluid dynamics by constructing simplified microscopic kinetic models. As an explicit numerical scheme with only local operations, it has the advantage of being easy to implement and especially suitable for graphics hardware (GPU) acceleration. Beyond the Navier Stokes equation of fluid mechanics, a typical LBM can be modified to solve the parabolic diffusion equation, which is further used to solve the elliptic Laplace and Poisson equations with a diffusion process. These PDEs are widely used in modeling and manipulating images, surfaces and volumetric data sets. Therefore, the LBM scheme can be used as an GPU-based numerical solver to provide a fast and convenient alternative to traditional implicit iterative solvers. We apply this method to several examples in volume smoothing, surface fairing and image editing, achieving outstanding performance on contemporary graphics hardware. It has the great potential to be used as a general GPU computing framework for efficiently solving PDEs in image processing, computer graphics and visualization.
/content/cudazone/CUDABrowser/assets/images/applications/821_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/821_visualcomputer_large.png
Academia
Kent State University
2007
12
07
12/07/2007
Ye Zhao
Paper
Imaging
Ye Zhao,zhao@cs.kent.edu
1acdf9de-8761-4f13-9fea-7b8b02b55719
Real-Time Online Video Object Silhouette Extraction Using Graph Cuts on the GPU
Being able to find the silhouette of an object is a very important front-end processing step for many high-level computer vision techniques, such as Shape-from-Silhouette 3D reconstruction methods, object shape tracking, and pose estimation. Graph cuts have been proposed as a method for finding very accurate silhouettes which can be used as input to such high level techniques, but graph cuts are notoriously computation intensive and slow. Leading CPU implementations can extract a silhouette from a single QVGA image in 100 milliseconds, with performance dramatically decreasing with increased resolution. Recent GPU implementations have been able to achieve performance of 6 milliseconds per image by exploiting the intrinsic properties of the lattice graphs and the hardware model of the GPU. However, these methods are restricted to a subclass of lattice graphs and are not generally applicable. We propose a novel method for graph cuts on the GPU which places no limits on graph configuration and which is able to achieve comparable real-time performance in online video processing scenarios.
/content/cudazone/CUDABrowser/assets/images/applications/820_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/820_cover-medium_large.jpg
Academia
Keio University
2009
08
29
08/29/2009
Zachary A. Garrett
Hideo Saito
Paper
Video & Audio
Zachary A. Garrett,Hideo Saito,zgarrett@hvrl.ics.keio.ac.jp,saito@hvrl.ics.keio.ac.jp
1919e879-ecaa-471f-b6cb-93415638c16a
Seeded ND medical image segmentation by cellular automaton on GPU
Purpose We present a GPU-based framework to perform organ segmentation in N-dimensional (ND) medical image datasets by computation of weighted distances using the Ford Bellman algorithm (FBA). Our GPU implementation of FBA gives an alternative and optimized solution to other graph-based segmentation techniques.
http://springerlink.com/content/v92w2q820w412jj8/?p=617f22391ecf47f89a3da0c82420ae97&pi=63
/content/cudazone/CUDABrowser/assets/images/applications/819_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/819_cover-medium_large.jpg
Research
Notre-Dame Hospital
2009
07
31
07/31/2009
Claude Kauffmann
Paper
Medical Imaging
Claude Kauffmann,claude.kauffmann@gmail.com
4bd610d3-92f8-4032-9730-02b0e6091d1f
On GPU's viability as a middleware accelerator
Today Graphics Processing Units (GPUs) are a largely underexploited resource on existing desktops and a possible cost-effective enhancement to high-performance systems. To date, most applications that exploit GPUs are specialized scientific applications. Little attention has been paid to harnessing these highly-parallel devices to support more generic functionality at the operating system or middleware level. This study starts from the hypothesis that generic middleware-level techniques that improve distributed system reliability or performance (such as content addressing, erasure coding, or data similarity detection) can be significantly accelerated using GPU support. We take a first step towards validating this hypothesis and we design StoreGPU, a library that accelerates a number of hashing-based middleware primitives popular in distributed storage system implementations. Our evaluation shows that StoreGPU enables up twenty five fold performance gains on synthetic benchmarks as well as on a high-level application: the online similarity detection between large data files.
/content/cudazone/CUDABrowser/assets/images/applications/818_scalable_small.png
/content/cudazone/CUDABrowser/assets/images/applications/818_scalable_large.png
Academia
University of British Columbia
2009
01
17
01/17/2009
Samer Al-Kiswany
Abdullah Gharaibeh
Elizeu Santos-Neto
Paper
Science
Samer Al-Kiswany,Abdullah Gharaibeh,Elizeu Santos-Neto,samera@ece.ubc.ca,abdullah@ece.ubc.ca,elizeus@ece.ubc.ca
03286d23-be49-45d1-be2b-790c02badee7
Implementing Decision Trees and Forests on a GPU
We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition. Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, non-branching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms.
http://springerlink.com/content/y702n504831g232m/?p=617f22391ecf47f89a3da0c82420ae97&pi=61
/content/cudazone/CUDABrowser/assets/images/applications/817_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/817_cover-medium_large.jpg
Academia
Microsoft Research, Cambridge, UK
2008
10
12
10/12/2008
Toby Sharp
Paper
Science
Toby Sharp,toby.sharp@microsoft.com
e4fd34a1-868c-482b-9522-41104b157431
CUDAMat
CUDAMat provides a CUDA-based matrix class for Python, making it easy to implement algorithms that are easily expressed in terms of dense linear algebra.
/content/cudazone/CUDABrowser/assets/images/applications/816_google_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/816_google_large.jpg
Academia
University of Toronto
2009
11
30
11/30/2009
50
Open source
Volodymyr Mnih
Code
Libraries
Volodymyr Mnih,vmnih@cs.toronto.edu
0692da9d-1f32-4819-a7e6-278383b1c438
Parallelization of a Video Segmentation Algorithm on CUDA Enabled Graphics Processing Units
Nowadays, Graphics Processing Units (GPU) are emerging as SIMD coprocessors for general purpose computations, specially after the launch of nVIDIA CUDA. Since then, some libraries have been implemented for matrix computation and image processing. However, in real video applications some stages need irregular data distributions and the parallelism is not so inherent. This paper presents the parallelization of a video segmentation application on GPU hardware, which implements an algorithm for abrupt and gradual transitions detection. A critical part of the algorithm requires highly intensive computation for video frames features calculation. Results on three CUDA-enabled GPUs are encouraging, because of the significant speedup achieved. They are also compared with an OpenMP version of the algorithm, running on two platforms with multiples cores.
/content/cudazone/CUDABrowser/assets/images/applications/815_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/815_cover-medium_large.jpg
Academia
University of Cordoba, Spain / University of Malaga, Spain
2009
08
22
08/22/2009
Juan Gomez-Luna
Jose Maria Gonzalez-Linares
Jose Ignacio Benavides
Paper
Science
Juan Gomez-Luna,Jose Maria Gonzalez-Linares,Jose Ignacio Benavides,el1goluj@uco.es,gonzalez@ac.uma.es,el1bebej@uco.es
ae4da9b0-398e-4b88-ad64-95c879d6e61f
Fast and automatic object pose estimation for range images on the GPU
We present a pose estimation method for rigid objects from single range images. Using 3D models of the objects, many pose hypotheses are compared in a data-parallel version of the downhill simplex algorithm with an image-based error function. The pose hypothesis with the lowest error value yields the pose estimation (location and orientation), which is refined using ICP. The algorithm is designed especially for implementation on the GPU. It is completely automatic, fast, robust to occlusion and cluttered scenes, and scales with the number of different object types. We apply the system to bin picking, and evaluate it on cluttered scenes. Comprehensive experiments on challenging synthetic and real-world data demonstrate the effectiveness of our method.
/content/cudazone/CUDABrowser/assets/images/applications/814_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/814_implementation_large.png
Academia
Inha University, Korea
2009
08
04
08/04/2009
In Kyu Park
Paper
Science
In Kyu Park,pik@inha.ac.kr
6a9ef568-5517-4b74-b3d0-0070e8b2ab21
MinGPU: a minimum GPU library for computer vision
In the field of computer vision, it is becoming increasingly popular to implement algorithms, in sections or in their entirety, on a graphics processing unit (GPU). This is due to the superior speed GPUs offer compared to CPUs. In this paper, we present a GPU library, MinGPU, which contains all of the necessary functions to convert an existing CPU code to GPU. We have created GPU implementations of several well known computer vision algorithms, including the homography transformation between two 3D views. We provide timing charts and show that our MinGPU implementation of homography transformations performs approximately 600 times faster than its C++ CPU implementation.
/content/cudazone/CUDABrowser/assets/images/applications/813_iss_small.png
/content/cudazone/CUDABrowser/assets/images/applications/813_iss_large.png
Academia
University of Central Florida
2009
05
28
05/28/2009
Pavel Babenko
Paper
Science
Pavel Babenko,pavelb@cs.ucf.edu
38ad061e-364d-40a7-8e42-1233c587d56e
GPU Accelerated Non-rigid Registration for the Evaluation of Cardiac Function
We present a method for the fast and efficient tracking of motion in cardiac magnetic resonance (CMR) cines. A GPU accelerated Levenberg-Marquardt non-linear least squares optimization procedure for finite element non-rigid registration was implemented on an NVIDIA graphics card using the OpenGL environment. Points were tracked from frame to frame using forward and backward incremental registration. The inner (endocardial) and outer (epicardial) boarders of the heart were tracked in six short axis cines with ~25 frames through the cardiac cycle in 36 patients with vascular disease. Contours placed by two independent expert observers using a semi-automatic ventricular analysis program (CIM version 4.6) were used as the gold standard. The method took 0.5 seconds per frame, and the maximum Hausdorff errors were less than 2 mm on average which was of the same order as the expert inter-observer error. In conclusion, GPU accelerated Levenberg-Marquardt non-linear optimization enables fast and accurate tracking of cardiac motion in CMR images.
/content/cudazone/CUDABrowser/assets/images/applications/812_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/812_cover-medium_large.jpg
Academia
University of Auckland
2008
10
30
10/30/2008
Bo Li
Alistair A. Young
Brett R. Cowan
Paper
Science
Bo Li,Alistair A. Young,Brett R. Cowan,b.li@auckland.ac.nz,a.young@auckland.ac.nz,b.cowan@auckland.ac.nz
22e19a72-ca7f-4e5d-a9d9-fbe3cbb38d5c
A Hybrid Parallel Signature Matching Model for Network Security Applications Using SIMD GPU
High performance signature matching against a large dictionary is of great importance in network security applications. The many-core SIMD GPU is a competitive choice for signature matching. In this paper, a hybrid parallel signature matching model (HPSMM) using SIMD GPU is proposed, which uses pattern set partition and input text partition together. Then the problem of load balancing for multiprocessors in the GPU is discussed carefully, and a balanced pattern set partition method (BPSPM) employed in HPSMM is introduced. Experiments demonstrate that using pattern set partition and input text partition together can help achieve a better performance, and the proposed BPSPM-Length works well in load balancing.
/content/cudazone/CUDABrowser/assets/images/applications/811_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/811_cover-medium_large.jpg
Academia
National University of Defense Technology, China
2009
08
21
08/21/2009
Chengkun Wu
Jianping Yin
Zhiping Cai
Paper
Science
Chengkun Wu,Jianping Yin,Zhiping Cai,chengkun_wu@nudt.edu.cn,jpyin@nudt.edu.cn,zpcai@nudt.edu.cn
89d9f616-d298-43a4-99e1-3fe1db248cba
Parallel 3D Image Segmentation of Large Data Sets on a GPU Cluster
In this paper, we propose an inherent parallel scheme for 3D image segmentation of large volume data on a GPU cluster. This method originates from an extended Lattice Boltzmann Model (LBM), and provides a new numerical solution for solving the level set equation. As a local, explicit and parallel scheme, our method lends itself to several favorable features: (1) Very easy to implement with the core program only requiring a few lines of code; (2) Implicit computation of curvatures; (3) Flexible control of generating smooth segmentation results; (4) Strong amenability to parallel computing, especially on low-cost, powerful graphics hardware (GPU). The parallel computational scheme is well suited for cluster computing, leading to a good solution for segmenting very large data sets.
/content/cudazone/CUDABrowser/assets/images/applications/810_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/810_cover-medium_large.jpg
Academia
Kent State University
2009
11
26
11/26/2009
Aaron Hagan
Ye Zhao
Paper
Science
Aaron Hagan,Ye Zhao
9e70b216-1271-4886-be56-fe79e2bb7ea9
Computing the Longest Common Transposition-Invariant Subsequence with GPU
Finding a longest common transposition-invariant subsequence (LCTS) of two given integer sequences A = a 1 a 2...a m and B = b 1 b 2...b n (a generalization of the well-known longest common subsequence problem (LCS)) has arisen in the field of music information retrieval. In the LCTS problem, we look for an LCS for the sequences A + t = (a 1 + t)(a 2 + t)...(a m + t) and B where t is any integer. Performance of the top graphical processing units (GPUs) outgrew the performance of the top CPUs a few years ago and there is a surge of interest in recent years in using GPUs for general processing.We propose and evaluate a bit-parallel algorithm solving the LCTS problem on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/809_Untitledsecuritytechnology_small.png
/content/cudazone/CUDABrowser/assets/images/applications/809_Untitledsecuritytechnology_large.png
Academia
Silesian University of Technology
2009
10
01
10/01/2009
Sebastian Deorowicz
Paper
Computer Aided Engineering
Sebastian Deorowicz,sebastian.deorowicz@polsl.pl
58db5b29-d3e0-4e9a-975a-d39dfd48e727
Real-Time GPU-Based Voxel Carving with Systematic Occlusion Handling
We present an approach to compute the visual hulls of multiple people in real-time in the presence of occlusions. We prove that the resulting visual hulls are correct and minimal under occlusions. Our proposed algorithm runs completely on the GPU with framerates up to 50fps for multiple people using only one computer equipped with off-the-shelf hardware. We also compare runtimes for different graphic chips and show that our approach scales very well without additional effort. Comparison to other work shows that our algorithm is as fast as state-of-the-art technology. The resulting visual hulls can be the basis for a wide range of algorithms that require a robust voxel representation as input.
/content/cudazone/CUDABrowser/assets/images/applications/808_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/808_cover-medium_large.jpg
Academia
Fraunhofer IITB Karlsruhe / Universitat Karlsruhe
2009
09
02
09/02/2009
Alexander Schick
Rainer Stiefelhagen
Paper
Science
Alexander Schick,Rainer Stiefelhagen,alexander.schick@iitb.fraunhofer.de,rainer.stiefelhagen@iitb.fraunhofer.de
f4504157-17b0-4b17-9476-d48e77994f7f
Arion Render
Arion is the hybrid-accelerated and physically-based light simulator developed by RandomControl. It comprises an interactive WYSIWYG editing application and a super-high performance production renderer. Arion's uses all the GPUs -and- all the CPUs in your system simultaneously, not wasting a single flop available. Additionally, Arion can use all the GPUs and all the CPUs in all the other computers in your network forming a cluster for massive computation. Arion is a grid-computing solution to the problem of light physics simulation.
/content/cudazone/CUDABrowser/assets/images/applications/807_arion_cuda_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/807_arion_cuda_large.jpg
Commercial
RandomControl S.L.U.
2010
04
01
04/01/2010
50
Commercial
RandomControl
Application
Multimedia
Presentation
Graphics
Imaging
Raytracing
raytracing rendering physically-based unbiased randomcontrol arion fryrender,RandomControl,tech@randomcontrol.com
1a492908-0605-4d9e-af4f-085ff724e6cf
Asymmetric Distributed Shared Memory
GMAC is a run-time system that implements an Asymmetric Disitributed Shared Memory model. This model eases the task of programming CUDA applications by building a unified global address space including system and GPU memories. Code executed at the CPU can transparently access data hosted by the GPU memory, but code run at the GPU is constrained to access the data hosted by its memory. GMAC removes the need to perform explicit data transfers using cudaMemcpy() calls and handles all data transfers in a transparent and efficient way. Moreover, the unified address space implemented by GMAC allows using CPU pointers in the GPU code.
/content/cudazone/CUDABrowser/assets/images/applications/806_google_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/806_google_large.jpg
Academia
Universitat Politecnica de Catalunya / University of Illinois
2009
11
02
11/02/2009
Open source
Isaac Gelado
Application
Presentation
Libraries
Isaac Gelado,igelado@ac.upc.edu
ef51a1b4-1fff-412e-a96d-796a24015f38
Octane Renderer
Octane Render is a fully GPU-powered, un-biased and physically based rendering application, with a 10-15X speed increase over un-biased CPU based renderers
/content/cudazone/CUDABrowser/assets/images/applications/806_octane_cuda_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/806_octane_cuda_large.jpg
Commercial
Refractive Software LTD
http://www.refractivesoftware.com
2010
01
10
01/10/2010
15
Commercial
Refractive Software LTD
Application
Multimedia
Imaging
Video & Audio
Graphics
Refractive Software LTD
554c3825-b0de-4df9-bd68-f0dba7b2a590
Textbook: GPU
Chinese text book for CUDA programing
/content/cudazone/CUDABrowser/assets/images/applications/803_20100202044228595_small.png
/content/cudazone/CUDABrowser/assets/images/applications/803_20100202044228595_large.png
Commercial
www.hpctech.com
http://www.hpctech.com/
2009
10
01
10/01/2009
Shu Zhang
Yanli Chu
Kaiyong Zhao
Multimedia
HPC information
Shu Zhang,Yanli Chu,Kaiyong Zhao,zhao.kaiyong@gmail.com
a62c5428-2955-4cf3-9d0d-0078b395153f
QView
Multi-math object viewer . Still under development.
/content/cudazone/CUDABrowser/assets/images/applications/802_qview_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/802_qview_large.jpg
Research
digitker - The digital kernel
2010
04
30
04/30/2010
Dimitar Tsonov
Paper
Presentation
Computational Fluid Dynamics
Finance
Game Physics
Graphics
Numerics
Libraries
Science
math kernel viewer
Dimitar Tsonov,dtsonov@digitker.com
ed7975e2-60da-449a-8a34-febfbd08eebf
Textbook: Programming Massively Parallel Processors: A Hands-on Approach
The first textbook of its kind, Programming Massively Parallel Processors: A Hands-on Approach is authored by Dr. David B. Kirk, NVIDIA Fellow and former chief scientist, and Dr. Wen-mei Hwu, who serves at the University of Illinois at Urbana-Champaign as Chair of Electrical and Computer Engineering in the Coordinated Science Laboratory, co-director of the Universal Parallel Computing Research Center and principal investigator of the CUDA Center of Excellence. The textbook, which is 256 pages, is the first aimed at teaching advanced students and professionals the basic concepts of parallel programming and GPU architectures. Published by Morgan Kaufmann, it explores various techniques for constructing parallel programs and reviews numerous case studies.
/content/cudazone/CUDABrowser/assets/images/applications/801_Kirk-HR_large_small.png
/content/cudazone/CUDABrowser/assets/images/applications/801_Kirk-HR_large_large.png
Academia
NVIDIA and UIUC
2011
01
28
01/28/2010
Dr. David Kirk
Dr. Wen-meiHwu
Multimedia
Progamming textbook
CUDA, Parallel Processing, NVIDIA, GPU,Dr. David Kirk,Dr. Wen-meiHwu,dkirk@nvidia.com
c8e8ac46-4a7f-47db-b7d6-b79ae238ba7d
PARRET: Parellel RestoreTools
PARRET is a Python package for image deblurring on GPUs. By making use of the parallelism on NVIDIA GPU CUDA architecture, the deblurring time is greatly reduced. Besides image deblurring, PARRET can be used to solve linear equations.
/content/cudazone/CUDABrowser/assets/images/applications/800_demo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/800_demo_large.png
Academia
Emory University
2010
02
01
02/01/2010
15
Open source
Ying Wai (Daniel) Fan
Code
Imaging
deblurring, Python, linear systems of equations,Ying Wai (Daniel) Fan,yfan@emory.edu
3192f565-72ab-4885-9348-2b3afd2511d6
QUDA : A library for QCD on GPUs
QUDA is a library for performing calculations in lattice QCD on graphics processing units (GPUs) using NVIDIA's C for CUDA API. The current release includes optimized kernels for applying the Wilson Dirac operator and clover-improved Wilson Dirac operator, kernels for performing various BLAS-like operations, and full inverters built on these kernels. Mixed-precision implementations of both CG and BiCGstab are provided, with support for double, single, and half (16-bit fixed-point) precision.
/content/cudazone/CUDABrowser/assets/images/applications/799_quda_image_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/799_quda_image_large.jpg
Academia
Boston University and Harvard University
2009
11
17
11/17/2009
10
Open source
M. A. Clark
R. Babich
K. Barros
R. Brower
C. Rebbi
Application
Paper
Code
Science
QCD, linear solver, mixed precision,Mike Clark,mikec@seas.harvard.edu
14721042-0396-4060-8731-199cc53e5bc2
SCGPSim: A fast SystemC simulator on GPUs
The main objective of this paper is to speed up the simulation performance of SystemC designs at the RTL abstraction level by exploiting the high degree of parallelism afforded by today's general purpose graphics processors (GPGPUs). Our approach parallelizes SystemC's discrete-event simulation (DES) on GPGPUs by transforming the model of computation of DES into a model of concurrent threads that synchronize as and when necessary. Our simulation infrastructure is called SCGPSim and it includes a source-to-source (S2S) translator to transform synthesizable SystemC models into parallelly executable programs targeting an NVIDIA GPU. The translator retains the simulation semantics of the original designs by applying semantics preserving transformations. The resulting transformed models mapped onto the massively parallel architecture of GPUs improve simulation efficiency quite substantially. Preliminary experiments with varying-sized examples such as AES, ALU, and FIR have shown simulation speed-ups ranging from 30x to 100x. Considering that our transformations are not yet optimized, we believe that optimizing them will improve the simulation performance even further.
/content/cudazone/CUDABrowser/assets/images/applications/798_scgp2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/798_scgp2_large.png
Academia
FERMAT Lab, Virginia Tech, Blacksburg, VA
http://www.fermat.ece.vt.edu/
2010
01
19
01/19/2010
100
Mahesh Nanjundappa
Hiren D Patel
Bijoy A Jose
Sandeep K Shukla
Paper
Electronic Design Automation
Mahesh Nanjundappa,Hiren D Patel,Bijoy A Jose,knmahesh@vt.edu
128f6237-5801-4d4f-b825-fc3a01ba1578
Myocyte Simulation
Code performes several time-step simulations of a Myocyte (heart muscle cell) in parallel, allowing to obtain results for different set of inputs.
/content/cudazone/CUDABrowser/assets/images/applications/797_Myocyte_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/797_Myocyte_large.jpg
Academia
University of Virginia
http://www.virginia.edu
2010
01
31
01/31/2010
10
Lukasz G. Szafaryn
Application
Multimedia
Paper
Code
Life Sciences
Science
Simulation
myocyte, simulation, ode solving, time-step,Lukasz G. Szafaryn,lgs9a@virginia.edu
ab039cd4-07bd-419e-b6b0-a2e7e7be3fec
Mutual Information Based Semi-Global Stereo Matching on the GPU
Real-time stereo matching is necessary for many practical applications, including robotics. There are already many real-time stereo systems, but they typically use local approaches that cause object boundaries to be blurred and small objects to be removed. We have selected the Semi-Global Matching (SGM) method for implementation on graphics hardware, because it can compete with the currently best global stereo methods. At the same time, it is much more efficient than most other methods that produce a similar quality. In contrast to previous work, we have fully implemented SGM including matching with mutual information, which is partly responsible for the high quality of disparity images. Our implementation reaches 4.2 fps on a GeForce 8800 ULTRA with images of 640 x480 pixel size and 128 pixel disparity range and 13 fps on images of 320 x240 pixel size and 64 pixel disparity range.
/content/cudazone/CUDABrowser/assets/images/applications/796_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/796_cover-medium_large.jpg
Academia
German Aerospace Center
2008
12
02
12/02/2008
Ines Ernst
Heiko Hirschmuller
Paper
Science
Ines Ernst,Heiko Hirschmuller,ines.ernst@dlr.de,heiko.hirschmueller@dlr.de
4f1c26e4-bd49-4db3-9e21-65632e62b00d
Experiences with Cell-BE and GPU for Tomography
Tomography is a powerful technique for three-dimensional imaging, that deals with image reconstruction from a series of projection images, acquired along a range of viewing directions. An important part of any tomograph system is the reconstruction algorithm. Iterative reconstruction algorithms have many advantages over non-iterative methods, yet their running time can be prohibitively long. As these algorithms have high potential for parallelization, multi-core architectures, such as the Cell-BE and GPU, can possibly alleviate this problem.
In this paper, we describe our experiences in mapping the basic operations of iterative reconstruction algorithms onto these platforms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. Performance results of our implementation demonstrate a speedup of over 40 for a single GPU, compared to a single-core CPU version. By combining eight GPUs and a quad-core CPU in a single system, similar performance to a large cluster consisting of hundreds of CPU cores has been obtained.
/content/cudazone/CUDABrowser/assets/images/applications/795_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/795_cover-medium_large.jpg
Academia
University of Antwerp, Belgium
2009
07
21
07/21/2009
40
Sander van der Maar
Kees Joost Batenburg
Jan Sijbers
Paper
Science
Sander van der Maar,Kees Joost Batenburg,Jan Sijbers,Sander.vanderMaar@ua.ac.be,Joost.Batenburg@ua.ac.be,Jan.Sijbers@ua.ac.be
9256c867-a33e-4bca-8dd1-f56c21b6047b
Experiences with Cell-BE and GPU for Tomography
Tomography is a powerful technique for three-dimensional imaging, that deals with image reconstruction from a series of projection images, acquired along a range of viewing directions. An important part of any tomograph system is the reconstruction algorithm. Iterative reconstruction algorithms have many advantages over non-iterative methods, yet their running time can be prohibitively long. As these algorithms have high potential for parallelization, multi-core architectures, such as the Cell-BE and GPU, can possibly alleviate this problem.
In this paper, we describe our experiences in mapping the basic operations of iterative reconstruction algorithms onto these platforms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. Performance results of our implementation demonstrate a speedup of over 40 for a single GPU, compared to a single-core CPU version. By combining eight GPUs and a quad-core CPU in a single system, similar performance to a large cluster consisting of hundreds of CPU cores has been obtained.
/content/cudazone/CUDABrowser/assets/images/applications/793_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/793_cover-medium_large.jpg
Academia
University of Antwerp, Belgium
2009
07
21
07/21/2009
40
Sander van der Maar
Kees Joost Batenburg
Jan Sijbers
Paper
Science
Sander van der Maar,Kees Joost Batenburg,Jan Sijbers,Sander.vanderMaar@ua.ac.be,Joost.Batenburg@ua.ac.be,Jan.Sijbers@ua.ac.be
4ad94310-447d-47c8-bd18-1a36ddda8728
Multi-walk Parallel Pattern Search Approach on a GPU Computing Platform
This paper studies the efficiency of using Pattern Search (PS) on bound constrained optimization functions on a Graphics Processing Unit (GPU) computing platform. Pattern Search is a direct search optimization technique that does not require derivative information on non-linear programming problems. Pattern Search is ideally suited to a GPU computing environment due to its low memory requirement and no communication between threads in a multi-walk setting. To adapt to a GPU environment, traditional Pattern Search is modified by terminating based on iterations instead of tolerance. This research designed and implemented a multi-walk Pattern Search algorithm on a GPU computing platform. Computational results are promising with a computing speedup of 100+ compared to a corresponding implementation on a single CPU.
/content/cudazone/CUDABrowser/assets/images/applications/792_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/792_cover-medium_large.jpg
Academia
Lamar University
2009
05
20
05/20/2009
Weihang Zhu
James Curry
Paper
Science
Weihang Zhu,James Curry,Weihang.Zhu@lamar.edu,jcurry@my.lamar.edu
1ecce826-a4da-4bd6-932e-11130eeee781
A GPU-Based Simulation of Tsunami Propagation and Inundation
Tsunami simulation consists of fluid dynamics, numerical computations, and visualization techniques. Nonlinear shallow water equations are often used to model the tsunami propagation. By adding the friction slope to the conservation of momentum, it also can model the tsunami inundation. To solve these equations, we use the second order finite difference MacCormack method. Since it is a finite difference method, it brings the possibility to be parallelized. We use the parallelism provided by GPU to speed up the computations. By loading data as textures in GPU memory, the computation processes can be written as shader programs and the operations will be done by GPU in parallel. The results show that with the help of GPU, the simulation can get a significant improvement in the execution time for each of the computation steps.
/content/cudazone/CUDABrowser/assets/images/applications/790_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/790_cover-medium_large.jpg
Academia
National United University
2009
07
31
07/31/2009
Wen-Yew Liang
Tung-Ju Hsieh
Muhammad T. Satria
Paper
Science
Wen-Yew Liang,Tung-Ju Hsieh,Muhammad T. Satria,wyliang@ntut.edu.tw,tjhsieh@ntut.edu.tw,t6598056@ntut.edu.tw
4aba234f-c87b-477d-84e4-5ccb3a641313
GPU-Supported Image Compression for Remote Visualization Realization and Benchmarking
In this paper we introduce a novel GPU-supported JPEG image compression technique with a focus on its application for remote visualization purposes. Fast and high quality compression techniques are very important for the remote visualization of interactive simulations and Virtual reality applications (IS/VR) on hybrid clusters. Thus the main goals of the design and implementation of this compression technique were low compression times and nearly no visible quality loss, while achieving compression rates that allow for 30+ Frames per second over 10 MBit/s networks. To analyze the potential of the technique and further development needs and to compare it to existing methods, several benchmarks are conducted and described in this paper. Additionally a quality assessment is performed to allow statements about the achievable quality of the lossy image compression. The results show that using the GPU not only for rendering but also for image compression is a promising approach for interactive remote rendering.
/content/cudazone/CUDABrowser/assets/images/applications/789_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/789_cover-medium_large.jpg
Academia
University of Paderborn
2008
12
02
12/02/2008
Stefan Lietsch
Paul Hermann Lensing
Paper
Science
Stefan Lietsch,Paul Hermann Lensing,slietsch@upb.de,plensing@upb.de
1968f34b-b4e7-4cfe-949e-957ac0b0a242
GPU-MEME: Using Graphics Hardware to Accelerate Motif Finding in DNA Sequences
Discovery of motifs that are repeated in groups of biological sequences is a major task in bioinformatics. Iterative methods such as expectation maximization (EM) are used as a common approach to find such patterns. However, corresponding algorithms are highly compute-intensive due to the small size and degenerate nature of biological motifs. Runtime requirements are likely to become even more severe due to the rapid growth of available gene transcription data. In this paper we present a novel approach to accelerate motif discovery based on commodity graphics hardware (GPUs). To derive an efficient mapping onto this type of architecture, we have formulated the compute-intensive parts of the popular MEME tool as streaming algorithms. Our experimental results show that a single GPU allows speedups of one order of magnitude with respect to the sequential MEME implementation. Furthermore, parallelization on a GPU-cluster even improves the speedup to two orders of magnitude.
/content/cudazone/CUDABrowser/assets/images/applications/788_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/788_cover-medium_large.jpg
Academia
Nanyang Technological University
2008
10
08
10/08/2008
Chen Chen
Bertil Schmidt
Liu Weiguo
Paper
Science
Chen Chen,Bertil Schmidt,Liu Weiguo,cchen@ntu.edu.sg,asbschmidt@ntu.edu.sg,liuweiguo@ntu.edu.sg
9dd0b45a-39ac-46d4-b174-a1e78ecab2a7
Performance Optimization Strategies of High Performance Computing on GPU
Recently GPU is widely utilized in scientific computing and engineering applications, owing primarily to the evolution of GPU architecture. Firstly, we analyze some key performance characters of GPU in detail, and the relationships among GPU architecture, programming model and memory hierarchy. Secondly, we present three performance optimization strategies: Prefetching, Streamlizing, and Task Division. Adequate experiments have been done to abstract the relationships among different factors and efficiency. Finally, we map the HPL benchmark to testify our strategies and achieve certain speedup.
/content/cudazone/CUDABrowser/assets/images/applications/787_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/787_cover-medium_large.jpg
Academia
National University of Defense Technology, ChangSha
2009
08
21
08/21/2009
Anguo Ma
Jing Cai
Yu Cheng
Paper
Science
Anguo Ma,Jing Cai,Yu Cheng,anguo.ma@nudt.edu.cn,jing.cai@nudt.edu.cn,y.cheng@nudt.edu.cn
168e001f-d970-4413-90a0-8d6c90fda259
Bipartite Graph Matching Computation on GPU
The Bipartite Graph Matching Problem is a well studied topic in Graph Theory. Such matching relates pairs of nodes from two distinct sets by selecting a subset of the graph edges connecting them. Each edge selected has no common node as its end points to any other edge within the subset. When the considered graph has huge sets of nodes and edges the sequential approaches are impractical, specially for applications demanding fast results. In this paper we investigate how to compute such matching on Graphics Processing Units (GPUs) motivated by its increasing processing power made available with decreasing costs. We present a new data-parallel approach for computing bipartite graph matching that is efficiently computed on todays graphics hardware and apply it to solve the correspondence between 3D samples taken over a time interval.
/content/cudazone/CUDABrowser/assets/images/applications/786_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/786_cover-medium_large.jpg
Academia
Leibniz Universitaet Hannover
2009
08
17
08/17/2009
Cristina Nader Vasconcelos
Bodo Rosenhahn
Paper
Science
Cristina Nader Vasconcelos,Bodo Rosenhahn,crisnv@inf.puc-rio.br,rosenhahn@tnt.uni-hannover.de
124508be-daac-4a5e-8a7d-8bcdae9ea237
Face Detection Using GPU-Based Convolutional Neural Networks
In this paper, we consider the problem of face detection under pose variations. Unlike other contributions, a focus of this work resides within efficient implementation utilizing the computational powers of modern graphics cards. The proposed system consists of a parallelized implementation of convolutional neural networks (CNNs) with a special emphasize on also parallelizing the detection process. Experimental validation in a smart conference room with 4 active ceiling-mounted cameras shows a dramatic speed-gain under real-life conditions.
/content/cudazone/CUDABrowser/assets/images/applications/785_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/785_cover-medium_large.jpg
Academia
TU Dortmund University
2009
08
29
08/29/2009
Fabian Nasse
Christian Thurau
Gernot A. Fink
Paper
Science
Fabian Nasse,Christian Thurau,Gernot A. Fink
cfd6b540-64f5-423f-bc2e-1b7ec1439ba5
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
Graphics processors are increasingly used in scientific applications due to their high computational power, which comes from hardware with multiple-level parallelism and memory hierarchy. Sparse matrix computations frequently arise in scientific applications, for example, when solving PDEs on unstructured grids. However, traditional sparse matrix algorithms are difficult to efficiently parallelize for GPUs due to irregular patterns of memory references. In this paper we present a new storage format for sparse matrices that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning. Experimental evaluation demonstrates significant speedups compared to previously published results.
/content/cudazone/CUDABrowser/assets/images/applications/784_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/784_cover-medium_large.jpg
Academia
Institute for System Programming of RAS
2010
01
21
01/21/2010
Alexander Monakov
Anton Lokhmotov
Arutyun Avetisyan
Paper
Science
Alexander Monakov,Anton Lokhmotov,Arutyun Avetisyan,amonakov@ispras.ru,anton@doc.ic.ac.uk,arut@ispras.ru
e105e7e5-d0ca-4fe1-b6ce-897fd679d5b4
Searching High-Dimensional Neighbours: CPU-Based Tailored Data-Structures Versus GPU-Based Brute-Force Method
Many image processing algorithms rely on nearest neighbor (NN) or on the k nearest neighbor (kNN) search problem. Several methods have been proposed to reduce the computation time, for instance using space partitionning. However, these methods are very slow in high dimensional space. In this paper, we propose a fast implementation of the brute-force algorithm using GPU (Graphics Processing Units) programming. We show that our implementation is up to 150 times faster than the classical approaches on synthetic data, and up to 75 times faster on real image processing algorithms (finding similar patches in images and texture synthesis).
/content/cudazone/CUDABrowser/assets/images/applications/783_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/783_cover-medium_large.jpg
Academia
Palaiseau
2009
05
05
05/05/2009
Vincent Garcia
Frank Nielsen
Paper
Science
Vincent Garcia,Frank Nielsen,garciav@lix.polytechnique.fr,nielsen@lix.polytechnique.fr
05b7c411-e33a-4038-b779-b94b67ba0e80
Belief Propagation Implementation Using CUDA on an NVIDIA GTX 280
Disparity map generation is a significant component of vision-based driver assistance systems. This paper describes an efficient implementation of a belief propagation algorithm on a graphics card (GPU) using CUDA (Compute Uniform Device Architecture) that can be used to speed up stereo image processing by between 30 and 250 times. For evaluation purposes, different kinds of images have been used: reference images from the Middlebury stereo website, and real-world stereo sequences, self-recorded with the research vehicle of the .enpeda.. project at The University of Auckland. This paper provides implementation details, primarily concerned with the inequality constraints, involving the threads and shared memory, required for efficient programming on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/780_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/780_cover-medium_large.jpg
Academia
Shandong University
2009
11
18
11/18/2009
Yanyan Xu
Hui Chen
Reinhard Klette
Paper
Science
Yanyan Xu,Hui Chen,Reinhard Klette
1cb185e6-e66e-458f-95d9-0f08f2490b6b
Lloyd's Algorithm on GPU
The Centroidal Voronoi Diagram (CVD) is a very versatile structure, well studied in Computational Geometry. It is used as the basis for a number of applications. This paper presents a deterministic algorithm, entirely computed using graphics hardware resources, based on Lloyds Method for computing CVDs. While the computation of the ordinary Voronoi diagram on GPU is a well explored topic, its extension to CVDs presents some challenges that the present study intends to overcome.
/content/cudazone/CUDABrowser/assets/images/applications/779_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/779_cover-medium_large.jpg
Academia
Pontificia Universidade Catolica
2008
12
02
12/02/2008
Cristina N. Vasconcelos
Asla Sa
Paulo Cezar Carvalho
Paper
Science
Cristina N. Vasconcelos,Asla Sa,Paulo Cezar Carvalho,crisnv@inf.puc-rio.br,asla@tecgraf.puc-rio.br,pcezar@impa.br
86056069-3857-4e63-8c25-55a234a83edd
GPU-Accelerated Nearest Neighbor Search for 3D Registration
Nearest Neighbor Search (NNS) is employed by many computer vision algorithms. The computational complexity is large and constitutes a challenge for real-time capability. The basic problem is in rapidly processing a huge amount of data, which is often addressed by means of highly sophisticated search methods and parallelism. We show that NNS based vision algorithms like the Iterative Closest Points algorithm (ICP) can achieve real-time capability while preserving compact size and moderate energy consumption as it is needed in robotics and many other domains. The approach exploits the concept of general purpose computation on graphics processing units (GPGPU) and is compared to parallel processing on CPU. We apply this approach to the 3D scan registration problem, for which a speed-up factor of 88 compared to a sequential CPU implementation is reported.
/content/cudazone/CUDABrowser/assets/images/applications/778_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/778_cover-medium_large.jpg
Academia
Sankt Augustin
2009
10
14
10/14/2009
Deyuan Qiu
Stefan May
Andreas Nuchter
Paper
Science
Deyuan Qiu,Stefan May,Andreas Nuchter,dqiu2s@smail.inf.h-brs.de,stefan_may@arcor.de,andreas@nuechti.de
95724514-4e0b-41fc-b92d-e2c41be2c895
An Efficient Pre-filtering Mechanism for Parallel Intrusion Detection Based on Many-Core GPU
Multi-pattern search is a time-consuming task in Network Intrusion Detection Systems(NIDS). The processing ability of NIDS cannot catch up with the rapid development of network bandwidth. One intuitive idea is to use pre-filtering to reduce the workload of NIDS. Our goal is to design a novel method for per-filtering which will be ready for an efficient implementation on many-core GPU. Through statistical analysis, we propose a rudimentary method to use 2B ASCII sub patterns as the filter keywords. To reduce the size of the filter keyword set, we use Binary Integer Linear Programming(BILP) for optimization. The number of filter keywords is reduced from 4824 to 362, which is also much smaller then the prefix based and suffix based method. We argue that our method can well utilize the computation power of GPU. Experiments demonstrate that our pre-filter can achieve a good fiter ratio, thus alleviate the burden of NIDS.
/content/cudazone/CUDABrowser/assets/images/applications/777_Untitledsecuritytechnology_small.png
/content/cudazone/CUDABrowser/assets/images/applications/777_Untitledsecuritytechnology_large.png
Academia
National University of Defense Technology
2009
11
28
11/28/2009
Chengkun Wu
Jianping Yin
Zhiping Cai
Paper
Science
Chengkun Wu,Jianping Yin,Zhiping Cai,chengkun_wu@nudt.edu.cn,jpyin@nudt.edu.cn,zpcai@nudt.edu.cn
21d4bbfd-5dd3-4016-982d-d55bab9285ed
GPU-based Acceleration of System-level Design Tasks
Many system-level design tasks (e.g., high-level timing analysis, hardware/software partitioning and design space exploration) involve computational kernels that are intractable (usually NP-hard). As a result, they involve high running times even for mid-sized problems. In this paper we explore the possibility of using commodity graphics processing units (GPUs) to accelerate such tasks that commonly arise in the electronic design automation (EDA) domain. We demonstrate this idea via two detailed case studies. The first explores the possibility of using GPUs to speedup standard schedulability analysis problems. The second proposes a GPU-based engine for a general hardware/software design space exploration problem. Not only do these problems commonly arise in the embedded systems domain, their computational kernels turn out to be variants of a combinatorial optimization problem viz., the knapsack problem that lies at the heart of several EDA applications. Experimental results show that our GPU-based implementations offer very attractive speedups for the computational kernels (up to 100x), and speedups of up to 17x for the full problem. In contrast to ASIC/FPGA-based accelerators given that even low-end desktop and notebook computers are now equipped with GPUs our solution involves no extra hardware cost. Although recent research has shown the benefits of using GPUs for a variety of non-graphics applications (e.g., in databases and bioinformatics), harnessing the parallelism of GPUs to accelerate problems from the EDA domain has not been sufficiently explored so far. We believe that our results and the generality of the core problem that we address will motivate researchers from this community to explore the possibility of using GPUs for a wider variety of problems from the EDA domain.
/content/cudazone/CUDABrowser/assets/images/applications/776_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/776_cover-medium_large.jpg
Academia
TU Munich
2010
01
15
01/15/2010
Unmesh D. Bordoloi
Samarjit Chakraborty
Paper
Science
Unmesh D. Bordoloi,Samarjit Chakraborty
d6f04307-3afc-40ed-9c71-ad0bc9456cec
A generic library for structured real-time computations: GPU implementation applied to retinal and cortical vision processes
Most graphics cards in standard personal computers are now equipped with several pixel pipelines running shader programs. Taking advantage of this technology by transferring parallel computations from the CPU side to the GPU side increases the overall computational power even in non-graphical applications by freeing the main processor from an heavy work. A generic library is presented to show how anyone can benefit from modern hardware by combining various techniques with little hardware specific programming skills. Its shader implementation is applied to retinal and cortical simulation. The purpose of this sample application is not to provide a correct approximation of real center surround ganglion or middle temporal cells, but to illustrate how easily intertwined spatiotemporal filters can be applied on raw input pictures in real-time. Requirements and interconnection complexity really depend on the vision framework adopted, therefore various hypothesis that may benefit from such a library are introduced.
/content/cudazone/CUDABrowser/assets/images/applications/775_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/775_implementation_large.png
Academia
University of Toulouse
2009
01
08
01/08/2009
Jean-Charles Quinton
Paper
Science
Jean-Charles Quinton,quinton@n7.fr
4844c2e1-42ea-446c-aa46-616f14577bf2
GPU Accelerated 3D Face Registration / Recognition
This paper proposes a novel approach to both registration and recognition of face in three dimensions. The presented method is based on normal map metric to perform either the alignment of captured face to a reference template or the comparison between any two faces in a gallery. As the metric involved is highly suited to be computed via vector processor, we propose an implementation of the whole framework on last generation graphics boards, to exploit the potential of GPUs applied to large scale biometric identification applications. This work shows how the use of affordable consumer grade hardware could allow ultra rapid comparison between face descriptors through their highly specialized architecture. The approach also addresses facial expression changes by means of a subject specific weighting masks. We include preliminary results of experiments conducted on a proprietary gallery and on a subset of FRGC database.
/content/cudazone/CUDABrowser/assets/images/applications/774_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/774_cover-medium_large.jpg
Academia
Universita degli Studi di Salerno
2007
08
30
08/30/2007
Andrea Francesco Abate
Michele Nappi
Stefano Ricciardi
Paper
Imaging
Andrea Francesco Abate,Michele Nappi,Stefano Ricciardi,abate@unisa.it,mnappi@unisa.it,sricciardi@unisa.it
ffcd384a-918f-49f4-a5ad-b21b0988e948
Implementation of a Lattice Boltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology
The Lattice Boltzmann method (LBM) is a distribution-function based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPUs in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data.
/content/cudazone/CUDABrowser/assets/images/applications/773_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/773_implementation_large.png
Academia
Technische Universitat Munchen
2009
05
06
05/06/2009
T. Indinger
Paper
Science
T. Indinger,Thomas.Indinger@tum.de
b4a2cd2c-ab54-4f4e-a597-12943d456da4
GPU-Assisted Surface Reconstruction on Locally-Uniform Samples
In point-based graphics, surfaces are represented by point clouds without explicit connectivity. If the distribution of the points can be carefully controlled, surface reconstruction becomes a much easier problem. We present a simple, completely local surface reconstruction algorithm for input point distributions that are locally uniform. The locality of the computation lets us handle large point sets using parallel and out-of-core methods. The algorithm can be implemented robustly with floating-point arithmetic. We demonstrate the simplicity, efficiency, and numerical stability of our algorithm with an out-of-core and parallel implementation using graphics hardware.
/content/cudazone/CUDABrowser/assets/images/applications/772_roundtable_small.png
/content/cudazone/CUDABrowser/assets/images/applications/772_roundtable_large.png
Academia
University of California
2009
10
23
10/23/2009
Yong Joo Kil
Nina Amenta
Paper
Computer Aided Engineering
Yong Joo Kil,Nina Amenta,kil@cs.ucdavis.edu,amenta@cs.ucdavis.edu
fdd9c3bb-420d-4e67-94a0-60174e2f4534
GP-GPU Implementation of the Local Rank Differences Image Feature
A currently popular trend in object detection and pattern recognition is usage of statistical classifiers, namely AdaBoost and its modifications. The speed performance of these classifiers largely depends on the low level image features they are using: both on the amount of information the feature provides and the processor time of its evaluation. Local Rank Differences is an image feature that is alternative to commonly used haar wavelets. It is suitable for implementation in programmable (FPGA) or specialized (ASIC) hardware, but -as this paper shows -it performs very well on graphics hardware (GPU) used in general purpose manner (GPGPU, namely CUDA in this case) as well. The paper discusses the LRD features and their properties, describes an experimental implementation of the LRD in graphics hardware using CUDA, presents its empirical performance measures compared to alter native approaches, suggests several notes on practical usage of LRD and proposes directions for future work.
/content/cudazone/CUDABrowser/assets/images/applications/771_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/771_cover-medium_large.jpg
Academia
Brno University of Technology
2009
05
21
05/21/2009
Adam Herout
Radovan Josth
Pavel Zemcik
Paper
Imaging
Adam Herout,Radovan Josth,Pavel Zemcik,fherout@fit.vutbr.cz,ijosth@fit.vutbr.cz,zemcik@fit.vutbr.cz
307a1055-9c6c-4df0-bc88-96f461322333
AES Encryption Implementation and Analysis on Commodity Graphics Processing Units
Graphics Processing Units (GPUs) present large potential performance gains within stream processing applications over the standard CPU. These performance gains are best realised when high computational intensity is required across large amounts of mostly independent input elements. The GPUs success in general purpose stream processing has been demonstrated in many diverse fields, though attempts to port cryptographic algorithms to the GPU have thus far met little success. In recent years, GPU architectures have continued to develop a more flexible and uniform programming environment. These developments have overcome a lot of previously encountered restrictions in cipher implementations. We present novel approaches for the implementation of the AES block cipher encryption algorithm on these GPUs. This work also serves as a precursor for future cipher implementations on the most advanced GPU architecture, the recently released Nvidia G80, which now includes integer support and a simplified programming interface.
/content/cudazone/CUDABrowser/assets/images/applications/770_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/770_cover-medium_large.jpg
Academia
Trinity College Dublin
2007
08
23
08/23/2007
Owen Harrison
John Waldron
Paper
Science
Owen Harrison,John Waldron,harrisoo@cs.tcd.ie,john.waldron@cs.tcd.ie
819a8581-877a-45c8-9cfb-d63121d5dbe2
The Future of Volume Graphics in Medical Virtual Reality
A recent trend in medical virtual reality is to include information from multiple sources, especially about physiology, into one model and one single visualization. Computer graphics must therefore deal with a huge amount of information in real time. The latest developments in computer graphics hardware allow not only implementing direct volume rendering on the graphics processing unit (GPU). The emerging compute languages enable us to address volume rendering problems of arbitrary complexity without being limited to formulating visualization techniques in an awkward fashion to match the GPU execution model. Utilizing the arising new possibilities we meet next generations demands in medical visualization.
/content/cudazone/CUDABrowser/assets/images/applications/769_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/769_prediction_large.png
Academia
Graz University of Technology
2010
01
01
01/01/2010
Judith Muehl
Bernhard Kainz
Alexander Bornik
Paper
Medical Imaging
Judith Muehl,Bernhard Kainz,Alexander Bornik
f327d71f-b539-441f-a3d5-fc8b66c264db
Implementation of a Lattice Boltzmann kernel using the Compute Unified Device Architecture developed by NVIDIA
In this article a very efficient implementation of a 2D-Lattice Boltzmann kernel using the Compute Unified Device Architecture (CUDA) interface developed by nVIDIA is presented. By exploiting the explicit parallelism exposed in the graphics hardware we obtain more than one order in performance gain compared to standard CPUs. A non-trivial example, the flow through a generic porous medium, shows the performance of the implementation.
/content/cudazone/CUDABrowser/assets/images/applications/768_bottle_small.png
/content/cudazone/CUDABrowser/assets/images/applications/768_bottle_large.png
Academia
TU Braunschweig
2008
07
24
07/24/2008
Jonas Tolke
Paper
Numerics
Jonas Tolke,toelke@cab.bau.tu-bs.de
36f123f2-0612-42e2-8134-d637453033c5
GPU in Haptic Rendering of Deformable Objects
We present some results regarding utilizing Graphics Processing Unit (GPU) for computing the deformation of two experimental objects. A suture simulation model with GPU and a 2D deformable cloth model with nVidia CUDA techniques are also proposed. We conducted experimental studies to compare the GPU-based suture models and with the CPU implementation. We also experimented with the implicit model of the 2D mesh which offer similar computational challenges associated with any Finite-Element modeling approaches. A method for computing the inverse of a matrix with truncated Neumann series is also introduced.
/content/cudazone/CUDABrowser/assets/images/applications/767_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/767_cover-medium_large.jpg
Academia
Simon Fraser University
2008
06
28
06/28/2008
Hans Fuhan Shi
Shahram Payandeh
Paper
Imaging
Hans Fuhan Shi,Shahram Payandeh,fuhans@cs.sfu.ca,shahram@cs.sfu.ca
aa90451a-b028-44e9-98ae-84677865270f
GP-GPU Implementation of the Local Rank Differences Image Feature
A currently popular trend in object detection and pattern recognition is usage of statistical classifiers, namely AdaBoost and its modifications. The speed performance of these classifiers largely depends on the low level image features they are using: both on the amount of information the feature provides and the processor time of its evaluation. Local Rank Differences is an image feature that is alternative to commonly used haar wavelets. It is suitable for implementation in programmable (FPGA) or specialized (ASIC) hardware, but -as this paper shows -it performs very well on graphics hardware (GPU) used in general purpose manner (GPGPU, namely CUDA in this case) as well. The paper discusses the LRD features and their properties, describes an experimental implementation of the LRD in graphics hardware using CUDA, presents its empirical performance measures compared to alter native approaches, suggests several notes on practical usage of LRD and proposes directions for future work.
/content/cudazone/CUDABrowser/assets/images/applications/766_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/766_cover-medium_large.jpg
Academia
Brno University of Technology
2009
05
21
05/21/2009
Adam Herout
Radovan Josth
Pavel Zemcik
Paper
Imaging
Adam Herout,Radovan Josth,Pavel Zemcik
41e47d70-074c-4ef5-a17f-ba467f8e9d78
Monte Carlo Dose Calculation using GPU-Based parallel processing
Recently, it became possible to operate physical phenomenon using Graphics Processing Unit (GPU), and Monte Carlo calculation methods came to be researched about shortening the computing time using GPU positively. This report shows how to significantly accelerate 3D dose calculation of photon beam using Graphics Processing Unit (GPU). We describe GPU parallel processing method for dose simulation based on NRCC DOSXYZnrc.
http://www.springerlink.com/content/r42wtk514k03865j/?p=da8f68ea438f401396ffad66aea4a402&pi=77
/content/cudazone/CUDABrowser/assets/images/applications/765_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/765_prediction_large.png
Academia
Tokyo Metropolitan University
2010
01
01
01/01/2010
Atsushi Myojyoyama
Hidetoshi Saitoh
Paper
Numerics
Atsushi Myojyoyama,Hidetoshi Saitoh
b4c6e882-4f8e-4f2b-87aa-c0667c088ae7
GpuCV: A GPU-Accelerated Framework for Image Processing and Computer Vision
This paper presents briefly the state of the art of accelerating image processing with graphics hardware (GPU) and discusses some of its caveats. Then it describes GpuCV, an open source multi-platform library for GPU-accelerated image processing and Computer Vision operators and applications. It is meant for computer vision scientist not familiar with GPU technologies. GpuCV is designed to be compatible with the popular OpenCV library by offering GPU-accelerated operators that can be integrated into native OpenCV applications. The GpuCV framework transparently manages hardware capabilities, data synchronization, activation of low level GLSL and CUDA programs, on-the-fly benchmarking and switching to the most efficient implementation and finally offers a set of image processing operators with GPU acceleration available.
/content/cudazone/CUDABrowser/assets/images/applications/764_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/764_cover-medium_large.jpg
Academia
TELECOM & Management SudParis
2008
12
03
12/03/2008
Yannick Allusse
Patrick Horain
Ankit Agarwal
Paper
Imaging
Yannick Allusse,Patrick Horain,Ankit Agarwal
4786c8f5-1af2-4f0b-b323-7dee0cdd4936
Population Parallel GP on the G80 GPU
The availability of low cost powerful parallel graphics cards has stimulated a trend to port GP on Graphics Processing Units (GPUs). Previous works on GPUs have shown evaluation phase speedups for large training cases sets. Using the CUDA language on the G80 GPU, we show it is possible to efficiently interpret several GP programs in parallel, thus obtaining speedups also for small training sets starting at less than 100 training cases. Our scheme was embedded in the well-known ECJ library, providing an easy entry point for owners of G80 GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/762_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/762_cover-medium_large.jpg
Academia
Universite du Littoral Cote dOpale
2009
04
03
04/03/2009
Denis Robilliard
Virginie MarionPoty
Cyril Fonlupt
Paper
Science
Denis Robilliard,Virginie MarionPoty,Cyril Fonlupt,robillia@lil.univ-littoral.fr,poty@lil.univ-littoral.fr,fonlupt@lil.univ-littoral.fr
2d5f5437-9dcc-4368-ad9a-937cce37e34c
Medical feature matching and model extraction from MRI/CT based on the Invariant Generalized Hough/Radon Transform
In this paper we present a variation of the Generalized Hough Transform (GHT) for automatic feature matching and model extraction. We propose a two-dimensional algorithm with two reference points parameterization (Dual-Point GHT) that is invariant to rotation and uniform scaling and uses the specificities of the both generalized Hough and Radon transforms. The method operates with two-dimensional accumulators, that decreases strongly the required memory size. We realize the algorithm on Graphics Processing Units (GeForce 8800GTX/nVidia CUDA) and apply it to the MRI/CT cardiac shapes extraction as an initial step for further medical image segmentation.
/content/cudazone/CUDABrowser/assets/images/applications/761_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/761_cover-medium_large.jpg
Academia
University of Heidelberg
2009
02
04
02/04/2009
D. Hlindzich
R. Maenner
Paper
Science
D. Hlindzich,R. Maenner
4a1c7daa-bbdd-4a89-b774-fafdc8d40477
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning
NVIDIA have released a new platform (CUDA) for general purpose computing on their graphical processing units (GPU). This paper evaluates use of this platform for statistical machine learning applications. The transfer rates to and from the GPU are measured, as is the performance of matrix vector operations on the GPU. An implementation of a sparse matrix vector product on the GPU is outlined and evaluated. Performance comparisons are made with the host processor.
/content/cudazone/CUDABrowser/assets/images/applications/760_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/760_cover-medium_large.jpg
Academia
Australian National University
2009
06
25
06/25/2009
Ahmed El Zein
Eric McCreath
Alistair Rendell
Paper
Science
Ahmed El Zein,Eric McCreath,Alistair Rendell,Ahmed.ElZein@anu.edu.au,Eric.McCreath@anu.edu.au,Alistair.Rendell@anu.edu.au
fcdd7311-d7ab-423b-9790-8fc720230f72
High-Quality Rendering of Varying Isosurfaces with Cubic Trivariate C1-Continuous Splines
Smooth trivariate splines on uniform tetrahedral partitions are well suited for high-quality visualization of isosurfaces from scalar volumetric data. We propose a novel rendering approach based on spline patches with low total degree, for which ray-isosurface intersections are computed using efficient root finding algorithms. Smoothly varying surface normals are directly extracted from the underlying spline representation. Our approach is using a combined CUDA and graphics pipeline and yields two key advantages over previous work. First, we can interactively vary the isovalues since all required processing steps are performed on the GPU. Second, we employ instancing in order to reduce shader complexity and to minimize overall memory usage. In particular, this allows to compute the spline coefficients on-the-fly in real-time on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/759_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/759_cover-medium_large.jpg
Academia
TU Darmstadt
2009
11
26
11/26/2009
Thomas Kalbe
Thomas Koch
Michael Goesele
Paper
Science
Thomas Kalbe,Thomas Koch,Michael Goesele
10eb0dc2-f2cc-4beb-9bfb-d6483731f3a4
Evaluation of Parallel FFT Implementations on GPU and Multi-core PCs for Magnetic Induction Tomography
Magnetic Induction Tomography is a relatively new non-invasive modality for the imaging of the electrical properties of materials which is currently under investigation for a variety of industrial and biomedical applications, in particular the detection and monitoring of cerebral haemorrhage. The speed of FFT-based phase measurement algorithms employed in some current MIT systems is however a major limit to higher data acquisition rate and precision.
http://www.springerlink.com/content/t5022335826j4052/?p=e6efad6c51a246a5a01428810aa2b808&pi=67
/content/cudazone/CUDABrowser/assets/images/applications/758_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/758_prediction_large.png
Academia
Philips Research / University of Glamorgan
2010
01
01
01/01/2010
2
Y. Maimaitijiang
H. C. Wee
A. Roula
Paper
Science
Y. Maimaitijiang,H. C. Wee,A. Roula
72ad49af-dbf9-4586-9849-1a116402fbcd
Visualization and GPU-accelerated simulation of medical
We present a fast GPU-based method for simulation of ultrasound images from volumetric CT scans and their visualization. The method uses a ray-based model of the ultrasound to generate view-dependent ultrasonic effects such as occlusions, large-scale reflections and attenuation combined with speckle patterns derived frompre-processing the CT image using a wave-based model of ultrasound propagation in soft tissue. The main applications of the method are ultrasound training and registration of ultrasound and CT images.
/content/cudazone/CUDABrowser/assets/images/applications/755_computermethods_small.png
/content/cudazone/CUDABrowser/assets/images/applications/755_computermethods_large.png
Academia
Technische Universitat Munchen
2008
12
19
12/19/2008
Oliver Kutter
Ramtin Shams
Nassir Navab
Paper
Medical Imaging
Oliver Kutter,Ramtin Shams,Nassir Navab
b717bbc6-50d1-4024-90db-2c891a8c7716
Parallel Computation of Mutual Information on the GPU with Application to Real-Time Registration of 3D Medical Images
Due to processing constraints, automatic image-based registration of medical images has been largely used as a pre-operative tool. We propose a novel method named sort and count for ecient parallelization of mutual information (MI) computation designed for massively multiprocessing architectures. Combined with a parallel transformation implementation and an improved optimization algorithm, our method achieves real-time (less than 1 second) rigid registration of 3D medical images using a commodity graphics processing unit (GPU). This represents a more than 50-fold improvement over a standard implementation on a CPU. Real-time registration opens new possibilities for development of improved and interactive intraoperative tools that can be used for enhanced visualization and navigation during an intervention.
/content/cudazone/CUDABrowser/assets/images/applications/754_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/754_graph_large.png
Academia
Australian National University
2009
08
21
08/21/2009
50
Ramtin Shams
Parastoo Sadeghi
Rodney Kennedy
Paper
Medical Imaging
Ramtin Shams,Parastoo Sadeghi,Rodney Kennedy
f6835ea9-7c74-4c76-9fe3-64880944cc7e
A SURVEY OF MEDICAL IMAGE REGISTRATION ON MULTI-CORE AND THE GPU
A surgeon is performing a potentially life-saving pancreatectomy on a patient in early stages of pancreatic cancer. Two small incisions of no more than half an inch allow laparoscopic tools including a video camera and an ultrasound probe to be guided inside the abdominal cavity. A third, larger incision, is occupied by a hand-access device that facilitates the operation. The surgeon is able to locate the tumor in the ultrasound
view with ease. This is largely possible due to a newly installed 3D navigation and visualization system that virtually renders the patient transparent.
http://users.rsise.anu.edu.au/~ramtin/papers/2010/SPM_2010.pdf
/content/cudazone/CUDABrowser/assets/images/applications/753_multicoregpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/753_multicoregpu_large.png
Academia
Australian National University
2010
03
01
03/01/2010
Ramtin Shams
Parastoo Sadeghi
Rodney A. Kennedy
Paper
Medical Imaging
Ramtin Shams,Parastoo Sadeghi,Rodney A. Kennedy
5ded63a6-656c-44f4-a306-7cc45e85ea40
A GPU Tile-Load-Map architecture for terrain rendering: theory and applications
This paper describes a robust, modular, complete GPU architecturethe Tile-Load-Map (TLM)designed for the real-time visualization of wide textured terrains created with arbitrary meshes. It extends and completes our previous succinct paper Amara et al. (ISVC 2007, Part 1, Lecture Notes in Computer Science, vol. 4841, pp. 586597, Springer, Berlin, 2007) by giving further technical and implementation details. It provides new solutions to problems that had been left unresolved, in the context of a joint use of OpenGL and CUDA, optimized on the G80 graphics chip. We explain the crucial components of the shaders, and emphasize the progress we have proposed, while resolving some difficulties. We show that this texturing architecture is well suited to current challenges, and takes into account most of the distinctive aspects of terrain rendering. Finally, we demonstrate how the design of the TLM facilitates the integration of geomatic input-data into procedural selection/rendering tasks on the GPU, and immediate applications to amplification.
/content/cudazone/CUDABrowser/assets/images/applications/751_visualcomputer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/751_visualcomputer_large.png
Academia
Bab Ezzouar
2009
01
14
01/14/2009
Yacine Amara
Xavier Marsault
Paper
Science
Yacine Amara,Xavier Marsault
af46c6f4-36e8-4672-af52-7cc2741bccb6
HISTOGRAM COMPUTATION WITH CUDA
GPU's higher processing power compared to a standard CPU comes at the cost of reduced data caching and flow control logic as more transistors have to be devoted to data processing. This imposes certain limitations in terms of how an application may access memory and implement flow control. As a result, implementation of certain algorithms (even trivial ones) on the GPU may be difficult or may not be computationally justified.
/content/cudazone/CUDABrowser/assets/images/applications/750_8800gtx-128_small.png
/content/cudazone/CUDABrowser/assets/images/applications/750_8800gtx-128_large.png
Academia
Australian National University
2008
08
01
08/01/2008
R. Shams
Application
R. Shams
82a5a192-ee7e-42dc-83d9-b24a79656a21
Parallel Lattice Boltzmann Flow Simulation on Emerging Multi-core Platforms
A parallel Lattice Boltzmann Method (pLBM), which is based on hierarchical spatial decomposition, is designed to perform large-scale flow simulations. The algorithm uses critical section-free, dual representation in order to expose maximal concurrency and data locality. Performances of emerging multi-core platforms PlayStation3 (Cell Broadband Engine) and Compute Unified Device Architecture (CUDA)are tested using the pLBM, which is implemented with multi-thread and message-passing programming. The results show that pLBM achieves good performance improvement, 11.02 for Cell over a traditional Xeon cluster and 8.76 for CUDA graphics processing unit (GPU) over a Sempron central processing unit (CPU). The results provide some insights into application design on future many-core platforms.
/content/cudazone/CUDABrowser/assets/images/applications/749_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/749_cover-medium_large.jpg
Academia
University of Southern California
2008
08
21
08/21/2008
Liu Peng
Ken-ichi Nomura
Takehiro Oyakawa
Paper
Science
Liu Peng,Ken-ichi Nomura,Takehiro Oyakawa
05930d99-8367-4f73-a605-c469a41e6fdb
Efficient Nonlinear FEM for Soft Tissue Modelling and Its GPU Implementation within the Open Source Framework SOFA
Accurate biomechanical modelling of soft tissue is a key aspect for achieving realistic surgical simulations. However, because medical simulation is a multi-disciplinary area, researchers do not always have sufficient resources to develop an efficient and physically rigorous model for organ deformation. We address this issue by implementing a CUDA-based nonlinear finite element model into the SOFA open source framework. The proposed model is an anisotropic visco-hyperelastic constitutive formulation implemented on a graphical processor unit (GPU). After presenting results on the models performance we illustrate the benefits of its integration within the SOFA framework on a simulation of cataract surgery.
/content/cudazone/CUDABrowser/assets/images/applications/1371_comas08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/1371_comas08_large.png
Academia
The Australian e-Health Research Centre
2008
04
07
04/07/2008
53
Olivier Comas
Zeike A. Taylo
Jeremie Allard
Paper
Science
Olivier Comas,Zeike A. Taylo,Jeremie Allard
bd858d80-d8c8-4000-9df8-2353690a6f98
Four styles of parallel and net programming
This paper reviews the programming landscape for parallel and network computing systems, focusing on four styles of concurrent programming models, and example languages/libraries. The four styles correspond to four scales of the targeted systems. At the smallest coprocessor scale, Single Instruction Multiple Thread (SIMT) and Compute Unified Device Architecture (CUDA) are considered. Transactional memory is discussed at the multicore or process scale. The MapReduce style is examined at the datacenter scale. At the Internet scale, Grid Service Markup Language (GSML) is reviewed, which intends to integrate resources distributed across multiple datacenters.
/content/cudazone/CUDABrowser/assets/images/applications/747_computerscience_small.png
/content/cudazone/CUDABrowser/assets/images/applications/747_computerscience_large.png
Academia
Chinese Academy of Sciences
2009
05
20
05/20/2009
Zhiwei Xu
Yongqiang He
Paper
Science
Zhiwei Xu,Yongqiang He,zxu@ict.ac.cn,heyongqiang@software.ict.ac.cn
955ab92b-e28a-4199-bb03-45c14dade318
Accelerating Image Retrieval Using Factorial Correspondence Analysis on GPU
We are interested in the intensive use of Factorial Correspondence Analysis (FCA) for large-scale content-based image retrieval. Factorial Correspondence Analysis, is a useful method for analyzing textual data, and we adapt it to images using the SIFT local descriptors. FCA is used to reduce dimensions and to limit the number of images to be considered during the search. Graphics Processing Units (GPU) are fast emerging as inexpensive parallel processors due to their high computation power and low price. The G80 family of Nvidia GPUs provides the CUDA programming model that treats the GPU as a SIMD processor array. We present two very fast algorithms on GPU for image retrieval using FCA: the first one is a parallel incremental algorithm for FCA and the second one is an extension of the filtering algorithm in our previous work for filtering step.
Our implementation is able to scale up the FCA computation a factor of 30 compared to the CPU version. For retrieval tasks, the parallel version on GPU performs 10 times faster than the one on CPU. Retrieving images in a database of 1 million images is done in about 8 milliseconds.
/content/cudazone/CUDABrowser/assets/images/applications/746_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/746_cover-medium_large.jpg
Academia
Campus de Beaulieu
2009
08
29
08/29/2009
NguyenKhang Pham
Annie Morin
Patrick Gros
Paper
Science
NguyenKhang Pham,Annie Morin,Patrick Gros,Nguyen_Khang@irisa.fr,Annie.Morin@irisa.fr,Patrick.Gros@inria.fr
fdc39d5f-f0d5-4664-809c-9f9c10a35c34
Experiences with Mapping Non-linear Memory Access Patterns into GPUs
Modern Graphics Processing Units (GPU) are very powerful computational systems on a chip. For this reason there is a growing interest in using these units as general purpose hardware accelerators (GPGPU). To facilitate the programming of general purpose applications, NVIDIA introduced the CUDA programming environment. CUDA provides a simplified abstraction of the underlying complex GPU architecture, so as a number of critical optimizations must be applied to the code in order to get maximum performance. In this paper we discuss our experience in porting an application kernel to the GPU, and all classes of design decisions we adopted in order to obtain maximum performance.
/content/cudazone/CUDABrowser/assets/images/applications/745_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/745_cover-medium_large.jpg
Academia
University of Malaga
2009
05
20
05/20/2009
Eladio Gutierrez
Sergio Romero
Maria A. Trenas
Paper
Science
Eladio Gutierrez,Sergio Romero,Maria A. Trenas,eladio@uma.es,sromero@uma.es,maria@uma.es
48d86485-4f4a-4edf-aa9c-9b0900fbf425
Mean Shift Parallel Tracking on GPU
We propose a parallel Mean Shift (MS) tracking algorithm on Graphics Processing Unit (GPU) using Compute Unified Device Architecture (CUDA). Traditional MS algorithm uses a large number of color histogram, say typically 16x16x16, which makes parallel implementation infeasible. We thus employ K-Means clustering to partition the object color space that enables us to represent color distribution with a quite small number of bins. Based on this compact histogram, all key components of the MS algorithm are mapped onto the GPU. The resultant parallel algorithm consist of six kernel functions, which involves primarily the parallel computation of the candidate histogram and calculation of the Mean Shift vector. Experiments on public available CAVIAR videos show that the proposed parallel tracking algorithm achieves large speedup and has comparable tracking performance, compared with the traditional serial MS tracking algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/744_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/744_cover-medium_large.jpg
Academia
Heilongjiang Univesity
2009
06
09
06/09/2009
Peihua Li
Paper
Science
Peihua Li,peihualj@hotmail.com
e606b34c-1f0d-4118-a68a-d0d7286fbab5
Concurrent Number Cruncher: An Efficient Sparse Linear Solver on the GPU
A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse general-purpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML).
/content/cudazone/CUDABrowser/assets/images/applications/743_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/743_cover-medium_large.jpg
Academia
Nancy Universite
2007
09
08
09/08/2007
Luc Buatois
Guillaume Caumon
Bruno Levy
Paper
Science
Luc Buatois,Guillaume Caumon,Bruno Levy,buatois@gocad.org,caumon@gocad.org,levy@loria.fr
3c1de2f7-4132-4e99-87fa-8242c3b9d107
Solving Sparse Linear Systems on NVIDIA Tesla GPUs
Current many-core GPUs have enormous processing power, and unlocking this power for general-purpose computing is very attractive due to their low cost and efficient power utilization. However, the fine-grained parallelism and the stream-programming model supported by these GPUs require a paradigm shift, especially for algorithm designers. In this paper we present the design of a GPU-based sparse linear solver using the Generalized Minimum RESidual (GMRES) algorithm in the CUDA programming environment. Our implementation achieved a speedup of over 20x on the Tesla T10P based GTX280 GPU card for benchmarks with from a few thousands to a few millions unknowns.
/content/cudazone/CUDABrowser/assets/images/applications/742_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/742_cover-medium_large.jpg
Academia
State University of New Jersey
2009
05
20
05/20/2009
20
Mingliang Wang
Hector Klie
Manish Parashar
Paper
Science
Mingliang Wang,Hector Klie,Manish Parashar
f97eefd3-0c69-48b8-b703-c569c5afab1e
Optimizing Monte Carlo radiosity on graphics hardware
The radiosity method is usually employed for the rendering of highly realistic synthetic images. In this paper we present an implementation of the Monte Carlo radiosity algorithm on the GPU using CUDA. Our proposal is based on the partition of the scene into sub-scenes to be processed in parallel to exploit the graphics card structure. The convex partition method employed permits the exploitation of data locality and the optimization of the ray shooting procedure due to the minimization of the number of objects to be tested in the intersection calculation. The results are good in terms of execution times, increasing the flexibility of previous solutions and demonstrating that the GPU can outperform the CPU results even for non-regular algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/741_neville_small.png
/content/cudazone/CUDABrowser/assets/images/applications/741_neville_large.png
Academia
Univ. of A Coruna
2009
11
06
11/06/2009
J. R. Sanjurjo
M. Amor
M. Boo
Paper
Numerics
J. R. Sanjurjo,M. Amor,M. Boo,josesan@udc.es,margamor@udc.es,montserrat.boo@usc.es
84a7d907-10ce-4d67-ae51-cc83bf5e33ab
Optimizations and Performance of a Robotics Grasping Algorithm Described in Geometric Algebra
The usage of Conformal Geometric Algebra leads to algorithms that can be formulated in a very clear and easy to grasp way. But it can also increase the performance of an implementation because of its capabilities to be computed in parallel. In this paper we show how a grasping algorithm for a robotic arm is accelerated using a Conformal Geometric Algebra formulation. The optimized C code is produced by the CGA framework Gaalop automatically. We compare this implementation with a CUDA implementation and an implementation that uses standard vector algebra.
/content/cudazone/CUDABrowser/assets/images/applications/740_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/740_cover-medium_large.jpg
Academia
Technische Universitat Darmstad
2009
11
16
11/16/2009
Florian Worsdorfer
Florian Stock
Eduardo BayroCorrochano
Paper
Numerics
Florian Worsdorfer,Florian Stock,Eduardo BayroCorrochano
dbeef696-2ea9-4394-bd10-b2f4aea55e81
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors
In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing, linear algebra, etc. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to the graphics hardware. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine the best configuration offline in order to use it at run-time. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of up to 33x can be achieved on NVIDIA's Tesla C870 compared to a parallelized implementation on a Xeon Quad Core.
/content/cudazone/CUDABrowser/assets/images/applications/739_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/739_cover-medium_large.jpg
Academia
University of Erlangen-Nuremberg
2009
07
21
07/21/2009
33
Richard Membarth
Frank Hannig
Hritam Dutta
Paper
Imaging
Richard Membarth,Frank Hannig,Hritam Dutta,richard.membarth@cs.fau.de,hannig@cs.fau.de,dutta@cs.fau.de
2ca5d0a9-8e2c-4c94-8c92-ccea0f8f3ede
Non-rigid Registration for Large Sets of Microscopic Images on Graphics Processors
Microscopic imaging is an important tool for characterizing tissue morphology and pathology. 3D reconstruction and visualization of large sample tissue structure requires registration of large sets of high-resolution images. However, the scale of this problem presents a challenge for automatic registration methods. In this paper we present a novel method for efficient automatic registration using graphics processing units (GPUs) and parallel programming. Comparing a C++ CPU implementation with Compute Unified Device Architecture (CUDA) libraries and pthreads running on GPU we achieve a speed-up factor of up to 4.11 with a single GPU and 6.68x with a GPU pair. We present execution times for a benchmark composed of two sets of large-scale images: mouse placenta (16K x16K pixels) and breast cancer tumors (23K x62K pixels). It takes more than 12 hours for the genetic case in C++ to register a typical sample composed of 500 consecutive slides, which was reduced to less than 2 hours using two GPUs, in addition to a very promising scalability for extending those gains easily on a large number of GPUs in a distributed system.
/content/cudazone/CUDABrowser/assets/images/applications/738_hyperspectral_small.png
/content/cudazone/CUDABrowser/assets/images/applications/738_hyperspectral_large.png
Academia
University of Malaga
2008
05
20
05/20/2008
7
Antonio Ruiz
Manuel Ujaldon
Lee Cooper
Paper
Computer Aided Engineering
Antonio Ruiz,Manuel Ujaldon,Lee Cooper,aruiz@ac.uma.es,ujaldon@ac.uma.es,cooperl@ece.osu.edu
ddb390fb-53bb-46e9-aad0-d5443baf25a4
Integrated Digital Image Correlation for the Identification of Mechanical Properties
Digital Image Correlation (DIC) is a powerful technique to provide full-field displacement measurements for mechanical tests of materials and structures. The displacement fields may be further processed as an entry for identification procedures giving access to parameters of constitutive laws. A new implementation of a Finite Element based Integrated Digital Image Correlation (I-DIC) method is presented, where the two stages (image correlation and mechanical identification) are coupled. This coupling allows one to minimize information losses, even in case of low signal-to-noise ratios. A case study for elastic properties of a composite material illustrates the approach, and highlights the accuracy of the results. Implementations on GPUs (using CUDA) leads to high speed performance while preserving the versatility of the methodology.
/content/cudazone/CUDABrowser/assets/images/applications/737_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/737_cover-medium_large.jpg
Academia
SpringerLink
2009
05
05
05/05/2009
Hugo Leclerc
Jean-Noel Perie
Stephane Roux
Paper
Science
Hugo Leclerc,Jean-Noel Perie,Stephane Roux,hugo.leclerc@lmt.ens-cachan.fr,jean-noel.perie@lmt.ens-cachan.fr,stephane.roux@lmt.ens-cachan.fr
8b5c1771-af09-43fd-a7ba-8160936587d3
Multifold Acceleration of Neural Network Computations Using GPU
With emergence of graphics processing units (GPU) of the latest generation, it became possible to undertake neural network based computations using GPU on serially produced video display adapters. In this study, NVIDIA CUDA technology has been used to implement standard back-propagation algorithm for training multiple perceptrons simultaneously on GPU. For the problem considered, GPU-based implementation (on NVIDIA GTX 260 GPU) has lead to a 50x speed increase compared to a highly optimized CPU-based computer program, and more than 150x compared to a commercially available CPU-based software (NeuroShell 2) (AMD Athlon 64 Dual core 6000+ processor).
/content/cudazone/CUDABrowser/assets/images/applications/736_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/736_cover-medium_large.jpg
Academia
Lomonosov Moscow State University
2009
09
16
09/16/2009
50
Alexander Guzhva
Paper
Science
Alexander Guzhva,nop43@rambler.ru
dddf07ad-463e-4556-8709-33b2f8f5b204
Genetic programming on graphics processing units
The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. In a first work we have showed that this setup allows to develop fine grain parallelization schemes to evaluate several GP programs in parallel, while obtaining speedups for usual training sets and program sizes. Here we present another parallelization scheme and optimizations about program representation and use of GPU fast memory. This increases the computation speed about three times faster, up to 4 billion GP operations per second. The code has been developed within the well known ECJ library and is open source.
/content/cudazone/CUDABrowser/assets/images/applications/735_hybrid_small.png
/content/cudazone/CUDABrowser/assets/images/applications/735_hybrid_large.png
Academia
SpringerLink
2009
10
13
10/13/2009
Denis Robilliard
Virginie Marion-Poty
Cyril Fonlupt
Paper
Science
Denis Robilliard,Virginie Marion-Poty,Cyril Fonlupt,robillia@lil.univ-littoral.fr,poty@lil.univ-littoral.fr,onlupt@lil.univ-littoral.fr
a20016ac-ca3b-4eb6-b760-3c62fa956a30
A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs
We present a particle-mesh N-body integrator running on GPU using CUDA. Relying on a grid-based description of the gravitational potential, it can simulate the evolution of self-interacting 'stars' in order to model e.g. galaxies. All the steps of the application have been ported on the GPU, namely 1/ an histogramming algorithm with CUDPP, 2/ of the resolution of the Poisson equation by means of FFT with CUFFT and multi-grid relaxation, 3/ of an optimized finite difference scheme to compute the accelerations of stars and 4/ of an update procedure for positions and velocities. We present several tests at different resolution, and reach a speedup from 2 to 50 depending on the resolution and on the test case.
/content/cudazone/CUDABrowser/assets/images/applications/734_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/734_cover-medium_large.jpg
Academia
Universite de Strasbourg
2009
05
20
05/20/2009
50
Dominique Aubert
Mehdi Amini
Romaric David
Paper
Science
Dominique Aubert,Mehdi Amini,Romaric David
bee72b5d-162e-4691-a0b7-f6646c239fbf
Parallel Implementations of Recurrent Neural Network Learning
Neural networks have proved to be effective in solving a wide range of problems. As problems become more and more demanding, they require larger neural networks, and the time used for learning is consequently greater. Parallel implementations of learning algorithms are therefore vital for a useful application. Implementation, however, strongly depends on the features of the learning algorithm and the underlying hardware architecture. For this experimental work a dynamic problem was chosen which implicates the use of recurrent neural networks and a learning algorithm based on the paradigm of learning automata. Two parallel implementations of the algorithm were applied - one on a computing cluster using MPI and OpenMP libraries and one on a graphics processing unit using the CUDA library. The performance of both parallel implementations justifies the development of parallel algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/733_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/733_cover-medium_large.jpg
Academia
University of Ljubljana
2009
09
30
09/30/2009
Uros Lotric
Andrej Dobnikar
Paper
Science
,Uros Lotric,Andrej Dobnikar,uros.lotric@fri.uni-lj.si,andrej.dobnikar@fri.uni-lj.si
91c16088-9260-4f2d-aaea-bdfde57fb25d
Orders-of-magnitude performance increases in GPU-accelerated correlation of images from the International Space Station
We implement image correlation, a fundamental component of many real-time imaging and tracking systems, on a graphics processing unit (GPU) using NVIDIA's CUDA platform. We use our code to analyze images of liquid-gas phase separation in a model colloid-polymer system, photographed in the absence of gravity aboard the International Space Station (ISS). Our GPU code is 4,000 times faster than simple MATLAB code performing the same calculation on a central processing unit (CPU), 130 times faster than simple C code, and 30 times faster than optimized C++ code using single-instruction, multiple-data (SIMD) extensions. The speed increases from these parallel algorithms enable us to analyze images downlinked from the ISS in a rapid fashion and send feedback to astronauts on orbit while the experiments are still being run.
/content/cudazone/CUDABrowser/assets/images/applications/732_iss_small.png
/content/cudazone/CUDABrowser/assets/images/applications/732_iss_large.png
Academia
Harvard University
2009
10
30
10/30/2009
130
Peter J. Lu
Hidekazu Oki
Catherine A. Frey
Paper
Science
Peter J. Lu,Hidekazu Oki,Catherine A. Frey
b3c55a10-7397-4e94-a954-a949e0bc26cd
A mathematical speedup prediction model for parallel vs. sequential programs
Data independent command sequences are part of many algorithms. One way to speed up their execution is processing on a single instruction multiple data (SIMD) architecture. But an implementation must not necessarily be efficient. To predict program acceleration for NVIDIA's compute unified device architecture (CUDA), a parallel computing platform based on graphics boards, a mathematical model is developed. This model extends the common approach for so called speedup prediction by CUDA hardware and algorithm specific parameters. The identification of some model parameters is difficult since they depend on hardware internal parameters. The model is tested for a convolution filter and yields conservative processing time predictions.
/content/cudazone/CUDABrowser/assets/images/applications/731_prediction_small.png
/content/cudazone/CUDABrowser/assets/images/applications/731_prediction_large.png
Academia
University of Applied Sciences Gelsenkirchen
2009
02
04
02/04/2009
Heinrich Martin Overhoff
Paper
Computer Aided Engineering
Heinrich Martin Overhoff,heinrich-martin.overhoff@fh-gelsenkirchen.de
8a45da2c-3f81-429c-8576-c6be8690765f
Improving the Performance of Hyperspectral Image and Signal Processing Algorithms Using Parallel, Distributed and Specialized Hardware-Based Systems
Advances in sensor technology are revolutionizing the way remotely sensed data is collected, managed and analyzed. The incorporation of latest generation sensors to airborne and satellite platforms is currently producing a nearly continual stream of high dimensional data, and this explosion in the amount of collected information has rapidly created new processing challenges.
http://www.springerlink.com/content/hp81u02p11126226/?p=c5eead9af73340e58a313d95581cfd40&pi=47
/content/cudazone/CUDABrowser/assets/images/applications/729_hyperspectral_small.png
/content/cudazone/CUDABrowser/assets/images/applications/729_hyperspectral_large.png
Academia
University of Extremadura
2010
01
01
01/01/2010
Antonio Plaza
Javier Plaz
Hugo Vegas
Paper
Science
Antonio Plaza,Javier Plaz,Hugo Vegas,aplaza@unex.es,jplaza@unex.es,hugovegas@fdi.ucm.es
66a4c5f1-213c-4450-b8d1-ce3745396713
GCSim: A GPU-Based Trace-Driven Simulator for Multi-level Cache
We describe the design of parallel trace-driven cache simulation for the purposes of evaluating different cache structures. As the research goes deeper, traditional simulation methods, which can only execute simulation operations in sequence, are no longer practical due to their long simulation cycles. An obvious way to achieve fast parallel simulation is to simulate the independent sets of a cache concurrently on different compute resources. We considered the use of generic GPU to accelerate cache simulation which exploits set-partitioning as the main source of parallelism. But we show this technique is not efficient in the case that just simulating one cache configuration, since a high correlation of the activity between different sets. Trace-sort and multi-configuration simulation in one single pass techniques are developed, taking advantage of the full programmability offered by the Compute Unified Device Architecture (CUDA) on the GPU. Our experimental results demonstrate that the cache simulator based on GPU-CPU platform gains 2.44x performance improvement compared to traditional sequential algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/728_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/728_cover-medium_large.jpg
Academia
Beihang University
2009
08
21
08/21/2009
3
Han Wan
Xiaopeng Gao
Xiang Long
Paper
Science
Han Wan,Xiaopeng Gao,Xiang Long,wanhan@les.buaa.edu.cn,gxp@les.buaa.edu.cn,long@les.buaa.edu.cn
f8d98fea-3e48-4399-97c6-1cc70bf36e27
GPU Accelerated RNA Folding Algorithm
Many bioinformatics studies require the analysis of RNA or DNA structures. More specifically, extensive work is done to elaborate efficient algorithms able to predict the 2-D folding structures of RNA or DNA sequences. However, the high computational complexity of the algorithms, combined with the rapid increase of genomic data, triggers the need of faster methods. Current approaches focus on parallelizing these algorithms on multiprocessor systems or on clusters, yielding to good performance but at a relatively high cost. Here, we explore the use of computer graphics hardware to speed up these algorithms which, theoretically, provide both high performance and low cost. We use the CUDA programming language to harness the power of NVIDIA graphic cards for general computation with a C-like environment. Performances on recent graphic cards achieve a x17 speed-up.
/content/cudazone/CUDABrowser/assets/images/applications/727_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/727_cover-medium_large.jpg
Academia
universitaire de Beaulieu
2009
05
20
05/20/2009
17
Guillaume Rizk
Dominique Lavenier
Paper
Science
Guillaume Rizk,Dominique Lavenier,guillaume.rizk@irisa.fr,dominique.lavenier@irisa.fr
b122d8ea-f81f-4796-a3b4-2f519b9b05f2
Multimedia Mining on Manycore Architectures: The Case for GPUs
Media mining, the extraction of meaningful knowledge from multimedia content, poses significant computational challenges in today's platforms, particularly in real-time scenarios. In this paper, we show how Graphic Processing Units (GPUs) can be leveraged for compute-intensive media mining applications. Furthermore, we propose a parallel implementation of color visual descriptors (color correlograms and color histograms) commonly used in multimedia content analysis on a CUDA (Compute Unified Device Architecture) enabled GPU (the Nvidia GeForce GTX280 GPU). Through the use of shared memory as software managed cache and efficient data partitioning, we reach computation throughputs of over 1.2 Giga Pixels/sec for HSV color histograms and over 100 Mega Pixels/sec for HSV color correlograms. We show that we can achieve better than real time performance and major speedups compared to high-end multicore CPUs and comparable performance on known implementations on the Cell B.E. We also study different trade-offs on the size and complexity of the features and their effect on performance.
/content/cudazone/CUDABrowser/assets/images/applications/726_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/726_cover-medium_large.jpg
Academia
Georgia Institute of Technology
2009
11
26
11/26/2009
Mamadou Diao
Jongman Kim
Paper
Science
Mamadou Diao,Jongman Kim,mamadou@ece.gatech.edu,jkim@ece.gatech.edu
73c6f376-cc30-4f50-a53d-0246839f1870
MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs
CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.
/content/cudazone/CUDABrowser/assets/images/applications/725_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/725_cover-medium_large.jpg
Academia
University of Illinois at Urbana-Champaign
2008
11
28
11/28/2008
John A. Stratton
Sam S. Stone
Wen-mei W. Hwu
Paper
Science
John A. Stratton,Sam S. Stone,Wen-mei W. Hwu,stratton@crhc.uiuc.edu,ssstone2@crhc.uiuc.edu,hwu@crhc.uiuc.edu
1c474439-b236-4f2e-b625-a7e540f06ffa
A Real-Time Video Illustration Using CUDA
According to advancements in video technology, there are lots of needs for various special effects of videos. The conventional image-transform effects could be applied to video streams, but non-photorealistic rendering effects are not easy to apply. For example, cartoon or illustration effects have expensive costs in video transformation which makes it difficult to execute in real-time. In this paper, we suggest a video transformation system with illustration effects. It is designed to apply the illustration effects to the video stream directly and is implemented to achieve real time performances using the GPU hardware with NVIDIA's CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/724_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/724_cover-medium_large.jpg
Academia
Electronics and Telecommunications Research Institute
2009
08
28
08/28/2009
JiHyung Lee
Yoon-Seok Choi
Bon-Ki Koo
Paper
Science
JiHyung Lee,Yoon-Seok Choi,Bon-Ki Koo,ijihyung@etri.re.kr,ys-choi@etri.re.kr,bkkoo@etri.re.kr
f3da9a57-398b-4320-adf8-c66fc56e7440
A Fast and Flexible Sorting Algorithm with CUDA
In this paper, we propose a fast and flexible sorting algorithm with CUDA. The proposed algorithm is much more practical than the previous GPU-based sorting algorithms, as it is able to handle the sorting of elements represented by integers, floats and structures. Meanwhile, our algorithm is optimized for the modern GPU architecture to obtain high performance. We use different strategies for sorting disorderly list and nearly sorted list to make it adaptive. Extensive experiments demonstrate our algorithm has higher performance than previous GPU-based sorting algorithms and can support realtime applications.
/content/cudazone/CUDABrowser/assets/images/applications/723_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/723_cover-medium_large.jpg
Academia
Chinese Academy of Sciences
2009
07
31
07/31/2009
Shifu Chen
Jing Qin
Yongming Xie
Paper
Numerics
Shifu Chen,Jing Qin,Yongming Xie,sf.chen@siat.ac.cn,jqin@cse.cuhk.edu.hk,ymxie@cse.cuhk.edu.hk
448d3111-a000-421c-b2da-c00e1509d590
Exploring Parallel Algorithms for Volumetric Mass-Spring-Damper Models in CUDA
Since the advent of programmable graphics processors (GPUs) their computational powers have been utilized for general purpose computation. Initially by exploiting graphics APIs and recently through dedicated parallel computation frameworks such as the Compute Unified Device Architecture (CUDA) from Nvidia. This paper investigates multiple implementations of volumetric Mass-Spring-Damper systems in CUDA. The obtained performance is compared to previous implementations utilizing the GPU through the OpenGL graphics API. We find that both performance and optimization strategies differ widely between the OpenGL and CUDA implementations. Specifically, the previous recommendation of using implicitly connected particles is replaced by a recommendation that supports unstructured meshes and run-time topological changes with an insignificant performance reduction.
/content/cudazone/CUDABrowser/assets/images/applications/722_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/722_cover-medium_large.jpg
Academia
University of Aarhus
2008
07
07
07/07/2008
Allan Rasmusson
Jesper Mosegaard
Thomas Sangild
Paper
Science
Allan Rasmusson,Jesper Mosegaard,Thomas Sangild
8072ce4d-85ec-4bad-8d3f-986eae58cfd2
Implementation of Parallel Genetic Algorithm Based on CUDA
Genetic Algorithm (GA) is a powerful tool for science computing, while Parallel Genetic Algorithm (PGA) further promotes the performance of computing. However, the traditional parallel computing environment is very difficult to set up, much less the price. This gives rise to the appearance of moving dense computing to graphics hardware, which is inexpensive and more powerful. The paper presents a hierarchical parallel genetic algorithm, implemented by NVIDIAs Compute Unified Device Architecture (CUDA). Mixed with master-slave parallelization method and multiple-demes parallelization method, this algorithm has contributed to better utilization of threads and high-speed shared memory in CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/721_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/721_cover-medium_large.jpg
China University of Geosciences
2009
09
30
09/30/2009
Sifa Zhang
Zhenming He
Paper
Science
Sifa Zhang,Zhenming He
c8b48fed-9a91-483e-bfcd-d80e41dde203
Memory Locality Exploitation Strategies for FFT on the CUDA Architecture
Modern graphics processing units (GPU) are becoming more and more suitable for general purpose computing due to its growing computational power. These commodity processors follow, in general, a parallel SIMD execution model whose efficiency is subject to a right exploitation of the explicit memory hierarchy, among other factors. In this paper we analyze the implementation of the Fast Fourier Transform using the programming model of the Compute Unified Device Architecture (CUDA) recently released by NVIDIA for its new graphics platforms. Within this model we propose an FFT implementation that takes into account memory reference locality issues that are crucial in order to achieve a high execution performance. This proposal has been experimentally tested and compared with other well known approaches such as the manufacturer's FFT library.
/content/cudazone/CUDABrowser/assets/images/applications/720_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/720_cover-medium_large.jpg
Academia
University of Malaga
2008
12
06
12/06/2008
Eladio Gutierrez
Sergio Romero
aria A. Trenas
Paper
Science
Eladio Gutierrez,Sergio Romero,aria A. Trenas,eladio@ac.uma.es,sromero@ac.uma.es,maria@ac.uma.es
b2b7e513-4ef5-4e1a-8a28-1aa414ba9965
Parallel Quantum Computer Simulation on the CUDA Architecture
Due to their increasing computational power, modern graphics processing architectures are becoming more and more popular for general purpose applications with high performance demands. This is the case of quantum computer simulation, a problem with high computational requirements both in memory and processing power. When dealing with such simulations, multiprocessor architectures are an almost obliged tool. In this paper we explore the use of the new graphics processor architecture NVIDIA CUDA in the simulation of some basic quantum computing operations. This new architecture is oriented towards a more general exploitation of the graphics platform, allowing to use it as a parallel SIMD multiprocessor. In this direction, some implementation strategies are proposed, showing that the effectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy.
/content/cudazone/CUDABrowser/assets/images/applications/718_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/718_cover-medium_large.jpg
Academia
University of Malaga
2008
06
25
06/25/2008
Eladio Gutierrez
Sergio Romero
Maria A. Trenas
Paper
Science
Eladio Gutierrez,Sergio Romero,Maria A. Trenas,eladio@ac.uma.es,sromero@ac.uma.es,maria@ac.uma.es
3a301449-69ff-493b-a459-7f4ff6b973a0
Implementation of a LatticeBoltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology
The LatticeBoltzmann method (LBM) is a distributionfunction based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPU's in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data.
/content/cudazone/CUDABrowser/assets/images/applications/717_implementation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/717_implementation_large.png
Academia
Technische Universitat Munchen
2009
05
06
05/06/2009
18
T. Indinger
E. Riegel
N. A. Adams
Paper
Science
T. Indinger,E. Riegel,N. A. Adams,Thomas.Indinger@tum.de
ade79ba5-8167-479e-837c-4edc6c615cd4
CUDA-Lite: Reducing GPU Programming Complexity
The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.
/content/cudazone/CUDABrowser/assets/images/applications/716_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/716_cover-medium_large.jpg
Academia
University of Illinois at Urbana-Champaign
2008
11
28
11/28/2008
17
Sain-Zee Ueng
Melvin Lathara
Sara S. Baghsorkhi
Paper
Programming Tools
Sain-Zee Ueng,Melvin Lathara,Sara S. Baghsorkhi,ueng@crhc.uiuc.edu,mlathara@crhc.uiuc.edu,bsadeghi@crhc.uiuc.edu
99559840-4e04-42b3-aef7-def473ec47e9
Neville elimination on multi- and many-core systems: OpenMP, MPI and CUDA
This paper describes several parallel algorithmic variations of the Neville elimination. This elimination solves a system of linear equations making zeros in a matrix column by adding to each row an adequate multiple of the preceding one. The parallel algorithms are run and compared on different multi- and many-core platforms using parallel programming techniques as MPI, OpenMP and CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/715_neville_small.png
/content/cudazone/CUDABrowser/assets/images/applications/715_neville_large.png
Academia
Universidad de Oviedo / Universidad Politecnica de Valencia
2009
11
18
11/18/2009
P. Alonso
R. Cortina
F. J. Martinez Zaldivar
Paper
Science
Neville,Multi core, Many core, OpenMP, MPI,GPU,CUDA,CUBLAS,P. Alonso,R. Cortina,F. J. Martinez Zaldivar,palonso@uniovi.es,raquel@uniovi.es,fjmartin@dcom.upv.es
662a5685-925d-4f0a-9e8e-8cf9681a85a4
Real-Time Ray Tracing with CUDA
The graphics processors (GPUs) have recently emerged as a low-cost alternative for parallel programming. Since modern GPUs have great computational power as well as high memory bandwidth, running ray tracing on them has been an active field of research in computer graphics in recent years. Furthermore, the introduction of CUDA, a novel GPGPU architecture, has removed several limitations that the traditional GPU-based ray tracing suffered. In this paper, an implementation of high per formance CUDA ray tracing is demonstrated. We focus on the perfor mance and show how our design choices in various optimization lead to an implementation that outperforms the previous works. For reasonably complex scenes with simple shading, our implementation achieves the performance of 30 to 43 million traced rays per second. Our implementation also includes the effects of recursive specular reflection and refraction, which were less discussed in previous GPU-based ray tracing works.
/content/cudazone/CUDABrowser/assets/images/applications/714_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/714_cover-medium_large.jpg
Academia
National Tsing Hua University / National Taiwan Normal University
2009
07
31
07/31/2009
Min Shih
Yung-Feng Chiu
Ying-Chieh Chen
Paper
Imaging
Ray Tracing - Programmable Graphics Hardware - GPU Computing - CUDA - Multithreaded Architectures,Min Shih,Yung-Feng Chiu,Ying-Chieh Chen,min_shih@ibr.cs.nthu.edu.tw,yfchiu@ibr.cs.nthu.edu.tw,louis@ibr.cs.nthu.edu.tw
21718119-decf-4a85-ad89-c2acf965a3a1
Scalable and highly parallel implementation of Smith-Waterman on graphics processing unit using CUDA
Program development environments have enabled graphics processing units (GPUs) to become an attractive high performance computing platform for the scientific community. A commonly posed problem in computational biology is protein database searching for functional similarities. The most accurate algorithm for sequence alignments is Smith-Waterman (SW). However, due to its computational complexity and rapidly increasing database sizes, the process becomes more and more time consuming making cluster based systems more desirable. Therefore, scalable and highly parallel methods are necessary to make SW a viable solution for life science researchers. In this paper we evaluate how SW fits onto the target GPU architecture by exploring ways to map the program architecture on the processor architecture. We develop new techniques to reduce the memory footprint of the application while exploiting the memory hierarchy of the GPU. With this implementation, GSW, we overcome the on chip memory size constraint, achieving 23x speedup compared to a serial implementation. Results show that as the query length increases our speedup almost stays stable indicating the solid scalability of our approach. Additionally this is a first of a kind implementation which purely runs on the GPU instead of a CPU-GPU integrated environment, making our design suitable for porting onto a cluster of GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/713_scalable_small.png
/content/cudazone/CUDABrowser/assets/images/applications/713_scalable_large.png
Academia
University of Arizona
2009
06
11
06/11/2009
23
Ali Akoglu
Gregory M. Striemer
Paper
Science
Ali Akoglu,Gregory M. Striemer,akoglu@email.arizona.edu,gmstrie@email.arizona.edu
94546634-a42d-4be6-9db8-6e5f34612d9f
Hybrid of genetic algorithm and local search to solve MAX-SAT problem using nVidia CUDA framework
General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a hybrid of Genetic Algorithm (GA) and Local Search (LS) to solve MAXimum SATisfiability (MAX-SAT) problem on a state-of-the-art nVidia Tesla GPU using nVidia Compute Unified Device Architecture (CUDA). MAX-SAT is a problem of practical importance and is often solved by employing metaheuristics based search methods like GAs and hybrid of GA with LS. Almost all the parallel GAs (pGAs) designed in the last two decades were designed for either clusters or MPPs. Unfortunately, very little research is done on the implementation of such algorithms over commodity graphics hardware. GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, and the same is the case with conventional LS algorithms. In this paper we explore different genetic operators that can be used for an efficient implementation of GAs over nVidia GPUs. We also design and introduce new techniques/operators for an efficient implementation of GAs and LS over such architectures. We use nVidia Tesla C1060 to perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25x. We also discuss the effects of different optimization techniques on the overall execution time.
/content/cudazone/CUDABrowser/assets/images/applications/712_hybrid_small.png
/content/cudazone/CUDABrowser/assets/images/applications/712_hybrid_large.png
Academia
Hokkaido University
2009
10
20
10/20/2009
25
Asim Munawar
Mohamed Wahib
Masaharu Munetomo
Paper
Science
Compute unified device architecture (CUDA) - General-purpose computing on graphics processing unit (GPGPU) - Genetic algorithm (GA) - MAXimum SATisfiability problem (MAX-SAT) - Single instruction multiple data (SIMD) - Single instruction multiple threads (SIMT),Asim Munawar,Mohamed Wahib,Masaharu Munetomo,asim@uva.cims.hokudai.ac.jp,wahibium@uva.cims.hokudai.ac.jp,munetomo@iic.hokudai.ac.jp
3c39d32e-5e81-4e93-8d02-4c6e2105d2be
CUDA Solutions for the SSSP Problem
We present several algorithms that solve the single-source shortest-path problem using CUDA. We have run them on a database, composed of hundreds of large graphs represented by adjacency lists and adjacency matrices, achieving high speedups regarding a CPU implementation based on Fibonacci heaps. Concerning correctness, we outline why our solutions work, and show that a previous approach [10] is incorrect.
/content/cudazone/CUDABrowser/assets/images/applications/711_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/711_cover-medium_large.jpg
Academia
Universidad Complutense de Madrid
2009
05
20
05/20/2009
Pedro J. Martin
Roberto Torres
Antonio Gavilanes
Paper
Science
Shortest path algorithms - GPU - CUDA,Pedro J. Martin,Roberto Torres,Antonio Gavilanes,pjmartin@sip.ucm.es,r.torres@fdi.ucm.es,agav@sip.ucm.es
b6723f4e-8b3e-416f-a229-ba8f7fbb2334
Adaptative Resonance Theory Fuzzy Networks Parallel Computation Using CUDA
Programming of Graphics Processing Units (GPUs) has evolved in a way they can be used to address and speed-up computation of algorithms exemplified by data-parallel models. In this paper parallelization of a Fuzzy ART algorithm is described and a detailed explanation of its implementation under CUDA is given. Experimental results show the algorithm runs up to 52 times faster on the GPU than on the CPU for testing and 18 times faster for training under specific conditions.
/content/cudazone/CUDABrowser/assets/images/applications/710_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/710_cover-medium_large.jpg
Academia
University of Valladolid
2009
06
05
06/05/2009
M. Martinez-Zarzuela
F. J. Diaz Pernas
A. Tejero de Pablos
Paper
Science
M. Martinez-Zarzuela,F. J. Diaz Pernas,A. Tejero de Pablos
627e05b2-0728-4871-ad2f-108473423236
Accelerating Large Graph Algorithms on the GPU Using CUDA
Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms including breadth first search, single source shortest path, and all-pairs shortest path using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing 600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.
/content/cudazone/CUDABrowser/assets/images/applications/709_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/709_cover-medium_large.jpg
Academia
International Institute of Information Technology Hyderabad
2008
01
22
01/22/2008
Pawan Harish
P. J. Narayanan
Paper
Science
Pawan Harish,P. J. Narayanan,harishpk@research.iiit.ac.in,pjn@iiit.ac.in
cef5a0e1-9512-4dd7-bf59-e0537bceb8f1
Molecular Dynamics Simulations on Commodity GPUs with CUDA
Molecular dynamics simulations are a common and often repeated task in molecular biology. The need for speeding up this treatment comes from the requirement for large system simulations with many atoms and numerous time steps. In this paper we present a new approach to high performance molecular dynamics simulations on graphics processing units. Using modern graphics processing units for high performance computing is facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed. To derive an efficient mapping onto this type of architecture, we have used the Compute Unified Device Architecture (CUDA) to design and implement a new parallel algorithm. This results in an implementation with significant runtime savings on an off-the-shelf computer graphics card.
/content/cudazone/CUDABrowser/assets/images/applications/708_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/708_cover-medium_large.jpg
Academia
Nanyang Technological University
2008
01
22
01/22/2008
Weiguo Liu
Bertil Schmidt
Gerrit Voss
Paper
Science
Weiguo Liu,Bertil Schmidt,Gerrit Voss,liuweiguo@ntu.edu.sg,bertil.schmidt@unsw.edu.au,asgerrit@ntu.edu.sg
f891bba8-5e53-45f2-a171-f2fd2b1b02e2
Accelerating Cone Beam Reconstruction Using the CUDA-Enabled GPU
Compute unified device architecture (CUDA) is a software development platform that enables us to write and run general-purpose applications on the graphics processing unit (GPU). This paper presents a fast method for cone beam reconstruction using the CUDA-enabled GPU. The proposed method is accelerated by two techniques: (1) off-chip memory access reduction; and (2) memory latency hiding. We describe how these techniques can be incorporated into CUDA code. Experimental results show that the proposed method runs at 82% of the peak memory bandwidth, taking 5.6 seconds to reconstruct a 5123-voxel volume from 360 5122-pixel projections. This performance is 18% faster than the prior method. Some detailed analyses are also presented to understand how effectively the acceleration techniques increase the reconstruction performance of a naive method.
/content/cudazone/CUDABrowser/assets/images/applications/707_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/707_cover-medium_large.jpg
Academia
Osaka University
2008
12
17
12/17/2008
Yusuke Okitsu
Fumihiko Ino
Paper
Science
Yusuke Okitsu,Fumihiko Ino,y-okitu@ist.osaka-u.ac.jp,ino@ist.osaka-u.ac.jp
3ec2d6eb-a3ca-4435-9e66-c0dbe3e61938
Parallelization of a Video Segmentation Algorithm on CUDA Enabled Graphics Processing Units
Nowadays, Graphics Processing Units (GPU) are emerging as SIMD coprocessors for general purpose computations, specially after the launch of nVIDIA CUDA. Since then, some libraries have been implemented for matrix computation and image processing. However, in real video applications some stages need irregular data distributions and the parallelism is not so inherent. This paper presents the parallelization of a video segmentation application on GPU hardware, which implements an algorithm for abrupt and gradual transitions detection. A critical part of the algorithm requires highly intensive computation for video frames features calculation. Results on three CUDA-enabled GPUs are encouraging, because of the significant speedup achieved. They are also compared with an OpenMP version of the algorithm, running on two platforms with multiples cores.
/content/cudazone/CUDABrowser/assets/images/applications/706_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/706_cover-medium_large.jpg
Academia
University of Cordoba / University of Malaga
2009
08
22
08/22/2009
Juan Gomez-Luna
Jose Maria Gonzalez-Linares
Jose Ignacio Benavides
Paper
Science
Juan Gomez-Luna,Jose Maria Gonzalez-Linares,Jose Ignacio Benavides,el1goluj@uco.es,gonzalez@ac.uma.es,el1bebej@uco.es
7628110e-6e5b-4952-9a5f-dbe69816046e
A CUDA-Supported Approach to Remote Rendering
In this paper we present the utilization of advanced programming techniques on current graphics hardware to improve the performance of remote rendering for interactive applications. We give an overview of existing systems in remote rendering and focus on some general bottlenecks of remote visualization. Afterwards we describe current developments in graphics hardware and software and outline how they can be used to increase the performance of remote graphics systems. Finally we present some results and benchmarks to confirm the validity of our work.
/content/cudazone/CUDABrowser/assets/images/applications/705_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/705_cover-medium_large.jpg
Academia
University of Paderborn
2007
11
22
11/22/2007
Stefan Lietsch
Oliver Marquardt
Paper
Science
Stefan Lietsch,Oliver Marquardt,slietsch@upb.de,marquard@upb.de
50132356-0063-40d3-922b-8bc54e0ecb18
JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA
A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. CUDA has emerged as a popular programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with managed runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary performance results show that this interface can deliver significant performance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java.
/content/cudazone/CUDABrowser/assets/images/applications/704_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/704_cover-medium_large.jpg
Academia
Department of Computer Science
2009
08
22
08/22/2009
Yonghong Yan
Max Grossman
Vivek Sarkar
Paper
Science
Yonghong Yan,Max Grossman,Vivek Sarkar,yanyh@rice.edu,jmg3@rice.edu,vsarkar@rice.edu
5a46aeed-d703-4e3f-a010-ccf692264df9
Training Recurrent Neural Network Using Multistream Extended Kalman Filter on Multicore Processor and Cuda Enabled Graphic Processor Unit
Recurrent neural networks are popular tools used for modeling time series. Common gradient-based algorithms are frequently used for training recurrent neural networks. On the other side approaches based on the Kalman filtration are considered to be the most appropriate general-purpose training algorithms with respect to the modeling accuracy. Their main drawbacks are high computational requirements and difficult implementation. In this work we first provide clear description of the training algorithm using simple pseudo-language. Problem with high computational requirements is addresses by performing calculation on Multicore Processor and CUDA-enabled graphic processor unit. We show that important execution time reduction can be achieved by performing computation on manycore graphic processor unit.
/content/cudazone/CUDABrowser/assets/images/applications/703_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/703_cover-medium_large.jpg
Academia
Faculty of Informatics and Information Technologies
2009
09
16
09/16/2009
Michal Cernansky
Paper
Science
Michal Cernansky,cernansky@fiit.stuba.sk
f9600d38-e1e3-40c9-bd98-fca07f782225
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation with Nvidia CUDA Compatible Devices
In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. Therefore, we accelerate CVB inference, an efficient deterministic inference method for LDA, with Nvidia CUDA. In the evaluation experiments, we used a set of 50,000 documents and a set of 10,000 images. We could obtain inference results comparable to sequential CVB inference.
/content/cudazone/CUDABrowser/assets/images/applications/702_cover-medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/702_cover-medium_large.jpg
Academia
Nagasaki University
2009
06
26
06/26/2009
Tomonari Masada
Tsuyoshi Hamada
Yuichiro Shibata
Paper
Tomonari Masada,Tsuyoshi Hamada,Yuichiro Shibata,masada@cis.nagasaki-u.ac.jp,hamada@cis.nagasaki-u.ac.jp,shibata@cis.nagasaki-u.ac.jp
9396fec3-f4d1-419d-9345-537bb1e70f10
POSIX Threads and NVIDIA's CUDA
The current progression of commodity processing architectures exhibits a trend toward increasing parallelism, requiring that undergraduate students in a wide range of technical disciplines gain an understanding of problem solving in massively parallel environments. However, as a small comprehensive college, we cannot currently afford to dedicate an entire semester-long course to the study of parallel computing. To combat this situation, we have integrated the key components of such a course into a 300-level course on modern operating systems. In this paper, we describe a parallel computing unit that is designed to dovetail with the discussion of process and thread management common to operating systems courses. We also
describe a set of self-contained projects in which students explore two parallel programming models, POSIX Threads and NVIDIA's Compute Unified Device Architecture, that enable parallel architectures to be utilized effectively. In our experience, this unit can be integrated with traditional operating systems topics quite readily, making parallel computing accessible to undergraduate students without requiring a full course dedicated to these increasingly important topics.
/content/cudazone/CUDABrowser/assets/images/applications/701_mte_small.png
/content/cudazone/CUDABrowser/assets/images/applications/701_mte_large.png
Academia
ogf.org
2008
12
31
12/31/2008
ogf.org
Paper
ogf.org
a23ac3b0-45a9-4784-a714-af9f875bd5cc
Open Inventor by VSG
Open Inventor by VSG provides application developers with a unique solution that enables interoperability between advanced 3D visualization and powerful GPU-based computing capabilities to perform parallel computation on the fly on a workstation.
/content/cudazone/CUDABrowser/assets/images/applications/700_vsg_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/700_vsg_logo_large.png
Commercial
VSG
http://www.vsg3d.com/vsg_prod_openinventor.php
2009
12
31
12/31/2009
Commercial
VSG
Application
Multimedia
Oil & Gas
VSG
984915a6-7fd1-45fa-8b37-52fd8af92486
Mental Ray 3.8
iray introduces a new way of utilizing photorealistic rendering, by integrating both preview and final frame rendering in one single interactive process. In addition, the power of the CUDA GPU dramatically shortens the processing time, introducing significant cost optimizations along the rendering pipeline. And the handling simplifications of iray provide a tool that enables professionals to focus on their core business, while still being able to generate beautiful photorealistic images of their works, all without the help of rendering experts and without the need of becoming rendering experts.
/content/cudazone/CUDABrowser/assets/images/applications/scene_update_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/scene_update_new_large.jpg
Commercial
mental images
http://www.mentalimages.com/index.php
2009
12
31
12/31/2009
Commercial
mental images
Application
Multimedia
Imaging
mental images
0152d7ff-43b6-4f43-b884-582438d94a54
AxRTM
Reverse Time Migration (RTM) is the current 'state-of-the-art' in seismic imaging. The strength of RTM stems from the fact that it fully respects the two-way acoustic wave equation, thus improving imaging in areas where complex geology violates the assumptions made in Kirchhoff or one-way wave equation migrations. Until recently, RTM's widespread use was severely hindered by the enormous computing resources required to process the data. This computational bottleneck is now cleared with Acceleware's patent-pending software solution AxRTM.
AxRTM provides the core numerical functionality of Reverse Time Migration as a library that can be integrated into an existing seismic processing framework. AxRTM has a modular architecture supporting a variety of integrator-supplied functionality, and currently supports both optimized multi-core CPU and NVIDIA GPU hardware.
/content/cudazone/CUDABrowser/assets/images/applications/698_Seismic_velocity_model_sml_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/698_Seismic_velocity_model_sml_large.jpg
Commercial
Acceleware
http://www.acceleware.com/default
2008
01
01
01/01/2008
Acceleware
Application
Paper
Oil & Gas
Acceleware
d9067cbb-646d-4ec7-ae1e-5302f65bed87
Linear Algebra Solvers and High Performance Computing
Solving a system of linear equations is a common numerical technique applied in many fields including fluid dynamics, thermal analysis, mechanical simulations, and economics. As simulations and models increase in complexity, organizations require high performance software to meet their growing computational needs. Several widely available optimized versions of BLAS and LAPACK libraries have been written to take advantage of CPU architectures. Recently, graphics processing units (GPUs) have shown potential to offer substantial performance gains when solving data-intensive calculations.
/content/cudazone/CUDABrowser/assets/images/applications/697_Engine_Block_sml_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/697_Engine_Block_sml_large.jpg
Commercial
Acceleware
http://www.acceleware.com/default/
2009
05
01
05/01/2009
Acceleware
Application
Computer Aided Engineering
Acceleware
16cf695b-a4de-4cd9-9aca-10b8ad8eff95
Unipro UGENE
UGENE is free cross-platform bioinformatics toolkit. It works on Windows, Linux, Mac OS and has out of the box support for modern GPUs including NVIDIA CUDA. UGENE focuses on integration of highly optimized versions of the most popular bioinformatics algorithms (Smith Waterman, HMMER, MUSCLE, Phylip etc) within single flexible visual interface.
http://ugene.unipro.ru
/content/cudazone/CUDABrowser/assets/images/applications/680_ss_mac_h1n1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/680_ss_mac_h1n1_large.png
Commercial
Unipro
http://unipro.ru/eng/
2009
07
15
07/15/2009
10
Open source
Unipro UGENE team
Application
Code
Life Sciences
Unipro UGENE team,ugene@unipro.ru
8b638261-433f-4fc9-ad7b-96c2bd1a6599
Movavi Video Suite
Movavi Video Suite is a complete collection of EIGHT powerful yet easy-to-use tools to suit your video processing needs
/content/cudazone/CUDABrowser/assets/images/applications/694_0000217475_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/694_0000217475_large.jpg
Commercial
Movavi
http://www.movavi.com/
2009
11
25
11/25/2009
Commercial
Movavi
Application
Video & Audio
Movavi
fa0fb82b-edb4-406b-bdcb-5b7d5e3eea51
Movavi Video Converter
Movavi Video Converter is a leading video converter you can use to convert video & audio, save for portables, rip & burn DVD
/content/cudazone/CUDABrowser/assets/images/applications/695_vc9box_jr_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/695_vc9box_jr_large.jpg
Commercial
Movavi
http://www.movavi.com
2009
11
24
11/24/2009
Commercial
Movavi
Application
Video & Audio
Movavi
a6e01b50-0329-4a90-85b3-597c398b8a63
PowerProducer 5
PowerProducer connects your HDV camcorder to your creative side, with a complete range of Blu-ray Disc and DVD authoring features for producing discs of your videos.
/content/cudazone/CUDABrowser/assets/images/applications/693_2eqevch_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/693_2eqevch_large.jpg
Commercial
Cyberlink
http://www.cyberlink.com/
2009
11
02
11/02/2009
5
Commercial
Cyberlink
Application
Video & Audio
Cyberlink
b4bfbf0c-7c02-49f3-9114-1bb621f4b3c7
HD NVR
The HD NVR series network video recorder sets new standards for IP camera recorders featuring full 1080p HD video output with dual monitor capability and hardware video acceleration via Nvidia Cuda. Also features low power consumption with green hard drives up to 2TB, perfect for MegaPixel HD cameras. The wireless HD NVR is suitable for up to 16 network cameras and can be used in the home, office or for professional applications.
/content/cudazone/CUDABrowser/assets/images/applications/691_nvr-header-2009_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/691_nvr-header-2009_large.jpg
Commercial
BiKal IP CCTV
http://www.bikal.co.uk/
2009
10
31
10/31/2009
Commercial
BiKal IP CCTV
Application
Video & Audio
BiKal IP CCTV
54670403-5c1a-4f0a-b226-3c1ca3dd071d
EyeSoft
EyeSoft is compatible with IP cameras and USB video devices from many different manufacturers including analogue video capture cards, alarm boxes and PTZ Keyboards. EyeSoft has an open source architecture allowing the integration of many hardware and software platforms and it's compatibility increases with each release.
/content/cudazone/CUDABrowser/assets/images/applications/690_eyesoft-header2-2009_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/690_eyesoft-header2-2009_large.jpg
BiKal IP CCTV
http://www.bikal.co.uk/
2009
10
30
10/30/2009
BiKal IP CCTV
Application
Video & Audio
EyeSoft
9ae6cb89-6d4c-4480-8021-f35668d44724
Loilo Touch
Now enjoy video editing by simply touching the screen. Enjoy using your fingers directly on your video, picture, and music with your friends and family.
Extreme 10X output made possible with NVIDIA CUDA technology that enables GPU to take command for ultra fast video encode.
/content/cudazone/CUDABrowser/assets/images/applications/689_touch_05_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/689_touch_05_large.jpg
Commercial
Loilo
http://loilo.tv
2009
10
23
10/23/2009
10
Commercial
Loilo
Application
Multimedia
Video & Audio
Loilo
4db08198-2f0d-49dd-b9b2-6087c1c8b368
Mirics FlexiTV
Mirics FlexiTVTM is a multi-standard broadcast TV receiver for netbooks, notebooks and desktop PCs. Using NVIDIAs CUDATM GPU acceleration technology for critical TV signal processing, global TV and radio can be received using FlexiTV. The result is a single hardware design for worldwide terrestrial TV and radio reception.
/content/cudazone/CUDABrowser/assets/images/applications/688_mirics_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/688_mirics_large.gif
Commercial
Mirics
http://www.mirics.com/
2009
10
02
10/02/2009
Mirics
Application
Video & Audio
Mirics
b7689d00-714b-455f-af55-9d1714974140
WinDVD 2010
Kick it up a notch with HD! WinDVD Pro is a Blu-ray player that supports AVCHD and even upscales standard DVDs to near-HD quality for more intense movies and music. Includes everything in the Standard version, plus:
NVIDIA GPU-accelerated upscaling for smoother playback of your DVD-Video on high-definition display. Upscale DVD-video to fit your HD display, regardless of the platform!
/content/cudazone/CUDABrowser/assets/images/applications/687_images_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/687_images_large.jpg
Commercial
Corel
http://www.corel.com
2009
09
10
09/10/2009
Corel
Application
Video & Audio
Corel
7572a818-d387-4d19-b437-e2244aca5398
MilkyWay@home
The goal of Milkyway@Home is to use the BOINC platform to harness volunteered computing resources in creating a highly accurate three dimensional model of the Milky Way galaxy using data gathered by the Sloan Digital Sky Survey. This project enables research in both astroinformatics and computer science.
/content/cudazone/CUDABrowser/assets/images/applications/686_feed-248_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/686_feed-248_large.jpg
Research
MilkyWay@home
http://milkyway.cs.rpi.edu/milkyway/
2009
08
31
08/31/2009
MilkyWay@home
Application
Science
MilkyWay@home
a4cd88a8-6691-4923-83b0-a03b9a3f6e2b
Roxio Creator 2010
With Creator 2010, you can render and encode your video 5 times faster thanks to NVIDIA Cuda technologies.
/content/cudazone/CUDABrowser/assets/images/applications/685_creator2010-box-lg_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/685_creator2010-box-lg_large.jpg
Commercial
Roxio
http://www.roxio.com
2009
08
25
08/25/2009
5
Commercial
Roxio
Application
Video & Audio
Roxio
bd09a954-61de-4510-bf7d-ea219b830f78
DivideFrame GPU Decoder
Hardware accelerated decoding of AVCHD/Quicktime h.264 files for NLEs
/content/cudazone/CUDABrowser/assets/images/applications/683_logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/683_logo_large.jpg
Commercial
DivideFrame
http://www.divideframe.com
2009
07
31
07/31/2009
10
Commercial
DivideFrame
Application
Multimedia
Video & Audio
DivideFrame
135ad008-3035-4f9c-943a-af31b3302a2f
Nero Moveit
Nero Move it lets you convert and transfer all your multimedia files to the most popular portable and mobile devices. Easily transfer your MP3, WMA, and other audio and video files to your choice of device, PC, Mobile Phone, Digital Camera and more. Move It converts quickly and hassle-free from any supported source and from online communities, and easily move them to iPod, iPhone, PSP and other mobile devices or online communities such as Blackberry, LG, Xbox, YouTube and more. With integrated NVIDIA CUDA technology in Nero Move it lets users with compatible NVIDIA graphics cards convert their favorite videos faster and more efficiently.
/content/cudazone/CUDABrowser/assets/images/applications/682_featured-product-moveit-eng_small.png
/content/cudazone/CUDABrowser/assets/images/applications/682_featured-product-moveit-eng_large.png
Commercial
Nero
http://www.nero.com/
2009
04
20
04/20/2009
Commercial
Nero
Application
Multimedia
Video & Audio
Nero
cb39210c-23de-464c-ac39-a27c3ea748d6
Elcomsoft Wireless Security Auditor
Elcomsoft Wireless Security Auditor allows network administrators to verify how secure a companys wireless network is by executing an audit of accessible wireless networks. Featuring patent-pending cost-efficient GPU acceleration technologies, Elcomsoft Wireless Security Auditor attempts to recover the original WPA/WPA2-PSK text passwords in order to test how secure your wireless environment is.
/content/cudazone/CUDABrowser/assets/images/applications/681_ewsa_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/681_ewsa_large.gif
Commercial
Elcomsoft
http://www.elcomsoft.com/ewsa.html
2009
01
01
01/01/2009
Commercial
Elcomsoft
Application
Wireless Security,Elcomsoft
0465f26d-a230-406e-8c08-9cb481f02fab
Wave Tomography
2D time-domain waveform tomography reconstruction algorithm using GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/680_cuda_website_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/680_cuda_website_large.jpg
Academia
EPFL
2010
01
20
01/20/2010
Open source
Olivier Roy
Ivana Jovanovic
Reza Parhizkar
Paper
Code
Imaging
Signal Processing
acoustic wave equation, inverse problems, waveform tomography,Olivier Roy,olivier.roy@usense.org
069ab305-5868-4d1b-8713-abf1f7dfd1ef
A performance study of general-purpose applications on graphics processors using CUDA
Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs). This paper uses NVIDIAs C-like CUDA language and an engineering sample of their recently introduced GTX 260 GPU to explore the effectiveness of GPUs for a variety of application types, and describes some specific coding idioms that improve their performance on the GPU. GPU performance is compared to both single-core and multicore CPU performance, with multicore CPU implementations written using OpenMP. The paper also discusses advantages and inefficiencies of the CUDA programming model and some desirable features that might allow for greater ease of use and also more readily support a larger body of applications.
/content/cudazone/CUDABrowser/assets/images/applications/679_pyramid_small.png
/content/cudazone/CUDABrowser/assets/images/applications/679_pyramid_large.png
Academia
University of Virginia, Department of Computer Science, Charlottesville, VA
2008
03
02
03/02/2008
6
Shuai Che
Michael Boyer
Jiayuan Meng
Paper
Shuai Che,Michael Boyer,Jiayuan Meng,sc5nf@cs.virginia.edu,jm6dg@cs.virginia.edu,jws9c@cs.virginia.edu
254535ef-fa6c-46f1-a198-5a49c6deecbb
Fast N-Body Simulation with CUDA
An N-body simulation numerically approximates the evolution of a system of bodies in which each body continuously interacts with every other body. A familiar example is an astrophysical simulation in which each body represents a galaxy or an individual star, and the bodies attract each other through the gravitational force, as in Figure 31-1. N-body simulation arises in many other computational science problems as well. For example, protein folding is studied using N-body simulation to calculate electrostatic and van der Waals forces. Turbulent fluid flow simulation and global illumination computation in
computer graphics are other examples of problems that use N-body simulation.
/content/cudazone/CUDABrowser/assets/images/applications/678_n-body_small.png
/content/cudazone/CUDABrowser/assets/images/applications/678_n-body_large.png
Academia
NVIDIA Corporation / University of North Carolina at Chapel Hill
2007
12
31
12/31/2007
Lars Nyland
Mark Harris
Jan Prins
Paper
Lars Nyland,Mark Harris,Jan Prins
5a75ea20-e19b-4d31-8ee4-cb07bd0cca4d
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processors organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each threads resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.
/content/cudazone/CUDABrowser/assets/images/applications/677_cover_thumb_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/677_cover_thumb_large.jpg
Academia
University of Illinois at Urbana-Champaign / NVIDIA Corporation
2008
12
31
12/31/2008
457
Shane Ryoo
Christopher I. Rodrigues
Sara S. Baghsorkhi
Sam S. Stone
Wen-mei W. Hwu
Paper
Shane Ryoo,Christopher I. Rodrigues,Sara S. Baghsorkhi, Sam S. Stone, Wen-mei W. Hwu
bb1d56b5-acb4-41ec-a177-b2c2ab424e0f
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.
/content/cudazone/CUDABrowser/assets/images/applications/676_1471-2105-9-S2-S10-1_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/676_1471-2105-9-S2-S10-1_large.gif
Academia
CRIBI, University of Padova / Elaide, Srl, Padova
2008
03
26
03/26/2008
30
Svetlin A Manavski
Giorgio Valle
Paper
Science
Svetlin A Manavski,Giorgio Valle
66e597cc-b024-471c-a2ae-ab091eb6f738
Speeding up Mutual Information Computation Using NVIDIA CUDA Hardware
We present an efficient method for mutual information
(MI) computation between images (2D or 3D) for NVIDIAs (CUDA) compatible devices. Efficient parallelization of MI is particularly challenging on a (GPU) due to the need for histogram-based calculation of joint and marginal probability mass functions (pmfs) with large number of bins. The data-dependent (unpredictable) nature of the updates to the histogram, together with hardware limitations of the GPU (lack of synchronization primitives and limited memory caching mechanisms) can make GPU-based computation inefficient. To overcome these limitation, we approximate the pmfs, using a down-sampled version of the jointhistogram which avoids memory update problems. Our CUDA implementation improves the efficiency of MI calculations by a factor of 25 compared to a standard CPUbased implementation and can be used in MI-based image registration applications.
/content/cudazone/CUDABrowser/assets/images/applications/675_comparison_small.png
/content/cudazone/CUDABrowser/assets/images/applications/675_comparison_large.png
Academia
The Australian National University
2007
12
31
12/31/2007
25
Ramtin Shams
Nick Barnes
Paper
Imaging
Ramtin Shams,Nick Barnes,ramtin.shams@anu.edu.au,nick.barnes@nicta.com.au
d3a7926e-9a32-4fbf-abe5-9c2d26d38adc
Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices
We present two efficient histogram algorithms designed for NVIDIAs compute unified device architecture (CUDA)compatible graphics processor units (GPUs). Our algorithm can be used for parallel computation of histograms on large data-sets and for thousands of bins. Traditionally histogram computation has been difficult and inefficient on the GPU. This often means that GPU-based implementation of the algorithms that require
histogram calculation as part of their computation, require to transfer data between the GPU and the host memory, which can be a significant bottleneck. Our algorithms remove the need for such costly data transfers by allowing efficient histogram calculation on the GPU. We show that the speed of histogram calculations can be improved by up to 30 times compared to a CPU-based implementation.
/content/cudazone/CUDABrowser/assets/images/applications/674_ParaviewHistogram_small.png
/content/cudazone/CUDABrowser/assets/images/applications/674_ParaviewHistogram_large.png
Academia
The Australian National University
2007
12
01
12/01/2007
30
Ramtin Shams
A. Kennedy
Paper
Code
Ramtin Shams,A. Kennedy
62eb8c85-57f7-4ab2-b7db-6b5e0aab23f9
gpuCuller
gpuCuller is a software library implementing parallel computation of view frustum culling for multiple view frustum and multiple entities (for now, AABB) Its main application is to compute visible elements for autonomous agents in VR simulation platforms the library builds up a BVH from the universe entities, which is parsed during culling operations
/content/cudazone/CUDABrowser/assets/images/applications/673_325px-View_frustum_culling.svg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/673_325px-View_frustum_culling.svg_large.png
Academia
UTBM
2010
01
14
01/14/2010
Nicolas Said
Application
Multimedia
Graphics
Nicolas Said,nicolas.said@gmail.com
00bed0a7-7d59-4f06-b773-fbcb33358272
Flow visualization and flow cytometry with holographic video microscopy
CUDA-accelerated analysis of holographic images yields the three-dimensional position of colloidal spheres with nanometer resolution, and simultaneously yields each spheres radius and complex refractive index with part-per-thousand resolution.
/content/cudazone/CUDABrowser/assets/images/applications/672_img19_small.png
/content/cudazone/CUDABrowser/assets/images/applications/672_img19_large.png
Academia
New York University
http://physics.nyu.edu/grierlab/
2009
07
16
07/16/2009
20
David G. Grier
Paper
Imaging
Science
Video & Audio
David G. Grier,david.grier@nyu.edu
b9d646ab-1b8e-42dc-b7c0-5033a2c01fe8
GPU acceleration of object classification algorithms using NVIDIA CUDA
The field of computer vision has become an important part of today's society, supporting crucial applications in the medical, manufacturing, military intelligence and surveillance domains. Many computer vision tasks can be divided into fundamental steps: image acquisition, pre-processing, feature extraction, detection or segmentation, and high-level processing. This work focuses on classification and object detection, specifically k-Nearest Neighbors, Support Vector Machine classification, and Viola & Jones object detection. Object detection and classification algorithms are computationally intensive, which makes it difficult to perform classification tasks in real-time. This thesis aims in overcoming the processing limitations of the above classification algorithms by offloading computation to the graphics processing unit (GPU) using NVIDIA's Compute Unified Device Architecture (CUDA). The primary focus of this work is the implementation of the Viola and Jones object detector in CUDA. A multi-GPU implementation provides a speedup ranging from 1x to 6.5x over optimized OpenCV code for image sizes of 300 x 300 pixels up to 2900 x 1600 pixels while having comparable detection results. The second part of this thesis is the implementation of a multi-GPU multi-class SVM classifier. The classifier had the same accuracy as an identical implementation using LIBSVM with a speedup ranging from 89x to 263x on the tested datasets. The final part of this thesis was the extension of a previous CUDA k-Nearest Neighbor implementation by exploiting additional levels of parallelism. These extensions provided a speedup of 1.24x and 2.35x over the previous CUDA implementation. As an end result of this work, a library of these three CUDA classifiers has been compiled for use by future researchers.
/content/cudazone/CUDABrowser/assets/images/applications/671_grouping_small.png
/content/cudazone/CUDABrowser/assets/images/applications/671_grouping_large.png
Academia
Rochester Institute of Technology
2009
09
01
09/01/2009
263
Jesse Patrick Harvey
Paper
Jesse Patrick Harvey
59686a98-5e65-42d9-9044-02706ad3d148
Motion estimation for H.264/AVC on multiple GPUs using NVIDIA CUDA
To achieve the high coding efficiency the H.264/AVC standard offers, the encoding process quickly becomes computationally demanding. One of the most intensive encoding phases is motion estimation. Even modern CPUs struggle to process high-definition video sequences in real-time. While personal computers are typically equipped with powerful Graphics Processing Units (GPUs) to accelerate graphics operations, these GPUs lie dormant when encoding a video sequence. Furthermore, recent developments show more and more computer configurations come with multiple GPUs. However, no existing GPU-enabled motion estimation architectures target multiple GPUs. In addition, these architectures provide no early-out behavior nor can they enforce a specific processing order. We developed a motion search architecture, capable of executing motion estimation and partitioning for an H.264/AVC sequence entirely on the GPU using the NVIDIA CUDA (Compute Unified Device Architecture) platform. This paper describes our architecture and presents a novel job scheduling system we designed, making it possible to control the GPU in a flexible way. This job scheduling system can enforce real-time demands of the video encoder by prioritizing calculations and providing an early-out mode. Furthermore, the job scheduling system allows the use of multiple GPUs in one computer system and efficient load balancing of the motion search over these GPUs. This paper focuses on the execution speed of the novel job scheduling system on both single and multi-GPU systems. Initial results show that real-time full motion search of 720p high-definition content is possible with a 32 by 32 search window running on a system with four GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/670_h264_small.png
/content/cudazone/CUDABrowser/assets/images/applications/670_h264_large.png
The International Society for Optical Engineering
2009
09
02
09/02/2009
Bart Pieters
Charles F. Hollemeersch
Peter Lambert
Paper
Video & Audio
Bart Pieters,Charles F. Hollemeersch,Peter Lambert
3f523b1a-9c95-4ffa-a003-90314020aede
Accelerating Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation
In this paper, we propose an acceleration of collapsed variational Bayesian (CVB) inference for latent Dirichlet allocation (LDA) by using Nvidia CUDA compatible devices. While LDA is an efficient Bayesian multi-topic document model, it requires complicated computations for parameter estimation in comparison with other simpler document models, e.g. probabilistic latent semantic indexing, etc. Therefore, we accelerate CVB inference, an efficient deterministic inference method for LDA, with Nvidia CUDA. In the evaluation experiments, we used a set of 50,000 documents and a set of 10,000 images. We could obtain inference results comparable to sequential CVB inference.
/content/cudazone/CUDABrowser/assets/images/applications/669_lncs_small.png
/content/cudazone/CUDABrowser/assets/images/applications/669_lncs_large.png
Academia
Nagasaki University, Bunkyo-machi, Nagasaki, Japan
2009
06
26
06/26/2009
Tomonari Masada
Tsuyoshi Hamada
Yuichiro Shibata
Paper
Tomonari Masada,Tsuyoshi Hamada,Yuichiro Shibata
54d853fe-6790-40a4-84a1-d7bfadeaa979
Real-time 2D parallel windowed Fourier transform for fringe pattern analysis using GPUs
In optical interferometers, fringe projection systems, and synthetic aperture radars, fringe patterns are common outcomes and usually degraded by unavoidable noises. The presence of noises makes the phase extraction and phase unwrapping challenging. Windowed Fourier transform (WFT) based algorithms have been proven to be effective for fringe pattern analysis to various applications. However, the WFT-based algorithms are computationally expensive, prohibiting them from real-time applications. In this paper, we propose a fast parallel WFT-based library using graphics processing units and computer unified device architecture. Real-time WFT-based algorithms are achieved with 4 frames per second in processing 256x256 fringe patterns. Up to 132x speedup is obtained for WFT-based algorithms using NVIDIA GTX295 graphics card than sequential C in quad-core 2.5GHz Intel(R)Xeon(R) CPU E5420.
/content/cudazone/CUDABrowser/assets/images/applications/668_rt2d_small.png
/content/cudazone/CUDABrowser/assets/images/applications/668_rt2d_large.png
Academia
Nanyang Technological University, Singapore
2009
12
02
12/02/2009
132
Wenjing Gao
Nguyen Thi
Ho Sy Loi
Paper
Wenjing Gao,Nguyen Thi,Ho Sy Loi,mkmqian@ntu.edu.sg
1ac32591-b05b-4a98-9570-fbdd6347751c
Solve MAX-SAT problem using nVidia CUDA framework
General Purpose computing over Graphical Processing Units (GPGPUs) is a huge shift of paradigm in parallel computing that promises a dramatic increase in performance. But GPGPUs also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges and design choices involved in parallelizing a hybrid of Genetic Algorithm (GA) and Local Search (LS) to solve MAXimum SATisfiability (MAX-SAT) problem on a state-of-the-art nVidia Tesla GPU using nVidia Compute Unified Device Architecture (CUDA). MAX-SAT is a problem of practical importance and is often solved by employing metaheuristics based search methods like GAs and hybrid of GA with LS. Almost all the parallel GAs (pGAs) designed in the last two decades were designed for either clusters or MPPs. Unfortunately, very little research is done on the implementation of such algorithms over commodity graphics hardware. GAs in their simple form are not suitable for implementation over the Single Instruction Multiple Thread (SIMT) architecture of a GPU, and the same is the case with conventional LS algorithms. In this paper we explore different genetic operators that can be used for an efficient implementation of GAs over nVidia GPUs. We also design and introduce new techniques/operators for an efficient implementation of GAs and LS over such architectures. We use nVidia Tesla C1060 to perform several numerical tests and performance measurements and show that in the best case we obtain a speedup of 25x. We also discuss the effects of different optimization techniques on the overall execution time.
/content/cudazone/CUDABrowser/assets/images/applications/667_cover-medium_small.png
/content/cudazone/CUDABrowser/assets/images/applications/667_cover-medium_large.png
Academia
Hokkaido University, Sapporo, Japan
2009
10
20
10/20/2009
Asim Munawar
Mohamed Wahib
Masaharu Munetomo
Paper
Asim Munawar,Mohamed Wahib,Masaharu Munetomo,asim@uva.cims.hokudai.ac.jp,wahibium@uva.cims.hokudai.ac.jp,munetomo@iic.hokudai.ac.jp
ff558f05-2894-4e4e-ae78-2c0a67925404
Scalable computation for spatially scalable video coding using NVIDIA CUDA and multi-core CPU
The scalable video coding (SVC), an extension of H.264/MPEG4-AVC (H.264), was standardized in 2007 by Joint Video Team (JVT). SVC provides spatial, temporal and SNR scalabilities. To achieve these scalabilities, SVC uses additional coding tools and coding modes based on H.264. The coding tools used by SVC and the variety coding modes decision make the corresponding coding complexity become extremely high, so real-time realization of SVC is nearly impossible by using software and single-core CPU only. One possible solution to generate SVC streams in real-time is to parallelize the whole encoding process. Currently, multi-core CPU and GPU are two popular kinds of parallel processing architectures. Not much research has been devoted to realize the parallel SVC encoders based on the co-work of these two architectures. In this paper, a scalable computation model for spatial SVC using multi-core CPU and GPGPU through NVIDIA CUDA is proposed. On the basis of the proposed computational model, a solution to solve the challenging data transition problem (will be detailed later) of this CPU-GPU co-work architecture is then provided. Simulation results show that, through our work, significant speed up gain in spatial SVC encoding can be achieved.
/content/cudazone/CUDABrowser/assets/images/applications/666_cover_thumb_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/666_cover_thumb_large.jpg
Academia
ACM
http://www.acm.org/publications
2009
01
01
01/01/2009
Yen-Lin Huang
Yun-Chung Shen
Ja-Ling Wu
Paper
Yen-Lin Huang,Yun-Chung Shen,Ja-Ling Wu
b972ec14-5548-49b5-8366-31d9df19eaf8
Sugarscape Cuda
Using emergent programing techniques on the GPU we have made an implementation of sugarscape to utilize the massively parallel architecture of modern GPUs. Agents within the model move optimally within their vision which is uniformly set between 1,10. Multiple agent cannot occupy the same cell. The agents also interact with the sugar patches uniformly given a metabolism between [0.1,1). The sugar patches grow at a constant rate of 0.1 per time step until they reach their maximum values which are determined by two Gaussian functions.
/content/cudazone/CUDABrowser/assets/images/applications/663_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/663_logo_large.png
code.google.com
http://code.google.com
2009
08
21
08/21/2009
Devm
Code
Devm
ae892094-5c92-4cd2-b782-fc12cb86f174
Cuda Nash
Finding Nash equilibria for large games is a computationally difficult task. The goal of this project is to implement a simple algorithm that is well suited to being run in parallel on simple hardware. The algorithm boils down solving a large system of differential equations until they converge within a given tolerance. We believe that the computational architecture of graphics cards is especially well suited to this type of problem.
/content/cudazone/CUDABrowser/assets/images/applications/662_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/662_defaultlogo_large.png
CUDA Developer
http://code.google.com
2009
06
03
06/03/2009
Aultman Stephen
Code
Numerics
Aultman Stephen
a58dcb4d-07ea-415c-8141-760465ce3812
Electromag with CUDA
Fun electromagnetism simulation application with CUDA GPGPU acceleration
/content/cudazone/CUDABrowser/assets/images/applications/661_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/661_defaultlogo_large.png
CUDA Developer
http://code.google.com
2009
05
08
05/08/2009
Code
e128489b-f5c9-4fb5-9e20-e6f08d8d3cd7
Hydrazine
A library of common operations needed for C++ and CUDA development
/content/cudazone/CUDABrowser/assets/images/applications/660_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/660_defaultlogo_large.png
CUDA Developer
http://code.google.com
2009
05
13
05/13/2009
Gregory
Code
Libraries
Gregory
6e7508ba-012a-4545-9c73-97e646edae15
CUDA Grayscale
This project presents a common technique for converting colored images to their grayscale representation using CUDA enabled GPUs to speed up processing.
This multi-platform implementation uses OpenCV for managing image files, while the conversion algorithm takes into consideration different weighting of the color channels for a more effective representation of the colored image.
/content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_large.png
CUDA Developer
http://code.google.com
2009
11
16
11/16/2009
Karl Phillip
Application
Code
Imaging
Karl Phillip
d84046cb-3498-46af-a187-293a84bbad65
CUDA Ndarray
This project provides a type with an interface as similar as possible to numpy's ndarray whose storage is allocated on a GPU device.
/content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/659_defaultlogo_large.png
CUDA Developer
http://code.google.com
2009
12
18
12/18/2009
James Bergstra
Frederic Bastien
Pascal Lamblin
Code
Libraries
James Bergstra,Frederic Bastien,Pascal Lamblin
ad2ef9c8-ad67-4c51-929b-c36141afa1c7
multisvm
The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this project, it is described how a naive implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy.
/content/cudazone/CUDABrowser/assets/images/applications/657_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/657_logo_large.png
CUDA Developer
http://code.google.com
2009
11
14
11/14/2009
Sergherr
Code
Sergherr
9b79ff8f-cc43-43bf-8143-37759f811994
gpuocelot
Ocelot is a dynamic compilation framework for heterogeneous systems, accomplishing this by providing various backend targets for CUDA programs. Ocelot currently allows CUDA programs to be executed on NVIDIA GPUs and x86-CPUs at full speed without recompilation.
/content/cudazone/CUDABrowser/assets/images/applications/656_logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/656_logo_large.jpg
CUDA Developer
http://code.google.com
2009
12
15
12/15/2009
Gregory
Arkerr
Code
Gregory,Arkerr
500a4c45-722d-4593-9b33-c78d5247013e
PHENOTYPING RODENT MODELS OF OBESITY USING MAGNETIC RESONANCE IMAGING
The emergence of dedicated, small animal imaging systems provides an excellent opportunity to study obesity using the rat and mouse models which will be critical to increasing our basic knowledge as well as deriving new treatments. MRI is well suited for quantifying fat depots (e.g., visceral, subcutaneous, hepatic, muscular) and for helping to determine the role of genetic, environmental, and therapeutic factors on lipid accumulation, metabolism, and disease. Assessment of lipid depots is important because of the linkage of visceral and ectopic depots to insulin resistance, vascular disease, etc. The importance of making reproducible imaging measurements can never be underestimated when conducting a study of many animals, and we demonstrated that ratio imaging enables reliable quantification even on a human clinical 1.5T MRI scanner. Scan-rescan variability and intra-operator variability were each reduced to a 2% coefficient of variation or less when the semi-automatic ratio image analysis was used. Receiver coil signal intensity inhomogeneity of over 200% across the field of view was flattened to less than 3% variation by ratio imaging. Using the SHR/SHROB rat model of dietary and genetic obesity, we found a novel image phenotype which showed that visceral adipose tissue depots are increased in both genetic and dietary obesity, but subcutaneous adipose tissue is uniquely linked to dietary obesity, at least in this model.
/content/cudazone/CUDABrowser/assets/images/applications/655_rat_small.png
/content/cudazone/CUDABrowser/assets/images/applications/655_rat_large.png
Academia
Department of Biomedical Engineering Case Western Reserve University
2010
01
01
01/01/2010
21
David Hervert Johnson
Paper
DAVID HERBERT JOHNSON
9ae40a45-facb-4779-9512-9bbf25f875c4
Model-driven Autotuning of Sparse Matrix-Vector Multiply on GPUs
We present a performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU). Our study consists of two parts. First, we describe several carefully hand-tuned SpMV implementations for GPUs, identifying key GPU-specific performance limitations, enhancements, and tuning opportunities. These implementations, which include variants on classical blocked compressed sparse row (BCSR) and blocked ELLPACK (BELLPACK)
storage formats, match or exceed state-of-the-art implementations. For instance, our best BELLPACK implementation achieves up to 29.0 Gflop/s in single-precision and 15.7 Gflop/s in doubleprecision
on the NVIDIA T10P multiprocessor (C1060), enhancing prior state-of-the-art unblocked implementations (Bell and Garland, 2009) by up to 1.8x and 1.5x for single and doubleprecision respectively.
/content/cudazone/CUDABrowser/assets/images/applications/654_threadblock_small.png
/content/cudazone/CUDABrowser/assets/images/applications/654_threadblock_large.png
Academia
Georgia Institute of Technology / Indian Institute of Technology Roorkee
2010
1
1
1/1/2010
Jee W. Choi
Amik Singh
Richard W. Vuduc
Paper
Jee W. Choi,Amik Singh,Richard W. Vuduc,jee@ece.gatech.edu,amiksuec@iitr.ernet.in,richie@cc.gatech.edu
d660b5b0-780e-42d3-9fc6-b4bb78acefde
Real-time display on Fourier domain optical coherence tomography system
Fourier domain optical coherence tomography (FD-OCT) requires resampling of spectrally resolved depth information from wavelength to wave number, and the subsequent application of the inverse Fourier transform. The display rates of OCT images are much slower than the image acquisition rates due to processing speed limitations on most computers. We demonstrate a real-time display of processed OCT images using a linear-in-wave-number (linear-k) spectrometer and a graphics processing unit (GPU). We use the linear-k spectrometer with the combination of a diffractive grating with 1200 lines/mm and a F2 equilateral prism in the 840-nm spectral region to avoid calculating the resampling process. The calculations of the fast Fourier transform (FFT) are accelerated by the GPU with many stream processors, which realizes highly parallel processing. A display rate of 27.9 frames/sec for processed images (2048 FFT sizex1000 lateral A-scans) is achieved in our OCT system using a line scan CCD camera operated at 27.9 kHz
/content/cudazone/CUDABrowser/assets/images/applications/653_060506_1-V1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/653_060506_1-V1_large.jpg
Academia
Graduate School of Science and Engineering, Yamagata University
2009
12
28
12/28/2009
Yuuki Watanabe
Multimedia
Paper
Imaging
Life Sciences
Yuuki Watanabe,ywata@yz.yamagata-u.ac.jp
83c76f9b-4deb-4017-9e78-42e911ff01ed
muvee Reveal version 8
muvee Reveal lets you create and share personalized, professional looking home movies in a few quick steps. With automatic motion and face detection, your photos and video are synced to the beat of your favorite music.
/content/cudazone/CUDABrowser/assets/images/applications/681_160x90_small.png
/content/cudazone/CUDABrowser/assets/images/applications/681_160x90_large.png
Commercial
muvee Technologies Pte. Ltd.
http://www.muvee.com
2009
11
17
11/17/2009
8
Commercial
muvee Technologies
Application
Multimedia
Digital Content Creation
Imaging
Video & Audio
Mafrudy bin Rubani,mafrudy@muvee.com
376fa043-5c61-42ff-a8e9-9c5dbcee6c9e
Multiple Back-Propagation source code
Multiple Back-Propagation is an open source oftware application for training neural networks with the backpropagation and the multiple back propagation algorithms. Currently this project is osted at htp://code.google.com/p/multiplebackpropagation and http://sourceforge.net/projects/mbp/
/content/cudazone/CUDABrowser/assets/images/applications/651_mbpTop_small.png
/content/cudazone/CUDABrowser/assets/images/applications/651_mbpTop_large.png
Academia
IPG
http://dit.ipg.pt/MBP/
2009
12
11
12/11/2009
179
Open source
Noel Lopes
Application
Science
Noel Lopes,noel@ipg.pt
e3fe2d97-e009-470c-9d85-6ee65c25cd43
ClusterTech Financial Library in GPU
CLUSTERTECH Finance Library includes a BGM Interest Path Generator and a Trinomial Tree-based Options Pricing Model. In the BGM model, each forward rate is modeled by a lognormal process. The volatility vector function is also defined in our implementation. Then numerous interest-rate paths are generated by Monte Carlo simulation. The library also includes a trinomial recombining tree based options-pricing model , which allows for greater flexibility in the movement of rates or prices compared to the binomial counterpart.
/content/cudazone/CUDABrowser/assets/images/applications/650_ct-fl-ad_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/650_ct-fl-ad_large.jpg
Commercial
Cluster Technology Limited
http://www.clustertech.com
2009
11
17
11/17/2009
30
Commercial
Cluster Technology Limited
Application
Finance
Numerics
Libraries
Cluster Technology Limited,hkbd@clustertech.com
224bcdd0-6e4a-422e-bd1a-4c367e094441
ClusterTech Parallel Random Number Generator
The ClusterTech Parallel Random Number Generator is based on Mersenne Twister which has a period of 2^19937-1. It generates multiple independent streams simultaneously across a cluster of CPUs and GPUs with a jump-ahead feature to guarantee the quality of the output.
/content/cudazone/CUDABrowser/assets/images/applications/649_ct-prng-ad_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/649_ct-prng-ad_large.jpg
Commercial
Cluster Technology Limited
http://www.clustertech.com
2009
11
17
11/17/2009
30
Cluster Technology Limited
Application
Numerics
Libraries
Cluster Technology Limited,hkbd@clustertech.com
df4a0b13-0479-40f6-8405-ae810286a06b
GPU computing with Kaczmarz's and otheriterative algorithms for linear systems
The graphics processing unit (GPU) is used to solve large linear systems derived from partial differential equations. The differential equations studied are strongly convection-dominated, of various sizes, and common to many fields, including computational fluid dynamics, heat transfer, and structural mechanics. The paper presents comparisons between GPU and CPU implementations of several well-known iterative methods, including Kaczmarzs, Cimminos, component averaging, conjugate gradient normal residual (CGNR), symmetric successive overrelaxation-preconditioned conjugate gradient, and conjugate-gradientaccelerated component-averaged row projections (CARP-CG). Computations are preformed with dense as well as general banded systems. The results demonstrate that our GPU implementation outperforms CPU implementations of these algorithms, as well as previously studied parallel implementations on Linux clusters and shared memory systems. While the CGNR method had begun to fall out of favor for solving such problems, for the problems studied in this paper, the CGNR method implemented on the GPU performed better than the other methods, including a cluster implementation of the CARP-CG method.
/content/cudazone/CUDABrowser/assets/images/applications/648_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/648_graph_large.png
Academia
University of Illinois Urbana-Champaign
http://www.uiuc.edu
2009
12
22
12/22/2009
10
J. Elble
N. Sahinidis
P. Vouzis
Paper
Graphics
Joseph Elble,elble@uiuc.edu
3b6e6752-173e-45e4-a22f-5e7eccae9b7f
Acceleration of a Finite-Difference WENO Scheme for Large-Scale Simulations on Many-Core Architectures
This is a highly accelerated implementation of the finite-difference weighted essentially non-oscillatory (WENO) scheme. This method is suitable for direct numerical simulations (DNS) large eddy simulations (LES) of compressible turbulence and requires large computing resources in order to achieve high Reynolds numbers. Our implementation utilizes a multi-GPU environment.
/content/cudazone/CUDABrowser/assets/images/applications/647_rayleigh-taylor_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/647_rayleigh-taylor_large.jpg
Academia
PDS Group - University of Patras
http://pdsgroup.hpclab.ceid.upatras.gr
2009
12
14
12/14/2009
50
Konstantinos Karantasis
Paper
Computational Fluid Dynamics
Konstantinos Karantasis,karantas@ceid.upatras.gr
39f2db06-bc42-4ecf-9f6a-235d25b002e6
GPU Accelerated Pathfinding
In the past few years the graphics programmable processor (GPU) has evolved into an increasingly convincing computational resource for non graphics applications. The GPU is especially well suited to address problem sets expressed as data parallel computation with the same program executed on many data elements concurrently. In pursuing a scalable navigation planning approach for many thousands of agents in crowded game scenes, developers became more attracted to decomposable movement algorithms that lend to explicit parallelism. Pathfinding is one key computational intelligence action in games that is typified by intense search over sparse graph data structures. This paper describes an efficient GPU implementation of parallel global pathfinding using the CUDA programming environment, and demonstrates GPU performance scale advantage in executing an inherently irregular and divergent algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/646_GPUAcceleratedPathfinding_small.png
/content/cudazone/CUDABrowser/assets/images/applications/646_GPUAcceleratedPathfinding_large.png
Research
NVIDIA Corporation
http://www.nvidia.com
2008
06
20
06/20/2008
Avi Bleiweiss
Presentation
Paper
Artificial Intelligence
Avi Bleiweiss,ableiweiss@nvidia.com
9b19fa0b-58e2-4937-a1bb-e6532bb7522a
Scalable Multi Agent Simulation on the GPU
We present a unique and elegant graphics hardware realization of multi agent simulation. Specifically, we adapted Velocity Obstacles that suits well parallel computation on single instruction, multiple thread, SIMT, type architecture. We explore hash based nearest neighbors search to considerably optimize the algorithm when mapped on to the GPU. Moreover, to alleviate inefficiencies of agent level concurrency, primarily exposed in small agent count (<32) scenarios, we exploit nested data parallel in unrolling the inner velocity iteration, demonstrating an appreciable performance gain. Simulation of ten thousand agents created with our system runs on current hardware at a real time rate of eighteen frames per second. Our software implementation builds on NVIDIAs CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/645_aicuda_small.png
/content/cudazone/CUDABrowser/assets/images/applications/645_aicuda_large.png
Research
NVIDIA Corporation
http://www.nvidia.com
2009
11
02
11/02/2009
50
Avi Bleiweiss
Presentation
Paper
Artificial Intelligence
Avi Bleiweiss,ableiweiss@nvidia.com
25c4c0ea-21c5-449e-8a40-f8cdfa40d539
NVIDIA Nexus - Visual Studio-based GPU Development
Our new GPU developer tools, code-named Nexus brings GPU Computing into Visual Studio 2008. Debug, profile, and analyze GPU code using standard workflow and tools. Nexus supports CUDA C, OpenCL, DirectCompute, Direct3D, and OpenGL.
/content/cudazone/CUDABrowser/assets/images/applications/644_64_small.png
/content/cudazone/CUDABrowser/assets/images/applications/644_64_large.png
Commercial
NVIDIA
http://www.nvidia.com
2009
12
16
12/16/2009
NVIDIA
Application
Multimedia
nexus,NVIDIA,cuda@nvidia.com
4fd199bf-5cce-4533-af21-5db250922ae6
Recursive APSP on the GPU
We consider the computation of shortest paths on Graphic Processing Units (GPUs). The blocked recursive elimination strategy we use is applicable to a class of algorithms (such as all-pairs shortest-paths, transitive closure, and LU decomposition without piv- oting) having similar data access patterns. Using the all-pairs shortest-paths problem as an example, we uncover potential gains over this class of algorithms. The impressive computational power and memory bandwidth of the GPU make it an attractive plat- form to run such computationally intensive algorithms. Although improvements over CPU implementations have previously been achieved for those algorithms in terms of raw speed, the utilization of the underlying computational resources was quite low. We implemented a recursively partioned all-pairs shortest-paths algorithm that harnesses the power of GPUs better than existing implementations. The alternate schedule of path computations allowed us to cast almost all operations into matrix-matrix multi- plications on a semiring. Since matrix-matrix multiplication is highly optimized and has a high ratio of computation to communication, our implementation does not suer from the premature saturation of bandwidth resources as iterative algorithms do. By increasing temporal locality, our implementation runs more than two orders of magni- tude faster on an NVIDIA 8800 GPU than on an Opteron. Our work provides evidence that programmers should rethink algorithms instead of directly porting them to GPU.
/content/cudazone/CUDABrowser/assets/images/applications/643_apsp-timings-small_small.png
/content/cudazone/CUDABrowser/assets/images/applications/643_apsp-timings-small_large.png
Academia
UC Santa Barbara
2008
11
30
11/30/2008
480
Open source
Aydin Buluc
Paper
Code
Numerics
Science
Aydin Buluc,aydin@cs.ucsb.edu
b474b1ae-94b1-4ac4-ba86-a16506460ba4
Multiphase flow in porous media
The movie shows fractional flow of oil and water in a generic porous medium (glass beads, water wet). The glass beads are visualized by a transparent material, the water is invisible and the oil phase is shown by a color encoded surface. The color represents the pressure distribution, where red is high and blue low pressure. The porous medium is resolved by 250^3 grid points. Ingrain's digital rock physics lab computes the physical properties and fluid flow characteristics of oil and gas reservoir rocks. Our technology leads the industry in measuring shales, carbonates, tight gas sands and oil sands. Ingrain uses advanced lattice Boltzmann methods to simulate multiphase flow in the rocks (porous media). The simulation engine uses a sparse data structure to represent the grid. The simulations are accelerated by using GPUs and the CUDA technology by two orders of magnitude compared to a state of the art multicore desktop computer. On a single Tesla GPU with 4GB memory we are able to simulate grids up to 800^3/600^3 for 5 % porosity and up to 500^3/400^3 for 40 % porosity for single/multi phase flow. For larger grids multiple GPUs are used in parallel.
/content/cudazone/CUDABrowser/assets/images/applications/642_movBlackLogo.0400_small.png
/content/cudazone/CUDABrowser/assets/images/applications/642_movBlackLogo.0400_large.png
Commercial
Ingrain
http://www.ingrainrocks.com
2009
12
05
12/05/2009
100
Jonas Toelke
Multimedia
Computational Fluid Dynamics
Jonas Toelke,toelke@ingrainrocks.com
c0d931f3-fe2d-42cf-aa7c-981392258c99
FastFractal256
Mandelbrot fractal render. Uses software integers at 256 bit precision run on GPU.
/content/cudazone/CUDABrowser/assets/images/applications/641_baby-mandelbrot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/641_baby-mandelbrot_large.jpg
Commercial
Imaginary Software, LLC
http://www.fastfractal.com/
2009
11
16
11/16/2009
10
Commercial
Imaginary Software, LLC
Application
Multimedia
Graphics
Numerics
Science
Imaginary Software, LLC,contact@fastfractal.com
75775576-7f1b-4338-950d-57508d71eb11
Digital Breast Tomosynthesis Reconstruction
reconstruction of Digital Breast Tomosynthesis volumes. The CUDA version gave a minimum 25x speedup over multi-threaded implementation on an Intel Core i7 quad-core CPU. The application is also scalable to multiple GPUs for further acceleration.
This work was done courtesy of Massachusetts General Hospital with additional support from the Bernard M. Gordon Center for Subsurface Sensing and Imaging Systems (Gordon-CenSSIS). Individual and Institutional Contributors include: Professor David Kaeli, Daniel B. Kopans M.D., Micha Moffie PhD., Richard H Moore, Diego Rivera, Dana Schaa, Juemin Zhang PhD., Brandeis University, and Dexela, Ltd.
/content/cudazone/CUDABrowser/assets/images/applications/640_tomo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/640_tomo_large.png
Research
Massachusetts General Hospital
2009
11
03
11/03/2009
85
Benjamin C Brown
Multimedia
Paper
Imaging
Life Sciences
GTX285 vs. Intel Core i7 940 quad-core 2.93 GHz.,Benjamin C Brown,bcbrown@partners.org
5a4a9940-b346-454f-926b-fc21e5e9995b
Needleman-Wunsch Sequence Alignment
The Needleman-Wunsch Sequence Alignment using CUDA
/content/cudazone/CUDABrowser/assets/images/applications/639_nw_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/639_nw_large.jpg
Academia
University of Virginia
2009
12
03
12/03/2009
8
Shuai Che
Kevin Skadron
Application
Multimedia
Code
Life Sciences
Shuai Che,Kevin Skadron,sc5nf@virginia.edu
984a48ef-ecb4-4649-85e3-fb42ceff5269
CUDAEASY
We present a graphics processing unit (GPU) accelerated program that solves the evolution of interacting scalar fields in an expanding universe in NVIDIA's Compute Unified Device Architecture (CUDA). In chaotic inflation models we report speedups between one and two orders of magnitude depending on the used hardware and software while achieving small errors in single precision. Simulations that used to last roughly one day to compute can now be done in hours and this difference is expected to increase in the future. The program has been written in the spirit of LATTICEEASY and users of the aforementioned program should find it relatively easy to start using CUDAEASY in lattice simulations.
/content/cudazone/CUDABrowser/assets/images/applications/638_logo_big_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/638_logo_big_large.jpg
Academia
University of Turku / Department of Physics and Astronomy
http://www.physics.utu.fi/en/
2009
12
01
12/01/2009
100
Open source
Jani Sainio
Application
Multimedia
Science
Jani Sainio,jani.sainio@utu.fi
e563fe70-8e81-434a-81a1-4d1ca78c77a4
TeraChem
General purpose software for quantum chemistry calculations designed specifically for Nvidia GPU
/content/cudazone/CUDABrowser/assets/images/applications/637_CoverArtDNANew_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/637_CoverArtDNANew_large.jpg
Commercial
PetaChem, LLC
http://www.petachem.com
2009
11
24
11/24/2009
650
Commercial
Ivan Ufimtsev
Application
Multimedia
Life Sciences
Science
Ivan Ufimtsev,i.ufimtsev@gmail.com
51876181-0577-4305-8961-455fe9f22ce9
Monte Carlo eXtreme (MCX)
Monte Carlo eXtreme, or MCX, is a Monte Carlo simulation software for photon migration in 3D turbid media. It uses Graphics Processing Units (GPU) based massively parallel computing techniques and is extremely fast compared to traditional CPU-based simulations. Using an nVidia 8800GT graphics card (14MP/114Cores), the acceleration is about 300x~400x with over 1700 parallel threads; this ratio can be as high as 700x on a high-end GTX 295 GPU (multiply by another 2x if both GPUs on GTX295 are used).
/content/cudazone/CUDABrowser/assets/images/applications/636_mcx_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/636_mcx_logo_large.png
Academia
Massachusetts General Hospital, Harvard Medical School
http://nmr.mgh.harvard.edu/
2009
10
22
10/22/2009
300
Open source
Qianqian Fang
Application
Paper
Code
Imaging
3D Photon Migration,Qianqian Fang,fangqq@gmail.com
ff3b0870-5be3-48ff-b14a-1e3b54c3320f
AIRWC
Accelerated Image Registration with CUDA. Fast medical image registrion using affina and B-Spline transformations.
/content/cudazone/CUDABrowser/assets/images/applications/634_image002_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/634_image002_large.jpg
Academia
University of Cambridge, Dept of Physics
http://www.phy.cam.ac.uk/
2009
11
15
11/15/2009
100
Richard Ansorge
Application
Imaging
Richard Ansorge,rea1@cam.ac.uk
a5d1af40-470d-4088-b087-30a5e7a408d3
Task and Data Parallel Framework for GPU Computing
MIT Lincoln Laboratory is developing PVTOL, a high-performance, portable signal and image processing library The goals of PVTOL are to: Provide a portable framework for high-performance embedded computing Support data and task parallelism Reduce the complexity and increase the speed of developing applications
/content/cudazone/CUDABrowser/assets/images/applications/628_pvtol_small.png
/content/cudazone/CUDABrowser/assets/images/applications/628_pvtol_large.png
Research
MIT Lincoln Laboratory
http://ww.tll.mit.edu
2009
11
12
11/12/2009
Commercial
James Brock
Multimedia
Paper
Signal Processing
James Brock,brock.j@neu.edu
c47ca00a-cfa4-4bfc-9e05-9aa325fcf26c
TMPGEnc KARMA..Plus
TMPGEnc KARMA..Plus makes it easy to take control of your ever-growing digital video library. Sort, search, classify, play, and even compare your digital video with easy-to-use tools and controls. And it supports NVIDIA CUDA technology for filter processing, decoding and H.264/AVC file output.
/content/cudazone/CUDABrowser/assets/images/applications/625_tmkp_main_quickview_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/625_tmkp_main_quickview_large.jpg
Commercial
Pegasys Inc.
http://www.pegasys-inc.com
2009
11
10
11/10/2009
9
Commercial
Zakk saito
Application
Multimedia
Video & Audio
CUDA H.264 Deocde Player Manage TMPG TMPGEnc Pegasys,Zakk saito,saito@pegasys-inc.com
f1d52d6a-a875-4d32-90ba-c1c23aa4f6a0
Mersenne Twister for Graphic Processors (MTGP)
MTGP is a new variant of Mersenne Twister (MT) introduced by Mutsuo Saito and Makoto Matsumoto in 2009. MTGP is designed with some features of Graphic Processors, such as parallel execution and hi-speed constant reference. It supports 32-bit and 64-bit integers, as well as single and double precision floating point as output. The periods of generated sequence are 11213-1,223209-1 and 244497-1 for 32-bit version, and 223209-1, 244497-1, 2110503-1 for 64-bit version. It support 128 parameter sets for each period, in other words, it can generate 128 independent pseudorandom number sequences for each period. We are now developing Dynamic Creator for MTGP, which generates more parameter sets.
/content/cudazone/CUDABrowser/assets/images/applications/624_mtgp_small.png
/content/cudazone/CUDABrowser/assets/images/applications/624_mtgp_large.png
Academia
Department of Mathematics, Hiroshima University , Japan
2009
11
17
11/17/2009
Open source
Mutsuo Saito
Makoto Matsumoto
Paper
Code
Finance
Numerics
Libraries
Science
Mutsuo Saito,Makoto Matsumoto,saito@math.sci.hiroshima-u.ac.jp
5a730964-d49a-4305-b5a8-3c5d75ecf73b
Eudyptula
Eudyptula is portable graphics engine that provides advanced support for the CUDA tools of NVIDIA and with its core purpose to be used in the development of scientific applications
/content/cudazone/CUDABrowser/assets/images/applications/622_eudyptula_small.png
/content/cudazone/CUDABrowser/assets/images/applications/622_eudyptula_large.png
OpenSource
2008
06
25
06/25/2008
Georgios Paraskevas
Application
Numerics
Science
Georgios Paraskevas
9ca281be-34d8-4b10-9f7c-cd1853ad715c
High performance sequence alignment
A fast Smith-Waterman algorithm, implemented on CUDA
/content/cudazone/CUDABrowser/assets/images/applications/620_protein_small.png
/content/cudazone/CUDABrowser/assets/images/applications/620_protein_large.png
Research
OpenSource
2008
09
19
09/19/2008
Vahid Noormofidi
Code
Life Sciences
Vahid Noormofidi
082d85de-353e-4a4d-9613-2513309d4b09
aeth.drive
A fast, parallel, versatile QED modelling framework. Uses Geometric Calculus and CUDA. Algorithm supports complex phenomena including turbulence, quantum effects, and relativistic gravitational procession.
/content/cudazone/CUDABrowser/assets/images/applications/619_aeth_small.jpeg
/content/cudazone/CUDABrowser/assets/images/applications/619_aeth_large.jpeg
Research
OpenSource
2008
11
15
11/15/2008
Kevin Daley
Code
Numerics
Science
Kevin Daley
91df274b-6c8d-470a-956d-8e6ff1d8c053
jacuzzi
This projects aims at providing java-bindings to the CUDA numeric environment. CUDA is an extension to the C/C++ programming language by NVIDIA.
/content/cudazone/CUDABrowser/assets/images/applications/617_jacuzzi_small.png
/content/cudazone/CUDABrowser/assets/images/applications/617_jacuzzi_large.png
Research
OpenSource
2009
03
05
03/05/2009
Alexander Heusel
Code
Numerics
Alexander Heusel
551bb282-5e25-4ff5-92fc-a0fc675d32bc
cuda cagen
CUDA-based rule 30 cellular automaton generator for nVidia GPUs
/content/cudazone/CUDABrowser/assets/images/applications/616_CellularAutomata_small.png
/content/cudazone/CUDABrowser/assets/images/applications/616_CellularAutomata_large.png
Research
OpenSource
2008
09
17
09/17/2008
Yuri Parfenov
Code
Numerics
Yuri Parfenov
60d005b8-e3c7-47a5-8fec-ab8aef9f2031
Fast parallel Particle-To-Grid interpolation for plasma PIC simulations on the GPU
Particle-in-Cell (PIC) methods have been widely used for plasma physics simulations in the past three decades. To ensure an acceptable level of statistical accuracy relatively large numbers of particles are needed. State-of-the-art Graphics Processing Units (GPUs), with their high memory bandwidth, hundreds of SPMD processors, and half-a-teraflop performance potential, offer a viable alternative to distributed memory parallel computers for running medium-scale PIC plasma simulations on inexpensive commodity hardware. In this paper, we present an overview of a typical plasma PIC code and discuss its GPU implementation. In particular we focus on fast algorithms for the performance bottleneck operation of particle-to-grid interpolation.
/content/cudazone/CUDABrowser/assets/images/applications/615_ptg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/615_ptg_large.png
Academia
University of Maryland
http://www.umd.edu/
2008
10
01
10/01/2008
20
George Stantchev
Paper
Science
George Stantchev,gogo@umd.edu
276b1bef-214e-4528-85e7-c08792f09988
cudacluster
The CUDA Cluster allows you to organize a cluster of CUDA-enabled Peer-To-Peer nodes, allowing for execution of tasks with extreme performance, by harnessing the combined power of multiple such GPU hosts. Sample jobs are provided. C#.Net/Mono with C.
/content/cudazone/CUDABrowser/assets/images/applications/614_cudacluster_small.png
/content/cudazone/CUDABrowser/assets/images/applications/614_cudacluster_large.png
Research
OpenSource
2008
08
06
08/06/2008
Nikolaos Tountas
Application
Numerics
Nikolaos Tountas
2be843df-918d-4f4f-94ec-6c1b99e58760
MP3 Encoder
MP3 encoder that runs on CUDA compatible hardware.
/content/cudazone/CUDABrowser/assets/images/applications/613_cudamp3_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/613_cudamp3_large.jpg
OpenSource
2008
03
19
03/19/2008
Research
biggestpos
Application
Video & Audio
Numerics
biggestpos
9a4aea49-e96f-487a-b6e3-ab50c134a049
cesql
Database Server based on NVIDIA CUDA Technology. CUDA makes it possible to use the GPU and its performance for parallel data computing.A classic sql server uses only about 15 GFlops instead of more than 500 GFlops which could be used by cesql.
/content/cudazone/CUDABrowser/assets/images/applications/612_cesql_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/612_cesql_large.jpg
Research
OpenSource
2008
06
08
06/08/2008
Arash Mahini
Application
Numerics
Arash Mahini,Arash_Mahini@users.sourceforge.net
436a1f19-e066-438d-9769-afd6b612b52e
cehttp
Web Server based on NVIDIA CUDA Technology. CUDA makes it possible to use the GPU and its performance for parallel data computing.A classic web server uses only about 15 GFlops instead of more than 500 GFlops which could be used by cehttp.
/content/cudazone/CUDABrowser/assets/images/applications/611_cehttp_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/611_cehttp_large.jpg
Research
OpenSource
2008
06
08
06/08/2008
Arash Mahini
Application
Arash Mahini,Arash_Mahini@users.sourceforge.net
faba717d-f830-457b-94a4-a8ca1d709890
The CUDA Files
Implementations of various algorithms using CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/610_thecudafiles_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/610_thecudafiles_large.jpg
Research
OpenSource
2008
01
08
01/08/2008
sashang
Code
Numerics
sashang,sashang@users.sourceforge.net
8ffabfbd-cad9-4fa1-81ee-f61d4bc4cc76
FreeSWITCH-CUDA
This goal of this project is produce and maintain a branch of the FreeSWITCH telephony platform that utilizes CUDA (NVida's GPGPU toolkit) to offload cpu-intensive transcoding tasks to the (NVidia) GPU.
/content/cudazone/CUDABrowser/assets/images/applications/609_freeswitch_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/609_freeswitch_large.jpg
Academia
OpenSource
2009
04
01
04/01/2009
Zac Wolfe
Code
Numerics
Zac Wolfe,Zac_Wolfe@users.sourceforge.net
9b5c77ca-f014-4173-83cd-3bc3da09039b
tokaspt
The Once Known as SmallPT is a cheap editable realtime derivation of http://kevinbeason.com/smallpt/ By way of the marketing department, some outrageously insignificant numbers: on a Quadro FX 5800, on the default scene at default resolution and configuration, 768x512x(2x2)x118fps = 185.6M 4-bounces rays are traced per second (alternatively, a maximum of 742.4M bounces are generated). Requires CUDA 2.1 to compile and run.
/content/cudazone/CUDABrowser/assets/images/applications/608_img_ui_bloated_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/608_img_ui_bloated_large.jpg
Research
http://ompf.org
http://ompf.org
2009
01
25
01/25/2009
Thierry Berger-Perrin
Application
Code
Graphics
Thierry Berger-Perrin,tbptbp@gmail.com
e143112b-a0c0-4f45-8360-6afe7687f68e
A framework for efficient and scalable execution of domain-specific templates on GPUs
Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts - processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7 - 7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory.
/content/cudazone/CUDABrowser/assets/images/applications/607_ipdp_small.png
/content/cudazone/CUDABrowser/assets/images/applications/607_ipdp_large.png
Commercial
NEC Labs, Berkeley, Purdue
2009
05
01
05/01/2009
8
Narayanan Sundaram
Anand Raghunathan
Srimat T. Chakradhar
Paper
Imaging
Medical Imaging
machine learning
edge detection, convolution neural network, out-of-core,Narayanan Sundaramyz,Anand Raghunathanyx,Srimat T. Chakradhar,narayans@eecs.berkeley.edu
d45f95f7-772b-41f6-a00d-4cb40e53e785
HyperNEAT4CUDA
This is a simple C# implementation of HyperNEAT implemented on NVidia's Compute Unified Device Architecture (CUDA).
/content/cudazone/CUDABrowser/assets/images/applications/605_hyperneat_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/605_hyperneat_large.jpg
Research
OpenSource
2009
05
19
05/19/2009
K A Lloyd
Code
Numerics
K A Lloyd
727f6e8e-1cc9-4afc-9d6f-3329a569a712
Smoke rendering demo
This application renders a density field of float values. In the particualr demo it is a smoke density field, but i could might as well be other sorts of data like fog, fluids or calculations. The density field is visualized using a ray marching technique and the background is rendered by ray tracing a kd tree.
/content/cudazone/CUDABrowser/assets/images/applications/604_smoke_sreenshot1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/604_smoke_sreenshot1_large.jpg
Research
Alexandra Instituttet
http://www.alexandra.dk/index.htm
2009
05
14
05/14/2009
Peter Trier
Application
Multimedia
Graphics
Science
Smoke rendering, ray tracing,Peter Trier,peter.trier@alexandra.dk
72067ded-99f3-4176-96ad-9f1551b12c41
CUJ2K - JPEG2000 Encoder
CUJ2K is a fast encoder for the new image compression standard JPEG2000 which is an improvement of JPEG providing better compression ratios and also supporting lossless compression along with many other features. JPEG2000 is very computation-intensive and therefore benfits much from CUDA acceleration. CUJ2K uses streaming to accelerate batch image compression. This program provides commandline-, .Net GUI- and libary-interfaces to convert BMP -> JPEG2000. It also supports creation of MJ2 videos.
/content/cudazone/CUDABrowser/assets/images/applications/603_banner_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/603_banner_large.gif
Hochschule
University of Stuttgart, IPVS
http://www.ipvs.uni-stuttgart.de/
2009
09
20
09/20/2009
4
Open Source
Norbert Fuerst
Armin Weiss
Simon Papandreou
Martin Heide
Ana Balevic
Application
Paper
Code
Graphics
Imaging
Medical Imaging
Libraries
Video & Audio
JPEG2000, image compression, encoder, codec, JPEG, CUJ2K, image processing, lossless, lossy,Norbert Fuerst,Armin Weiss,Simon Papandreou, Martin Heide, Ana Balevic,cuj2k.project@googlemail.com
64528049-540a-4d7f-9cc0-2d4a2ccad4f0
Parallel Multiclass classification using SVM on GPUs
The scaling of serial algorithms cannot rely on the improvement of CPUs anymore. The performance of classical Support Vector Machine (SVM) implementations has reached its limit and the arrival of the multi core era requires these algorithms to adapt to a new parallel scenario. Graphics Processing Units (GPU) have arisen as high performance platforms to implement data parallel algorithms. In this paper, it is described how a native implementation of a multiclass classifier based on SVMs can map its inherent degrees of parallelism to the GPU programming model and efficiently use its computational throughput. Empirical results show that the training and classification time of the algorithm can be reduced an order of magnitude compared to a classical solver, LIBSVM, while guaranteeing the same accuracy.
/content/cudazone/CUDABrowser/assets/images/applications/602_multisvm_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/602_multisvm_large.gif
Academia
MIT
2008
12
31
12/31/2008
112
Sergio Herrero-Lopez
Code
Numerics
Sergio Herrero-Lopez,sherrero@mit.edu
f7874e4b-ba49-44f9-b736-6a3341519f41
Fast pattern classification of ventricular arrhythmias using graphics processing units
Graphics Processing Units (GPUs) can provide remarkable performance gains when compared to CPUs for computationally-intensive applications. In the biomedical area, most of the previous studies are focused on using Neural Networks (NNs) for pattern recognition of biomedical signals. However, the long training times prevent them to be used in real-time. This is critical for the fast detection of Ventricular Arrhythmias (VAs) which may cause cardiac arrest and sudden death. In this paper, we present a parallel implementation of the Back-Propagation (BP) and the Multiple Back-Propagation (MBP) algorithm which allowed significant training speedups. In our proposal, we explicitly specify data parallel computations by defining special functions (kernels); therefore, we can use a fast evaluation strategy for reducing the computational cost without wasting memory resources. The performance of the pattern classification implementation is compared against other reported algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/600_mbpTop_small.png
/content/cudazone/CUDABrowser/assets/images/applications/600_mbpTop_large.png
Academia
IPG
http://www.ipg.pt
2009
11
09
11/09/2009
53
Noel Lopes
Application
Paper
medicine
Neural Networks,Noel Lopes,noel@ipg.pt
c8a33001-387c-474f-a477-63571429ab6f
Heart Wall Tracking
Tracking of mouse heart walls through a series of ultrasound images.
/content/cudazone/CUDABrowser/assets/images/applications/599_heartwall_small.png
/content/cudazone/CUDABrowser/assets/images/applications/599_heartwall_large.png
Academia
University of Virginia
http://www.virginia.edu
2009
11
05
11/05/2009
15
Open source
Lukasz G. Szafaryn
Application
Multimedia
Code
Medical Imaging
Image Processing, Feature Detection, Ultrasound,Lukasz G. Szafaryn,lgs9a@virginia.edu
d29736de-ffee-4b0a-b7ec-8d041259c195
Towards a multi-GPU solver for the three-dimensional two-phase incompressible Navier-Stokes equations
We have ported parts of our parallel level-set based two-phase solver for the three-dimensional Navier-Stokes equations on the GPU. To our knowledge, this is the first time that a two-phase fluid solver profits from the performance boost of several GPUs. A multi-GPU double-precision solver for the pressure Poisson equation based on the Jacobi preconditioned conjugate gradient method was implemented using CUDA and MPI. Thereby, we obtain a major speedup factor of 31.1 for the Poisson solver on four GPUs of our NVIDIA Tesla S1070, in contrast to a single CPU. Consequently, our overall fluid solver shows an impressive speedup factor of 16.6.
/content/cudazone/CUDABrowser/assets/images/applications/598_logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/598_logo_large.jpg
Academia
Institute for Numerical Simulation - University of Bonn, Germany
http://www.ins.uni-bonn.de
2009
09
30
09/30/2009
16
Peter Zaspel
Paper
Computational Fluid Dynamics
Numerics
Science
CFD, multi-GPU, Navier-Stokes, multi-phase,Peter Zaspel,zaspel@ins.uni-bonn.de
5bd7b280-5a27-49e5-be83-c95099ac3a3c
String Matching on a Multicore GPU Using CUDA
Graphics Processing Units (GPUs) have evolved over the past few years from dedicated graphics rendering devices to powerful parallel processors, outperforming traditional Central Processing Units (CPUs) in many areas of scientific computing. The use of GPUs as processing elements was very limited until recently, when the concept of General-Purpose computing on Graphics Processing Units (GPGPU) was introduced. GPGPU made possible to exploit the processing power and the memory bandwidth of the GPUs with the use of APIs that hide the GPU hardware from programmers. This paper presents experimental results on the parallel processing for some well known on-line string matching algorithms using one such GPU abstraction API, the Compute Unified Device Architecture (CUDA).
/content/cudazone/CUDABrowser/assets/images/applications/597_cuda1o_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/597_cuda1o_large.jpg
Academia
University of Macedonia
http://www.uom.gr
2009
09
10
09/10/2009
24
C. S. Kouzinopoulos
Paper
String matching
string matching, algorithms, CUDA, GPGPU, parallel,C. S. Kouzinopoulos,ckouz@uom.gr
c3242f2b-7ede-43d1-87b7-c462eae24c94
Fast Tridiagonal Solvers on the GPU
We study the performance of three parallel algorithms and their hybrid variants for solving tridiagonal linear systems on a GPU: cyclic reduction (CR), parallel cyclic reduction (PCR) and recursive doubling (RD). We develop an approach to measure, analyze, and optimize the performance of GPU programs in terms of memory access, computation, and control overhead. We find that CR enjoys linear algorithm complexity but suffers from more algorithmic steps and bank conflicts, while PCR and RD have fewer algorithmic steps but do more work each step. To combine the benefits of the basic algorithms, we propose hybrid CR+PCR and CR+RD algorithms, which improve the performance of PCR, RD and CR by 21%, 31% and 61% respectively. Our GPU solvers achieve up to a 28x speedup over a sequential LAPACK solver, and a 12x speedup over a multi-threaded CPU solver.
/content/cudazone/CUDABrowser/assets/images/applications/596_idav_small.png
/content/cudazone/CUDABrowser/assets/images/applications/596_idav_large.png
Academia
University of California, Davis
2009
10
28
10/28/2009
12
Yao Zhang
Application
Paper
Numerics
Yao Zhang,yaozhang@ucdavis.edu
0facea85-946d-47ef-93fd-12b5ae74b4b6
Accelerating Geo-Science and Engineering System Simulations on Graphics Hardware
This paper discusses GPU implementations of three example applications from computational fluid dynamics, seismic wave propagation, and rock magnetism. These candidate applications involve important numerical modeling techniques, widely employed in physical system simulations, that are themselves examples of distinct computing classes identified as fundamental to scientific and engineering computing. The presented numerical methods (and respective computing classes they belong to) are: (1) a lattice-Boltzmann code for geofluid dynamics (structured grid class); (2) a spectral-finite-element code for seismic wave propagation simulations (sparse linear algebra class); and (3) a least-squares minimization code for interpreting magnetic force microscopy data (dense linear algebra class). Significant performance increases are seen in all three applications, demonstrating the power of GPU implementations for these types of simulations and their associated computing classes.
/content/cudazone/CUDABrowser/assets/images/applications/595_stochastic_small.png
/content/cudazone/CUDABrowser/assets/images/applications/595_stochastic_large.png
Academia
University of Minnesota
2009
10
25
10/25/2009
30
Stuart D.C. Walsh
Paper
Computational Fluid Dynamics
Imaging
Science
Stuart D.C. Walsh,sdcwalsh@umn.edu
786fec9c-472d-4f0e-9985-42ad2050e358
Sailfish: An Open Source fluid simulation package using the Lattice-Boltzmann method
Sailfish is a general purpose fluid dynamics solver optimized for modern multicore processors, especially Graphics Processing Units (GPUs). The solver is based on the Lattice Boltzmann Method and works for both 2D and 3D fluids. Its performance peaks at 950MLUPS with the D2Q9 grid and 750MLUPS with D3Q19 (using CUDA on a single GTX280 video card). The design of Sailfish tries to reconcile ease of use and flexibility with performance. Python, with its powerful modules: sympy (for automatic code generation), numpy, pygame, tvtk etc. is used as the main language on the host (for I/O, visualization and user interaction), while the actual computations are performed on the GPU using CUDA or OpenCL.
/content/cudazone/CUDABrowser/assets/images/applications/594_sailfish_small.png
/content/cudazone/CUDABrowser/assets/images/applications/594_sailfish_large.png
Academia
Institute of Physics, University of Silesia
2009
04
17
04/17/2009
100
Open Source
M. Januszewski
M. Kostur
Multimedia
Code
Computational Fluid Dynamics
M. Januszewski,M. Kostur,mjanusz@us.edu.pl
111d3757-3e16-4600-bf47-437a832bae86
GPU-SPHysics
a GPU-based Smoothed Particle Hydrodynamics model for free surface flows
/content/cudazone/CUDABrowser/assets/images/applications/593_boreinboxwhite_small.png
/content/cudazone/CUDABrowser/assets/images/applications/593_boreinboxwhite_large.png
Academia
Istituto Nazionale di Geofisica e Vulcanologia
2008
12
31
12/31/2008
23
Alexis Herault
Paper
Computational Fluid Dynamics
dea0e214-213a-4557-9ef4-1e9d5d6f80c9
Evaluating Multi-Core Platforms for HPC Data-Intensive Kernels
We present an evaluation of three platform types, namely NVIDIA GPUs, the STI Cell/B.E., and generic multi-core CPUs on convolutional resampling (aka gridding), which is an irregular, data-intensive application from radio astronomy. We evaluate these platforms in terms of performance, programming effort and cost. Although we do not select a clear winner, we do provide a list of guidelines to assist in platform choice and development of similar data-intensive applications.
/content/cudazone/CUDABrowser/assets/images/applications/592_gridding_fig_small.png
/content/cudazone/CUDABrowser/assets/images/applications/592_gridding_fig_large.png
Academia
Delft University of Technology
http://www.tudelft.nl/
2009
05
18
05/18/2009
Alexander S. van Amesfoort
Paper
Imaging
Numerics
Science
Signal Processing
data-intensive gridding astronomy,Alexander S. van Amesfoort,a.s.vanamesfoort@tudelft.nl
02285ada-66ce-4cd5-8809-e459372d9fb8
An efficient GPU implementation for large scaleindividual-based simulation of collective behavior
In this work we describe a GPU implementation for an individual-based model for fish schooling. In this model each fish aligns its position and orientation with an appropriate average of its neighbors positions and orientations. This carries a very high computational cost in the so-called nearest neighbors search. By leveraging the GPU processing power and the new programming model called CUDA we implement an efficient framework which permits to simulate the collective motion of high-density individual groups. In particular we present as a case study a simulation of motion of millions of fishes. We describe our implementation and present extensive experiments which demonstrate the effectiveness of our GPU implementation.
/content/cudazone/CUDABrowser/assets/images/applications/591_HiBi09_small.png
/content/cudazone/CUDABrowser/assets/images/applications/591_HiBi09_large.png
Academia
Universita di Salerno
2009
10
16
10/16/2009
Ugo Erra
Bernardino Frola
Vittorio Scarano
Iain Couzin
Application
Multimedia
Paper
Life Sciences
Ugo Erra,ugo.erra@unibas.it
a412a716-04f1-4cf9-a389-8a51d3ea7680
OpenCurrent
OpenCurrent is an open source C++ library for solving Partial Differential Equations (PDEs) over regular grids using the CUDA platform from NVIDIA. It breaks down a PDE into 3 basic objects, Grids, Solvers, and Equations. Grid data structures efficiently implement regular 1D, 2D, and 3D arrays in both double and single precision. Grids support operations like computing linear combinations, managing host-device memory transfers, interpolating values at non-grid points, and performing array-wide reductions. Solvers use these data structures to calculate terms arising from discretizations of PDEs, such as finite-difference based advection and diffusion schemes, and a multigrid solver for Poisson equations. These computational building blocks can be assembled into complete Equation objects that solve time-dependent PDEs. One such Equation solver is an incompressible Navier-Stokes solver that uses a second-order Boussinesq model. This equation solver is fully validated, and has been used to study Rayleigh-Benard convection under a variety of different regimes (citation). Benchmarks show it to perform about 8 times faster than an equivalent Fortran code running on an 8-core Xeon.
/content/cudazone/CUDABrowser/assets/images/applications/590_opencurrent_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/590_opencurrent_large.jpg
Commercial
NVIDIA
http://www.nvidia.com
2009
09
25
09/25/2009
Open Source
Jonathan Cohen
Code
libraries
Jonathan Cohen
21a1b481-5773-403d-8644-730c1c5f1d58
Correlating Radio Astronomy Signals
A recent development in radio astronomy is to replace traditional dishes with many small antennas. The signals are combined to form one large, virtual telescope. The enormous data streams are cross-correlated to filter out noise. This is especially challenging, since the computational demands grow quadratically with the number of data streams. Moreover, the correlator is not only computationally intensive, but also very I/O intensive. The LOFAR telescope, for instance, will produce over 100 terabytes per day. The future SKA telescope will even require in the order of exaflops, and petabits/s of I/O. A recent trend is to correlate in software instead of dedicated hardware, to increase flexibility and to reduce development efforts.
/content/cudazone/CUDABrowser/assets/images/applications/589_LBA-field_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/589_LBA-field_large.jpg
Research
Astron
http://www.astron.nl
2009
10
16
10/16/2009
6.3
Open source
Rob van Nieuwpoort
Paper
Code
Application
Science
Signal Processing
Rob van Nieuwpoort,nieuwpoort@astron.nl
fb82b05f-0449-485d-8779-b53d28646189
TUNED AND ASYNCHRONOUS STENCIL KERNELS FOR CPU/GPU SYSTEMS
We describe heterogeneous multi-CPU and multi-GPU implementations of Jacobi's iterative method for the 2-D Poisson equation on a structured grid, in both single and double-precision. Properly tuned, our best implementation achieves 98% of the empirical streaming GPU bandwidth (66% of peak) on a NVIDIA C1060.
Motivated to find a still faster implementation, we further consider wildly asynchronous implementations that can reduce or even eliminate the synchronization bottleneck between iterations. In these versions, which are based on the principle of a chaotic relaxation (Chazan and Miranker, 1969), we simply remove or delay
synchronization between iterations, thereby potentially trading of more
ops (via more iterations to converge) for a higher degree of asynchronous parallelism. Our relaxed-synchronization implementations on a GPU can be 1.2-2.5x faster than our best synchronized GPU implementation while achieving the same accuracy. Looking forward, this result suggests research on similarly fast-and-loose algorithms in the coming era of increasingly massive concurrency and relatively high synchronization or communication costs.
/content/cudazone/CUDABrowser/assets/images/applications/588_tuned_small.png
/content/cudazone/CUDABrowser/assets/images/applications/588_tuned_large.png
Academia
Georgia Institute of Technology
2009
05
01
05/01/2009
Sundaresan Venkatasubramanian
Paper
Sundaresan Venkatasubramanian
f72dcd39-833c-4760-8d04-87e67f4afa2b
Hybrid GPU-Based Single- and Double-Bounce SAR Simulation
A new hybrid graphics-processing-unit (GPU)-based real-time synthetic aperture radar (SAR) simulation system is presented. Previous real-time SAR simulators only supported single-bounce simulation in real time. The new hybrid system uses the rasterization approach for real-time single-bounce simulation and a new image-based GPU ray-tracing approach for monostatic SAR double-bounce simulation. This approach provides fast simulation results even while simulating complex and extended scenes.
/content/cudazone/CUDABrowser/assets/images/applications/587_hybrid_small.png
/content/cudazone/CUDABrowser/assets/images/applications/587_hybrid_large.png
Academia
LIESMARS, Wuhan University
2009
10
01
10/01/2009
Timo Balz
Uwe Stilla
Paper
Science
Remote Sensing
Radar, SAR, Remote Sensing, Simulaton, Ray-Tracing,Timo Balz,Uwe Stilla,timobalz@gmail.com
1026c7d5-f1c2-4709-800e-fad3add12e5a
A Proposal to Extend the OpenMP Tasking Modelfor Heterogeneous Architectures
A proposal to extend OpenMP so it incorporates the concept of multiple architectures so it takes care of: separating the different pieces, compiling them adequately, offloading them. The user is still responsible for identifying interesting parts to offload and optimize for the target.
/content/cudazone/CUDABrowser/assets/images/applications/586_openmp_small.png
/content/cudazone/CUDABrowser/assets/images/applications/586_openmp_large.png
Academia
Universitat Politechnica de Catalunya
2009
06
03
06/03/2009
E. Ayguade
Presentation
Libraries
E. Ayguade
13072d4f-4cdc-488e-ac1b-5d42f73c2528
AntiPlanet2
AntiPlanet2 is first person 3D shooter game in fantastic extraterrestrial world, which is built of spheres and shadows. AntiPlanet uses ray tracing render for visualization. It works through CUDA. 3D engine works in any resolution in real-time, supports transparency and bi-cubic textures.
/content/cudazone/CUDABrowser/assets/images/applications/585_fallenflowers_small.png
/content/cudazone/CUDABrowser/assets/images/applications/585_fallenflowers_large.png
Commercial
virtualray.ru
http://www.virtualray.ru
2009
10
06
10/06/2009
Commercial
Lev Dymchenko
Application
Multimedia
Graphics
computer game
3d shooter antiplanet first person action game real time ray tracing spherical computer art,Lev Dymchenko,levdy@virtualray.ru
3859efe4-0773-4cc5-be54-9fc3d338a0ce
cuco
The GPU version of cosmological simulation code Gadget based on CUDA
/content/cudazone/CUDABrowser/assets/images/applications/584_cuco_small.png
/content/cudazone/CUDABrowser/assets/images/applications/584_cuco_large.png
Partner Group of MPA
2009
08
25
08/25/2009
Lei Liu
Code
Science
Lei Liu
0e8e658b-58f3-4627-a6ca-1c64e79c3416
Data Monster
Database processing is a cornerstone of computing, and it is a market that last year generated approximately US $27 billion, according to technology analysis firm Forrester Research, in Cambridge, Mass. The firm projects that this number which includes new database licenses, technical support, and consulting will grow to $32 billion by 2013. Every time you bid on an eBay auction, search for a movie on Netflix, look for a Kindle title on Amazon, or do a Google search, massive database applications spring into action, delving into huge quantities of data spread across tens of thousands of machines.
/content/cudazone/CUDABrowser/assets/images/applications/583_datamonster_small.png
/content/cudazone/CUDABrowser/assets/images/applications/583_datamonster_large.png
ieee spectrum
http://spectrum.ieee.org
2009
09
01
09/01/2009
Andrea Di Blas
Tim Kaldewey
Paper
Andrea Di Blas,Tim Kaldewey
4d61ce47-f1c6-472f-81d3-595fd0ab0883
Citrix HDX 3D for Professional Graphics
Citrix HDX 3D for Professional Graphics can now deliver Windows physical desktops and applications to the most advanced professional graphics power users through Citrix XenDesktop technology. XenDesktop with HDX 3D provides the best performance possible over the wide area network (WAN), and over a local area network (LAN), HDX 3D consumes 10x less bandwidth than alternatives while still providing a high-definition user experience.
/content/cudazone/CUDABrowser/assets/images/applications/582_citrix_small.png
/content/cudazone/CUDABrowser/assets/images/applications/582_citrix_large.png
Commercial
Citrix
http://www.citrix.com
2009
10
15
10/15/2009
Commercial
Citrix
Application
Multimedia
Paper
Video & Audio
Citrix
a8985a03-4860-49c7-92ce-f1237031cc81
GPU-Accelerated TF-IDF
TF-IDF (term-frequency/inverse-document frequency) is one of the fundamental concepts used in information retrieval and text mining.
/content/cudazone/CUDABrowser/assets/images/applications/581_atomic_method_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/581_atomic_method_large.jpg
Academia
North Carolina State University
2009
03
10
03/10/2009
9
Yongpeng Zhang
Frank Mueller
Xiaohui Cui and Thomas Potok
Paper
Text Mining
Yongpeng Zhang,Frank Mueller, Xiaohui Cui and Thomas Potok,zhang.yongpeng@gmail.com
27bd9c43-0986-477e-aa4d-1dcd0493090c
High-Quality Rendering of Varying Isosurfaces
Smooth trivariate splines on uniform tetrahedral partitions are well suited for high-quality visualization of isosurfaces from scalar volumetric data. We propose a novel rendering approach based on spline patches with low total degree, for which ray-isosurface intersections are computed using effcient root finding algorithms. Smoothly varying surface normals are directly extracted from the underlying spline representation. Our approach is using a combined CUDA and graphics pipeline and yields two key advantages over previous work. First, we can interactively vary the isovalues since all required processing steps are performed on the GPU. Second, we employ instancing in order to reduce shader complexity and to minimize overall memory usage. In particular, this allows to compute the spline coeffcients on-the-fly in real-time on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/580_C1isosurfaces-medical_small.png
/content/cudazone/CUDABrowser/assets/images/applications/580_C1isosurfaces-medical_large.png
Academia
TU Darmstadt
http://www.tu-darmstadt.de/
2009
10
07
10/07/2009
68
T. Kalbe
T. Koch
M. Goesele
Multimedia
Paper
Graphics
Medical Imaging
Raycasting trivariate Splines isosurface volumerendering,T. Kalbe,T. Koch,M. Goesele,thomasdidikoch@gmx.net
0fa28489-5f19-4370-9a78-1d90711534a6
Realtime Dense Stereo Matching with Dynamic Programming in CUDA
Real-time depth extraction from stereo images is an important process in computer vision. This paper proposes a new implementation of the dynamic programming algorithm to calculate dense depth maps using the CUDA architecture achieving real-time performance with consumer graphics cards. We compare the running time of the algorithm against CPU implementation and demonstrate the scalability property of the algorithm by testing it on different graphics cards.
/content/cudazone/CUDABrowser/assets/images/applications/579_DP_algorithm_CUDA_TV_2009_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/579_DP_algorithm_CUDA_TV_2009_large.jpg
Academia
CAD/CAM/CAE Lab. EAFIT University
http://www1.eafit.edu.co/cadcamcae/
2009
09
09
09/09/2009
10
John Congote
Paper
Graphics
Imaging
Video & Audio
John Congote,jcongote@eafit.edu.co
b7498c1e-fb46-492b-8ae6-fe6a4ccea50d
Improving the Open64 Backend for GPUs
NVIDIA uses Open64 as a front-end tool to compile CUDA programs into an intermediate language called PTX. PTX can be viewed as an assembly language targeting a virtual machine and is an abstract layer between the application and the final hardwaredependent machine code. Our research explores the relationship between register pressure in the PTX code and the final machine code. We also implemented two optimizations in Open64 to help reduce register pressure and increase thread concurrency.
/content/cudazone/CUDABrowser/assets/images/applications/578_open64_small.png
/content/cudazone/CUDABrowser/assets/images/applications/578_open64_large.png
Academia
Northeastern University
2009
10
01
10/01/2009
Rodrigo Dominguez
Presentation
Programming Tools
Rodrigo Dominguez,rdomingu@ece.neu.edu
92df54f0-8995-4ad9-8a57-0e9ddfd14842
Computer Generated Hologram on GPU - Simple color electroholography reconstruction system -
We have constructed a simple color electroholography system that has excellent cost performance. It uses a graphics processing unit (GPU) and a liquid crystal display (LCD) projector. The structure of the GPU is suitable for calculating computer-generated holograms (CGHs). The calculation speed of the GPU is approximately 1,500 times faster than that of a central processing unit(Intel Core 2 Duo 2.66 GHz (We used one core for the calculation)). The LCD projector is an inexpensive, high-performance device for displaying CGHs. It has high-definition LCD panels for red, green and blue. Thus, it can be easily used for color electroholography. For a three-dimensional object consisting of 1,000 points, our system succeeded in real-time color holographic reconstruction at rate of 30 frames per second.
/content/cudazone/CUDABrowser/assets/images/applications/577_hologram_small.png
/content/cudazone/CUDABrowser/assets/images/applications/577_hologram_large.png
Chiba University / Shohoku College / Kisarazu National College of Technology
2009
10
07
10/07/2009
1500
Tomoyoshi Ito
Naoki Takada
Tomoyoshi Shimobaba
Multimedia
Paper
Imaging
Numerics
Science
Tomoyoshi Ito,Naoki Takada,Tomoyoshi Shimobaba,itot@faculty.chiba-u.jp
a8608fd3-c5f3-45fe-bc14-9438fefb2c62
CudaPad
Cudapad is a software that helps developments develop and test small kernals for NVIDIAs CUDA language. Sometimes in your IDE you will want a quick way build or test a piece of CUDA code and CudaPad lets you do it. It shows the ptx code, cubin code, register count, error and more on the fly. There is no need to manually compile your code.
/content/cudazone/CUDABrowser/assets/images/applications/576_CudaPad_small.png
/content/cudazone/CUDABrowser/assets/images/applications/576_CudaPad_large.png
CudaPad
http://cudapad.com/
2009
08
23
08/23/2009
CudaPad
Application
Programming Tools
CudaPad
d91c3c63-a2d6-4a15-a70a-87bcafdd70d8
Real-time Parallel Hashing on the GPU
We introduce an efficient data-parallel algorithm for building hash tables containing millions of elements in real-time on the GPU. Our two-tiered approach combines classical randomized perfect hashing and the recently introduced cuckoo hashing. Retrieval of any item requires checking at most three locations.
/content/cudazone/CUDABrowser/assets/images/applications/575_paper_thumb_small.png
/content/cudazone/CUDABrowser/assets/images/applications/575_paper_thumb_large.png
Academia
University of California, Davis
http://idav.ucdavis.edu/
2009
09
12
09/12/2009
Dan Anthony Alcantara
Paper
Graphics
Libraries
Dan Anthony Alcantara,dfalcantara@ucdavis.edu
b79e2f2b-047f-4497-bc71-8fae1e3bf2df
Real-time Robotic Surgery Platform with the GPU
A Real-time Simulation, Guidance and Visualisation Platform for Intra-operative Minimally Invasive Robotic Surgery
/content/cudazone/CUDABrowser/assets/images/applications/574_robot-hotspot180_medium_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/574_robot-hotspot180_medium_large.jpg
Academia
Imperial College London
2009
10
06
10/06/2009
88
Guang-Zhong Yang
Presentation
Medical Imaging
Guang-Zhong Yang,gzy@doc.ic.ac.uk
aee7f189-cad9-4004-ab12-7af9e2dac705
Accelerating Virtual Texturing using CUDA
Virtual texturing selectively loads parts of a large texture data set visible by the current view. Our poster shows how virtual texturing can be accelerated by using CUDA and OpenGL
/content/cudazone/CUDABrowser/assets/images/applications/573_cuda_zone_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/573_cuda_zone_large.jpg
Academia
Ghent University - IBBT, ELIS Department/Multimedia Lab
http://multimedialab.elis.ugent.be/
2009
09
30
09/30/2009
Charles-Frederik Hollemeersch
Paper
Graphics
virtual textures rendering,Charles-Frederik Hollemeersch,charlesfrederik.hollemeersch@ugent.be
dfe42cca-0549-462c-a8b4-6f7f2fdb17a8
Implementation in C+CUDA of Multi-Label Text Categorizers
In automated multi-label text categorization problems with large numbers of labels, the training databases are large, which may render the categorization time prohibitive for online systems. In this work, we evaluate the parallel implementation in C+CUDA of two multi-label text categorizers: the first is based on the k-Nearest Neighbors (k-NN) algorithm and the second is based on Probabilistic Neural Networks (PNN). We implemented these algorithms in three different ways: sequential in C, parallel in C+CUDA, and parallel using the C+CUBLAS library.
/content/cudazone/CUDABrowser/assets/images/applications/572_800px-Pnn_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/572_800px-Pnn_large.jpg
Academia
Universidade Federal do Espirito Santo
http://www.ufes.br
2009
08
03
08/03/2009
64
Alberto F. De Souza et al.
Paper
Information Retrieval
Alberto F. De Souza,alberto@lcad.inf.ufes.br
9e8ea1d4-1246-4b44-86aa-3eaeeec9bc0c
Biologically Inspired Stereoscopic Vision Model in C+CUDA
Most of the depth perception processing is done in the visual cortex, mainly in the primary (V1) and medial temporal (MT) areas. In this work, we modeled the neural architecture of the V1 and MT cortices using as building blocks previous models of cortical cells and log-polar mapping. A sequential implementation of our model can build a tridimensional representation of the external world using stereoscopic image pairs obtained from a pair of fronto-parallel cameras. A C+CUDA parallel implementation is almost 60 times faster and allows real-time 3D reconstruction.
/content/cudazone/CUDABrowser/assets/images/applications/571_800px-3d-hallysson_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/571_800px-3d-hallysson_large.jpg
Academia
Universidade Federal do Espirito Santo
http://www.ufes.br
2009
08
03
08/03/2009
57
Alberto F. De Souza et al.
Paper
Computer Vision
Alberto F. De Souza,alberto@lcad.inf.ufes.br
ddb099b1-959c-4bf2-9254-ba51143125d4
ACCELERATING SPHERICAL HARMONIC TRANSFORMS ON THE NVIDIA GPU
The Spherical Harmonic Transform is a critical computational kernel of the dynamics algorithms for numerical weather prediction and climate modeling. As atmospheric models push towards higher resolutions it has become necessary to accelerate this computationally intensive transform. Previous work has made attempts to parallelize and optimize the transform [1] [2] [3] [4], but none have exploited the advantages of the NVIDIAs General Purpose Graphics Processor Unit (GPGPU), a very recent SIMD type architecture. This paper describes a CPU-GPU type implementation for computation of Spherical Harmonic Transform. The implementation shows gain in terms of computation time and a low error rate, when compared to the implementation discussed in [1].
/content/cudazone/CUDABrowser/assets/images/applications/570_soman_small.png
/content/cudazone/CUDABrowser/assets/images/applications/570_soman_large.png
Academia
Department of Electrical Engineering University of Wisconsin, Madison, Wisconsin, USA
2008
12
31
12/31/2008
42
Vikrant Soman
Paper
Computational Fluid Dynamics
Spherical Harmonic Transform, GPU, Parallel,Vikrant Soman
e22a4a2f-cf43-499a-8777-3570c85b9e60
CULATools
CULA is EM Photonics' GPU-accelerated numerical linear algebra library that contains a growing list of LAPACK functions.
/content/cudazone/CUDABrowser/assets/images/applications/569_cula-logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/569_cula-logo_large.png
Commercial
CULATools
http://www.culatools.com/
2009
09
30
09/30/2009
Application
200
3355d528-c9f0-4e35-a07e-da8ea95ddc35
Scalable Split Primitives for the GPU
Fast Split and Sort Implementation for millions of input elements and supporting 32-128 bit key values
/content/cudazone/CUDABrowser/assets/images/applications/568_splitSort_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/568_splitSort_large.jpg
Academia
CVIT, IIIT Hyderabad
http://cvit.iiit.ac.in
2009
07
15
07/15/2009
Open source
Suryakant Patidar
Paper
Code
Libraries
Sort, Split,Suryakant Patidar,skp@research.iiit.ac.in
2f279ff5-7168-4acf-822b-72fd98b2cd76
FindCUDA.cmake
Building on the open source project CMake, developers can now integrate CUDA C compilation directly into their Visual Studio, Makefile or XCode build systems. File level dependencies are supported, as well as many other features designed to help CUDA C files build as part of the native system. Starting with CMake 2.8, FindCUDA.cmake is part of the standard distribution.
/content/cudazone/CUDABrowser/assets/images/applications/567_CMake-logo-high-res_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/567_CMake-logo-high-res_large.jpg
Commercial
NVIDIA Corp.
http://www.nvidia.com
2009
09
30
09/30/2009
Open source
James Bigler
Code
Programming Tools
Build, CMake, Visual Studio, Makefile, XCode,James Bigler,jbigler@nvidia.com
46bb452f-bc32-4e3e-a9f8-ef2b42c975db
Cognitive developmental approach towards the realization of human-like visual scene understanding
How we humans understand visual scenes so easily and quickly? It is difficult to answer the question. However human babiles naturally acquire the ability to do it. Thus, imitating typical actions of babies would be promising for acquiring the ability of human-like visual scene understanding. Based on the above discussion, we propose a new framework of human-like visual scene understanding based on cognitive developmental approach, and construct a prototype system that recognizes already known objects, detects and registers unknown objects in near real-time with CUDA technologies.
/content/cudazone/CUDABrowser/assets/images/applications/566_poster2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/566_poster2_large.png
NTT Communication Science Laboratories
http://www.kecl.ntt.co.jp
2009
09
27
09/27/2009
Akisato Kimura
Multimedia
Paper
Signal Processing
Video & Audio
Cognitive developmental approach, visual scene understanding, saliency, video segmentation, CUDA,Akisato Kimura,akisato@ieee.org
bd8ebdba-8c09-413c-8c09-8cd67ec51ea5
SCGPSim: A fast SystemC Simulator on GPUs
A SystemC simulator on GPUs
/content/cudazone/CUDABrowser/assets/images/applications/564_poster_small.png
/content/cudazone/CUDABrowser/assets/images/applications/564_poster_large.png
Academia
FERMAT Lab, Virginia Tech, Blacksburg, USA
http://www.fermat.ece.vt.edu/
2009
10
01
10/01/2009
100
Mahesh Nanjundappa
Paper
Electronic Design Automation
GPGPU, EDA, Parallel Simulation, SystemC,Mahesh Nanjundappa,knmahesh@vt.edu
6e6bb696-0ae8-49c7-b75f-982182e43b7e
Flowcart
Flowball is an interactive game using dense optical flow computed in realtime on a Geforce GTX 280. We provide a video and optical flow libraries...
/content/cudazone/CUDABrowser/assets/images/applications/563_cuda_zone_flowcart_small.png
/content/cudazone/CUDABrowser/assets/images/applications/563_cuda_zone_flowcart_large.png
Academia
Institute for Computer Graphics and Vision, Graz University of Technology
http://www.icg.tugraz.at/
2009
09
02
09/02/2009
Wolfgang Paier
Application
Multimedia
Paper
Game Physics
Graphics
Video & Audio
Wolfgang Paier,info@gpu4vision.org
ccbd6aa9-f5a3-4310-bc01-4463d114ba04
CUDA Accelerated Sparse Field Level Set Segmentation of Large Medical Data Sets
Segmentation of large medical volumes is an important task in diagnostic medicine. Computer assisted level set segmentation techniques have been shown to improve the accuracy of difficult segmentation tasks. We present a novel GPU accelerated level set segmentation algorithm that avoids redundant computations by only processing those voxels near the propagating level set surface. We evaluate the speed and accuracy of our algorithm by performing various segmentation tasks on a noisy magnetic resonance image (MRI) generated from the BrainWeb phantom dataset. We compare the performance of our algorithm to that of the previous best GPU and CPU algorithms. Compared to previous best GPU algorithm, our algorithm reduces the total number of processed voxels by 16 times with a negligible effect on segmentation accuracy. Our algorithm converges 9 times faster than the previous best GPU algorithm and 360 times faster than the previous best CPU algorithm on identical hardware.
/content/cudazone/CUDABrowser/assets/images/applications/562_level_set_growth_3D_3_images_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/562_level_set_growth_3D_3_images_large.jpg
Academia
University of Calgary
http://www.ucalgary.ca/
2009
10
01
10/01/2009
360
Commercial
Mike Roberts
Paper
Medical Imaging
segmentation, level set, sparse field, narrow band,Mike Roberts,mlrobert@ucalgary.ca
b19208d8-3dcc-4b57-b8aa-993ed8261989
GPU accelerated Maximum Intensity Projection
The "Maximum Intensity Projection" (MIP) is a computer visualization method in medicine that uses 3D data, e. g. CT or MRT, and computes a 2D view from a certain viewpoint.
/content/cudazone/CUDABrowser/assets/images/applications/603_mip_filter2_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/603_mip_filter2_large.gif
Academia
Heidelberg University / Heilbronn University
2008
12
31
12/31/2008
Clas Rurik
Multimedia
Medical Imaging
Clas Rurik,crurik@ix.urz.uni-heidelberg.de
5ad29b38-3310-42a2-830e-f315c5103602
Stochastic Lagrangian Particle Model for Air Pollution
The Graphics Processing Unit (GPU) is a powerful tool for parallel computing. In the past years the performance and capabilities of GPUs have increased, and the Compute Unified Device Architecture (CUDA) - a parallel computing architecture - has been developed by NVIDIA to utilize this performance in general purpose computations. Here we show for the first time a possible application of GPU for environmental studies serving as a basement for decision making strategies. A stochastic Lagrangian particle model has been developed on CUDA to estimate the transport and the transformation of the radionuclides from a single point source during an accidental release. Our results show that parallel implementation achieves typical acceleration values in the order of 80-120 times compared to CPU using a single-threaded implementation on a 2.33 GHz desktop computer. Only very small differences have been found between the results obtained from GPU and CPU simulations, which are comparable with the effect of stochastic transport phenomena in atmosphere. The relatively high speedup with no additional costs to maintain this parallel architecture could result in a wide usage of GPU for diversified environmental applications in the near future.
/content/cudazone/CUDABrowser/assets/images/applications/602_plume_small.png
/content/cudazone/CUDABrowser/assets/images/applications/602_plume_large.png
Academia
Eotvos Lorand University
2009
09
21
09/21/2009
120
Open source
Ferenc Molnar Jr.
Application
Paper
Code
Computational Fluid Dynamics
Numerics
Science
Video card, Parallel computing, CUDA, Environmental application, Air pollution,Ferenc Molnar Jr.,mofi@elte.hu
358bc116-6b7d-4598-a11a-bdad6cbd8e30
On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods
We present a case-study on the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Processing Units (GPUs), are self-contained parallel computational devices that can be housed in conventional desktop and laptop computers. For certain classes of Monte Carlo algorithms they offer massively parallel simulation, with the added advantage over conventional distributed multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, dedicated local devices with low power consumption. On a canonical set of stochastic simulation examples including population-based Markov chain Monte Carlo methods and Sequential Monte Carlo methods, we find speedups from 35 to 500 fold over conventional single-threaded computer code. Our findings suggest that GPUs have the potential to facilitate the growth of statistical modelling into complex data rich domains through the availability of cheap and accessible many-core computation. We believe the speedup we observe should motivate wider use of parallelizable simulation methods and greater methodological attention to their design.
/content/cudazone/CUDABrowser/assets/images/applications/601_montecarlo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/601_montecarlo_large.png
Academia
Oxford-Man Institute
2009
05
14
05/14/2009
500
Anthony Lee
Christopher Yau
Michael B. Giles
Paper
Numerics
Sequential Monte Carlo, Population-Based Markov Chain Monte Carlo, General Purpose Computationon Graphics Processing Units, Many-Core Architecture, Stochastic Simulation, Parallel Processing,Anthony Lee,Christopher Yau,Michael B. Giles,lee@stats.ox.ac.uk
ed8e3b35-7db8-4c89-8cf7-a9366ce84bbe
FOLKI-GPU Optical Flow
A very fast implementation of Optical flow (25fps for full HD res)
/content/cudazone/CUDABrowser/assets/images/applications/600_onera_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/600_onera_large.jpg
Research
ONERA
http://www.onera.fr/english.php
2009
07
24
07/24/2009
Open source
Aurelien Plyer
Guy Le Besnerais
Frederic Champagnat
Multimedia
Paper
Code
Imaging
Video & Audio
computer vision
optical flow motion,Aurelien Plyer,Guy Le Besnerais,Frederic Champagnat,aurelien.plyer@onera.fr
aa156ca7-4c87-4d17-89d0-e51569250645
A Fast High Quality Pseudo Random Number Generator for NVIDIA CUDA
Previously either due to hardware GPU limits or older versions of software, careful implementation of PRNGs was required to make good use of the limited numerical precision available on graphics cards. Newer nVidia G80 and Tesla hardware support double precision. This is available to high level programmers via CUDA. This allows a much simpler C++ implementation of Park-Miller random numbers, which provides a four fold speed up compared to an earlier GPU implementation. Code is available via ftp.
/content/cudazone/CUDABrowser/assets/images/applications/599_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/599_graph_large.png
Academia
Department of Computer Science, CREST centre, Kings College, London
http://www.cs.ucl.ac.u
2009
01
01
01/01/2009
W. B. Langdon
Paper
Programming Tools
W. B. Langdon,Wi11iam.Langdon@kcl.ac.uk
cbe71302-afb1-4776-bb98-80fd8651b466
JUMP FLOODING ALGORITHM ON GRAPHICS HARDWARE AND ITS APPLICATIONS
The graphics processing unit (GPU) has been developing at a very fast pace these few years. More and more researches have been done to utilize the ever increasing computability power of the GPU on general-purpose computations. This thesis proposes a new GPU algorithm { jump cooding algorithm (JFA). JFA is a new paradigm of communication between pixels on the GPU. It can quickly propagate the information of certain pixels to the others. The speed of JFA is exponen-tially faster than that of the standard cooding algorithm, and is approximately independent to the input size.
/content/cudazone/CUDABrowser/assets/images/applications/597_progress_small.png
/content/cudazone/CUDABrowser/assets/images/applications/597_progress_large.png
2008
12
31
12/31/2008
RONG GUODONG
Paper
Imaging
RONG GUODONG
91755150-16a7-4570-9a2e-2b2e921d2baf
Many-Core Algorithms for Statistical Phylogenetics
Statistical phylogenetics is computationally intensive, resulting in considerable attention meted on techniques for parallelization. Codon-based models allow for independent rates of synonymous and replacement substitutions and have the potential to more adequately model the process of protein coding sequence evolution with a resulting increase in phylogenetic accuracy. Unfortunately, due to the high number of codon states, computational burden has largely thwarted phylogenetic reconstruction under codon models, particularly at the genomic-scale. Here we describe novel algorithms and methods for evaluating phylogenies under arbitrary molecular evolutionary models on Graphics Processing Units (GPUs), making use of the large number of processing cores to efficiently parallelize calculations even for large state-size models. Results:
We implement the approach in an existing Bayesian framework and apply the algorithms to estimating the phylogeny of 62 complete mitochondrial genomes of carnivores under a 60-state codon model. We see a near 90-fold speed increase over an optimized CPU-based computation and a >140-fold increase over the currently available implementation, making this the first practical use of codon models for phylogenetic inference over whole mitochondrial or microorganism genomes.
/content/cudazone/CUDABrowser/assets/images/applications/596_Phylogenetics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/596_Phylogenetics_large.png
Department of Biomathematics, University of California, Los Angeles
2009
04
15
04/15/2009
140
Marc A. Suchard
Andrew Rambaut
Paper
Marc A. Suchard,Andrew Rambaut
d753c609-c7e3-4ddc-b2c9-d054e3ab46dd
Speed Up SVM Algorithm for Massive Classification Tasks
We present a new parallel and incremental Support Vector Machine (SVM) algorithm for the classification of very large datasets on graphics processing units (GPUs). SVM and kernel related methods have shown to build accurate models but the learning task usually needs a quadratic program so that this task for large datasets requires large memory capacity and long time. We extend a recent Least Squares SVM (LS-SVM) proposed by Suykens and Vandewalle for building incremental and parallel algorithm. The new algorithm uses graphics processors to gain high performance at low cost. Numerical test results on UCI and Delve dataset repositories showed that our parallel incremental algorithm using GPUs is about 70 times faster than a CPU implementation and often significantly faster (over 1000 times) than state-of-the-art algorithms like LibSVM, SVM-perf and CB-SVM.
/content/cudazone/CUDABrowser/assets/images/applications/595_svm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/595_svm_large.png
Academia
IRISA Symbiose, Campus de Beaulieu, 35042 Rennes Cedex, France
2008
09
30
09/30/2008
Thanh-Nghi Do
Van-Hoa Nguyen
Francois Poulet
Paper
Numerics
Thanh-Nghi Do,Van-Hoa Nguyen,Francois Poulet,dtnghi@cit.ctu.edu.vn,vhnguyen@irisa.fr,francois.poulet@irisa.fr
040861ed-61a2-410f-907e-65e4a23b33a3
Visualizing Multiwavelength Astrophysical Data
With recent advances in the measurement technology for allsky astrophysical imaging, our view of the sky is no longer limited to the tiny visible spectral range over the 2D Celestial sphere. We now can access a third dimension corresponding to a broad electromagnetic spectrum with a wide range of allsky surveys; these surveys span frequency bands including long long wavelength radio, microwaves, very short X-rays, and gamma rays. These advances motivate us to study and examine multiwavelength visualization techniques to maximize our capabilities to visualize and exploit these informative image data sets. In this work, we begin with the processing of the data themselves, uniformizing the representations and units of raw data obtained from varied detector sources. Then we apply tools to map, convert, color-code, and format the multiwavelength data in forms useful for applications. We explore different visual representations for displaying the data, including such methods as textured image stacks, the horseshoe representation, and GPU-based volume visualization. A family of visual tools and analysis methods are introduced to explore the data, including interactive data mapping on the graphics processing unit (GPU), the mini-map explorer, and GPU-based interactive feature analysis.
/content/cudazone/CUDABrowser/assets/images/applications/593_title_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/593_title_large.jpg
Academia
The Hong Kong University of Science and Technology
2008
12
01
12/01/2008
Hongwei Li
Paper
Imaging
Science
Hongwei Li
489165bd-a529-412c-bb5e-0230b77d02f9
A GPU based real-time software correlation system for theMurchison Widefield Array prototype.
Modern graphics processing units (GPUs) are inexpensive commodity hardware that offer Tflop/s theoretical computing capacity. GPUs are well suited to many compute-intensive tasks including digital signal processing. We describe the implementation and performance of a GPU-based digital correlator for radio astronomy. The correlator is implemented using the NVIDIA CUDA development environment. We evaluate three design options on two generations of NVIDIA hardware. The different designs utilize the internal registers, shared memory and multiprocessors in different ways. We find that optimal performance is achieved with the design that minimizes global memory reads on recent generations of hardware. The GPU-based correlator outperforms a single-threaded CPU equivalent by a factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The extra compute capability provided by the GPU maximises the correlation capability of a PC while retaining the fast development time associated with using standard hardware, networking and programming languages. In this way, a GPU-based correlation system represents a middle ground in design space between high performance, custom built hardware and pure CPU-based software correlation. The correlator was deployed at the Murchison Widefield Array 32 antenna prototype system where it ran in real-time for extended periods. We briefly describe the data capture, streaming and correlation system for the prototype array.
/content/cudazone/CUDABrowser/assets/images/applications/592_bar_small.png
/content/cudazone/CUDABrowser/assets/images/applications/592_bar_large.png
Academia
Harvard-Smithsonian Center for Astrophysics
2008
12
31
12/31/2008
Randall B. Wayth
Paper
Science
Signal Processing
Randall B. Wayth,rwayth@cfa.harvard.edu
01e2e4a7-8b67-47c9-80b4-c4b0b40c66e7
Asymptotic theorems of sequential estimation-adjusted urn models
The Generalized P'{o}lya Urn (GPU) is a popular urn model which is widely used in many disciplines. In particular, it is extensively used in treatment allocation schemes in clinical trials. In this paper, we propose a sequential estimation-adjusted urn model (a nonhomogeneous GPU) which has a wide spectrum of applications. Because the proposed urn model depends on sequential estimations of unknown parameters, the derivation of asymptotic properties is mathematically intricate and the corresponding results are unavailable in the literature. We overcome these hurdles and establish the strong consistency and asymptotic normality for both the patient allocation and the estimators of unknown parameters, under some widely satisfied conditions. These properties are important for statistical inferences and they are also useful for the understanding of the urn limiting process. A superior feature of our proposed model is its capability to yield limiting treatment proportions according to any desired allocation target. The applicability of our model is illustrated with a number of examples.
/content/cudazone/CUDABrowser/assets/images/applications/591_formula_small.png
/content/cudazone/CUDABrowser/assets/images/applications/591_formula_large.png
Academia
Zhejiang University
2006
03
14
03/14/2006
Li-X. Zhang
Feifang Hu
Siu Hung Cheung
Paper
Numerics
Science
Li-X. Zhang,Feifang Hu,Siu Hung Cheung
8fd8d414-8e27-4d33-b9b7-ec084a06aeb4
High Performance Direct Gravitational N-body Simulations
We present the results of gravitational direct $N$-body simulations using the commercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce 8800GTX, and compare the results with GRAPE-6Af special purpose hardware. The force evaluation of the $N$-body problem was implemented in Cg using the GPU directly to speed-up the calculations. The integration of the equations of motions were, running on the host computer, implemented in C using the 4th order predictor-corrector Hermite integrator with block time steps. We find that for a large number of particles ($N apgt 10^4$) modern graphics processing units offer an attractive low cost alternative to GRAPE special purpose hardware. A modern GPU continues to give a relatively flat scaling with the number of particles, comparable to that of the GRAPE. Using the same time step criterion the total energy of the $N$-body system was conserved better than to one in $10^6$ on the GPU, which is only about an order of magnitude worse than obtained with GRAPE. For $Napgt 10^6$ the GeForce 8800GTX was about 20 times faster than the host computer. Though still about an order of magnitude slower than GRAPE, modern GPU's outperform GRAPE in their low cost, long mean time between failure and the much larger onboard memory; the GRAPE-6Af holds at most 256k particles whereas the GeForce 8800GTF can hold 9 million particles in memory.
/content/cudazone/CUDABrowser/assets/images/applications/590_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/590_graph_large.png
Academia
Section Computational Science, University of Amsterdam, Amsterdam, The Netherlands
2009
02
23
02/23/2009
Simon Portegies Zwart
Robert Belleman
Peter Geldof
Paper
Numerics
Science
Simon Portegies Zwart,Robert Belleman,Peter Geldof
abb3c32a-1e92-4553-9a7a-812aaa364adb
Graphic processors to speed-up simulations for the design of high performance solar receptors
Graphics Processing Units (GPUs) are now powerful and flexible systems adapted and used for other purposes than graphics calculations (General Purpose computation on GPU -- GPGPU). We present here a prototype to be integrated into simulation codes that estimate temperature, velocity and pressure to design next generations of solar receptors. Such codes will delegate to our contribution on GPUs the computation of heat transfers due to radiations. We use Monte-Carlo line-by-line ray-tracing through finite volumes. This means data-parallel arithmetic transformations on large data structures. Our prototype is inspired on the source code of GPUBench. Our performances on two recent graphics cards (Nvidia 7800GTX and ATI RX1800XL) show some speed-up higher than 400 compared to CPU implementations leaving most of CPU computing resources available. As there were some questions pending about the accuracy of the operators implemented in GPUs, we start this report with a survey and some contributed tests on the various floating point units available on GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/589_model_small.png
/content/cudazone/CUDABrowser/assets/images/applications/589_model_large.png
Academia
ELIAUS, UPVD
2007
03
06
03/06/2007
420
Sylvain Collange
Marc Daumas
David Defour
Paper
Graphics
Science
Sylvain Collange,Marc Daumas,David Defour,firstname.lastname@univ-perp.fr
003d7f3b-356c-4876-81b4-c207d76b6bf2
nHD
nHD is a multi-GPU 2nd order full Godunov three-dimensionaluniform-mesh Euler equations solver for calorically ideal,compressible gas. nHD uses CUDA with MPI and runs on a cluster ofmulti-GPU machines to accelerate computational hydrodynamicscalculations.Full Godunov method solves the hydrodynamic equations by discretizingthe fluid and calculating the nonlinear evolution of the discretizeddistribution, using the analytic solutions for Riemann problems. Thusfull Godunov method can resolve arbitrary severe shocks with minimumartificial dissipation and oscillation, and is the irreplaceablemethod for simulations of compressible fluid, where shocks and vacuumsare naturally generated.
/content/cudazone/CUDABrowser/assets/images/applications/588_nHD7_small.png
/content/cudazone/CUDABrowser/assets/images/applications/588_nHD7_large.png
Academia
Department of Physics, Kyoto University
http://www.scphys.kyoto-u.ac.jp/index_e.html
2009
09
20
09/20/2009
173
Open source
Takayuki Muranushi
Code
Computational Fluid Dynamics
Science
Computational Hydrodynamics, Full Godunov Method,Takayuki Muranushi,muranushi@gmail.com
9369afee-d78a-4b20-a092-c689a4a40301
SCELib3.0
SCELib is a computer program which implements the Single Center Expansion (SCE) method to describe molecular electronic densities and the interaction potentials between a charged projectile (electron or positron) and a target molecular system. The first version (CPC Catalog identifier ADMG_v1_0) was submitted to the CPC Program Library in 2000, and version 2.0 (ADMG_v2_0) was submitted in 2004. We here announce the new release 3.0 which presents additional features with respect to the previous versions aiming at a significative enhance of its capabilities to deal with larger molecular systems. SCELib 3.0 allows for ab initio effective core potential (ECP) calculations of the molecular wavefunctions to be used in the SCE method in addition to the standard all-electron description of the molecule. The list of supported architectures has been updated and the code has been ported to platforms based on accelerating coprocessors, such as the NVIDIA GPGPU and the new parallel model adopted is able to efficiently run on a mixed many-core computing system.
/content/cudazone/CUDABrowser/assets/images/applications/587_Ribose_toc_small.png
/content/cudazone/CUDABrowser/assets/images/applications/587_Ribose_toc_large.png
Research
CASPUR, Consortium for Supercomputing in Research
http://www.caspur.it
2009
07
25
07/25/2009
177
Nico Sanna
Paper
Science
Nico Sanna,n.sanna@caspur.it
9a45c9f4-df80-4c98-b6b4-98def8807dd4
Black holes on GPUs
This paper describes a parallel implementation of Monte Carlo simulations using the post-Newtonian equations of motion to model black holes. We use these simulations to investigate the phase space of binary black hole systems.
/content/cudazone/CUDABrowser/assets/images/applications/586_blackhole_small.png
/content/cudazone/CUDABrowser/assets/images/applications/586_blackhole_large.png
Academia
University of Maryland
2009
08
27
08/27/2009
50
Frank Herrmann
John Silberholz
Matias Bellone
Gustavo Guerberoff
Manuel Tiglio
Paper
Numerics
Life Sciences
Science
Frank Herrmann,John Silberholz,Matias Bellone,Gustavo Guerberoff,Manuel Tiglio,tiglio@umd.edu
16c382b3-7218-4288-a261-523470b8c535
GPU accelerated analysis of financial markets
The compute unified device architecture is an almost conventional programming approach for managing computations on a graphics processing unit (GPU) as a data-parallel computing device. With a maximum number of 240 cores in combination with a high memory bandwidth, a recent GPU offers resources for computational physics. We apply this technology to methods of fluctuation analysis, which includes determination of the scaling behavior of a stochastic process and the equilibrium autocorrelation function. Additionally, the recently introduced pattern formation conformity (Preis T et al 2008 Europhys. Lett. 82 68005), which quantifies pattern-based complex short-time correlations of a time series, is calculated on a GPU and analyzed in detail. Results are obtained up to 84 times faster than on a current central processing unit core. When we apply this method to high-frequency time series of the German BUND future, we find significant pattern-based correlations on short time scales.
/content/cudazone/CUDABrowser/assets/images/applications/585_financial_markets_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/585_financial_markets_large.gif
Academia
Johannes Gutenberg University Mainz
2009
09
16
09/16/2009
80
Open source
Tobias Preis
Multimedia
Paper
Code
Finance
Science
Tobias Preis,preis@uni-mainz.de
c462ebc4-646d-4eaf-9714-144678d49528
Fast recursive filters for simulating nonlinear dynamic systems
A fast and accurate computational scheme for simulating nonlinear dynamic systems is presented. The scheme assumes that the system can be represented by a combination of components of only two different types: first-order low-pass filters and static nonlinearities. The parameters of these filters and nonlinearities may depend on system variables, and the topology of the system may be complex, including feedback. Several examples taken from neuroscience are given: phototransduction, photopigment bleaching, and spike generation according to the Hodgkin-Huxley equations. The scheme uses two slightly different forms of autoregressive filters, with an implicit delay of zero for feedforward control and an implicit delay of half a sample distance for feedback control. On a fairly complex model of the macaque retinal horizontal cell it computes, for a given level of accuracy, 1-2 orders of magnitude faster than 4th-order Runge-Kutta. The computational scheme has minimal memory requirements, and is also suited for computation on a stream processor, such as a GPU (Graphical Processing Unit).
/content/cudazone/CUDABrowser/assets/images/applications/584_nuclear_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/584_nuclear_large.gif
Academia
Netherlands Institute for Neuroscience
2007
04
11
04/11/2007
J. H. van Hateren
Paper
Imaging
Life Sciences
Science
J. H. van Hateren
4b683456-f4de-488d-b8f6-6e9a8607538f
N-Body Simulations on GPUs
Commercial graphics processors (GPUs) have high compute capacity at very low cost, which makes them attractive for general purpose scientific computing. In this paper we show how graphics processors can be used for N-body simulations to obtain improvements in performance over current generation CPUs. We have developed a highly optimized algorithm for performing the O(N^2) force calculations that constitute the major part of stellar and molecular dynamics simulations. In some of the calculations, we achieve sustained performance of nearly 100 GFlops on an ATI X1900XTX. The performance on GPUs is comparable to specialized processors such as GRAPE-6A and MDGRAPE-3, but at a fraction of the cost. Furthermore, the wide availability of GPUs has significant implications for cluster computing and distributed computing efforts like Folding@Home.
/content/cudazone/CUDABrowser/assets/images/applications/583_nbody_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/583_nbody_large.gif
Academia
Stanford University
2007
06
20
06/20/2007
Erich Elsen
V. Vishal
Mike Houston
Paper
Numerics
Life Sciences
Science
Erich Elsen,V. Vishal,Mike Houston,pande@stanford.edu
e5271230-c663-4fb6-bf23-997f7563256e
High Performance Direct Gravitational N-body Simulations
We present the results of gravitational direct $N$-body simulations using the Graphics Processing Unit (GPU) on a commercial NVIDIA GeForce 8800GTX designed for gaming computers. The force evaluation of the $N$-body problem is implemented in ``Compute Unified Device Architecture'' (CUDA) using the GPU to speed-up the calculations. We tested the implementation on three different $N$-body codes: two direct $N$-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme. The integration of the equations of motions for all codes is performed on the host CPU. We find that for $N > 512$ particles the GPU outperforms the GRAPE-6Af, if some softening in the force calculation is accepted. Without softening and for very small integration time steps the GRAPE still outperforms the GPU. We conclude that modern GPUs offer an attractive alternative to GRAPE-6Af special purpose hardware. Using the same time-step criterion, the total energy of the $N$-body system was conserved better than to one in $10^6$ on the GPU, only about an order of magnitude worse than obtained with GRAPE-6Af. For $N apgt 10^5$ the 8800GTX outperforms the host CPU by a factor of about 100 and runs at about the same speed as the GRAPE-6Af.
/content/cudazone/CUDABrowser/assets/images/applications/582_nbody_small.png
/content/cudazone/CUDABrowser/assets/images/applications/582_nbody_large.png
Academia
Section Computational Science, University of Amsterdam, Amsterdam, TheNetherlands
2007
07
06
07/06/2007
Robert G. Belleman
Jeroen Bedorf
Simon Portegies Zwart
Paper
Numerics
Science
Robert G. Belleman,Jeroen Bedorf,Simon Portegies Zwart
a89487d4-5b55-47ad-9a1c-38363d7c0e04
Developing and Deploying Advanced Algorithms to Novel Supercomputing Hardware
The objective of our research is to demonstrate the practical usage and orders of magnitude speedup of real-world applications by using alternative technologies to support high performance computing. Currently, the main barrier to the widespread adoption of this technology is the lack of development tools and case studies that typically impede non-specialists that might otherwise develop applications that could leverage these technologies. By partnering with the Innovative Systems Laboratory at the National Center for Supercomputing, we have obtained access to several novel technologies, including several Field-Programmable Gate Array (FPGA) systems, NVidia Graphics Processing Units (GPUs), and the STI Cell BE platform. Our goal is to not only demonstrate the capabilities of these systems, but to also serve as guides for others to follow in our path. To date, we have explored the efficacy of the SRC-6 MAP-C and MAP-E and SGI RASC Athena and RC100 reconfigurable computing platforms in supporting a two-point correlation function which is used in a number of different scientific domains. In a brute force test, the FPGA based single-processor system has achieved an almost two orders of magnitude speedup over a single-processor CPU system. We are now developing implementations of this algorithm on other platforms, including one using a GPU. Given the considerable efforts of the cosmology community in optimizing these classes of algorithms, we are currently working to implement an optimized version of the basic family of correlation functions by using tree-based data structures. Finally, we are also exploring other algorithms, such as instance-based classifiers, power spectrum estimators, and higher-order correlation functions that are also commonly used in a wide range of scientific disciplines.
/content/cudazone/CUDABrowser/assets/images/applications/581_tesla_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/581_tesla_large.jpg
Academia
National Center for Supercomputing Applications, University of Illinois atUrbana-Champaign
2007
11
21
11/21/2007
25
Robert J. Brunner
Volodymyr V. Kindratenko
Adam D. Myers
Paper
Numerics
Science
Robert J. Brunner,Volodymyr V. Kindratenko,Adam D. Myers,rb@astro.uiuc.edu
19b10e9a-f467-4538-8587-8594b128eeda
Fast k Nearest Neighbor Search
The recent improvements of graphics processing units (GPU) offer to the computer vision community a powerful processing platform. Indeed, a lot of highly-parallelizable computer vision problems can be significantly accelerated using GPU architecture. Among these algorithms, the k nearest neighbor search (KNN) is a well-known problem linked with many applications such as classification, estimation of statistical properties, etc. The main drawback of this task lies in its computation burden, as it grows polynomially with the data size. In this paper, we show that the use of the NVIDIA CUDA API accelerates the search for the KNN up to a factor of 120.
/content/cudazone/CUDABrowser/assets/images/applications/580_dots_small.png
/content/cudazone/CUDABrowser/assets/images/applications/580_dots_large.png
Research
2009
04
09
04/09/2009
120
Vincent Garcia and Eric Debreuve and Michel Barlaud
Paper
Numerics
Science
Vincent Garcia and Eric Debreuve and Michel Barlaud
8890dacc-2905-41ac-a0bd-4efc292db999
A multiphysics and multiscale software environment for modeling astrophysical systems
We present MUSE, a software framework for combining existing computational tools for different astrophysical domains into a single multiphysics, multiscale application. MUSE facilitates the coupling of existing codes written in different languages by providing inter-language tools and by specifying an interface between each module and the framework that represents a balance between generality and computational efficiency. This approach allows scientists to use combinations of codes to solve highly-coupled problems without the need to write new codes for other domains or significantly alter their existing codes. MUSE currently incorporates the domains of stellar dynamics, stellar evolution and stellar hydrodynamics for studying generalized stellar systems. We have now reached a "Noah's Ark" milestone, with (at least) two available numerical solvers for each domain. MUSE can treat multi-scale and multi-physics systems in which the time- and size-scales are well separated, like simulating the evolution of planetary systems, small stellar associations, dense stellar clusters, galaxies and galactic nuclei. In this paper we describe three examples calculated using MUSE: the merger of two galaxies, the merger of two evolving stars, and a hybrid N-body simulation. In addition, we demonstrate an implementation of MUSE on a distributed computer which may also include special-purpose hardware, such as GRAPEs or GPUs, to accelerate computations. The current MUSE code base is publicly available as open source at this http URL: http://muse.li/.
/content/cudazone/CUDABrowser/assets/images/applications/579_sidexside_small.png
/content/cudazone/CUDABrowser/assets/images/applications/579_sidexside_large.png
Academia
University of Amsterdam, Amsterdam, The Netherlands
2008
07
12
07/12/2008
Simon Portegies Zwart
Steve McMillan
Stefan Harfst
Paper
Numerics
Science
Simon Portegies Zwart,Steve McMillan,Stefan Harfst
0ba0bc17-1da4-46b9-8af0-b885bd619e74
Accelerating Scientific Computations with Mixed Precision Algorithms
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
/content/cudazone/CUDABrowser/assets/images/applications/578_c_small.png
/content/cudazone/CUDABrowser/assets/images/applications/578_c_large.png
Academia
Department of Mathematics, University of Coimbra, Coimbra,Portugal
2008
08
20
08/20/2008
15
Marc Baboulin
Alfredo Buttari
Jack Dongarra
Code
Numerics
Science
Marc Baboulin,Alfredo Buttari,Jack Dongarra
cfb65cc8-2394-4a3f-8658-4d117f3a3953
Parallel GPU Implementation of Iterative PCA Algorithms
Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimized versions, based on CUBLAS (NVIDIA) are substantially faster (up to 12 times) than the CPU optimized versions based on CBLAS (GNU Scientific Library).
/content/cudazone/CUDABrowser/assets/images/applications/577_pca_small.png
/content/cudazone/CUDABrowser/assets/images/applications/577_pca_large.png
Academia
Institute for Biocomplexity and Informatics, University of Calgary
2009
11
07
11/07/2009
12
M. Andrecut
Paper
Numerics
Science
M. Andrecut
9e8271a2-4d5e-4e08-868d-fb8c0e0eb80a
Recent algorithm and machine developments for lattice QCD
I review recent machine trends and algorithmic developments for dynamical lattice QCD simulations with the HMC algorithm for Wilson-type fermions. The topics include the trend toward multi-core processors and general purpose GPU (GPGPU) computing, and improvements on the quark determinant preconditioning, molecular dynamics integrator, and quark solvers. I also discuss the prospect on the use of these techniques on the forthcoming petaflops machines.
/content/cudazone/CUDABrowser/assets/images/applications/576_ps_small.png
/content/cudazone/CUDABrowser/assets/images/applications/576_ps_large.png
Academia
Graduate School of Science, Hiroshima University, Higashi-Hiroshima, Hiroshima 739-8526,Japan.
2009
11
11
11/11/2009
Ken-Ichi Ishikawa
Paper
Numerics
Science
Ken-Ichi Ishikawa,ishikawa@theo.phys.sci.hiroshima-u.ac.jp
58941ce0-0754-418a-9235-d94fbd05b96f
Interactive Visualization of Billion Point Cosmological Simulations
Despite the recent advances in graphics hardware capabilities, a brute force approach is incapable of interactively displaying terabytes of data. We have implemented a system that uses hierarchical level-of-detailing for the results of cosmological simulations, in order to display visually accurate results without loading in the full dataset (containing over 10 billion points). The guiding principle of the program is that the user should not be able to distinguish what they are seeing from a full rendering of the original data. Furthermore, by using a tree-based system for levels of detail, the size of the underlying data is limited only by the capacity of the IO system containing it.
/content/cudazone/CUDABrowser/assets/images/applications/575_space_small.png
/content/cudazone/CUDABrowser/assets/images/applications/575_space_large.png
Academia
California Institute of Technology, California Ave, 91126, Pasadena, CA
2009
11
13
11/13/2009
Tamas Szalay
Volker Springel
Gerard Lemson
Paper
Imaging
Numerics
Science
Tamas Szalay,Volker Springel,Gerard Lemson
8b031bf9-1fc2-4a6b-a94f-fc9d93433d19
Parallel Algorithm for Solving Kepler's Equation on Graphics Processing Units: Application to Analysis of Doppler Exoplanet Searches
We present the results of a highly parallel Kepler equation solver using the Graphics Processing Unit (GPU) on a commercial nVidia GeForce 280GTX and the "Compute Unified Device Architecture" programming environment. We apply this to evaluate a goodness-of-fit statistic (e.g., chi^2) for Doppler observations of stars potentially harboring multiple planetary companions (assuming negligible planet-planet interactions). We tested multiple implementations using single precision, double precision, pairs of single precision, and mixed precision arithmetic. We find that the vast majority of computations can be performed using single precision arithmetic, with selective use of compensated summation for increased precision. However, standard single precision is not adequate for calculating the mean anomaly from the time of observation and orbital period when evaluating the goodness-of-fit for real planetary systems and observational data sets. Using all double precision, our GPU code outperforms a similar code using a modern CPU by a factor of over 60. Using mixed-precision, our GPU code provides a speed-up factor of over 600, when evaluating N_sys > 1024 models planetary systems each containing N_pl = 4 planets and assuming N_obs = 256 observations of each system. We conclude that modern GPUs also offer a powerful tool for repeatedly evaluating Kepler's equation and a goodness-of-fit statistic for orbital models when presented with a large parameter space.
/content/cudazone/CUDABrowser/assets/images/applications/574_KeplersEquation_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/574_KeplersEquation_large.gif
Academia
Department of Astronomy, University of Florida
2009
12
16
12/16/2009
600
Eric B. Ford
Paper
Numerics
Science
gravitation,planetary systems,methods: numerical,techniques:radial velocities,Eric B. Ford
c8752f5a-9c9d-4f67-9779-6c0ffbd62c22
Differential Equations for Monte Carlo Recycling and a GPU-Optimized Normal Quantile
This article presents differential equations and solution methods for the functions of the form $A(z) = F^{-1}(G(z))$, where $F$ and $G$ are cumulative distribution functions. Such functions allow the direct recycling of samples from one distribution into samples from another. The method may be developed analytically for certain special cases, and illuminate the idea that it is a more precise form of the traditional Cornish-Fisher expansion. In this manner the model risk of distributional risk may be assessed free of the Monte Carlo noise associated with resampling. The method may also be regarded as providing both analytical and numerical bases for doing more precise Cornish-Fisher transformations. Examples are given of equations for converting normal samples to Student t, and converting exponential to hyperbolic, variance gamma and normal. In the case of the normal distribution, the change of variables employed allows the sampling to take place to good accuracy based on a single rational approximation over a very wide range of the sample space. The avoidance of any branching statement is of use in optimal GPU computations, and we give example of branch-free normal quantiles that offer performance improvements in a GPU environment, while retaining the precision characteristics of well-known methods.
/content/cudazone/CUDABrowser/assets/images/applications/573_montecarlo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/573_montecarlo_large.png
Academia
Department of Mathematics King's College, The Strand, LondonWC2R 2LS, England
2009
01
06
01/06/2009
William T. Shaw
Nick Brickman
Paper
Numerics
Science
William T. Shaw,Nick Brickman,william.shaw@kcl.ac.uk
a8956c98-3e65-4a84-8ed9-2b2c84becf99
Nodal Discontinuous Galerkin Methods
Discontinuous Galerkin (DG) methods for the numerical solution of partial differential equations have enjoyed considerable success because they are both flexible and robust: They allow arbitrary unstructured geometries and easy control of accuracy without compromising simulation stability. Lately, another property of DG has been growing in importance: The majority of a DG operator is applied in an element-local way, with weak penalty-based element-to-element coupling. The resulting locality in memory access is one of the factors that enables DG to run on off-the-shelf, massively parallel graphics processors (GPUs). In addition, DG's high-order nature lets it require fewer data points per represented wavelength and hence fewer memory accesses, in exchange for higher arithmetic intensity. Both of these factors work significantly in favor of a GPU implementation of DG. Using a single US$400 Nvidia GTX 280 GPU, we accelerate a solver for Maxwell's equations on a general 3D unstructured grid by a factor of 40 to 60 relative to a serial computation on a current-generation CPU. In many cases, our algorithms exhibit full use of the device's available memory bandwidth. Example computations achieve and surpass 200 gigaflops/s of net application-level floating point work. In this article, we describe and derive the techniques used to reach this level of performance. In addition, we present comprehensive data on the accuracy and runtime behavior of the method.
/content/cudazone/CUDABrowser/assets/images/applications/572_plane_small.png
/content/cudazone/CUDABrowser/assets/images/applications/572_plane_large.png
Academia
Division of Applied Mathematics, Brown University, Providence, RI 02912
2009
01
08
01/08/2009
60
Andreas Klockner
Tim Warburton
Jeffrey Bridge
Paper
Numerics
Science
Andreas Klockner,Tim Warburton,Jeffrey Bridge,andreas@brown.edu,kloeckner@brown.edu
f93b62b6-b6af-497e-83a8-865af31c8d7a
Parallelizing Hash-based Data Carving
The ability to detect fragments of deleted image files and to reconstruct these image files from all available fragments on disk is a key activity in the field of digital forensics. Although reconstruction of image files from the file fragments on disk can be accomplished by simply comparing the content of sectors on disk with the content of known files, this brute-force approach can be time consuming. This paper presents results from research into the use of Graphics Processing Units (GPUs) in detecting specific image file byte patterns in disk clusters. Unique identifying pattern for each disk sector is compared against patterns in known images. A pattern match indicates the potential presence of an image and flags the disk sector for further in-depth examination to confirm the match. The GPU-based implementation outperforms the software implementation by a significant margin.
/content/cudazone/CUDABrowser/assets/images/applications/571_g80_small.png
/content/cudazone/CUDABrowser/assets/images/applications/571_g80_large.png
Academia
ELIAUS University of Perpignan
2009
01
09
01/09/2009
Sylvain Collange
Yoginder Dandass
Paper
Imaging
Science
Sylvain Collange,Yoginder Dandass,sylvain.collange@univ-perp.fr
0cf53113-4ca6-4571-ad43-030fb84f5f1e
ACEMD: Accelerating bio-molecular dynamics in the microsecond time-scale
The high arithmetic performance and intrinsic parallelism of recent graphical processing units (GPUs) can offer a technological edge for molecular dynamics simulations. ACEMD is a production-class bio-molecular dynamics (MD) simulation program designed specifically for GPUs which is able to achieve supercomputing scale performance of 40 nanoseconds/day for all-atom protein systems with over 23,000 atoms. We illustrate the characteristics of the code, its validation and performance. We also run a microsecond-long trajectory for an all-atom molecular system in explicit TIP3P water on a single workstation computer equipped with just 3 GPUs. This performance on cost effective hardware allows ACEMD to reach microsecond timescales routinely with important implications in terms of scientific applications.
/content/cudazone/CUDABrowser/assets/images/applications/570_biomoleculardynamics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/570_biomoleculardynamics_large.png
Academia
Information and Communications Technologies,Imperial College London, South Kensington, London, SW7 2AZ, UK
2009
02
05
02/05/2009
19
M. J. Harvey
G. Giupponi
G. De Fabritiis
Paper
Life Sciences
Science
M. J. Harvey,G. Giupponi,G. De Fabritiis,m.j.harvey@imperial.ac.uk
f16b23b6-991b-410d-b339-b4815a000f00
GPUs for data processing in the MWA
The MWA is a next-generation radio interferometer under construction in remote Western Australia. The data rate from the correlator makes storing the raw data infeasible, so the data must be processed in real-time. The processing task is of order ~10 TFLOPS. The remote location of the MWA limits the power that can be allocated to computing. We describe the design and implementation of elements of the MWA real-time data processing system which leverage the computing abilities of modern graphics processing units (GPUs). The matrix algebra and texture mapping capabilities of GPUs are well suited to the majority of tasks involved in real-time calibration and imaging. Considerable performance advantages over a conventional CPU-based reference implementation are obtained.
/content/cudazone/CUDABrowser/assets/images/applications/569_wma_small.png
/content/cudazone/CUDABrowser/assets/images/applications/569_wma_large.png
Academia
Harvard-Smithsonian Center for Astrophysics, Cambridge, MA, USA
2009
02
05
02/05/2009
S. Ord
L. Greenhill
R. Wayth
Paper
Numerics
Science
S. Ord,L. Greenhill,R. Wayth
31397807-ebad-4ce4-a822-4f66cfe8d3ca
SAPPORO: A way to turn your graphics cards into a GRAPE-6
We present Sapporo, a library for performing high-precision gravitational N-body simulations on NVIDIA Graphical Processing Units GPUs. Our library mimics the GRAPE-6 library, and N-body codes currently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited to single precision arithmetics. This limitation is effectively overcome by emulating double precision for calculating the distance between particles. The performance loss of this operation is small ( 20 percent) compared to the advantage of being able to run at high precision. We tested the library using several GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak performance of 800 Gflop/s for running with 10^6 particles on a PC with four commercial G92 architecture GPUs (two GeForce 9800GX2). As a production test, we simulated a 32k Plummer model with equal mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance was 113 Gflop/s. The GPU did not show any problems from running in a production environment for such an extended period of time.
/content/cudazone/CUDABrowser/assets/images/applications/567_cpu_gpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/567_cpu_gpu_large.png
Academia
Astronomical Institute "Anton Pannekoek", University of Amsterdam
2009
02
25
02/25/2009
Evghenii Gaburov
Stefan Harfst
Simon Portegies Zwart
Paper
Numerics
Science
Evghenii Gaburov,Stefan Harfst,Simon Portegies Zwart,egaburov@strw.leidenuniv.nl
ca427734-eeff-4c03-ae7e-3230ad448d64
Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures
The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code.
/content/cudazone/CUDABrowser/assets/images/applications/566_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/566_graph_large.png
Research
European Synchrotron Radiation Facility
2009
04
09
04/09/2009
20
Luigi Genovese
Matthieu Ospici
Thierry Deutsch
Paper
Numerics
Science
Luigi Genovese,Matthieu Ospici,Thierry Deutsch,luigi.genovese@esrf.fr
cea6c291-c024-4ea7-9cdb-af965bd49771
Accelerator-Oriented Algorithm Transformation for Temporal Data Mining
Temporal data mining algorithms are becoming increasingly important in many application domains including computational neuroscience, especially the analysis of spike train data. While application scientists have been able to readily gather multi-neuronal datasets, analysis capabilities have lagged behind, due to both lack of powerful algorithms and inaccessibility to powerful hardware platforms. The advent of GPU architectures such as Nvidia's GTX 280 offers a cost-effective option to bring these capabilities to the neuroscientist's desktop. Rather than port existing algorithms onto this architecture, we advocate the need for algorithm transformation, i.e., rethinking the design of the algorithm in a way that need not necessarily mirror its serial implementation strictly. We present a novel implementation of a frequent episode discovery algorithm by revisiting "in-the-large" issues such as problem decomposition as well as "in-the-small" issues such as data layouts and memory access patterns. This is non-trivial because frequent episode discovery does not lend itself to GPU-friendly data-parallel mapping strategies. Applications to many datasets and comparisons to CPU as well as prior GPU implementations showcase the advantages of our approach.
/content/cudazone/CUDABrowser/assets/images/applications/564_oriented_small.png
/content/cudazone/CUDABrowser/assets/images/applications/564_oriented_large.png
Academia
Department of Computer Science, Virginia Tech
2009
05
13
05/13/2009
431
Debprakash Patnaik
Sean P. Ponce
Yong Cao
Paper
Numerics
Life Sciences
Science
Debprakash Patnaik, Sean P. Ponce, Yong Cao, Naren Ramakrishnan
aa89b4c8-9abd-4459-adc4-00bfbb8021f7
Solving $k$-Nearest Vector Problem on Multiple Graphics Processors
In a recommendation system, customers preferences are encoded into vectors, and finding the nearest vectors to each vector is an essential part. We define this part of problem as a $k$-nearest vector problem and give an effective algorithm to solve it on multiple graphics processor units (GPUs). By an experiment, we show that when the size of the problem is large, an implementation of the algorithm on two GPUs runs more than 260 times faster than a single core implementation on a latest CPU. We also show that our algorithm scales well with respect to the number of GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/563_k_small.png
/content/cudazone/CUDABrowser/assets/images/applications/563_k_large.png
Research
Nihon Unisys, Ltd.
2009
01
01
01/01/2009
260
Kimikazu Kato
Tikara Hosino
Paper
Numerics
Kimikazu Kato, Tikara Hosino
d4976711-d460-44f9-bfa5-ce9ca5d4c44e
Elemental Accelerator
Elemental Accelerator is a video processing solution designed to add power and performance to the Adobe Premiere Pro CS4 workflow. Coupled with NVIDIA Quadro series video cards, Elemental Accelerator harnesses the power of the graphics processing unit (GPU) to perform high-speed video encoding and deliver dramatic time savings over conventional CPU-only encoding solutions. Elemental Accelerator performs GPU-accelerated conversion of commonly distributed digital video formats to H.264/AVC output ready for upload to the web or burning to Blu-ray disc. Elemental Accelerator also supports high-speed MPEG-2 encoding for DVD or digital broadcast. By executing demanding processing tasks on the GPU, Elemental Accelerator not only speeds video transcoding, it frees CPU resources to perform other tasks, resulting in a faster, more efficient video editing and production environment.
/content/cudazone/CUDABrowser/assets/images/applications/562_accelerator_small.png
/content/cudazone/CUDABrowser/assets/images/applications/562_accelerator_large.png
Commercial
Elemental
http://elementaltechnologies.com/
2009
07
10
07/10/2009
7
Commercial
Elemental
Application
Multimedia
BUY NOW
Video & Audio
14428008-4747-47d1-bcd5-f59bdb8230ec
Towards Flow Cytometry Data Clustering on Graphics Processing Units
Like many modern techniques for scientific analysis, flow cytometry produces massive amounts of data that must be analyzed and clustered intelligently to be useful. Current manual binning techniques are cumbersome and limited in both the quality and quantity of analysis produced. To address the quality of results, a new framework applying two different sets of clustering algorithms and inference methods are implemented. The two methods investigated are fuzzy c-means and minimum description length inference and k-medoids with BIC. These approaches lend themselves to large scale parallel processing. To address the computational demands, the Nvidia CUDA framework and Tesla architecture are utilized. The resulting performance demonstrated 1-2 orders of magnitude improvement over an equivalent sequential version. The quality of results is promising and motivates further research and development in this direction.
/content/cudazone/CUDABrowser/assets/images/applications/561_flow_small.png
/content/cudazone/CUDABrowser/assets/images/applications/561_flow_large.png
Academia
Rochester Institute of Technology, Rochester, NY
2008
12
31
12/31/2008
159
Jeremy Espenshade
Doug Roberts
James Cavenaugh
Paper
Numerics
Jeremy Espenshade,Doug Roberts,James Cavenaugh
d2033a62-e770-435d-a243-a38ca5a2ac58
Search Pipeline for Gravitational Waves from Coalescing Binaries of Compact Objects
We report a novel application of graphics processing units (GPUs) for the purpose of accelerating the search pipelines for gravitational waves from coalescing binaries of compact objects. A speed-up of 16 fold has been achieved compared with a single central processing unit (CPU). We show that substantial improvements are possible and discuss the reduction in CPU count required for the detection of inspiral sources afforded by the use of GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/560_pipeline_small.png
/content/cudazone/CUDABrowser/assets/images/applications/560_pipeline_large.png
Academia
School of Computer Science and Engineering, The University of Western Australi
2009
07
23
07/23/2009
16
Shin Kee Chung
Linqing Wen
David Blair
Paper
Science
Shin Kee Chung, Linqing Wen, David Blair, Kipp Cannon, Amitava Datta
e441e930-0ee4-41c2-9fa0-6d18c307ea30
Neuroblastoma
Accelerationg dataflow application through the coordination of CPU and GPU
/content/cudazone/CUDABrowser/assets/images/applications/559_neuro_results_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/559_neuro_results_large.jpg
Research
UFMG
2009
09
04
09/04/2009
90
George Teodor
Paper
Medical Imaging
George Teodor,george@dcc.ufmg.br
46b9fa44-59f6-4987-a80c-87339387925f
Abe
Abe is a different type of search, serching for images with images.
/content/cudazone/CUDABrowser/assets/images/applications/558_logo_s_small.png
/content/cudazone/CUDABrowser/assets/images/applications/558_logo_s_large.png
Commercial
Quad Streaming
http://www.quadstreaming.com/
2010
02
08
02/08/2010
10
Quad
Application
Imaging
Image search
Quad,office@quadstreaming.com
d84233c9-2041-4233-818b-e72e7813b115
GPU Satellite Image Processing
Using CUDA and Tesla, PCI Geomatics has optimized code for orthorectification and pansharpening of high-resolution satellite imagery in the GeoImaging Accelerator (GXL)
/content/cudazone/CUDABrowser/assets/images/applications/557_GXL_Server_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/557_GXL_Server_large.jpg
Commercial
PCI Geomatics
http://www.pcigeomatics.com
2009
03
02
03/02/2009
2
Commercial
David Piekny
Paper
Imaging
David Piekny,piekny@pcigeomatics.com
8daa1fae-c95a-42a0-8d4a-82aab0b0d346
FlaCuda encoder
Opensource CUDA-enabled FLAC encoder
/content/cudazone/CUDABrowser/assets/images/applications/556_flacuda_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/556_flacuda_large.jpg
Research
2009
09
10
09/10/2009
3
Open source
Gregory S. Chudov
Application
Code
Video & Audio
Gregory S. Chudov,
ef913ea0-85cb-4e53-a970-00b879982728
Large Integer/polynomial multiplication on GPU
The paper describes the first implementation of large integer and/or polynomial multiplication using the number theoretic transform on GPU with 24-bit primes. The efficient 24-bit modular reduction is performed in floating-point arithmetic. Our algorithm exploits fused-multiply add (FMA) capabilities of the graphics hardware. DOI: http://dx.doi.org/10.1007/978-3-642-03644-6_11
/content/cudazone/CUDABrowser/assets/images/applications/555_mul_image_small.png
/content/cudazone/CUDABrowser/assets/images/applications/555_mul_image_large.png
Academia
Max Planck Institute for Informatics
http://www.mpi-inf.mpg.de
2009
08
21
08/21/2009
Pavel Emeliyanenko
Paper
Numerics
Science
Pavel Emeliyanenko,asm@mpi-sb.mpg.de
82150290-d681-44c6-a606-35c9565949a8
A Parallel Annealing Method for Automatic Color Cervigram Image Segmentation
The accurate and automatic segmentation of tissue regions in cervigram images can aid in the identification and classification of precancerous regions. We implement and analyze four GPU (Graphics Processing Unit) based clustering algorithms: K-means, mean shift, deterministic annealing, and spatially coherent deterministic annealing. From our results, we propose a novel parallel algorithm using the CUDA programming language for digital cervigram segmentation and clustering. The first step of our fully automatic method is to compute the number of modes in the feature space of a color cervigram image using the mean shift clustering algorithm. Next, we use the number of modes in a novel spatially coherent deterministic annealing optimization technique to produce an approximate optimal solution for the clustering problem. Our GPU based methods perform approximately 38x (deterministic annealing),
134x (mean shift), and 276x (spatially coherent deterministic annealing) faster than an equivalent CPU solution. Our implementation decreases the computational time of an annealing method on a 1280x872 pixel image from 5 hours 3 minutes to 72.12 seconds, enabling the use of this optimization method in clinical settings and on large cervigram datasets.
/content/cudazone/CUDABrowser/assets/images/applications/554_edkim_miccaigpuimage_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/554_edkim_miccaigpuimage_large.jpg
Academia
Lehigh University
2009
08
15
08/15/2009
276
Edward Kim
Paper
Medical Imaging
Edward Kim,edk208@lehigh.edu
30159376-0082-49a8-9716-a55f0b2fb707
Predicting Lightning in Protoplanetary Discs
We study the role of dust-dust collisional charging in protoplanetary discs. Although in some cases the charge densities for different species differ by 20 orders of magnitude, we transformed algorithm sothat it gives sufficiently precise solutions using only single precision floats. This made the program run faster on GPGPUs, allowing us to survey wide range of parameter space in high resolution. As a result, we found that as dust condensate, the charge distribution experience four phases. At one of these phases the electrostatic field grows as fourth power of dust density and lightning takes place.
/content/cudazone/CUDABrowser/assets/images/applications/553_lightning-here_small.png
/content/cudazone/CUDABrowser/assets/images/applications/553_lightning-here_large.png
Academia
Theoretical Astrophysics Group, Department of Physics, Kyoto University
http://www-tap.scphys.kyoto-u.ac.jp/
2009
08
11
08/11/2009
140
Takayuki Muranushi
Paper
Numerics
Science
Takayuki Muranushi,muranushi@gmail.com
2c8af116-762a-4968-b59c-bdc1328b7461
Optimization of FTLE Calulation
We calculate the Finite-Time Lyapunov Exponent (FTLE) for several fluid flows and find that CUDA helps us immensely.
/content/cudazone/CUDABrowser/assets/images/applications/552_rlw_vortex_small.png
/content/cudazone/CUDABrowser/assets/images/applications/552_rlw_vortex_large.png
Academia
California Institute of Technology
2009
08
14
08/14/2009
1000
Raymond Jimenez
Application
Multimedia
Paper
Code
Computational Fluid Dynamics
Raymond Jimenez,raymondj@caltech.edu
d508073b-38bf-4fc7-b99b-1ad6ff71b868
CBDA: Cyclotron Beam Dynamics Analysis
Software for the Accelerator Physics
/content/cudazone/CUDABrowser/assets/images/applications/551_demo2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/551_demo2_large.jpg
Research
JINR
http://www.jinr.ru
2009
07
01
07/01/2009
60
Perepelkin Evgeny
Application
Science
Perepelkin Evgeny,pevgeny@jinr.ru,Cyclotron, Space charge effect, Acceleration
1b0a6a82-8181-4dd3-9477-0e7d523af249
Efficient Acceleration of Asymmetric Cryptography on GPUs
We present implementations of large integer modular exponentiation, the core of public-key cryptosystems such as RSA, on a DirectX 10 compliant GPU. We present high performance modular exponentiation implementations based on integers represented in both standard radix form and residue number system form. We show how a GPU implementation of a 1024-bit RSA decrypt primitive can outperform a comparable CPU implementation by up to 4 times and also improve the performance of previous GPU implementations by decreasing latency by up to 7 times and doubling throughput. We present how an adaptive approach to modular exponentiation involving implementations based on both a radix and a residue number system gives the best all-around performance on the GPU both in terms of latency and throughput. We also highlight the usage criteria necessary to allow the GPU to reach peak performance on public key cryptographic operations.
/content/cudazone/CUDABrowser/assets/images/applications/550_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/550_graph_large.png
Academia
Trinity College Dublin, Ireland
http://www.tcd.ie/
2008
12
01
12/01/2008
4
Owen Harrison
Paper
Numerics
Owen Harrison,harrisoo@cs.tcd.ie
4eb88033-a856-44e6-aa9c-f8143e624219
StandardModel on GPU
This project is a GPU port of the "Standard Model of Visual Cortex" (CBCL, MIT, by Riesenhuber M., Poggio T., Serre T., Wolf L.)
/content/cudazone/CUDABrowser/assets/images/applications/549_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/549_logo_large.png
Research
2009
08
10
08/10/2009
100
Open source
Giacomo Spigler
Application
Code
Graphics
Science
Signal Processing
Giacomo Spigler,spiglerg@gmail.com
a927fa37-9635-48c1-8b51-1f237dec4035
Cuda Jpeg Decoder
jpeg decoder on GPU
/content/cudazone/CUDABrowser/assets/images/applications/548_screenshot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/548_screenshot_large.jpg
Commercial
2U
http://www.2uinfotech.com
2009
08
13
08/13/2009
10
Open source
Ramazan Dincer
Application
Code
Imaging
Ramazan Dincer,rados82@gmail.com
36253542-171f-491b-8646-f224a5694e8f
Hyperspectral unmixing on NVidia GPUs
Hyperspectral images are now routinely used in several Earth observation and planetary exploration missions. These images can be seen as high-dimensional data cubes with three dimensions: two of which represent the spatial domain, while the third one comprises hundreds of spectral bands collected at different wavelengths. As a result, each pixel is represented by a spatial localization and a spectral signature which provides very detailed information about its composition. One of the main problems in the analysis of hyperspectral data cubes is the problem of mixed pixels, which arise when the spatial resolution of the sensor is not enough to separate spectrally distinct materials. In this case, several spectrally pure signatures (endmembers) are combined into the same (mixed) pixel. Hyperspectral unmixing techniques comprise two stages: 1) automatic identification of spectral endmembers; and 2) estimation of the fractional abundance of each endmember in each pixel. The unmixing process is quite computationally expensive, mainly due to the extremely high dimensionality of hyperspectral data cubes. In this work, we develop a computationally efficient implementation of the full hyperspectral unmixing chain using different endmember extraction and fractional abundance estimation algorithms. The proposed methodology has been implemented, using the compute device unified architecture (CUDA), on an NVidia GeForce 8800 GTX GPU, achieving speedups in the order of 25x when compared to an optimized implementation of the same code in a dual-core CPU.
/content/cudazone/CUDABrowser/assets/images/applications/547_hyperspectralcube_small.png
/content/cudazone/CUDABrowser/assets/images/applications/547_hyperspectralcube_large.png
Academia
Technology of Computers and Communications, University of Extremadura
http://www.umbc.edu/rssipl/people/aplaza
2009
08
12
08/12/2009
25
Antonio Plaza
Application
Imaging
Antonio Plaza,aplaza@unex.es
31d6fa65-0651-4b66-8e96-75dda84f13f6
Tracking as Segmentation of Spatial-Temporal Volumes
In this work, we interpret tracking as segmentation of spatial-temporal volumes. Segmentation is done by a variational approach using anisotropic weighted Total Variation (TV) regularization. All major parts of this approach are computed on the GPU using CUDA
/content/cudazone/CUDABrowser/assets/images/applications/546_cuda_zone_emmcvpr_small.png
/content/cudazone/CUDABrowser/assets/images/applications/546_cuda_zone_emmcvpr_large.png
Academia
Graz University of Technology, Institute for Computer Graphics and Vision
http://www.gpu4vision.org
2009
08
12
08/12/2009
Markus Unger
Multimedia
Paper
Numerics
Science
Video & Audio
Computer Vision
Markus Unger,info@gpu4vision.org
6217e565-1518-4cfd-9644-e7206c32a1a5
Performance Comparison of Single-Precision SPICE Model-Evaluation on FPGA, GPU, Cell, and Multi-core Processors
Automated code generation and performance tuning techniques for concurrent architectures such as GPUs, Cell and FPGAs can provide integer factor speedups over multi-core processor organizations for data-parallel, floating-point computation in SPICE Model-Evaluation. Our Verilog AMS compiler produces code for parallel evaluation of non-linear circuit models suitable for use in SPICE simulations where the same model is evaluated several times for all the devices in the circuit. Our compiler uses architecture specific parallelization strategies (OpenMP for multi-core, PThreads for Cell, CUDA for GPU, statically scheduled VLIW for FPGA) when producing code for these different architectures. We automatically explore different implementation configurations (e.g. unroll factor, vector length) using our performance-tuner to identify the best possible configuration for each architecture. We demonstrate speedups of 3--131x for an NVIDIA 9600 GT GPU over a 3 GHz Intel Xeon 5160 implementation for a variety of single-precision device models.
/content/cudazone/CUDABrowser/assets/images/applications/545_ic_logo_basic_small.png
/content/cudazone/CUDABrowser/assets/images/applications/545_ic_logo_basic_large.png
Academia
U. Penn. Implementation of Computation Lab
2009
08
31
08/31/2009
133
Nachiket Kapre
Paper
Electronic Design Automation
Nachiket Kapre,nachiket@ieee.org
d0e01d7c-0f0f-4dcb-8b28-e0f98e87f914
Single Pass Depth Peeling via CUDA Renderer
Multi-fragment effects play important roles on many graphics applications, which require operations on more than one fragment per pixel. The classical depth peeling algorithm provides a simple but robust solution by peeling off one layer each pass, but multi rasterizations will become a performance bottleneck for large and complex scenes. Ideally, we prefer to capture and sort multiple fragments in a single pass, which is difficult because the fragments generated in graphics pipeline are not allowed to be scattered to arbitrary positions of the render targets. Compute unified device architecture (CUDA) provides more flexible control over the GPU memory, but accessing of the fragments generated by graphics pipeline is not yet supported. In this work we design a CUDA rasterizer so that many graphics applications can benefit from the free control of GPU memory, especially for the multi-fragment effects. We present two efficient schemes to capture and sort multiple fragments per pixel in a single geometry pass via the atomic operations of CUDA without read-modify-write (RMW) hazards. Experimental results show significant speedup to classical depth peeling, especially for large scenes.
/content/cudazone/CUDABrowser/assets/images/applications/544_dragon_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/544_dragon_large.jpg
Research
Institue of Software, Chinese Academy of Sciences
2009
08
10
08/10/2009
10
Open source
Fang Liu
Paper
Graphics
Fang Liu,liuf@ios.ac.cn
b4cb74db-42cd-4def-86f0-65c87bd36187
FOLKI GPU
Fast Optical Flow on GPU at video rate for full HD resolution
/content/cudazone/CUDABrowser/assets/images/applications/543_icone_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/543_icone_large.jpg
Research
Onera
http://www.onera.fr
2009
07
23
07/23/2009
100
Open source
Aurelien Plyer
Application
Code
Computational Fluid Dynamics
Imaging
Signal Processing
Video & Audio
Aurelien Plyer,aurelien.plyer@gmail.com
bdb4e756-195f-4243-94f7-b10b7e11e2bd
Iterative CUDA
Iterative CUDA is a CUDA-based solver package for large, sparse linear systems.
/content/cudazone/CUDABrowser/assets/images/applications/542_sparse-city-small_small.png
/content/cudazone/CUDABrowser/assets/images/applications/542_sparse-city-small_large.png
Academia
Brown University
http://brown.edu
2009
08
01
08/01/2009
10
Open source
Andreas Kloeckner
Code
Computational Fluid Dynamics
Numerics
Libraries
Science
Andreas Kloeckner,kloeckner@dam.brown.edu,solver,cg,iterative,linear system
c620d360-fac3-43d3-a045-9c1aae75ec57
SARRACUDA: Syntetic Aperture Radar Range-doppler Algorithm using CUDA
This application is a GPU version of a Synthetic Aperture Radar focusing algorithm. The implemented algorithm is the Range doppler algorithm, one of the most accurates and widely used. Synthetic Aperture Radar (SAR) is an imaging radar for earth observation from satellite and airborne manned/unmanned platforms; it is currently operational in recently launched polar-orbiting platforms such as TerraSAR-X, RadarSAT-2 and Cosmo-SkyMed as well as in previous missions. Applicatons are tailored to disaster observation and management, mapping of renewable resources, geological mapping, snow/ice mapping and strategic surveillance of military sites.The data stream produced by high resolution SAR systems may exceed 1 Gb/s and the real-time or near real-time processing represents a demanding requirement for on-board or even ground-based processing systems. The remote sensing community and the space agencies spend yearly a considerableamount of time and money to implement efficient and accurate processors for SAR data. Moreover, the scientific community is more and more oriented to a wide range of applications where the first step is the focalization of SAR data. The recent development and diffusion of multicore platformsopens new horizons and breaks barriers in the design of architectures for massively parallel processing of SAR data, without loosing in resolution and/or accuracy.
/content/cudazone/CUDABrowser/assets/images/applications/541_sarracuda_small.png
/content/cudazone/CUDABrowser/assets/images/applications/541_sarracuda_large.png
Academia
Universita degli Studi del Sannio
http://www.ing.unisannio.it
2009
08
05
08/05/2009
15
Open source
Carmine Clemente
Paper
Signal Processing
Remote Sensing
Carmine Clemente,carmineclemente@gmail.com
9ca18ade-f66f-4f40-9743-e6ebb760de33
Libra SDK
Libra SDK is a scientific developer kit for building simple and fast cross CPU-GPU applications suited for scientific computations. Libra 1.1 SDK includes C/C++ matlab style API, sample programs and documentation. Example code and a downloadable trial version is available from GPU Systems website http://www.gpusystems.com
/content/cudazone/CUDABrowser/assets/images/applications/540_logo_bg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/540_logo_bg_large.png
Commercial
GPU Systems
http://www.gpusystems.com
2009
06
24
06/24/2009
Commercial
Marco Hjerpe
Multimedia
Paper
Computational Fluid Dynamics,Digital Content Creation,Electronic Design Automation,Finance,Game Physics,Graphics,Imaging,Medical Imaging,Numerics,Life Sciences,Libraries,Oil & Gas,Science,Signal Processing,Video & Audio,matlab programming
Marco Hjerpe,marco.hjerpe@gpusystems.com,CPU,GPU,C++ programming,gpgpu,matlab,CUDA,OpenCL
87d90d59-5c3a-4094-a8cf-0e8da5326193
Real-time optical manipulation of micron sized structures using GPU generated holograms
Holographic optical tweezers allow the three dimensional, dynamic, multipoint manipulation of micron sized dielectric objects. Exploiting the massive parallel architecture of modern GPUs we can generate highly optimized holograms at video frame rate allowing the interactive micromanipulation of complex structures.
/content/cudazone/CUDABrowser/assets/images/applications/539_slm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/539_slm_large.png
Academia
CNR-INFM, CRS-SOFT Dipartimento di Fisica, Universita di Roma La Sapienza
2009
07
23
07/23/2009
350
S. Bianchi
R. Di Leonardo
Paper
Imaging
Science
S. Bianchi,R. Di Leonardo
a0d09099-5643-406d-9d4a-9e7053425028
The Living Application: a Self-Organising System for Complex Grid Tasks
We present the living application, a method to autonomously manage applications on the grid. During its execution on the grid, the living application makes choices on the resources to use in order to complete its tasks. These choices can be based on the internal state, or on autonomously acquired knowledge from external sensors. By giving limited user capabilities to a living application, the living application is able to port itself from one resource topology to another. The application performs these actions at run-time without depending on users or external workflow tools. We demonstrate this new concept in a special case of a living application: the living simulation. Today, many simulations require a wide range of numerical solvers and run most efficiently if specialized nodes are matched to the solvers. The idea of the living simulation is that it decides itself which grid machines to use based on the numerical solver currently in use. In this paper we apply the living simulation to modelling the collision between two galaxies in a test setup with two specialized computers. This simulation switces at run-time between a GPU-enabled computer in the Netherlands and a GRAPE-enabled machine that resides in the United States, using an oct-tree N-body code whenever it runs in the Netherlands and a direct N-body solver in the United States.
/content/cudazone/CUDABrowser/assets/images/applications/538_self-organism_small.png
/content/cudazone/CUDABrowser/assets/images/applications/538_self-organism_large.png
Academia
Section Computational Science, University of Amsterdam, Amsterdam, theNetherlands
2009
07
23
07/23/2009
D. Groen
S. Harfst
S. Portegies Zwart
Paper
Numerics
Science
Signal Processing
D. Groen, S. Harfst, S. Portegies Zwart,djgroen@science.uva.nl
057f342d-84b4-4a12-ae35-5208c51ed958
Synthetic Aperture Radar Back-Projection Algorithm
Synthetic Aperture Radar(SAR) uses microwaves to create images of the earth. These images provide information not visible to the naked eye, and can be made despite visibility conditions. SAR image formation requires massive amounts of computation and is hard to do in real-time. The best SAR processing algorithm, known as back-projection, is O(N^3) where N is the number of pixels -- which can be many thousands. To reduce computation suboptimal FFT-based algorithms have been traditionally used despite the various limitations and image degradation effects these algorithms have. The back-projection algorithm is however ideal for a highly parallel processor like NVIDIA's GPGPUs. At the Brigham Young University Microwave Earth Remote Sensing Laboratory we have been able to take advantage of the GPGPUs massive processing power to reduce the processing time for a 1500X1600 image that took 31 minutes in a well-optimized, single-threaded C implementation, down to a 5.6 seconds using one of the four processors of a NVIDIA S1070. This is even faster than many FFT-based algorithms! We hope to continue to build off of this speed up to make further advancements in SAR imaging.
/content/cudazone/CUDABrowser/assets/images/applications/537_sonar_small.png
/content/cudazone/CUDABrowser/assets/images/applications/537_sonar_large.png
Academia
Brigham Young University Microwave Earth Remote Sensing Laboratory
http://www.mers.byu.edu/SAR.html#YIFSAR
2009
08
03
08/03/2009
300
David G. Long
Multimedia
Paper
Signal Processing
David G. Long,long@ee.byu.edu
069854c1-e5b6-44b1-84a9-eb00831c8fae
Julia 4D
Ray tracing of quaternion julia set
/content/cudazone/CUDABrowser/assets/images/applications/540_Julia4D_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/540_Julia4D_large.jpg
Research
homemade
2009
08
02
08/02/2009
Charles Strub
Application
Multimedia
Graphics
Numerics
Charles Strub,charles.strub@gmail.com,Julia 4D quaternion ray tracing
7cc11512-08ba-4264-a4c7-fa1c31ae47b2
cudaseg (Fast Level Set Segmentation of Biomedical Images using Graphics Processing Units )
n this projet we have engineered a parallel level In this projet we have engineered a parallel level set implementation using the NVIDIA CUDA framework to accelerate image and volume segmentations. The final source code and thesis can be downloaded on this site In this projet we have engineered a parallel level set implementation using the NVIDIA CUDA framework to accelerate image and volume segmentations.
/content/cudazone/CUDABrowser/assets/images/applications/538_cudasegall_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/538_cudasegall_large.jpg
Academia
University of Oxford
2009
06
02
06/02/2009
Hormuz Mostofi
Application
Multimedia
Paper
Code
Graphics
Imaging
Hormuz Mostofi,
d86e78ef-8ada-44e6-ad42-fc6386e55cc0
Cholesky Decompositions
Cholesky factorization for dense matrix and reached 450x with GTX 285
/content/cudazone/CUDABrowser/assets/images/applications/536_http_imgload.cgi_small.png
/content/cudazone/CUDABrowser/assets/images/applications/536_http_imgload.cgi_large.png
Freelance
2009
09
05
09/05/2009
450
lixiuyu
Application
Science
lixiuyu,cyrosly@163.com
34d84e30-9ec7-42e6-a411-63810d133fc4
A GPU based GPS software receiver
Off-the-shelf graphics processing units provide low-cost massive parallel computing performance, which can be utilized for the implementation of a GPS software receiver. In order to realize a real-time capable system the crucial stages of the receiver should be optimized to suit the requirements of a parallel processor. Moreover, the receiver should be capable to provide wider correlation functions and provide easy access to the spectral domain of the signals. Thus, the most suitable correlation algorithm, which forms the core part of each receivers should be chosen and implemented on the graphics processor. Since the sampling rate of the received signal limits the real-time capabilities of the software radio it is necessary to determine an optimum value, considering that the precision of the observable varies with sampling bandwidth. We are going to discuss details and present our single frequency multi-channel implementation, which is capable of operating in real-time mode. Our implementation differs from other solutions by the wideness of the correlation function and allows simple handling of data in the spectral domain. Comparison with output from a commercial hardware receiver, which shares the antenna with the software radio, confirms the consistency and accuracy of our development.
/content/cudazone/CUDABrowser/assets/images/applications/535_gpsgpu_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/535_gpsgpu_large.jpg
Research
National Institute of Information and Communications Technology, Japan
http://www.nict.go.jp
2009
08
08
08/08/2009
Thomas Hobiger
Paper
Science
Signal Processing
Thomas Hobiger,hobiger@nict.go.jp
ce75850e-89b5-4c0f-af44-1f5f66f91cd1
framework for efficient and scalable execution of domain-specific templates on GPUs
Graphics Processing Units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data sizes and new GPU platforms. To achieve scalable and efficient GPU execution, our framework focuses on two critical problems that have been largely ignored in previous efforts - processing large data sets that do not fit within the GPU memory, and minimizing data transfers between the host and GPU. Our framework takes domain-specific parallel programming templates that are expressed as parallel operator graphs, and performs operator splitting, offload unit identification, and scheduling of off-loaded computations and data transfers between the host and the GPU, to generate a highly optimized execution plan. Finally, a code generator produces a hybrid CPU/GPU program in accordance with the derived execution plan, that uses lower level frameworks such as CUDA. We have applied the proposed framework to templates from the recognition domain, specifically edge detection kernels and convolutional neural networks that are commonly used in image and video analysis. We present results on two different GPU platforms from NVIDIA (a Tesla C870 GPU computing card and a GeForce 8800 graphics card) that demonstrate 1.7 - 7.8X performance improvements over already accelerated baseline GPU implementations. We also demonstrate scalability to input data sets and application memory footprints of 6GB and 17GB, respectively, on GPU platforms with only 768MB and 1.5GB of memory.
/content/cudazone/CUDABrowser/assets/images/applications/534_image_small.png
/content/cudazone/CUDABrowser/assets/images/applications/534_image_large.png
Commercial
NEC Labs, Berkeley, Purdue
2009
05
01
05/01/2009
Narayanan Sundaramyz
Anand Raghunathanyx
Srimat T. Chakradhar
Paper
Imaging
Medical Imaging
machine learning
Narayanan Sundaramyz, Anand Raghunathanyx, and Srimat T. Chakradhar
a9aaf71b-fcf4-4cab-a580-5029383afb71
Massively Parallel Population-Based Monte Carlo Methods
Implementation of population-based MCMC and a sequential Monte Carlo sampler for inference in a Gaussian mixture model and a particle filter for a factor stochastic volatility state-space model.
/content/cudazone/CUDABrowser/assets/images/applications/533_b1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/533_b1_large.png
Academia
University of Oxford
2009
05
14
05/14/2009
500
Open source
Anthony Lee
Christopher Yau
Michael B. Giles
Arnaud Doucet
Christopher C. Holmes
Application
Paper
Code
Statistics
Anthony Lee,lee@stats.ox.ac.uk
f549113c-45f8-4aff-96b4-b89db7abe5bb
3D Image Deconvolution on GPUs
A popular approach to solving the inverse problem of image deconvolution is to use iterative methods. Iterative deconvolution can provide better results than simpler methods at a cost of higher computational complexity and processing time. In this work we investigate the use of graphics processing units (GPUs) and CUDA to accelerate the execution of one such iterative algorithm, the Richardson-Lucy (RL) algorithm. We compare performance results for a number of 3D Richardson-Lucy implementations on both the CPU and GPU, showing that our best GPU implementation, using Fourier space convolutions (CUFFT), significantly outperforms our best CPU implementation, which uses a publicly available and highly optimised Fast Fourier Transform (FFT) library. L. Domanski, P. Vallotton, and D. Wang. Two and Three-Dimensional Image Deconvolution on Graphics Hardware. In Anderssen, R.S., R.D. Braddock and L.T.H. Newham (eds) 18th World IMACS/MODSIM Congress, Cairns, Australia, pages 1010--1016, 13-17 July 2009. ISBN: 978-0-9758400-7-8. http://www.mssanz.org.au/modsim09/C5/domanski.pdf
/content/cudazone/CUDABrowser/assets/images/applications/532_psfteaser_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/532_psfteaser_large.jpg
Research
Commonwealth Scientific and Industrial Research Organisation
http://www.csiro.au/
2009
07
13
07/13/2009
Luke Domanski
Multimedia
Paper
Imaging
Medical Imaging
Luke Domanski,Luke.Domanski@csiro.au,image deconvolution, image restoration, microscopy, CUDA, CUFFT
0313f2ec-fa80-4682-8fb2-8e855c9f2e66
PAPER - Accelerating Parallel Evaluations of ROCS
PAPER is a GPU-accelerated implementation of Gaussian molecular shape overlay (the algorithm in OpenEye ROCS) running on NVIDIA graphics cards. We have demonstrated multiple-order-of-magnitude speedups relative to a CPU-based implementation of the same algorithm, and 5x speedup relative to OpenEye ROCS even on low-end graphics hardware (an NVIDIA 8600GT).
/content/cudazone/CUDABrowser/assets/images/applications/531_gpuROCS_thumb_small.png
/content/cudazone/CUDABrowser/assets/images/applications/531_gpuROCS_thumb_large.png
Academia
Department of Computer Science, Stanford University
2009
05
06
05/06/2009
35
Imran Haque
Application
Paper
Code
Life Sciences
Imran Haque,ihaque@cs.stanford.edu,paper openeye rocs
ffe53df7-0183-443c-a269-710b724d1cb7
librysq
librysq is C/C++ implementation of the Rys quadrature for computing arbitrary electron repulsion integrals on CPU and CUDA GPUs. A FORTRAN interface is provided for compatibility with the existing chemistry packages.
/content/cudazone/CUDABrowser/assets/images/applications/529_MOS-902-8-400x300_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/529_MOS-902-8-400x300_large.jpg
Research
Source Forge
2009
03
29
03/29/2009
andrey asadchev
Paper
Numerics
Science
andrey asadchev,
0fc84b69-6d38-4463-adae-0d6d3ad2fdb0
GPU Flame Fractal Renderer
Renderer for flam3 cosmic recursive fractal flames implemented on GPU. Requires a CUDA-capable graphics card.
/content/cudazone/CUDABrowser/assets/images/applications/528_screenshot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/528_screenshot_large.jpg
Research
SourceForge
2009
07
24
07/24/2009
Keldor
Application
Graphics
Keldor,Keldor@users.sourceforge.net
027d6d57-37b1-4657-9df6-394c24092014
Combining Molecular Dynamics with Bayesian Analysis To Predict and Evaluate Ligand-Binding Mutations in Influenza Hemagglutinin
The influenza virus infects people and animals by binding to complex sugar molecules on the surface of the respiratory tract. Bird viruses bind most strongly to bird cell-surface sugars and human viruses bind most strongly to human cell-surface sugars. As the recent swine-origin influenza virus has demonstrated, there is considerable overlap between the binding ability of human and pig viruses to cells of the other host. Changes to this binding affinity are one key component for viruses to make a jump between species, and it is difficult to predict the necessary mutations ahead of time. We would like to predict high-risk mutations to enable better surveillance and early control of potential inter-species transmission events. This work represents a first step in that direction, as we examine mutations to H5N1 avian influenza that alter ligand binding. We use Folding@Home as a powerful computational screen to evaluate mutations that will eventually require experimental testing to verify.
/content/cudazone/CUDABrowser/assets/images/applications/527_ja904_small.png
/content/cudazone/CUDABrowser/assets/images/applications/527_ja904_large.png
Academia
Departments of Chemistry and Structural Biology, Stanford University
2009
07
28
07/28/2009
Peter M Kasson
Paper
Life Sciences
Peter M Kasson,kasson@stanford.edu,folding@home influenza
d1233002-2132-43e2-8527-3bf5159ddf19
ViVid
Python framework for video processing and content analysis using CUDA for acceleration.
/content/cudazone/CUDABrowser/assets/images/applications/525_6702-Water_Life_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/525_6702-Water_Life_large.jpg
Research
Source Forge
2009
04
18
04/18/2009
Dennis Lin
Mert Dikmen
Code
Video & Audio
Dennis Lin,Mert Dikmen,Dennis_Lin@sourceforge.net
0b495833-3a0f-45e7-88ab-43b3f28cc0fe
SSbump Generator
A GUI interface to a tool for generating SSBumps (Self Shadowed Bump Maps). Includes a CUDA GPU rendering extension.
/content/cudazone/CUDABrowser/assets/images/applications/524_screenshot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/524_screenshot_large.jpg
Research
SourceForge
2008
12
31
12/31/2008
SARGE
ssbumpgenerator
Application
Imaging
Numerics
Science
SARGE,ssbumpgenerator,SARGE@users.sourceforge.net
2444c3a0-b3b0-4584-9b11-6f566f9030ee
Open64 Compiler and Tools
The Open64 Compiler and Tools site is dedicated to the continued development of the former SGI Pro64(TM) compiler for the IA64, x86, CUDA and MIPS architecture.
/content/cudazone/CUDABrowser/assets/images/applications/523_nvidia-2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/523_nvidia-2_large.jpg
Research
NVIDIA
http://www.nvidia.com
2009
04
04
04/04/2009
Alban Douillet
Juergen Ributzka
Suneel Jain
Application
Numerics
Alban Douillet,Juergen Ributzka,Suneel Jain,adouillet@nvidia.com
90e8e358-dfd7-493e-b829-36373e4ab5ee
CUDA-EC
A fast parallel error correction tool for short reads.
/content/cudazone/CUDABrowser/assets/images/applications/522_cuda-ec_small.png
/content/cudazone/CUDABrowser/assets/images/applications/522_cuda-ec_large.png
Academia
Nanyang Technological University
2009
04
07
04/07/2009
Haixiang Shi
Application
Numerics
Haixiang Shi,Haixiang_Shi@users.sourceforge.net
4c8b5fb1-15cd-4251-bec3-f6de3a414800
pfsRTtmo
This project provides realtime implementations of popular HDR tone mapping operators on GeForce 8800 GPUs using the CUDA programming environment.
/content/cudazone/CUDABrowser/assets/images/applications/521_screenshot_thumb_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/521_screenshot_thumb_large.jpg
Research
SourceForce
2008
12
31
12/31/2008
07/30/2008
Peter Kipfer
Application
Imaging
Science
Peter Kipfer,prkipfer@users.sourceforge.net
6de04003-5433-4de2-bf86-c308ac51fd12
GPU Accelerated Real Time HDR Rendering
A real-time interactive display was developed to showcase timelapse photos by using motion estimation results to produce unique high-dynamic range images as a function of the viewer's position in front of the display.
/content/cudazone/CUDABrowser/assets/images/applications/520_ir_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/520_ir_large.jpg
Academia
University of Toronto
http://www.eyetap.org
2009
05
05
05/05/2009
Raymond Lo
Eric Tran
Multimedia
Graphics
Imaging
Raymond Lo,Eric Tran
63d283aa-d137-4705-899c-cbb174ef07ba
GPU accelerated dose calculations for radiotherapy
We developed a ray-tracing algorithm for radiotherapy dose calculations that enables (nearly) real-time calculation of the dose for realistic radiotherapy patient data-sets. This reduces the workload for manual determination of the optimal treatment plan. Besides, it offers a speed up for automated optimization of (advanced) radiotherapy treatment plans and/or re-planning after on-line imaging of the patient.
/content/cudazone/CUDABrowser/assets/images/applications/519_dosedistro_1e6_ptv_only_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/519_dosedistro_1e6_ptv_only_large.jpg
Academia
Academic Medical Center, University of Amsterdam
http://www.amc.nl/radiotherapie
2009
08
12
08/12/2009
10
M.de Greef
Multimedia
Paper
Numerics
Life Sciences
Science
M.de Greef,m.degreef@amc.uva.nl
ce36336b-2796-4924-9cb9-a79e4d7992e6
OpenMS
An open-source framework for mass spectrometry
/content/cudazone/CUDABrowser/assets/images/applications/518_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/518_logo_large.png
Academia
Center for Bioinformatics, Saarland University
http://bioinf-www.bioinf.uni-sb.de/
2009
01
14
01/14/2009
Rene Hussong
Paper
Code
Life Sciences
Rene Hussong,rene@bioinf.uni-sb.de,openms proteomics
228465d3-3c07-4215-ad0f-8bba8d3f87a8
parallel for
A data parallel scientific programming model. Compiles efficiently to different platforms like distributed memory (MPI), shared memory multi-processor (pthreads), Cell BE processor, NVIDIA CUDA, SIMD vectorization (SSE, Altivec), and sequential C++ code.
/content/cudazone/CUDABrowser/assets/images/applications/517_simd_mimd_small.png
/content/cudazone/CUDABrowser/assets/images/applications/517_simd_mimd_large.png
Research
CISCO
2008
12
31
12/31/2008
GWZ
Code
Numerics
Science
GWZ,gwz@cisco.com
2c01c333-373e-49f3-9b2e-41a3d14db455
multiDAC
multiDAC is intended to become a user-friendly tool for image- and videoprocessing in the field of deformation/movement analysis. It is written in C# with some C routines using CPU/GPU parallelization (e.g. CUDA).
/content/cudazone/CUDABrowser/assets/images/applications/516_screenshot_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/516_screenshot_large.jpg
Research
SourceForge
2009
07
30
07/30/2009
purzel42
Application
Video & Audio
purzel42,purzel@users.sourceforge.net
feb82039-4088-43ec-9118-1d2a1c80b349
CUDA-NN
A parallel version of Neural Networks using CUDA for optimization, data mining, etc.
/content/cudazone/CUDABrowser/assets/images/applications/515_datamining7_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/515_datamining7_large.jpg
Academia
Nanyang Technological University
2008
12
31
12/31/2008
Haixiang Shi
Code
Numerics
Life Sciences
Science
Haixiang Shi,Haixiang_Shi@users.sourceforge.net
f11599cd-b3a5-4b08-a9af-801b67ebd826
IllustStudio
IllustStudio is the paint tool which allows users to express pen strokes similar to real ones and to expand their range of expressions. IllustStudio has filters corresponding to CUDA and realizes high-speed filtering process by using GPU calculation. According to our research*, with CUDA enables the processing speed 35 times faster than without CUDA. * According to the ratio of CELSYS.
/content/cudazone/CUDABrowser/assets/images/applications/514_illuststudio_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/514_illuststudio_large.jpg
CELSYS,Inc.
http://www.celsys.co.jp/
2009
07
29
07/29/2009
35
CELSYS,Inc.
Application
Digital Content Creation
Graphics
CELSYS,Inc.
12ba6f44-1cdc-4fad-a5de-7d9e052f76dc
CUDA-SVM
A fast parallel SVM tool based on CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/513_svm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/513_svm_large.png
Nanyang Technological University
Academia
2008
12
31
12/31/2008
Haixiang Shi
Code
Numerics
Haixiang Shi,Haixiang_Shi@users.sourceforge.net
3662fbe9-eeec-4413-b43c-d42054cbfa52
CUDA-GA
A fast parallel genetic algorithm using CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/512_GAArt_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/512_GAArt_large.jpg
Academia
Nanyang Technological University
2008
12
31
12/31/2008
Haixiang Shi
Code
Numerics
Life Sciences
Science
Haixiang Shi,Haixiang_Shi@users.sourceforge.net
76057c07-a388-46d9-af21-b9bfcc4453c3
CUDA-PSO
A parallel version of Particle Swarm Intelligence (PSO) using nVidia's CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/511_swarm_intelligence_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/511_swarm_intelligence_large.jpg
Academia
Nanyang Technological University
2008
12
31
12/31/2008
Haixiang Shi
Code
Imaging
Science
Haixiang Shi,Haixiang_Shi@users.sourceforge.net
f4e8deee-ef9f-4047-bafd-0701a9a1bc27
Magnetohydrodynamics simulations on graphics processing units
Magnetohydrodynamics (MHD) simulations based on the ideal MHD equations have become a powerful tool for modeling phenomena in a wide range of applications including laboratory, astrophysical, and space plasmas. In general, high-resolution methods for solving the ideal MHD equations are computationally expensive and Beowulf clusters or even supercomputers are often used to run the codes that implemented these methods. With the advent of the Compute Unified Device Architecture (CUDA), modern graphics processing units (GPUs) provide an alternative approach to parallel computing for scientific simulations. In this paper we present, to the authors' knowledge, the first implementation to accelerate computation of MHD simulations on GPUs. Numerical tests have been performed to validate the correctness of our GPU MHD code. Performance measurements show that our GPU-based implementation achieves speedups of 2 (1D problem with 2048 grids), 106 (2D problem with 1024^2 grids), and 43 (3D problem with 128^3 grids), respectively, compared to the corresponding serial CPU MHD implementation.
/content/cudazone/CUDABrowser/assets/images/applications/510_GPU_MHD_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/510_GPU_MHD_new_large.jpg
Academia
Faculty of IT, Macau University of Science and Technology
2009
09
01
09/01/2009
100
Hon-Cheng Wong
Un-Hong Wong
Paper
Science
Computational Physics
Hon-Cheng Wong,hcwong@ieee.org
e8dc2667-cce6-47b8-8fe4-c0e18e14972b
CUDA-ClustalW
CUDA-ClustalW is publicly available open-source software for high-speed computation of large MSAs running on CUDA-enabled GPUs based on clustalw-2.0.9. The project has been tested on a GeForce GTX 280 graphics card.
/content/cudazone/CUDABrowser/assets/images/applications/509_p53_Hsap_Mmus_Rnor_Frub_ClustalW_6Kb_angle_800p_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/509_p53_Hsap_Mmus_Rnor_Frub_ClustalW_6Kb_angle_800p_large.jpg
Research
SourceForge.net
2009
04
01
04/01/2009
nkcslyc
Application
Numerics
nkcslyc,
49b7c770-57ef-4530-b9ea-ea804d21c7ff
Cuda_Wrapper
The CUDA wrapper library provides means for an efficient resource sharing and resource protection on multi-user GPU clusters.It implements the following functionality:1) Virtualization of the physical GPU devices2) Ensuring NUMA affinity for GPUs .
/content/cudazone/CUDABrowser/assets/images/applications/507_numerics_rayleighbenard3d_small.png
/content/cudazone/CUDABrowser/assets/images/applications/507_numerics_rayleighbenard3d_large.png
Academia
University of Illinois at Urbana-Champaign
2009
07
21
07/21/2009
Guochun Shi
Jeremy Enos
Code
Numerics
Libraries
Guochun Shi,Jeremy Enos,gshi@ncsa.uiuc.edu
54d76ef2-f4b0-4e80-a3c3-ee8338606f13
CUDA Neural Network
Implementation of a feed-forward backpropagation artificial neural network using CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/506_neural_network_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/506_neural_network_large.jpg
Research
Sourceforge
2008
12
03
12/03/2008
Pyrevenant
Application
Life Sciences,Libraries,Science
Pyrevenant
eabc31be-c665-455a-95bd-0d0e7dd532ab
cuda-z
Simple program that displays information about CUDA-enabled devices. Program is equipped with GPU performance test.
/content/cudazone/CUDABrowser/assets/images/applications/505_CUDA-Z_2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/505_CUDA-Z_2_large.png
Research
SourceForge.net
2009
04
13
04/13/2009
Andriy Golovnya
Application
Numerics
Andriy Golovnya,andrew_golovnia@users.sourceforge.net
bbf86cd6-59ac-443b-ab02-8ba8ef3bbf60
Computation of Troposphere Slant Delays on a GPU
Description (i.e. abstract of the paper): The computation of ray-traced troposphere delays which can be utilized for space geodetic applications is a time-consuming effort when a large number of rays has to be calculated. On the other hand, computation time can be tremendously reduced when algorithms are capable of supporting parallel processing architectures. Thus, by the use of an off-the-shelf graphics processing unit (GPU), it is demonstrated that troposphere slant delays can be computed very efficiently, without loss of accuracy. An adopted ray-tracing algorithm is presented, and results from GPU computations are compared with those obtained from calculations on a standard personal computer's CPU.
/content/cudazone/CUDABrowser/assets/images/applications/504_IEEE_GPU_figureC_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/504_IEEE_GPU_figureC_new_large.jpg
Research
National Institute of Information and Communications Technology, Japan
http://www.nict.go.jp
2009
06
26
06/26/2009
18
Hobiger Thomas
Ichikawa Ryuichi Koyama Yasuhiro
Paper
Geoscience
Science
Hobiger Thomas, Ichikawa Ryuichi, Koyama Yasuhiro, Kondo Tetsuro
e97270b7-8c73-45fa-ac57-96e9ab59ca88
Cuda ITK
This project shows how to integrate NVIDIA CUDA GPU programming API into ITK (Insight Segmentation and Registration Toolkit) library.
/content/cudazone/CUDABrowser/assets/images/applications/503_226314_small.png
/content/cudazone/CUDABrowser/assets/images/applications/503_226314_large.png
Academia
Harvard University
2009
06
28
06/28/2009
Won-Ki Jeong
Paper
Numerics
e9097706-9b7a-4d12-9104-a44bd1952348
Phobos
Phobos is a continuous map-reduce framework built upon NVIDIA CUDA
/content/cudazone/CUDABrowser/assets/images/applications/502_1_PHOBOS_461_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/502_1_PHOBOS_461_large.jpg
Academia
HKUST
2009
01
01
01/01/2009
Wenbin Fang
Code
Libraries
Wenbin Fang,saven@cse.ust.hk
212c683f-9d5a-4654-9edc-c3e3fcfe8727
cudatemplates
CUDA Templates" is a collection of C++ template classes and functions which provide a consistent interface to NVidia's "Compute Unified Device Architecture" (CUDA), hiding much of the complexity of the underlying CUDA functions from the programmer.
/content/cudazone/CUDABrowser/assets/images/applications/501_CUDATemplates_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/501_CUDATemplates_large.jpg
Academia
Technische Unversitat Graz
2008
12
31
12/31/2008
Markus Grabner
Application
Numerics
Science
d321a794-7411-4623-a966-4586c0d149e8
Application of a Kinetic Theory based solver of the Euler Equations using GPU
Presented is a modified form of the Quiet Direct Simulation (QDS) method [1] adapted for application of Graphics Processing Units (GPU) for flux calculation. Fluxes between source and destination cells calculated by QDS are flux-vector split and (on a regular Cartesian grid) a function of the source cell alone. The resulting advantage is the rapid calculation of fluxes between cells without the prior exchange of information between them, allowing highly efficient calculation using GPU. Various flow problems have been solved and consistent speed-ups of over 35 times (when compared to an equivalent single CPU code) are reported.
/content/cudazone/CUDABrowser/assets/images/applications/500_kinetic_small.png
/content/cudazone/CUDABrowser/assets/images/applications/500_kinetic_large.png
Academia
National Centre for High Performance Computing, Hsinchu, Taiwan
http://www.nchc.org.tw/en/
2009
05
18
05/18/2009
35
Matthew Smith
Paper
Computational Fluid Dynamics
Matthew Smith,msmith@nchc.org.tw,Quiet Direct Simulation, Kinetic Theory
426e2b01-84c7-4c8c-a935-c652aee3ba78
Conjugated Gradient CUDA and CPU solvers for float, double and quad precision
Free CUDA CG! Take advantage from our full featured 150GFlop/s Conjugated Gradient CUDA and CPU solvers for float, double and quad precision for free.
/content/cudazone/CUDABrowser/assets/images/applications/499_CG_small.png
/content/cudazone/CUDABrowser/assets/images/applications/499_CG_large.png
Commercial
Elegant Mathematics Ltd
http://www.elegant-mathematics.com/
2009
01
08
01/08/2009
Open source
Elegant Mathematics Ltd
Code
Numerics
Elegant Mathematics Ltd,info@elegant-mathematics.com
ef371856-7d1b-4d97-ab4c-6e73f9925992
GAMER: a GPU-Accelerated Adaptive Mesh Refinement Code for Astrophysics
We present the newly developed code, GAMER (GPU-accelerated Adaptive MEsh Refinement code), which has adopted a novel approach to improve the performance of adaptive mesh refinement (AMR) astrophysical simulations by a large factor with the use of the graphic processing unit (GPU). The AMR implementation is based on a hierarchy of grid patches with an oct-tree data structure. We adopt a three-dimensional relaxing TVD scheme for the hydrodynamic solver, and a multi-level relaxation scheme for the Poisson solver. Both solvers have been implemented in GPU, by which hundreds of patches can be advanced in parallel. The computational overhead associated with the data transfer between CPU and GPU is carefully reduced by utilizing the capability of asynchronous memory copies in GPU, and the computing time of the ghost-zone values for each patch is made to diminish by overlapping it with the GPU computations. We demonstrate the accuracy of the code by performing several standard test problems in astrophysics. GAMER is a parallel code that can be run in a multi-GPU cluster system. We measure the performance of the code by performing purely-baryonic cosmological simulations in different hardware implementations, in which detailed timing analyses provide comparison between the computations with and without GPU(s) acceleration. Maximum speed-up factors of 12.19 and 10.47 are demonstrated using 1 GPU with 4096^3 effective resolution and 16 GPUs with 8192^3 effective resolution, respectively.
/content/cudazone/CUDABrowser/assets/images/applications/498_fig18_small.png
/content/cudazone/CUDABrowser/assets/images/applications/498_fig18_large.png
Academia
Department of Physics, National Taiwan University
2009
07
30
07/30/2009
12
Hsi-Yu Schive
Paper
Computational Fluid Dynamics
Science
Hsi-Yu Schive,b88202011@ntu.edu.tw
ab53f652-f3d5-41b7-ab63-10bbda728871
Data Parallel Bin-Based Indexing for Answering Queries on Multi-Core Architectures
The multi-core trend in CPUs and GPUs offers new opportunities for the database community. The increase of cores at exponential rates is likely to affect virtually every server and client in the coming decade, and presents database management systems with a huge, compelling disruption that will radically change how processing is done. This paper presents a new parallel indexing data structure for answering queries that takes full advantage of the increasing thread-level parallelism emerging in multi-core architectures. In our approach, our Data Parallel Bin-based Index Strategy (DP-BIS) first bins the base data, and then partitions and stores the values in each bin as a separate, bin-based data cluster. In answering a query, the procedures for examining the bin numbers and the bin-based data clusters offer the maximum possible level of concurrency; each record is evaluated by a single thread and all threads are processed simultaneously in parallel. We implement and demonstrate the effectiveness of DP-BIS on two multi-core architectures: a multi-core CPU and a GPU. The concurrency afforded by DP-BIS allows us to fully utilize the thread-level parallelism provided by each architecture--for example, our GPU-based DP-BIS implementation simultaneously evaluates over 12,000 records with an equivalent number of concurrently executing threads. In comparing DP-BIS's performance across these architectures, we show that the GPU-based DP-BIS implementation requires significantly less computation time to answer a query than the CPU-based implementation. We also demonstrate in our analysis that DP-BIS provides better overall performance than the commonly utilized CPU and GPU-based projection index. Finally, due to data encoding, we show that DP-BIS accesses significantly smaller amounts of data than index strategies that operate solely on a column's base data; this smaller data footprint is critical for parallel processors that possess limited memory resources (e.g. GPUs).
/content/cudazone/CUDABrowser/assets/images/applications/497_960_small.png
/content/cudazone/CUDABrowser/assets/images/applications/497_960_large.png
Academia
University of California at Davis
2009
06
02
06/02/2009
18
Luke J. Gosink
Paper
Data Parallel Database Indexing
Luke J. Gosink,jgosink@ucdavis.edu
47cb4eda-e210-4db8-bb80-d6ae342dd454
Physical-Space Refraction-Corrected Transmission Ultrasound Computed Tomography Made Computationally Practical
Transmission Ultrasound Computed Tomography CT) is strongly affected by the acoustic refraction properties of the imaged tissue, and proper modeling and correction of these effects is crucial to achieving high-quality image reconstructions. Excellent results can be obtained when these physics effects are incorporated, but at considerable computational expense. We have used CUDA to conceive a framework that implements refractive Ultrasound CT and meets the interactive demands of clinical practice, without a loss in reconstruction quality.
/content/cudazone/CUDABrowser/assets/images/applications/497_us_img_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/497_us_img_large.jpg
Academia
Stony Brook University
2008
09
11
09/11/2008
85
Kllaus Mueller
Paper
Medical Imaging
Kllaus Mueller,mueller@cs.sunysb.edu
b912f9a1-2627-4d4f-9f4f-4da3eff3ca78
Python Parallel Utilities
NVIDIA CUDA and MPI python wrappers. These wrappers are written in pure C no swig or boost necessary. The CUDA wrapper exposes the CUDA runtime and Driver API's.
/content/cudazone/CUDABrowser/assets/images/applications/496_smoothed_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/496_smoothed_large.jpg
Academia
Seismic Laboratory for Imaging and Modeling
2008
12
31
12/31/2008
Sean Ross-Ross
Paper
Programming Tools
c58a8810-432f-4757-a91f-c80faabe20ab
Signal Integrity Simulations
Agilent Technologies Inc. (NYSE:A) announced its work with NVIDIA to accelerate signal integrity simulations using NVIDIAs Compute Unified Device Architecture (CUDA)-based Graphics Processing Units (GPU). The association is expected to yield the commercial release of a GPU-enabled Advanced Design System (ADS) Transient Convolution Simulator that will allow signal integrity designers to run these simulations dramatically faster than was previously possible.
/content/cudazone/CUDABrowser/assets/images/applications/495_hyperlinx-eye_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/495_hyperlinx-eye_large.jpg
Commercial
EDA Geek News Staff in Models, Simulations
2008
08
26
08/26/2008
EDA Geek News Staff in Models, Simulations
Paper
Signal Processing
contact_us@agilent.com
d0dbf768-8c4a-45f6-a6c3-8c38dc100a98
Applying Modern Soft and Hardware Technologies for Computational Steering Approaches in Computational Fluid Dynamics
In this article we present an educational simulation tool, FlowSim 2007 CUDA edition, a computational steering application for interactive 2D flow simulation based on the Lattice Boltzmann Method. The application combines a comfortable user interface as well as a convenient development platform on the one hand and a high performance flow solver on the other hand. The user interface is implemented using the Microsoft .NET Framework whereas the Lattice Boltzmann kernel is based on the Compute Unified Device Architecture (CUDA) by nVIDIA running on GeForce 8 series featuring G8X GPUs [2]. The gap between the managed intermediate language (IL) code and the hardware specific native code is filled using the recently introduced C++/CLI programming language [1]. We demonstrate that this integrated desktop approach can deliver a performance that exceeds that of a high end PC by at least an order of magnitude. In our conclusion we will focus on extensions to three dimensions and clusters of GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/494_p175_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/494_p175_large.jpg
Academia
Technology Institute at TU Braunschweig
2007
10
26
10/26/2007
Jan Linxweiler
Jonas T lke Manfred Krafczyk
Paper
Computational Fluid Dynamics
Science
Jan Linxweiler,Jonas T,lke Manfred Krafczyk,
a6e5c287-cebe-4a52-b00f-7ec58a5dbdd2
Computer generated hologram with geometric occlusion using GPU-accelerated depth buffer rasterization for three-dimensional display
We present a method of rapidly producing computer-generated holograms that exhibit geometric occlusion in the reconstructed image. Conceptually, a bundle of rays is shot from every hologram sample into the object volume. We use z buffering to find the nearest intersecting object point for every ray and add its complex field contribution to the corresponding hologram sample. Each hologram sample belongs to an independent operation, allowing us to exploit the parallel computing capability of modern programmable graphics processing units (GPUs). Unlike algorithms that use points or planar segments as the basis for constructing the hologram, our algorithm's complexity is dependent on fixed system parameters, such as the number of ray-casting operations, and can therefore handle complicated models more efficiently. The finite number of hologram pixels is, in effect, a windowing function, and from analyzing the Wigner distribution function of windowed free-space transfer function we find an upper limit on the cone angle of the ray bundle. Experimentally, we found that an angular sampling distance of 0.01 for a 2.66 cone angle produces acceptable reconstruction quality.
/content/cudazone/CUDABrowser/assets/images/applications/493_h15g_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/493_h15g_large.jpg
Academia
University of CambridgeElectrical Engineering Dept.
2009
07
17
07/17/2009
Rick H.-Y. Chen
Timothy D. Wilkinson
Paper
Rick H.-Y. Chen,Timothy D. Wilkinson
6f121072-b6d3-47ba-9e6b-e872695eaaf8
Real-Time Fringe Pattern Generation with High Quality
A hologram computation procedure and its GPU implementation are presented. The procedure is based on partitioning. Each segment has an approximate but simpler frequency domain representation. Quality of the results is comparable to Fresnel holograms.
/content/cudazone/CUDABrowser/assets/images/applications/492_3d-scan1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/492_3d-scan1_large.jpg
Academia
Department of Electrical and Electronics Engineering and Bilkent University
2009
04
30
04/30/2009
Hoonjong Kang,
Fahri Yara, Levent Onural,
Paper
Imaging
Science
Hoonjong Kang, Fahri Yara, Levent Onural,hjkang@ee.bilkent.edu.tr
922028e2-56a9-45cf-99de-ceaa8d0a5370
Real-Time Multiple SLM Color Holographic Display Using Multiple GPU Acceleration
A real-time color holographic video display system computes holograms from point cloud of a rigid object by using multi-GPU system and uses three different colored LEDs for reconstruction. Experimental results are satisfactory.
/content/cudazone/CUDABrowser/assets/images/applications/491_slm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/491_slm_large.png
Academia
Dept. of Electrical and Electronics Eng., Bilkent University
2009
04
30
04/30/2009
Fahri Yara
Hoonjong Kang Levent Onural
Paper
Imaging
Science
Video & Audio
Fahri Yara, Hoonjong Kang,Levent Onural,
8bbd6e15-496a-4bb3-9053-b3811821e510
Fast Hardware-Accelerated Volume Rendering of CT Scans
As CT scanning is a very common medical imaging method, we propose new hardware-based algorithms using GPU (Graphical Processor Unit) programming for rapid visualization. Firstly, 3D volumes are constructed from CT scans. Then volume rendering is used to display anatomical structures via algorithms founded on improved ray casting and 2D textures. Our methods achieve interactive rendering rates and require an ordinary PC with an off-the-shelf graphics card. We expect our approach to be useful to medical practitioners for handling modern, large-scale medical datasets.
/content/cudazone/CUDABrowser/assets/images/applications/490_ct_head_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/490_ct_head_large.jpg
Academia
Zhejiang University
2007
12
01
12/01/2007
Ronghua Liang
Zhigeng Pan Meleagros Krokos
Paper
Medical Imaging
Life Sciences
Ronghua Liang, Zhigeng Pan, Meleagros Krokos,zgpan@cad.zju.edu.cn
f10e6a41-90c1-4b44-8a9a-a427e74974f8
GPU-Based Acceleration Method for Coherent Holographic Stereogram Calculation
In this paper, we show an acceleration method of the coherent holographic stereogram calculation by means of the GPU, and demonstrate the performance gain up to a factor of over 10 compared with CPU-based computing.
/content/cudazone/CUDABrowser/assets/images/applications/489_mobius_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/489_mobius_large.jpg
Academia
Department of Electrical and Electronics Engineering and Bilkent University
2008
03
16
03/16/2008
10
Hoonjong Kang,
Takeshi Yamaguchi, Hiroshi Yoshikawa
Paper
Imaging
Science
Hoonjong Kang, Takeshi Yamaguchi,Hiroshi Yoshikawa,hjkang@ee.bilkent.edu.tr
f9c1b8ad-db0d-429f-8863-03ded9a69dab
Atmospheric wavefront phase recovery by use of specialized hardware: graphical processing units and field-programmable gate arrays
To achieve the wavefront phase-recovery stage of an adaptive-optics loop computed in real time for 32x32 or a greater number of subpupils in a Shack-Hartmann sensor, we present here, for what is to our knowledge the first time, preliminary results that we obtained by using innovative techniques: graphical processing units (GPUs) and field-programmable gate arrays (FPGAs). We describe the stream-computing paradigm of the GPU and adapt a zonal algorithm to take advantage of the parallel computational power of the GPU. We also present preliminary results we obtained by use of FPGAs on the same algorithm. GPUs have proved to be a promising technique, but FPGAs are already a feasible solution to adaptive-optics real-time requirements, even for a large number of subpupils.
/content/cudazone/CUDABrowser/assets/images/applications/488_08_06a_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/488_08_06a_large.jpg
Academia
University of La LagunaSpain
2004
12
31
12/31/2004
Jose G. Marichal-Hernandez
Luis F. Rodriguez-Ramos Fernando Rosa
Paper
Imaging
Science
Jose G. Marichal-Hernandez, Luis F. Rodriguez-Ramos, Fernando Rosa,tpc3dtvcon09@tnt.uni-hannover.de
8a5f261c-09d8-42a5-8726-7100bdde85c8
Acceleration method of computing a compensated phase-added stereogram
We have implemented experimental code to compute a compensated phase-added stereogram (CPAS), which was proposed in a previous paper, on a graphic processing unit (GPU). In this paper, we show an acceleration method for CPAS computation by means of the GPU and compare the computation time between CPU-based and GPU-based calculations, which are programmed in our laboratories. In addition, we demonstrate their reconstructed images. As a result, we could achieve a performance gain of a factor of over 33 compared with a CPU-based computing environment and digital holograms can be displayed at 30 frames per second with 15,000 points.
/content/cudazone/CUDABrowser/assets/images/applications/487_stereo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/487_stereo_large.png
Academia
Department of Electrical and Electronics Engineering and Bilkent University
2008
10
24
10/24/2008
33
Hoonjong Kang
Takeshi Yamaguchi Hiroshi Yoshikawa
Paper
Imaging
Science
Hoonjong Kang, Takeshi Yamaguchi, Hiroshi Yoshikawa, hjkang@ee.bilkent.edu.tr
784de106-a069-4478-8a7e-92a0aed3649b
Hologram synthesis for photorealistic reconstruction
Computation of diffraction patterns, and thus holograms, of scenes with photorealistic properties is a highly complicated and demanding process. An algorithm, based primarily on computer graphics methods, for computing full-parallax diffraction patterns of complicated surfaces with realistic texture and reflectivity properties is proposed and tested. The algorithm is implemented on single-CPU, multiple-CPU and GPU platforms. An alternative algorithm, which implements reduced occlusion diffraction patterns for much faster but somewhat lower quality results, is also developed and tested. The algorithms allow GPU-aided calculations and easy parallelization. Both numerical and optical reconstructions are conducted. The results indicate that the presented algorithms compute diffraction patterns that provide successful photorealistic reconstructions; the computation times are acceptable especially on the GPU implementations.
/content/cudazone/CUDABrowser/assets/images/applications/486_image018_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/486_image018_large.jpg
Academia
JOSA A
2008
11
24
11/24/2008
Martin Janda
Ivo Hanak Levent Onural
Paper
Imaging
Numerics
Science
Martin Janda,Ivo Hanak, Levent Onural,mjandakiv@zcu.cz
9d2fbd2e-e241-478f-90a1-f40cb04ed084
Real-time digital holographicmicroscopy
Digital holographic microscopy (DHM) is a well-known powerful method allowing both the amplitude and phase of a specimen to be simultaneously observed. In order to obtain a reconstructed image from a hologram, numerous calculations for the Fresnel diffraction are required. The Fresnel diffraction can be accelerated by the FFT (Fast Fourier Transform) algorithm. However, real-time reconstruction from a hologram is difficult even if we use a recent central processing unit (CPU) to calculate the Fresnel diffraction by the FFT algorithm. In this paper, we describe a real-time DHM system using a graphic processing unit (GPU) with many stream processors, which allows use as a highly parallel processor. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512x512 grids in 24 frames per second.
/content/cudazone/CUDABrowser/assets/images/applications/485_holo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/485_holo_large.png
Academia
Graduate School of Science and Engineering, Yamagata University
2009
07
23
07/23/2009
Tomoyoshi Shimobaba
Yoshikuni Sato Junya Miura
Multimedia Paper
Imaging
Science
Numerics
Tomoyoshi Shimobaba,Yoshikuni Sato,Junya Miura,shimo@yz.yamagata-u.ac.jp
5068f09c-0433-4ba5-9565-f77fbc04d4c8
Real-time liquid-crystal atmosphere turbulence simulator
To generate time-evolving atmosphere turbulence in real time, a phase-generating method for our liquid-crystal (LC) atmosphere turbulence simulator (ATS) is derived based on the Fourier series (FS) method. A real matrix expression for generating turbulence phases is given and calculated with a graphic processing unit (GPU), the GeForce 8800 Ultra. A liquid crystal on silicon (LCOS) with 256x256 pixels is used as the turbulence simulator. The total time to generate a turbulence phase is about 7.8 ms for calculation and readout with the GPU. A parallel processing method of calculating and sending a picture to the LCOS is used to improve the simulating speed of our LC ATS. Therefore, the real-time turbulence phasegeneration frequency of our LC ATS is up to 128 Hz. To our knowledge, it is the highest speed used to generate a turbulence phase in real time.
/content/cudazone/CUDABrowser/assets/images/applications/484_simulator_small.png
/content/cudazone/CUDABrowser/assets/images/applications/484_simulator_large.png
Academia
Changchun Institute of Optics, Fine Mechanics and Physics
2009
04
17
04/17/2009
Lifa Hu
Li Xuan Dayu Li
Paper
Numerics
Science
Lifa Hu,Li Xuan,Dayu Li,hulifa@ciomp.ac.cn
dfaea93f-1724-4e37-b329-5ee4848f3988
GPU-assisted high-resolution, real-time3-D shape measurement
This paper describes a Graphics Processing Unit (GPU)-assisted real-time three-dimensional shape measurement system. Our experiments demonstrated that the absolute coordinates calculation and rendering speed of a GPU is more than four times faster than that of a dual CPU workstation with the same graphics card. By implementing the GPU into our system, we realized simultaneous absolute coordinate acquisition, reconstruction and display at 30 frames per second with a resolution of approximately 266K points per frame. Moreover, a 2+1 phase-shifting algorithm was employed to alleviate the measurement error caused by motion. Applications of the system include medical imaging, manufacturing, entertainment, and security.
/content/cudazone/CUDABrowser/assets/images/applications/483_face_small.png
/content/cudazone/CUDABrowser/assets/images/applications/483_face_large.png
Academia
Mathematics Department, Harvard University
2006
10
02
10/02/2006
4
Song Zhang
Dale Royer Shing-Tung Yau
Paper
Imaging
Numerics
Science
Song Zhang,Dale Royer,Shing-Tung Yau,szhang77@gmail.com
6ab09d58-6ea6-4025-8af8-e1925bef8dce
Computer generated holography
We have applied the graphics processing unit (GPU) to computer generated holograms (CGH) to overcome the high computational cost of CGH and have compared the speed of a GPU implementation to a standard CPU implementation. The calculation speed of a GPU (GeForce 6600, nVIDIA) was found to be about 47 times faster than that of a personal computer with a Pentium 4 processor. Our system can realize real-time reconstruction of a 64-point 3-D object at video rate using a liquid-crystal display of resolution 800x600.
/content/cudazone/CUDABrowser/assets/images/applications/482_computer-generated-hologram_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/482_computer-generated-hologram_large.jpg
Academia
Department of Medical System Engineering Chiba University
2008
12
31
12/31/2008
47
Nobuyuki Masuda
Tomoyoshi Ito Takashi Tanaka
Paper
Imaging
Numerics
Video & Audio
Nobuyuki Masuda,Tomoyoshi Ito,Takashi Tanaka,masudanb@faculty.chiba-u.jp
028ff6b3-3641-497e-8515-37370d59d3c3
Flow visualization and flow cytometry with holographic video microscopy
The video stream captured by an in-line holographic microscope can be analyzed on a frame-by-frame basis to track individual colloidal particles three-dimensional motions with nanometer resolution, and simultaneously to measure their sizes and refractive indexes. Through a combination of hardware acceleration and software optimization, this analysis can be carried out in near real time with off-the-shelf instrumentation. An efficient particle identification algorithm automates initial position estimation with sufficient accuracy to enable unattended holographic tracking and characterization. This techniques resolution for particle size is fine enough to detect molecular-scale coatings on the surfaces of colloidal spheres, without requiring staining or fluorescent labeling. We demonstrate this approach to label-free holographic flow cytometry by detecting the binding of avidin to biotinylated polystyrene spheres.
/content/cudazone/CUDABrowser/assets/images/applications/481_laser_small.png
/content/cudazone/CUDABrowser/assets/images/applications/481_laser_large.png
Academia
Department of Physics and Center for Soft Matter Research, New York University, New York
2009
07
17
07/17/2009
Fook Chiong Cheong
Bo Sun Remi Dreyfus
Paper
Imaging
Science
Fook Chiong Cheong,Bo Sun,Remi Dreyfus,david.grier@nyu.edu
8dba8b26-2a21-43a9-945b-ec9f04d5ff5d
A QAP Solver with CUDA GPU Computing Architecture
This application solves the quadratic assignment problem (QAP) [1]. In QAP, we are given l locations and l facilities and the task is to assign the facilities to the locations to minimize the cost. We chose QAP for the following reasons: First, problem sizes of QAPs in real life problems are relatively small compared with other problems in permutation domains such as the traveling salesman problem (TSP) and the scheduling problem. This enables us to use the shared memory of a GPU effectively. Second, QAP is one of the most diffcult problems among problems in permutation domains. Thus, QAP is a good test bed to evaluate an optimization algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/480_qap03_small.png
/content/cudazone/CUDABrowser/assets/images/applications/480_qap03_large.png
Academia
Graduate School of Science, Osaka Prefecture University
2008
12
31
12/31/2008
Noriyuki Fujimoto
Shigeyoshi Tsutsui
Paper
Numerics
Science
Noriyuki Fujimoto,Shigeyoshi Tsutsui,fujimoto@mi.s.osakafu-u.ac.jp
ca10a525-1ae8-4b4e-9720-2243074cb32e
A GPU Accelerated Evolutionary Computer Vision System
We have used the graphics processing unit (GPU) of the graphics card to create an evolutionary image processing system which is able to learn how to detect a user-specified object in an image. The system receives an image sequence as input. The user only has to tell the system where this object is located. This is done by using the mouse pointer. The user simply moves the mouse over the desired object and then presses the mouse button as long as the object is located under the mouse pointer. The user follows this object over several frames while keeping the mouse button pressed. As this is being done, the system evolves a population of image processing algorithms by exploiting the power of the GPU at interactive rates. Our system is the first GPU accelerated evolutionary image processing system (Figure 1) which allows the automatic creation of object detection algorithms [2]. This is the first step towards building fully adaptive evolutionary vision systems [1].
/content/cudazone/CUDABrowser/assets/images/applications/479_ducks_small.png
/content/cudazone/CUDABrowser/assets/images/applications/479_ducks_large.png
Academia
Universitat Tubingen
2008
12
31
12/31/2008
45
Eberhard Karls
Paper
Imaging
Numerics
Science
Eberhard Karls,marc.ebner@wsii.uni-tuebingen.de
49ece6ac-6aa1-492c-8ceb-6a748939c306
GPU-based Acceleration of the Genetic Algorithm
Genetic algorithm (GA) is a stochastic optimization method inspired by nature evolution. Because of their parallel nature, they have been parallelized many times. Graphic Processing Units (GPU) were originally targeted for rasterization of graphics primitives. Today GPUs are more likely fast multi-core processors capable of performing complex mathematical tasks. There are many ways how to exploit GPUs potential for general purpose computation (GPGPU). One option is to employ Compute Unified Device Architecture (CUDA) framework.
/content/cudazone/CUDABrowser/assets/images/applications/478_voronoi_knauss_oesterle_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/478_voronoi_knauss_oesterle_large.jpg
Academia
Brno University of Technology, Bozetechova 2
2008
12
31
12/31/2008
2600
Petr Pospichal
Jiri Jaros
Numerics
Science
Petr Pospichal,Jiri Jaros,xpospi45@stud.t.vutbr.cz
4aa192d6-70c5-4040-9fa2-7ee690f988dc
Parallel Ant System for Traveling Salesman Problem
Ant Colony Optimization(ACO) is a meta-heuristic introduced in 1991 by Dorigo et al. on TSP problem(Dorigo, 1992). This alorithm is inspired by the natural behavior of real ants. Ants usually communicate via pheromone trail, i.e. an ant would lay down some mount of pheromone on the passed path. An ants tendency to choose a specific path is positively correlated to the intensity of trail. The pheromone trail evaporates over time, if on pheromone laid down by other ants. If many ants lay down pheromone on specific path, the intensity would attract more ants forward this path. Although ACO has outstanding performance on TSP problem, it spends huge execution time in large scale TSP problem. However, ACO has highly parallelizable structure(Talbi, Roux, Fonlupt, & Robillard, 1999 St utzle, 1998). In this work, we choose NVIDIAs CUDA programming model and Tesla C1060 as platform to implement our Parallel ACO.
/content/cudazone/CUDABrowser/assets/images/applications/477_AntLines_small.png
/content/cudazone/CUDABrowser/assets/images/applications/477_AntLines_large.png
Academia
Taiwan Evolutionary Intelligence Laboratory (TEIL) Department of Electrical Engineering, National Taiwan University
2008
12
31
12/31/2008
21
Ying-Shiuan You
Paper
Numerics
Science
Ying-Shiuan You,r97921039@ntu.edu.tw
8fc9a55c-d66e-4870-8536-634fad8c6d4a
StarPU
StarPU is a unified runtime system that offers support for heterogeneous multicore architectures (CPUs, GPUs, Cell's SPUs, ...) . Its unified execution model is tightly coupled with a high-level data management library and provides a convenient way to develop and tune powerful scheduling algorithms. StarPU therefore make it possible to actually get the benefits of hybrid systems in a portable fashion.
/content/cudazone/CUDABrowser/assets/images/applications/476_starpu-lu-dag_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/476_starpu-lu-dag_large.jpg
Research
INRIA
http://www.inria.fr
2009
07
06
07/06/2009
Open source
Cedric Augonnet
Application Code
Libraries
runtime system, task scheduling, data management, portability
8d5ab3be-6289-4b09-8e1f-db37f6d927b0
Optimization of Primality Testing Methods
Modern fast primality testing uses a combination of Strong Probable Prime (SPRP) rejection tests. We find more powerful combinations by intensive search of the vast domain of SPRP test configurations. Evolutionary guidance using previous promising results boosts search speed. We implement the entire search on the GPU with the CUDA programming language resulting in 65-time speedup over a CPU search. This project has already found a test an order of magnitude more powerful than the best previously known.
/content/cudazone/CUDABrowser/assets/images/applications/474_rabin_miller_1_small.PNG
/content/cudazone/CUDABrowser/assets/images/applications/474_rabin_miller_1_large.PNG
Academia
2008
12
31
12/31/2008
65
Steve Worley
Paper
Numerics
Science
Steve Worley,comments@worley.com
b8b898ab-530a-4119-9976-a20a5fdc492b
Particle Swarm Optimization
The increasing interest of researchers in using low cost GPUs for applications requiring intensive parallel comput- ing is due to the ability of these devices to solve parallelizable problems much faster than traditional sequential processors. The first applications of evolutionary algorithms (EAs) on GPUs have been developed to solve specific image processing problems; at the beginning they were using textures render- ing for the encoding and evaluation of individuals and most of the times tasks like pseudo random numbers generation and other evolutionary operations were executed on CPU. This project presents an approach for the implementation of PSO algoritms on GPUs which, by means of the nVIDIA CUDA TM environment, avoids the use of textures as data structures and performs all evolution on the GPU, reducing as much as possible the exchange of data with the CPU.
/content/cudazone/CUDABrowser/assets/images/applications/473_phase_small.png
/content/cudazone/CUDABrowser/assets/images/applications/473_phase_large.png
Academia
Dipartimento di Ingegneria dell InformazioneUniversita degli Studi di Parma
2008
12
31
12/31/2008
50
Luca Mussi
Stefano Cagnoni
Paper
Numerics
Luca Mussi,,Stefano Cagnoni,mussi@ce.unipr.it,cagnoni@ce.unipr.it
d23dc8ee-f770-49ab-ab28-abe32d6d2d10
Video Game Tools Used For Defense Needs
Video gaming computers and video game consoles available today typically contain a graphics processing unit (GPU), which is very efficient at manipulating and displaying computer graphics. However, the unit's highly parallel structure also makes it more efficient than a general-purpose central processing unit for a range of complex calculations important to defense applications.
/content/cudazone/CUDABrowser/assets/images/applications/472_commandandconquer-775336_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/472_commandandconquer-775336_large.jpg
Academia
Georgia Institute of Technology Research News
2009
06
24
06/24/2009
350
Georgia Institute of Technology Research News
Paper
Game Physics
Video Game
1f321cbb-73a3-4321-8858-cf8f5d246fe1
Using Evolutionary Computing on Consumer GraphicsHardware for Epistasis Analysis in Human Genetics
Biological systems are both complex and robust. Because of this epistasis, or gene-gene interactions, are thought to be a ubiquitous component of common human diseases. Unfortunately, due to the non-linear nature of these interactions, detecting and characterizing epistasis requires algorithms which are combinatorial in complexity. One such algorithm is Multifactor Dimensionality Reduction (MDR). Expert knowledge guided evolutionary computing wrappers around MDR have previously been shown to be a powerful way to efficiently analyze datasets for interactions. Evolutionary computing can effectively address some of the challenges these datasets present. Unfortunately examining the statistical significance of results requires permutation testing, which increases the computation requirements by a factor of 1000. Here we implement an expert knowledge guided ant system on graphics processing units (GPUs) and show that the GPU implementation makes the rigorous statistical analysis of large datasets practical.
/content/cudazone/CUDABrowser/assets/images/applications/471_karyotype_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/471_karyotype_large.jpg
Academia
Dartmouth Medical School
2009
07
24
07/24/2009
Nicholas A.Sinnott-Armstrong
Casey S. Greene Jason H. Moore
Paper
Life Sciences
Science
Nicholas A.Sinnott-Armstrong,Casey S. Greene,Jason H. Moore,Epistasis Analysis, Consumer app, human genetics
5a9e5a89-919c-4735-872b-a0670bb94480
How GPUs can outperform ASICs for fast LDPC decoding
Due to huge computational requirements, powerful Low-Density Parity-Check (LDPC) error correcting codes, discovered in the early 1960s, have only recently been adopted by emerging communication standards. LDPC decoders are supported by VLSI technology, which delivers good parallel computational power with excellent throughputs, but at the expense of significant costs. In this work, we propose an alternative flexible LDPC decoder that exploits data-parallelism for simultaneous multicodeword decoding, supported by multithreading on CUDA-based graphics processing units (GPUs). The ratio of arithmetic operations per memory access is low for the efficient min-sum LDPC decoding algorithm proposed, which causes a bottleneck due to memory latency and data collisions. We propose runtime data realignment to allow coalesced parallel memory accesses to be performed by distinct threads inside the same warp. The memory access patterns of LDPC codes are random, which does not admit the simultaneous use of coalescence in both read and write operations of the decoding process. To overcome this problem we have developed a data mapping transformation which allows new addresses to be contiguously accessed for one of the mentioned memory access types. Our implementation shows throughputs above 100Mbps and BER curves that compare well with ASIC solutions.
/content/cudazone/CUDABrowser/assets/images/applications/469_QPPldpcgraph_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/469_QPPldpcgraph_large.jpg
Academia
University of Coimbra, Coimbra, Portugal
2008
12
31
12/31/2008
Gabriel Falcao
Vitor Silva Leonel Sousa
Paper
Numerics
Gabriel Falcao,Leonel Sousa,Vitor Silva,
24d1eddb-c861-4eae-8746-e0bb6eb9c3f3
High performance genetic programming
The availability of low cost powerful parallel graphics cards has stimulated the port of Genetic Programming (GP) on Graphics Processing Units (GPUs). Our work focuses on the possibilities offered by Nvidia G80 GPUs when programmed in the CUDA language. We compare two parallelization schemes that evaluate several GP programs in parallel. We show that the fine grain distribution of computations over the elementary processors greatly impacts performances. We also present memory and representation optimizations that further enhance computation speed, up to 2.8 billion GP operations per second. The code has been developed with the well known ECJ library.
/content/cudazone/CUDABrowser/assets/images/applications/468_mutation_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/468_mutation_large.jpg
Academia
Universite Lille Nord de France, Calais, France
2008
12
31
12/31/2008
Denis Robilliard
Virginie Marion Cyril Fonlupt
Paper
Numerics
Denis Robilliard,Virginie Marion,Cyril Fonlupt,poty@lil.univ-littoral.fr,genetic algorithms, genetic programming, parallel processing
27faebfe-cded-4c12-b4b5-88c50c12807c
A game loop architecture for the GPU used as a math coprocessor in real-time applications
This article concerns the use of a graphics processor unit (GPU) as a math co-processor in real-time applications in special games and physics simulations. To validate this approach, we present a new game loop architecture that employs GPUs for general-purpose computations (GPGPUs). A critical issue here is the process distribution between the CPU and the GPU. The architecture consists of a model for distribution, and our implementation offers many advantages in comparison to other approaches without the GPGPU stage. This architecture can be used either by a general-purpose language such as the Compute Unified Device Architecture (CUDA), or shader languages such as the High-Level Shader Language (HLSL) and the OpenGL Shading Language (GLSL). Although the architecture proposed here aims at supporting mathematics and physics on the GPU, it is possible to adapt any kind of generic computation. This article discusses the model implementation in an open-source game engine and presents the results of using this platform.
/content/cudazone/CUDABrowser/assets/images/applications/467_Minna-de-Puzloop-1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/467_Minna-de-Puzloop-1_large.jpg
Academia
Instituto de Computacao, Universidade Federal Fluminense, Brazil
2008
12
31
12/31/2008
Marcelo P. M. Zamith
Esteban W. G. Clua Aura Conci
Paper
Game Physics,Numerics,Science
Marcelo P. M. Zamith,Esteban W. G. Clua,Aura Conci,esteban@inf.puc-rio.br,Game loop, real-time physics
3ac00157-b33b-4de3-87e4-b079e50c6f8a
A hardware redundancy and recovery mechanism for reliable scientific computation
General purpose computation on graphics processors (GPGPU) has rapidly evolved since the introduction of commodity programmable graphics hardware. With the appearance of GPGPU computation-oriented APIs such as AMD's Close to the Metal (CTM) and NVIDIA's Compute Unified Device Architecture (CUDA), we begin to see GPU vendors putting financial stakes into this non-graphics, one-time niche market. Major supercomputing installations are building GPGPU clusters to take advantage of massively parallel floating point capabilities, and Folding@Home has even released a GPU port of its protein folding distributed computation client. But in order for GPGPU to truly become important to the supercomputing community, vendors will have to address the heretofore unimportant reliability concerns of graphics processors. We present a hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures. Upon detecting an error, the system invokes an automatic recovery mechanism that only recomputes erroneous results. Our results show that our technique imposes less than a 1.5 x performance penalty and saves energy for GPGPU but is completely transparent to general graphics and does not affect the performance of the games that drive the market.
/content/cudazone/CUDABrowser/assets/images/applications/466_cuda-nbody-example_small.png
/content/cudazone/CUDABrowser/assets/images/applications/466_cuda-nbody-example_large.png
Academia
University of Virginia
2008
12
31
12/31/2008
Jeremy W. Sheaffer
David P. Luebke Kevin Skadron
Paper
Numerics
Science
Jeremy W. Sheaffer,David P. Luebke,Kevin Skadron
04b954f1-d86f-409f-a6ad-d3ac9b072663
Accelerated Pathfinding
In the past few years the graphics programmable processor (GPU) has evolved into an increasingly convincing computational resource for non graphics applications. The GPU is especially well suited to address problem sets expressed as data parallel computation with the same program executed on many data elements concurrently. In pursuing a scalable navigation planning approach for many thousands of agents in crowded game scenes, developers became more attracted to decomposable movement algorithms that lend to explicit parallelism. Pathfinding is one key computational intelligence action in games that is typified by intense search over sparse graph data structures. This paper describes an efficient GPU implementation of parallel global pathfinding using the CUDA programming environment, and demonstrates GPU performance scale advantage in executing an inherently irregular and divergent algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/465_image006_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/465_image006_large.jpg
Commercial
NVIDIA Corporation
http://www.nvidia.com
2008
12
31
12/31/2008
Avi Bleiweiss
Paper
Numerics
Avi Bleiweiss,ableiweiss@nvidia.com
2b0d90a1-b0dc-4834-833f-c633cdf8bd9b
BSGP: bulk-synchronous
We present BSGP, a new programming language for general purpose computation on the GPU. A BSGP program looks much the same as a sequential C program. Programmers only need to supply a bare minimum of extra information to describe parallel processing on GPUs. As a result, BSGP programs are easy to read, write, and maintain. Moreover, the ease of programming does not come at the cost of performance. A well-designed BSGP compiler converts BSGP programs to kernels and combines them using optimally allocated temporary streams. In our benchmark, BSGP programs achieve similar or better performance than well-optimized CUDA programs, while the source code complexity and programming time are significantly reduced. To test BSGP's code efficiency and ease of programming, we implemented a variety of GPU applications, including a highly sophisticated X3D parser that would be extremely difficult to develop with existing GPU programming languages.
/content/cudazone/CUDABrowser/assets/images/applications/464_6_small.JPG
/content/cudazone/CUDABrowser/assets/images/applications/464_6_large.JPG
Academia
Tsinghua University
2008
12
31
12/31/2008
Qiming Hou
Kun Zhou Baining Guo
Paper
Numerics
Qiming Hou,Kun Zhou,Baining Guo
0294e1a5-493d-432d-a5f8-66ca43222dc6
High performance discrete Fourier transforms
We present novel algorithms for computing discrete Fourier transforms with high performance on GPUs. We present hierarchical, mixed radix FFT algorithms for both power-of-two and non-power-of-two sizes. Our hierarchical FFT algorithms efficiently exploit shared memory on GPUs using a Stockham formulation. We reduce the memory transpose overheads in hierarchical algorithms by combining the transposes into a block-based multi-FFT algorithm. For non-power-of-two sizes, we use a combination of mixed radix FFTs of small primes and Bluestein's algorithm. We use modular arithmetic in Bluestein's algorithm to improve the accuracy. We implemented our algorithms using the NVIDIA CUDA API and compared their performance with NVIDIA's CUFFT library and an optimized CPU-implementation (Intel's MKL) on a high-end quad-core CPU. On an NVIDIA GPU, we obtained performance of up to 300 GFlops, with typical performance improvements of 2--4x over CUFFT and 8--40x improvement over MKL for large sizes.
/content/cudazone/CUDABrowser/assets/images/applications/463_fc100_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/463_fc100_large.jpg
Commercial
Microsoft Corporation
2008
12
31
12/31/2008
40
Naga K. Govindaraju
Brandon Lloyd Yuri Dotsenko
Paper
Numerics
Naga K. Govindaraju,Brandon Lloyd,Yuri Dotsenko,Algorithms, Design, Experimentation, Measurement, Performance
a0694fe3-9fac-4d87-9684-16832426b768
Wave field synthesis for 3D audio: architectural prospectives
In this paper, we compare the architectural perspectives of the Wave Field Synthesis (WFS) 3D-audio algorithm mapped on three different platforms: a General Purpose Processor (GPP), a Graphics Processor Unit (GPU) and a Field Programmable Gate Array (FPGA). Previous related work reveals that, up to now, WFS sound systems are based on standard PCs. However, on one hand, contemporary GPUs consist of many multiprocessors that can process data concurrently. On the other hand, recent FPGAs provide huge level of parallelism, and reasonably high performance potentials, which can be exploited very efficiently by smart designers. Furthermore, new parallel programming environments, such as the Compute Unified Device Architecture (CUDA) from NVidia and the Stream from ATI, give to the researchers full access to the GPU resources. We use the CUDA to map the WFS kernel on a GeForce 8600GT GPU. Additionally, we implement a reconfigurable and scalable hardware accelerator for the same kernel, and map it onto Virtex4 FPGAs. We compare both architectural approaches against a baseline GPP implementation on a Pentium D at 3.4 GHz. Our conclusion is that in highly demanding WFS-based audio systems, a low-cost GeForce 8600GT desktop GPU can achieve a speedup of up to 8x comparing to a modern Pentium D implementation. An FPGA-based WFS hardware accelerator consisting of a single rendering unit (RU), can provide a speedup of up 10x comparing to the Pentium D approach. It can fit into small FPGAs and consumes approximately 3 Watts. Furthermore, cascading multiple RUs into a larger FPGA, can boost processing throughput up to more than two orders of magnitude higher than a GPP-based implementation and an order of magnitude better than a low-cost GPU one.
/content/cudazone/CUDABrowser/assets/images/applications/462_wfs-objetos_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/462_wfs-objetos_large.jpg
Academia
Delft University of Technology, Delft, Netherlands
2008
12
31
12/31/2008
10
Dimitris Theodoropoulos
Catalin Bogdan Ciobanu Georgi Kuzmanov
Paper
Numerics
Video & Audio
Dimitris Theodoropoulos,Catalin Bogdan Ciobanu,Georgi Kuzmanov
b74a6976-b873-458c-acfe-6057b5eedf72
A compiler framework for optimization of affine loop nests
GPUs are a class of specialized parallel architectures with tremendous computational power. The new Compute Unified Device Architecture (CUDA) programming model from NVIDIA facilitates programming of general purpose applications on their GPUs. However, manual development of high-performance parallel code for GPUs is still very challenging. In this paper, a number of issues are addressed towards the goal of developing a compiler framework for automatic parallelization and performance optimization of affine loop nests on GPGPUs: 1) approach to program transformation for efficient data access from GPU global memory, using a polyhedral compiler model of data dependence abstraction and program transformation; 2) determination of optimal padding factors for conflict-minimal data access from GPU shared memory; and 3) model-driven empirical search to determine optimal parameters for unrolling and tiling. Experimental results on a number of kernels demonstrate the effectiveness of the compiler optimization approaches developed.
/content/cudazone/CUDABrowser/assets/images/applications/461_180px-Polytope_model_unskewed.svg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/461_180px-Polytope_model_unskewed.svg_large.png
Academia
The Ohio State University, Columbus, OH, USA
2008
12
31
12/31/2008
Muthu Manikandan Baskaran
Uday Bondhugula Sriram Krishnamoorthy
Paper
Numerics
Muthu Manikandan Baskaran,Uday Bondhugula,Sriram Krishnamoorthy
bd6d474d-c2bb-423c-b00f-fe9a2fedb280
Single-particle 3d reconstruction from cryo-electron microscopy images
Single-particle 3D reconstruction from cryo-electron microscopy (cryo-EM) images is a kernel application of biological molecules analysis, as the computational requirement of which is now beyond PetaFlop for a high-resolution 3D structure. In this paper, we quantitatively analyze the workload, computational intensity and memory performance of the application, parallelize it on an emerging multicore architecture GPU-CUDA. Further we apply a percolation technique to decouple computation with memory operations and orchestrate thread-data mapping to reduce the overhead off-chip memory operations. Finally we tested our optimization strategy on a popular open-source package EMAN to GPU-CUDA, which achieves a relative speedup of about 10X to the original CPU-only EMAN. The experimental results also show that the proposed percolation programming greatly improves utilization of memory bandwidth and floating-point units.
/content/cudazone/CUDABrowser/assets/images/applications/460_kouzouseiri_image_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/460_kouzouseiri_image_large.jpg
Academia
Chinese Academy of Science, Beijing, China
2008
12
31
12/31/2008
10
Guangming Tan
Ziyu Guo Mingyu Chen
Paper
Imaging
Medical Imaging
Life Sciences
Science
Guangming Tan,Ziyu Guo,Mingyu Chen
0d938d85-37c4-41f6-8329-fad802f09c5e
All-pairs shortest-paths for large graphs
The all-pairs shortest-path problem is an intricate part in numerous practical applications. We describe a shared memory cache efficient GPU implementation to solve transitive closure and the all-pairs shortest-path problem on directed graphs for large datasets. The proposed algorithmic design utilizes the resources available on the NVIDIA G80 GPU architecture using the CUDA API. Our solution generalizes to handle graph sizes that are inherently larger then the DRAM memory available on the GPU. Experiments demonstrate that our method is able to significantly increase processing large graphs making our method applicable for bioinformatics, internet node traffic, social networking, and routing problems.
/content/cudazone/CUDABrowser/assets/images/applications/459_2_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/459_2_new_large.jpg
Academia
University of Pennsylvania and Lockheed Martin
2008
12
31
12/31/2008
Gary J. Katz
Joseph T. Kider, Jr
Paper
Numerics
Science
Gary J. Katz, Joseph T. Kider, Jr
d8ab1ded-aa91-4d3f-be16-5b13f6a6a1e2
Program optimization space pruning for a multithreaded gpu
Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.
/content/cudazone/CUDABrowser/assets/images/applications/458_deferredshadow_small.png
/content/cudazone/CUDABrowser/assets/images/applications/458_deferredshadow_large.png
Academia
2008
12
31
12/31/2008
Shane Ryoo
Christopher I. Rodrigues Sam S. Stone
Paper
Numerics
Shane Ryoo,Christopher I. Rodrigues,Sam S. Stone
018b5db4-b3cb-444f-be6c-778b8517c99b
Aspects of GPU for general purpose high performance computing
We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/457_GeForce_GTX_280_3qtr_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/457_GeForce_GTX_280_3qtr_large.jpg
Academia
The University of Tokyo
2008
12
31
12/31/2008
Reiji Suda
Takayuki Aoki Shoichi Hirasawa
Paper
Numerics
Reiji Suda,Takayuki Aoki,Shoichi Hirasawa
21306bd4-d5ae-4455-b202-c8bae8a17348
Software Pipelined Execution of Stream Programs
The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multi-core architectures. This model allows programmers to specify the structure of a program as a set of filters that act upon data, and a set of communication channels between them. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on modern Graphics Processing Units (GPUs), as they support abundant parallelism in hardware. In this paper, we describe the challenges in mapping StreamIt to GPUs and propose an efficient technique to software pipeline the execution of stream programs on GPUs. We formulate this problem --- both scheduling and assignment of filters to processors --- as an efficient Integer Linear Program (ILP), which is then solved using ILP solvers. We also describe a novel buffer layout technique for GPUs which facilitates exploiting the high memory bandwidth available in GPUs. The proposed scheduling utilizes both the scalar units in GPU, to exploit data parallelism, and multiprocessors, to exploit task and pipeline parallelism. Further it takes into consideration the synchronization and bandwidth limitations of GPUs, and yields speedups between 1.87X and 36.83X over a single threaded CPU.
/content/cudazone/CUDABrowser/assets/images/applications/456_pipe_small.png
/content/cudazone/CUDABrowser/assets/images/applications/456_pipe_large.png
Academia
Supercomputer Education and Research Centre, Indian Institute of Science
2008
12
31
12/31/2008
37
Abhishek Udupa
R. Govindarajan Matthew J. Thazhuthaveetil
Paper
Numerics
Abhishek Udupa, R. Govindarajan,Matthew J. Thazhuthaveetil,mjt@csa.iisc.ernet.in
ea8c2995-d603-42a2-93f8-633811e8b9c2
Pervasive massively multithreaded GPU processors
This talk presents an overview of NVIDIA's SIMT architecture and some brief insights on how some CUDA programming paradigms map onto it. A brief history of SIMT is provided to explain how NVIDIA ended up implementing a unified SIMT processor core in its GPUs including how graphics shaders are mapped onto SIMT threads. In addition, a conceptual view of how a SIMT microarchitecture executes threads in parallel is provided. The talk wraps up by describing some pitfalls related to thread synchronization, memory access, and cache management and describes some key problem areas in SIMT programming that NVIDIA would like to address in the future
/content/cudazone/CUDABrowser/assets/images/applications/455_nvidia_gpu_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/455_nvidia_gpu_large.jpg
Commercial
NVIDIA Corporation, Santa Clara, CA, USA
2008
12
31
12/31/2008
Michael C. Shebanow
Paper
Science
Numerics
Michael C. Shebanow , mshebanow@nvidia.com
d1cab6b3-cfb3-4feb-9d02-4517910a5cf0
A compiler and runtime system for enabling data mining applications
With increasing need for accelerating data mining and scientific data analysis on large data sets, and less chance to improve processor performance by simply increasing clock frequencies, multi-core architectures and accelerators like FPGAs and GPUs have become popular. A recent development in using GPU for general computing has been the release of CUDA (Compute Unified Device Architecture) by NVIDIA. CUDA allows GPU programming with Clanguage-like features, thus easing the development of non-graphics applications on a GPU. However, several challenges still remain in programming the GPUs with CUDA, because CUDA involves explicit parallel programming and management of its complex memory hierarchy, as well as allocating device memory, moving data between CPU anddevice memory, and specification of thread grid configurations. In this paper, we offer a solution for the programmers to generate CUDA code by specifying the sequential reduction loop(s) with some information about the parameters. With program analysis and code generation, the applications are mapped to a GPU. Several additional optimizations are also performed by the middleware. We have evaluated our system using three popular data miningapplications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50.
/content/cudazone/CUDABrowser/assets/images/applications/454_data-mining_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/454_data-mining_large.jpg
Academia
The Ohio State University, Columbus, OH, USA
2008
12
31
12/31/2008
50
Wenjing Ma
Gagan Agrawal
Paper
Numerics
Science
Wenjing Ma, Gagan Agrawal
9bc5625b-e5f6-4072-a83c-32e59a956b1d
A control-structure splitting optimization for GPGPU
Control statements in a GPU program such as loops and branches pose serious challenges for the efficient usage of GPU resources because those control statements will lead to the serialization of threads and consequently ruin the occupancy of GPU, that is, the number of threads running concurrently. Unlike traditional vector processing units that are inside a general purpose processor, the GPU cannot leave the control statements to the CPU because fine-grain statement scheduling between GPU and CPU is impossible. We need an effective method to handle the control statements "just in place" on the GPUs. In this paper, we propose novel techniques to transform control statements so that they can be executed efficiently on GPUs. Our techniques smartly increase code redundancy, which might be deemed as "de-optimization" for CPU, to improve the occupancy of a program on GPU and therefore improve performance. We focus our attention on how common programming structures such as loops and branches decrease the occupancy of single kernels and how to counter that. We demonstrate our optimizations on a synthetic benchmark and a complex parallel algorithm, the Lattice Boltzmann Method (LBM). Our results show that these techniques are very efficient and can lead to an increase in occupancy and a drastic improvement in performance compared to non-split version of the programs.
/content/cudazone/CUDABrowser/assets/images/applications/453_fracorg_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/453_fracorg_large.jpg
Academia
University of Delaware, Newark, USA
2008
12
31
12/31/2008
Snaider Carrillo
Jakob Siegal Xiaoming Li
Numerics
Science
Snaider Carrillo,Jakob Siegal,Xiaoming Li
b223f4bf-c0f0-49bc-876f-1b1d7058d9e7
Massive parallel LDPC decoding
Low-Density Parity-Check (LDPC) codes are powerful error correcting codes (ECC). They have recently been adopted by several data communication standards such as DVB-S2 and WiMax. LDPCs are represented by bipartite graphs, also called Tanner graphs, and their decoding demands very intensive computation. For that reason, VLSI dedicated architectures have been investigated and developed over the last few years. This paper proposes a new approach for LDPC decoding on graphics processing units (GPUs). Efficient data structures and an new algorithm are proposed to represent the Tanner graph and to perform LDPC decoding according to the stream-based computing model. GPUs were programmed to efficiently implement the proposed algorithms by applying data-parallel intensive computing. Experimental results show that GPUs perform LDPC decoding nearly three orders of magnitude faster than modern CPUs. Moreover, they lead to the conclusion that GPUs with their tremendous processing power can be considered as a consistent alternative to state-of-the-art hardware LDPC decoders.
/content/cudazone/CUDABrowser/assets/images/applications/452_ldpc_generation_graph_small.png
/content/cudazone/CUDABrowser/assets/images/applications/452_ldpc_generation_graph_large.png
Academia
Instituto de Telecomunicacoes/FCTUC, University of Coimbra, Coimbra, Portugal
2008
12
31
12/31/2008
Gabriel Falcao
Leonel Sousa Vitor Silva
Paper
Numerics
Science
Gabriel Falcao,Leonel Sousa,Vitor Silva
3510667b-e35f-4140-a7dc-c4ff7e95ee68
Efficient computation of sum-products on GPUs through software-managed cache
We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the implementation of a software-managed cache. We also present an analytical model for performance analysis of such algorithms. We apply this technique to the implementation of the GPU-based solver of the sum-product or marginalize a product of functions (MPF) problem, which arises in a wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications. Our motivation to accelerate MPF originated in the context of the analysis of genetic diseases, which in some cases requires years to complete on modern CPUs. Computing MPF is similar to computing the chain matrix product of multi-dimensional matrices, but is more difficult due to a complex data-dependent access pattern, high data reuse, and a low compute-to-memory access ratio. Our GPU-based MPF solver achieves up to 2700-fold speedup on random data and 270-fold on real-life genetic analysis datasets on GeForce 8800GTX GPU from NVIDIA over the optimized CPU version on an Intel 2.4GHz Core 2 with a 4MB L2 cache.
/content/cudazone/CUDABrowser/assets/images/applications/451_6763420-0-large_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/451_6763420-0-large_large.jpg
Academia
Technion - Israel Institute of Technology, Haifa, Israel
2008
12
31
12/31/2008
270
Mark Silberstein
Assaf Schuster Dan Geiger
Paper
Numerics
Life Sciences
Science
Mark Silberstein, Assaf Schuster,Dan Geiger
66160e9c-33cb-4208-9562-05b21eceb571
Accelerating total variation regularization for matrix-valued images on GPUs
The advent of new matrix-valued magnetic resonance imaging modalities such as Diffusion Tensor Imaging (DTI) requires extensive computational acceleration. Computational acceleration on graphics processing units (GPUs) can make the regularization (denoising) of DTI images attractive in clinical settings, hence improving the quality of DTI images in a broad range of applications. Construction of DTI images consists of direction-specific Magnetic Resonance (MR) measurements. Compared with conventional MR, direction-sensitive acquisition has a lower signal-to-noise ratio (SNR). Therefore, high noise levels often limit DTI imaging. Advanced post-processing of imaging data can improve the quality of estimated tensors. However, the post-processing problem is only made more computationally difficult when considering matrix-valued imaging data. This paper describes the acceleration of a Total Variation regularization method for matrix-valued images, in particular, for DTI images on NVIDIA Quadro FX 5600. The TV regularization of a 3-D image with 1283 voxels ultimately achieves 266X speedup and requires 1 minute and 30 seconds on the Quadro, while this algorithm on a dual-core CPU completes in more than 3 hours. In this application study we are aimed at analyzing the effective of excessive synchronization, which provides an insight into generally adapting Variational methods to the GPU architecture for other image processing algorithms designed for matrix-valued images.
/content/cudazone/CUDABrowser/assets/images/applications/450_matrix_rose_leaf_3_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/450_matrix_rose_leaf_3_large.jpg
Academia
University of California, Los Angeles, CA
2008
12
31
12/31/2008
266
Maryam Moazeni
Alex Bui
Majid Sarrafzadeh
Paper
Imaging
Science
Maryam Moazeni,Alex Bui,Majid Sarrafzadeh
ca59d3fd-811a-44aa-bffa-097765bd6b20
Performance analysis of accelerated image registration using GPGPU
This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of NVIDIA's Tesla C870 GPU. We explain the underlying structure of the GPU implementation and compare its performance and accuracy against a fast CPU-based implementation. Our experimental results demonstrate that our GPU version is capable of up to 90x speedup with bilinear interpolation and 30x speedup with bicubic interpolation while maintaining a high level of accuracy. This compares favorably to recent image registration studies, but it also indicates that our implementation only reaches about 70% of theorectical peak performance. To analyze our results, we utilize profiling data to identify some of the underlying limitations of CUDA that prohibit peak performance. At the end, we emphasize the need to manage memory resources carefully to fully utilize the GPU and obtain maximum speedup.
/content/cudazone/CUDABrowser/assets/images/applications/449_attention_based_image_registration_saliency_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/449_attention_based_image_registration_saliency_large.jpg
Academia
University of Notre Dame
2008
12
01
12/01/2008
90
Peter Bui
Jay Brockman
Paper
Imaging,Science
Peter Bui,Jay Brockman
b33ef247-efa4-414a-bb2a-ee8f016f096f
Accelerating advanced mri reconstructions
Computational acceleration on graphics processing units (GPUs) can make advanced magnetic resonance imaging (MRI) reconstruction algorithms attractive in clinical settings, thereby improving the quality of MR images across a broad spectrum of applications. At present, MR imaging is often limited by high noise levels, significant imaging artifacts, and/or long data acquisition (scan) times. Advanced image reconstruction algorithms can mitigate these limitations and improve image quality by simultaneously operating on scan data acquired with arbitrary trajectories and incorporating additional information such as anatomical constraints. However, the improvements in image quality come at the expense of a considerable increase in computation. This paper describes the acceleration of an advanced reconstruction algorithm on NVIDIA's Quadro FX 5600. Optimizations such as register allocating the voxel data, tiling the scan data, and storing the scan data in the Quadro's constant memory dramatically reduce the reconstruction's required bandwidth to on-chip memory. The Quadro's special functional units provide substantial acceleration of the trigonometric computations in the algorithm's inner loops, and experimentally-tuned code transformations increase the reconstruction's performance by an additional 20%. The reconstruction of a 3D image with 128^3 voxels ultimately achieves 150 GFLOPS and requires less than two minutes on the Quadro, while reconstruction on a quad-core CPU is thirteen times slower. Furthermore, relative to the true image, the error exhibited by the advanced reconstruction is only 12%, while conventional reconstruction techniques incur error of 42%. In short, the acceleration afforded by the GPU greatly increases the appeal of the advanced reconstruction for clinical MRI applications.
/content/cudazone/CUDABrowser/assets/images/applications/448_Img00250_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/448_Img00250_large.jpg
Academia
University of Illinois at Urbana-Champaign, Urbana, IL, USA
2008
12
31
12/31/2008
13
Samuel S. Stone
Justin P. Haldar Stephanie C. Tsao
Paper
Imaging
Medical Imaging
Life Sciences
Science
Samuel S. Stone, Justin P. Haldar, Stephanie C. Tsao,ssstone2@crhc.uiuc.edu
51d66e62-d1a3-42b7-9f4b-29ce84e42a20
GPU acceleration of cutoff pair potentials for molecular modeling applications
The advent of systems biology requires the simulation of ever-larger biomolecular systems, demanding a commensurate growth in computational power. This paper examines the use of the NVIDIA Tesla C870 graphics card programmed through the CUDA toolkit to accelerate the calculation of cutoff pair potentials, one of the most prevalent computations required by many different molecular modeling applications. We present algorithms to calculate electrostatic potential maps for cutoff pair potentials. Whereas a straightforward approach for decomposing atom data leads to low compute efficiency, a newer strategy enables fine-grained spatial decomposition of atom data that maps efficiently to the C870's memory system while increasing work-efficiency of atom data traversal by a factor of 5. The memory addressing flexibility exposed through CUDA's SPMD programming model is crucial in enabling this new strategy. An implementation of the new algorithm provides a greater than threefold performance improvement over our previously published implementation and runs 12 to 20 times faster than optimized CPU-only code. The lessons learned are generally applicable to algorithms accelerated by uniform grid spatial decomposition.
/content/cudazone/CUDABrowser/assets/images/applications/447_imprint_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/447_imprint_large.jpg
Academia
University of Illinois at Urbana-Champaign, Urbana-Champaign, IL, USA
2008
12
31
12/31/2008
20
Christopher I. Rodrigues
David J. Hardy John E. Stone
Paper
Numerics
Christopher I. Rodrigues,David J. Hardy,John E. Stone, graphics processors, molecular dynamics
2155f6f6-18cc-479b-85a0-3e96576dff51
An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness
GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Programming thousands of massively parallel threads is a big challenge for software engineers, but understanding the performance bottlenecks of those parallel programs on GPU architectures to improve application performance is even more difficult. Current approaches rely on programmers to tune their applications by exploiting the design space exhaustively without fully understanding the performance characteristics of their applications. To provide insights into the performance bottlenecks of parallel applications on GPU architectures, we propose a simple analytical model that estimates the execution time of massively parallel programs. The key component of our model is estimating the number of parallel memory requests (we call this the memory warp parallelism) by considering the number of running threads and memory bandwidth. Based on the degree of memory warp parallelism, the model estimates the cost of memory requests, thereby estimating the overall execution time of a program. Comparisons between the outcome of the model and the actual execution time in several GPUs show that the geometric mean of absolute error of our model on micro-benchmarks is 5.4% and on GPU computing applications is 13.3%. All the applications are written in the CUDA programming language.
/content/cudazone/CUDABrowser/assets/images/applications/446_figure09_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/446_figure09_large.jpg
Academia
Georgia Institute of Technology, Atlanta, GA
2008
12
01
12/01/2008
Sunpyo Hong
Hyesoon Kim
Paper
Numerics
Sunpyo Hong, Hyesoon Kim, hyesoon@cc.gatech.edu
d9006f32-0255-4aab-a31d-a8f339088809
A translation system for enabling data mining applications on GPUs
Modern GPUs offer much computing power at a very modest cost. Even though CUDA and other related recent developments are accelerating the use of GPUs for general purpose applications, several challenges still remain in programming the GPUs. Thus, it is clearly desirable to be able to program GPUs using a higher-level interface. In this paper, we offer a solution that targets a specific class of applications, which are the data mining and scientific data analysis applications. Our work is driven by the observation that a common processing structure, that of generalized reductions, fits a large number of popular data mining algorithms. In our solution, the programmers simply need to specify the sequential reduction loop(s) with some additional information about the parameters. We use program analysis and code generation to map the applications to a GPU. Several additional optimizations are also performed by the system. We have evaluated our system using three popular data mining applications, k-means clustering, EM clustering, and Principal Component Analysis (PCA). The main observations from our experiments are as follows. The speedup that each of these applications achieve over a sequential CPU version ranges between 20 and 50. The automatically generated version did not have any noticeable overheads compared to hand written codes. Finally, the optimizations performed in the system resulted in significant performance improvements.
/content/cudazone/CUDABrowser/assets/images/applications/445_data-mining_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/445_data-mining_large.jpg
Academia
The Ohio State University, Columbus, OH, USA
2008
12
31
12/31/2008
50
Wenjing Ma
Gagan Agrawal
Paper
Numerics
Wenjing Ma,Gagan Agrawal
96154328-556a-4904-a557-ae73986ce7bc
Hughes Trainable Text Skimmer: description of the TTS system as used for MUC-3
The objective of the Hughes Trainable Text Skimmer (TTS) Project is to create text skimming software that: (1) can be easily re-configured for new applications, (2) improves its performance with use, and (3) is fast enough to process megabytes of text per day. The TTS-MUC3 system is our first full scale prototype.
/content/cudazone/CUDABrowser/assets/images/applications/444_text-deactivation_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/444_text-deactivation_large.jpg
Academia
Hughes Research Laboratories, Malibu, CA
2008
12
31
12/31/2008
Charles P. Dolan
Thomas V. Cuda Seth R. Goldman
Paper
Numerics
Charles P. Dolan, Thomas V. Cuda,Seth R. Goldman, HRLcontracts@hrl.com
68f9f27b-3df1-46c8-aebf-1c79fc6f3a47
Accelerating linpack with CUDA on heterogenous clusters
This paper describes the use of CUDA to accelerate the Linpack benchmark on heterogenous clusters, where both CPUs and GPUs are used in synergy with minor or no modifications to the original source code. A host library intercepts the calls to DGEMM and DTRSM and executes them simultaneously on both GPUs and CPU cores. An 8U cluster is able to sustain more than a Teraflop using a CUDA accelerated version of HPL.
/content/cudazone/CUDABrowser/assets/images/applications/443_1476-072X-7-57-2-l_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/443_1476-072X-7-57-2-l_large.jpg
NVIDIA
http://www.nvidia.com/cuda
2008
12
31
12/31/2008
Massimiliano Fatica
Paper
Numerics
Massimiliano Fatica, mfatica@nvidia.com,
53bada17-5c89-4cf8-8e56-5edee7ba8578
High-performance CUDA kernel execution on FPGAs
In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto reconfigurable devices. The use of the CUDA programming model offers the advantage of a common programming interface for exploiting parallelism on two very different types of accelerators -- FPGAs and GPUs. Moreover, by leveraging the advanced synthesis capabilities of AutoPilot we enable efficient exploitation of the FPGA configurability for application specific acceleration. Our flow is based on a compilation process that transforms the SPMD CUDA thread blocks into high-concurrency AutoPilot-C code. We provide an overview of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the generated multi-core accelerators.
/content/cudazone/CUDABrowser/assets/images/applications/442_fpga_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/442_fpga_large.jpg
Academia
University of Illinois, Urbana - Champaign, IL, USA
2008
12
31
12/31/2008
Alexandros Papakonstantinou
Karthik Gururaj
John A. Stratton
Paper
Electronic Design Automation
Imaging
Science
Alexandros Papakonstantinou,Karthik Gururaj,John A. Stratton
56573ff8-9c5d-49b3-a543-759fd0b3dfb8
A Cross-Input Adaptive Framework for GPU Program Optimizations
This work presents a CUDA program optimizer, named G-ADAPT. It is a tool for helping programmers determine the suitable values of a set of optimization parameters for a CUDA application. It is unique in being adaptive to the influence of program inputs on the application's executions.
/content/cudazone/CUDABrowser/assets/images/applications/441_tjetb_iso_shaded_small.png
/content/cudazone/CUDABrowser/assets/images/applications/441_tjetb_iso_shaded_large.png
Academia
The College of William and Mary
http://www.cs.wm.edu/caps/
2009
05
25
05/25/2009
Xipeng Shen
Paper
Programming Tools
Program Optimizations, empirical search, Cross-input Adaptation. Xipeng Shen
b6d01d14-c22a-43ac-bc34-9e0ee006c583
Optimization principles and application performance evaluation of a multithreaded GPU
GPUs have recently attracted the attention of many application developers as commodity data-parallel coprocessors. The newest generations of GPU architecture provide easier programmability and increased generality while maintaining the tremendous memory bandwidth and computational power of traditional GPUs. This opportunity should redirect efforts in GPGPU research from ad hoc porting of applications to establishing principles and strategies that allow efficient mapping of computation to graphics hardware. In this work we discuss the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies. Key to performance on this platform is using massive multithreading to utilize the large number of cores and hide global memory latency. To achieve this, developers face the challenge of striking the right balance between each thread's resource usage and the number of simultaneously active threads. The resources to manage include the number of registers and the amount of on-chip memory used per thread, number of threads per multiprocessor, and global memory bandwidth. We also obtain increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations. We apply these strategies across a variety of applications and domains and achieve between a 10.5X to 457X speedup in kernel codes and between 1.16X to 431X total application speedup.
/content/cudazone/CUDABrowser/assets/images/applications/440_comet-connections_small.png
/content/cudazone/CUDABrowser/assets/images/applications/440_comet-connections_large.png
Academia
University of Illinois at Urbana-Champaign, Urbana, IL, USA
2008
12
31
12/31/2008
431
Shane Ryoo
Christopher I. Rodrigues Sara S. Baghsorkhi
Paper
Parallel Algorithms
Shane Ryoo, Christopher I. Rodrigues, Sara S. Baghsorkhi, GPU computing, parallel computing
222b928f-43be-4e02-a007-a005d2655181
Bandwidth intensive 3-D FFT kernel
Most GPU performance "hypes" have focused around tightly-coupled applications with small memory bandwidth requirements e.g., N-body, but GPUs are also commodity vector machines sporting substantial memory bandwidth; however, effective programming methodologies thereof have been poorly studied. Our new 3-D FFT kernel, written in NVIDIA CUDA, achieves nearly 80 GFLOPS on a top-end GPU, being more than three times faster than any existing FFT implementations on GPUs including CUFFT. Careful programming techniques are employed to fully exploit modern GPU hardware characteristics while overcoming their limitations, including on-chip shared memory utilization, optimizing the number of threads and registers through appropriate localization, and avoiding low-speed stride memory accesses. Our kernel applied to real applications achieves orders of magnitude boost in power&cost vs. performance metrics. The off-card bandwidth limitation is still an issue, which could be alleviated somewhat with application kernels confinement within the card, while ideal solution being facilitation of faster GPU interfaces.
/content/cudazone/CUDABrowser/assets/images/applications/439_ERGOpage04_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/439_ERGOpage04_large.jpg
Academia
Tokyo Institute of Technology, Tokyo, Japan and Japan Science and Technology Agency, Kawaguchi, Saitama, Japan
2008
12
31
12/31/2008
3
Akira Nukada
Yasuhiko Ogata Toshio Endo
Paper
Numerics
Akira Nukada, Yasuhiko Ogata, Toshio Endo, Algorithms, Design, Experimentation, Measurement, Performance
ccaf5379-24f8-488e-a424-c6c223458be2
A High Performance Agent Based Modelling Framework
We present an efficient implementation of a high performance parallel framework for Agent Based Modelling (ABM), exploiting the parallel architecture of the Graphics Processing Unit (GPU). It provides a mapping between formal agent specifications, with C based scripting, and optimised NVIDIA Compute Unified Device Architecture (CUDA) code. The mapping of agent data structures and agent communication is described, and our work is evaluated through a number of simple interacting agent examples. In contrast with an alternative, single machine CPU implementation, a speedup of up to 250 times is reported.
/content/cudazone/CUDABrowser/assets/images/applications/438_11219696_small.JPG
/content/cudazone/CUDABrowser/assets/images/applications/438_11219696_large.JPG
Academia
University of Sheffield, UK
2008
12
31
12/31/2008
250
Paul Richmond
Simon Coakley
Daniela M. Romano
Paper
parallel algorithms
Paul Richmond, Simon Coakley, Daniela M. Romano
108e4261-9842-4b48-9266-5445cda7c5df
Accelerating phase unwrapping and affine transformations for optical quadrature microscopy
Optical Quadrature Microscopy (OQM) is a process which uses phase data to capture information about the sample being studied. OQM is part of an imaging framework developed by the Optical Science Laboratory at Northeastern University. In one particular application of interest, the framework is used to extract phase information from the image of an embryo to determine embryo viability. Phase Unwrapping is the process of reconstructing the real phase shift (propagation delay) of a sample from the measured "wrapped" representation which is between - and +. Unwrapping can be done using the Minimum LP Norm Phase Unwrap algorithm. Images are first preprocessed using an Affine Transform before they are unwrapped. Both of these steps are time consuming and would benefit greatly from parallelization and acceleration. Faster processing would lower many research barriers (in terms of throughput and performance) present when using OQM. In this paper we report on accelerating Phase Unwrapping and Affine Transformations using NVIDIA's CUDA programming model. We also run elementary noise removal on the GPU using NVIDIA's CUBLAS (CUDA Basic Linear Algebra Subprograms) library. We integrate GPU execution into a Matlab environment to seamlessly interface to the pre-existing image acquisition system. By mapping the unwrap and noise removal to a GPU, and by also reducing the amount of I/O overhead, we are able to accelerate the end-to-end process by more than 7.3x. This enables our imaging framework to perform high speed image acquisition and visualization at near real-time rates.
/content/cudazone/CUDABrowser/assets/images/applications/437_20060621-QuenchedSi-AFM_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/437_20060621-QuenchedSi-AFM_large.jpg
Academia
Northeastern University, Boston, MA
2008
12
31
12/31/2008
8
Miriam Leeser
Sherman Braganza
David Kaeli
Perhaad Mistry
Paper
Medical Imaging
Life Sciences
Science
Perhaad Mistry , Sherman Braganza , David Kaeli, pmistry@ece.neu.edu
16bab639-6bea-43bf-bb15-a160a8fb5924
hiCUDA: a high-level directive-based language
The Compute Unified Device Architecture (CUDA) has become a de facto standard for programming NVIDIA GPUs. However, CUDA places on the programmer the burden of packaging GPU code in separate functions, of explicitly managing data transfer between the host memory and various components of the GPU memory, and of manually optimizing the utilization of the GPU memory. Practical experience shows that the programmer needs to make significant code changes, which are often tedious and error-prone, before getting an optimized program. We have designed hiCUDA, a high-level directive-based language for CUDA programming. It allows programmers to perform these tedious tasks in a simpler manner, and directly to the sequential code. Nonetheless, it supports the same programming paradigm already familiar to CUDA programmers. We have prototyped a source-to-source compiler that translates a hiCUDA program to a CUDA program. Experiments using five standard CUDA bechmarks show that the simplicity and flexibility hiCUDA provides come at no expense to performance.
/content/cudazone/CUDABrowser/assets/images/applications/436_sombrero_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/436_sombrero_large.jpg
Academia
University of Toronto, Toronto, Ontario, Canada
2008
12
31
12/31/2008
Tianyi David Han
Tarek S. Abdelrahman
Paper
Parallel Algorithms
b4a32d12-7514-402c-81f6-0f3a1131a030
Harvesting graphics power for MD simulations
We discuss an implementation of molecular dynamics (MD) simulations on a graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPU's is compared to their main processor counterpart. We achieve speedups of up to 80, 40 and 150 fold, respectively. With newest generation of GPU's one can run standard MD simulations at 10^7 flops.
/content/cudazone/CUDABrowser/assets/images/applications/435_math_snap-480_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/435_math_snap-480_large.jpg
Academia
FOM Institute for Atomic and Molecular Physics, Kruislaan
2007
09
01
09/01/2007
150
J.A. van Meel
A. Arnold
D. Frenkel
Paper
Digital Content Creation
Graphics
Imaging
Science
simulations
7bb139aa-5808-4bf2-a89b-2c4666abc8cc
GPU computing for 2-d spin systems: CUDA vs OpenGL
In recent years the more and more powerful GPU's available on the PC market have attracted attention as a cost effective solution for parallel (SIMD) computing. CUDA is a solid evidence of the attention that the major companies are devoting to the field. CUDA is a hardware and software architecture developed by Nvidia for computing on the GPU. It qualifies as a friendly alternative to the approach to GPU computing that has been pioneered in the OpenGL environment. We discuss the application of both the CUDA and the OpenGL approach to the simulation of 2-d spin systems (XY model).
/content/cudazone/CUDABrowser/assets/images/applications/434_opengl_small.png
/content/cudazone/CUDABrowser/assets/images/applications/434_opengl_large.png
Academia
University of Parma
2008
11
13
11/13/2008
Viola Anselmi
Giovanni Conti Francesco Di Renzo
Paper
Numerics
Science
45b89c24-5196-4659-aa01-b47994748c78
Accelerating numerical solution of Stochastic Differential Equations with CUDA
Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675x compared to a standard CPU, which corresponds to sev eral billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively.
/content/cudazone/CUDABrowser/assets/images/applications/433_numerical_small.png
/content/cudazone/CUDABrowser/assets/images/applications/433_numerical_large.png
Academia
Institute of Physics, University of Silesia
2009
03
23
03/23/2009
675
M. Januszewski
M. Kostur
Paper
Numerics
Science
Josephson junction, Kuramoto, graphics processing unit,advanced computer architecture, numerical integration, diusion, stochasticdierential equation, CUDA, Tesla, NVIDIA
17d19b5f-5d93-4db7-87a8-1d58ee75a60b
An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)
We present a detailed approach for making use of two new computer hardware architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame, is nearly identical.
/content/cudazone/CUDABrowser/assets/images/applications/432_96714main_DiskPreBurst_lg_web-1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/432_96714main_DiskPreBurst_lg_web-1_large.jpg
Academia
Research Group Programming Languages, Methodologies Universitat Kassel
2008
12
31
12/31/2008
06/29/2009
Jens Breitbart
Gaurav Khanna
Paper
Numerics
Science
Signal Processing
99f41941-0cd4-4e77-88f3-19bbb5787ac0
Teraflop per second gravitational lensing ray-shooting using graphics processing units
Gravitational lensing calculation using a direct inverse ray-shooting approach is a computationally expensive way to determine magnification maps, caustic patterns, and light-curves (e.g. as a function of source profile and size). However, as an easily parallelisable calculation, gravitational ray-shooting can be accelerated using programmable graphics processing units (GPUs). We present our implementation of inverse ray-shooting for the NVIDIA G80 generation of graphics processors using the NVIDIA Compute Unified Device Architecture (CUDA) software development kit. We also extend our code to multiple-GPU systems, including a 4-GPU NVIDIA S1070 Tesla unit. We achieve sustained processing performance of 182 Gflop/s on a single GPU, and 1.28 Tflop/s using the Tesla unit. We demonstrate that billion-lens microlensing simulations can be run on a single computer with a Tesla unit in timescales of order a day without the use of a hierarchical tree code.
/content/cudazone/CUDABrowser/assets/images/applications/431_tera_small.png
/content/cudazone/CUDABrowser/assets/images/applications/431_tera_large.png
Academia
Centre for Astrophysics and Supercomputing, Swinburne University of Technology
2009
05
15
05/15/2009
100
Alexander C. Thompson
Christopher J. Fluke David G. Barnes
Paper
Graphics
Imaging
Numerics
Science
9fe009cc-0722-47c2-8f4a-dbef6569de7a
SAR simulation
Synthetic Aperture Rada (SAR) target signla simulation, and SAR imaging
/content/cudazone/CUDABrowser/assets/images/applications/430_d0b97bfa-b534-4d93-bb7d-cf8e9da6d64dB_small.JPG
/content/cudazone/CUDABrowser/assets/images/applications/430_d0b97bfa-b534-4d93-bb7d-cf8e9da6d64dB_large.JPG
Academia
UESTC EE
2008
05
30
05/30/2008
40
Wang haihua
Yu qin Zhang Shu
Application
Signal Processing
eeae1a85-6d79-4615-92ad-23040cec2407
Chromakey Solution for Photo Studio
Chromakey Solution for Photo Studio is a system provides synthesized live view of video and background still image by ISP chromakey algorithm, accelerated by CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/429_Chromakey_small.png
/content/cudazone/CUDABrowser/assets/images/applications/429_Chromakey_large.png
Commercial
Research Institute of Systems Planning,Inc
2009
07
27
07/27/2009
13
Research Institute of Systems Planning,Inc
Paper
Imaging
Video & Audio
chromakey, synthesized live view
71454a1e-2588-4983-85bd-578a6d501c65
MediaCoder
MediaCoder is a free universal batch media transcoder, which nicely integrates most popular audio/video codecs and tools into an all-in-one solution
/content/cudazone/CUDABrowser/assets/images/applications/428_mc-skinned_small.png
/content/cudazone/CUDABrowser/assets/images/applications/428_mc-skinned_large.png
Commercial
mediacoder
http://www.mediacoderhq.com/index.htm
2009
07
22
07/22/2009
Stanley Huang
Application
Video & Audio
audio and video transcoder
34f38ba0-4c4b-49e0-b737-14e7f8028d73
Furry Ball: GPU renderer for Maya
GPU renderer for Maya for studio use. Features: Direct X 10 compatible, Full Maya Integration in Viewport, Complete realtime Dynamic Fur and Hairs, Bump mapping, Lambert, Blin, Phong materials, Textures, Unlimited lights Soft Shadows (with variable penumbra), Reflection, Blurred reflection, Resolution up to 8k, Unlimited Supersampling, Per Object Supersampling, Ambient occlusion, Transparency.
100-300 times faster than CPU render on regular Geforce card.
/content/cudazone/CUDABrowser/assets/images/applications/427_furry_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/427_furry_large.jpg
Commercial
Art And Animation studio
http://www.aaa-studio.cz
2009
12
30
12/30/2009
300
Commercial
Art And Animation studio
Application
Multimedia
Graphics
Imaging
Video & Audio
GPU renderer Maya
aaa7ed55-38b8-4b94-889b-b0d2a3d6b216
Massively-Parallel Game Servers
This work is intended to show that the GPU is the most appropriate technology for game servers, and also that high performance for game servers can be achieved with low cost hardware.
/content/cudazone/CUDABrowser/assets/images/applications/426_CPUvsGPU_game_server_time_chart_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/426_CPUvsGPU_game_server_time_chart_large.jpg
Maxime Griot
2009
07
24
07/24/2009
Research
Maxime Griot
Paper
server
1d3b0c3e-d0a6-4727-bf26-bf373f8f6953
Parallel, distributed and GPU computing technologies in single-particle electron microscopy
Most known methods for the determination of the structure of macromolecular complexes are limited or at least restricted at some point by their computational demands. Recent developments in information technology such as multicore, parallel and GPU processing can be used to overcome these limitations. In particular, graphics processing units (GPUs), which were originally developed for rendering real-time effects in computer games, are now ubiquitous and provide unprecedented computational power for scientific applications. Each parallel-processing paradigm alone can improve overall performance; the increased computational performance obtained by combining all paradigms, unleashing the full power of today's technology, makes certain applications feasible that were previously virtually impossible. In this article, state-of-the-art paradigms are introduced, the tools and infrastructure needed to apply these paradigms are presented and a state-of-the-art infrastructure and solution strategy for moving scientific applications to the next generation of computer hardware is outlined.
/content/cudazone/CUDABrowser/assets/images/applications/425_POLlen-_thmb_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/425_POLlen-_thmb_large.jpg
2008
10
29
10/29/2008
Martin Schmeisser
Burkhard C. Heisen Mario Luettich
Paper
Science
parallel processing, GPU processing, distributed heterogeneous computing, nondedicated systems, multicore performance, cluster computing, electron microscopy,Martin Schmeisser, Burkhard C. Heisen,Mario Luettich
da9e5636-cfb8-4460-add9-571196637dcf
CUDA Path Tracing Demo
CUDA Path Tracing Demo. Can load and display static .obj scene with diffuse path tracing or direct illumination, both from a single area light.
/content/cudazone/CUDABrowser/assets/images/applications/424_ruins_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/424_ruins_new_large.jpg
Academia
Saarland University
2009
07
01
07/01/2009
Javor Kalojanov
Application
Graphics
Javor Kalojanov, javor@graphics.cs.uni-sb.de
b8685f2e-ed2e-4b7a-86f5-37fe0ac5f8dd
ToraTora
File Encryption Software ( AES 256bit )
/content/cudazone/CUDABrowser/assets/images/applications/423_OpenHelp_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/423_OpenHelp_large.jpg
Commercial
iCanal Inc.
http://www.icanal.co.jp/
2009
07
10
07/10/2009
3
Naoki Hirayama
Application
File Utility
nhiraya@icanal.co.jp, Naoki Hirayama
9e515da5-c97e-4c37-8305-f27982a02d5f
CUDA Multiforcer
Multihash CUDA Brute Forcer - The world's fastest cross-platform MD4/MD5/NTLM cracking for Windows/Mac/Linux
/content/cudazone/CUDABrowser/assets/images/applications/421_md5lookupwidget_20070725131833_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/421_md5lookupwidget_20070725131833_large.jpg
Research
Cryptohaze
http://www.cryptohaze.com/
2009
01
23
01/23/2009
Bitweasil
Application
Information Security
CUDA-Multiforcer, MD4, MD5, NTLM
7e5043c6-ca48-401d-a1e3-8fa1d3e12f99
Rapid Aerodynamic Performance Prediction on a Cluster of Graphics Processing Units
We investigate the use of a cluster of GPUs for large-scale CFD problems and show order-of-magnitude increases in performance and performance-to-price ratio. We implement two separate compressible ow solvers. First, we develop a CUDA-based solver for the 2D compressible Euler equations and verify the results against a reference multi-block code MBFLO. After demonstrating the performance of our Euler solver, we proceed to develop a new version of MBFLO by adding GPU-accelerated subroutines to the existing Fortran codebase. Using an eight-node cluster equiped with 16 NVIDIA 9800GX2 GPUs, we achieve speedups of up to 496x on our Euler Solver and 88x on MBFLO. This paper describes the numerical, hardware and software techniques that provide signicant speedups.
/content/cudazone/CUDABrowser/assets/images/applications/420_fig5_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/420_fig5_large.jpg
Academia
University of California, Davis
2009
01
05
01/05/2009
496
Everett H. Phillips
Yao Zhangy Roger L. Davisz
Paper
Computational Fluid Dynamics
Everett H. Phillips, Yao Zhangy, Roger L. Davisz, MBFLO, Euler Solver, Navier-Stokes
03bb68ad-6cb6-4a8d-969e-53eebed9a521
PowerDVD 9
Video playback quality optimization with TrueTheater HD. CUDA support with build 1719
/content/cudazone/CUDABrowser/assets/images/applications/419_box_PDVD_9_ultra_eng-150_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/419_box_PDVD_9_ultra_eng-150_large.gif
Commercial
Cyberlink
2009
05
22
05/22/2009
Commercial
Cyberlink
Application
Video & Audio
CUDA video DVD upscaling quality
47d7ca93-8eb9-4f19-b3f6-813e99b2aa02
Granular Matter
Realtime, simple model of granular matter implemented on CUDA architecture
/content/cudazone/CUDABrowser/assets/images/applications/418_c2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/418_c2_large.jpg
Academia
aciej Matyka
2009
07
08
07/08/2009
Maciej Matyka
Application Multimedia Code
Game Physics
Graphics
Maciej Matyka, CUDA Physics Gravity
f2bf43a7-bd39-45ba-9845-a50c956157cd
Parallel View-Dependent Tessellation of Catmull-Clark Subdivision Surfaces
We present a strategy for performing view-adaptive, crack-free tessellation of Catmull-Clark subdivision surfaces entirely on programmable graphics hardware. Our scheme extends the concept of breadth-first subdivision, which up to this point has only been applied to parametric patches. While mesh representations designed for a CPU often involve pointer-based structures and irregular per-element storage, neither of these is well-suited to GPU execution. To solve this problem, we use a simple yet effective data structure for representing a subdivision mesh, and design a careful algorithm to update the mesh in a completely parallel manner. We demonstrate that in spite of the complexities of the subdivision procedure, real-time tessellation to pixel-sized primitives can be done. Our implementation does not rely on any approximation of the limit surface, and avoids both subdivision cracks and T-junctions in the subdivided mesh. Using the approach in this paper, we are able to perform real-time subdivision for several static as well as animated models. Rendering performance is scalable for increasingly complex models.
/content/cudazone/CUDABrowser/assets/images/applications/417_return_small.png
/content/cudazone/CUDABrowser/assets/images/applications/417_return_large.png
Academia
University of California, Davis
http://www.ece.ucdavis.edu/
2009
08
01
08/01/2009
Anjul Patney
Paper
Graphics
Anjul Patney, apatney@ucdavis.edu
87183ef1-9947-470d-bd10-3352dc74be16
Real-Time Reyes-Style Adaptive Surface Subdivision
We present a GPU based implementation of Reyes-style adaptive surface subdivision, known in Reyes terminology as the Bound/Split and Dice stages. The performance of this task is important for the Reyes pipeline to map efficiently to graphics hardware, but its recursive nature and irregular and unbounded memory requirements present a challenge to an efficient implementation. Our solution begins by characterizing Reyes subdivision as a work queue with irregular computation, targeted to a massively parallel GPU. We propose efficient solutions to these general problems by casting our solution in terms of the fundamental primitives of prefix-sum and reduction, often encountered in parallel and GPGPU environments. Our results indicate that real-time Reyes subdivision can indeed be obtained on today's GPUs. We are able to subdivide a complex model to subpixel accuracy within 15 ms. Our measured performance is several times better than that of Pixar's RenderMan. Our implementation scales well with the input size and depth of subdivision. We also address concerns of memory size and bandwidth, and analyze the feasibility of conventional ideas on screen-space buckets.
/content/cudazone/CUDABrowser/assets/images/applications/416_reyes08_new_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/416_reyes08_new_large.jpg
Academia
University of California, Davis
http://www.ece.ucdavis.edu/
2008
12
01
12/01/2008
Anjul Patney
Paper
Graphics
Anjul Patney, apatney@ucdavis.edu
df98027a-dd64-4806-a323-db07d9c1c88d
R GPU: Enabling GPU Computing in the R Statistical Environment
R is the most popular open source statistical environment in the biomedical research community. However, most of the popular R function implementations involve no parallelism and they can only be executed as separate instances on multicore or cluster hardware for large data-parallel analysis tasks. The arrival of modern graphics processing units (GPUs) with user friendly programming tools, such as nVidia's CUDA toolkit (http://www.nvidia.com/cuda), provides a possibility of increasing the computational efficiency of many common tasks by more than one order of magnitude (http://gpgpu.org/). However, most R users are not trained to program a GPU, a key obstacle for the widespread adoption of GPUs in biomedical research. To overcome this obstacle, we decided to devote efforts for moving frequently used R functions in our work to the GPU using CUDA. In the ideal solution, if a CUDA compatible GPU and driver is present on a user's machine, the user may only need to prefix "gpu" to the original function name to take advantage of the GPU implementation of the corresponding R function. We take achieving this ideal as one of our primary goals so that any biomedical researcher can harness the computational power of a GPU using a familiar tool. Since our code is open source, researchers may customize the R interfaces to their particular needs. In addition, because CUDA uses shared libraries and unobtrusive extensions to the C programming language, any experienced C programmer can easily customize the underlying code. Using the CUDA extension to C and the shared linear algebra library CUBLAS, we have implemented a variety of statistical analysis functions with R interfaces that execute with different degrees of parallelism on a Graphics Processing Unit (GPU). If an algorithm is comprised of common vector or matrix operations each performed once, we involve the GPU by implementing those operations with calls to CUBLAS. If an algorithm involves computing the elements of a large matrix, we can often merely assign each thread executing on the GPU a portion of a row and/or column. Algorithms for which we have implemented GPU enabled versions include the calculations of distances between sets of points (R dist function), hierarchical clustering (R hclust function). Pearson and Kendall correlation coefficients (similar to R cor function), and the Granger test ('granger.test' in the R MSBVAR package). We are committed to implement more R GPU functions, and we hope to contribute packages to the open source community via our project's website. The initial package is hosted by CRAN as gputools a sorce package for UNIX and Linux systems. Be sure to set the environment variable CUDA_HOME to the root of your CUDA toolkit installation. Then install the package in the usual R manner. The installation process will automatically make use of nVidia's nvcc compiler and CUBLAS shared library. We hope that others can contribute to the R-GPGPU effort and encourage any comments or suggestions.
/content/cudazone/CUDABrowser/assets/images/applications/415_rgpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/415_rgpu_large.png
Academia
The Molecular and Behavioral Neuroscience Institute, U. of Michigan
http://brainarray.mbni.med.umich.edu/
2009
06
14
06/14/2009
75
Open source
J. Buckner
Paper Code
Numerics
Life Sciences
Oil & Gas
29d03c16-cf86-4f24-9e60-3a943214b48a
Fast Seismic Modeling and Reverse Time Migration on a GPU Cluster
We have designed a fast parallel simulator that solves the acoustic wave equation on a GPU cluster. Solving the acoustic wave equation in an oil exploration industrial context aims at speeding up seismic modeling and Reverse Time Migration. We consider a finite difference approach on a regular mesh, in both 2D and 3D cases. The acoustic wave equation is solved in either a constant density or a variable density domain. All the computations are done in single precision, since double precision is not required in our context. We use CUDA to take advantage of the GPUs computational power. We study different implementations and their impact on the application performance. We obtain a speed up of 10 for Reverse Time Migration and up to 30 for the modeling application over a sequential code running on general purpose CPU.
/content/cudazone/CUDABrowser/assets/images/applications/414_inra_small.png
/content/cudazone/CUDABrowser/assets/images/applications/414_inra_large.png
INRIA Bordeaux
2009
07
14
07/14/2009
30
Rached Abdelkhalek
Paper
Libraries
Rached Abdelkhalek
0f9d31f9-f736-4c8a-a281-0aa5b0883fe5
Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units
Sparse Matrix-Vector Multiplication Toolkit for Graphics Processing Units (SpMV4GPU) is a library optimized for NVIDIA Graphics Processing Units (GPUs). The GPU is fast emerging as the ideal architecture to use as an accelerator in a heterogenous computing environment. Modern GPUs are designed not only for accelerating traditional graphics kernels, but also for general-purpose computationally intensive kernels. The state-of-the art GPUs exhibit very high computational capabilities at a reasonable price. These GPUs also support high-level parallel programming models, for example, NVIDIA's Common Unified Device Architecture (CUDA) or Brook+ from AMD, that enable users to develop parallel applications that use the CPU as the host and the GPU as an accelerator. Sparse Matrix-Vector Multiplication is a core numerical analysis kernel used for a wide range of application domains, such as graphics, data mining, and image processing. SpMV4GPU is a sparse matrix-vector multiplication library optimized for the NVIDIA GPUs. It is developed using the NVIDIA CUDA interfaces, and works on all NVIDIA GPUs that support this library. SpMV4GPU uses the standard sparse matrix storage formats, such as compressed row and column storage formats. It hides the intricacies of GPU programming by using an abstract interface. The SpMV4GPU interface also allows users to provide optional performance hints, and optionally use special storage representations. Experimental evaluation demonstrate that the SpMV library provides two to four times improvement over the equivalent solution provided by the NVIDIA's CUDPP library. While the current implementation of the SpMV code uses the CUDA interfaces, the code can be easily migrated to use the upcoming OpenCL standard. This will allow the SpMV code to execute on a wide range of GPU architectures.
/content/cudazone/CUDABrowser/assets/images/applications/413_thumbnail_small.png
/content/cudazone/CUDABrowser/assets/images/applications/413_thumbnail_large.png
Research
IBM Research
2009
04
21
04/21/2009
10
Open source
Rajesh Bordawekar
Code
Computational Fluid Dynamics
Electronic Design Automation
Medical Imaging
Numerics
Life Sciences
Libraries
Oil & Gas
Rajesh Bordawekar, Linear Algebra, Sparse Matrix-Vector Multiplication
93caed1b-cbd9-4f44-88a3-ff0dc1182e35
Practical Pre-stack Kirchhoff Time Migration of Seismic Processing on General Purpose GPU
In this paper, we introduced three prototypes of GPGPU solutions on NVidia GeForce8800GT for a practical Pre-stack Kirchhoff Time Migration program. We presented how to re-design and re-implement the original CPU code to efficiency GPU code. The prototypes are more than at most 7.2 times faster than its CPU version on Intels P4 3.0G
/content/cudazone/CUDABrowser/assets/images/applications/412_kirchhoff_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/412_kirchhoff_large.jpg
2009
01
01
01/01/2009
7
Xiaohua Shi
Xu Wang
Paper
Signal Processing
Xiaohua Shi,Xu Wang ,Changhai Zhao
d5429d39-25a0-402b-9492-d775fa79105f
Exploiting Computing Power on Graphics Processing Unit
With recent technological advances, graphics processing units (GPUs) are providing increasingly higher performance with improvement programmability. This paper investigates NVIDIAs CUDA technology that enables data mining algorithm be parallelized effectively on GPU. The proposed algorithm exploits the computational power and the memory hierarchy of GPUs, using the shared memory to store frequently accessed data. Experimental results indicate that the speed of the computation through the GPU is considerably faster than through the CPU.
/content/cudazone/CUDABrowser/assets/images/applications/411_0408_Hoff4_305_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/411_0408_Hoff4_305_large.jpg
2008
12
01
12/01/2008
Ziyi Liu
Wenjing Ma
Paper
Ziyi Liu, Wenjing Ma
13e2548c-0608-45e5-b42c-32349121a917
Sequence alignment with GPU: Performance and design challenges
In bioinformatics, alignments are commonly performed in genome and protein sequence analysis for gene identification and evolutionary similarities. There are several approaches for such analysis, each varying in accuracy and computational complexity. Smith-Waterman (SW) is by far the best algorithm for its accuracy in similarity scoring. However, execution time of this algorithm on general purpose processor based systems makes it impractical for use by life scientists. In this paper we take Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU. We achieve a 23x speedup against the serial version of the SW algorithm. We further study the effect of memory organization and the instruction set architecture on GPU performance. For that purpose we analyze another implementation on an Intel Quad Core processor that makes use of Intel's SIMD based SSE2 architecture. We show that if reading blocks of 16 words at a time instead of 4 is allowed, and if 64KB of shared memory as opposed to 16KB is available to the programmer, GPU performance enhances significantly making it comparable to the SIMD based implementation. We quantify these observations to illustrate the need for studies on extending the instruction set and memory organization for the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/410_bioinformatics_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/410_bioinformatics_large.jpg
Academia
Department of Electrical and Computer Engineering, University of Arizona
2009
01
01
01/01/2009
23
Gregory M. Striemer
Ali Akoglu
Paper
Life Sciences
Gregory M. Striemer, Ali Akoglu
8c74ee14-db85-421e-aab4-62db9a7be80e
High-Speed Implementations of Block Cipher ARIA Using Graphics Processing Units
The power of graphics processing unit (GPU) has been increasing rapidly more than that of CPU. It is not surprising that many software libraries were developed which enable us to use the power of GPU for general computations especially in parallel data processing. In this paper, we propose implementations of the standard block cipher ARIA of Korea using OpenGL and CUDA libraries on GPU. Since ARIA was announced only 4 years ago, there is no hardware solution yet providing high-speed encryption with ARIA. We make use of GPU as a parallel processors with several grid structures and optimize the encryption speed and the occupancy of shared-memory. As a result, when ARIA is running on GeForce 8800GTS using CUDA library, the speed of the encryption reaches up to 4.8 Gbps which is the fastest implementation of ARIA known to public.
/content/cudazone/CUDABrowser/assets/images/applications/408_smileycbcb_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/408_smileycbcb_large.jpg
2008
12
01
12/01/2008
Yongjin Yeom
Yongkuk Cho Moti Yung
Paper
Numerics
Yongjin Yeom,Yongkuk Cho,Moti Yung
fdc24388-6fb6-4e40-ab58-56bafa8f9422
Swarm's flight: Accelerating the particles using C-CUDA
With the development of Graphics Processing Units (GPU) and the Compute Unified Device Architecture (CUDA) platform, several areas of knowledge are being benefited with the reduction of the computing time. Our goal is to show how optimization algorithms inspired by Swarm Intelligence can take profit from this technology. In this paper, we provide an implementation of the Particle Swarm Optimization (PSO) algorithm in C-CUDA. The algorithm was tested on a suite of well-known benchmark optimization problems and the computing time has been compared with the same algorithm implemented in C and Matlab. Results demonstrate that the computing time can significantly be reduced using C-CUDA. As far as we know, this is the first implementation of PSO in C-CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/407_flock1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/407_flock1_large.jpg
Academia
Departmento de Informatica, PPGI, Universidade Federal do Espirito Santo
2009
01
01
01/01/2009
Lucas de P. Veronese
Renato A. Krohling
Paper
Science
Lucas de P. Veronese, Renato A. Krohling
cb1e2f8f-6bd3-4ac8-ab13-c84866cb1f6e
Using Graphics Processors for High-Performance Computation and Visualization of Plasma Turbulence
Direct numerical simulation (DNS) of turbulence is computationally intensive and typically relies on some form of parallel processing. Spectral kernels used for spatial discretization are a common computational bottleneck on distributed memory architectures. One way to increase DNS algorithms' efficiency is to parallelize spectral kernels using tightly coupled single-program, multiple-data (SPMD) multiprocessor units with minimal interprocessor communication latency. The authors present techniques to map DNS computations to modern graphics processing units (GPUs), which are characterized by very high memory bandwidth and hundreds of SPMD processors. The article compares the performance between the authors' parallel algorithm running on a GPU versus the associated CPU implementation of a solver for one of the fundamental nonlinear models of turbulence theory. They also demonstrate a prototype of a scalable computational steering framework based on turbulence simulation and visualization coupling on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/406_F-TFTRplasma_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/406_F-TFTRplasma_large.jpg
Academia
University of Maryland, College Park
2009
03
01
03/01/2009
George Stantchev
Derek Juba William Dorland,
Paper
Science
George Stantchev,Derek Juba,William Dorland
f0c09d7d-c0a3-4f4d-8e8b-4ba17d10e0bc
GPU-based parallel particle swarm optimization
A novel parallel approach to run standard particle swarm optimization (SPSO) on Graphic Processing Unit (GPU) is presented in this paper. By using the general-purpose computing ability of GPU and based on the software platform of Compute Unified Device Architecture (CUDA) from NVIDIA, SPSO can be executed in parallel on GPU. Experiments are conducted by running SPSO both on GPU and CPU, respectively, to optimize four benchmark test functions. The running time of the SPSO based on GPU (GPU-SPSO) is greatly shortened compared to that of the SPSO on CPU (CPU-SPSO). Running speed of GPU-SPSO can be more than 11 times as fast as that of CPU-SPSO, with the same performance. compared to CPU-SPSO, GPU-SPSO shows special speed advantages on large swarm population applications and hign dimensional problems, which can be widely used in real optimizing problems.
/content/cudazone/CUDABrowser/assets/images/applications/404_ParticleSwarmOptimization_small.png
/content/cudazone/CUDABrowser/assets/images/applications/404_ParticleSwarmOptimization_large.png
Academia
Key Laboratory of Machine Perception and Intelligence (Peking University)
2009
05
01
05/01/2009
11
You Zhou
Ying Tan
Paper
Science
You Zhou,Ying Tan
9a4d9359-25e3-4dc2-b4c8-65c7c4b3633c
The Virtual Marathon: Parallel Computing Supports Crowd Simulations
To be realistic, an urban model must include appropriate numbers of pedestrians, vehicles, and other dynamic entities. Using a parallelcomputing architecture, researchers simulated a marathon with more than a million participants. To simulate participant behavior, they used fuzzy logic on a GPU to perform millions of inferences in real time.
/content/cudazone/CUDABrowser/assets/images/applications/403_crowd_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/403_crowd_large.jpg
Academia
Middle East Technical University
2009
07
01
07/01/2009
Erdal Yilmaz
Veysi Isler Yasemin Yardimci Cetin
Paper
Erdal Yilmaz, Veysi Isler, Yasemin Yardimci Cetin
3cdcf8ea-261d-4611-aaf3-44761158df5f
Accelerating Dust Temperature Calculations with Graphics Processing Units
When calculating the infrared spectral energy distributions (SEDs) of galaxies in radiation-transfer models, the calcu-lation of dust grain temperatures is generally the most time-consuming part of the calculation. Because of its highly parallel nature, this calculation is perfectly suited for massively parallel general-purpose Graphics Processing Units (GPUs). This paper presents an implementation of the calculation of dust grain equilibrium temperatures on GPUs in the Monte-Carlo radiation transfer code sunrise, using the CUDA API. The GPU can perform this calculation 55 times faster than the 8 CPU cores, showing great potential for accelerating calculations of galaxy SEDs.
/content/cudazone/CUDABrowser/assets/images/applications/402_dust_small.png
/content/cudazone/CUDABrowser/assets/images/applications/402_dust_large.png
Academia
Santa Cruz Institute for Particle Physics, University of California, Santa Cruz, CA
2009
07
22
07/22/2009
55
Patrik Jonsson
Joel R. Primack
Paper
Numerics
Joel R. Primack, Patrik Jonsson, dust, radiative transfer, methods: numerical
eb30ba09-4ff0-453e-89f3-041ea6d73ec7
Linear optimization on modern GPUs
Optimization algorithms are becoming increasingly more important in many areas, such as finance and engineering. Typically, real problems involve several hundreds of variables, and are subject to as many constraints. Several methods have been developed trying to reduce the theoretical time complexity. Nevertheless, when problems exceed reasonable sizes they end up being very computationally intensive. Heterogeneous systems composed by coupling commodity CPUs and GPUs are becoming relatively cheap, highly performing systems. Recent developments of GPGPU technologies give even more powerful control over them. In this paper, we show how we use a revised simplex algorithm for solving linear programming problems originally described by Dantzig for both our CPU and GPU implementations. Previously, this approach has showed not to scale beyond around 200 variables. However, by taking advantage of modern libraries such as ATLAS for matrix-matrix multiplication, and the NVIDIA CUDA programming library on recent GPUs, we show that we can scale to problem sizes up to at least 2000 variables in our experiments for both architectures. On the GPU, we also achieve an appreciable precision on large problems with thousands of variables and constraints while achieving between 2x and 2.5x speed-ups over the serial ATLAS-based CPU version. With further tuning of both the algorithm and its implementations, even better results should be achievable for both the CPU and GPU versions.
/content/cudazone/CUDABrowser/assets/images/applications/402_2601838e2_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/402_2601838e2_large.gif
Department of Computer and Information Science, Norwegian University of Science and Technology, Trondheim, Norway
2009
05
01
05/01/2009
3
Daniele G. Spampinato
Anne C. Elstery
Paper
Numerics
Daniele G. Spampinato, Anne C. Elstery
354bd9a4-3e2a-412d-a9f6-e9e32be1e782
GPU acceleration of Zernike moments for large-scale images
Zernike moments are trascendental digital image descriptors used in many application areas like biomedical image processing and computer vision due to their good properties of orthogonality and rotation invariance. However, their computation is too expensive and limits its application in practice, overall when real-time constraints are imposed. This work introduces a novel approach to the high-performance computation of Zernike moments using CUDA on graphics processors. The proposed method is applicable to the computation of an individual Zernike moment as well as a set of Zernike moments of a given order, and it is compared against three of the fastest implementations performed on CPUs over the last decade. Our experimental results on a commodity PC reveal up to 5x faster execution times on a GeForce 8800 GTX against the best existing implementation on a Pentium 4 CPU.
/content/cudazone/CUDABrowser/assets/images/applications/401_zernike_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/401_zernike_large.jpg
Academia
Computer Architecture Department, University of Malaga, Spain
2009
05
01
05/01/2009
5
Manuel Ujaldon
Application
Imaging
Manuel Ujaldon
7a75c63a-5693-4eac-8879-52b943e3db6a
Efficient visual hull computation for real-time 3D reconstruction using CUDA
In this paper we present two efficient GPU-based visual hull computation algorithms. We compare them in terms of performance using image sets of varying size and different voxel resolutions. In addition, we present a real-time 3D reconstruction system which uses the proposed GPU-based reconstruction method to achieve real-time performance (30 fps) using 16 cameras and 4 PCs.
/content/cudazone/CUDABrowser/assets/images/applications/399_PVHMemo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/399_PVHMemo_large.jpg
Academia
Department of Computer Science, Technische Universitat Munchen
2008
12
31
12/31/2008
12/01/2008
Alexander Ladikos
Selim Benhimane
Nassir Navab
Paper
Imaging
Alexander Ladikos,Selim Benhimane,Nassir Navab
67438a50-58e1-44a1-aec4-42052ba5add2
CUDA cuts: Fast graph cuts on the GPU
Graph cuts has become a powerful and popular optimization tool for energies defined over an MRF and have found applications in image segmentation, stereo vision, image restoration, etc. The maxflow/mincut algorithm to compute graph-cuts is computationally heavy. The best-reported implementation of graph cuts takes over 100 milliseconds even on images of size 640x480 and cannot be used for real-time applications or when iterated applications are needed. The commodity Graphics Processor Unit (GPU) has emerged as an economical and fast computation co-processor recently. In this paper, we present an implementation of the push-relabel algorithm for graph cuts on the GPU. We can perform over 60 graph cuts per second on 1024x1024 images and over 150 graph cuts per second on 640x480 images on an Nvidia 8800 GTX. The time for each complete graph-cut is about 1 millisecond when only a few weights change from the previous graph, as on dynamic graphs resulting from videos. The CUDA code with a well-defined interface can be downloaded for anyone's use.
/content/cudazone/CUDABrowser/assets/images/applications/398_case01045_2T_half_fa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/398_case01045_2T_half_fa_large.png
2008
12
1
12/1/2008
Vibhav Vineet,
P. J. Narayanan
Paper
Imaging
Vibhav Vineet,P. J. Narayanan
16ef052b-1b67-4ada-88f8-6461329d82c8
GPU Acceleration of 2D-DWT Image Compression in MATLAB with CUDA
This article will present the details about the acceleration of 2D wavelet-based medical data (image) compression on MATLAB with CUDA. It is obvious that the diagnostic materials (mostly as acertain type of image) are increasingly acquired in a digital format. Therefore, common need to daily manipulate huge amount of data brought about the issue of compression within a very less stipulated amount of time. Attention will be given to the acceleration processing flow which exploits the massive parallel computational power offered by the latest NVIDIA graphics processor unit (GPU). It brings a compute device that can be programmed using a C-like language using CUDA, (Compute Unified Device Architecture). In the same time, a number of attractive features can be exploited for a broad class of intensive data parallel computation tasks. The final part of discussion outlines possible directions towards future improvements of compression ratio and processing speed.
/content/cudazone/CUDABrowser/assets/images/applications/396_wls2_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/396_wls2_large.gif
2008
12
01
12/01/2008
Vaclav Simek
Paper
Medical Imaging
Vaclav Simek, Radim Dvorak
7250813b-109b-4b5c-b956-70f1a188a517
A Compute Unified System Architecture for Graphics Clusters Incorporating Data Locality
We present a development environment for distributed GPU computing targeted for multi-GPU systems, as well as graphics clusters. Our system is based on CUDA and logically extends its parallel programming model for graphics processors to higher levels of parallelism, namely, the PCI bus and network interconnects. While the extended API mimics the full function set of current graphics hardware including the concept of global memory on all distribution layers, the underlying communication mechanisms are handled transparently for the application developer. To allow for high scalability, in particular for network-interconnected environments, we introduce an automatic GPU-accelerated scheduling mechanism that is aware of data locality. This way, the overall amount of transmitted data can be heavily reduced, which leads to better GPU utilization and faster execution. We evaluate the performance and scalability of our system for bus and especially network-level parallelism on typical multi-GPU systems and graphics clusters.
/content/cudazone/CUDABrowser/assets/images/applications/395_3d-gauge-cluster-from-nvidia-and-icar_YatMw_59_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/395_3d-gauge-cluster-from-nvidia-and-icar_YatMw_59_large.jpg
Academia
Visualisierungsinstitut der Universitat Stuttgart
2009
31
12
12/31/2009
07/01/2009
Christoph Muller
Steffen Frey Magnus Strengert
Paper
Christoph Muller, Steffen Frey,Magnus Strengert
b66bbea9-6d0c-4b61-aa2f-128074a32b0b
Processing Neocognitron of Face Recognition on High Performance Environment Based on GPU with CUDA Architecture
This work presents an implementation of Neocognitron Neural Network, using a high performance computing architecture based on GPU (Graphics Processing Unit). Neocognitron is an artificial neural network, proposed by Fukushima and collaborators, constituted of several hierarchical stages of neuron layers, organized in two-dimensional matrices called cellular planes. For the high performance computation of Face Recognition application using Neocognitron it was used CUDA (Compute Unified Device Architecture) as API (Application Programming Interface) between the CPU and the GPU, from GeForce 8800 GTX of NVIDIA company, with 128 ALU's. As face image databases it was used a face database created at UFSCar, and the CMU-PIE (Carnegie Mellon University Pose, Illumination and Expression) database. The load balancing was achieved through the use of cellular connections as threads organized in blocks, following the CUDA philosophy of development. The results showed the feasibility of this type of device as a massively parallel data processing tool, and that smaller the granularity and the data dependency of the parallel processing, better is its performance.
/content/cudazone/CUDABrowser/assets/images/applications/394_polar-rose-face-3d_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/394_polar-rose-face-3d_large.jpg
2008
12
01
12/01/2008
Gustavo Poli
Jos Hiroki Saito Joo F. Mari
Paper
Signal Processing
Gustavo Poli,Jos Hiroki Saito,Joo F. Mari
11117185-2796-4f64-aab0-91728886bf15
Parallel Image Processing Based on CUDA
CUDA (Compute Unified Device Architecture) is a novel technology of general-purpose computing on the GPU, which makes users develop general GPU (Graphics Processing Unit) programs easily. This paper analyzes the distinct features of CUDA GPU, summarizes the general program mode of CUDA. Furthermore, we implement several classical image processing algorithms by CUDA, such as histogram equalization, removing clouds, edge detection and DCT encode and decode etc., especially introduce the first two algorithms. If we don't take the data transfer time in experiment between host memory and device memory into account, as the image size increase, histogram computation can get a more than 40x speedup, removing clouds can get an about 79x speedup, DCT can gain around 8x and edge detection more than 200x.
/content/cudazone/CUDABrowser/assets/images/applications/393_nvidia-CUDA,Q-1-111097-13_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/393_nvidia-CUDA,Q-1-111097-13_large.jpg
2008
12
12
12/12/2008
200
Zhiyi Yang
Yating Zhu Yong Pu
Paper
Imaging
Zhiyi Yang, Yating Zhu, Yong Pu
5cc29037-c944-43de-94b6-046dc4fbbac2
Neural Network Implementation using CUDA and OpenMP
Many algorithms for image processing and pattern recognition have recently been implemented on GPU (graphic processing unit) for faster computational times. However, the implementation using GPU encounters two problems. First, the programmer should master the fundamentals of the graphics shading languages that require the prior knowledge on computer graphics. Second, in a job which needs much cooperation between CPU and GPU, which is usual in image processings and pattern recognitions contrary to the graphics area, CPU should generate raw feature data for GPU processing as much as possible to effectively utilize GPU performance. This paper proposes more quick and efficient implementation of neural networks on both GPU and multi-core CPU. We use CUDA (compute unified device architecture) that can be easily programmed due to its simple C language-like style instead of GPGPU to solve the first problem. Moreover, OpenMP (Open Multi-Processing) is used to concurrently process multiple data with single instruction on multi-core CPU, which results in effectively utilizing the memories of GPU. In the experiments, we implemented neural networks-based text detection system using the proposed architecture, and the computational times showed about 15 times faster than implementation using CPU and about 4 times faster than implementation on only GPU without OpenMP.
/content/cudazone/CUDABrowser/assets/images/applications/392_openmp_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/392_openmp_large.jpg
Academia
Department of Digital Media, College of Information Science, Soongsil University
2008
12
12
01/01/2009
Honghoon Jang
Anjin Park
Keechul Jung
Paper
Numerics
Honghoon Jang, Anjin Park, Keechul Jung
e6a0a5aa-1798-4eaa-b48f-e5a0a6810875
A Parallel Implementation of the 2D Wavelet Transform Using CUDA
There is a multicore platform that is currently concentrating an enormous attention due to its tremendous potential in terms of sustained performance: the NVIDIA Tesla boards. These cards intended for general-purpose computing on graphic processing units (GPGPUs) are used as data-parallel computing devices. They are based on the Computed Unified Device Architecture (CUDA) which is common to the latest NVIDIA GPUs. The bottom line is a multicore platform which provides an enormous potential performance benefit driven by a non-traditional programming model. In this paper we try to provide some insight into the peculiarities of CUDA in order to target scientific computing by means of a specific example. In particular, we show that the parallelization of the two-dimensional fast wavelet transform for the NVIDIA Tesla C870 achieves a speedup of 20.8 for an image size of 8192x8192, when compared with the fastest host-only version implementation using OpenMP and including the data transfers between main memory and device memory.
/content/cudazone/CUDABrowser/assets/images/applications/391_2d-wavelet_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/391_2d-wavelet_large.jpg
2009
01
01
01/01/2009
20
Joaquin Franco
Gregorio Bernabe
Juan Fernandez
Paper
Numerics
Joaquin Franco, Gregorio Bernabe, Juan Fernandez
63d5f310-4eb7-4f0d-82ea-5b4dc263e859
Towards Accelerated Computation of Atmospheric Equations Using CUDA
Main objective of this paper is to outline possibleways how to achieve a substantial acceleration in caseof advection-diffusion equation (A-DE) calculation,which is commonly used for a description of thepollutant behavior in atmosphere. A-DE is a kind ofpartial differential equation (PDE) and in general caseit is usually solved by numerical integration due to itshigh complexity. These types of calculations are timeconsuming thus the main idea of our work is to adoptCUDA platform and commodity GPU card to do thecalculations in a faster way. The solution is based onmethod of lines with 4th order Runge-Kutta scheme tohandle the integration. As a matter of fact, the selectedapproach involves number of auxiliary variables andthus the memory management is critical in order toachieve desired performance. We have implementedseveral possible solutions that use different memoryaccess schemes. Detailed evaluation is provided in thispaper where the obtained results show a tremendousprocessing speed up in comparison to CPU.
/content/cudazone/CUDABrowser/assets/images/applications/390_600px-Lorenz_attractor.svg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/390_600px-Lorenz_attractor.svg_large.png
2009
01
01
01/01/2009
Vaclav Simek
Radim Dvorak Frantisek Zboril
Paper
Life Sciences
Vaclav Simek, Radim Dvorak, Frantisek Zboril
d23b8b83-4dbd-47b4-a4b6-ee9df082dfb0
K-Means on Commodity GPUs with CUDA
K-means algorithm is one of the most famous unsupervised clustering algorithms. Many theoretical improvements for the performance of original algorithms have been put forward, while almost all of them are based on Single Instruction Single Data(SISD) architecture processors (CPUs), which partly ignored the inherent paralleled characteristic of the algorithms. In this paper, a novel Single Instruction Multiple Data (SIMD) architecture processors (GPUs)based k-means algorithm is proposed. In this algorithm, in order to accelerate compute-intensive portions of traditional k-means, both data objects assignment and k centroids recalculation are offloaded to the GPU in parallel. We have implemented this GPU-based k-means on the newest generation GPU with Compute Unified Device Architecture(CUDA). The numerical experiments demonstrated that the speed of GPU-based k-means could reach as high as 40 times of the CPU-based k-means.
/content/cudazone/CUDABrowser/assets/images/applications/389_AndromedaKMEANSK_4_1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/389_AndromedaKMEANSK_4_1_large.jpg
2009
03
01
03/01/2009
40
Bai Hong-tao
He Li-li
Ouyang Dan-tong
Paper
Numerics
Bai Hong-tao, He Li-li, Ouyang Dan-tong
7a2fcb84-0960-415d-901a-d59f0a3a81c3
Accelerating K-Means on the Graphics Processor via CUDA
In this paper an optimized k-means implementation on the graphics processing unit (GPU) is presented. NVIDIA's Compute Unified Device Architecture (CUDA), available from the G80 GPU family onwards, is used as the programming environment. Emphasis is placed on optimizations directly targeted at this architecture to best exploit the computational capabilities available. Additionally drawbacks and limitations of previous related work, e.g. maximum instance, dimension and centroid count are addressed. The algorithm is realized in a hybrid manner, parallelizing distance calculations on the GPU while sequentially updating cluster centroids on the CPU based on the results from the GPU calculations. An empirical performance study on synthetic data is given, demonstrating a maximum 14x speed increase to a fully SIMD optimized CPU implementation.
/content/cudazone/CUDABrowser/assets/images/applications/388_k_means_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/388_k_means_large.jpg
2009
04
01
04/01/2009
14
Mario Zechner
Michael Granitzer
Paper
Other
Mario Zechner, Michael Granitzer
a0bd4162-4ba4-4139-b2c6-b27e17553509
Design of a parallel AES for graphics hardware using the CUDA framework
Web servers often need to manage encrypted transfers of data. The encryption activity is computationally intensive, and exposes a significant degree of parallelism. At the same time, cheap multicore processors are readily available on graphics hardware, and toolchains for development of general purpose programs are being released by the vendors. In this paper, we propose an effective implementation of the AES-CTR symmetric cryptographic primitive using the CUDA framework. We provide quantitative data for different implementation choices and compare them with the common CPU-based OpenSSL implementation on a performance-cost basis. With respect to previous works, we focus on optimizing the implementation for practical application scenarios, and we provide a throughput improvement of over 14 times. We also provide insights on the programming knowledge required to efficiently exploit the hardware resources by exposing the different kinds of parallelism built in the AES-CTR cryptographic primitive.
/content/cudazone/CUDABrowser/assets/images/applications/387_encryption_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/387_encryption_large.jpg
Academia
Politecnico di Milano, Italy
2009
01
01
01/01/2009
14
Andrea Di Biagio
Paper
Other
Andrea Di Biagio
bb9a79fd-0b9b-44f3-b56a-5e5d04a51acc
vCUDA: GPU accelerated high performance computing in virtual machines
This paper describes vCUDA, a GPGPU (General Purpose Graphics Processing Unit) computing solution for virtual machines. vCUDA allows applications executing within virtual machines (VMs) to leverage hardware acceleration, which can be beneficial to the performance of a class of high performance computing (HPC) applications. The key idea in our design is: API call interception and redirection. With API interception and redirection, applications in VMs can access graphics hardware device and achieve high performance computing in a transparent way. We carry out detailed analysis on the performance and overhead of our framework. Our evaluation shows that GPU acceleration for HPC applications in VMs is feasible and competitive with those running in a native, non-virtualized environment. Furthermore, our evaluation also identifies the main cause of overhead in our current framework, and we give some suggestions for future improvement.
/content/cudazone/CUDABrowser/assets/images/applications/386_electricsheep_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/386_electricsheep_large.jpg
Academia
Advanced Internet and Media Lab, School of Computer and Communications, Hunan University, Chang Sha
2009
01
01
01/01/2009
Lin Shi
Paper
Other
355534f2-32b2-4fa0-8a1f-3c7bdbeb17c8
Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA
Emerging DNA sequencing technologies open up exciting new opportunities for genome sequencing by generating read data with a massive throughput. However, produced reads are significantly shorter and more error-prone compared to the traditional Sanger shotgun sequencing method. This poses challenges for de-novo DNA fragment assembly algorithms in terms of both accuracy (to deal with short, error-prone reads) and scalability (to deal with very large input data sets). In this paper we present a scalable parallel algorithm for correcting sequencing errors in high-throughput short-read data. It is based on spectral alignment and uses the CUDA programming model. Our computational experiments on a GTX 280 GPU show runtime savings between 10 and 19 times (for different error-rates using simulated datasets as well as real Solexa/Illumina datasets).
/content/cudazone/CUDABrowser/assets/images/applications/385_figure1D_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/385_figure1D_large.jpg
Academia
School of Computer Engineering, Nanyang Technological University, Singapore
2009
01
01
01/01/2009
19
Haixiang Shi
Paper
Life Sciences
f3c0c426-1df9-4fb3-b9f4-1bd7f62a6978
Accelerating the reduction to upper Hessenberg
We present a Hessenberg reduction (HR) algorithm for hybrid multicore + GPU systems that gets more than 16 performance improvement over the current LAPACK algorithm running just on current multicores (in double precision arithmetic). This enormous acceleration is due to proper matching of algorithmic requirements to architectural strengths of the hybrid components. The reduction itself is an important linear algebra problem, especially with its relevance to eigenvalue problems. The results described in this paper are signi cant because Hessenberg reduction has not yet been accelerated on multicore architectures, and it plays a signi cant role in solving nonsymmetric eigenvalue problems. The approach can be applied to the symmetric problem and in general, to two-sided matrix transformations. The work further motivates and highlights the strengths of hybrid computing: to harness the strengths of the components of a hybrid architecture to get signi cant computational acceleration which otherwise may have been impossible.
/content/cudazone/CUDABrowser/assets/images/applications/384_criticalpath_small.png
/content/cudazone/CUDABrowser/assets/images/applications/384_criticalpath_large.png
Academia
University of Tennessee / Oak Ridge National Laboratory / University of Manchester
2009
05
29
05/29/2009
16
Stanimire Tomov
Jack Dongarra
Paper
Numerics
Stanimire Tomov, Jack Dongarra, Hessenberg reduction, eigenvalue problems, two-sided factorizations,
893b3219-bf47-4757-87f1-d83b48783c83
Zonar
ZONAR is an advanced STP system that handles sales, marketmaking, portfolio and risk management of all major asset classes in a multi-user environment. New version optimized for Cuda in several areas; -Option calculations and formulas -Volatility smile calculations -Interpolations -Portfolio calculations
/content/cudazone/CUDABrowser/assets/images/applications/383_portf_small.png
/content/cudazone/CUDABrowser/assets/images/applications/383_portf_large.png
Commercial
SoftCapital
http://www.softcapital.com/index.htm
2009
08
05
08/01/2009
40
Commercial
Lars Pehrsson
Application
Finance
Numerics
Finance, derivatives, trading, volatility, Lars Pehrsson, larsnsj@gmail.com
9494d226-5bd1-4a88-98dd-1f9536598781
PointTrackerLibrary
A C++ library with various frontends and a full search SSD CUDA blockmatcher as backend, able to track many points within realtime in images up to 2K resolution.
/content/cudazone/CUDABrowser/assets/images/applications/382_Peacock0000_small.png
/content/cudazone/CUDABrowser/assets/images/applications/382_Peacock0000_large.png
Research
JOANNEUM RESEARCH
www.joanneum.at/iis
2009
07
18
07/18/2009
12
Commercial
H.Fassold / H. Fuerntratt
Application
Multimedia
Graphics
Imaging
Video & Audio
Tracking, Blockmatching, Image conversion, H.Fassold,H. Fuerntratt,hermann.fuerntratt@joanneum.at
17ed7b65-302b-46b7-9d62-624cfce64935
PyOSSMGPU : Propagation of high-intensity pulses in nonlinear fiber bragg grating
The propagation of high-intensity laser pulses in fiber bragg grating or in any nonlinear periodic dielectric media can be studied using coupled-mode theory. When applied to Bragg grating in optical fiber, the coupled-mode theory lead to two coupled-mode equations which can be numerically resolved using a classical fourth-order Runge-Kutta formula. When studying classical problem like propagation of bragg soliton in very long grating (many cm), Runge-Kutta method usually take many hours to complete. PyOSSMGPU is a CUDA implementation of the optimized split-step method for solving nonlinear coupled-mode equations that model wave propagation in nonlinear fiber Bragg gratings. The GPU accelerated version of the OSSM code perform around 20X faster then plain C version. Classical problem like bragg soliton in very long grating take can be completed typically within a minute.
/content/cudazone/CUDABrowser/assets/images/applications/381_PyOSSMGPU_screenshot1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/381_PyOSSMGPU_screenshot1_large.jpg
Academia
Universite Laval
2009
07
10
07/10/2009
20
Martin Laprise
Multimedia
Code
Science
Martin Laprise, martin.laprise.1@ulaval.ca
f498d1da-fd44-4359-a83d-84a281d58c53
HONEI: A collection of libraries for numerical computations targeting multiple processor architectures
We present HONEI, an open-source collection of libraries o ering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the exibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a twofold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3{4 and 4{16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-speci c operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, signi cantly simplifying their development.
/content/cudazone/CUDABrowser/assets/images/applications/380_honei_small.png
/content/cudazone/CUDABrowser/assets/images/applications/380_honei_large.png
Academia
aInstitut fur Physik, TU Dortmund, Germany
2009
01
01
01/01/2009
16
Danny van Dyk
Markus Geveler
Sven Mallach
Paper
Numerics
Danny van Dyk, Markus Geveler, Sven Mallach
55b2f736-0a6b-4893-aaed-272cb5dd676d
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Atomistic molecular dynamics (MD) simulations are a vital tool in chemical research, as they are able to provide a view of chem- ical systems and processes that is not obtainable through experiment. However, large-scale MD simulations require access to multicore clus- ters or supercomputers that are not always available to all researchers. Recently, many have begun to explore the power of graphics processing units (GPUs) for various applications, such as MD. We present prelimi- nary results of water simulations carried out on GPUs. We compare the performance gained using a GPU versus the same simulation on a single CPU or multiple CPUs. We also address the use of more accurate double precision arithmetic with the newest GPUs and its cost in performance.
/content/cudazone/CUDABrowser/assets/images/applications/379_towards_molecular_dynamics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/379_towards_molecular_dynamics_large.png
Academia
University of Delaware, Newark
2009
01
01
01/01/2009
7
Joseph E. Davis
Adnan Ozsoy
Sandeep Patel
Paper
Life Sciences
Joseph E. Davis, Adnan Ozsoy, Sandeep Patel
ccf2228b-a635-4fc1-8875-4321542e5a7c
Multi-Dimensional Characterization of Temporal Data Mining on Graphics Processors
Through the algorthmic design patterns of data parallelism and task parallelism, the graphics processing unit (GPU) offers the potential to vastly accelerate discovery and innovation across a multitude of disciplines. For example, the exponential growth in data volume now presents an obstacle for high-throughput data mining in fields such as neuroinformatics and bioinformatics. As such, we present a characterization of a MapReduce-based datamining application on a general-purpose GPU (GPGPU). Using neuroscience as the application vehicle, the results of our multi-dimensional performance evaluation show that a (one-size-fits-all) approach maps poorly across different GPGPU cards. Rather, a high-performance implementation on the GPGPU should factor in the 1) problem size, 2) type of GPU, 3) type of algorithm, and 4) data-access method when determining the type and level of parallelism. To guide the GPGPU programmer towards optimal performance within such a broad design space, we provide eight general performance characterizations of our data-mining application.
/content/cudazone/CUDABrowser/assets/images/applications/378_csatvt-header_small.gif
/content/cudazone/CUDABrowser/assets/images/applications/378_csatvt-header_large.gif
Academia
Department of Computer Science, Virginia Tech
2009
01
01
01/01/2009
Jeremy Archuleta
Yong Cao
Wu-chun Feng
Paper
Numerics
Jeremy Archuleta, Yong Cao, Wu-chun Feng
e728d1cd-65b2-4b12-af9e-e1a445f0b779
Molecular dynamics simulation of complex multiphase flow on a computer cluster with GPUs
Compute Unified Device Architecture (CUDA) was used to design and implement molecular dynamics (MD) simulations on graphics processing units (GPU). With an NVIDIA Tesla C870, a 20 to 60 fold speedup over that of one core of the Intel Xeon 5430 CPU was achieved, reaching up to 150 Gflops. MD simulation of cavity flow and particle-bubble interaction in liquid was implemented on multiple GPUs using a message passing interface (MPI). Up to 200 GPUs were tested on a special network topology, which achieves good scalability. The capability of GPU clusters for large-scale molecular dynamics simulation of meso-scale flow behavior was, therefore, uncovered.
/content/cudazone/CUDABrowser/assets/images/applications/377_molecular_dynamics_simulation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/377_molecular_dynamics_simulation_large.png
State Key Laboratory of Multi-Phase Complex Systems, Institute of Process Engineering, Chinese Academy of Sciences, Beijing
2009
01
01
01/01/2009
60
CHEN FeiGuo
GE Wei
LI JingHai
Paper
Life Sciences
multiphase flow, molecular dynamics, CUDA, GPU, parallel computing, CHEN FeiGuo, GE Wei, LI JingHai
e38ddfe2-3ca5-4cb0-9a6d-209006e8051a
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled GPUs
The Smith-Waterman algorithm is one of the most widely used tools for searching biological sequence databases due to its high sensitivity. Unfortunately, the Smith-Waterman algorithm is computationally demanding, which is further compounded by the exponential growth of sequence databases. The recent emergence of many-core architectures, and their associated programming interfaces, provides an opportunity to accelerate sequence database searches using commonly available and inexpensive hardware
/content/cudazone/CUDABrowser/assets/images/applications/376_1756-0500-2-73-1-l_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/376_1756-0500-2-73-1-l_large.jpg
Research
BioMed Central Ltd
2009
02
01
02/01/2009
Yongchao Liu
Douglas L Maskell Bertil Schmidt
Paper
Life Sciences
Yongchao Liu, Douglas L Maskell, Bertil Schmidt
99c99e2d-b68b-4e24-a12d-e5ff3105c5b0
A High-Speed Multi-GPU Implementation of Bottom-Up Attention Using CUDA
In this paper a novel implementation of the saliency map model on a multi-GPU platform using CUDA technology is presented. The saliency map model is a well- known computational model for bottom-up attention selection and serves as a basis of many attention control strategies of cognitive vision systems. A real-time implementation is the prerequisite of an application of bottom-up attention on mobile robots and vehicles. Parallel computation on Graphics Process- ing Unit (GPU) provides an excellent solution for this kind of compute-intensive image processing. Running on 1 to 4 NVIDIA GeForce 8800 (GTX) graphics cards a frame rate of 313 fps at resolution of 640 x 480 is achieved, which is approximately 8.5 times faster than the standard implementations on CPUs. The implementation is also evaluated using a high-speed camera at 200 Hz. Using two GPUs only 2 ms extra computational time for the saliency map generation in addition to the camera capture time is required for images of 640 x 480 pixels.
/content/cudazone/CUDABrowser/assets/images/applications/375_attention_small.png
/content/cudazone/CUDABrowser/assets/images/applications/375_attention_large.png
Academia
Institute of Automatic Control Engineering Technische Universitat Munchen, Germany
2009
01
01
01/01/2009
9
Tingting Xu
Thomas Pototschnig
Kolja Kuhnlenz
Paper
Imaging
Tingting Xu, Thomas Pototschnig, Kolja Kuhnlenz
06a005b8-2c22-4aaa-9a47-1dc0e24dbfe8
Implementation of a Lattice-Boltzmann method for numerical fluid mechanics using the NVIDIA CUDA technology
The Lattice-Boltzmann method (LBM) is a distribution-function based approach to numerical fluid mechanics. Due to the simple formulation of the underlying algorithm this method is well suited for parallelization and hardware acceleration using general purpose graphical processing units (GPGPU). Within this work LBM has been implemented in a new code with multi-GPU support and physically validated for a flow around a sphere. The performance analysis shows a remarkable speed-up of 1840% using 3 GPUs in comparison to a single socket multi core CPU calculation. Moreover the validation for the test case chosen shows excellent agreement with available reference data.
/content/cudazone/CUDABrowser/assets/images/applications/374_boltzmann_small.png
/content/cudazone/CUDABrowser/assets/images/applications/374_boltzmann_large.png
Academia
Technische Universitat Munchen Lehrstuhl fur Aerodynamik
2009
01
01
01/01/2009
1840
Eugen Riegel
Thomas Indinger
Paper
Computational Fluid Dynamics
Eugen Riegel, Thomas Indinger
bf616128-b630-49cb-9ab0-996b495737b6
QP: A Heterogeneous Multi-Accelerator Cluster
We present a heterogeneous multi-accelerator cluster developed and deployed at NCSA. The cluster consists of 16 AMD dual-core CPU compute nodes each with four NVIDIA GPUs and one Xilinx FPGA. Cluster nodes are interconnected with both InfiniBand and Ethernet networks. The software stack consists of standard cluster tools with the addition of accelerator-specific software packages and enhancements to the resource allocation and batch sub-systems. We highlight several HPC applications that have been developed and deployed on the cluster. We also present our Phoenix application development framework that is meant to help with developing new applications and migrating existing legacy codes to heterogeneous systems.
/content/cudazone/CUDABrowser/assets/images/applications/373_heterogeneous_small.png
/content/cudazone/CUDABrowser/assets/images/applications/373_heterogeneous_large.png
Academia
University of Illinois at Urbana-Champaign, Urbana
2009
01
01
01/01/2009
48
Commercial
Michael Showerman
Jeremy Enos
Avneesh Pant
Paper
Other
heterogeneous system, acceleration co-processor, GPGPU, FPGA, Michael Showerman, Jeremy Enos, Avneesh Pant
8d2571a3-e901-47bc-a42a-2b2291dc858e
OpenMP to GPGPU: A Compiler Framework for Automatic Translation and Optimization
GPGPUs have recently emerged as powerful vehicles for generalpurpose high-performance computing. Although a new Compute Unified Device Architecture (CUDA) programming model from NVIDIA offers improved programmability for general computing, programming GPGPUs is still complex and error-prone. This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications. The goal of this translation is to further improve programmability and make existing OpenMP applications amenable to execution on GPGPUs. In this paper, we have identified several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance. Experimental results from two important kernels (JACOBI and SPMUL) and two NAS OpenMP Parallel Benchmarks (EP and CG) show that the described translator and compile-time optimizations work well on both regular and irregular applications, leading to performance improvements of up to 50X over the unoptimized translation (up to 328X over serial on a CPU).
/content/cudazone/CUDABrowser/assets/images/applications/372_gpu_perf_small.png
/content/cudazone/CUDABrowser/assets/images/applications/372_gpu_perf_large.png
Academia
School of ECE, Purdue University West Lafayette, IN
2009
01
01
01/01/2009
50
Seyong Lee
Seung-Jai Min
Rudolf Eigenmann
Paper
Programming Tools
Seyong Lee, Seung-Jai Min, Rudolf Eigenmann
4dd15dbc-9919-413f-9603-3bd4744edbd5
Nuclei: GPU-accelerated Many-core Network Coding
While it is a well known result that network coding achieves optimal flow rates in multicast sessions, its potential for practical use has remained to be a question, due to its high computational complexity. Our previous work has attempted to design a hardware-accelerated and multi-threaded implementation of network coding to fully utilize multi-core CPUs, as well as SSE2 and AltiVec SIMD vector instructions on x86 and PowerPC processors. This paper represents another step forward, and presents the first attempt in the literature to maximize the performance of network coding by taking advantage of not only multi-core CPUs, but also potentially hundreds of computing cores in commodity off-the-shelf Graphics Processing Units (GPU).
/content/cudazone/CUDABrowser/assets/images/applications/371_network_coding_small.png
/content/cudazone/CUDABrowser/assets/images/applications/371_network_coding_large.png
Academia
University of Toronto / School of Computer Science Fudan University
2009
01
01
01/01/2009
3
Hassan Shojania
Baochun Li Xin Wang
Paper
Signal Processing
Hassan Shojania, Baochun Li, Xin Wang
c24dcc0f-c60c-45f9-8d57-588e9460a58f
High Performance Computation and Interactive Display of Molecular Orbitals on GPUs and Multi-core CPUs
The visualization of molecular orbitals (MOs) is important for analyzing the results of quantum chemistry simulations. The functions describing the MOs are computed on a threedimensional lattice, and the resulting data can then be used for plotting isocontours or isosurfaces for visualization as well as for other types of analyses. Existing software packages that render MOs perform calculations on the CPU and require runtimes of tens to hundreds of seconds depending on the complexity of the molecular system. We present novel data-parallel algorithms for computing lattices of MOs on modern graphics processing units (GPUs) and multi-core CPUs. The fastest GPU algorithm achieves up to a 125-fold speedup over an optimized CPU implementation running on one CPU core. We also demonstrate possible bene ts of dynamic GPU kernel generation and just-intime compilation for MO calculation. We have implemented these algorithms within the popular molecular visualization program VMD, which can now produce high quality MO renderings for large systems in less than a second, and achieves the rst-ever interactive animations of quantum chemistry simulation trajectories using only on-the- y calculation.
/content/cudazone/CUDABrowser/assets/images/applications/374_molecular_orbitals_small.png
/content/cudazone/CUDABrowser/assets/images/applications/374_molecular_orbitals_large.png
Academia
University of Illinois at Urbana-Champaign Urbana
2009
01
01
01/01/2009
125
John E. Stone
Jan Saam
David J. Hardy
Paper
Science
John E. Stone, Jan Saam, David J. Hardy
f1be0f16-d60c-4c25-b225-6b7d575d6efc
CUDA Implementation of a Navier-Stokes Solver on Multi GPU Desktop Platforms for Incompressible Flows
Graphics processor units (GPU) that are traditionally designed for graphics rendering have emerged as massively-parallel "co-processors" to the central processing unit (CPU). Small-footprint desktop supercomputers with hundreds of cores that can deliver teraflops peak performance at the price of conventional workstations have been realized. A computational fluid dynamics (CFD) simulation capability with rapid computational turnaround time has the potential to transform engineering analysis and design optimization procedures. We describe the implementation of a Navier-Stokes solver for incompressible fluid flow using desktop platforms equipped with multi-GPUs. Specifically, NVIDIA's Compute Unified Device Architecture (CUDA) programming model is used to implement the discretized form of the governing equations. The projection algorithm to solve the incompressible fluid flow equations is divided into distinct CUDA kernels, and a unique implementation that exploits the memory hierarchy of the CUDA programming model is suggested. Using a quad-GPU platform, we observe two orders of magnitude speedup relative to a serial CPU implementation. Our results demonstrate that multi-GPU desktops can serve as a cost-effective small-footprint parallel computing platform to accelerate CFD simulations substantially.
/content/cudazone/CUDABrowser/assets/images/applications/373_navier_small.png
/content/cudazone/CUDABrowser/assets/images/applications/373_navier_large.png
Academia
Boise State University, Boise, Idaho
2009
01
01
01/01/2009
Julien C. Thibault
Inanc Senocak
Paper
Numerics
Julien C. Thibault, Inanc Senocak
78767ea4-1fbc-4def-b463-06022ab41ae5
Feasibility of GPU-assisted iterative image reconstruction for mobile C-arm CT
Computed tomography (CT) has been extensively studied and widely used for a variety of medical applications. The reconstruction of 3D images from a projection series is an important aspect of the modality. Reconstruction by filtered backprojection (FBP) is used by most manufacturers because of speed, ease of implementation, and relatively few parameters. Iterative reconstruction methods have a significant potential to provide superior performance with incomplete or noisy data, or with less than ideal geometries, such as cone-beam systems. However, iterative methods have a high computational cost, and regularization is usually required to reduce the effects of noise. The simultaneous algebraic reconstruction technique (SART) is studied in this paper, where the Feldkamp method (FDK) for filtered back projection is used as an initialization for iterative SART. Additionally, graphics hardware is utilized to increase the speed of SART implementation. Nvidia processors and compute unified device architecture (CUDA) form the platform for GPU computation. Total variation (TV) minimization is applied for the regularization of SART results. Preliminary results of SART on 3-D Shepp-Logan phantom using using TV regularization and GPU computation are presented in this paper. Potential improvements of the proposed framework are also discussed.
/content/cudazone/CUDABrowser/assets/images/applications/372_image_reconstruction_for_mobile_c_arm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/372_image_reconstruction_for_mobile_c_arm_large.png
Academia
aScientific Computing and Imaging Institute, University of Utah
2009
01
01
01/01/2009
130
Yongsheng Pan
Ross Whitaker
Arvi Cheryauka
Paper
Medical Imaging
C-arm CT, FDK, SART, GPU, CUDA, TV, Yongsheng Pan, Ross Whitaker, Arvi Cheryauka
2f26c62f-b633-45aa-8690-9a493d3851aa
Scalable Parallel Programming with CUDA
The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now parallel systems. Furthermore, their parallelism continues to scale with Moore's law. The challenge is to develop mainstream application software that transparently scales its parallelism to leverage the increasing number of processor cores, much as 3D graphics applications transparently scale their parallelism to manycore GPUs with widely varying numbers of cores.
/content/cudazone/CUDABrowser/assets/images/applications/371_fig5-7_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/371_fig5-7_large.jpg
Commercial
NVIDIA
http://www.nvidia.com
2008
04
01
04/01/2008
263
John Nickolls
Ian Buck
Michael Garland
Paper
Programming Tools
John Nickolls, Ian Buck, Michael Garland
774b5ab3-3d70-4061-9dc6-1809c3eaa8f3
GPU Decoder (Vegas/Premiere)
GPU Decoder comes as a plugin for Non-Linear Editors (Sony Vegas/Adobe Premiere). It uses the power of your NVIDIA graphic card to decode h.264 video files such as AVCHD or files from Canon EOS 5D Mark II.
/content/cudazone/CUDABrowser/assets/images/applications/370_packshot_small.png
/content/cudazone/CUDABrowser/assets/images/applications/370_packshot_large.png
Commercial
DIVIDE FRAME
http://www.divideframe.com
2009
07
11
07/11/2009
10
Commercial
Robin Lobel
Application Multimedia
Video & Audio
gpu, decoder, sony, vegas, adobe, premiere, panasonic, avchd, h.264, h264, quicktime, canon, Robin Lobel
008d8c73-6b0b-4838-a211-8587b4f23232
Data structure design for GPU based heterogeneous systems
This paper reports on our experience with data structure design for systems having both multiple CPU cores and a programmable graphics card. We integrate our data structures into the game-like application OpenSteerDemo and compare our data structures on two different pc-systems. One System has a relative fast single core CPU and slower GPU, whereas the other one uses a high-end GPU with a slower multi core CPU. We design two grid based data structures for effectively solving the k-nearest neighbor problem. The static grid uses grid cells of uniform size, whereas the dynamic grid does not rely on given grid cells, but creates them at runtime. The static grid is designed for fast data structure creation, in contrast to the dynamic grid, which is designed to provide high simulation performance at the GPU. The high performance at the GPU is achieved by explicitly taking advantage of the special GPU memory system, which however comes at the cost of a more complex construction algorithm. Our experiments show that with a slower CPU the complex algorithm for creating the dynamic grid becomes the bottleneck and the increased simulation performance at the GPU thereby does not provide an increase in performance compared to the static grid based implementation. This also holds true when the simulation is run with a faster CPU and a slower GPU, even though the break-even point is different. Furthermore we experimented with data structure creation on the GPU, but the performance of the static grid is not feasible, whereas the creation of the dynamic grid on the GPU is not possible due to the lack of support for recursive functions. We provide a dynamic grid creation algorithm, which uses multiple CPU cores. However, this algorithm is slower than its sequential counterpart due to the parallelization overhead.
/content/cudazone/CUDABrowser/assets/images/applications/369_data_structure_small.png
/content/cudazone/CUDABrowser/assets/images/applications/369_data_structure_large.png
Academia
Research Group Programming Languages / Methodologies Universitat Kassel
2008
12
01
12/01/2008
Jens Breitbart
Paper
Numerics
GPGPU, k-nearest neighbor, games, OpenMP, CUDA, Jens Breitbart
aa81fd54-34d5-4855-abe2-74940dd2d268
Comparing CUDA and OpenGL implementations for a Jacobi iteration
The use of the GPU as a general purpose processor is becoming more popular and there are different approaches for this kind of programming. In this paper we present a comparison between different implementations of the OpenGL and CUDA approaches for solving our test case, a weighted Jacobi iteration with a structured matrix originating from a finite element discretization of the elliptic PDE part of the cardiac bidomain equations. The CUDA approach using textures showed to be the fastest with a speedup of 78 over a CPU implementation. CUDA showed to be an efficient and easy way of programming GPU for general purpose problems, though it is also easier to write inefficient codes.
/content/cudazone/CUDABrowser/assets/images/applications/368_visualization_simulation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/368_visualization_simulation_large.png
Academia
Karl Franzens Universitat Graz
2008
12
01
12/01/2008
78
Ronan Amorim
Gundolf Haase
Manfred Liebmann
Paper
Programming Tools
Ronan Amorim, Gundolf Haase, Manfred Liebmann
d473a665-7e53-4405-9be4-952aeea0f762
GPU Acceleration of an Unmodified Parallel Finite Element Navier Stokes Solver
We have previously suggested a minimally invasive approach to include hardware accelerators into an existing large-scale parallel finite element PDE solver toolkit, and implemented it into our software FEAST. Our concept has the important advantage that applications built on top of FEAST benefit from the acceleration immediately, without changes to application code. In this paper we explore the limitations of our approach by accelerating a Navier-Stokes solver. This nonlinear saddle point problem is much more involved than our previous tests, and does not exhibit an equally favourable acceleration potential: Not all computational work is concentrated inside the linear solver. Nonetheless, we are able to achieve speedups of more than a factor of two on a small GPU-enhanced cluster. We conclude with a discussion how our concept can be altered to further improve acceleration.
/content/cudazone/CUDABrowser/assets/images/applications/367_channel_flow_small.png
/content/cudazone/CUDABrowser/assets/images/applications/367_channel_flow_large.png
Academia
Angewandte Mathematik und Numerik, TU Dortmund, Germany
2009
06
23
06/23/2009
2
Dominik Goddeke
Sven H.M. Buijssen
Hilmar Wobker and Stefan Turek
Paper
Numerics
Dominik Goddeke, Sven H.M. Buijssen, Hilmar Wobker and Stefan Turek
dee67bc1-3342-481f-b4d9-64f377689b1e
GPU Implementation of the Multiple Back-Propagation Algorithm
In this paper, we describe a parallel implementation of the Multiple Back-Propagation (MBP) algorithm and present the results obtained when running the algorithm on two well-known benchmarks. The implementation described in the paper will be included in the next version of the Multiple Back-Propagation Software.
/content/cudazone/CUDABrowser/assets/images/applications/366_mbpTop_small.png
/content/cudazone/CUDABrowser/assets/images/applications/366_mbpTop_large.png
Academia
IPG
2009
09
01
09/01/2009
40
Noel Lopes
Application
Paper
Neural Networks
Neural Networks, Multiple Back~Propagation, Noel Lopes
20d2df07-85f7-4bc9-9689-ab36bad685af
F2C-ACC
F2C-ACC was developed to reduce the time required to modify codes to run on the GPU or Cell devices. We believe in time, Fortran language support will be provided by PGI and others. In the meantime, we have developed a language translator to convert codes from Fortran into C or CUDA-C. Both translations are useful: C can be used for testing and as a base code for running on the IBM Cell processor, and the generated CUDA code serves as a base for running on the GPU. The translator handles parsing of all Fortran 95 language features but output generation of the C and CUDA code is not complete.
/content/cudazone/CUDABrowser/assets/images/applications/365_notepadplus_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/365_notepadplus_large.jpg
Research
NOAA
2009
05
01
05/01/2009
Mark Govett
Application
Code
Programming Tool
Medical Imaging
Mark Govett
dd07c8a1-e8f8-472d-95f5-9e4cfbf8e928
Interactive Point-Based Rendering of Higher-Order Tetrahedral Data
Computational simulations frequently generate solutions defined over very large tetrahedral volume meshes containing many millions of elements. Furthermore, such solutions may often be expressed using non-linear basis functions. Certain solution techniques, such as discontinuous Galerkin methods, may even produce non-conforming meshes. Such data is difficult to visualize interactively, as it is far too large to fit in memory and many common data reduction techniques, such as mesh simplification, cannot be applied to non-conforming meshes. We introduce a point-based visualization system for interactive rendering of large, potentially non-conforming, tetrahedral meshes. We propose methods for adaptively sampling points from non-linear solution data and for decimating points at run time to fit GPU memory limits. Because these are streaming processes, memory consumption is independent of the input size. We also present an
/content/cudazone/CUDABrowser/assets/images/applications/364_thumbnail_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/364_thumbnail_large.jpg
Academia
IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS,
2008
1
1
1/1/2008
Yuan Zhou
Michael Garland
Paper
Imaging
Yuan Zhou, Michael Garland
7b377967-35c3-427c-ba6e-196138e952fd
Rapid multipole graph drawing on the GPU
As graphics processors become powerful, ubiquitous and easier to program, they have also become more amenable to general purpose high-performance computing, including the computationally expensive task of drawing large graphs. This paper describes a new parallel analysis of the multipole method of graph drawing to support its efficient GPU implementation. We use a variation of the Fast Multipole Method to estimate the long distance repulsive forces in force directed layout. We support these multipole computations efficiently with a k-d tree constructed and traversed on the GPU. The algorithm achieves impressive speedup over previous CPU and GPU methods, drawing graphs with hundreds of thousands of vertices within a few seconds via CUDA on an NVIDIA GeForce 8800 GTX.
/content/cudazone/CUDABrowser/assets/images/applications/363_thumbnail_small.png
/content/cudazone/CUDABrowser/assets/images/applications/363_thumbnail_large.png
Academia
University of Illinois / NVIDIA
2008
09
01
09/01/2008
4
Apeksha Godiyal
Jared Hoberock
Michael Garland
Paper
Electronic Design Automation
Apeksha Godiyal, Jared Hoberock, Michael Garland
fdfd95d7-184d-4637-879e-26b8906b3aec
Fast BVH Construction on GPUs
We present two novel parallel algorithms for rapidly constructing bounding volume hierarchies on manycore GPUs. The first uses a linear ordering derived from spatial Morton codes to build hierarchies extremely quickly and with high parallel scalability. The second is a top-down approach that uses the surface area heuristic (SAH) to build hierarchies optimized for fast ray tracing. Both algorithms are combined into a hybrid algorithm that removes existing bottlenecks in the algorithm for GPU construction performance and scalability leading to significantly decreased build time. The resulting hierarchies are close in to optimized SAH hierarchies, but the construction process is substantially faster, leading to a significant net benefit when both construction and traversal cost are accounted for. Our preliminary results show that current GPU architectures can compete with CPU implementations of hierarchy construction running on multicore systems. In practice, we can construct hierarchies of models with up to several million triangles and use them for fast ray tracing or other applications.
/content/cudazone/CUDABrowser/assets/images/applications/362_thumbnail_small.png
/content/cudazone/CUDABrowser/assets/images/applications/362_thumbnail_large.png
Academia
1University of North Carolina at Chapel Hill / NVIDIA
2009
05
01
05/01/2009
C. Lauterbach
M. Garland
S. Sengupta
Multimedia Paper
Imaging
C. Lauterbach, M. Garland , S. Sengupta
cf7594c3-abb8-4fd2-8e0f-a570be1629e4
Designing Efficient Sorting Algorithms for Manycore GPUs
We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23% faster, on average, than even a very carefully optimized multicore CPU sorting routine.
/content/cudazone/CUDABrowser/assets/images/applications/361_thumbnail_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/361_thumbnail_large.jpg
Academia
Dept. of Electrical Engineering University of California / NVIDIA
http://eecs.berkeley.edu
2009
05
01
05/01/2009
Nadathur Satish
Mark Harris
Michael Garland
Numerics
Nadathur Satish, Mark Harris, Michael Garland
de61eb09-d3dc-43a2-9fe7-13e3709e7c04
Cryostasis: Benchmark
Sneak Peek - Featuring GPU accelerated NVIDIA PhysX effects
/content/cudazone/CUDABrowser/assets/images/applications/359_screenshot_cryostasis_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/359_screenshot_cryostasis_large.jpg
Commercial
Action Forms ltd
http://www.cryostasis-game.com/
2008
12
01
12/01/2008
10
Commercial
Action Forms ltd
Application
Multimedia
Game Physics
Action Forms ltd
166a76d3-01d8-4200-857a-3bb71d282a60
Ocelot
CUDA provides a programming model with abstractions that are amenable to many-core architectures in general, not only GPUs. We argue that the optimal partitioning of application may require both highly data-parallel architectures that rely on hardware multithreaded to hide memory latency as well as superscalar architectures with deep cache hierarchies. Ocelot aims to push towards the development of tools that can compile CUDA programs to multiple architectures, and dynamically determine which parts of an application should be run on each architecture. We currently have code analysis tools for CUDA and PTX, as well as a full featured emulator for PTX.
/content/cudazone/CUDABrowser/assets/images/applications/358_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/358_logo_large.png
Academia
Georgia Institute of Technology
www.ece.gatech.edu/research/labs/casl/index.html
2009
07
09
07/09/2009
Open source
Gregory Diamos
Application
Libraries
Gregory Diamos
a3eb01b2-b364-4d42-8abe-9187a6b2136c
Prestack migration on gpu
We provide world leading prestack mingration techlonogy on GPU . The program have alreadey used by a lot of China Oil Field
/content/cudazone/CUDABrowser/assets/images/applications/357_2009041220515865320_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/357_2009041220515865320_large.jpg
Commercial
Geostar Science & Technology Co. LTD
2008
07
01
07/01/2008
80
Commercial
Tong Xiaolong
Application
Paper
Oil & Gas
prestack migratoin, Tong Xiaolong
45413113-1fe7-4f1f-b61a-b39a80f6c99a
Crazy Machines 2
Casual gaming could not get any better, this GPU accelerated puzzle game challenges the player to create a series of interacting mechanisms to achieve the final goal. GPU accelerated PhysX is used for amazing fluid simulation which interacts with the machines and puzzle elements. A fully playable multi-level version available as a free download as part of NVIDIAs Powerpack.
/content/cudazone/CUDABrowser/assets/images/applications/356_screenshot_crazymachines_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/356_screenshot_crazymachines_large.jpg
Commercial
Viva Media LLC
http://www.crazymachinesgame.com/
2008
12
1
12/1/2008
5
Commercial
Viva Media LLC
Application
Multimedia
Game Physics
Viva Media LLC
b39b2278-ae30-4a00-bdab-c7efa59f065d
Star Tales Benchmark Demo
This exciting social networking game, features GPU accelerated PhysX for lifelike simulation of clothing and hair. Characters can dance and their clothing will move and interact with the surroundings .
/content/cudazone/CUDABrowser/assets/images/applications/355_screenshot_startales_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/355_screenshot_startales_large.jpg
Commercial
QWD1
2008
12
1
12/1/2008
10
Commercial
QWD1
Application
Multimedia
Game Physics
QWD1
68630bdc-c298-481b-905b-e806a212ad99
Sacred 2: Fallen Angel - PhysX Game Patch
Sacred 2: Fallen Angel - PhysX Game Patch is an Action Role-playing Game (RPG) which occurs in a vast, graphically rich world, called Ancaria. The free GPU PhysX patch extends this richness to the physical world with the addition of physical debris, physical spell particles, and force fields. Storms come to life as a myriad of leaves dance in the wind and collide with the environment. Players can finally, "feel" the power of spells as surrounding leaves and pebbles are blown about and magic spell particles fly about the environment, bouncing off of buildings, and tumbling down hillsides. Once you've experienced the amount of energy and detail this patch brings to the environment, there is no going back.
/content/cudazone/CUDABrowser/assets/images/applications/354_screenshot_sacred2_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/354_screenshot_sacred2_large.jpg
Ascaron Entertainment GmbH 2008
http://www.sacred2.com/en.html
2008
12
1
12/1/2008
5
Commercial
Ascaron Entertainment GmbH 2008
Application
Multimedia
Game Physics
Ascaron Entertainment GmbH 2008
bfdb49b3-472b-4ddb-8772-7ff5640c5f27
Warmonger
This multi-player shooter with five levels sets the standard for interactivity for all multi-player shooters. It makes terrific use of GPU PhysX. Every building is destructible. The constant battle has created an environment of debris, rags and embers which all interact with the environment. You can create obstacles and affect game play by blowing up buildings. You can protect your rear by blowing up staircases behind you. You can create shortcuts to get to your opponents by blowing up wall they were seeking shelter behind. There is NO PLACE To HIDE!
/content/cudazone/CUDABrowser/assets/images/applications/353_warmonger-free_fullgame_01_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/353_warmonger-free_fullgame_01_large.jpg
Commercial
NetDevil
http://www.netdevil.com
2008
08
12
08/12/2008
5
Commercial
NetDevil
Application
Multimedia
Game Physics
NetDevil
7e81a1fa-17d2-4a23-b4f7-52a3ac094cc8
PhysX Screen Saver
The PhysX screen saver uses the power of accelerated GPU Physics to create this unique hypnotic experience. The forever rolling ball will power through objects and cloth banners featuring your pictures, against your own favorite panoramic image -- but now you have a chance to customize and mod it to make your own personal creation. The full source for the PhysX Screen Saver is now available from The Game Creators at: http://gdk.thegamecreators.com/
/content/cudazone/CUDABrowser/assets/images/applications/352_it_photo_111042_28_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/352_it_photo_111042_28_large.jpg
Commercial
The Game Creators
http://www.thegamecreators.com/
2008
08
12
08/12/2008
5
Commercial
The Game Creators
Application
Multimedia
Game Physics
The Game Creators
ea10a5ce-2e1a-45d0-b6b1-5f4297590aa2
Ghost Recon Advanced Warfighter 2
Tom Clancy's Ghost Recon Advanced Warfighter 2 builds off of the events in the first game and places gamers in control of the U.S. military's elite fighting unit, the Ghosts. In the year 2014, the rising conflict between Mexican loyalists and insurgent rebel forces has thrown Mexico into full-scale civil war. Under the command of Captain Scott Mitchell, the Ghosts are called upon to face an imminent threat to the United States. The fate of two countries now lies in the hands of the Ghosts as they fend off an attack on U.S. soil. Equipped with the most cutting-edge weaponry and technology, the Ghosts must battle on both sides of the border to neutralize the escalating rebel threat. Use of PhysX extends the visually rich and complex GRAW 2 to an entirely new level of game realism and interactivity, with dynamic gameplay physics, impactful environmental effects and persistent destruction and debris throughout. PhysX provides a realistic combat experience the from the characters to tanks to buildings and every other object within the game world. When something explodes, the physics engine kicks in to create superb explosions, emitting debris which affects gameplay.
/content/cudazone/CUDABrowser/assets/images/applications/351_GRAW2_Logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/351_GRAW2_Logo_large.jpg
Ubisoft
2008
08
12
08/12/2008
5
Commercial
Red Storm Entertainment
Application
Multimedia
Game Physics
Red Storm Entertainment
bcfc968e-677a-44c2-9cf3-34042de48d2e
Selective and Adaptive Supersampling for Real-Time Ray Tracing
While supersampling is an essential element for high quality rendering, high sampling rates, routinely employed in offline rendering, are still considered quite burdensome for real-time ray tracing. In this paper, we propose a selective and adaptive supersampling technique aimed at the development of a real-time ray tracer on today's many-core processors. For efficient utilization of very precious computing time, this technique explores both image--space and object--space attributes, which can be easily gathered during the ray tracing computation, minimizing rendering artifacts by cleverly distributing ray samples to rendering elements according to priorities that are selectively set by a user. Our implementation on the current GPU demonstrates that the presented algorithm makes high sampling rates as effective as 9 to 16 samples per pixel more affordable than before for real-time ray tracing.
/content/cudazone/CUDABrowser/assets/images/applications/350_AdaptiveSamplingRenderedImage_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/350_AdaptiveSamplingRenderedImage_large.jpg
Sogang Computer Graphics Lab
http://grmanet.sogang.ac.kr/results_rtg.html#gpurt2
2009
06
01
06/01/2009
B. Jin
I. Ihm
C. Park
Multimedia
Paper
Imaging
B. Jin, I. Ihm, C. Park
a8845841-7514-46a1-9c0d-d28fef4e68ea
SIMD Optimization of Linear Expressions for Programmable Graphics Hardware
The increased programmability of graphics hardware allows efficient GPU implementations of a wide range of general computations on commodity PCs. An important factor in such implementations is how to fully exploit the SIMD computing capacities offered by modern graphics processors. Linear expressions in the form of , where is a matrix, and , , and are vectors, constitute one of the most basic operations in many scientific computations. In this paper, we propose a SIMD code optimization technique that enables efficient shader codes to be generated for evaluating linear expressions. It is shown that performance can be improved considerably by efficiently packing arithmetic operations into four-wide SIMD instructions through reordering of the operations in linear expressions. We demonstrate that the presented technique can be used effectively for programming both vertex and pixel shaders for a variety of mathematical applications, including integrating differential equations and solving a sparse linear system of equations using iterative methods.
/content/cudazone/CUDABrowser/assets/images/applications/349_simd_ole4pgh_title_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/349_simd_ole4pgh_title_large.jpg
Department of Computer Science, University of Texas / Sogang University
2008
07
01
07/01/2008
Chandrajit Bajaj
Insung Ihm
Jungki Min
Imaging
programmable GPU, vertex shader, pixel shader, numerical computing, linear expression, SIMD, shader code optimization, Chandrajit Bajaj, Insung Ihm, Jungki Min
12d5c225-4df4-4315-b428-40025933ac3d
Nebula 3
VST Plugin based on Volterra Kernels Series. It emulates different types of vintage gear: equalizers, filters, microphones, preamps, compressors, reverb and generic time-variant processors (chorus, flangers, phasers)
/content/cudazone/CUDABrowser/assets/images/applications/348_acustica_small.png
/content/cudazone/CUDABrowser/assets/images/applications/348_acustica_large.png
Commercial
ACUSTICA
http://www.acusticaudio.net/
2009
06
01
06/01/2009
2
Commercial
ACUSTICA
Application
Video & Audio
ACUSTICA
225159b9-381a-45ca-9fa2-556bdac7e48f
3D-Coat
3D-Coat is a creative 3D painting, texturing and sculpting tool that uses CUDA to speed up Voxel sculpting so the application can keep up with the artist's creative input.
/content/cudazone/CUDABrowser/assets/images/applications/347_3dcoat_small.png
/content/cudazone/CUDABrowser/assets/images/applications/347_3dcoat_large.png
Commercial
Pilgway
http://http://www.3dcoat.com
2009
06
01
06/01/2009
2
Consumer
Application
Multimedia
3D digital content creation
Imaging
79b0b95f-6bc4-4ccd-b719-8d33ba9f5875
MediaShow Espresso
MediaShow Espresso is hassle-free. Its user interface is intuitively designed to convert videos in an easy 2-step, allowing you to convert all your favorite videos for playback on iPhone, PSP, Xbox, YouTube and more. Optimized for NVidia CUDA, MediaShow Espresso offers an incredibly performance up to 10 times faster. Faster performance doesn't necessarily mean you have to waste power though, as you'll find out with MediaShow Espresso's energy-saving feature, auto-shutdown.
/content/cudazone/CUDABrowser/assets/images/applications/346_mediashowespresso_small.png
/content/cudazone/CUDABrowser/assets/images/applications/346_mediashowespresso_large.png
Commercial
CyberLink
http://www.cyberlink.com/
2009
4
29
4/29/2009
4
CyberLink
Application
Video & Audio
CyberLink
7b3f7561-d959-44df-a9a8-0d5cac1c630c
Move it 1.5
Nero Move it lets customers enjoy all your multimedia files on the compatible portable and mobile devices. Nero Move it uses CUDA to convert videos in a fraction of the usual time, moving content between mobile devices at incredible speeds. Up to 5x the speed when compared to classic CPU-based transcoding.
/content/cudazone/CUDABrowser/assets/images/applications/345_box-moveit-96_small.png
/content/cudazone/CUDABrowser/assets/images/applications/345_box-moveit-96_large.png
Commercial
Nero
http://www.nero.com/
2009
04
30
4/20/2009
4
Nero
Application
Video & Audio
Nero
bfd8e336-ea99-45e2-8b96-1e65404c3cf3
Super LoiLoScope
Super LoiLoScope is an easy-to-use super high-speed GPU-based video editing software with a game-like ultra intuitive GUI, developed by two top Japanese game creators.
/content/cudazone/CUDABrowser/assets/images/applications/344_loiloscope_small_.png
/content/cudazone/CUDABrowser/assets/images/applications/344_loiloscope_large_.png
Commercial
LoiLo Inc.
http://www.loilo.tv/
2009
1
31
1/31/2009
10
Consumer
LoiLo Inc.
Application
Video & Audio
LoiLo Inc
9c83700f-4313-46ac-b21e-8bbbab6c579c
ffA Software: Performance Acceleration
Integration of Graphics Processing Unit (GPU) based computing capabilities into ffA's SVI Pro and SEA3D Pro desktop image processing and analysis applications is delivering step increases in the processing performance of ffA Seismic Image Processing Algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/343_HPCPic_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/343_HPCPic_large.jpg
ffA
http://www.ffa.co.uk/hpc.html
2009
05
15
05/15/2009
98
ffA
Multimedia
Paper
Oil & Gas
ffA
8d33eb53-44ed-4237-bf40-82cb67db9224
SeismicCity
As a foundation for its future needs, SeismicCity turned to GPU Computing by running NVIDIA CUDA on an NVIDIA Tesla S870 1U server system. This massively parallel computing architecture produces a 20X performance increase over the previous CPU configuration. Performance was accelerated an additional 3.5X with NVIDIA's next-generation Tesla processors based on CUDA technology. Going forward, the scalability of GPUs will make the transition to new algorithms faster and allow the hardware platform to be expanded as need arises.
/content/cudazone/CUDABrowser/assets/images/applications/342_large_Seismiccity_RTM_image_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/342_large_Seismiccity_RTM_image_large.jpg
Commercial
SeismicCity
http://www.seismiccity.com
2008
10
29
10/29/2008
20
SeismicCity
Paper
Oil & Gas
SeismicCity
340dde84-bed3-41c8-b7e9-25ac626ea0ee
A method of accelerating seismic Pre-stack time migration by GPU
General Purpose GPU technology has being becoming mature, it has been applied in many industry area. However, due to the differences of computing feather between CPU and GPU, the study of GPU in petroleum industry application should be developed effectively. In this article, we introduce the General Purpose GPU technology and propose a method to realize pre-stack time migration software on GPU. Compared with traditional pre-stack time migration running on Personal Computer (PC) or PC-Cluster, the new programming method greatly improves computational efficiency, and then dramatically save power and maintenances cost. the actual tests in real seismic data illustrate that high performance computing based on General Purpose GPU technology(GPGPU) is a important direction of developments to meet the requirements of large scale computing in petroleum industry.
/content/cudazone/CUDABrowser/assets/images/applications/341_waves_small.png
/content/cudazone/CUDABrowser/assets/images/applications/341_waves_large.png
Academia
Institute of Geology and Geophysics Chinese Academy of Sciences CBeijing
2009
01
01
01/01/2009
15
LI Bo
LIU Guo-feng LIU Hong
Paper
Oil & Gas
LI Bo, LIU Guo-feng, LIU Hong
14ba18ca-16f2-4463-a272-ad86dac1149d
Ruins 1.5
If you want to Shatter many debris, youneed CUDA accelerated .you must have nvidia Geforce 8800 or quadro FX 4600 and 512M display memory above, if you have 512M display memory , but In fact only 350M memory can use about!
/content/cudazone/CUDABrowser/assets/images/applications/340_shatter_helix_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/340_shatter_helix_large.jpg
Commercial
nShatter.com
http://www.nshatter.com/index.html
2009
07
07
07/07/2009
Commercial
nShater.com
Application
Multimedia
Imaging
nShater.com
5ac5e350-f908-4569-b8b8-d4e510e57595
Seismic Solvers
Acceleware is leading the market in providing acceleration solutions for seismic data processing and reservoir simulation. By combining our core knowledge in parallelization and optimization of complex algorithms with an in-house team of seismic industry experts, Acceleware provides software solutions for seismic data processors, which access the massively parallel capabilities of compute GPUs. The Acceleware seismic processing solutions provide multi-fold performance increases to reduce lengthy processing times and deliver faster business decisions for the seismic industry. By harnessing the parallel processing power of GPU accelerators to dramatically increase the computation power of data centers, seismic jobs are processed faster and with a reduced total cost of IT ownership.
/content/cudazone/CUDABrowser/assets/images/applications/339_SEG_Salt_Model_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/339_SEG_Salt_Model_large.jpg
Commercial
Acceleware
http://www.acceleware.com/default/
2009
06
09
06/09/2009
200
Acceleware
Application
Multimedia
Oil & Gas
Acceleware
1b3adcb0-1f3b-4f2d-8772-a4158048ddbf
Imaging Earth's Subsurface Using CUDA
The state-of-the-art algorithms used in seismic data processing are evolving rapidly, and the need for computing power increases dramatically every year. For this reason, CGGVeritas has always pioneered new high-performance computing (HPC) technologies, and in this work we explore GPUs and NVIDIA's CUDA programming model to accelerate our industrial applications.
/content/cudazone/CUDABrowser/assets/images/applications/338_gems3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/338_gems3_large.png
CGGVeritas: Global Provider of Geophysical Services and Equipment
http://www.cggveritas.com
2008
08
02
08/02/2008
15
Bernard Deschizeaux
Jean-Yves Blanc
Paper
Oil & Gas
Bernard Deschizeaux, Jean-Yves Blanc
1fa695ac-ce3d-4712-93cf-5ce15091f681
GADGET2 Optimization
Optimization of the astrophysics N-Body/SPH solver, using the CUDA architecture to calculate the particle forces.
/content/cudazone/CUDABrowser/assets/images/applications/337_structure2_small.png
/content/cudazone/CUDABrowser/assets/images/applications/337_structure2_large.png
Academia
Private/Acme Late Night Coding
2009
07
06
07/06/2009
30
Open source
Carsten Frigaard
Code
Computational Fluid Dynamics
Numerics
Science
Carsten Frigaard
8178e5ed-fa4f-4d93-901b-0bbd3fa0c50f
FIDESYS
Strength analysis at large deformations Mechanics and strength at phase transformations under finite strains Strength of solids made of materials which properties are changed at loading Strength analysis of solids which parts are removed Development of numerical and analytical computational methods
/content/cudazone/CUDABrowser/assets/images/applications/336_mp8-fig1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/336_mp8-fig1_large.png
Commercial
SALD Laboratory
http://www.saldlab.com
2009
07
03
07/03/2009
30
Commercial
Vladimir A. Levin
Multimedia
Paper
Numerics
Oil & Gas
Science
strength analysis, large deformations, finite strains, phase transitions, superposition, numerical methods, CUDA, Vladimir A. Levin
f335d34b-a206-43d9-9e84-bf4500523f8c
Real-time estimation of human visual attention
we propose a new stochastic model of visual attention. The proposed model is composed of a dynamic Bayesian network with four layers that combines several fundamental statistical models. The proposed model enable us to automatically estimate eye focusing positions and their densities only from video frames.
/content/cudazone/CUDABrowser/assets/images/applications/335_akisato_small.png
/content/cudazone/CUDABrowser/assets/images/applications/335_akisato_large.png
Academia
NTT Communication Science Laboratories
http://www.brl.ntt.co.jp/people/akisato/
2009
07
03
07/03/2009
20
Akisato Kimura
Application
Multimedia
Paper
Digital Content Creation
Graphics
Science
Signal Processing
Video & Audio
Akisato Kimura
3bb25969-f48e-4aee-b4cb-65c369ca70d2
Autodesk Moldflow
Autodesk Moldflow Adviser injection molding software can put plastics simulation within every designer's grasp. Autodesk Moldflow Adviser simplifies plastics injection molding simulation, enabling you to optimize mold features, such as gates, runners, and cavity layouts. The product guides designers and mold makers through analysis setup and results interpretation, so you can see how changes to wall thickness, gate location, material, and geometry affect manufacturability.
/content/cudazone/CUDABrowser/assets/images/applications/334_autodesk_small.png
/content/cudazone/CUDABrowser/assets/images/applications/334_autodesk_large.png
Commercial
Autodesk
http://usa.autodesk.com/
2009
07
01
07/01/2009
2
Autodesk
Application
Industrial design
Autodesk
89e2206e-bcf5-4db7-a353-ed500b83c385
Massively-Parallel Simulation of Biochemical Systems
Understanding biological evolution prompts for a detailed understanding of the realized phenotype. Biochemical and gene regulatory dynamics are a cornerstone for the physiology of the cell and must therefore be regarded as one of the major aspects of such a phenotype. Experimental insight into molecular parameters is, however, hard to come by. Model development therefore requires computational parameter estimation. At the same time, design of cellular dynamics is highly efficient when done in-silico. We therefore developed a computational approach to allow for massively parallel simulation of biological molecular networks that leverage the massively-parallel computing power of modern graphics cards and other many-core programming paradigms. Our system can automatically compile standard SBML files into CUDA code, using analytic derivatives, and computing standard measures of complex dynamics like the Lyapunov exponent.
/content/cudazone/CUDABrowser/assets/images/applications/333_SBMLtoCUDA-pipeline-thumb_small.png
/content/cudazone/CUDABrowser/assets/images/applications/333_SBMLtoCUDA-pipeline-thumb_large.png
Academia
TU Darmstadt
http://www.tu-darmstadt.de
2009
07
03
07/03/2009
59
J. Ackermann
P. Baecher
T. Franzel
M. Goesele
K. Hamacher
Paper
Life Sciences
SBML-to-CUDA conversion, J. Ackermann, P. Baecher, T. Franzel, M. Goesele, K. Hamacher
f7976db7-ee65-4f7c-80fe-75011b114c6a
GPU-SNN: Large-Scale Biologically Realistic Spiking Neural Networks
Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for engineering applications. Spiking Neural Network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, Graphics Processing Units (GPUs) can provide a low-cost, programmable, and highperformance computing platform for simulation of SNNs. In this project we demonstrate an efficient, Izhikevich neuron based large-scale SNN simulator that runs on a single GPU. The GPU-SNN model (running on an NVIDIA GTX-280 with 1GB of memory), is up to 26 times faster than a CPU version for the simulation of 100K neurons with 50 Million synaptic connections, firing at an average rate of 7Hz. For simulation of 100K neurons with 10 Million synaptic connections, the GPUSNN model is only 1.5 times slower than real-time. This project uses a collection of new techniques related to parallelism extraction, mapping of irregular communication, and compact network representation for effective simulation of SNNs on GPUs
/content/cudazone/CUDABrowser/assets/images/applications/332_gpu-snn-logo_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/332_gpu-snn-logo_large.jpg
Academia
University of California - Irvine
http://www.ics.uci.edu/~jmoorkan/project/
2009
06
06
06/06/2009
26
Jayram Moorkanikara Nageswaran
Micah Richert
Paper
Code
Jayram Moorkanikara Nageswaran, Micah Richert
bf5d4551-ca35-4137-9070-987c91bc227a
CUDPP: CUDA Data-Parallel Primitives Library v1.1
CUDPP is the CUDA Data Parallel Primitives Library. CUDPP is a library of data-parallel algorithm primitives such as parallel prefix-sum ("scan"), parallel sort and parallel reduction. Primitives such as these are important building blocks for a wide variety of data-parallel algorithms, including sorting, stream compaction, and building data structures such as trees and summed-area tables. For more information and to download CUDPP, visit the CUDPP homepage at http://www.gpgpu.org/developer/cudpp
/content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_small.png
/content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_large.png
Research
NVIDIA and University of California at Davis
http://gpgpu.org/developer/cudpp
2009
07
01
07/01/2009
Open source
Mark Harris
Code
parallel algorithms
data-parallel, scan, sort, random number generation, Mark Harris
ab27bee7-f0b0-4e84-b9d6-9155055ef5d5
Clustering Billions of Data Points Using GPUs
In this paper, we report our research on using GPUs to accelerate clustering algorithms, with special interests on very large data sets, which are common in today's real world applications. While many published works have shown that GPUs can be used to accelerate various general purpose applications with respectable performance gains, few attempts have been made to tackle very large problems. Our goal here is to investigate if the GPUs can be useful accelerators even with very large data sets that cannot fit into GPU's onboard memory. Using a popular clustering algorithm, K-Means, as an example, our results have been very positive. With the GPU acceleration, a data set with a billion data points can be clustered within minutes. We achieved 10x performance gain over our highly optimized CPU-only version running on 8 cores, and about 300x performance boost against a popular benchmark, MineBench running on a single core.
/content/cudazone/CUDABrowser/assets/images/applications/330_renwu_cudazone_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/330_renwu_cudazone_large.jpg
Intelligent Information Management Lab, HP Labs.
http://www.hpl.hp.com/
2009
05
18
05/18/2009
300
Ren Wu
Bin Zhang
Meichun Hsu
Paper
Business Intelligence
Data Mining
Analytics
Parallel Algorithm, Data-mining, Clustering, Graphics Processor, GPGPU, Accelerator, Multi-core, Many-core, Data parallelism, Ren Wu, Bin Zhang, Meichun Hsu
66322604-44b3-4a6b-941f-5c6cd54cc0bb
Running Unstructured Grid CFD Solvers on Modern Graphics Hardware
We implement an unstructured grid finite volume solver for the three-dimensional Euler equations for compressible flow. We describe optimization strategies taken to minimize uncoalesced memory access and achieve high performance. We consider two benchmark cases from aerodynamics.
/content/cudazone/CUDABrowser/assets/images/applications/329_pressure_small.png
/content/cudazone/CUDABrowser/assets/images/applications/329_pressure_large.png
Academia
George Mason University
2009
06
24
06/24/2009
33
Andrew Corrigan
Paper
Computational Fluid Dynamics
NACA0012 Air Foil, Andrew Corrigan
d8e7a5a4-6e68-4613-885c-1f02a8232df8
Efficient parallel scan algorithms for GPUs
Scan and segmented scan algorithms are crucial building blocks for a great many data-parallel algorithms. Segmented scan and related primitives also provide the necessary support for the flattening transform, which allows for nested data-parallel programs to be compiled into flat data-parallel languages. In this paper, we describe the design of efficient scan and segmented scan parallel primitives in CUDA for execution on GPUs. Our algorithms are designed using a divide-and-conquer approach that builds all scan primitives on top of a set of primitive intra-warp scan routines. We demonstrate that this design methodology results in routines that are simple, highly efficient, and free of irregular access patterns that lead to memory bank conflicts. These algorithms form the basis for current and upcoming releases of the widely used CUDPP library.
/content/cudazone/CUDABrowser/assets/images/applications/328_thumbnail_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/328_thumbnail_large.jpg
Research
NVIDIA Research
http://www.nvidia.com/research
2008
12
15
12/15/2008
Shubho Sengupta
Paper
Code
Libraries
Parallel Algorithms
Data-parallel algorithms, algorithms, CUDPP, scan, segmented scan, Shubho Sengupta
e98a2682-012c-4145-a08d-689927f2f107
Hyperspectral image compression on NVidia GPUs
Hyperspectral imaging instruments are capable of collecting hundreds of images, corresponding to different wavelength channels, for the same area on the surface of the Earth. For instance, NASA is continuously gathering imagery data with instruments such as the Jet Propulsion Laboratory's Airborne Visible-Infrared Imaging Spectrometer (AVIRIS), able to record the visible and near-infrared spectrum (wavelength region from 0.4 to 2.5 micrometers) of the reflected light of an area 2 to 12 kilometers wide and several kilometers long, using 224 spectral bands. The resulting multidimensional data volume typically comprises several GBs per flight. We have developed a computationally efficient approach for lossy compression of remotely sensed hyperspectral images that retains the relevant information for analyzing the hyperspectral data with sub-pixel precision. The proposed methodology has been implemented, using the compute device unified architecture (CUDA), on an NVidia GeForce 8800 GTX GPU, achieving speedups in the order of 26x when compared to an optimized implementation of the same code in a dual-core CPU.
/content/cudazone/CUDABrowser/assets/images/applications/327_hyperspectral_small.png
/content/cudazone/CUDABrowser/assets/images/applications/327_hyperspectral_large.png
Academia
Technology of Computers and Communications, University of Extremadura
http://www.umbc.edu/rssipl/people/aplaza
2009
06
24
06/24/2009
26
Antonio Plaza
Javier Plaza
Sergio Sanchez
Paper
Imaging
Antonio Plaza, Javier Plaza, Sergio Sanchez
d5d96c8c-eae1-4c91-b0d1-118677e7315a
GPUTop - Topology Optimization on CUDA Graphics Cards in 3D
GPUTop is a topology optimizer for CUDA enabled graphics cards. It is based on the SIMP method with optimality criteria updates in three dimensions. Linear Elasticity is discretized using finite elements on a cartesian mesh. The material density is assumed constant in each element. The resulting system is solved by a matrix-free conjugate gradient method entirely inside the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/326_cantilever_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/326_cantilever_large.jpg
Academia
University of Trier
2009
05
01
05/01/2009
60
Stephan Schmidt
Application
Multimedia
Electronic Design Automation
Numerics
Science
Stephan Schmidt
fe18ed72-01e4-486a-974c-4a31bdce2636
The sparse matrix vector product on GPUs
The sparse matrix vector product (SpMV) is a paramount operation in engineering and scientific computing and, hence, has been a subject of intense research for long. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the ultimate aim of maximizing the performance. The Graphics Processing Units (GPUs) have recently emerged as platforms that yield outstanding acceleration factors. Currently, SpMV implementations for NVIDIA-GPUs have already appeared on the scene. This work proposes and evaluates a new implementation of SpMV for GPUs based on a new matrix storage format, called ELLPACK-R, and compares it against a variety of formats proposed elsewhere. The most important qualities of this new format is that (1) no preprocessing of the sparse matrix is required, and (2) the resulting SpMV algorithm is very regular. The comparative evaluation of this new SpMV approach has been carried out based on a representative set of test matrices. The results show that the SpMV approach based on ELLPACK-R turns out to be superior to the previous strategies used so far. Moreover, a comparison with standard state-of-the-art superscalar processors reveals that significant speedup factors are achieved with GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/324_Cuda_Zone_Sp_format_GPU_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/324_Cuda_Zone_Sp_format_GPU_large.jpg
Academia
Dpt. Computer Architecture and Electronics, University of Almeria, Spain
2009
06
14
06/14/2009
80
Ester Martin Garzon
Paper
Numerics
Libraries
Ester Martin Garzon
8c9d6b52-283f-46d2-be89-3b83c39233e0
Real Time Holographic Optical Trapping
Using CUDA powered NVIDIA graphics card we can quickly generate highly optimized holograms allowing interactive optical manipulation of micron sized structures.
/content/cudazone/CUDABrowser/assets/images/applications/323_HotCuda-big_small.png
/content/cudazone/CUDABrowser/assets/images/applications/323_HotCuda-big_large.png
Research
CNR-Dip Fisica, Univ."La Sapienza" Roma, Italy
2009
06
19
06/19/2009
350
S.Bianchi
R.Di Leonardo
Application
Multimedia
Science
holgrphic optical tweezers, S.Bianchi, R.Di Leonardo
13958d4d-5cf1-420b-b8a2-14932b5bb9d7
Parallel Computation With NVIDIA Graphics Card Using CUDA in Hyperthermia Applications
In this work we developed a fast parallel computing tool to simulate electromagnetic (EM) fields using the finite-difference time-domain (FDTD) method. The software is used to calculate the EM distribution during a hyperthermia session. Hyperthermia is a modality in cancer treatment that involves heating of tumors. The software can also be used for different applications that require fast and accurate simulation of EM fields, like MRI, RFID, medical implants, wireless sensors, etc.
/content/cudazone/CUDABrowser/assets/images/applications/322_SARCP01_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/322_SARCP01_large.jpg
Research
Academic Medical Center, Amsterdam
http://www.amc.nl/radiotherapie
2009
06
12
06/12/2009
25
Davi Correia
Multimedia
Presentation
Numerics
Life Sciences
Science
hyperthermia, electromagnetics, FDTD, Davi Correia
aca0f696-82fe-41fb-b6a3-6d296bd1bb83
Real Time Elimination of Undersampling Artifacts in CE MRA using Variational Denoising on Graphics Hardware
Undersampled imaging strategies with state of the art reconstruction methods like compressed sensing, which reformulate image reconstruction as a constrained optimization problem, have the potential to deliver CE MRA images with high spatial and temporal resolution. The drawback of these algorithms is their long reconstruction time which makes it impossible to use them in clinical practice. This study demonstrates that these optimization problems can be solved on modern graphic processing units (GPUs), with computation times that allow real time imaging.
/content/cudazone/CUDABrowser/assets/images/applications/321_mra08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/321_mra08_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2008
12
01
12/01/2008
2300
F. Knoll
M. Unger
F. Ebner
Multimedia
Presentation
Imaging
F. Knoll, M. Unger, F. Ebner
b7be624c-6ee4-4753-89a6-2cfab4e0f695
Mumford-Shah Meets Stereo: Integration of Weak Depth Hypotheses
Recent results on stereo indicate that an accurate segmentation is crucial for obtaining faithful depth maps. Variational methods have successfully been applied to both image segmentation and computational stereo. In this paper we propose a combination in a unified framework. In particular, we use a Mumford-Shah-like functional to compute a piecewise smooth depth map of a stereo pair. Our approach has two novel features: First, the regularization term of the functional combines edge information obtained from the color segmentation with flow-driven depth discontinuities emerging during the optimization procedure. Second, we propose a robust data term which adaptively selects the best matches obtained from different weak stereo algorithms. We integrate these features in a theoretically consistent framework. The final depth map is the minimizer of the energy functional, which can be solved by the associated functional derivatives. The underlying numerical scheme allows an efficient implementation on modern graphics hardware. We illustrate the performance of our algorithm using the Middlebury database as well as on real imagery.
/content/cudazone/CUDABrowser/assets/images/applications/320_cvpr07pock_small.png
/content/cudazone/CUDABrowser/assets/images/applications/320_cvpr07pock_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2007
12
01
12/01/2007
Thomas Pock
Christopher Zach
Horst Bischof
Paper
Imaging
Thomas Pock, Christopher Zach, Horst Bischof
5f8c1341-18c0-4057-9760-46eac91eea78
A Globally Optimal Algorithm for Robust TV-L1 Range Image Integration
Robust integration of range images is an important task for building high-quality 3D models. Since range images, and in particular range maps from stereo vision, may have a substantial amount of outliers, any integration approach aiming at high-quality models needs an increased level of robustness. Additionally, a certain level of regularization is required to obtain smooth surfaces. Computational efficiency and global convergence are further preferable properties. The contribution of this paper is a unified framework to solve all these issues. Our method is based on minimizing an energy functional consisting of a total variation (TV) regularization force and an L1 data fidelity term. We present a novel and efficient numerical scheme, which combines the duality principle for the TV term with a point-wise optimization step. We demonstrate the superior performance of our algorithm on the well-known Middlebury multi-view database and additionally on real-world multi-view images.
/content/cudazone/CUDABrowser/assets/images/applications/319_iccv07_paper_small.png
/content/cudazone/CUDABrowser/assets/images/applications/319_iccv07_paper_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2007
12
01
12/01/2007
Christopher Zach
Thomas Pock
Horst Bischof
Paper
Imaging
Christopher Zach, Thomas Pock, Horst Bischof
d816287e-5a19-4af3-9147-a117a60e5d9e
A Convex Formulation of Continuous Multi-Label Problems
We propose a spatially continuous formulation of Ishikawa's discrete multi-label problem.We show that the resulting non-convex variational problem can be reformulated as a convex variational problem via embedding in a higher dimensional space. This variational problem can be interpreted as a minimal surface problem in an anisotropic Riemannian space. In several stereo experiments we show that the proposed continuous formulation is superior to its discrete counterpart in terms of computing time, memory efficiency and metrication errors.
/content/cudazone/CUDABrowser/assets/images/applications/318_pockeccv08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/318_pockeccv08_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2008
12
01
12/01/2008
33
Thomas Pock
Thomas Schoenemann
Gottfried Grabe
Paper
Imaging
Thomas Pock, Thomas Schoenemann, Gottfried Grabe
c754fdc4-120c-4eb3-ab2d-de1dc540cf4b
Continuous Globally Optimal Image Segmentation with Local Constraints
The Geodesic Active contour model is a very flexible model for variational image segmentation. Unfortunately the Geodesic Active Contour model exhibits local minima making segmentation results strongly dependent on its initialization. We propose a flexible, interactive segmentation method in two and three dimensions that yields the globally optimal solution with respect to local constraints introduced by the user. A fast numerical scheme is used to minimize the proposed energy which is based on a weighted Total Variation energy functional. With our GPU-based implementation, real-time performance is achieved for both 2D and 3D segmentation problems. We show experimental results on various medical datasets, and discuss the properties of the segmentation framework.
/content/cudazone/CUDABrowser/assets/images/applications/317_cvww08seg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/317_cvww08seg_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2008
05
01
05/01/2008
Markus Unger
Thomas Pock
Horst Bischof
Multimedia
Paper
Imaging
Markus Unger, Thomas Pock, Horst Bischof
44c02794-46b1-4ec6-8c12-1d0ebb082d7b
Globally Optimal TV-L1 Shape Prior Segmentation
Interpreting an image is a common and challenging task in computer vision. A human observer does not only use intensity or color information or other basic features when looking for region boundaries but also takes prior knowledge into account. This increases the robustness on the segmentation result for most images. The main intention of our work is to propose a globally optimal segmentation algorithm that incorporates prior knowledge in form of a geometric shape. The proposed energy is based on a weighted Total Variation energy and is optimized with fast numerical approaches like the projected gradient descent method. The GPU-based implementation is able to achieve real-time performance for the presented applications. We show the coherence of the proposed energy model to former variational methods like the well-known edge-preserving restoration model of Rudin, Osher and Fatemi and methods that incorporate prior information into classical segmentation models. Different applications are realized with the proposed energy. First of all a semi-automatic, interactive segmentation tool is implemented. The user can either define a shape prior on the fly using the weighted Total Variation as geodesic active contour or load a predefined geometric shape. Next the energy model can be used to align two shapes on each other or optimize the alignment of a shape to an underlying edge function. Consequentially a tracking approach was introduced with the ability to optimize the incorporated shape information according to consecutive frames. This position update is also used when processing 3D data sets with a 2D prior which is particularly useful for segmenting tubular structures in medical data sets with a single constraint on the first slice.
/content/cudazone/CUDABrowser/assets/images/applications/316_werlberger_master_small.png
/content/cudazone/CUDABrowser/assets/images/applications/316_werlberger_master_large.png
Academia
Graz University of Technology
http://www.tugraz.at
2008
05
01
05/01/2008
Manuel Werlberger
Multimedia
Paper
Imaging
Manuel Werlberger
fd209c06-27b3-4bd5-9110-18f512c9d85e
Interactive Globally Optimal Image Segmentation
Image segmentation is a challenging task in computer vision. We present a general purpose image segmentation framework, and focus on its application to medical imaging. Features like gray values or edges are commonly used as input for segmentation algorithms. The geodesic active contour model gained popularity as a flexible variational image segmentation model based solely on edge information. Unfortunately the geodesic active contour model exhibits local minima, making segmentation results strongly dependent on its initialisation. We propose a globally optimal segmentation model, that unifies the usage of gray value information with the geodesic active contour model. A flexible, interactive segmentation framework is presented, that allows incorporation of local constraints. Fast numerical schemes are used to minimise the proposed energy which is based on a weighted Total Variation energy functional. Different segmentation approaches using the proposed energy functional are discussed. The relation to the image denoising task is analysed, and we present a fast implementation of the image denoising model of Rudin, Osher and Fatemi. With our GPU-based implementation real-time performance is achieved for both 2D and 3D segmentation problems. We show experimental results on various real world images and different medical datasets.
/content/cudazone/CUDABrowser/assets/images/applications/315_unger_tr0802_small.png
/content/cudazone/CUDABrowser/assets/images/applications/315_unger_tr0802_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2008
02
01
02/01/2008
100
Markus Unger
Thomas Pock
Horst Bischof
Multimedia
Paper
Imaging
Markus Unger, Thomas Pock, Horst Bischof
4230035e-7064-4178-837b-7e86b8b8af75
TVSeg - Interactive Total Variation Based Image Segmentation
Interactive object extraction is an important part in any image editing software. We present a two step segmentation algorithm that first obtains a binary segmentation and then applies matting on the border regions to obtain a smooth alpha channel. The proposed segmentation algorithm is based on the minimization of the Geodesic Active Contour energy. A fast Total Variation minimization algorithm is used to find the globally optimal solution. We show how user interaction can be incorporated and outline an efficient way to exploit color information. A novel matting approach, based on energy minimization, is presented. Experimental evaluations are discussed, and the algorithm is compared to state of the art object extraction algorithms. The GPU based binaries are available online.
/content/cudazone/CUDABrowser/assets/images/applications/314_ungerbmvc2008_small.png
/content/cudazone/CUDABrowser/assets/images/applications/314_ungerbmvc2008_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2008
12
01
12/01/2008
Markus Unger
Thomas Pock
Werner Trobin
Application
Multimedia
Paper
Imaging
Markus Unger, Thomas Pock, Werner Trobin
9b4757c0-bb97-47fe-bafb-21111b2cd277
A convex approach for computing minimal partitions
We describe a convex relaxation for a family of problems of minimal perimeter partitions. The minimization of the relaxed problem can be tackled numerically, we describe an algorithm and show some results. In most cases, our relaxed problem finds a correct (approximate) solution: We give some arguments to explain why it should be so, and also discuss some situation where it fails.
/content/cudazone/CUDABrowser/assets/images/applications/313_chambolle08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/313_chambolle08_large.png
Academia
ECOLE POLYTECHNIQUE
http://www.cmap.polytechnique.fr/
2008
11
01
11/01/2008
Antonin Chambolle
Daniel Cremers
Thomas Pock
Paper
Imaging
Antonin Chambolle, Daniel Cremers, Thomas Pock
98759e63-9907-443c-ab79-dc6e8de0f2c4
A Convex Relaxation Approach for Computing Minimal Partitions
In this work we propose a convex relaxation approach for computing minimal partitions. Our approach is based on rewriting the minimal partition problem (also known as Potts model) in terms of a primal dual Total Variation functional. We show that the Potts prior can be incorporated by means of convex constraints on the dual variables. For minimization we propose an efficient primal dual projected gradient algorithm which also allows a fast implementation on parallel hardware. Although our approach does not guarantee to find global minimizers of the Potts model we can give a tight bound on the energy between the computed solution and the true minimizer. Furthermore we show that our relaxation approach dominates recently proposed relaxations. As a consequence, our approach allows to compute solutions closer to the true minimizer. For many practical problems we even find the global minimizer. We demonstrate the excellent performance of our approach on several multi-label image segmentation and stereo problems.
/content/cudazone/CUDABrowser/assets/images/applications/312_pockcvpr2009_small.png
/content/cudazone/CUDABrowser/assets/images/applications/312_pockcvpr2009_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2009
01
01
01/01/2009
Thomas Pock
Daniel Cremers
Antonin Chambolle
Paper
Imaging
Antonin Chambolle, Daniel Cremers, Thomas Pock
f5051dff-fcce-4689-ad6e-bbbeeb64ad67
Semi Automatic Segmentation of Articular Cartilage using Variational Methods
Osteoarthritis (OA) is a syndrome of joint pain that acts the large weight bearing joints. It is caused by an abnormal wearing of articular cartilage, covering the joints. In addition to the inconvenience OA causes, its treatment is very time consuming and expensive. Therefore it is desirable to improve methods for an early diagnosis of OA. The detection of thinning of articular cartilag provides a good support for the diagnosis of OA in its early stage. The first step in this diagnosis process is the accurate segmentation of the cartilage surface. In this Master's Thesis we propose an interactive segmentation framework for the semi automatic segmentation of articular cartilage. Until today, no automatic segmentation method is able achieve the accuracy, necessary for a trustworthy diagnosis. Also, physicians in general prefer to be able to control and modify the segmentation result, which is usually complicated using automatic methods. Semi automatic methods allow the user to incorporate knowledge into the segmentation process, whilst reducing the time and improving the repeatability compared to fully manual methods. The proposed segmentation model is based on a weighted Total Variation energy and minimised using efficient numerical approaches. Implemented on today's userprogrammable graphics cards, it allows real-time user interaction. The evaluation of our segmentation method using real-world magnet resonance datasets of human knee joints shows, that we are able to speed up the segmentation process significantly, compared to manual and semi automatic segmentation methods.
/content/cudazone/CUDABrowser/assets/images/applications/311_thesis_christian_reinbacher_web_small.png
/content/cudazone/CUDABrowser/assets/images/applications/311_thesis_christian_reinbacher_web_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2009
01
01
01/01/2009
50
Christian Reinbacher
Paper
Imaging
Christian Reinbacher
ad494f35-db0c-4777-bec8-14bcd01a305c
A Variational Model for Interactive Shape Prior Segmentation and Real-Time Tracking
In this paper, we introduce a semi-automated segmentation method based on minimizing the Geodesic Active Contour energy incorporating a shape prior. We increase the robustness of the segmentation result using the additional shape information that represents the desired structure. Furthermore the user has the possibility to take corrective actions during the segmentation and adapt the shape prior position. Interaction is often desirable when processing difficult data like in medical applications. To facilitate the user interaction we add a shape deformation which allows to change the shape position manually by the user and automatically in terms of underlying image features. Using a variational formulation, the optimization can be done in a globally optimal manner for a fixed shape representation. To obtain real-time behavior, which is especially important for an interactive tool, the whole method is implemented on the GPU. Experiments are done on medical, as well as on video data and camera streams that are processed in real-time. In terms of medical data we compare our method with a segmentation done by an expert. The GPU based binaries will be available online on our homepage.
/content/cudazone/CUDABrowser/assets/images/applications/310_werlberger_ssvm2009_small.png
/content/cudazone/CUDABrowser/assets/images/applications/310_werlberger_ssvm2009_large.png
http://www.gpu4vision.org
http://www.gpu4vision.org
2008
12
01
12/01/2008
Manuel Werlberger
Thomas Pock
Thomas Pock
Paper
Multimedia
Imaging
Manuel Werlberger, Thomas Pock
847a423f-7656-4ac0-b0d5-5e5b41c66f11
A Duality Based Approach for Realtime TV-L1 Optical Flow
Variational methods are among the most successful approaches to calculate the optical flow between two image frames. A particularly appealing formulation is based on total variation (TV) regularization and the robust L1 norm in the data fidelity term. This formulation can preserve discontinuities in the flow field and offers an increased robustness against illumination changes, occlusions and noise. In this work we present a novel approach to solve the TV-L1 formulation. Our method results in a very efficient numerical scheme, which is based on a dual formulation of the TV energy and employs an efficient point-wise thresholding step. Additionally, our approach can be accelerated by modern graphics processing units. We demonstrate the real-time performance (30 fps) of our approach for video inputs at a resolution of 320 x 240 pixels.
/content/cudazone/CUDABrowser/assets/images/applications/309_pockdagm07_small.png
/content/cudazone/CUDABrowser/assets/images/applications/309_pockdagm07_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2007
12
01
12/01/2007
C. Zach
T. Pock
H. Bischof
Application
Multimedia
Paper
Imaging
C. Zach, T. Pock, H. Bischof
e51747cc-73ba-4e32-8e4d-35dfc6679a6c
An Unbiased Second-Order Prior for High-Accuracy Motion Estimation
Virtually all variational methods for motion estimation regularize the gradient of the flow field, which introduces a bias towards piecewise constant motions in weakly textured areas. We propose a novel regularization approach, based on decorrelated second-order derivatives, that does not suffer from this shortcoming. We then derive an efficient numerical scheme to solve the new model using projected gradient descent. A comparison to a TV regularized model shows that the proposed second-order prior exhibits superior performance, in particular in lowtextured areas (where the prior becomes important). Finally, we show that the proposed model yields state-of-the-art results on the Middlebury optical flow database.
/content/cudazone/CUDABrowser/assets/images/applications/308_trobindagm08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/308_trobindagm08_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2008
12
01
12/01/2008
Werner Trobin
Thomas Pock
Daniel Cremers
Paper
Imaging
Werner Trobin, Thomas Pock, Daniel Cremers
0d9b7a3f-47e7-4f51-a0c9-8161f1f5a368
Continuous Energy Minimization
Variational problems, which are commonly used to solve lowlevel vision tasks, are typically minimized via a local, iterative optimization strategy, e.g. gradient descent. Since every iteration is restricted to a small, local improvement, the overall convergence can be slow and the algorithm may get stuck in an undesirable local minimum. In this paper, we propose to approximate the minimization by solving a series of binary subproblems to facilitate large optimization moves. The proposed method can be interpreted as an extension of discrete graph-cut based methods such as -expansion or LogCut to a spatially continuous setting. In order to demonstrate the viability of the approach, we evaluated the novel optimization strategy in the context of optical flow estimation, yielding excellent results on the Middlebury optical flow datasets.
/content/cudazone/CUDABrowser/assets/images/applications/307_trobineccv2008_small.png
/content/cudazone/CUDABrowser/assets/images/applications/307_trobineccv2008_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2008
12
1
12/1/2008
Werner Trobin
Thomas Pock
Daniel Cremers
Paper
Imaging
Werner Trobin, Thomas Pock, Daniel Cremers
bf8c14b0-7314-47ac-915f-e8a86a852188
Duality TV-L1 Flow with Fundamental Matrix Prior
Variational techniques yield the most accurate results for dense optical flow fields between two images. They have the nice property of inherent smoothness to cope with untextured image regions: the filling-in of such regions is driven by neighbouring pixels. Such filling-in is not always the best choice. If the scene is mostly stationary and the camera is moving, the direction of the optical flow vectors can be restricted using the fundamental matrix. In this paper we propose an exact solution of the variational optical flow, using the fundamental matrix geometry as an additional weak prior. Our novel approach currently performs best on the Middlebury flow evaluation which includes images from stationary and dynamic scenes.
/content/cudazone/CUDABrowser/assets/images/applications/306_ivcnz08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/306_ivcnz08_large.png
Academia
Daimler Group Research, Sindelfingen, Germany
2008
12
01
12/01/2008
A. Wedel
T. Pock
J. Braun
Multimedia
Paper
Imaging
Optical flow, fundamental matrix, structure from motion, optimization, total variation, A. Wedel, T. Pock, J. Braun
337c0c41-4ef4-4b42-897e-0a7762736367
Real-time Computation of Variational Methods on Graphics Hardware
This paper combines two powerful approaches: variational methods and graphics hardware. Variational methods have demonstrated considerable success in computer vision for such diverse tasks as denoising, segmentation, registration, stereo matching etc. Their main advantage is a mathematically clean and powerful formulation of the vision problem in terms of energy functionals that have to be minimized. However, due to their iterative nature these approaches tend to be slow and far from real-time capable. Recent progress in graphics hardware (the computational power grows much faster than for standard CPUs) makes this hardware interesting for computer vision applications. For example floating point arithmetic and high-level programming languages such as Cg are now available for modern graphics cards. In this paper we demonstrate that by a careful analysis and formulation of variational methods and exploitation of the parallelism of modern GPUs we can achieve real-time performance without complex multi-grid optimization schemes (we obtain speed-ups of a factor of more than 200 compared to Matlab implementations). This opens several new application areas for variationalmethods (e.g. realtime algorithms).
/content/cudazone/CUDABrowser/assets/images/applications/305_cvww07_pock_small.png
/content/cudazone/CUDABrowser/assets/images/applications/305_cvww07_pock_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2007
02
06
02-06-2007
200
Thomas Pock
Markus Grabner
Horst Bischof
Paper
Imaging
Thomas Pock, Markus Grabner, Horst Bischof
7385b402-b661-4ec7-a1e7-a1774baf45f6
Fast Total Variation for Computer Vision
Motivated by statistical inference methods, variational methods are among the most successful methods to solve a number of different Computer Vision problems. Variational methods aim to minimize an energy functional which is designed to appropriately describe a Computer Vision task. Since Computer Vision problems are typically ill-posed, appropriate priors (or regularizers) are needed to find physically meaningful solutions. A particularly interesting prior is given by the Total Variation norm. It provides a good tradeoff between modeling the true statistics of natural images while still allowing to compute an exact solution. Total Variation methods were first introduced Rudin Osher and Fatemi in 1992 for edge preserving image denoising. Computing the solution of energy functionals incorporating Total Variation is a challenging task. We review different numerical algorithms and discuss its properties. For implementation on a digital computer we will consider two different approaches. The first approach is based on an explicit discretization of the Euler-Lagrange partial differential equations, which is the standard approach. The second approach is based on algorithmic differentiation of a discretized version of the energy functional. We show that both approaches yield equivalent results whereas algorithmic differentiation is less error prone and can be applied to very complex models. For performance evaluation, we implement our variational algorithms on the graphics processing unit (GPU). Through controlled experiments we show that our GPU-based implementations clearly outperform recently proposed discrete optimization techniques in both speed and maximum problem size. In the remaining part of the thesis, we apply Total Variation methods to three fundamental Computer Vision problems: Segmentation, Optical Flow and 3D Reconstruction. We show that our Total Variation based methods yield state-of-the-art results.
/content/cudazone/CUDABrowser/assets/images/applications/304_pock_phd_small.png
/content/cudazone/CUDABrowser/assets/images/applications/304_pock_phd_large.png
Academia
Graz University of Technology
http://www.tugraz.at/
2008
1
1
1/1/2008
1000
Thomas Pock
Paper
Imaging
Thomas Pock
bfd5c98d-c5d6-4d33-9d85-2cdaa02e118f
Fast and Exact Solution of Total Variation Models on the GPU
This paper discusses fast and accurate methods to solve Total Variation (TV) models on the graphics processing unit (GPU). We review two prominent models incorporating TV regularization and present different algorithms to solve these models. We mainly concentrate on variational techniques, i.e. algorithms which aim at solving the Euler Lagrange equations associated with the variational model. We then show that particularly these algorithms can be effectively accelerated by implementing them on parallel architectures such as GPUs. For comparison we chose a state-ofthe- art method based on discrete optimization techniques. We then present the results of a rigorous performance evaluation including 2D and 3D problems. As a main result we show that the our GPU based algorithms clearly outperform discrete optimization techniques in both speed and maximum problem size.
/content/cudazone/CUDABrowser/assets/images/applications/303_pockcvpr2008_small.png
/content/cudazone/CUDABrowser/assets/images/applications/303_pockcvpr2008_large.png
Academia
1Institute for Computer Graphics and Vision, Graz University of Technology
http://www.icg.tu-graz.ac.at
2008
12
01
12/01/2008
1000
Thomas Pock
Markus Unger
Daniel Cremers
Application
Multimedia
Paper
Imaging
Thomas Pock, Markus Unger, Daniel Cremers
f7584aa1-8c40-4182-86b2-2ec2baebb15d
Automatic Differentiation for GPU-Accelerated 2D/3D Registration
A common task in medical image analysis is the alignment of data from different sources, e.g., X-ray images and computed tomography (CT) data. Such a task is generally known as registration. We demonstrate the applicability of automatic differentiation (AD) techniques to a class of 2D/3D registration problems which are highly computationally intensive and can therefore greatly benefit from a parallel implementation on recent graphics processing units (GPUs). However, being designed for graphics applications, GPUs have some restrictions which conflict with requirements for reverse mode AD, in particular for taping and TBR analysis. We discuss design and implementation issues in the presence of such restrictions on the target platform and present a method which can register a CT volume data set (512x512x288 voxels) with three X-ray images (512x512 pixels each) in 11.8 seconds on a GeForce 8800GTX graphics card.
/content/cudazone/CUDABrowser/assets/images/applications/302_grabner_AD08_small.png
/content/cudazone/CUDABrowser/assets/images/applications/302_grabner_AD08_large.png
Academia
Institute for Computer Graphics and Vision, Graz University of Technology
http://www.icg.tu-graz.ac.at/
2008
08
15
08/15/2008
Markus Grabner
Thomas Pock
Tobias Gross
Paper
Imaging
Optimization, medical image analysis, 2D/3D registration, GPU, Markus Grabner, Thomas Pock, Tobias Gross
4c67cc92-5ea4-4b4c-afd2-e7041206bc09
Ascalaph Liquid GPU 1.2.1
Ascalaph Liquid GPU is a application, that calculates dynamic molecules in a liquid phase. Calculation of dynamic molecules is a very complex process. So you have to get as much renderpower as possible. So Agile Molecule took the CUDA 2.0 API and designed an application, that takes advantage of this technology. The GPU version can speed up the calculations dramatically.
/content/cudazone/CUDABrowser/assets/images/applications/301_liquid-simulation_small.png
/content/cudazone/CUDABrowser/assets/images/applications/301_liquid-simulation_large.png
Commercial
Agile Molecule
http://www.agilemolecule.com
2009
03
01
03/01/2009
29
Agile Molecule
Application
Computational Fluid Dynamics
Agile Molecule
d73b70a5-03df-4e6d-b2cd-7098e976fd00
CUDA OpenGL Tutorials
CUDA ("Compute Unified Device Architecture"), is a GPGPU technology that allows a programmer to use the C programming language to code algorithms for execution on the GPU. CUDA has been developed by NVIDIA and to use this architecture requires an NVIDIA GPU and special stream processing drivers. CUDA only works with the new GeForce 8 Series, featuring G8X GPUs; NVIDIA guarantees that programs developed for the GeForce 8 series will also work without modification on all future NVIDIA video cards. CUDA gives developers unfettered access to the native instruction set and memory of the massively parallel computational elements in CUDA GPUs. Using CUDA, NVIDIA GeForce-based GPUs effectively become powerful, programmable open architectures like todays CPUs (Central Processing Units). By opening up the architecture, CUDA provides developers both with the low-level, deterministic, and for repeatable access to hardware that is necessary API to develop essential high-level programming tools such as compilers, debuggers, math libraries, and application platforms.
/content/cudazone/CUDABrowser/assets/images/applications/300_index_clip_image002_0002_small.png
/content/cudazone/CUDABrowser/assets/images/applications/300_index_clip_image002_0002_large.png
Academia
Department of Computer Science & Engineering, CUHK
http://www.cse.cuhk.edu.hk/
2007
02
26
02/26/2007
Xie Yongming
Application
Imaging
Xie Yongming
f633b870-abcb-4a3d-9dd4-0a53519dca66
Concurrent Number Cruncher An Efficient Sparse Linear Solver on the GPU
A wide class of geometry processing and PDE resolution methods needs to solve a linear system, where the non-zero pattern of the matrix is dictated by the connectivity matrix of the mesh. The advent of GPUs with their ever-growing amount of parallel horsepower makes them a tempting resource for such numerical computations. This can be helped by new APIs (CTM from ATI and CUDA from NVIDIA) which give a direct access to the multithreaded computational resources and associated memory bandwidth of GPUs; CUDA even provides a BLAS implementation but only for dense matrices (CuBLAS). However, existing GPU linear solvers are restricted to specific types of matrices, or use non-optimal compressed row storage strategies. By combining recent GPU programming techniques with supercomputing strategies (namely block compressed row storage and register blocking), we implement a sparse generalpurpose linear solver which outperforms leading-edge CPU counterparts (MKL / ACML).
/content/cudazone/CUDABrowser/assets/images/applications/299_dragon_small.png
/content/cudazone/CUDABrowser/assets/images/applications/299_dragon_large.png
Academia
Gocad Research Group, INRIA, Nancy Universite, France
http://gocad.org
2007
12
1
12/1/2007
8
L. Buatois
G. Caumon
B. Levy
Paper
Numerics
L. Buatois, G. Caumon, B. Levy
633f6b09-e9cb-4ce6-9bc4-ce40bfa307d8
LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs
We present performance results for dense linear algebra using the 8-series NVIDIA GPUs. Our matrix-matrix multiply routine (GEMM) runs 60% faster than the vendor implementation in CUBLAS 1.1 and approaches the peak of hardware capabilities. Our LU, QR and Cholesky factorizations achieve up to 80-90% of the peak GEMM rate.
http://www.netlib.org/lapack/lawnspdf/lawn202.pdf
/content/cudazone/CUDABrowser/assets/images/applications/298_lawn202_small.png
/content/cudazone/CUDABrowser/assets/images/applications/298_lawn202_large.png
Academia
University of California at Berkeley
http://berkeley.edu/
2008
07
07
07/07/2008
8
Vasily Volkov
James W. Demmel
Paper
Research
Vasily Volkov, James W. Demmel
f832aa18-4900-4118-9e81-cd2d52dedf71
General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform
We report on our experience with integrating and using graphics processing units (GPUs) as fast parallel floatingpoint co-processors to accelerate two fundamental computational scientific kernels on the GPU: sparse direct factorization and nonlinear interior-point optimization.
http://www.cs.jhu.edu/~misha/ReadingSeminar/Papers/Christen07.pdf
/content/cudazone/CUDABrowser/assets/images/applications/297_christen_small.png
/content/cudazone/CUDABrowser/assets/images/applications/297_christen_large.png
2006
09
01
09/01/2006
7
Matthias Christen
Olaf Schenk
Helmar Burkhart
Paper
Numerics
Matthias Christen, Olaf Schenk, Helmar Burkhart
65d33580-ff45-439e-9f5c-5fdc1814f542
A Fast Double Precision CFD Code using CUDA
We describe a second-order double precision finite volume Boussinesq code implemented using the CUDA platform. We perform detailed validation of the code on a variety of Rayleigh-Benard convection problems and show second order convergence. We obtain matching results with a Fortran code running on a high-end eight-core CPU. The CUDA-accelerated code achieves approximately an eight-time speedup for versus the Fortran code on identical problems. As a result, we are able to run a simulation with a grid of size 384 x 384 x 192 at 1.6 seconds per time step on a machine with a single GPU.
/content/cudazone/CUDABrowser/assets/images/applications/296_cfd_small.png
/content/cudazone/CUDABrowser/assets/images/applications/296_cfd_large.png
Commercial / Academic
NVIDIA / IGPP UCLA
http://www.nvidia.com
2009
06
01
06/01/2009
8
Jonathan M. Cohen
M. Jeroen Molemaker
Paper
Computational Fluid Dynamics
CUDA, GPU Computing, Multicore, Rayleigh-Benard convection, Jonathan M. Cohen, M. Jeroen Molemaker
a711966a-9d50-42f0-9797-544c84d2d4a2
cRARk
RAR archives password recovery
/content/cudazone/CUDABrowser/assets/images/applications/295_crark32-ss_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/295_crark32-ss_large.jpg
Academia
St. Petersburg Technical University
2009
05
29
05/29/2009
15
Pavel Semjanov
Application
Multimedia
Numerics
rar password, Pavel Semjanov
e69e67b3-e02f-4918-a2a1-f4ba67c92bfd
GPGPU BASED IMAGE SEGMENTATION LIVEWIRE ALGORITHM IMPLEMENTATION
This thesis presents a GPU implementation of the Livewire algorithm. Instead of using
traditional architectures, like the CPU, this implementation focuses advantages obtained
using Single Instruction Multiple Data (SIMD) architectures.
http://code.google.com/p/gpuwire/
/content/cudazone/CUDABrowser/assets/images/applications/294_gpuwire_large.jpg
/content/cudazone/CUDABrowser/assets/images/applications/294_gpuwire_small.jpg
2008
12
1
12/1/2008
Daniel Lelis Baggio
Application
Multimedia
Paper
Imaging
Daniel Lelis Baggio
46373d5b-aa72-4200-b0ba-c62ddda1cd1b
Asymmetric Cryptography on GPUs
The paper discusses the use of CUDA to accelerate asymmetric cryptographic algorithms based on modular exponentiation, such as RSA and DSA, and elliptic curve-based methods such as ECDSA.
/content/cudazone/CUDABrowser/assets/images/applications/293_nvidia-CUDA,P-W-111092-13_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/293_nvidia-CUDA,P-W-111092-13_large.jpg
Hochschule
Horst Gortz Institute for IT Security, Ruhr University Bochum
http://www.crypto.rub.de
2008
08
10
08/10/2008
N. z.
Robert Szerwinski
Paper
Numerics
Science
RSA, DSA, ECDSA, ECC, Robert Szerwinski
d095f5a1-5a77-4533-9f22-576c0d37f0aa
Level-3 BLAS on a GPU: Picking the Low Hanging Fruit
The arrival of hardware accelerators has created a new gold rush to be the first to deliver their promise of high performance for numerical applications. Since they are relatively hard to program, with limited language and compiler support, it is generally accepted that one needs to roll up one's sleeves and tough it out, not unlike the early days of distributed memory parallelcomputing (or any other period after the introduction of a drastically different architecture). In this paper we remind the community that while this is a noble endeavor, there is a lot of lowhanging fruit that can be harvested easily. Picking this low hanging fruit benefits the scientific computing community imme diately and prototypes the approach that the further optimizationsmay wish to follow. We demonstrate this by focusing on a widely used set of operations, thelevel-3 BLAS, targeting the NVIDIA family of GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/292_level3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/292_level3_large.png
Research
Universitat Jaume I, Spain - University of Texas at Austin
2009
05
22
05/22/2009
4
Francisco Igual
Gregorio Quintana
Robert van de Geijn
Paper
Numerics
Francisco Igual, Gregorio Quintana, Robert van de Geijn
795116c5-6b29-4699-849d-7fd134527031
Density field viewer
This Cuda demo, is able to ray trace a density volume and surrounding triangle objects in real-time. In the demo the density is a smoke simulation done by Michael Bang and Brian Bunch Christensen from the Department of Computer Science, University of Aarhus.
/content/cudazone/CUDABrowser/assets/images/applications/291_cool_smoke_small.png
/content/cudazone/CUDABrowser/assets/images/applications/291_cool_smoke_large.png
Research
Alexandra Institute
http://www.alexandra.dk/
2009
05
15
05/15/2009
10
Peter Trier
Application
Multimedia
Computational Fluid Dynamics
Graphics
Science
Ray tracing, volume rendering, Peter Trier
aece5608-4e34-46be-aa9d-69da7de03fbf
Accelerating RTM on GPU, what is the current status?
This presentation describes experience with GPU-parallelization of the Reverse Time Migration (RTM) approach to seismic depth imaging. RTM is reviewed at a high level, followed by GPU implementation geared towards multi-GPU systems. GPU code (in this case C for CUDA) was generated using CAPS HMPP programming directives.
/content/cudazone/CUDABrowser/assets/images/applications/290_trc_small.png
/content/cudazone/CUDABrowser/assets/images/applications/290_trc_large.png
Commercial
NVIDIA
http://www.nvidia.com
2009
01
01
01/01/2009
Henri Calandra
Stephane bihan
Paulius Micikevicius
Paper
Signal Processing
Henri Calandra, Stephane bihan, Paulius Micikevicius
61950ddf-e994-4823-a034-9b92880a62f9
Parallelized Turing bombe & Enigma simulations
This project involves implementing simulations of Enigma machines and the Turing bombe on various parallel-computing systems including multi-processor PCs, Linux clusters, and modern enhanced graphic cards.
/content/cudazone/CUDABrowser/assets/images/applications/289_180px-Enigma-rotor-stack_small.png
/content/cudazone/CUDABrowser/assets/images/applications/289_180px-Enigma-rotor-stack_large.png
Academia
Department of Electronic Engineering, La Trobe University, Bundoora, Australia
http://www.latrobe.edu.au/ee/
2009
05
14
05/14/2009
35
Open source
Cong Van Nguyen
Application
Multimedia
Code
Numerics
Enigma, Turing bombe, CUDA, Cong Van Nguyen
43f8cd9c-46c8-48d4-aac8-588815fde07b
CUDA Accelerated Expectation Maximization of Gaussian Mixture Models
This is a CUDA implementation of the Expectation Maximization algorithm for Gaussian Mixture Models. On my machine, it provides up to 170x performance increases versus a CPU reference version. See the report available at http://andrewharp.com/gmmcuda for more information.
/content/cudazone/CUDABrowser/assets/images/applications/288_em_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/288_em_large.jpg
2009
5
6
5/6/2009
170
Andrew Harp
Paper
Code
Numerics
Science
Other
Machine Learning
Clustering
AI
Statistics
GMM, Gaussian, Machine Learning, Statistics, Andrew Harp
7fcf2257-7a1f-4f39-b2f1-78266b3cc524
Accelerating Stencil-Based Computations by Increased Temporal Locality on Modern Multi- and Many-Core Architectures
Stencil computations arise in a wide range of applications of computational sciences. This paper focuses on stencil computations arising in the context of a biomedical simulation. Compute-intensive bio-medical simulations represent an attractive application for the Cell Broadband Engine Architecture (CBEA) and for graphics processing units (GPUs) as hardware accelerators. Due to the low arithmetic intensity of stencil computations and bandwidth limitations of the compute hardware, the performance is usually only a fraction of peak performance. We detail an implementation of parallel stencil computations on the CBEA and GPUs, which improves performance by exploiting temporal locality. We report on performance improvements over CPU implementations.
/content/cudazone/CUDABrowser/assets/images/applications/287_christenschenk_small.png
/content/cudazone/CUDABrowser/assets/images/applications/287_christenschenk_large.png
2008
12
01
12/01/2008
Matthias Christen
Olaf Schenk
Peter Messmer
Paper
Numerics
Matthias Christen, Olaf Schenk, Peter Messmer
5c5a6a6d-ccd0-47d4-bdff-bc89f7320589
Seismic imaging using GPGPU accelerated Reverse Time Migration CS315A Final Project Report
In this report, I outline the implementation and preliminary benchmarking of a parallelized program to perform reverse time migration (RTM) seismic imaging using the Nvidia CUDA platform for scientific computing, accelerated by a general purpose graphics processing unit (GPGPU). This novel software architecture allows access to the massively parallel computational capabilities of a high performance GPU system, which is leveraged for its high throughput of numeric capabilities.
/content/cudazone/CUDABrowser/assets/images/applications/286_nwmoussa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/286_nwmoussa_large.png
Academia
Stanford University
http://folding.stanford.edu/
2009
01
01
01/01/2009
Nader W. Moussa
Paper
Science
Nader W. Moussa
a7979b2b-d04d-4ce5-9395-ba7f6facbd8e
GPU accelerated Monte Carlo simulation of the Ising model
The compute unified device architecture (CUDA) is a programming approach for performing scientific calculations on a graphics processing unit (GPU) as a data-parallel computing device. First, we apply this new technology to Monte Carlo simulations of the two dimensional ferromagnetic square lattice Ising model. By implementing a variant of the checkerboard algorithm, results are obtained up to 60 times faster on the GPU than on a current CPU core. An implementation of the three dimensional ferromagnetic cubic lattice Ising model on a GPU is able to generate results up to 35 times faster than on a current CPU core. As proof of concept we calculate the critical temperature of the 2D and 3D Ising model using finite size scaling techniques. Theoretical results for the 2D Ising model and previous simulation results for the 3D Ising model can be reproduced.
/content/cudazone/CUDABrowser/assets/images/applications/285_montecarlo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/285_montecarlo_large.png
Academia
Johannes Gutenberg University Mainz
2009
4
30
4/30/2009
60
Open source
Tobias Preis
Multimedia
Paper
Code
Numerics
Tobias Preis
827b03f6-1371-40f2-b2a2-406e56f22825
A GPU interval library based on Boost interval
Interval arithmetic is widely used in numerical algorithms requiring reliability. Ray tracing of implicit surface is one of these applications that use interval arithmetic to increase the quality of a produced image. However these applications are computationally demanding. One solution is to use graphics processing unit (GPU) in order to take advantage of its computational power. We describe in this paper a GPU implementation of interval operators based on the Boost library. We tested these operators on a ray tracing algorithms and observe several order of execution speed improvements over the CPU version with the same image quality.
/content/cudazone/CUDABrowser/assets/images/applications/284_interval_small.png
/content/cudazone/CUDABrowser/assets/images/applications/284_interval_large.png
2008
03
13
03/13/2008
300
Sylvain Collange
Jorge Florez
David Defour
Paper
Labraries
Sylvain Collange, Jorge Florez, David Defour
a257d123-ba1c-4b24-99ae-3cf4c49b4f28
Implementation of float-float operators on graphics hardware
The Graphic Processing Unit (GPU) has evolved into a powerful and flexible processor. The latest graphic processors provide fully programmable vertex and pixel processing units that support vector operations up to single floating-point precision. This computational power is now being used for general-purpose computations. However, some applications require higher precision than single precision. This paper describes the emulation of a 44-bit floating-point number format and its corresponding operations. An implementation is presented along with performance and accuracy results.
/content/cudazone/CUDABrowser/assets/images/applications/283_floats_small.png
/content/cudazone/CUDABrowser/assets/images/applications/283_floats_large.png
Academia
Dali, LP2A, Universite de Perpignan
2006
03
29
03/29/2006
Guillaume Da Graca
David Defour
Paper
Programming Tools
Guillaume Da Graca, David Defour
cd2c96da-0961-4644-af98-390aa7bbdb59
Power Consumption of GPUs from a Software Perspective
GPUs are now considered as serious challengers for high performance computing solutions. They have power consumptions up to 300 W. This may lead to power supply and thermal dissipation problems in computing centers. In this article we investigate, using measurements, how and where modern GPUs are using energy during various computations in a CUDA environment.
/content/cudazone/CUDABrowser/assets/images/applications/282_pcg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/282_pcg_large.png
Academia
ELIAUS, Univ. de Perpignan
http://www.univperp.fr
2009
02
12
02/12/2009
Sylvain Collange
David Defour
Arnaud Tisserand
Paper
Programming Tools
Sylvain Collange, David Defour, Arnaud Tisserand
2ad66f68-4139-4334-b027-f3a282d242a2
Barra, a Modular Functional GPU Simulator for GPGPU
The use of GPUs for general-purpose applications promises huge performance returns for a small investment. However the internal design of such processors is undocumented and many details are unknown, preventing developers to optimize their code for these architectures. One solution is to use functional simulation to determine program behavior and gather statistics when counters are missing or unavailable. In this article we present a GPU functional simulator targeting GPGPU based on the UNISIM framework which takes a NVIDIA cubin file as input.
/content/cudazone/CUDABrowser/assets/images/applications/281_bmfg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/281_bmfg_large.png
Academia
ELIAUS, Univ. de Perpignan
http://www.univperp.fr
2009
02
09
02/09/2009
Sylvain Collange
David Defour
David Parello
Paper
Programming Tools
Sylvain Collange, David Defour, David Parello
2234c230-375e-11de-8a39-0800200c9a66
Stochastic Differential Equations with CUDA
Numerical integration of stochastic differential equations is commonly used in many branches of science. In this paper we present how to accelerate this kind of numerical calculations with popular NVIDIA Graphics Processing Units using the CUDA programming environment. We address general aspects of numerical programming on stream processors and illustrate them by two examples: the noisy phase dynamics in a Josephson junction and the noisy Kuramoto model. In presented cases the measured speedup can be as high as 675x compared to a standard CPU, which corresponds to several billion integration steps per second. This means that calculations which took weeks can now be completed in less than one hour. This brings stochastic simulation to a completely new level, opening for research a whole new range of problems which can now be solved interactively.
/content/cudazone/CUDABrowser/assets/images/applications/280_cudasde3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/280_cudasde3_large.png
Academia
University of Silesia
http://www.us.edu.pl
2009
03
23
03/23/2009
675
Open source
Michal Januszewski
Marcin Kostur
Code
Paper
Numerics
Science
SDE,stochastic,simulation,Langevin, Michal Januszewski, Marcin Kostur
1b536fc0-375e-11de-8a39-0800200c9a66
Efficient Sparse Matrix-Vector Multiplication on CUDA
In this paper we discuss data structures and algorithms for SpMV that are efficiently implemented on the CUDA platform for the fine-grained parallel architecture of the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/279_esmv_small.png
/content/cudazone/CUDABrowser/assets/images/applications/279_esmv_large.png
Commercial
NVIDIA
http://www.nvidia.com
2008
12
08
12/08/2008
Nathan Bell
Code
Paper
Numerics
Libraries
Sparse Matrix, SpMV, iterative, Nathan Bell
11fb7800-375e-11de-8a39-0800200c9a66
Solving Kinetic Equations on GPUs I: Model Kinetic Equations
We present an algorithm specifically tailored for solving kinetic equations onto GPUs. The efficiency of the algorithm is demonstrated by solving the one- dimensional shock wave structure problem and a two-dimensional low Mach number driven cavity flow. Computational results show that it is possible to cut down the computing time of the sequential codes of two order of magnitudes. The algorithm can easily be extended to three-dimensional flows and more general collision models.
/content/cudazone/CUDABrowser/assets/images/applications/278_ske_small.png
/content/cudazone/CUDABrowser/assets/images/applications/278_ske_large.png
Academia
Politecnico di Milano / Dipartimento di Matematica
http://www.mate.polimi.it/
2009
03
24
03/24/2009
500
Aldo Frezzotti
Gian Pietro Ghiroldi
Livio Gibelli
Paper
Computational Fluid Dynamics
Numerics
Science
Rarefied Gas Dynam ics
Rarefied Gas Dynamics, Boltzmann Equation, Semi-regular method of solution, Aldo Frezzotti, Gian Pietro Ghiroldi, Livio Gibelli
fa944db0-35cb-11de-8a39-0800200c9a66
Reduction to Condensed Forms for Symmetric Eigenvalue Problems on Multi-core Architectures
We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem. The target architecture is a current general purpose multi-core processor, where parallelism is extracted using a tuned multi-threaded implementation of BLAS. Also, in response to the advances of hardware accelerators, we modify the code in SBR to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Our results on a system with two Intel QuadCore processors and a Tesla C1060 GPU illustrate the performance and scalability delivered by these architectures.
/content/cudazone/CUDABrowser/assets/images/applications/277_hpf_small.png
/content/cudazone/CUDABrowser/assets/images/applications/277_hpf_large.png
Research
Aachen University, University Jaume I, ETH Zurich
www.hpca.uji.es
2009
04
02
04/02/2009
12
Paolo Bientinesi
Francisco Igual
Daniel Kressner
Enrique Quintana-Orti
Paper
Numerics
Eigenvalues, CUDA, Tesla, Paolo Bientinesi, Francisco Igual, Daniel Kressner, Enrique Quintana-Orti
f4522170-35cb-11de-8a39-0800200c9a66
Using Graphics Processors to Acceleratethe Solution of Out-of-Core Linear Systems
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 x 100, 000 symmetric positive definite linear system in less than one hour. Thus, for problems that used to be considered large, it is not necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution to be computed on a fast multithreaded architecture like a desktop computer equipped with a GPU. This paper provides evidence in support of these claims.
/content/cudazone/CUDABrowser/assets/images/applications/276_ooc_small.png
/content/cudazone/CUDABrowser/assets/images/applications/276_ooc_large.png
Academia
Uni versitat Jaume I (Castellon, Spain)
2009
04
02
04/02/2009
Mercedes Marques
Gregorio Quintana-Orti
Enrique Quintana-Orti
Robert van de Geijn
Paper
Numerics
Mercedes Marques, Gregorio Quintana-Orti, Enrique Quintana-Orti, Robert van de Geijn
ed20da40-35cb-11de-8a39-0800200c9a66
Multidimensional Decomposition for Nuclear Magnetic Resonance
Recently, the multilinear decomposition approved as the new robust method for data processing for multidimensional Nuclear Magnetic Resonance and these results published in Nature. In this application problem sizes are so huge, that solutions need several days on modern workstations. The algorithm is based on sparse implementation of parallel factor decomposition algorithm (PARAFAC), that performs sparsely defined alternate least sqares minimization. The algorithm allows to solve simultaniously several small and middle sized problems using all power of several GPU multiprocessors, or one huge (rank over 1000) sparse PARAFAC task. It reduce NMR data processing of factor 20-30 allow to do it on modern PC Supercomputer from NVIDIA instead of middle sized linux cluster.
/content/cudazone/CUDABrowser/assets/images/applications/275_mdd-nmr_small.png
/content/cudazone/CUDABrowser/assets/images/applications/275_mdd-nmr_large.png
Commercial
Commercial
Elegant Mathematics Ltd.
http://www.elegant-mathematics.com/
2009
04
02
04/02/2009
40
Ilgis Ibragimov
Code
Numerics
Libraries
Science
Signal Processing
Multilinear Multidimentional Tensor Singular Value Decomposition, Ilgis Ibragimov
e17a1080-35cb-11de-8a39-0800200c9a66
Real-Time Fiber Tracking
Fiber tracking is a technique based on diffusion tensor magnetic resonance imaging (DT-MRI) that allows a neurosurgeon to visualize the neuronal fibers in the brain of a patient. By using CUDA, our fiber tracking tool is now much more interactive.
/content/cudazone/CUDABrowser/assets/images/applications/274_fiber_small.png
/content/cudazone/CUDABrowser/assets/images/applications/274_fiber_large.png
Academia
The Cyclops Group - Laboratory for Image Processing and Computer Graphics
http://www.lapix.ufsc.br/
2009
03
20
04/20/2009
Adiel Mittmann
Multimedia
Paper
Medicalimaging
fiber tracking, dt-mri, diffusion tensor imagi ng, Adiel Mittmann
dc080580-35cb-11de-8a39-0800200c9a66
Thrust
Thrust is a CUDA library of parallel algorithms with an interface resembling the C++ Standard Template Library (STL). Thrust provides a flexible high-level interface for GPU programming that greatly enhances developer productivity. Develop high-performance applications rapidly with Thrust!
/content/cudazone/CUDABrowser/assets/images/applications/273_thrust_small.png
/content/cudazone/CUDABrowser/assets/images/applications/273_thrust_large.png
Commercial
NVIDIA
http://www.nvidia.com
2009
04
06
04/06/2009
Nathan Bell
Code
Libraries
STL CUDA Templates C++ HighLevel, Nathan Bell
d6b394a0-35cb-11de-8a39-0800200c9a66
2D FDTD Wave Propagation
The finite difference time domain (FDTD) solution of the scalar wave equation over a two-dimensional space discretized by a 2D grid of uniform grid cells.
/content/cudazone/CUDABrowser/assets/images/applications/272_fdtd_small.png
/content/cudazone/CUDABrowser/assets/images/applications/272_fdtd_large.png
Academia
University of Stuttgart
2008
02
06
02/06/2008
50
Open source
Ana Balevic
Code
Presentation
Science
Computational electromagnetics, FDTD, Wave propagation, Ana Balevic
ccb37600-35cb-11de-8a39-0800200c9a66
Fast and Scalable List Ranking on the GPU
In this paper, we describe two implementations of List Ranking, a traditional irregular algorithm that is difficult to parallelize on such massively multi-threaded hardware. We first present an implementation of Wyllie's algorithm based on pointer jumping. This technique does not scale well to large lists due to the suboptimal work done. We then present a GPU optimized, Recursive Helman-JaJa (RHJ) algorithm. Our RHJ implementation can rank a random list of 32 million elements in about a second and achieves a speedup of about 8-9 over a CPU implementation as well as a speedup of 3-4 over the best reported implementation on the Cell Broadband engine. We also discuss the practical issues relatin g to the implementation of irregular algorithms on massively multi-threaded architectures like that of the GPU. Regular or coalesced memory accesses pattern and balanced load are critical to achieve good performance on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/271_fastscaleable_small.png
/content/cudazone/CUDABrowser/assets/images/applications/271_fastscaleable_large.png
Academia
International Institute of Information Technology, Hyderabad
http://www.iiit.ac.in
2009
04
16
04/16/2009
15
Suhail Rehman
Paper
Numerics
Algorithms
list ranking, GPGPU, irregular algorithm, Suhail Rehman
ab2b9de0-35c7-11de-8a39-0800200c9a66
Statistical phylogenetics
Many-core algorithms for statistical phylogenetics
/content/cudazone/CUDABrowser/assets/images/applications/270_statphy_small.png
/content/cudazone/CUDABrowser/assets/images/applications/270_statphy_large.png
Academia
UCLA
2009
04
01
04/01/2009
90
Open source
Marc Suchard
Code
Paper
Numerics
Life Sciences
Libraries
Marc Suchard
a4938970-35c7-11de-8a39-0800200c9a66
GPUmat
GPUmat is a Freeware library that enables Matlab code to run on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/269_gpumat_small.png
/content/cudazone/CUDABrowser/assets/images/applications/269_gpumat_large.png
Commercial
The GP-you Group
http://gp-you.org/
2009
04
28
04/28/2009
The GP-you Group
Application
Numerics
Matlab
95ffb1e0-35c7-11de-8a39-0800200c9a66
GPU accelerated Poisson Boltzmann calculations and their comparison to the ASIC MDGRAPE-3
For proper functionality biomolecules need to be in specific environments (water, biomembranes, specialized tissues, etc). The Poisson Boltzmann (PB) approach can be used to account for this environmental effect in simulation studies of biomolecular matter. Here we study the suitability of the GPU (NVIDIA GTX 280) to accelerate PB computations within an enhanced Boundary Element Method (BEM). We compare to a general purpose CPU (INTEL Quad-Core Xeon E5430)and a specifically designed chip (the ASIC MDGRAPE-3). Both specialized devices, ie the GPU as well as the ASIC, offer comparable compute performance revealing theoretical Speed Up of approximately 39x within the current implementation.
/content/cudazone/CUDABrowser/assets/images/applications/268_pbbem_small.png
/content/cudazone/CUDABrowser/assets/images/applications/268_pbbem_large.png
Academia
Keio University; University of Electro-Communications; RIKEN Advanced Science Institute; University of Bologna; Michigan Tech;
2009
02
04
02/04/2009
39
Tetsu Narumi
Kenji Yasuoka
Makoto Taiji
Siegfried Hoefinger
Paper
Life Sciences
Computational Biophysical Chemistry
solvation, Poisson Boltzmann, implicit solvation models, GPU, ASIC, MDGRAPE, Tetsu Narumi, Kenji Yasuoka, Makoto Taiji, Siegfried Hoefinger
b4f27b00-1731-11de-8c30-0800200c9a66
Tool for Generalized Harmonics Analysis
An implementation of Generalized Harmonics Analysis with CUDA
/content/cudazone/CUDABrowser/assets/images/applications/267_harmonics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/267_harmonics_large.png
Research
Nishihara's Laboratory in Tokyo Institute of Technology
http://www.nh.cradle.titech.ac.jp/
2009
03
19
03/10/2009
420
Hisayori Noda
Application
Numerics
Signal Processing
Hisayori Noda
ad352070-1731-11de-8c30-0800200c9a66
vReveal
vReveal is MotionDSP's video enhancement software for Windows. It can dramatically improve the quality of flawed videos with just one click. The unrivaled enhancement technology powering vReveal works wonders with videos that are shaky, dark, noisy, pixelated, or blurry. With vReveal, you can play video files, preview enhancements to videos, and then save enhanced videos to disk or upload them directly to YouTube. Plus, vReveal has been specially tuned to run up to five times faster on CUDA-enabled NVIDIA graphics processors. That means you can process video enhancements in less time and have your CPU available for normal everyday tasks like emailing and internet browsing.
/content/cudazone/CUDABrowser/assets/images/applications/266_vraveal_new_small.png
/content/cudazone/CUDABrowser/assets/images/applications/266_vraveal_new_large.png
Commercial
MotionDSP
http://www.vreveal.com
2009
03
24
03/24/2009
5
Commercial
MotionDSP
Application
Digital Content Creation
Imaging
Signal Processing
Video & Audio
vReveal, video enhancement, GPU, super-resolution, MotionDSP
a653ce00-1731-11de-8c30-0800200c9a66
Discontinuous Galerkin Methods
DG methods yield high-order accurate PDE (partial differential equation) solvers on unstructured meshes. The special structure of the DG operator is shown to be well-suited to the CUDA parallel computation model, allowing net application speeds exceeding 200 GFlops/s.
/content/cudazone/CUDABrowser/assets/images/applications/265_galerkin_methods_small.png
/content/cudazone/CUDABrowser/assets/images/applications/265_galerkin_methods_large.png
Academia
Brown/Rice Collaboration
http://www.dam.brown.edu/scicomp
2008
12
18
12/18/2008
50
Open Source
A Kloeckner
T Warburton
J Bridge
JS Hesthaven
Paper
Multimedia
Computational Fluid Dynamics
Numerics
Science
A Kloeckner, T Warburton, J Bridge, JS Hesthaven
8843e330-0f59-11de-8c30-0800200c9a66
High-Speed Single-Database PIR Implementation
In this HotPETs session we would like to present an implementation of a singledatabase Private Information Retrieval (PIR) scheme that can process a database at 2 Gbits/s using a commodity Graphics Processing Unit (GPU).
http://www.petsymposium.org/2008/hotpets/PIR.pdf
/content/cudazone/CUDABrowser/assets/images/applications/258_PIR_small.png
/content/cudazone/CUDABrowser/assets/images/applications/258_PIR_large.png
Academia
University of Limoges
http://www.unilim.fr
2008
01
01
01/01/2008
Carlos Aquilar
Paper
Carlos Aquilar
850c10c0-0f59-11de-8c30-0800200c9a66
Lattice QCD on modern graphics cards
Presentation on using CUDA-enabled GPUs to solve Dirac-Wilson equation on the lattice.
/content/cudazone/CUDABrowser/assets/images/applications/257_lattice_small.png
/content/cudazone/CUDABrowser/assets/images/applications/257_lattice_large.png
Academia
University of Wuppertal & University Budapest
2007
11
09
11/09/2007
10
Gyozo Egri
Paper
Gyozo Egri
81e30b60-0f59-11de-8c30-0800200c9a66
Fine-grained Parallelization of Lattice QCD Kernel Routine on GPU
This paper briefs an experience in parallelizing a kernel function responsible for computing the action of the Dirac operator, called Hopping_Matrix. This routine contributes to most of the execution time of a simulation to the classical problem of Lattice Quantum Chromodynamics (Lattice QCD).
/content/cudazone/CUDABrowser/assets/images/applications/263_fine_grained_small.png
/content/cudazone/CUDABrowser/assets/images/applications/263_fine_grained_large.png
Academia
INRIA
http://irisa.fr/home_html-en?set_language=en
2008
01
01
01/01/2008
9
Khaled Z. Ibrahim
F. Bodin
Olivier Pene
Paper
Numerics
Khaled Z. Ibrahim, F. Bodin, Olivier Pene
E72d2b0d0-0f59-11de-8c30-0800200c9a66
Blasting through lattice calculations using CUDA
Modern graphics hardware is designed for highly parallel numerical tasks and provides significant cost and performance benefits. Graphics hardware vendors are now making available development tools to support general purpose high performance computing. Nvidias CUDA platform, in particular, offers direct access to graphics hardware through a programming language similar to C. Using the CUDA platform we have implemented a Wilson-Dirac operator which runs at an effective 68 Gflops on the Tesla C870. The recently released GeForce GTX 280 runs this same code at 92 Gflops, and we expect further improvement pending code optimization.
/content/cudazone/CUDABrowser/assets/images/applications/256_bu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/256_bu_large.png
Academia
Boston University
http://arxiv.org/PS_cache/arxiv/pdf/0810/0810.5365v1.pdf
2008
10
29
10/29/2008
Kipton Barros
Paper
Numerics
Kipton Barros
3edb5b60-1378-11de-8c30-0800200c9a66
Accelerating Leukocyte Tracking using CUDA
The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application detection and tracking of white blood cells in video microscopy can be accelerated by 200x using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU.
/content/cudazone/CUDABrowser/assets/images/applications/255_leukocyte_small.png
/content/cudazone/CUDABrowser/assets/images/applications/255_leukocyte_large.png
Academia
Computer Science and Electrical and Computer Engineering University of Virginia, Charlottesville
http://www.cs.virginia.edu/
2009
05
01
05/012009
29
Michael Boyer
David Tarjan
Scott T. Acton
Kevin Skadron
Paper
Life Sciences
Michael Boyer, David Tarjan, Scott T. Acton, Kevin Skadron
5f3db880-0f59-11de-8c30-0800200c9a66
Cloud Services for GPU Computing
Hoopoe is a cloud solution and infrastructure for organizations, allows using of GPU hardware for computational intensive tasks, while running on existing systems.
http://www.gass-ltd.co.il/hoopoe/Features.aspx
/content/cudazone/CUDABrowser/assets/images/applications/254_hoopoe_small.png
/content/cudazone/CUDABrowser/assets/images/applications/254_hoopoe_large.png
Research
Hoopoe
http://www.gass-ltd.co.il/hoopoe/Features.aspx#instance_types
2008
12
01
2008
2008
Hoopoe
Paper
Science
Hoopoe
5a60c190-0f59-11de-8c30-0800200c9a66
Automatic Parallelization for Graphics Processing Units in JikesRVM
Accelerated graphics cards, including specialized high-performance processors called Graphics Processing Units (GPUs), have become ubiquitous in recent years. On the right kinds of problems, GPUs greatly surpass CPUs in terms of raw performance. However, GPUs are currently used only for a narrow class of special-purpose applications; the raw processing power available in a typical desktop PC is unused most of the time.
http://uwspace.uwaterloo.ca/bitstream/10012/3752/1/thesis.pdf
/content/cudazone/CUDABrowser/assets/images/applications/253_automatic_parallelization_small.png
/content/cudazone/CUDABrowser/assets/images/applications/253_automatic_parallelization_large.png
Academia
University of Waterloo
http://www.lib.uwaterloo.ca/
2008
01
01
01/01/2008
13
Alan Chun-Wai Leung
Paper
Alan Chun-Wai Leung
56788ae0-0f59-11de-8c30-0800200c9a66
Seeing 3D: issues, algorithms and applications
The acquisition of 3D data is becoming very easy. With the recent technological advances, we expect 3D acquisition hardware and 3d modeling softwares to become as popular as their traditional 2d counter-part. this on one hand, opens new perspectives to solve computer vision problems related to scene analysis and understanding. On the other hand, it rises new challenges in storage, analyses, and retrieval of the large amount of 3D data.
/content/cudazone/CUDABrowser/assets/images/applications/252_seeing_3d_small.png
/content/cudazone/CUDABrowser/assets/images/applications/252_seeing_3d_large.png
Academia
Tokyo Institute of Technology, Japan
http://www.titech.ac.jp
2008
01
01
01/01/2008
Hamid Laga
Paper
Signal Processing
Web search, query processing, GPU, Hamid Laga
53b10670-0f59-11de-8c30-0800200c9a66
Using Graphics Processors for High-Performance IR Query Processing
Web search engines are facing formidable performance challenges as they need to process thousands of queries per second over billions of documents. To deal with this heavy workload, current engines use massively parallel architectures of thousands of machines that require large hardware investments. We investigate new ways to build such high-performance IR systems based on Graphical Processing Units (GPUs).
http://www2008.org/papers/pdf/p1213-ding.pdf
/content/cudazone/CUDABrowser/assets/images/applications/251_ir_query_small.png
/content/cudazone/CUDABrowser/assets/images/applications/251_ir_query_large.png
Academia
CIS Department, Polytechnic University
http://www.poly.edu/cse/
2008
04
25
04/25/2008
3
Shuai Ding
Jinru He
Hao Yan
Torsten Suel
Paper
Shuai Ding, Jinru He, Hao Yan, Torsten Suel
4f6d6950-0f59-11de-8c30-0800200c9a66
FLOCKING-BASED DOCUMENT CLUSTERING ON THE GRAPHICS PROCESSING UNIT
Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. Each bird represents a single document and flies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n2). As the number of documents grows, it becomes increasingly difficult to receive results in a reasonable amount of time. However, flocking behavior, along with most naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have experienced improved performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highly-parallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefit the GPU brings to all naturally inspired algorithms. Using the CUDA platform from NVIDIA, we developed a document flocking implementation to be run on the NVIDIA GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3,000 documents. The results of these tests were very significant. Performance gains ranged from three to nearly five times improvement of the GPU over the CPU implementation. This dramatic improvement in runtime makes the GPU a potentially revolutionary platform for document clustering algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/250_flocking_small.png
/content/cudazone/CUDABrowser/assets/images/applications/250_flocking_large.png
Science
U.S. Department of Energy Journal of Undergraduate Research
http://www.scied.science.doe.gov
2007
12
01
12/01/2007
5
Jesse St. Charles
Robert M. Patton
Thomas E. Potok
Xiaohui Cui
Paper
Jesse St. Charles, Robert M. Patton, Thomas E. Potok, Xiaohui Cui
4808dd20-0f59-11de-8c30-0800200c9a66
Real Time 3D Fluid and Particle Simulation and Rendering
This application uses a GPU based 3D fluid solver with a moving domain to seamlessly advect hundreds of thousands of particles. We use a globally second-order accurate fluid simulation method which takes constant calculation time, leading to a highly turbulent but stable simulation suitable for use in a real-time situation. In this demo our solver is used to simulate the turbulent wake behind a car, and we render smoke transported through that wake with a hardware-based volume renderer. The simulation domain is at a fixed position relative to the car, and we exploit Galilean invariance to transform the moving simulation domain into a motionless domain with inflow and outflow boundary conditions. Particles are advected through this fluid, but seamlessly transition to simple Newtonian dynamics as they leave the simulation domain. After simulation, the particles are sorted using a GPU-accelerated radix sort, and then rendered as alpha blended sprites with motion blur and volumetric shadows.
/content/cudazone/CUDABrowser/assets/images/applications/249_GDCSmokeDemo_6_small.png
/content/cudazone/CUDABrowser/assets/images/applications/249_GDCSmokeDemo_6_large.png
Commercial
NVIDIA
http://www.nvidia.com
2009
03
17
03/17/2009
40
Jonathan Cohen
Sarah Tariq
Simon Green
Multimedia
Computational Fluid Dynamics
Computational Fluid Dynamics, CFD, Navier Stokes, Particle simulation, Jonathan Cohen, Sarah Tariq, Simon Green
63708100-f945-11dd-87af-0800200c9a66
8850 Roll Microfilm Scanstation Image Processing
The application processes an 85 Mbyte/s stream of greyscale data as it is captured by the linear CCD array camera of the scanner from the source microfilm. An NVIDIA GPU running a CUDA application then processes the data stream in the real time. The image frames ( typically newspaper pages, medical records, land records ) are enhanced and an additional bitonal data stream created. Image enhancement includes correction for the microfilm density, background artefact removal, character enhancement and noise reduction without the loss of fine detail. The NVIDIA GPU and CUDA application replaces, in our previous Scanstations, a dedicated piece of image processing hardware. The new CUDA solution is faster ( more data per second ), cheaper and most importantly improvements can be made in the software without change to custom hardware.
/content/cudazone/CUDABrowser/assets/images/applications/259_8850-1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/259_8850-1_large.png
Commercial
Wicks and Wilson
http://www.wwl.co.uk
2008
11
03
11/03/2008
20
Commercial
Kevin Keeler
Application
Multimedia
Imaging
Kevin Keeler
448d9910-0f59-11de-8c30-0800200c9a66
GPU in power system engineering
This work demonstrates how the application of a GPU can yield significant speedup in ltransient stability simulation of the large-scale power systems.
/content/cudazone/CUDABrowser/assets/images/applications/247_electricalandcomputerengineering_small.png
/content/cudazone/CUDABrowser/assets/images/applications/247_electricalandcomputerengineering_large.png
Academia
University of Alberta
http://www.engineering.ualberta.ca/ece/
2009
02
25
02/25/2009
340
Vahid Jalili-Marandi
Paper
Numerics
Graphics processors, Parallel programming,Power system transient stability, Vahid Jalili-Marandi
4135bb80-0f59-11de-8c30-0800200c9a66
Graph Cuts via L1 Norm Minimization
Graph cuts have become an increasingly important tool for solving a number of energy minimization problems in computer vision and other fields. In this paper, the graph cut problem is reformulated as an unconstrained L1 norm minimization that can be solved effectively using interior point methods. This reformulation exposes connections between graph cuts and other related continuous optimization problems. Eventually, the problem is reduced to solving a sequence of sparse linear systems involving the Laplacian of the underlying graph. The proposed procedure exploits the structure of these linear systems in a manner that is easily amenable to parallel implementations. Experimental results obtained by applying the procedure to graphs derived from image processing problems are provided.
/content/cudazone/CUDABrowser/assets/images/applications/246_pattern_analysis_small.png
/content/cudazone/CUDABrowser/assets/images/applications/246_pattern_analysis_large.png
Academia
Department of Computer & Information Science, Penn Libraries
http://www.library.upenn.edu/
2008
10
01
October 2008
Arvind Bhusnurmath
Camillo J. Taylor
Paper
Video & Audio
Arvind Bhusnurmath, Camillo J. Taylor
3c7462e0-0f59-11de-8c30-0800200c9a66
3D Finite Difference Computation on GPUs using CUDA
In this paper we describe a GPU parallelization of the 3D finite difference computation using CUDA. Data access redundancy is used as the metric to determine the optimal implementation for both the stencil-only computation, as well as the discretization of the wave equation, which is currently of great interest in seismic computing. For the larger stencils, the described approach achieves the throughput of between 2,400 to over 3,000 million of output points per second on a single Tesla 10-series GPU. This is roughly an order of magnitude higher than a 4-core Harpertown CPU running a similar code from seismic industry. Multi-GPU parallelization is also described, achieving linear scaling with GPUs by overlapping inter-GPU communication with computation.
/content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_small.png
/content/cudazone/CUDABrowser/assets/images/applications/NVIDIACUDA_large.png
Commercial
NVIDIA
http://www.nvidia.com
2009
03
08
03/08/2009
Paulius Micikevicius
Paper
Oil & Gas
Paulius Micikevicius
386bd2f0-0f59-11de-8c30-0800200c9a66
Efficient Parallelization of Stochastic Simulation Algorithm for Chemically Reacting Systems on the Graphics Processing Unit
In this paper, we will show how Single Instruction Multiple Data (SIMD) computation can be implemented on a CUDA-enabled GPU, the NVIDIA GeForce 8800GTX, to efficiently perform ensemble runs of SSA simulations for chemically reacting systems.
/content/cudazone/CUDABrowser/assets/images/applications/262_efficient_small.png
/content/cudazone/CUDABrowser/assets/images/applications/262_efficient_large.png
Academia
Department of Computer Science, University of California, Santa Barbara
http://www.cs.ucsb.edu
2008
12
01
12/01/2008
Hong Li
Linda Petzold
Paper
Hong Li, Linda Petzold
33ae4ae0-0f59-11de-8c30-0800200c9a66
A FAST HYBRID TIME-SYNCHRONOUS/EVENT APPROACH TO PARALLEL DISCRETE EVENT SIMULATION OF QUEUING NETWORKS
The trend in computing architectures has been toward multicore central processing units (CPUs) and graphics processing units (GPUs). An affordable and highly parallelizable GPU is practical example of Single Instruction, Multiple Data (SIMD) architectures oriented toward stream processing.
http://www.informs-sim.org/wsc08papers/095.pdf
/content/cudazone/CUDABrowser/assets/images/applications/261_fast_hybrid_small.png
/content/cudazone/CUDABrowser/assets/images/applications/261_fast_hybrid_large.png
Academia
Department of Computer and Information Science and Engineering University of Florida
http://www.cise.ufl.edu/
2008
08
06
08/06/2008
2
Hyungwook Park
Paul A. Fishwick
Paper
Hyungwook Park, Paul A. Fishwick
2f7757f0-0f59-11de-8c30-0800200c9a66
Singular Value Decomposition on GPU using CUDA
Linear algebra algorithms are fundamental to many computing applications. Modern GPUs are suited for many general purpose processing tasks and have emerged as inexpensive high performance co-processors due to their tremendous computing power. In this paper, we present the implementation of singular value decomposition (SVD) of a dense matrix on GPU using the CUDA programming model. SVD is implemented using the twin steps of bidiagonalization followed by diagonalization. It has not been implemented on the GPU before. Bidiagonalization is implemented using a series of Householder transformations which map well to BLAS operations. Diagonalization is performed by applying the implicitly shifte d QR algorithm. Our complete SVD implementation outperforms the MATLAB and Intel R Math Kernel Library (MKL) LAPACK implementation significantly on the CPU. We show a speedup of upto 60 over the MATLAB implementation and upto 8 over the Intel MKL implementation on a Intel Dual Core 2.66GHz PC on NVIDIA GTX 280 for large matrices. We also give results for very large matrices on NVIDIA Tesla S1070.
/content/cudazone/CUDABrowser/assets/images/applications/245_auevt_small.png
/content/cudazone/CUDABrowser/assets/images/applications/245_auevt_large.png
Research
IIIT Hyderabad
http://www.iiit.net/
2008
12
15
12/15/2008
60
Sheetal Lahabar
Paper
Numerics
Sheetal Lahabar
6e9f39d0-04a4-11de-8c30-0800200c9a66
Sparse Multifrontal Performance Gains via NVIDIA GPU
Discussion of Access Analytics International's BCSLIB-EXT, an advanced analytic engine for use in finite element analysis, optimization, and data analysis tool suites. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009.
/content/cudazone/CUDABrowser/assets/images/applications/244_sparse_small.png
/content/cudazone/CUDABrowser/assets/images/applications/244_sparse_large.png
Commercial
Access Analytics International
http://www.aanalytics.com
2009
01
16
01/16/2009
7
Dan'l Pierce
Presentation
Numerics
Dan'l Pierce
2bcbdc20-0f59-11de-8c30-0800200c9a66
AN APPROACH FOR THE EFFECTIVE UTILIZATION OF GP-GPUS IN PARALLEL COMBINED SIMULATION
A major challenge in the field of Modeling & Simulation is providing efficient parallel computation for a variety of algorithms. Algorithms that are described easily and computed efficiently for continuous simulation, may be complex to describe and/or efficiently execute in a discrete event context, and vice-versa. Real-world models often employ multiple algorithms that are optimally defined in one approach or the other. Parallel combined simulation addresses this problem by allowing models to define algorithmic components across multiple paradigms. In this paper, we illustrate the performance of parallel combined simulation, where the continuous component is executed across multiple graphical processing units (GPU) and the discrete event component is executed across multiple central processing units (CPU).
/content/cudazone/CUDABrowser/assets/images/applications/260_approach_small.png
/content/cudazone/CUDABrowser/assets/images/applications/260_approach_large.png
Commercial
The MITRE Corporation
http://www.mitre.org
2008
01
01
2008
539
David W. Bauer Jr.
Matthew McMahon
Ernest H. Page
Paper
Imaging
Simulation
David W. Bauer Jr., Matthew McMahon, Ernest H. Page
156a1ca0-0e03-11de-8c30-0800200c9a66
Particle Swarm Optimization on GPU
Review of Particle Swarm Optimization (PSO) on the GPU. PSO is a population-based stochastic optimization technique inspired by social behavior of bird flocking or fish schooling. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009.
/content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_large.png
Academia
Department of Mathematics National Taiwan University
http://www.math.ntu.edu.tw/newenglish
2009
01
16
01/16/2009
270
Weichung Wang
Presentation
Life Sciences
Weichung Wang
0f4626c0-0e03-11de-8c30-0800200c9a66
JaCuda
this project aims to help you running applications on CUDA using java/python or groovy. Basically we provice a couple of functions and these functions are executed on the the gpu or if you don't have a gpu, well you always can run them in java/python mode depending which framework you prefer.
/content/cudazone/CUDABrowser/assets/images/applications/242_jacuda_small.png
/content/cudazone/CUDABrowser/assets/images/applications/242_jacuda_large.png
Research
2008
06
07
06/07/2008
LGPL
gert wohlgemuth
Application
Programming Tools
gert wohlgemuth
612c3b90-04a4-11de-8c30-0800200c9a66
JCufft
JCufft is providing Java bindings for the NVIDIA CUDA FFT implementation
/content/cudazone/CUDABrowser/assets/images/applications/241_jcufft_small.png
/content/cudazone/CUDABrowser/assets/images/applications/241_jcufft_large.png
Research
JavaGL
http://javagl.de/jcuda/jcublas/JCublas.html
2008
12
31
12/31/2008
javagl@javagl.de
Application
Paper
Programming Tools
javagl@javagl.de
5b705650-04a4-11de-8c30-0800200c9a66
JCublas
JCublas is providing Java bindings for the NVIDIA CUDA BLAS implementation, thus making the parallel processing power of modern graphics hardware available for Java programs.
/content/cudazone/CUDABrowser/assets/images/applications/240_jcublas_small.png
/content/cudazone/CUDABrowser/assets/images/applications/240_jcublas_large.png
Research
JavaGL
http://javagl.de/jcuda/jcublas/JCublas.html
2008
12
31
12/31/2008
javagl@javagl.de
Application
Paper
Programming Tools
javagl@javagl.de
5359b830-04a4-11de-8c30-0800200c9a66
FLAGON: Fortran-9X Library for GPU Numerics
FLAGON is an open source library/middleware for using GPUs from Fortran-9X, without necessarily knowing too much C or CUDA. It provides a Fortran Module (similar to a class in C++) that provides variables that are pointers to device variables on the GPU. Several supporting functions are available for data transfer, for manipulating variables on the device, as are simple interfaces to the CUBLAS, CUFFT librariers. Additional functionality is provided by functions that are written in CU, and can be called. Several such functions are available. The CUDPP library is also incorporated. Flagon has been used to develop relatively large pieces of scientific computing code on the GPU (Fast Multipole Methods on the GPU, plasma turbulence simulations), under both Windows and Linux.
/content/cudazone/CUDABrowser/assets/images/applications/239_flagon_small.png
/content/cudazone/CUDABrowser/assets/images/applications/239_flagon_large.png
Research
Fantalgo, LLC
http://www.fantalgo.com/
2008
08
01
08/01/2008
Yuancheng Luo
Nail A. Gumerov
Kate Despain
Bill Dorland
Ramani Duraiswami
Application
Imaging
Scientific
Physics
Geoscience
Yuancheng Luo, Nail A. Gumerov, Kate Despain, Bill Dorland, Ramani Duraiswami
1c6dd6c0-0f59-11de-8c30-0800200c9a66
Manufacturing Computations Lab
Agent-Based Models ABMs are used to model dynamic systems such as stock markets, societies, and complex biological systems that are difficult to model analytically using partial differential equations. This is particularly the case where the system consists of autonomous individuals who are "intelligent" and evolve over time. By intelligent, we mean that the individuals can independently act based on their goals, the environment and other individuals. The properties of the system as a whole emerge from micro-scale interactions between individuals and the environment. In the recent past, there has been an explosion in ABM research in disciplines such as economics, sociology, ecology, epidemiology, and computational biology.
/content/cudazone/CUDABrowser/assets/images/applications/239_manufacturing_small.png
/content/cudazone/CUDABrowser/assets/images/applications/239_manufacturing_large.png
Academia
Dept. of Mechanical Engineering-Engineering Mechanics Michigan Tech. University
http://www.me.mtu.edu/
2008
12
1
12/01/2008
Michigan Tech. University
Paper
Multimedia
Numerics
Life Sciences
ABM Simulation, Graphics Processing Units (GPU), GPGPU, Complex Systems, Data-Parallel Algorithms, Michigan Tech. University
029579b0-0f59-11de-8c30-0800200c9a66
GPU Accelerated Radio Astronomy Signal Convolution
The increasing array size of radio astronomy interferometers is causing the associated computation to scale quadratically with the number of array signals. Consequently, efficient usage of alternate processing architectures should be explored in order to meet this computational challenge. Affordable parallel processors have been made available to the general scientific community in the form of the commodity graphics card. This work investigates the use of the Graphics Processing Unit (GPU) in the parallelisation of the combined conjugate multiply and accumulation stage of a correlator for a radio astronomy array. Using NVIDIA's Compute Unified Device Architecture, our testing shows processing speeds from one to two orders of magnitude faster than a Central Processing Unit (CPU) approach.
/content/cudazone/CUDABrowser/assets/images/applications/238_gpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/238_gpu_large.png
Academia
The University of Western Australia
http://www.uwa.edu.au/
2008
07
31
07/31/2008
Chris Harris
Karen Haines
Lister Staveley-Smith
Paper
Signal Processing
Chris Harris, Karen Haines, Lister Staveley-Smith
38e24440-0ed1-11de-8c30-0800200c9a66
GPU Implementation of Belief Propagation Using CUDA for Cloud Tracking and Reconstruction
This paper describes an efficient CUDA-based GPU implementation of the belief propagation algorithm that can be used to speed up stereo image processing and motion tracking calculations without loss of accuracy. Preliminary results in using belief propagation to analyze satellite images of Hurricane Luis for real-time cloud structure and tracking are promising with speedups of nearly a factor of five.
/content/cudazone/CUDABrowser/assets/images/applications/237_clouds_small.png
/content/cudazone/CUDABrowser/assets/images/applications/237_clouds_large.png
Academia
Department of Computer and Information Sciences University of Delaware
http://www.cis.udel.edu/
2009
02
13
02/13/2009
5
Scott Grauer-Gray
Chandra Kambhamettu
Kannappan Palaniappan
Paper
Science
Scott Grauer-Gray, Chandra Kambhamettu, Kannappan Palaniappan
f99265b0-0ec9-11de-8c30-0800200c9a66
LOW-COST REAL-TIME SAR SIMULATION FOR APPLICATIONS IN MISSION PLANNING, EDUCATION AND INFORMATION EXTRACTION
SAR simulators are important for a huge variety of applications. Realistic SAR simulations need realistic 3D models, which are often not available. Less realistic models can be used in the less accurate real-time simulation approach. Using modern graphic cards for SAR simulation even complex environments can be simulated in real-time. This is realised by implementing of SAR geometry and radiometry within standard graphics hardware, which offers 3D hardware acceleration and programmable graphics processing units (GPU).
http://www.isprs2007ist.itu.edu.tr/21.pdf
/content/cudazone/CUDABrowser/assets/images/applications/236_sar_small.png
/content/cudazone/CUDABrowser/assets/images/applications/236_sar_large.png
Academia
Institute for Photogrammetry, Universitat Stuttgart, Germany
http://www.ifp.uni-stuttgart.de/forschung/photo/georef-Dateien/georef.en.html
2007
05
08
05/08/2007
Timo Balz
Paper
Imaging
Graphics
SAR, Radar, Real-Time, Simulation, Timo Balz
f43e42f0-0ec9-11de-8c30-0800200c9a66
GPU Accelerated Acoustic Likelihood Computations
This paper introduces the use of Graphics Processors Unit (GPU) for computing acoustic likelihoods in a speech recognition system. In addition to their high availability, GPUs provide high computing performance at low cost. We have used a NVidia GeForce 8800GTX programmed with the CUDA (Compute Unified Device Architecture) which shows the GPU as a parallel coprocessor.
http://www.crim.ca/Publications/2008/documents/plein_texte/PAR_CarPals_Interspeech2008.pdf
/content/cudazone/CUDABrowser/assets/images/applications/235_likelihood_computations_small.png
/content/cudazone/CUDABrowser/assets/images/applications/235_likelihood_computations_large.png
Academia
Centre de transfert de technologies et de connaissances (CRIM)
http://www.crim.ca/fr/
2008
12
01
12/01/2008
5
Patrick Cardinal
Pierre Dumouchel
Gilles Boulianne
Michel Comeau
Paper
Signal Processing
Patrick Cardinal, Pierre Dumouchel, Gilles Boulianne, Michel Comeau
e4315ff0-0ec9-11de-8c30-0800200c9a66
Ultrasound goes GPU: real-time simulation using CUDA
Despite the increasing adoption of other imaging modalities, ultrasound guidance is widely used for surgical procedures and clinical imaging due to its low cost, non-invasiveness, and real-time visual feedback. Many ultrasound-guided procedures require extensive training and where possible training on simulations should be preferred over patients. Computational resources for existing approaches to ultrasound simulation are usually limited by real-time requirements. Unlike previous approaches we simulate freehand ultrasound images from CT data on the Graphics Processing Unit (GPU).
http://campar.in.tum.de/pub/reichl2009spie/reichl2009spie.pdf
/content/cudazone/CUDABrowser/assets/images/applications/234_ultrasound_small.png
/content/cudazone/CUDABrowser/assets/images/applications/234_ultrasound_large.png
Research
Computer-Aided Medical Procedures (CAMP), TUM, Munich, Germany
http://wwwnavab.in.tum.de/WebHome
2009
02
01
02/01/2009
Tobias Reichl
Josh Passenger
Oscar Acosta
Olivier Salvado
Paper
Life Sciences
Tobias Reichl, Josh Passenger, Oscar Acosta, Olivier Salvado
f1c168e0-0e15-11de-8c30-0800200c9a66
GPU FX Spectrometer using CUDA
The next generation of radio telescopes, such as Square Kilometer Array and the associated Pathfinder arrays, require vast amounts of computation due to the extremely large number of interferometers and the imaging requirements. The hardware for this computation is becoming a significant consideration in array design, both in terms of initial cost and power consumption. Graphics Processing Units (GPU) provide power efficiency and affordability as well as the flexibility of general purpose hardware. This work implements a GPU-based FX spectrometer, which processes four streams of 8-bit interferometer data, for a variable number of frequency channels. This approach scales well with frequency channels, with a computation speeds up to 18 times faster than those of a CPU implementation. Further work is in progress to scale the algorithm with the number of interferometer streams, and to investigate optimisation of the GPU algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/233_gpu_fx_spectrometer_small.png
/content/cudazone/CUDABrowser/assets/images/applications/233_gpu_fx_spectrometer_large.png
Research
The University of Western Australia
http://www.uwa.edu.au/
2007
12
01
1/12/2007
Chris Harris
Paper
Signal Processing
Imaging
Chris Harris
ec53f1c0-0e15-11de-8c30-0800200c9a66
IMPLEMENTING ALGORITHMS FOR SIGNAL AND IMAGE RECONSTRUCTION ON GRAPHICAL PROCESSING UNITS
Several highly efective algorithms that have been proposed recently for compressed sensing and image processing applications can be implemented eficiently on commodity graphical processing units (GPUs). The properties of algorithms and application that make for eficient GPU implementation are discussed, and computational results for several algorithms are presented that show large speedups over CPU implementations.
/content/cudazone/CUDABrowser/assets/images/applications/232_signal_images_small.png
/content/cudazone/CUDABrowser/assets/images/applications/232_signal_images_large.png
Academia
The University of Wisconsin Madison
http://www.cs.wisc.edu/
2008
12
01
12/01/2008
169
SANGKYUN LEE
STEPHEN J. WRIGHT
Code
Paper
Numerics
Imaging
Graphical processing units, compressed sensing, image denoising, image deblurring, SANGKYUN LEE, STEPHEN J. WRIGHT
e3fb68a0-0e15-11de-8c30-0800200c9a66
CUDA Fluid Simulation in NVIDIA PhysX
The NVIDIA fluid particle demo uses PhysX technology accelerated by the CUDA architecture. Released in October 2008 as part of a GeForce Powerpack Over 64000 SPH fluid particles pour into the scene, push aside wooden crates, which float up as the fluid level rises. All these particles are simulated real time using accelerated PhysX, each SPH particle moves in the scene as the result of interactions with other particles, rigid body objects and the surrounding environment. This demo was recorded using a GeForce 9800GTX+
/content/cudazone/CUDABrowser/assets/images/applications/231_fluidsim_small.png
/content/cudazone/CUDABrowser/assets/images/applications/231_fluidsim_large.png
Consumer
NVIDIA
http://www.nvidia.com/cuda
2008
12
1
12/01/2008
10
Mark Harris
Paper
Computational Fluid Dynamics
PhysX, Fluids, SPH, Mark Harris
ded49770-0e15-11de-8c30-0800200c9a66
Translating GPU Binaries to Tiered SIMD Architectures with Ocelot
Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once this has been accomplished, even simple architectures with low hardware complexity can easily exploit the parallelism in an application. With these applications in mind, this paper presents Ocelot, a binary translation framework designed to allow architectures other than NVIDIA GPUs to leverage the parallelism in PTX programs. Specifically, we show how (i) the PTX thread hierarchy can be mapped to many-core architectures, (ii) translation techniques can be used to hide memory latency, and (iii) GPU data structures can be efficiently emulated or mapped to native equivalents. We describe the low level implementation of our translator, ending with a case study detailing the complete translation process from PTX to SPU assembly used by the IBM Cell Processor.
/content/cudazone/CUDABrowser/assets/images/applications/230_ptx_small.png
/content/cudazone/CUDABrowser/assets/images/applications/230_ptx_large.png
Academia
School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, Georgia
http://www.ece.gatech.edu/
2009
01
01
01/01/2009
32
Gregory Diamos
Andrew Kerr
Mukil Kesavan
Paper
Programming Tools
Gregory Diamos, Andrew Kerr, Mukil Kesavan
d964fd70-0e15-11de-8c30-0800200c9a66
Reverberate LE GPU Edition
VST convolution reverb for PC, powered by NVIDIA CUDA for GeForce 8 series and above.
/content/cudazone/CUDABrowser/assets/images/applications/229_reverberate_small.png
/content/cudazone/CUDABrowser/assets/images/applications/229_reverberate_large.png
Commercial
LiquidSonics
http://www.liquidsonics.com/publicbeta/
2009
03
05
03/05/2009
LiquidSonics
Application
Video & Audio
VST, Audio, Reverb, LiquidSonics
d46f0040-0e15-11de-8c30-0800200c9a66
VST Plugin: Convolution Reverb on NVidia GPUs
The plugin is able to load wav files as impulse responses and can be used together with a NVidia GPU (Geforce 8xxx or better) to have a convolution reverb with nearly no CPU usage at all.
/content/cudazone/CUDABrowser/assets/images/applications/228_kvr_listed_at_png32_small.png
/content/cudazone/CUDABrowser/assets/images/applications/228_kvr_listed_at_png32_large.png
Commercial
nilsschneider.de
http://www.nilsschneider.de/cms/index.php
2008
11
17
11/17/2008
Nils Schneider
Application
Video & Audio
VST, Audio, Reverb, Nils Schneider
cef9c0f0-0e15-11de-8c30-0800200c9a66
Massively Parallel Two-Dimensional TLM Algorithm on Graphics Processing Units
Recent advances in computing technology has brought massively parallel computing power to desktop PCs. As multi-core processor technology becomes mature, a new front in parallel technology based on graphics processors has emerged. A massively parallel 2D-TLM algorithm for NVIDIA advanced graphics processors has been developed. The proposed parallel computing paradigm can be adopted straightforwardly to accelerate time-domain electromagnetic field modeling programs.
/content/cudazone/CUDABrowser/assets/images/applications/227_twodimensional-tlm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/227_twodimensional-tlm_large.png
Academia
University of British Columbia
http://www.ece.ubc.ca/
2009
01
22
01/22/2009
10
Filippo V. Rossi
Poman P.M. So
Nikolaus Fichtner
Peter Russer
Paper
Filippo V. Rossi, Poman P.M. So, Nikolaus Fichtner, Peter Russer
b6f64af0-0e15-11de-8c30-0800200c9a66
Fast GPU Implementation of Sparse Signal Recovery from Random Projections
We consider the problem of sparse signal recovery from a small number of random projections (measurements). This is a well known NP-hard to solve combinatorial optimization problem. A frequently used approach is based on greedy iterative procedures, such as the Matching Pursuit (MP) algorithm. Here, we discuss a fast GPU implementation of the MP algorithm, based on the recently released NVIDIA CUDA API and CUBLAS library. The results show that the GPU version is substantially faster (up to 31 times) than the highly optimized CPU version based on CBLAS (GNU Scientific Library).
/content/cudazone/CUDABrowser/assets/images/applications/226_sparse_signal_recovery_small.png
/content/cudazone/CUDABrowser/assets/images/applications/226_sparse_signal_recovery_large.png
Academia
University of Calgary
http://www.ucalgary.ca
2009
01
29
01/25/2009
M. Andrecut
Paper
Signal Processing
GPU programming, Nvidia CUDA, sparse signal recovery, random projections, matching pursuit algorithm, M. Andrecut
c9700f40-0e15-11de-8c30-0800200c9a66
ArcSoft TotalMedia Theatre
The newly released, CUDA-powered SimHD technology is available in TotalMedia Theatre to allow viewers to obtain a HD-like viewing experience from not only DVDs, but also other standard definition multimedia files,"said George Tang, ArcSoft's Vice President and General Manager of Video and Home Entertainment Group. "We are pleased to be partnering with NVIDIA to deliver excellence in high-definition video on the PC.
/content/cudazone/CUDABrowser/assets/images/applications/225_arcsoft_small.png
/content/cudazone/CUDABrowser/assets/images/applications/225_arcsoft_large.png
Consumer
ArcSoft
http://www.arcsoft.com/products/totalmediatheatre/
2003
03
10
03/27/2008
ArcSoft
Application
Consumer
ArcSoft
c39d46a0-0e15-11de-8c30-0800200c9a66
SETI@home
SETI (Search for Extraterrestrial Intelligence) is a scientific area whose goal is to detect intelligent life outside Earth. One approach, known as radio SETI, uses radio telescopes to listen for narrow-bandwidth radio signals from space. Such signals are not known to occur naturally, so a detection would provide evidence of extraterrestrial technology. You can participate by running this free program that downloads and analyzes radio telescope data. And now, with this CUDA-optimized client powered by your CUDA-ready GeForce GPU, your system can deliver as much as 10 times the computational power of an average home PC CPU.
/content/cudazone/CUDABrowser/assets/images/applications/224_setiathome_small.png
/content/cudazone/CUDABrowser/assets/images/applications/224_setiathome_large.png
Consumer
University of Berlkey
http://setiathome.berkeley.edu/
2008
12
01
12/01/2008
10
University of Berlkey
Application
Signal Processing
University of Berlkey
be062770-0e15-11de-8c30-0800200c9a66
Folding@home
Download Folding@home and band together with people across the globe by adding the massive compute power of a NVIDIA GeForce GPU to one of the largest supercomputers in the world. Use your PC to help fight many of the world's most devastating diseases while you sleep at night.
/content/cudazone/CUDABrowser/assets/images/applications/223_foldingathome_small.png
/content/cudazone/CUDABrowser/assets/images/applications/223_foldingathome_large.png
Academia
Stanford University
http://folding.stanford.edu/
2008
06
01
06/01/2008
100
Stanford University
Application
Multimedia
Life Sciences
Stanford University
b02413b0-0e15-11de-8c30-0800200c9a66
PowerDirector7 Ultra
Award winning video editing software with CUDA accelerated video effects and encoding. PowerDirector 7's support for NVIDIA CUDA technology delivers huge speed gains when encoding HD video into the H.264 format. Offering performance gains of 270% for encoding high-definition video using NVIDIA CUDA technology, PowerDirector 7 leverages the power of the GPU to deliver its faster results.
http://www.cyberlink.com/multi/products/main_4_en_US.html
/content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_new_small.png
/content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_new_large.png
/content/cudazone/CUDABrowser/assets/images/applications/222_powerdirector_box1.png
Commercial
Cyberlink
http://www.cyberlink.com
2009
02
04
02/04/2009
3.5
Consumer
Cyberlink
Application
Video & Audio
PowerDirector,CyberLink,Video,Videoschnitt,Videobearbeitung,Authoring,Videoauthoring,AVCHD,Blu-ray,H.264,MPEG-2,DV-CAM,Kamera,Videokamera, Cyberlink
031bb7a0-0e0f-11de-8c30-0800200c9a66
cuSVM
cuSVM is a software package for high-speed (Gaussian-kernelized) Support Vector Machine training and prediction that exploits the massively parallel processing power of Graphics Processors (GPUs). cuSVM is written in NVIDIA's CUDA C-language GPU programming environment, includes implementations of both classification and regression, and performs SVM training (prediction) at 13-73 (22-172) times the rate of state of the art CPU software. Moreover, cuSVM features a Matlab MEX wrapper so that users can access the GPU's power without having to do any "real" programming.
/content/cudazone/CUDABrowser/assets/images/applications/222_cusvm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/222_cusvm_large.png
2009
01
17
01/17/2009
172
AUSTIN CARPENTER
Application
Paper
AUSTIN CARPENTER
fc3423a0-0e0e-11de-8c30-0800200c9a66
CuPP
Big improvements in the performance of graphics processing units (GPUs) and enhancements in the corresponding programming systems turned GPUs into a compelling platform for high performance computing. In this thesis, we present CuPP, a C++ framework built up on NVIDIAs CUDA. CuPP allows easier development of GPU programs even as compared to CUDA, by automating frequent programming tasks e.g. memory management. The thesis is roughly divided into three parts. We begin with an introduction to the CUDA programming system and discuss difficulties when integrating it into already existing C++ applications. Then we describe the CuPP framework and explain how it solves these difficulties. Afterwards we demonstrate the benefits of CuPP on the example of OpenSteer, a steering library and a demo application. With only a small amount of code changes, the performance was increased by a factor of 42 as compared to the CPU version.
/content/cudazone/CUDABrowser/assets/images/applications/221_cupp_small.png
/content/cudazone/CUDABrowser/assets/images/applications/221_cupp_large.png
Academia
Universitat Kassel
http://cms.uni-kassel.de/unicms/?id=eecs
2009
01
16
01/16/2009
Jens Breitbart
Application
Paper
Programming Tools
Jens Breitbart
f64a5090-0e0e-11de-8c30-0800200c9a66
Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics
Discussion of GraCCA (Graphic-Card Cluster for Astrophysics) system and AMR (Adaptive-Mesh-Refinement) Hydrodynamics + Self-Gravity Simulation in GPUs. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009.
/content/cudazone/CUDABrowser/assets/images/applications/220_gpu_computation_in_astrophysics_small.png
/content/cudazone/CUDABrowser/assets/images/applications/220_gpu_computation_in_astrophysics_large.png
Academia
Graduate Institute of Physics, National Taiwan University
http://www.ntu.edu.tw/engv4/
2009
01
16
01/16/2009
23
H. Y. Schive
Presentation
Science
H. Y. Schive
f1230a30-0e0e-11de-8c30-0800200c9a66
CRYPTOGRAPHIC COMPUTING ON GPU
Review of Cryptographic Computing on the GPU. Discusses elliptic curve method of factorization (ECM) on the GPU. Presentation made at Workshop on GPU Computing, Center for Quantum Science and Engineering, National Taiwan University, January 2009.
/content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/243_particle_swarm_large.png
Academia
Dept. Electrical Engineering National Taiwan University
http://www.ee.ntu.edu.tw/en/
2009
01
16
01/16/2009
Chen-Mou Cheng
Presentation
Numerics
Chen-Mou Cheng
67a8ffd0-04a4-11de-8c30-0800200c9a66
The GPU Supercomputer of CQSE
We have developed highly efficient CUDA (Compute Unified Device Architecture ) codes for our computationally intense problems (quantum chromodynamics, quantum spin systems, and astrophysics) With our GPU supercomputer, we can tackle many large scale computations withoutusing the prohibitively expensive supercomputers like IBM BlueGene.
/content/cudazone/CUDABrowser/assets/images/applications/219_cqse_small.png
/content/cudazone/CUDABrowser/assets/images/applications/219_cqse_large.png
Academia
Center for Quantum Science and Engineering National Taiwan University
http://cqse.ntu.edu.tw/cqse/
2009
01
16
01/16/2009
100
Ting-WaiChiu
Paper
Ting-WaiChiu
4c8a6720-04a4-11de-8c30-0800200c9a66
Real-time modelling of sea-surface radiance
In order to address the issue of scene simulation in marine environment with optical sea clutter, Alyotech Technologies developed a real-time model of the wind-driven sea surface radiance in the IR and visible spectrum. While IR surveillance from surface ships is among the first considered applications, the model was specifically designed to face the tricky problems related to observation at grazing angles. For this purpose, special effort has been carried out to deal efficiently with the following issues:
- dynamical computing of surface geometry from wave height spectra, including some (limited) nonlinearity,
- representing the surface on optimized multi-scale mesh,
- global illumination of the ocean surface by partially cloudy sky-domes,
- dynamical estimate and rendering of unresolved surface rugosity accounting for both capillary waves and unresolved distant gravity waves.
Real-time animation and rendering for meshes as large as several 106 polygons is achieved through massive parallelization on GPU. Full sky domes for both global illumination and sky rendering are precomputed using SKYGEN, a cloudy-sky simulation software recently developed by Alyotech.
The application is based on CUDA and OpenGL Shader Model 4.0. The CUDA part is at least 50 times faster than current multi-core implementations. Almost all the computation is offloaded to the GPU giving high performance results (around 150 fps using a GeForce GTX 280).
/content/cudazone/CUDABrowser/assets/images/applications/218_sea_small.png
/content/cudazone/CUDABrowser/assets/images/applications/218_sea_large.png
Commercial
Alyotech technologies
http://www.alyotech.com
2009
03
02
03/02/2009
50
Commercial
Stephane Melledant
Sebastien Vince
Goulven Monnier
Multimedia
Ocean, sea, surface, radiance,alyotech, Stephane Melledant, Sebastien Vince, Goulven Monnier
441666c0-04a4-11de-8c30-0800200c9a66
Smith Waterman algorithm
Protein sequence alignment
/content/cudazone/CUDABrowser/assets/images/applications/217_SWA_small.png
/content/cudazone/CUDABrowser/assets/images/applications/217_SWA_large.png
Academia
ICM, University of Warsaw
http://bioinfo.icm.edu.pl/algorithm/
2009
03
09
03/09/2009
3.5
Lukasz Ligowski/Witold Rudnicki
Paper
Life Sciences
bioinformatics, sequence alignment, Lukasz Ligowski, Witold Rudnicki
3cdfef70-04a4-11de-8c30-0800200c9a66
GPU VSIPL
GPU VSIPL is an implementation of the Vector Signal Image Processing Library Application Programming Interface that exploits CUDA capable GPUs to accelerate signal processing and dense linear algebra applications
/content/cudazone/CUDABrowser/assets/images/applications/216_GPU_VSIPL_small.png
/content/cudazone/CUDABrowser/assets/images/applications/216_GPU_VSIPL_large.png
Research
Georgia Tech Research Institute
http://www.gtri.gatech.edu
2009
02
27
02/27/2009
75
GPU VSIPL Team
Application
Paper
Numerics
Libraries
Signal Processing
Linear Algebra Signal Processing, GPU VSIPL Team
3512b660-04a4-11de-8c30-0800200c9a66
Boris Pusher
The Boris pusher is a numerical algorithm to advance charged particles in an electromagnetic field. It is widely used in numerical simulations in Plasma Physics. This application implements the Boris Pusher in CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/215_sbp_small.png
/content/cudazone/CUDABrowser/assets/images/applications/215_sbp_large.png
Research
Lasers & Plasma Group (GoLP) of the Institute for Plasmas and Nuclear Fusion (IPFN)
http://cfp.ist.utl.pt/golp/epp/
2008
08
02
08/02/2008
16
Open source
Paulo Abreu
Multimedia
Paper
Science
Paulo Abreu
26f98d10-04a4-11de-8c30-0800200c9a66
Ikena Live
Ikena is a revolutionary video enhancement/forensic solution for Intelligence and Law Enforcement applications. It's entire video processing pipeline is implemented in both CUDA and x86, and is able to run on any Windows PC (laptop or desktop), and runs up to 5 times faster with NVIDIA GPUs. Ikena's powerful multi-frame enhancement technology can quickly and dramatically extract information from poor video sources such as: mobile phones, YouTube videos, and surveillance cameras. In seconds, faces and objects can be enhanced, and license plates can be read.
/content/cudazone/CUDABrowser/assets/images/applications/214_Ikena160x90_small.png
/content/cudazone/CUDABrowser/assets/images/applications/214_Ikena160x90_large.png
Commercial
MotionDSP
http://www.motiondsp.com
2008
12
15
12/15/2008
5
MotionDSP
Multimedia
Video & Audio
Imaging
video, video forensic, video enhancement, CUDA, MotionDSP
80228240-ff98-11dd-87af-0800200c9a66
Many-Core Simulation on GPU Platform
Many-Core processor has gained significant attention recently. To do researches about Many-core processor we need a fast and accurate simulator. Existed multi-core simulator works in CPU platform, and it is extermly slow when the simulated cores is more than sixteen. So for many-core processor traditioanl simulating technique will not suitable for many-core case. In this project, we try to use GPU platform, an existing many-core instance which favorates streaming applications, to simulate general many-core processor. It is well known that general CPU processor simulation has a very irregular program behaviour. How to map this irregular simulating behaviours onto an regular platform and getting a sim ulation speedup with the platform gives us a big challenge. We plan to do it in two steps: first we try to make it works and do not consider the simulating speed, and then secondly we will play with the simulator and make it faster and easy to be used by the lieterature.
/content/cudazone/CUDABrowser/assets/images/applications/213_irisa1_small.png
/content/cudazone/CUDABrowser/assets/images/applications/213_irisa1_large.png
Academia
Inner Mongolia University
http://www.imu.edu.cn
2007
12
31
12/31/2007
Open Source
He Liqiang
Code
microprocessor
simulation
microprocessor, simulation, GPU, He Liqiang
7053b9b0-ff98-11dd-87af-0800200c9a66
StoreGPU: Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems
Today Graphics Processing Units (GPUs) are a largely underexploited resource on existing desktops and a possible costeffective enhancement to high-performance systems. To date, most applications that exploit GPUs are specialized scientific applications. Little attention has been paid to harnessing these highly-parallel devices to support more generic functionality at the operating system or middleware level. This study starts from the hypothesis that generic middleware-level techniques that improve distributed system reliability or performance (such as content addressing, erasure coding, or data similarity detection) can be significantly accelerated using GPU support.
http://www.ece.ubc.ca/~samera/papers/StoreGPU-HPDC08.pdf
/content/cudazone/CUDABrowser/assets/images/applications/212_md5_small.png
/content/cudazone/CUDABrowser/assets/images/applications/212_md5_large.png
Academia
Electrical and Computer Engineering Department, University of British Columbia
http://www.ece.ubc.ca
2008
06
27
06/27/2008
8
Samer Al-Kiswany
Paper
Numerics
MD5, Samer Al-Kiswany
6a162150-ff98-11dd-87af-0800200c9a66
Exegy
Exegy offers hardware-accelerated appliances for the Financial Services community, facilitating the delivery and normalization of market data at very high data rates, without sacrificing latency or useful functionality. Because Exegy appliances are based on nontraditional hardware-acceleration technologies, more market data can be delivered faster without increasing operating costs, space or management time.
/content/cudazone/CUDABrowser/assets/images/applications/211_exegy_small.png
/content/cudazone/CUDABrowser/assets/images/applications/211_exegy_large.png
Commercial
Exegy
http://www.exegy.com
2008
11
16
11/16/2008
74
Exegy
Paper
Financial
Exegy
60c359b0-ff98-11dd-87af-0800200c9a66
GPU-HMMER
GPU-Based MPI-HMMER is an open source MPI implementation of the HMMER protein sequence analysis suite. The main search algorithms, hmmpfam and hmmsearch, have been ported to MPI in order to provide high throughput HMMER searches on modern computational clusters. We improve on HMMER through sophisticated I/O, a self-contained coordinator/worker model, and the easy inclusion of accelerated architectures. This results in better scalability while still maintaining the familiar user interface
http://www.mpihmmer.org/
/content/cudazone/CUDABrowser/assets/images/applications/210_hmmr_small.png
/content/cudazone/CUDABrowser/assets/images/applications/210_hmmr_large.png
Research
mpiHMMER
http://www.mpihmmer.org/
2009
02
09
02/09/2009
Open Source
John Paul Walters
Joseph Landman
Vipin Chaudhary
Application
Paper
John Paul Walters, Joseph Landman, Vipin Chaudhary
584a2930-ff98-11dd-87af-0800200c9a66
Glimmer: Multilevel MDS on the GPU
We present Glimmer, a new multilevel algorithm for multidimensional scaling designed to exploit modern graphics processing unit (GPU) hardware. We also present GPU-SF, a parallel, force-based subsystem used by Glimmer. Glimmer organizes input into a hierarchy of levels and recursively applies GPU-SF to combine and refine the levels. The multilevel nature of the algorithm makes local minima less likely while the GPU parallelism improves speed of computation. We propose a robust termination condition for GPU-SF based on a filtered approximation of the normalized stress function. We demonstrate the benefits of Glimmer in terms of speed, normalized stress, and visual quality against several previous algorithms for a range of synthetic and real benchmark datasets. We also show that the performance of Glimmer on GPUs is substantially faster than a CPU implementation of the same algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/209_mds_small.png
/content/cudazone/CUDABrowser/assets/images/applications/209_mds_large.png
Academia
University of British Columbia
http://www.cs.ubc.ca/
2009
01
08
01/08/2009
Open Source
Stephen Ingram
Tamara Munzner
Marc Olano
Paper
Multimedia
Code
Numerics
Stephen Ingram, Tamara Munzner, Marc Olano
4fc93080-ff98-11dd-87af-0800200c9a66
Accelerating Molecular Dynamic Simulations on GPUs Using OpenMM
OpenMM is a freely downloadable, high performance, extensible library that allows molecular dynamics (MD) simulations to run on high performance computer architectures, such as graphics processing units (GPUs). Significant performance speed ups of over 100X in some cases were achieved using OpenMM, as compared to a conventional implementation running on a single CPU core. The library performs full protein Hamiltonian calculations without any cutoffs (full O(N2) treatment). The current release includes a version of GROMACS that uses OpenMM to speed up its calculations on recent versions of NVIDIA and ATI GPUs. It supports implicit solvent models (Onufriev, Bashford, Case GB), with explicit solvent models to be incorporated into the next release
/content/cudazone/CUDABrowser/assets/images/applications/208_openmm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/208_openmm_large.png
Academia
Simbios
http://simbios.stanford.edu
2009
01
26
01/26/2009
100
Open Source
OpenMM Team
Application
Life Sciences
Libraries
protein folding, RNA folding, molecular dynamics, molecular modeling, GROMACS, OpenMM Team
43e02da0-ff98-11dd-87af-0800200c9a66
Predictive Runtime Code Scheduling for Heterogeneous Architectures
Heterogeneous architectures are currently widespread. With the advent of easy-to-program general purpose GPUs, virtually every recent desktop computer is a heterogeneous system. Combining the CPU and the GPU brings great amounts of processing power. However, such architectures are often used in a restricted way for domain specific applications like scientific applications and games, and they tend to be used by a single application at a time. We envision future heterogeneous computing systems where all their heterogeneous resources are continuously utilized by dierent applications with versioned critical parts to be able to better adapt their behavior and improve execution time, power consumption, response time and other constraints at runtime. Under such a model, adaptive scheduling becomes a critical component. In this paper, we propose a novel predictive user level scheduler based on past performance history for heterogeneous systems. We developed several scheduling policies and present the study of their impact on system performance. We demonstrate that such scheduler allows multiple applications to fully utilize all available processing resources in CPU/GPU like systems and consistently achieve speedups ranging from 30% to 40% compared to just using the GPU in a single application mode.
/content/cudazone/CUDABrowser/assets/images/applications/207_pred_runtime_small.png
/content/cudazone/CUDABrowser/assets/images/applications/207_pred_runtime_large.png
Research
HiPEAC European Network of Excellence
http://www.hipeac.net
2008
01
01
01/01/2008
72
Barcelona Supercomputing Center
Paper
Numerics
Barcelona Supercomputing Center
3afb9120-ff98-11dd-87af-0800200c9a66
GPU Acceleration of a Production Molecular Docking Code
Modeling the interactions of biological molecules, or docking, is critical to both understanding basic life processes and to designing new drugs. Here we describe the GPU-based acceleration of a recently developed, complex, production docking code. We show how the various functions can be mapped to the GPU and present numerous optimizations. We find which parts of the problem domain are best suited to the different correlation methods. The GPU-accelerated system achieves a speedup of at least 16x for all likely problems sizes. This makes it competitive with FPGA-based systems for small molecule docking, and superior for protein-protein docking.
/content/cudazone/CUDABrowser/assets/images/applications/205_moleculardocking_small.png
/content/cudazone/CUDABrowser/assets/images/applications/205_moleculardocking_large.png
Academia
Department of Electrical and Computer Engineering Boston University
http://www.bu.edu/dbin/ece/web
2008
01
01
01/01/2008
16
Bharat Sukhwani
Martin C. Herbordt
Paper
Science
Bharat Sukhwani, Martin C. Herbordt
262c6b20-ff98-11dd-87af-0800200c9a66
Data-Parallel Large Vocabulary Continuous Speech Recognition on Graphics Processors
Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. Hidden Markov Model (HMM) based recognition approaches are widely used for modeling the human speech process by constructing probabilistic estimates of the underlying word sequence from an acoustic signal. High-accuracy speech recognition, however, requires complex models, large vocabulary sizes, and exploration of a very large search space, making the computation too intense for current personal and mobile platforms. In this paper, we explore opportunities for parallelizing the HMM based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on current many-core platforms. For the case study, we use a recognition model of 50,000 English words, with more than 500,000 word bigram transitions, and one million hidden states. We examine important implementation tradeoffs for shared-memory single-chip many-core processors by implementing LVCSR on the NVIDIA G80 Graphics Processing Unit (GPU) in Compute Unified Device Architecture (CUDA), leading to significant speedups. This work is an important step forward for LVCSR-based applications to leverage many-core processors in achieving real-time performance on personal and mobile computing platforms.
/content/cudazone/CUDABrowser/assets/images/applications/206_continuos_speech_recognition_small.png
/content/cudazone/CUDABrowser/assets/images/applications/206_continuos_speech_recognition_large.png
Academia
Electrical Engineering and Computer Sciences University of California at Berkeley
http://www.eecs.berkeley.edu
2008
05
22
05/22/2008
9
Jike Chong
Youngmin Yi
Arlo Faria
Nadathur Satish
Kurt Keutzer
Paper
Video & Audio
Jike Chong,Youngmin Yi, Arlo Faria, Nadathur Satish, Kurt Keutzer
16639600-ff98-11dd-87af-0800200c9a66
Improving Optical Character Recognition
There is a clear need for optical character recognition in order to provide a fast and accurate method to search both existing images as well as large archives of existing paper documents. However, existing optical character recognition programs suffer from a flawed tradeoff between speed and accuracy, making it less attractive for large quantities of documents. This paper analyzes five different algorithms which operate completely independently of optical character recognition programs, but which have the combined effect of decreasing computational complexity and increasing overall accuracy. Finally, the paper proposes implementing each of these algorithms on the GPU, as well as optical character recognition programs themselves, in order to deliver another massive speed increase.
/content/cudazone/CUDABrowser/assets/images/applications/204_wordrecognition_small.png
/content/cudazone/CUDABrowser/assets/images/applications/204_wordrecognition_large.png
Academia
Villanova University
http://www.villanova.edu
2008
01
01
01/01/2008
AJ Palkovic
Paper
AJ Palkovic
f08f28b0-fa04-11dd-87af-0800200c9a66
Parallel, stochastic measurement of molecular surface area
Biochemists often wish to compute surface areas of proteins. A variety of algorithms have been developed for this task, but they are designed for traditional single-processor architectures. The current trend in computer hardware is towards increasingly parallel architectures for which these algorithms are not well suited. We describe a parallel, stochastic algorithm for molecular surface area computation thatmaps well to the emerging multi-core architectures. Our algorithm is also progressive, providing a rough estimate of surface area immediately and refining this estimate as time goes on. Furthermore, the algorithm generates points on the molecular surface which can be used for point-based rendering. We demonstrate a GPU implementation of our algorithm and show that it compares favorably with several existing molecular surface computation programs, giving fast estimates of the molecular surface area with good accuracy.
/content/cudazone/CUDABrowser/assets/images/applications/203_jmgm_small.png
/content/cudazone/CUDABrowser/assets/images/applications/203_jmgm_large.png
Academia
Department of Computer Science, University of Maryland
http://www.cs.umd.edu/
2008
02
13
03/13/2008
7
Derek Juba
Amitabh Varshney
Paper
Science
Molecular surface, Parallel, Progressive, GPU, Stochastic, Quasi-random, Derek Juba, Amitabh Varshney
e5d3a950-fa04-11dd-87af-0800200c9a66
IV Data Feed Server
Analytical Data Distribution System, Implied Volatility Index (IVX) calculations (IVX(c) is a registered trademark of IVolatility.com), Risk analysis, all in real-time.
/content/cudazone/CUDABrowser/assets/images/applications/201_logo_small.png
/content/cudazone/CUDABrowser/assets/images/applications/201_logo_large.png
Commercial
IVolatility.com
http://www.ivolatility.com
2008
08
03
08/03/2008
20
Commercial
Sergey Fedoseev
Presentation
Finance
Sergey Fedoseev
e0e30350-fa04-11dd-87af-0800200c9a66
Q GPU
Q-GPU (Quantara-GPU) is a high performance options analytics for pricing and risk managing exotic structures. Q-GPU is based on the NVIDIA-CUDA architecture high performance computing technology to price a wide range of interest rate structures using state-of-the-art stochastic volatility and multi-factor models
/content/cudazone/CUDABrowser/assets/images/applications/200_q-gpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/200_q-gpu_large.png
Commercial
Advanced Derivatives Solutions
http://www.aderivatives.com
2008
08
20
08/20/2008
100
Commercial
Skander Handous
Presentation
Finance
montecarlo callable interest rates
daabd390-fa04-11dd-87af-0800200c9a66
Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration
General-purpose computing on graphics processing units (GPGPU) is shown to dramatically increase the speed of Monte Carlo simulations of photon migration. In a standard simulation of time-resolved photon migration in a semi-infinite geometry, the proposed methodology executed on a low-cost graphics processing unit (GPU) is a factor 1000 faster than simulation performed on a single standard processor. In addition, we address important technical aspects of GPU-based simulations of photon migration. The technique is expected to become a standard method in Monte Carlo simulations of photon migration.
/content/cudazone/CUDABrowser/assets/images/applications/199_parallelcomputing_small.png
/content/cudazone/CUDABrowser/assets/images/applications/199_parallelcomputing_large.png
Academia
Lund University
http://www.lth.se/fysik/english/
2008
07
16
07/16/2008
1000
Erik Alerstam
Tomas Svensson
Stefan Andersson-Engels
Paper
biomedical optics, simulations, scattering, Monte Carlo, Erik Alerstam, Tomas Svensson, Stefan Andersson-Engels
d526dcd0-fa04-11dd-87af-0800200c9a66
Dense Compressed/Hierarchical Linear System Solver
Benchmarks with NVIDIA 260 GTX hardware for solution large dense linear systems with compressed/hierarchical structures are discussed. A linear system with 163844 unknowns was generated and solved on GPU with 25 times speedup in regards to Quad Core Xeon 2.66HGz. Matrix generation shows 20 times speedup with a peak performance of 70 GFlop/s. The iterative solver and compressed matrix multiplication algorithm produce up to 50 times speedup with the peak performance 6 GFlop/s = 45 GB/s memory bandwidth.
/content/cudazone/CUDABrowser/assets/images/applications/198_dense_small.png
/content/cudazone/CUDABrowser/assets/images/applications/198_dense_large.png
Commercial
Elegant Mathematics Ltd.
http://www.elegant-mathematics.com/
2009
02
13
02/13/2009
50
Commercial
Ilgis Ibragimov
Application
Computational Fluid Dynamics
Numerics
Libraries
Science
dense h-matrix low rank linear system solver, Ilgis Ibragimov
d02cc0f0-fa04-11dd-87af-0800200c9a66
GPU Accelerated Free Surface Flows Using Smoothed Particle Hydrodynamics
NVIDIA C870 Implementation of a Smoothed Particle Hydrodynamics (SPH) CFD code for wave interactions with fixed and floating bodies
/content/cudazone/CUDABrowser/assets/images/applications/197_shot700_small.png
/content/cudazone/CUDABrowser/assets/images/applications/197_shot700_large.png
Academia
Manchester Metropolitan University
http://www.doc.mmu.ac.uk/cmmfa/
2009
02
12
02/12/2009
23
Professor Derek Causon
Presentation
Computational Fluid Dynamics
Numerics
Computational Fluid Dynamics, Free Surface Flows, SPH, Waves, Professor Derek Causon
f146d3d0-f99f-11dd-87af-0800200c9a66
Fluids: Technology Demo
The NVIDIA fluid particle demo uses PhysX technology accelerated by the CUDA architecture. Released in October 2008 as part of a GeForce Powerpack Over 64000 SPH fluid particles pour into the scene, push aside wooden crates, which float up as the fluid level rises. All these particles are simulated real time using accelerated PhysX, each SPH particle moves in the scene as the result of interactions with other particles, rigid body objects and the surrounding environment. This demo was recorded using a GeForce 9800GTX+
/content/cudazone/CUDABrowser/assets/images/applications/196_screenshot_fluids_small.png
/content/cudazone/CUDABrowser/assets/images/applications/196_screenshot_fluids_large.png
Commercial
NVIDIA
http://www.nvidia.com/object/nvidia_physx.html
2008
08
12
08/12/2008
10
NVIDIA
Application
Multimedia
Game Physics
NVIDIA
94c0f270-f991-11dd-87af-0800200c9a66
The Great Kulu: Technology Demo Using CUDA
The Great Kulu was one of NVIDIA GTX280 Launch demos, featuring PhysX technology, accelerated by the CUDA architecture. The demo is set below deck of a research ship, a large sea creature - "Kulu" , features a fully physically simulated flesh - using PhysX soft body simulation. The creature skeleton movements are animated - but the flesh movements are the result of simulation. This amazing demo gives a glimpse into the immense possibilities offered by GPU accelerated PhysX.
/content/cudazone/CUDABrowser/assets/images/applications/195_screenshot_kulu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/195_screenshot_kulu_large.png
Commercial
NVIDIA
http://www.nvidia.com/object/nvidia_physx.html
2008
08
12
08/12/2008
5
Commercial
NVIDIA
Application
Multimedia
Game Physics
NVIDIA
8fbceb80-f991-11dd-87af-0800200c9a66
Nurien Demo Using CUDA
This is a demo of a social networking game in development by Nurien, featuring NVIDIA PhysX technology accelerated by the CUDA architecture. This demo was released in October 2008 as part of a GeForce Powerpack. This fashion show runway scene is brought to life by the physically simulated skirts and character hair. Note how the skirt moves and flows naturally as the character walks and dances, this was recorded using an GeForce 9800GT.
/content/cudazone/CUDABrowser/assets/images/applications/194_screenshot_nurien_small.png
/content/cudazone/CUDABrowser/assets/images/applications/194_screenshot_nurien_large.png
Commercial
Nurien Software
http://www.nurien.com/service/main/main.nrn
2008
08
12
08/12/2008
5
Commercial
Application
Multimedia
Game Physics
4f4d41b0-f961-11dd-87af-0800200c9a66
UT3 PhysX Mod Pack Using CUDA
The NVIDIA PhysX mod pack features three extra PhysX levels for EPIC's UnReal Tournament III, accelerated by the CUDA architecture . These levels are Lighthouse , Tornado and HeatRay . Each level features advance PhysX effects though out, with dynamic environmental elements such as dust, hail , rain and wind, destructible buildings, tearing cloth banners , amazing explosions and many more additions these levels are a must for any UT3 PC gamer.
/content/cudazone/CUDABrowser/assets/images/applications/193_screenshot_ut3_small.png
/content/cudazone/CUDABrowser/assets/images/applications/193_screenshot_ut3_large.png
Commercial
Epic Games
http://www.epicgames.com
2008
08
12
08/12/2008
5
Commercial
NVIDIA
Application
Multimedia
Game Physics
NVIDIA
824e3300-f991-11dd-87af-0800200c9a66
GPU for Surveillance
Our goal is to develop very fast image and video processing algorithms by taking advantage of Graphics Processing Units (GPU). We have already implemented MERL's state-of-the art Bayesian background generation and foreground detection method. In comparison to the CPU version of the same algorithm, the GPU implementation is more than 20 times faster.
/content/cudazone/CUDABrowser/assets/images/applications/192_surveillance_small.png
/content/cudazone/CUDABrowser/assets/images/applications/192_surveillance_large.png
Commercial
Mitsubishi Electric Research Laboratories
http://www.merl.com/
2009
01
16
01/16/2009
20
Fatih Porikli
Jay Thornton
Paper
Imaging
Fatih Porikli, Jay Thornton
5fb7fc20-f961-11dd-87af-0800200c9a66
Parallel Fast Multipole Method for Global Illumination on Graphics Hardware
Traditionally, Graphics Processing Units (GPUs) were designed for performing graphics specific computations. However, with rapid improvements in performance and programmability, GPUs have fostered considerable interest in doing computations that go beyond computer graphics; general purpose computation on GPUs, or "GPGPU". GPUs may be viewed as data parallel compute coprocessors that can provide significant improvements in computational performance especially for algorithms which exhibit sufficiently high amount of parallelism. One such algorithm is the Fast Multipole Method (FMM).
http://www.cse.iitb.ac.in/~prekshu/dd.html
/content/cudazone/CUDABrowser/assets/images/applications/190_prekshu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/190_prekshu_large.png
Academia
Prekshu Ajmera
http://www.cse.iitb.ac.in/~prekshu/flash.php
2008
06
27
06/27/2008
20
Prekshu Ajmera
Paper
Presentation
Imaging
Prekshu Ajmera
59b4ada0-f961-11dd-87af-0800200c9a66
Extending VForce to Include Support for NVIDIA GPUs using CUDA
VSIPL++ for Reconfigurable Computing (VForce) is a middleware framework that adds support for special purpose processors (SPPs) to VSIPL++ [1], a C++ extension of the Vector, Signal, and Image Processing Library. VSIPL++ defines an object oriented API that provides a collection of commonly used signal processing algorithms and strives to enable performance, portability, and productivity.
/content/cudazone/CUDABrowser/assets/images/applications/189_vforce_small.png
/content/cudazone/CUDABrowser/assets/images/applications/189_vforce_large.png
Academia
Northeastern University
http://www.ece.neu.edu/groups/rcl/projects/vsipl/vsipl.html/
2008
09
24
09/24/2008
Miriam Leeser
Presentation
Paper
Imaging, Signal Processing, Libraries
Signal Processing
Libraries
Miriam Leeser
538a2810-f961-11dd-87af-0800200c9a66
Computing spike-based convolutions on GPUs using CUDA
This project developed a hierarchical spike-based network for object recognition using a the dynamic vision sensor silicon retina and NVIDIA CUDA GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/188_CUDAspikeoutDiffSaccade_small.png
/content/cudazone/CUDABrowser/assets/images/applications/188_CUDAspikeoutDiffSaccade_large.png
Telluride Neuromorphic Engineering Workshop
https://neuromorphs.net
2008
07
18
07/18/2008
5
Yingxue Wang
Jayram Moorkanikara Nageswaran
Tobi Delbruck
Paper
Multimedia
Imaging
Yingxue Wang, Jayram Moorkanikara Nageswaran, Tobi Delbruck
497d3830-f961-11dd-87af-0800200c9a66
A GPU Accelerated Speech Recognition System
Graphics Processing Units (GPUs) have become increasingly programmable over the past few years and are able to accomplish more than the specific graphics tasks for which they were designed. Their relatively low price, inherent parallelism, and the fact that their overall performance is increasing faster than for CPUs make them ideal co-processors for certain applications other than graphics. The purpose of this project is to utilize the power of a GPU for a speech recognition application, decreasing the overall processing time while maintaining the same level of recognition performance.
/content/cudazone/CUDABrowser/assets/images/applications/187_speech_recognition_small.png
/content/cudazone/CUDABrowser/assets/images/applications/187_speech_recognition_large.png
Academia
Mississippi State University
http://www.msstate.edu/
2005
05
01
01/05/2005
7
John Johnson
Paper
Video & Audio
John Johnson
2bd2c480-f961-11dd-87af-0800200c9a66
Hierarchical Object Recognition Algorithm
NVIDIA CUDA Implementation of a Hierarchical Object Recognition Algorithm
/content/cudazone/CUDABrowser/assets/images/applications/186_object_recognition_small.png
/content/cudazone/CUDABrowser/assets/images/applications/186_object_recognition_large.png
Academia
2008
11
1
01/11/2008
Sharat Chikkerur
Paper
Graphics
Sharat Chikkerur
24de3830-f961-11dd-87af-0800200c9a66
Arbitrary Dimension Reed-Solomon Coding and Decoding for Extended RAID on GPUs
Reed-Solomon coding is a method of generating arbitrary amounts of checksum information from original data via matrix-vector multiplication in finite fields. Previous work has shown that CPUs are not well-matched to this type of computation, but recent graphical processing units (GPUs) have been shown through a case study to perform this encoding quickly for the 3 + 3 (three data + three parity) case. In order to be utilized in a true RAID-like system, it is important to understand how well this computation can scale in the number of data disks supported. This paper details the performance of a general Reed-Solomon encoding and decoding library that is suitable for use in RAID-like systems. Both generation and recovery are performance-tested and discussed.
/content/cudazone/CUDABrowser/assets/images/applications/185_raid_on_gpu_small.png
/content/cudazone/CUDABrowser/assets/images/applications/185_raid_on_gpu_large.png
Academia
University of Alabama at Birmingham / Sandia National Laboratories
http://www.cis.uab.edu
2008
11
17
11/07/2008
15
Matthew Curry
tony@cis.uab.edu
Lee Ward
Ron Brightwell
Paper
Matthew Curry, tony@cis.uab.edu, Lee Ward, Ron Brightwell
7324e1c0-f95b-11dd-87af-0800200c9a66
Knoppix for CUDA
We are developping a USB/CD/DVD bootable linux system named "Knoppix for CUDA". Using it, we can quickly start and evaluate many applications in the scientific GPU computing area without any efforts for installing CUDA software, GPU device driver, MPI/OpenMP environments also compiling CUDA sample code etc.
/content/cudazone/CUDABrowser/assets/images/applications/184_knx4cuda_small.png
/content/cudazone/CUDABrowser/assets/images/applications/184_knx4cuda_large.png
Academia
Nagasaki University
http://progrape.jp/cs/
2008
02
10
02/10/2008
77
Open source
Tsuyoshi Hamada
Application
Code
Multimedia
Computational Fluid Dynamics
Life Sciences
Libraries
Programming Tools
Science
Other
USB bootable and GPU-enable Linux System, Tsuyoshi Hamada
718b9100-f943-11dd-87af-0800200c9a66
3D Particle Boltzmann Solver
This software package solves the Boltzmann equation with a particle method. It allows one to make mixtures of particles and particle transformations, i.e. chemical reactions at supersonic speed. The CUDA collision part is almost 30 times faster compared with tuned and threaded Quad Core Xeons version. The free flow part is even faster, up to 120 times faster. A "simple" supersonic (Mach=7) problem with 15,000,000 particles can be solved within several minutes just in the GPU memory of 2xx GTX series. (9xxx series are also supported)
/content/cudazone/CUDABrowser/assets/images/applications/182_ss_small.png
/content/cudazone/CUDABrowser/assets/images/applications/182_ss_large.png
Commercial
Elegant Mathematics Ltd.
http://www.elegant-mathematics.com
2009
02
10
02/10/2009
120
Commercial
Ilgis Ibragimov
Application
Computational Fluid Dynamics
Dynamics
Numerics
Life Sciences
Libraries
Science
Boltzmann Particle CFD, Ilgis Ibragimov
f760db20-f88d-11dd-87af-0800200c9a66
Creation parallel dotplots for suite of protein sequences
This application is developed to generate pairwise dotplots for huge number of protein sequences in the database. It consumes a huge amount time for creation of dotplots sequentially for a database of sequences. However, if we use CUDA we can do the same task in a much shorter time.
/content/cudazone/CUDABrowser/assets/images/applications/181_multidotplot_small.png
/content/cudazone/CUDABrowser/assets/images/applications/181_multidotplot_large.png
Commercial
New England Biolabs
2009
01
17
01/17/2009
Chandra Sekhar Pedamallu
Application
Code
Life Sciences
Protein Sequences, Dot plots, Sequence similarity, Chandra Sekhar Pedamallu
7f2db890-ef5e-11dd-ba2f-0800200c9a66
Fast Support Vector Machine Training and Classification on Graphics Processors
Recent developments in programmable, highly parallel Graphics Processing Units (GPUs) have enabled high performance implementations of machine learning algorithms. We describe a solver for Support Vector Machine training, using Platt's Sequential Minimal Optimization algorithm, which achieves speedups of 5-32x over LibSVM running on a high-end traditional processor. We also present a system for SVM classification which achieves speedups of 120-150x over LibSVM.
/content/cudazone/CUDABrowser/assets/images/applications/180_vector_machine_training_small.png
/content/cudazone/CUDABrowser/assets/images/applications/180_vector_machine_training_large.png
Academia
Electrical Engineering and Computer Sciences University of California at Berkeley
http://www.eecs.berkeley.edu/
2008
02
08
02/08/2008
150
Bryan Catanzaro
Narayanan Sundaram
Kurt Keutzer
Paper
Numerics
Support Vector Machines, Sequential Minimal Optimization, Graphics Processing Units, Bryan Catanzaro, Narayanan Sundaram, Kurt Keutzer
275f4a70-e91e-11dd-ba2f-0800200c9a66
GPGPU for Accelerated GRAPPA Autocalibration in Magnetic Resonance Imaging
The first part of this thesis provided an overview of MRI and explained how the acquisition time can be reduced by parallel imaging techniques such as GRAPPA. GRAPPA in a nutshell: undersample k-space and reconstruct missing information by fitting acquired data to k-space gaps. The reconstruction of missing data is a computationally intensive task The second part described the massively parallel and specialized architecture of graphics hardware and how to effectively harness its computational power to accelerate general-purpose computations. The final part presented different CUDA kernels for complex-valued matrix multiplication on GPU and explained various optimization techniques that have been applied step-by-step yielding speedups of 12 through 18 for special cases compared to the highly optimized Intel MKL.
/content/cudazone/CUDABrowser/assets/images/applications/178_grappa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/178_grappa_large.png
2008
04
01
04/01/2008
18
Matthias Schneider
Paper
Imaging
Matthias Schneider
136a90d0-e91c-11dd-ba2f-0800200c9a66
Clinical Evaluation of GPU-Based Cone Beam Computed Tomography
The use of cone beam computed tomography (CBCT) is growing in the clinical arena due to its ability to provide 3-D information during interventions, its high diagnostic quality (sub-millimeter resolution), and its short scanning times (60 seconds). In many situations, the short scanning time of CBCT is followed by a time consuming 3-D reconstruction. The standard reconstruction algorithm for CBCT data is the filtered backprojection, which for a volume of size 2563 takes up to 25 minutes on a standard system. Recent developments in the area of Graphic Processing Units (GPUs) make it possible to have access to high performance computing solutions at a low cost, allowing for use in applications to many scientific problems. We have implemented an algorithm for 3-D reconstruction of CBCT data using the Compute Unified Device Architecture (CUDA) provided by NVIDIA (NVIDIA Cor., Santa Clara, California),which was executed on a NVIDIA GeForce 8800GT. Our implementation results in improved reconstruction times from on the order of minutes, and perhaps hours, to a matter of seconds, while also giving the clinician the ability to view 3-D volumetric data at higher resolutions. We evaluated our implementation on ten clinical data sets and one phantom data set to observe differences that can occur between CPU and GPU based reconstructions. By using our approach, the computation time for 2563 is reduced from 25 minutes on the CPU to 4.8 seconds on the GPU. The GPU reconstruction time for 5123 is 11.3 seconds, and 10243 is 61.4 seconds.
/content/cudazone/CUDABrowser/assets/images/applications/177_conebeam_small.png
/content/cudazone/CUDABrowser/assets/images/applications/177_conebeam_large.png
Academia
Computer Science and Toshiba Stroke Research Center of The State University of New York at Buffalo
http://www.cse.buffalo.edu
2008
12
31
12/31/2008
300
Peter B. Noel
Alan M. Walczak
Kenneth R. Hoffmann
Jinhui Xu
Jason J. Corso
Sebastian Schafer
Paper
Medical Imaging
Peter B. Noel, Alan M. Walczak, Kenneth R. Hoffmann, Jinhui Xu, Jason J. Corso, Sebastian Schafer
863de380-e919-11dd-ba2f-0800200c9a66
Fast Deformable Registration on the GPU: A CUDA Implementation of Demons
In the medical imaging field, we need fast deformable registration methods especially in intra-operative settings characterized by their time-critical applications. Image registration studies which are based on Graphics Processing Units (GPUs) provide fast implementations. However, only a small number of these GPU-based studies concentrate on deformable registration. We implemented Demons, a widely used deformable image registration algorithm, on NVIDIA's Quadro FX 5600 GPU with the Compute Unified Device Architecture (CUDA) programming environment. Using our code, we registered 3D CT lung images of patients. Our results show that we achieved the fastest runtime among the available GPU-based Demons implementations. Additionally, regardless of the given dataset size, we provided a factor of 55 speedup over an optimized CPU-based implementation. Hence, this study addresses the need for on-line deformable registration methods in intra-operative settings by providing the fastest and most scalable Demons implementation available to date. In addition, it provides an implementation of a deformable registration algorithm on a GPU, an understudied type of registration in the general-purpose computation on graphics processors (GPGPU) community.
/content/cudazone/CUDABrowser/assets/images/applications/176_brain_small.png
/content/cudazone/CUDABrowser/assets/images/applications/176_brain_large.png
Academia
University of California / University of Florida
2008
06
01
06/01/2008
55
Pinar Muyan-Ozcelik
John D. Owens
Junyi Xia
Sanjiv S. Samant
Paper
Life Sciences
Science
Pinar Muyan-Ozcelik, John D. Owens, Junyi Xia, Sanjiv S. Samant
e51d78e0-e912-11dd-ba2f-0800200c9a66
Accelerated Image Registration With CUDA
In image registration, one of the images is referred to as the reference or source and the second image is referred to as the target or sensed. Image registration involves spatially transforming the target image to align with the reference image. A broad category of transformation models includes linear transformations, which include translation, rotation, scaling, and affine. Affine registration was performed between two 3D anatomical volume data sets (size240x256x176 voxels). The registration algorithm seeks to find an affine transformation that maps a "source" volume onto a "target" volume so as to minimize a cost function calculated between the two. In practice the transformed source voxels straddle target voxels positions, so that interpolation of target voxel values is required.
/content/cudazone/CUDABrowser/assets/images/applications/175_brain_small.png
/content/cudazone/CUDABrowser/assets/images/applications/175_brain_large.png
Academia
BSS Group, Cavendish Laboratory, University of Cambridge UK
http://www.bss.phy.cam.ac.uk/
2008
08
01
08/01/2008
100
Richard Ansorge
Paper
Life Sciences
Science
Richard Ansorge
bcfa89a0-e90f-11dd-ba2f-0800200c9a66
A Note on Auto-tuning GEMM for GPUs
The development of high performance dense linear algebra (DLA) critically depends on highly optimized BLAS, and especially on the matrix multiplication routine (GEMM). This is especially true for Graphics Processing Units (GPUs), as evidenced by recently published results on DLA for GPUs that rely on highly optimized GEMM [13, 11]. However, the current best GEMM performance, e.g. of up to 375 GFlop/s in single precision and of up to 75 GFlop/s in double precision arithmetic on NVIDIA's GTX 280, is difficult to achieve. The development involves extensive GPU knowledge and even backward engineering to understand some undocumented insides about the architecture that have been of key importance in the development [12]. In this paper, we describe some GPU GEMM auto-tuning optimization techniques that allow us to keep up with changing hardware by rapidly reusing, rather than reinventing, the existing ideas. Auto-tuning, as we show in this paper, is a very practical solution where in addition to getting an easy portability, we can often get substantial speedups even on current GPUs (e.g. up to 27% in certain cases for both single and double precision GEMMs on the GTX 280). Keywords: Auto-tuning, matrix multiply, dense linear algebra, GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/174_anagg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/174_anagg_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://www.cs.utk.edu/~tomov
2009
01
12
01/12/2009
Yinan Li
Jack Dongarra
Stanimire Tomov
Paper
Numerics
Auto-tuning, matrix multiply, dense linear algebra, GPUs, Yinan Li, Jack Dongarra, Stanimire Tomov
33852d10-e90f-11dd-ba2f-0800200c9a66
Enhancing the Performance of Dense Linear Algebra Solvers on GPUs
The MAGMA project, headed by the linear algebra research groups at University of Tennessee, UC Berkeley, and UC Denver, aims to develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current 'Multicore+GPU' systems. This transition cannot be done automatically as in many cases new algorithms that significantly differ from algorithms for conventional architectures, will be needed. Preliminary studies - on a new class of 'heterogeneity-aware' algorithms of 'reduced communication' and 'high-parallelism', as shown in this poster - confirm that this is the case.
/content/cudazone/CUDABrowser/assets/images/applications/173_magma_small.png
/content/cudazone/CUDABrowser/assets/images/applications/173_magma_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://www.cs.utk.edu/~tomov
2008
11
08
11/08/2008
2
M. Baboulin
J. Demmel
J. Dongarra
S. Tomov
V. Volkov
Paper
Application
Numerics
M. Baboulin, J. Demmel, J. Dongarra, S. Tomov, V. Volkov
95015ef0-e90b-11dd-ba2f-0800200c9a66
Towards Dense Linear Algebra for Hybrid GPU Accelerated Manycore Systems
If multicore is a disruptive technology, try to imagine hybrid multicore systems enhanced with accelerators! This is happening today as accelerators, in particular Graphics Processing Units (GPUs), are steadily making their way into the high performance computing (HPC) world. We highlight the trends leading to the idea of hybrid manycore/GPU systems, and we present a set of techniques that can be used to efficiently program them. The presentation is in the context of Dense Linear Algebra (DLA), a major building block for many scientific computing applications.We motivate the need for new algorithms that would split the computation in a way that would fully exploit the power that each of the hybrid components offers. As the area of hybrid multicore/GPU computing is still in its infancy, we also argue for its importance in view of what future architectures may look like. We therefore envision the need for a DLA library similar to LAPACK but for hybrid manycore/GPU systems. We illustrate the main ideas with an LU-factorization algorithm where particular techniques are used to reduce the amount of pivoting, resulting in an algorithm achieving up to 388 GFlop/s for single and up to 99:4 GFlop/s for double precision factorization on a hybrid Intel Xeon (2x4 cores @ 2.33 GHz) { NVIDIA GeForce GTX 280 5 (240 cores @ 1.30 GHz) system. Keywords: hybrid computing, dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphics processing units.
/content/cudazone/CUDABrowser/assets/images/applications/172_tdla_small.png
/content/cudazone/CUDABrowser/assets/images/applications/172_tdla_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://www.cs.utk.edu/~tomov
2008
10
01
10/01/2008
5
S. Tomov
J. Dongarra
M. Baboulin
Paper
Numerics
hybrid computing, dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphics processing units, S. Tomov, J. Dongarra, M. Baboulin
0d29c620-e906-11dd-ba2f-0800200c9a66
Exploring New Architectures in Accelerating CFD for Air Force Applications
Computational Fluid Dynamics (CFD) is an active field of research where the development of faster and more accurate methods is linked to the continuous demand for ever higher computational power. And indeed, for at least two decades, high-performance computing (HPC) programmers have taken for granted that each successive generation of microprocessors would, either immediately or after minor adjustments, make their software run substantially faster. But recent microprocessor design trends including the introduction of multi/many-core designs and the increasingly popular use in HPC of accelerators such as General Purpose Graphics Processing Units (GPGPU) and Field Programmable Gate Arrays (FPGAs), present an unprecedented challenge, namely how to update and enhance the existing large CFD software infrastructure to efficiently use these new architectures. In this paper we address some main issues in this transition and present ideas on using the new architectures to accelerate CFD applications that are of interest to the Air Force. We consider not only multi/many-core but also special purpose (e.g. GPUs) and reconfigurable computing (e.g. FPGAs) architectures. Moreover, we demonstrate benefits of using hybrid combinations where the strengths of each platform can be used to better map algorithm requirements and underlying architecture.
/content/cudazone/CUDABrowser/assets/images/applications/171_enaa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/171_enaa_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://www.cs.utk.edu/~tomov
2008
10
31
10/31/2008
3
J. Dongarra
S. Moore
G. Peterson
S. Tomov
J. Allred
V. Natoli
D. Richie
Paper
Numerics
J. Dongarra, S. Moore, G. Peterson, S. Tomov
0d6fb020-e903-11dd-ba2f-0800200c9a66
Some Issues in Dense Linear Algebra for Multicore and Special Purpose Architectures
We address some key issues in designing dense linear algebra (DLA) algorithms that are common for both multi/many-cores and special purpose architectures (in particular GPUs). We present them in the context of an LU factorization algorithm, where randomization techniques are used as an alternative to pivoting. This approach yields an algorithm based entirely on a collection of small Level 3 BLAS type computational tasks, which has emerged as a common goal in designing DLA algorithms for new architectures. Other common trends, also considered here, are block asynchronous task execution and "Block" layouts for the data associated with the separate tasks. We present numerical results and other specific experiments with DLA algorithms on NVIDIA GPUs using CUDA. The GPU results are also of interest themselves as we show a performance of up to 160 Glop/s on a single Quadro FX 5600 card. Keywords: dense linear algebra, parallel algorithms, LU factorization, multicore processors, graphic process units.
/content/cudazone/CUDABrowser/assets/images/applications/170_sidlamspa_small.png
/content/cudazone/CUDABrowser/assets/images/applications/170_sidlamspa_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://www.cs.utk.edu/~tomov
2009
01
13
01/13/2009
2
M. Baboulin
J. Dongarra
Stanimire Tomov
Paper
Numerics
dense linear algebra, parallel algorithms, LU factorization,
multicore processors, graphic process units, M. Baboulin, J. Dongarra, Stanimire Tomov
6dc8e4b0-e821-11dd-ba2f-0800200c9a66
MATRIX ALGEBRA ON GPU AND MULTICORE ARCHITECTURES
The MAGMA project, led by the linear algebra research groups at University of Tennessee, UC Berkeley, and UC Denver, aims to develop a linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current "Multicore+GPU" systems. This transition cannot be done automatically, as in many cases new algorithms that significantly differ from algorithms for conventional architectures will be needed. Preliminary studies on a new class of "heterogeneity-aware" algorithms of "reduced communication" and "high-parallelism" confirm that this is the case.
/content/cudazone/CUDABrowser/assets/images/applications/169_magma_small.png
/content/cudazone/CUDABrowser/assets/images/applications/169_magma_large.png
Academia
Innovative Computing Laboratory Computer Science Department, University of Tennessee
http://icl.cs.utk.edu/magma
2008
11
08
11/08/2008
2
Stanimire Tomov
Paper
Numerics
Stanimire Tomov
c74547b0-e810-11dd-ba2f-0800200c9a66
High-Performance Stream Computing for Particle Beam Transport Simulations
Understanding modern particle accelerators requires simulating charged particle transport through the machine elements. These simulations can be very time consuming due to the large number of particles and the need to consider many turns of a circular machine. Stream computing offers an attractive way to dramatically improve the performance of such simulations by calculating the simultaneous transport of many particles using dedicated hardware. Modern Graphics Processing Units (GPUs) are powerful and affordable stream computing devices. The results of simulations of particle transport through the booster-to-storage-ring transfer line of the DIAMOND synchrotron light source using an NVidia GeForce 7900 GPU are compared to the standard transport code MAD. It is found that particle transport calculations are suitable for stream processing and large performance increases are possible. The accuracy and potential speed gains are compared and the prospects for future work in the area are discussed.
/content/cudazone/CUDABrowser/assets/images/applications/168_hpscpbts_small.png
/content/cudazone/CUDABrowser/assets/images/applications/168_hpscpbts_large.png
N/A
Academia
Particle Physics Group at the University of Manchester
http://www.hep.man.ac.uk/whoswho/HEP_members.html
2007
09
02
09/02/2007
5
R. Appleby
D. Bailey
M. Salt
Application
Paper
R. Appleby, D. Bailey, M. Salt
073721d0-e1a5-11dd-ad8b-0800200c9a66
YARP CUDA Driver
YARP is an open source robotic platform well widespread and it's used on many advanced humanoid robots (eg: MIT Cog, Kismet, RobotCub iCub, ...). It's divided into two main branches: IPC for programs distributed in a local network; a driver system to allow standardization of hardware's software interface (eg: a "grabber" interface allows video acquisition from many types of devices, allowing developers to use the same code on different platforms). This project provides YARP with a CUDA-based driver to allow execution of user-made kernels on nVidia GPUs, helping easy integration of GPGPU and robotics.
/content/cudazone/CUDABrowser/assets/images/applications/167_small_yarp.png
/content/cudazone/CUDABrowser/assets/images/applications/167_large_yarp.png
N/A
N/A
2007
11
24
11/24/2007
Open Source
Giacomo Spigler (YARP authors are Paul Fitzpatrick, Giorgio Metta, Lorenzo Natale and Alessandro Scalzo)
Code
Numerics
Science
Signal Processing
CUDA, YARP, YARP device, robotics, robotcub, icub, robot, Giacomo Spigler
317cf050-e114-11dd-ad8b-0800200c9a66
National Instruments LabVIEW HIL application in control of extremely large telescope
Prototype in which NVIDIA's CUDA technology enables NI LabVIEW has been thoroughly benchmarked with impressive computational results.
/content/cudazone/CUDABrowser/assets/images/applications/166_small_E-ELT_telescope_10494_p.png
/content/cudazone/CUDABrowser/assets/images/applications/166_large_E-ELT_telescope_10494_p.png
National Instruments
http://www.ni.com
2008
12
02
12/02/2008
Shawn McCaslin
Paper
Multimedia
Science
Hardware-in-the-loop, HIL, high performance computing, extremely large telescope, Shawn McCaslin
29168480-e114-11dd-ad8b-0800200c9a66
Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors
Lattice Boltzmann Methods (LBM) are used for the computational simulation of Newtonian fluid dynamics. In general, LBM-based simulations are relatively straightforward to parallelize, and this technique has been applied numerous times on general-purpose processors, field-programmable gate arrays (FPGAs), and graphics processing units (GPUs). Of the three methods, the GPU implementations achieved the highest simulation performance per chip. With memory bandwidth of up to 141 GB/s and a theoretical maximum floating point performance of over 600 GFLOPS, CUDA-ready GPUs from NVIDIA provide an attractive platform for a wide range of scientific simulations, inclu ding LBM. Using the D3Q19 model, this paper improves upon prior GPU LBM results by increasing GPU multiprocessor occupancy, resulting in an increase in maximum performance by 20%, and by introducing a space-efficient storage method which reduces GPU RAM requirements by 50% at a slight detriment to performance. Both GPU versions are over 28 times faster than a quad-core CPU version using OpenMP.
/content/cudazone/CUDABrowser/assets/images/applications/165_small_flow_in_porous_media.jpg
/content/cudazone/CUDABrowser/assets/images/applications/165_large_flow_in_porous_media.jpg
Academia
University of Minnesota
http://www.umn.edu
2008
12
19
12/19/2008
28
Peter Bailey
Paper
Computational Fluid Dynamics
"Lattice Boltzmann" LBM cfd d3q19, Peter Bailey
1fe0e1d0-e114-11dd-ad8b-0800200c9a66
Ray tracing with CUDA
An interactive real-time ray tracing system, developed as part of a diploma thesis. Amongst others, it features three different traversal strategies for a SAH-based BVH, two of them leveraging specific aspects of the novelties NVIDIA CUDA introduces to GPU computing, one resembling a more traditional approach. It is shown that the new features offered by CUDA provide for substantial improvement in hierarchy traversal over previous solutions and that GPU-based ray tracing in general can match or even better the performance of CPU-based systems.
/content/cudazone/CUDABrowser/assets/images/applications/164_small_RTCUDA_cover.png
/content/cudazone/CUDABrowser/assets/images/applications/164_large_RTCUDA_cover.png
Academia
University of Koblenz-Landau, Koblenz Campus
http://www.uni-koblenz.de
2008
09
30
09/30/2008
Hanno Rabe
Presentation
Graphics
ray tracing, Hanno Rabe
136059e0-e114-11dd-ad8b-0800200c9a66
GPU Particle Tracking and Multi-Fluid Simulations with Greatly Enhanced Computational Speed
This is a poster presented at the American Geophysical Union Meeting in December 2008.
/content/cudazone/CUDABrowser/assets/images/applications/163_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/163_large.jpg
Academia
University of Washington
2008
12
16
12/16/08
50
Open source
Michele Cash
Presentation
Computational Fluid Dynamics
Science
Michele Cash
494de4a0-e114-11dd-ad8b-0800200c9a66
CoreAVC Professional
CoreAVC is being recognized as being the world's fastest H.264 software video decoder. Now featuring NVIDIA CUDA Support!
/content/cudazone/CUDABrowser/assets/images/applications/162_bg_small.png
/content/cudazone/CUDABrowser/assets/images/applications/162_bg_large.png
Commercial
CoreCodec, Inc.
http://www.corecodec.com
2008
12
19
12/19/2008
Commercial
Dan Marlin
Application
Graphics
h.264, h264, avc, coreavc, corecodec, video, decoder, Dan Marlin
208b40f0-ccef-11dd-ad8b-0800200c9a66
Gnort: High Performance Network Intrusion Detection Using Graphics Processors
We present an intrusion detection system based on the Snort open-source NIDS that suggest that modern graphics cards can be greatly speed up intrusion detection systems.
/content/cudazone/CUDABrowser/assets/images/applications/161_gnort_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/161_gnort_Large.png
Research
ICS-FORTH
http://www.ics.forth.gr
2008
06
06
06/06/2008
Giorgos Vasiliadis
Paper
Other
Network Intrusion Detection
pattern matching, intrusion detection systems, network security, Giorgos Vasiliadis
eb232f60-cc29-11dd-ad8b-0800200c9a66
Parallel algorithms for approximation of distance maps on parametric surfaces
We present an efficient O(n) numerical algorithm for first-order approximation of geodesic distances on geometry images, where n is the number of points on the surface. The structure of our algorithm allows efficient implementation on parallel architectures.
/content/cudazone/CUDABrowser/assets/images/applications/160_PMM_head_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/160_PMM_head_Large.png
Academia
Technion
http://www.cs.technion.ac.il/~weber/Publications/
2008
10
1
10/1/2008
150
Ofir Weber
Application
Multimedia
Paper
Graphics
Numerics
Geodesic distance, shortest path, Ofir Weber
90c91cc0-cc1d-11dd-ad8b-0800200c9a66
Multigrid on GPU: Tackling Power Grid Analysis on Parallel SIMT Platforms
The challenging task of analyzing on-chip power (ground) distribution networks with multi-million node complexity and beyond is key to today large chip designs. For the first time, we show how to exploit recent massively parallel single-instruction multiple-thread (SIMT) based graphics processing unit (GPU) platforms to tackle power grid analysis with promising performance.
/content/cudazone/CUDABrowser/assets/images/applications/159_HybridMultigrid_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/159_HybridMultigrid_Large.png
Texas A&M University
http://www.tamu.edu
2008
11
12
11/12/2008
Zhuo Feng and Peng Li
Paper
Electronic Design Automation
Zhuo Feng and Peng Li
44cf2020-cc14-11dd-ad8b-0800200c9a66
Voxel-based real-time ray tracing
A real-time ray tracer that uses voxels (volumetric pixels) and is fully implemented on the GPU using the NVIDIA CUDA architecture. This ray tracer is independant of the scene complexity thus making it possible to create models as complex as desired.
/content/cudazone/CUDABrowser/assets/images/applications/158_voxeldemo1_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/158_voxeldemo1_Large.png
Academia
ijsf
2008
07
12
7/12/2008
Cecill Etheredge
Application
Paper
Graphics
Imaging
voxel ray tracing real time volumetric raytracing, Cecill Etheredge
fa51e3b6-a0f8-4d3c-af4b-a19144509f43
Multi-GPU Incompressible Navier-Stokes Solver
We describe the implementation of an incompressible Navier-Stokes solver code with Cartesian geometry capability on desktop supercomputers with multi-GPUs. Specifically, we adopt the programming model for the NVIDIA CUDA Architecture NVIDIA {CUDA programming model} to implement the discretized form of the governing equations on desktop supercomputers with up to four GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/157_lidDrivenCavity_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/157_lidDrivenCavity_Large.png
Academia
Boise State University
http://www.boisestate.edu
2008
12
10
10/12/2008
100
Thibault / Senocak
Presentation
Multimedia
Computational Fluid Dynamics
Numerics
CFD, multi-GPU, Navier-Stokes, Thibault, Senocak
d3447ef0-c132-11dd-ad8b-0800200c9a66
Realtime Conversation Scene Analysis
This is a realtime system for analyzing group meetings by combining face pose tracking and speaker diarization, based on audio-visual signals from an omnidirectional camera-microphone. It aims to estimate "who is talking to whom".based on CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/148_3d_Crop_TopviewCrop_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/148_3d_Crop_TopviewCrop_large.jpg
Research
NTT Communication Science Laboratories
2008
10
20
20/10/2008
Kazuhiro Otsuka
Application
Paper
Presentation
Imaging
Video & Audio
Kazuhiro Otsuka
c0d91d10-c133-11dd-ad8b-0800200c9a66
Matrix Algebra on GPU and Multicore Architectures
The MAGMA project, headed by the linear algebra research groups at University of Tennessee and U of California, Berkeley, aims to develop a library similar to LAPACK but for Multicore+GPU systems.
/content/cudazone/CUDABrowser/assets/images/applications/149_logoMAGMA_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/149_logoMAGMA_large.jpg
Academia
University of Tennessee
2008
11
2
2/11/2008
S.Tomov, J.Dongarra, M. Baboulin
Application
Paper
Numerics
S.Tomov, J.Dongarra, M. Baboulin
d5849900-c134-11dd-ad8b-0800200c9a66
Flame Fractals
Flame fractals are a generalization of IFS fractals, allowing a range of non-linear transformations in addition to linear ones. A good example is the Electric Sheep screensaver, which uses a distributed rendering system to render animations
/content/cudazone/CUDABrowser/assets/images/applications/150_12secondswithfinalxform_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/150_12secondswithfinalxform_large.jpg
Research
2008
11
44
4/11/2008
50
Open source
Steven Brodhead
Application
Code
Digital Content Creation
Graphics
Flam3, Flame, Fractal, Steven Brodhead
c418fe70-c136-11dd-ad8b-0800200c9a66
Fast Iterative Linear System Solvers
CG, BiCGStab, GMRES/FGMRES/NGMRES, Lancos, Arnoldy and Davidson together with tensor Preconditioners
/content/cudazone/CUDABrowser/assets/images/applications/151_GPU-Iter_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/151_GPU-Iter_large.jpg
Commercial
Elegant Mathematics Ltd.
2008
11
5
5/11/2008
50
Commercial
Ibragimov
Application
Code
Numerics
Libraries
Programming Tools
Science
Ibragimov
70a1d900-c137-11dd-ad8b-0800200c9a66
TMPGEnc 4.0 Xpress
Multi-codec transcoder with the CUDA accelerated filters.
/content/cudazone/CUDABrowser/assets/images/applications/152_te4xp_box_med_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/152_te4xp_box_med_large.jpg
Commercial
Pegasys Inc.
http://www.pegasys-inc.com/en/index.html
2008
10
31
31/10/2008
4.46
Commercial
Kaname Saito
Application
Video & Audio
transcode, encode, transcoder, encoder, TMPGEnc, Pegasys, Kaname Saito
07c34530-c138-11dd-ad8b-0800200c9a66
Exploiting the capabilities of modern GPUs for dense matrix computations
We present several algorithms to compute the solution of a linear system of equations on the GPU, as well as general techniques to improve the performance, such as padding and hybrid CPU-GPU computation. We compare single and double precision on a GTX280
/content/cudazone/CUDABrowser/assets/images/applications/153_tr_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/153_tr_large.jpg
Academia
University Jaume I, Castellon
2008
10
10
10/10/2008
Francisco Igual
Application
Paper
Numerics
Francisco Igual
ba6e8f00-c138-11dd-ad8b-0800200c9a66
Jacket's Graphics Toolbox for MATLAB
The Graphics Toolbox extends Jacket for MATLAB to seamlessly integrate computation with visualization making difficult to program, multi-threaded, and real time graphical displays effortless to achieve.
/content/cudazone/CUDABrowser/assets/images/applications/154_gfx_interactive_ocean_example_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/154_gfx_interactive_ocean_example_large.jpg
Commercial
AccelerEyes
http://www.accelereyes.com
2008
11
1
1/11/2008
100
Gallagher Pryor
Application
Multimedia
Computational Fluid Dynamics
Finance
Graphics
Imaging
Numerics
Libraries
Oil & Gas
Programming Tools
Science
Signal Processing
Video & Audio
Gallagher Pryor
66a26bc0-c139-11dd-ad8b-0800200c9a66
Neurocuda - CUDA-accelerated neurodynamics
Software for neurodynamic simulations, consisting of a C++ class library and applications. The neural networks are biologically plausible competetive nets and associative memories, with an intended direction towards vision processing.
/content/cudazone/CUDABrowser/assets/images/applications/155_neurocuda1_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/155_neurocuda1_large.jpg
Research
2008
11
12
12/11/2008
Open source
Fredrik Farnstrom
Application
Code
Science
Other
Neural, network, neurodynamics, artificial, intelligence, AI, associative, memory, competetive, vision, computational, neuroscience, Fredrik Farnstrom
ee77fb00-c139-11dd-ad8b-0800200c9a66
GPU-Quicksort
GPU-Quicksort is a Quicksort-based sorting algorithm designed for GPUs for sorting integers and floats on graphics processors. Experiments shows that it can outperform highly optimized CPU-based Quicksort with a factor of 10 on high-end graphics processor
/content/cudazone/CUDABrowser/assets/images/applications/156_gpuqsort_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/156_gpuqsort_large.jpg
Academia
Distributed Computing and Systems - Chalmers University of Technology
http://www.cs.chalmers.se/~dcs
2008
11
10
10/11/2008
10
Open source
Daniel Cederman
Philippas Tsigas
Application
Code
Paper
Presentation
Libraries
sorting, Daniel Cederman, Philippas Tsigas
7E3FAFC6-B7D0-11DD-A2A0-A58455D89593
Badaboom Media Converter
Elemental Technologies Badaboom Media Converter takes a fundamentally different approach to video format conversion from other solutions. Instead of performing format conversion on the CPU, it harnesses massively parallel GPUs from NVIDIA. By using the power of the GPU, the time required for video conversion is reduced.
/content/cudazone/CUDABrowser/assets/images/applications/147_badaboom_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/147_badaboom_large.jpg
Commercial
Elemental Technologies
http://www.elementaltechnologies.com/
2008
1
10
10/1/2008
GeForce 8 an higher
20
Commercial
Elemental Technologies
Application
Video & Audio
Encoding, transcoding, audio, video, BadaBOOM, Elemental Technologies
b830c9b0-a5bd-11dd-ad8b-0800200c9a66
Multiresolution Gradient Adaptive Filter
The gimp plugin implements a multiresolution gradient adaptive filter with a bilateral filter kernel. The filter operates on grayscale images and removes noise while edges are preserved.
/content/cudazone/CUDABrowser/assets/images/applications/146_Multiresolution_Gradient_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/146_Multiresolution_Gradient_large.jpg
Academia
University of Erlangen-Nuremberg
http://www.fau.de/
2008
10
09
10/09/2008
30
Open Source
Membarth
Application
Code
Graphics
Imaging
Membarth
310eabf0-a5bd-11dd-ad8b-0800200c9a66
CUDA vs Wizard
CUDA Microsoft Visual Studio Wizard
/content/cudazone/CUDABrowser/assets/images/applications/145_CUDA_vs_Wizard_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/145_CUDA_vs_Wizard_large.jpg
HKBU
2008
04
18
04/18/2008
Zhao, Kaiyong
Application
Programming Tools
Zhao, Kaiyong
360ecd30-a5bb-11dd-ad8b-0800200c9a66
LISSOM
LISSOM is a model of human neocortex (mainly visual cortex) at a neural column level. The model was developed by Bednar, Choe, Miikkulainen and Sirosh, at University of Texas. The model was ported to GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/144_LISSOM_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/144_LISSOM_large.jpg
Research
2008
10
14
10/14/2008
9
Open Source
Spigler
Application
Code
Presentation
Life Sciences
Science
Video & Audio
lissom visual cortex neural network som v1, Spiglerg, Spigler
15bde710-a5ba-11dd-ad8b-0800200c9a66
Fast Computed Tomography
North Star Imaging's new proprietary efX CT software utilizes GPU reconstruction speed. The software was developed with a CUDA interface and reconstruction speeds have increased up to 50x as compared to other CT software.
/content/cudazone/CUDABrowser/assets/images/applications/143_Fast_Computed_Tomography_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/143_Fast_Computed_Tomography_large.jpg
Commercial
North Star Imaging, Inc.
http://www.4nsi.com/
2008
10
16
10/16/2008
50
Damhof
Noel
Application
Paper
Digital Content Creation
Imaging
industrial nondestructive testing, computed tomography, ct scan, ct reconstruction, ct software, CUDA, GPU reconstruction, North Star Imaging, NSI, Damhof, Noel
0329d970-a5b9-11dd-ad8b-0800200c9a66
Flowball
Flowball is an interactive game using dense optical flow computed in realtime on a Geforce GTX 280. We provide a video and even a free Win32 optical flow library.
/content/cudazone/CUDABrowser/assets/images/applications/142_Flowball_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/142_Flowball_large.jpg
Academia
Graz University of Technology, Institute for Computer Graphics and Vision
http://www.icg.tugraz.at/
2008
10
09
10/09/2008
Open Source
Santner
Application
Paper
Multimedia
Imaging
Numerics
Science
Signal Processing
Video & Audio
Realtime Dense Optical Flow Interactive Game, Santner
ad3b3da0-a5b4-11dd-ad8b-0800200c9a66
Fast Blood Flow Visualization of High-resolution Laser Speckle Imaging Data
This paper introduces GPUs into the data processing framework of laser speckle contrast imaging, to achieve fast and high-resolution blood flow visualization on PCs by exploiting the high floating-point processing power of GPUs. By using GPU, a 12-60 fold performance enhancement is obtained in comparison to the optimized CPU implementations.
/content/cudazone/CUDABrowser/assets/images/applications/141_Fast_Blood_Flow_Visualization_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/141_Fast_Blood_Flow_Visualization_large.jpg
Academia
Wuhan National Laboratory for Optoelectronics
http://wnlo.hust.edu.cn/english/
2008
08
29
08/29/2008
60
Li, Liu, and Luo
Paper
Science
Video & Audio
Blood flow, visualization, video, Li, Liu, Luo
f2667150-a5b0-11dd-ad8b-0800200c9a66
Powerful Real-time Electrodynamics
Aeth.drive is a fast, parallel, versatile grid-based EM modelling framework including support for relativity, turbulent/quantum effects, and isothermal participating media. Runs in real-time and includes demo.
/content/cudazone/CUDABrowser/assets/images/applications/140_Powerful_Real-time_Electrodynamics_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/140_Powerful_Real-time_Electrodynamics_large.jpg
Research
2008
10
20
10/20/2008
Open Source
Daley
Application
Code
Science
Alpha, computational, computer graphics, computer simulation, CUDA, electrodynamics, EM, free software, gpgpu, GPU, isothermal, mapping, MIT license, modelling, NVIDIA, open source, parallel, photon, physics, plasma, radiosity, release, science, scientific computing, simulation, thermodynamics, turbulence, daley
db3a3d80-99f4-11dd-ad8b-0800200c9a66
GPU Based Image Segmentation Livewire Algorithm Implementation
This thesis presents a GPU implementation of the Livewire algorithm. It is divided in 3 phases: Sobel or Laplacian filter convolution, image modeling as a grid graphand solving the non-negative weighted edges single source shortest path problem. An adapted version of the parallel delta-stepping algorithm is used for the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/139_GPU_Based_Image_Segmentation_Livewire_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/139_GPU_Based_Image_Segmentation_Livewire_large.jpg
Academia
Instituto Tecnologico de Aeronautica
http://www.ita.br/
2008
02
15
02/15/2008
Open Source
Baggio
Application
Paper
Code
Multimedia
Imaging
livewire, dijkstra, cuda, Baggio
43b5c9c0-99f4-11dd-ad8b-0800200c9a66
Computational Fluid Dynamics (CFD) using GPUs
Solving 2D head conduction CFD problems using CUDA. Using Red-Black Gauss-Seidel with SOR (successive over-relaxation).
/content/cudazone/CUDABrowser/assets/images/applications/138_CFD_using_GPUs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/138_CFD_using_GPUs_large.jpg
Academia
Mechanical Science and Engineering Dept., Univerity of Illinois Urbana-Champaign
http://www.mechse.uiuc.edu/
2008
08
18
08/18/2008
17
Shinn
Presentation
Computational Fluid Dynamics
Science
CFD, 2D solver, heat conduction, Shinn
500ec1f0-99f3-11dd-ad8b-0800200c9a66
Volume Ray Casting With CUDA
This dissertation includes a chapter that is devoted to volume ray casting on CUDA architecture. It is tailored to take into account the CUDA architecture's unique details and performs 1.5 times better than that of the Cell processor and 15 times better than that of Intel Xeon processor in our implementations.
/content/cudazone/CUDABrowser/assets/images/applications/137_Volume_Ray_Casting_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/137_Volume_Ray_Casting_large.jpg
Academia
University of Maryland at College Park
http://www.umd.edu/
2008
05
09
05/09/2008
Kim
Paper
Graphics
Kim
73926e40-99f0-11dd-ad8b-0800200c9a66
Radius-CUDA
This application implements a complete ray tracing kernel using a kd-tree structure for the hierarchical space subdivision and plain triangles for the geometry. The complete kernel including simple shading, shadow and visibility ray generation is implemented using CUDA. The source is also provided by the author.
/content/cudazone/CUDABrowser/assets/images/applications/136_Radius-Cuda_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/136_Radius-Cuda_large.jpg
Commercial
Etranges Libellules
http://www.etranges-libellules.fr/?lang=en
2008
10
10
10/10/2008
2
Open Source
Segovia
Application
Graphics
Ray tracing, CUDA, Segovia
d70d9d10-99ef-11dd-ad8b-0800200c9a66
The Synchronization Power of Coalesced Memory Accesses
This paper investigates the synchronization power of coalesced memory accesses in CUDA. The results show the coalesced memory accesses can be used to construct concurrent data objects that tolerate up to 63 crash-failures (compute capability 1.2 & higher) or 15 crash-failures (compute capability 1.1 & lower).
/content/cudazone/CUDABrowser/assets/images/applications/135_Coalesced_Memory_Accesses_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/135_Coalesced_Memory_Accesses_large.jpg
Academia
University of Troms, Norway
http://uit.no/informatikk/
2008
09
24
09/24/2008
Ha
Anshus
Tsigas
Paper
Presentation
Programming Tools
Science
Fault-tolerance, synchronization, multicores, memory access mechanisms, consensus, Ha, Anshus, Tsigas
fc0adf20-99ee-11dd-ad8b-0800200c9a66
Cubic Interpolation
Cubic B-spline interpolation of 2D and 3D textures. Easily replace your tex2D and tex3D calls by cubic interpolated texture lookups. A CUDA accelerated pre-filter for calculating the B-spline coefficients and example programs are also included.
/content/cudazone/CUDABrowser/assets/images/applications/134_Cubic_Interpolation_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/134_Cubic_Interpolation_large.jpg
Research
KU Leuven / TU Eindhoven
2008
10
06
10/06/2008
327
Open Source
Ruijters
Application
Code
Graphics
Imaging
Libraries
Programming Tools
Science
Signal Processing
Video & Audio
Cubic B-spline interpolation, Ruijters
f8398140-99ed-11dd-ad8b-0800200c9a66
SVI Pro Advanced 3D Seismic Analysis
SVI Pro is a 3D seismic image analysis and visualisation application, allowing geological objects to be identified, enhanced and extracted from large 3D seismic datasets. The results of this objective and repeatable analysis provide a comprehensive understanding of the subsurface, without pre-interpretation.
/content/cudazone/CUDABrowser/assets/images/applications/133_SVI_Pro_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/133_SVI_Pro_large.jpg
Commercial
ffA
http://www.ffa.co.uk/
2008
10
08
10/08/2008
34
Commercial
ffA
Application
Presentation
Paper
Multimedia
Oil & Gas
3D Seismic, Image Analysis, Exploration, Production, Image Processing, Visualisation, ffA
e274b200-8df7-11dd-ad8b-0800200c9a66
Computational Chemistry Using GPUs
We undertook the task of accelerating the resolution-of-the-identity second-order Moeller-Plesset (RI-MP2) calculations as implemented in Q-Chem 3.1 by executing matrix-matrix multiplication operations using CUBLAS. We exploited the fact that large matrices can be multiplied about 13x faster on the GPU than on the host CPU. With moderate programming effort, our code had a 4.3x speedup.
/content/cudazone/CUDABrowser/assets/images/applications/132_Computational_Chemistry_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/132_Computational_Chemistry_large.jpg
Academia
Harvard, Department of Chemistry and Chemical Biology
http://aspuru.chem.harvard.edu/About/
2008
08
01
01/08/2008
4.3
Open Source
Aspuru-Guzik
Paper
Science
Other
Quantum, chemistry, Moller-Plesset, molecular, Aspuru-Guzik
776e39b0-8959-11dd-ad8b-0800200c9a66
GpuCV: GPU-accelerated Computer vision library
GpuCV is an open-source GPU-accelerated image processing and Computer Vision library. It is meant for easily porting existing OpenCV applications, while taking advantage of computing power available from recent GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/131_GpuCV_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/131_GpuCV_large.jpg
Academia
Institut TELECOM; TELECOM & Management SudParis
http://www.it-sudparis.eu/
2008
10
22
10/22/2007
100
Open Source
Allusse
Horain
Application
Paper
Code
Imaging
Programming Tools
Science
Signal Processing
Video & Audio
Other
GLSL, NVIDIA CUDA, computer vision, image processing, Allusse, Horain
21efc800-8959-11dd-ad8b-0800200c9a66
Ray Tracing
This application shows a method allowing ray tracing with CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/130_Ray_Tracing_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/130_Ray_Tracing_large.jpg
Academia
University of Reims
2008
9
12
9/12/2008
16
Open Source
Maxime
Application
Code
Graphics
Imaging
Numerics
Ray tracing, Maxime
bec4bb60-847f-11dd-ad8b-0800200c9a66
Photon-Mapping Demo
Real-time, physically-based photon mapper using aeth.drive, a general-purpose library for dynamical simulations. Kernel solves Maxwell's equations in a turbulent participating medium using nearly 800,000 samples in a few hundredths of a second.
/content/cudazone/CUDABrowser/assets/images/applications/129_Photon_Mapping_Demo_Small.png
/content/cudazone/CUDABrowser/assets/images/applications/129_Photon_Mapping_Demo_Large.png
Research
2008
9
6
9/6/2008
256
Open Source
Daley
Application
Code
Graphics
Libraries
Science
Other
Photon mapper, simulations, Maxwell's equations, Daley
07243f10-847c-11dd-ad8b-0800200c9a66
Fast Sparse Signal Recovery from Random Projections
We consider the problem of sparse signal recovery from a small number of random projections (measurements). This is a well known NP-hard to solve combinatorial optimization problem. Here, we discuss the fast GPU (CUDA & CUBLAS) implementation.
/content/cudazone/CUDABrowser/assets/images/applications/128_Fast_Sparse_Signal_Recovery_from_Random_Projections_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/128_Fast_Sparse_Signal_Recovery_from_Random_Projections_Large.jpg
Academia
Institute for Biocomplexity and Informatics, University of Calgary
http://www.ibi.ucalgary.ca/
2008
9
10
9/10/2008
31
Andrecut
Application
Paper
code
Numerics
Signal Processing
Matching Pursuit; Signal Recovery; Random Projections, Andrecut
0d68f890-7b56-11dd-ad8b-0800200c9a66
Ray tracing with CUDA (CUDART-sp)
Realtime ray tracing on a cuda device. This implementation is using only spheres, 2 lights and no textures.
/content/cudazone/CUDABrowser/assets/images/applications/127_cudart_sp_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/127_cudart_sp_large.jpg
Academia
BV2
http://www.bv2.co.uk/
2008
8
27
8/27/2008
25
Abernethy
Application
Multimedia
Graphics
Ray tracing, Abernethy
7b764240-7b54-11dd-ad8b-0800200c9a66
Lucas and Kanade optical flow algorithm using CUDA (LKCUDA)
Real time pyramidal implementation of the Lucas and Kanade optical flow algorithm.
/content/cudazone/CUDABrowser/assets/images/applications/126_LKCuda_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/126_LKCuda_large.jpg
Research
INRIA
http://www-rocq.inria.fr/
2008
8
29
8/29/2008
55
Yann DUMORTIER
Julien MARZAT
Application
Imaging
Libraries
vision, dense optical flow, real-time, Lucas and Kanade, algorithm, LKCuda Team
012a74f0-7b51-11dd-ad8b-0800200c9a66
Fast Sliding-Window Object Detection
The paper presents a fast object class localization framework implemented on a data parallel architecture available in recent computers. Our case study, the implementation of HOG descriptors, shows that just by using this recent programming model we can easily speed up an original CPU-only implementation by a factor of 34/109, making it unnecessary to use early rejection cascades that sacrifice classification performance, even in real-time conditions.
/content/cudazone/CUDABrowser/assets/images/applications/125_Fast_Sliding_Window_Object_Detection_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/125_Fast_Sliding_Window_Object_Detection_large.jpg
Academia
TU Darmstadt
http://www.mis.informatik.tu-darmstadt.de/
2008
6
10
6/10/2008
109
Wojek, Dork, Schulz, Schiele
Paper
Graphics
Imaging
Video & Audio
Other
Object Detection, Histograms of Oriented Gradients, HOG, Sliding-Window, People Detection, Wojek, Dork, Schulz, Schiele
f89bc2f5-c528-41ce-84ef-8e833503d4de
Teraflops for Games and Derivatives Pricing
Financial instruments pricing using Monte-Carlo methods.
/content/cudazone/CUDABrowser/assets/images/applications/124_Teraflops_for_Games_and_Derivatives_Pricing_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/124_Teraflops_for_Games_and_Derivatives_Pricing_large.jpg
Commercial
QuantCatalyst Inc.
http://www.quantcatalyst.com/
2008
8
1
8/1/2008
50
Egloff, Bennemann, Beinker, Gauckler
Paper
Finance
Graphics processing units, high performance computing, cluster, grid, Monte-Carlo simulation, basket options, local volatility, derivatives pricing, risk analytics, Egloff, Bennemann, Beinker, Gauckler
3f8319f0-5fc7-11dd-ad8b-0800200c9a66
Ray Casting Deformable Models
This paper explores the problem of real time ray casting of large deformable models (over a million triangles) on large displays (a million pixels) on an off-the-shelf GPU. We build a GPU-efficient three dimensional data structure for this purpose and a corresponding algorithm that uses it for fast ray casting using the CUDA model.
http://cvit.iiit.ac.in/projects/gpuproject/
/content/cudazone/CUDABrowser/assets/images/applications/123_Ray_Casting_Deformable_Models_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/123_Ray_Casting_Deformable_Models_Large.jpg
Academia
International Institute of Information Technology Hyderabad
http://cvit.iiit.ac.in/
2008
7
4
7/4/2008
Open Source
Patidar, et al
Paper
Graphics
Ray casting, Deformable Models, Data structures on GPU, Patidar, Narayanan, Patidar, et al
90392390-5fc6-11dd-ad8b-0800200c9a66
A Fast Similarity Join Algorithm
A novel similarity join algorithm called LSS is presented that executes on a GPU, exploiting its parallelism and high data throughput. Experimental results demonstrate that LSS is suitable for similarity joins in large high-dimensional datasets, and that it performs well when compared against two existing prominent similarity join methods.
/content/cudazone/CUDABrowser/assets/images/applications/122_A_Fast_Similarity_Join_Algorithm_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/122_A_Fast_Similarity_Join_Algorithm_Large.jpg
Academia
University of Maryland
http://www.cs.umd.edu/
2008
04
09
04/09/2008
100
Lieberman, Sankaranarayanan, Samet
Paper
Science
Simiarity search, joins, high-dimensional points, Lieberman, Sankaranarayanan, Samet
bc2716d0-5fc4-11dd-ad8b-0800200c9a66
Concurrent Number Cruncher
Concurrent Number Cruncher: a general purpose symmetric sparse solver on the GPU. It describes how to combine recent GPU programming techniques and new GPU dedicated APIs with high performance computing strategies to implement a sparse general-purpose linear solver.
/content/cudazone/CUDABrowser/assets/images/applications/120_CNC_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/120_CNC_Large.jpg
Academia
INRIA
http://www.inria.fr/
2008
7
28
7/28/2008
10
Open Source
Buatois, Caumon, Lvy
Application
Paper
Code
Multimedia
Numerics
Conjugate gradient, sparse solver, Buatois, Caumon, Levy
f751cbd0-5fc2-11dd-ad8b-0800200c9a66
NaminamiFX for Fluid Simulation
NaminamiFX bundled with LiquidPack is improved more than 4 times in computational performance than a CPU only system. Available as a Plug-in for LightWave v9, it was developed and is currently sold only in Japan.
/content/cudazone/CUDABrowser/assets/images/applications/119_NaminamiFX_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/119_NaminamiFX_Large.jpg
Commercial
D-Storm Inc.
http://www.dstorm.co.jp/
2007
10
16
10/16/2007
4
Commercial
Kenkyujo
Application
Digital Content Creation
LightWave, Liquid Pack, fluid simulation, wave, Kenkyujo
db0238a0-5ecf-11dd-ad8b-0800200c9a66
Real-time Digital Holographic Microscopy
This paper describes a real-time DHM system using a GPU with many stream processors. The computational speed of the Fresnel diffraction using the GPU is faster than that of recent CPUs. The real-time DHM system can obtain reconstructed images from holograms whose size is 512x512 grids in 24 frames per second.
/content/cudazone/CUDABrowser/assets/images/applications/115_Real-time_Digital_Holographic_Microscopy_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/115_Real-time_Digital_Holographic_Microscopy_Large.jpg
Academia
Yamagata University
http://gabor.yz.yamagata-u.ac.jp/
2008
7
24
7/24/2008
Shimobaba, Sato, Miura, Takenouchi, Ito
Application
Paper
Multimedia
Numerics
Libraries
Science
Signal Processing
Digital holography microscope microscopy Fresnel diffraction light propagation hologram, Tomoyoshi, Shimobaba, Yoshikuni, Sato, Junya, Miura, Mai, Takenouchi, Tomoyoshi, Ito
7083a0f0-5e42-11dd-ad8b-0800200c9a66
GPU4Vision
Usage of GPUs to tackle computer vision tasks like denoising, filtering, segmentation, stereo, optical flow, etc.
/content/cudazone/CUDABrowser/assets/images/applications/118_GPU4Vision_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/118_GPU4Vision_Large.jpg
Academia
Graz University of Technology
http://www.icg.tugraz.at/
2008
7
21
7/21/2008
Pock
Application
Paper
Graphics
Imaging
Numerics
Science
Signal Processing
Video & Audio
Computer vision, denoising, filtering, segmentation, stereo, optical flow, Pock
4b8d2790-5e41-11dd-ad8b-0800200c9a66
Molecular Dynamics of DNA and Liquids
Ascalaph Liquid GPU is a program for molecular dynamics simulation in liquid phase."Ascalaph DNA GPU" is the program for creating models of nucleic acids and their complexes with ligands.
http://mtzweb.scs.uiuc.edu/research/gpu/
/content/cudazone/CUDABrowser/assets/images/applications/117_Molecular_Dynamics_of_DNA_and_Liquids_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/117_Molecular_Dynamics_of_DNA_and_Liquids_Large.jpg
Commercial
Agile Molecule
http://www.agilemolecule.com/index.html
2008
7
23
7/23/2008
18
Commercial
Alexey
Application
Science
Molecular Dynamics, Building Design, Alexey
8ec8dff0-5e40-11dd-ad8b-0800200c9a66
GPUGRID.NET
GPUGRID.NET is a volunteer computing project using NVIDIA graphics cards and CUDA for full-atom molecular dynamics simulations of proteins.
/content/cudazone/CUDABrowser/assets/images/applications/116_GPUGRID_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/116_GPUGRID_Large.jpg
Academia
Universitat Pompeu Fabra - Multiscale Lab
http://multiscalelab.org/
2008
7
17
7/17/2008
Commercial
De Fabritiis
Application
Life Sciences
Science
molecular dynamics, distributed computing, BOINC, Gianni, De Fabritiis
8bfac8a0-5e3c-11dd-ad8b-0800200c9a66
LIBOR Interest rate Model
With recent exciting developments, the author is dedicating some research time to exploring the capabilities of the latest hardware/software for HPC including topics such as trends in mainstream HPC, the co-processor alternatives, NVIDIA GPUs (hardware/software/applications), and whether the alternatives will have an impact.
/content/cudazone/CUDABrowser/assets/images/applications/114_LIBOR_Interest_rate_Model_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/114_LIBOR_Interest_rate_Model_Large.jpg
Academia
Oxford University
http://people.maths.ox.ac.uk/~gilesm/hpc/
2006
7
1
7/1/2006
50
Giles
Application
Presentation
Finance
Monte Carlo, finance, computational finance, LIBOR, interest rate model, finite difference, Giles
f6a4261c-5a03-4123-b651-0057b017551d
Real-time Visual Tracker by Stream Processing
This work describes the implementation of a real-time visual tracker that targets the position and 3D pose of objects (specifically faces) in video sequences. Using a GPU and the NVIDIA CUDA technology, performance improvements as large as ten times compared to a similar CPU-only tracker are achieved.
/content/cudazone/CUDABrowser/assets/images/applications/113_Real-time Visual_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/113_Real-time Visual_Large.jpg
Research
NTT Communication Science Laboratories
http://www.kecl.ntt.co.jp/rps/index.html
2008
7
12
7/12/2008
10
Lozano, Otsuka
Application
Paper
Video & Audio
Stream processing, particle filtering, face tracking, gpu, cuda, vision, Lozano, Otsuka
b4954515-c71d-4c7d-b29c-8dc6954724ef
Numerical Calculation Library for Diffraction Integrals
The GPU-based Wave Optics library is numerical calculation library for the diffraction integrals using the GPU. It can calculate several diffractions: Fresnel diffraction, Angular spectrum method, Fraunhofer diffraction and Shifted-Fresnel diffraction.
/content/cudazone/CUDABrowser/assets/images/applications/112_Diffraction Integrals_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/112_Diffraction Integrals_Large.jpg
Academia
Yamagata University
http://gabor.yz.yamagata-u.ac.jp/
2008
7
01
7/01/2008
Shimobaba
Application
Numerics
Science
Imaging
Digital Content Creation
Light propagation, Diffraction theory, Holography, Hologram, Computer-generated-hologram, Shimobaba
6cf70d73-92ff-4896-9f30-9a5b4c47e2b5
Cost-effective Medical Image Reconstruction
This paper demonstrates parallel implementations for modern medical imaging applications on traditional parallel architectures can be outperformed, in both speed and cost-effectiveness, by new implementations on next-generation architectures like GPUs.
http://portal.acm.org/citation.cfm?id=1366230.1366278
/content/cudazone/CUDABrowser/assets/images/applications/111_Cost Effective Medical_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/111_Cost Effective Medical_Large.jpg
Academia
University of Muenster, Germany
http://www.math.uni-muenster.de/
2008
5
15
5/15/2008
10
Schellmann, et al
Paper
Imaging
Algorithms, general purpose gpu programming, list-mode osem, medical image reconstruction, parallel programming, Schellmann, et al
7cf2de98-b41d-44dc-bf59-ef4f0510e9d4
Obsidian: GPU Programming in Haskell
Obsidian is a GPGPU language embedded in Haskell. The goal is to simplify GPGPU programming by raising the level of abstraction but still offer control of the details necessary to write efficient programs.
/content/cudazone/CUDABrowser/assets/images/applications/110_Obsidian_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/110_Obsidian_Large.jpg
Academia
Chalmers University of Technology
http://www.chalmers.se/
2008
5
15
5/15/2008
Svensson, et al
Presentation
Libraries
Embedded language, GPGPU, Functional Programming, CUDA, Schellmann, et al
20cc945a-5e1b-4473-8f25-f42a4155eb22
Accelerating Density Functional Calculations with GPU
G80 GPU accelerates the ab initio density functional calculation (Gaussian03) by a factor of 10 over the latest Quad-core CPU. The errors due to single precision were found to be small enough for practical usage. The new algorithm suitable for GPUs were reported.
/content/cudazone/CUDABrowser/assets/images/applications/109_Accelerating Density_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/109_Accelerating Density_Large.jpg
Academia
Nagoya University
http://www.is.nagoya-u.ac.jp/index.html.en
2008
7
04
7/4/2008
G80
40
Yasuda
Paper
Science
Density functional theory, quantum chemistry, first-principle calculation, Yasuda
e84a322a-abec-4a99-a894-d1aa60e2ffa3
Motion Tracking Using Recursive Gaussian
Using a re-implemented recursive gaussian, track an object moving into the screen: the backgroud can have a global displacement, and the shape of the object must not change a lot.
/content/cudazone/CUDABrowser/assets/images/applications/108_Motion Tracking_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/108_Motion Tracking_Large.jpg
Academia
Donar Team
2008
6
30
6/30/2008
Open source
Donar Team
Code
Imaging
Recursive gaussian, imaging, motion tracking, Donar Team
6928c0a0-8571-11dd-ad8b-0800200c9a66
Sliding-Windows for Rapid Object Class Localization: A Parallel Technique
This paper presents a fast object class localization framework implemented on a data parallel architecture currently available in recent computers, with a case study on the implementation of Histograms of Oriented Gradients (HOG) descriptors showing speed up of a CPU-only implementation by a factor of 34 for the application and 109 for the accelerated part.
/content/cudazone/CUDABrowser/assets/images/applications/107_Sliding-Window for Rapid_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/107_Sliding-Window for Rapid_Large.jpg
Academia
TU Darmstadt
http://www.mis.informatik.tu-darmstadt.de/
2008
6
11
6/11/2008
109
Wojek, Dork, Schulz, Schiele
Paper
Imaging
Science
Imaging, science, object detection, object localization, HOG, SVM, Wojek, Dork, Schulz, Schiele
e03fdc93-ad98-47e2-b82d-a809e92adf6c
Dense Matrix-Vector Multiplication
A dense matrix-vector multiplication routine maximum 15.69 times faster than sgemv in CUBLAS 1.1
/content/cudazone/CUDABrowser/assets/images/applications/106_Desnse Matrix_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/106_Dense_Matrix_Large.jpg
Academia
Osaka Prefecture University
http://www.osakafu-u.ac.jp/english/index.html
2008
4
14
4/14/2008
32
Open source
Fujimoto
Application
Paper
Code
Numerics
Numeric, matrix-vector multiplication, sgemv, CUBLAS, Fujimoto
e120454a-0ccb-4a7f-8bb4-ec1179681443
Wait-free Programming for Computations on Graphics Processors
This paper demonstrates that it is possible to construct wait-free synchronization mechanisms for GPUs without the need of strong synchronization primitives in hardware and that wait-free programming is possible for GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/105_Wait-free Programming_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/105_Wait-free Programming_Large.jpg
Academia
University of Troms, Norway
http://uit.no/informatikk
2008
7
18
7/18/2008
Phuong, Tsigas, Anshus
Paper
Presentation
Science
Non-blocking programming, consensus, read-modify-write objects, synchronization, many-core architectures, SIMD, graphics processors, Science, Phuong, Tsigas, Anshus
5ee8c199-24b4-46b6-8ef9-34f4af739e65
CUDA.NET
CUDA.NET is a library that provides access to CUDA functionality from .NET based applications. It can be used on both Windows and Linux operating systems, supporting 32 and 64 bit modes of operation. Examples are provided in C# and IronPython.
/content/cudazone/CUDABrowser/assets/images/applications/104_CUDANET_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/104_CUDA.NET_Large.jpg
Commerical
GASS Company for Advanced Supercomputing Solutions Ltd.
http://www.gass-ltd.co.il/
2008
6
07
6/7/2008
Butrashvily
Application
Code
Library
Programming Tools
CUDA.NET, .NET, Library, Butrashvily
a31eb28a-5d98-404d-8038-034575e9ea19
Real Time Deformable Body Physics Simulation
This paper introduces an optimal solution to implement accurate simulation of deformable bodies in real time, accomplished through the use of Point Based Animation. Significant improvements on performance in comparison to the CPU was observed impressive speedups of about 20 times could be achieved in the simulation of deformable bodies with 575 physics elements (phyxels) and 53,504 surface elements (surfels).
/content/cudazone/CUDABrowser/assets/images/applications/103_Massively Parallel_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/103_Massively Parallel_Large.jpg
Academia
GRVM / CIn / UFPE
http://www.gprt.ufpe.br/~grvm
2008
6
08
6/8/2008
24
Farias, Almeida, Teixeira, Teichrieb, Kelner
Application
paper
Graphics
Point Based Animation, meshless simulation technique, graphics, Farias, Almeida, Teixeira, Teichrieb, Kelner
c5e7ae8a-f431-4c5d-877f-17b90f3ffc6e
Mixed Precision Linear Solvers
This report updates results from "Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations". It shows mixed-precision schemes are still preferable to double percision alone. A significant quantitative performance improvements is observed with more powerful hardware, demonstrated in a multi-grid scheme that provided an accurate solution in Finite Element settings with one million unknowns in less than 0.1 seconds.
/content/cudazone/CUDABrowser/assets/images/applications/102_Mixed_Precision_Linear_Solvers_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/102_Mixed_Precision_Linear_Solvers_large.jpg
Academia
TU Dortmund
http://www.mathematik.uni-dortmund.de/~goeddeke/
2008
7
08
7/08/2008
G200, T10
27
Gddeke, Strzodka
Application
paper
Numerics
Mixed precision multigrid finite element, multigrid, finite element, FEM, numerics, Gddeke, Strzodka
395b7b4c-d22d-4539-a7ba-b4157bafb2d2
Large Vocabulary Continuous Speech Recognition
Automatic speech recognition is a key technology for enabling rich human-computer interaction in emerging applications. This paper explores opportunities for parallelizing the Hidden Markov Model (HMM) based Viterbi search algorithm typically used for large-vocabulary continuous speech recognition (LVCSR), and present an efficient implementation on the G80 architecture.
/content/cudazone/CUDABrowser/assets/images/applications/101_LVCSR_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/101_LVCSR_Large.j
pg
Academia
University of California Berkeley
http://www.eecs.berkeley.edu/
2008
6
21
6/21/2008
Chong, Yi, Faria, Satish, Keutzer
Paper
Video & Audio
Speech recognition,probabilistiv inference,HMM,Beam Search,LVCSR,data parallelism,graph traversal, Chong, Yi, Faria, Satish, Keutzer
e4ad4db9-af51-4c58-a3aa-b20ea097ae3e
Jacket: GPU Engine for MATLAB
Jacket enables standard MATLAB code to run on the GPU, connecting MATLAB directly to the speed and visual computing capability of the GPU. It is system that automatically makes memory transfer and execution optimization decisions, and it uses a compile on-the-fly system to allow GPU functions to run in MATLAB's interpretive style. This example demonstrates some of the BLAS capability of Jacket, providing several speedup benchmarks.
http://www.accelereyes.com/documentation.php
/content/cudazone/CUDABrowser/assets/images/applications/100_Jacket_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/100_Jacket_Large.jpg
Commerical
AccelerEyes LLC
http://accelereyes.com/
2008
6
16
6/16/2008
50
Melonakos
Application
Computational Fluid Dynamics
Digital Content Creation
Electronic Design Automation
Finance
Graphics
Imaging
Numerics
Life Sciences
Libraries
Oil & Gas
Science
Video & Audio
Matlab, Jacket, memory transfer, Pryor, Malcolm, Rehman, Melonakos
e415e883-5349-4c9a-abb7-fefe294c5b08
Optical Flow Algorithm using CUDA and OpenCV
It implements an optical flow algorithm using CUDA and OpenCV, achieving 90FPS on 640x480 images with a 4 level pyramid using a GeForce 8800 GTX compared to 1FPS on a Pentium 4@3GHz with 320x240 images on a 3 level pyramid. The algorithm implemented is described in Bayesian Multi-scale Differential Optical Flow, Handbook of Computer Vision and Applications.
/content/cudazone/CUDABrowser/assets/images/applications/99_Optical_Flow_Algorithm_using_CUDA_and_OpenCV_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/99_Optical_Flow_Algorithm_using_CUDA_and_OpenCV_large.jpg
Academia
2008
6
18
6/18/2008
90
Hauagge
Paper
Code
Imaging
Numeric, algorithm, optical flow algorithm, optical flow, OpenCV, Hauagge
6a83932a-0df0-4e85-832d-cd3c218c24c7
Python bindings for CUDA using ctypes
The application emulates the original CUDA code more closely. The .tar.gz or .rpm files contain numerous examples, many translated from the CUDA SDK examples.
ftp://ftp.graviscom.com/pub/code/python-cuda/
/content/cudazone/CUDABrowser/assets/images/applications/98_Python_bindings_for_CUDA_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/98_Python_bindings_for_CUDA_large.jpg
Commercial
GraVisCom.de
http://www.graviscom.com/
2008
3
8
3/8/2008
Open source
Paehler
Paper
Libraries
Programming Tools
CUDA, Python, Paehler
ef17a9c2-4240-4c2e-970a-dd41ce505d92
Towards Acceleration of Fault Simulation
This paper discusses the implementation of a fault simulator in a GPU that exploits thread level parallelism.
/content/cudazone/CUDABrowser/assets/images/applications/96_Towards_Acceleration_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/96_Towards_Acceleration_Large.jpg
Academia
Texas A & M University
http://www.tamu.edu/
2008
6
8
6/8/2008
35
Gulati, Khatri
Paper
Electronic Design Automation
EDA, fault simulator, simulation, electronic design automation, Gulati, Khatri
ec54e761-d6dc-4fbc-a8f6-5346104cade3
Accelerating Statistical Static Timing Analysis
This paper explores the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/95_Accelerating_Statistical_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/95_Accelerating_Statistical_Large.jpg
Academia
Texas A & M University
http://www.tamu.edu/
2008
5
7
5/7/2008
260
Gulati, Khatri
Paper
Electronic Design Automation
EDA, Monte Carlo, simulation, electronic design automation, Gulati, Khatri
34737671-9717-4453-98b0-57b4ea19a3e6
Low Viscosity Flow Simulations for Animation
This paper describes a fluid simulation method used in the film industry. We use CUDA to accelerate our high resolution Poisson solver to enforce fluid incompressibility.
/content/cudazone/CUDABrowser/assets/images/applications/94_Low_Viscosity_Flow_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/94_Low_Viscosity_Flow_Large.jpg
Commercial
Rhythm and Hues, UCLA, Inst. of Geophysics and Planetary Physics
http://www.rhythm.com/
2008
7
7
7/7/2008
55
Cohen, Molemaker, Patel, Noh
Paper
Multimedia
Computational Fluid Dynamics
Fluid dynamics,multigrid,poisson solver, Cohen, Molemaker, Patel, Noh
5cbe6168-f2ee-4979-94d4-189bac136744
MIDG
Discontinuous Galerkin Methods for GPU.
http://www.caam.rice.edu/~timwar/RMMC/gpuDG.html
/content/cudazone/CUDABrowser/assets/images/applications/93_MIDG_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/93_MIDG_Large.jpg
Academia
Rice University
http://www.rice.edu/
2008
8
1
08/01/2008
50
Warburton
Application
Numerics
Galerkin method,partial differential equation, Warburton
e379ae7f-988d-44f6-ad46-4a87ff7f2bee
Real Time Capture of Audio Images and Use with Video
Arrays of microphone arrays provide an ability to compute the intensity of sound corresponding to different directions at a given time. Intensities may be exhibited as an image and these images updated at a high frame rate to achieve a real time video image of the sound reflections.
/content/cudazone/CUDABrowser/assets/images/applications/94_Real_Time_Capture_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/94_Real_Time_Capture_Large.jpg
Academia
University of Maryland
http://www.umd.edu/
2007
9
21
9/21/2007
ODonovan, Duraiswami, Gumerov
Paper
Video & Audio
Imaging
Imaging, audio, camera, spherical microphone arrays, ODonovan, Duraiswami, Gumerov
745d5157-b2c7-495e-bced-1c3f6d6d7a32
Silicon Informatics Protein Docking
The DockStar deskside server for AutoDock 4.0 can dramatically change workflow and thinking and increasing scientific productivity and interactivity.
http://www.siliconinformatics.com/products.html
/content/cudazone/CUDABrowser/assets/images/applications/91_Silicon_Informatics_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/91_Silicon_Informatics_Large.jpg
Commerical
Silicon Informatics
http://www.siliconinformatics.com/
2008
6
16
6/16/2008
20
Silicon Informatics
Application
Life Sciences
Life science, science, protein, drug discovery, Silicon Informatics
2e520f9d-943e-454d-8a46-865d5f351a7b
High Performance Pattern Recognition on GPU
This paper presents high performance Pattern Recognition algorithms using GPUs and present fast implementations on the GPU using CUDA. We study the Parzen windows scheme for density estimation and the Artificial Neural Network for training and classification.
/content/cudazone/CUDABrowser/assets/images/applications/90_High_Performance_Pattern_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/90_High_Performance_Pattern_Large.jpg
Academia
International Institute of Information Technology
http://www.iiit.ac.in/
2007
11
7
11/7/2007
100
Lahabar, Agrawal, Narayanan
Paper
Numerics
Numerics, pattern recognition, algorithms, Parzen, Artificial Neural Network, Lahabar, Agrawal, Narayanan
ab502e83-66df-44cb-a81e-14dae30ac6d6
Audio FIR Crossover
4 Way FIR Crossover / Channel Divider, with 8192 TAPs FIR filter.
http://koonlab.com/CUDA_RealFIR/CUDA%20Real%20FIR.html
/content/cudazone/CUDABrowser/assets/images/applications/89_Audio_FIR_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/89_Audio_FIR_Large.jpg
Academia
Koon lab
http://koonlab.com/
2008
5
3
5/3/2008
35
Open source
Koon lab
Application
Code
Video & Audio
CrossOver FIR Audio,Koon
c101fdeb-9e59-4490-b6f5-75e198022e18
PyCuda
PyCuda lets you access NVIDIA CUDA parallel computation API from Python, and offers features such as object cleanup tied to lifetime of objects and automatic error checking.
http://mathema.tician.de/software/pycuda
/content/cudazone/CUDABrowser/assets/images/applications/88_PyCuda_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/88_PyCuda_Large.jpg
Academia
Brown University
http://www.brown.edu/
2008
6
15
6/15/2008
Open source
Klockner
Code
Numerics
Programming Tools
Numerics, PyCUDA, CUDA, Python, object cleanup, automatic error checking, Klockner
397405d9-916f-4c22-8d5d-770001adf5c7
OpenVIDIA: Parallel GPU Computer Vision
This project implements computer vision algorithms on computer graphics hardware. The project provides useful example programs which run real time computer vision algorithms on single or multiple GPU system configurations.
http://openvidia.sourceforge.net/
/content/cudazone/CUDABrowser/assets/images/applications/87_OpenVIDIA_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/87_OpenVIDIA_Large.jpg
Academia
University of Toronto
http://www.eecg.toronto.edu/
2004
1
7
1/7/2004
Open source
Fung, Mann
Paper
Application
Code
Multimedia
Imaging
Imaging, computer vision algorithm, computer vision, algorithm, Canny edge, numerics, Fung, Mann
e7670906-c1de-4202-acfa-3a56dc57aef9
TechniScan 3D UltraSound CT
The TechniScan UltraSound CT Imaging System features include the ability to scan the whole breast and produce high resolution 3D images, which provide for easier, more accurate localization and characterization of areas identified as requiring further workup after mammography or conventional ultrasound.
http://www.techniscanmedicalsystems.com/
/content/cudazone/CUDABrowser/assets/images/applications/86_TechniScan_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/86_TechniScan_Large.jpg
Commercial
TechniScan
http://www.techniscanmedicalsystems.com/
2007
12
31
12/31/2007
TechniScan
Application
Life Sciences
Imaging
Life sciences, imaging, medical imaging, medical equipment, CT scan, 3D, TechniScan
e96a79c5-371c-496a-b851-b99b83360354
Ray Casting Algebraic Surfaces using the Frustum Form
This paper discusses an algorithm for interactive ray-casting of algebraic surfaces of high degree. Authors performanced nuermica root-finding using B-spline and B zier techniques, then compared them to recent and classical algorithms. The paper proposes an anti-aliasing scheme and shows how this algorithm can be implemented on streaming architectures with single precision.
http://www.blackwell-synergy.com/doi/abs/10.1111/j.1467-8659.2008.01133.x
/content/cudazone/CUDABrowser/assets/images/applications/84_klebsch_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/84_klebsch_Large.jpg
Academia
University of Oslo, Norway
http://www.uio.no/english/
2008
4
1
04/01/2008
16
Seland, et al
Paper
Numerics
Graphics
Numeric, algorithm, B-spline, Bazier, graphics, Seland, et al
fbdfdeae-93a4-473e-899d-081e73858fda
xNormal
An application to render normal/ambient occlusion/parallax/relief maps with an integrated 3D viewer and Photoshop tools and mesh importers/exporters for 3dsmax. It makes an intensive use of the GPU and CPU multicore to perform ray tracing.
http://www.xnormal.net/
/content/cudazone/CUDABrowser/assets/images/applications/85_xNormal_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/85_xNormal_Large.jpg
Commercial
xNormal
http://www.xnormal.net/
2008
6
11
6/11/2008
Open source
xNormal
Application
Graphics
Graphic, render, rendering, ray tracing, xNormal
210dbbcf-ffdf-4145-a4d8-c89d1ca50d54
CUDA Accelerated DXT Compression
NVIDIA supplies a free texture utility to create different types of texture formats for content creation tools.
http://developer.nvidia.com/object/texture_tools.html
/content/cudazone/CUDABrowser/assets/images/applications/02_CUDA_Accelerated_DXT_Compression_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/02_CUDA_Accelerated_DXT_Compression_Large.jpg
Commercial
NVIDIA
http://www.nvidia.com
2008
3
18
3/18/2008
Any CUDA
Open source
NVIDIA
Code
Digital Content Creation
Digital content creation,texture, NVIDIA
1218c1de-21b0-45c4-a999-c531eb2be811
Efficient Computation of Sum Products on GPUs
A wide variety of real-life applications in artificial intelligence, statistics, image processing, and digital communications rely on mathematical techniques such as solvers for the sum-product or marginalize a product of functions (MPF). The authors describe the results of an MPF solver that achieves excellent results on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/03_Efficient_Computation_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/03_Efficient_Computation_Large.jpg
Academia
2008
7
12
7/12/2008
Any CUDA
270
Silberstein, Schuster, Geiger, Patney, Owens
Paper
Numerics
Algorithms,Numerics,Mathematics, Silberstein, Schuster, Geiger, Patney, Owens
2f6af007-7448-4114-abc9-c41b39265b76
Programming Algorithms-by-Block Made easy
The FLAME project is a framework for linear algebra. This paper explains how, when applied to a new architecture (GPU), an out-of-the-box solution attains high performance almost effortlessly. The FLAME project has been studying the question of parallel programming in the context of dense and banded matrix computations. In this paper they address the programmability issue head-on and demonstrate that their solution, which departs from the traditional evolutionary path, supports portability to new architectures by demonstrating their work with an NVIDIA multi-GPU system.
/content/cudazone/CUDABrowser/assets/images/applications/04_Programming_Algorithms_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/04_Programming_Algorithms_Large.jpg
Academia
2008
11
1
01/01/2008
Any CUDA
Castillo, Chan, Igual, Mayo, Quintana-Orta, van de Geijn, Van Zee
Paper
Numerics
Algorithms, Numerics, Mathematics, linear algebra, libraries, high-performance, multithreaded architectures, Castillo, Chan, Igual, Mayo, Quintana-Orta, van de Geijn, Van Zee
e7e74109-3b6e-49c4-9a30-dfc5b74f986c
Highly Optimized Object-oriented Molecular Dynamics: HOOMD
HOOMD stands for Highly Optimized Object Oriented Molecular Dynamics. It performs general purpose molecular dynamics simulations on a single workstation, taking advantage of the NVIDIA GPUs to attain a level of performance equivalent to 30 processor cores on a fast cluster.
http://www.external.ameslab.gov/hoomd/download.html
/content/cudazone/CUDABrowser/assets/images/applications/05_Highly_Optimized_Object_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/05_Highly_Optimized_Object_Large.jpg
Academia
Ames Laboratory, United States Department of Energy
http://www.external.ameslab.gov/hoomd/index.html
2008
2
1
02/01/2008
Any CUDA
15
Open source, BSD
Anderson, et al
Code
Science
Molecular dynamics,HOOMD,biophysics, Anderson, et al
d6c799c2-7704-4e68-a188-022c8a40b02a
AES Crytography Acceleration
This paper presents a study of the efficiency in applying modern GPUs can be applied to symmetric key cryptographic solutions. This paper describes an efficient implementation of the Advanced Encryption Standard (AES) algorithm in the novel CUDA platform by NVIDIA.
/content/cudazone/CUDABrowser/assets/images/applications/06_CUDA compatible GPU_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/06_CUDA compatible GPU_Large.jpg
Research
2007
11
1
11/01/2007
Any CUDA
12
Manavski
Paper
Numerics
Numerics, Manavski
07341e38-d5ed-462a-8891-6e9803d13697
Visualization of Meshless Simulations Using Fourier Volume Rendering
This paper discusses Fourier volume rendering technique's implementation on graphics hardware, and demonstrates its usefulness in visualizing data produced by both astrophysical and fluid dynamics simulations.
http://cds.gmu.edu/~acorriga/pubs/meshless_fvr/
/content/cudazone/CUDABrowser/assets/images/applications/07_Visualization_of_Meshless_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/07_Visualization_of_Meshless_Large.jpg
Academia
George Mason University
http://cds.gmu.edu/
2007
7
1
07/01/2007
Open source
Corrigan, Wallin, Vesenjak
Paper
Code
Multimedia
Science
Computational Fluid Dynamics
Astrophysics,fluid dynamics,Fourier,hydrodynamics, Corrigan, Wallin, Vesenjak
194c280d-6e0e-48ad-84d0-442383dad19c
Scalable Molecular Dynamics: NAMD
This article introduces concepts and methods used in the NAMD program and provides a list of the key features of NAMD. Describes the benefits of combining NAMD with the molecular graphics/sequence analysis software, VMD, and the grid computing/collaboratory software, BioCoRE.
http://www.ks.uiuc.edu/Research/namd/
/content/cudazone/CUDABrowser/assets/images/applications/08_Scalable_Molecular_Dynamics_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/08_Scalable_Molecular_Dynamics_Large.jpg
Academia
University of Illinois, Urbana-Champagn
http://www.ks.uiuc.edu/
2005
10
1
10/01/2005
Open source
Phillips, Braun, Wang, Gumbart, Tajkhorshid, Villa, Chipot, Skeel, Kala, Schulten
Paper
Code
Life sciences
Biomolecular simulation, molecular dynamics, parallel computing, Phillips, Braun, Wang, Gumbart, Tajkhorshid, Villa, Chipot, Skeel, Kala, Schulten
3e29b9f9-4166-41eb-97e7-8ba488703924
Automated Dynamic Analysis of CUDA Programs
This paper presents an automated analysis technique that can be run directly in CUDA's device emulation mode to help programmers find and solve subtle bugs in programs that are too complex to analyze manually.
/content/cudazone/CUDABrowser/assets/images/applications/09_Automated_Dynamic_analysis_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/09_Automated_Dynamic_analysis_Large.jpg
Academia
http://web.mit.edu/rabbah/www/conferences/08/
2008
4
1
04/012008
Open source
Boyer, Skadron, Weimer
Paper
Libraries
Analysis,memory, Boyer, Skadron, Weimer
473d6a2d-1359-4212-9586-320d648f5469
GLAME@lab API for Linear Algebra Operations on GPUs
This paper describes the implementation and performance evaluation of three different variants of the Cholesky factorization, on two high-level APIs to use a GPU as a coprocessor for dense linear algebra operations.
/content/cudazone/CUDABrowser/assets/images/applications/10_GLAME_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/10_GLAME_Large.jpg
Academia
Universitat Jaume
http://www.uji.es/
2008
2
1
02/01/2008
GeForce 8800 Ultra (G80 processor)
Barrachina, et al
Paper
Graphics
Graphics processors (GPUs), general purpose computing on GPU, linear algebra, BLAS, high performance, Barrachina, et al
415c9ac6-7945-458e-88ae-f036b45a697a
Remote Rendering with CUDA
This paper presents the utilization of advanced programming techniques on current graphics hardware to improve the performance of remote rendering for interactive applications.
http://www.nvidia.com/object/io_1200981635689.html
/content/cudazone/CUDABrowser/assets/images/applications/11_CUDA_Supported_Approach_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/11_CUDA_Supported_Approach_Large.jpg
Academia
University of Paderborn
http://www.uni-paderborn.de/en/
2007
9
1
09/01/2007
GeForce 8800 GTS
Lietsch, Marquardt
Paper
Graphics
Graphics,rendering,visualization, Lietsch, Marquardt
4804ed9b-996c-4cbc-a091-1c7c3ff3abf8
Molecular Dynamics Simulations
This paper presents a new approach to high performance molecular dynamics simulations on GPUs, facilitated by their enhanced programmability and motivated by their attractive price/performance ratio and incredible growth in speed.
http://www.springerlink.com/content/p106n8501059l077/
/content/cudazone/CUDABrowser/assets/images/applications/12_Molecular_Dynamics_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/12_Molecular_Dynamics_large.jpg
Academia
Nanyang Technological University
http://www.ntu.edu.sg/publicportal/
2007
12
1
12/01/2007
Liu, Schmidt, Voss, Mailler-Wittig
Application
Life Sciences
Molecular dynamic,molecule,biology, Liu, Schmidt, Voss, Mailler-Wittig
9d99ed75-fd2a-49fe-9951-d19f6dbc8637
Improved Magnetic Resonance Imaging (MRI) Quality
This paper describes how the reconstruction algorithm leverages the resources of the G80 GPU to achieve over 150 GFLOPS in performance. The G80 helps to dramatically reduced the algorithm's required bandwidth to off-chip memory, while providing substantial acceleration for the trigonometric computations in the algorithm's inner loops -- resulting in significant performance increase.
/content/cudazone/CUDABrowser/assets/images/applications/13_Improved_MRI_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/13_Improved_MRI_Large.jpg
Academia
Univeristy of Illinois at Urbana-Champaign
http://www.ks.uiuc.edu/
2007
10
1
October 2007
GeForce 8800 GTX (G80)
Stone, et al
Paper
Life Sciences
MRI,magnetic resonance imaging, Stone, et al
df71897e-b47a-428e-b12f-fafdc0b537d3
Fast Multipole Methods on Graphics Processors
GPUs contain a large number of processing units with access to local and shared memory, and achieve significant speedups vis-a-vis CPUs on problems that can be mapped to their Single Program Multiple Data (SPMD) architecture. This paper describes how our FMM algorithm achieves timings that if computed using an O(N2) algorithm correspond to speeds of 25-45 Tflops (for achieved L2 errors of ~10-6 - 2x10-4).
http://www.nvidia.com/object/io_1195169962941.html
/content/cudazone/CUDABrowser/assets/images/applications/14_Fast_Multiple_Methods_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/14_Fast_Multiple_Methods_Large.jpg
Academia
University of Maryland
http://www.umd.edu/
2007
10
1
10/01/2007
GeForce 8800 GTX
Gumerov, Duraiswami
Paper
Libraries
Fast Multipole Method, GPU, GPGPU, Personal Supercomputing, NVIDIA CUDA, GPU/Multicore Development Environment, Gumerov, Duraiswami
df6142a1-ec69-44ec-8ba4-853403e1f34a
FIR and QR Decomposition on GPUs
This paper describes the implementation of two HPEC Challenge benchmarks (Finite Impulse Response and QR decomposition) on NVIDIA GPUs using data-parallel implementation approach, as well as results compared to calculations on a CPU.
/content/cudazone/CUDABrowser/assets/images/applications/15_FIR_and_QR_Decomposition_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/15_FIR_and_QR_Decomposition_Large.jpg
Academia
MIT
http://web.mit.edu/
2007
9
1
September 2007
GeForce 8800 GTX
35
McGraw-Herdeg, Enright, Michel
Paper
Science
Computation,HPEC,parallel algorithm, McGraw-Herdeg, Enright, Michel
05ceb304-a1a9-474e-8ed7-29dfb61f358e
Molecular Dynamics Simulations on GPUs
This paper discusses an implementation of molecular dynamics simulations on a GPU in the CUDA language. Results for two algorithms suitable for short-ranged and long-ranged interactions, and a congruential shift random number generator are presented. The performance of the GPUs is compared to their main processor counterpart.
http://eprintweb.org/S/article/cond-mat/0709.3225
/content/cudazone/CUDABrowser/assets/images/applications/16_Molecular_Dynamics_Simulations_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/16_Molecular_Dynamics_Simulations_Large.jpg
Academia
Universiteit van Amsterdam
http://www.uva.nl/start.cfm/la=en/th=main
2007
9
1
09/01/2007
GeForce 8800 GTX
150
Open Source
van Meel, Arnold, Frenkel, Portegies Zwart, Belleman
Paper
Code
Life Sciences
Molecular dynamics,simulation, van Meel, Arnold, Frenkel, Portegies Zwart, Belleman
6a5b108f-8260-4481-a1c1-f946acb50255
Accelerating Molecular Modeling with GPUs
In this article presents an overview of recent advances in programmable GPUs, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases. In light of the performance obtained for this set of calculations, future applications of graphics processors to molecular dynamics simulations are discussed.
http://www3.interscience.wiley.com/journal/116323814/abstract
/content/cudazone/CUDABrowser/assets/images/applications/17_Accelerating_Molecular_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/17_Accelerating_Molecular_large.jpg
Academia
University of Illinois, Urbana-Champagn
http://www.ks.uiuc.edu/
2007
07
01
07/01/2007
Stone, et al
Application
Life Sciences
GPU computing, CUDA, parallel computing, molecular modeling, electrostatic potential, multilevel summation, molecular dynamics, ion placement, multithreading, graphics processing unit, Stone, et al
2518fa4c-903f-438d-a48a-aeb104d1b8f9
Increased Performance of Digital Forensics Tools
This paper presents the results of a number of experiments that evaluate the effectiveness of offloading processing common to digital forensics tools to a GPU, using "massive" numbers of threads to parallelize the computation. These results are compared to speedups obtainable by simple threading schemes appropriate for multi-core CPUs.
/content/cudazone/CUDABrowser/assets/images/applications/18_Increased_Performance_Digital_Forensics_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/18_Increased_Performance_Digital_Forensics_Large.jpg
Academia
University of New Orleans
http://www.uno.edu/
2007
8
1
August 2007
GeForce 8800GTX (G80)
Marziale, Richard, Roussev
Paper
Science
Forensic, Marziale, Richard, Roussev
0885bce4-5ead-478e-807d-8bb29b38f831
Scan Primitives for GPU Computing
This paper describes GPU implementations of scan primitives, specifically an efficient formulation and implementation of segmented scan, on NVIDIA GPUs using the CUDA programming model implemented with the C-language. Using the scan primitives, the paper shows novel GPU implementations of quicksort and sparse matrix-vector multiply, and analyze the performance of the scan primitives, several sort algorithms that use the scan primitives, and a graphical shallow-water fluid simulation using the scan framework for a tridiagonal matrix solver.
http://www.nvidia.com/object/io_1195170133199.html
/content/cudazone/CUDABrowser/assets/images/applications/19_Scan_Primitives_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/19_Scan_Primitives_Large.jpg
Academia
University of California Davis
http://www.ucdavis.edu/index.html
2007
8
1
08/01/2007
NVIDIA 8-series (G80)
12
Sengupta, Harris, Zhang, Owens
Paper
Libraries
Parallel computing, general purpose computing, scan primitives, segmented scan, Sengupta, Harris, Zhang, Owens
5dc43cde-e9be-44af-a0f0-4243411a22a3
N-body Simulations in CUDA
This paper presents the results of gravitational direct N-body simulations using the GPU. The force evaluation of the N-body problem is implemented in CUDA using the GPU to speed-up the calculations,and the implementation is tested on three different N-body codes: two direct N-body integration codes, using the 4th order predictor-corrector Hermite integrator with block time-steps, and one Barnes-Hut treecode, which uses a 2nd order leapfrog integration scheme.
http://arxiv.org/abs/0707.0438v2
/content/cudazone/CUDABrowser/assets/images/applications/20_N_body_Simulations_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/20_N_body_Simulations_Large.jpg
Academia
Universiteit van Amsterdam
http://www.uva.nl/start.cfm/la=en/th=main
2007
07
01
07/01/2007
GeForce 8800GTX
Belleman, Baedorf, Portegies Zwart
Paper
Science
Gravitation,stellar dynamics,N-body simulation,numerical, Belleman, Baedorf, Portegies Zwart
cd2d33ac-8dfd-4845-a2e8-89ec3bd4095e
Graphic-Card Cluster for Astrophysics (GraCCA)
This paper describes the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. To demonstrate this computing cluster's performance in astrophysics computation, the authors implemented a parallel direct N-body simulation program with shared time-step algorithm in this system, and reported performance results and comparison.
http://arxiv.org/abs/0707.2991
/content/cudazone/CUDABrowser/assets/images/applications/21_Graphic_Card_Cluster_2_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/21_Graphic_Card Cluster_2_Large.jpg
Academia
National Taiwan University
http://www.ntu.edu.tw/engv4/
2008
1
1
01/01/2008
GeForce 8800 GTX
250
Schive, Chiena, Wong, Tsai, Chiueh
Paper
Science
Gravitation,stellar dynamics,N-body simulations,numerical, Schive, Chiena, Wong, Tsai, Chiueh
fb45c4f8-549c-4d31-9a84-25c92313ebaf
The Chamomile Scheme: N-body Simulations
This paper presents an algorithm named "Chamomile Scheme". The scheme is fully optimized for calculating gravitational interactions on a programmable GPU, which has (a) small but fast shared memories with no broadcasting mechanism and (b) floating point arithmetic hardware of 500 Gflop/s but only for single precision. Based on this scheme, the authors developed a library for gravitational N-body simulations, "CUNBODY-1", whose measured performance reaches to 173 Gflop/s for 2048 particles and 256 Gflop/s for 131072 particles.
http://arxiv.org/abs/astro-ph/0703100
/content/cudazone/CUDABrowser/assets/images/applications/22_The_Chamomile_Scheme_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/22_The_Chamomile_Scheme_Large.jpg
Research
Computational Astrophysics Laboratory, RIKEN
http://atlas.riken.go.jp/
2007
3
1
03/01/2007
GeForce8800GTX
Hamada, Iitaka
Paper
Science
Stellar Dynamics,numerical,N-body simulations, Hamada, Iitaka
6fe4156e-2e59-4913-931c-8d975619cd06
Smith-Waterman Sequence Alignment
This paper exploits the huge computational power of commonly available GPUs to develop high performance solutions for sequence alignment, as industry development and increasing demands make using Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.
http://www.biomedcentral.com/1471-2105/9/S2/S10
/content/cudazone/CUDABrowser/assets/images/applications/23_Smith_Waterman_Sequence_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/23_Smith_Waterman_Sequence_large.jpg
Academia
Universita Degli Studi Di Padova
http://www.unipd.it/en/
2008
3
1
03/01/2008
GeForce 8800 GTX
30
Manavski, Valle
Paper
Life Sciences
Molecular biology,Smith-Waterman algorithm,protein,DNA,FASTA,BLAST, Manavski, Valle
edbc17aa-a223-42fa-842c-7b2d4b80cd75
MUMmerGPU: High-throughput Sequence Alignment
This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on GPUs in common workstations. MUMmerGPU uses CUDA to align multiple query sequences against a single reference sequence stored as a suffix tree, providing a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies.
http://www.biomedcentral.com/1471-2105/8/474
/content/cudazone/CUDABrowser/assets/images/applications/24_MUMmerGPU_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/24_MUMmerGPU_Large.jpg
Academia
University of Maryland
http://www.umd.edu/
2007
12
1
12/01/2007
G80
10
Schatz, Trapnell, Delcher, Varshney
Paper
Life Sciences
DNA sequencing, sequence alignment, MUMmer, genotyping, genome resequencing, metagenomics, de novo genome assembly, parallel computing, Schatz, Trapnell, Delcher, Varshney
6be79b95-7535-41e5-bfb7-e89d5ec9305b
Two-electron Integral Evaluation
The paper proposes the algorithm to evaluate the Coulomb potential in the ab initio density functional calculation on the GPU. The paper discusses in detail the results of numerical accuracy required for the algorithm.
http://www3.interscience.wiley.com/journal/114287520/abstract
/content/cudazone/CUDABrowser/assets/images/applications/25_Two-electron Integral_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/25_Two-electron Integral_Large.jpg
Academia
Nagoya University
http://www.nagoya-u.ac.jp/en/
2007
4
1
04/01/2007
GeForce 8800 GTX
Yasuda
Application
Science
Algorithm, Coulomb, computing, Gauss-Rys, two-electron integrals, quantum chemistry, first-principle calculation, Yasuda
c3e22827-0b47-4e95-b4ba-c073ee1fc74a
Interactive Visualization of Volumetric White Matter Connectivity
Diffusion tensor magnetic resonance imaging (DT-MRI) using parallel Hamilton-Jacobi (H-J) equation solver implemented in CUDA running on NVIDIA GeForce 8800GTX.
/content/cudazone/CUDABrowser/assets/images/applications/26_Interactive_Visualization_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/26_Interactive_Visualization_Large.jpg
Research
University of Utah
http://www.cs.utah.edu/
2007
10
1
October 2007
G80+
100
Jeong, Fletcher, Tao, Whitaker
Paper
Life Sciences
GPGPU CUDA MRI DT-MRI H-J Hamilton-Jacobi PDE parallel, Diffusion tensor visualization, graphics hardware, interactivity, fast iterative method (FIM), NIH, Jeong, Fletcher, Tao, Whitaker
e223fbfc-f017-498c-8174-699747dbd88b
Accelerating Distributed Storage Systems with CUDA
Hashing module algorithms in CUDA: SHA1 and MD5.
http://www.ece.ubc.ca/~samera/projects/StoreGPU/
/content/cudazone/CUDABrowser/assets/images/applications/27_Accelerating_Distributed_Storage_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/27_Accelerating_Distributed_Storage_large.jpg
Academia
The University of British Columbia
http://www.ubc.ca/
2008
01
01
01/01/2008
G80+
9
Al-Kiswany, Gharaibeh, Santos-Neto, Yuan, Ripeanu
Paper
Code
Numerics
MD5,SHA1,CTM, Al-Kiswany, Gharaibeh, Santos-Neto, Yuan, Ripeanu
aecf6efa-875d-47d1-a3f1-429316708e70
Astrophysical N-body Simulation
An optimized C/C++/Fortran library to accelerate N-body interactions using CUDA on NVIDIA GPUs
http://progrape.jp/cs/
/content/cudazone/CUDABrowser/assets/images/applications/28_Astrophysical_N_body_Simulation_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/28_Astrophysical_N_body_Simulation_Large.jpg
Academia
Genomic Sciences Center, RIKEN
http://mdgrape.gsc.riken.jp/
2007
7
27
7/27/2007
G8x and up
Paper
Code
Science
Numerics
Astrophysics,N-body,library
248f3de9-aa4c-4a56-9980-76eb1057ce4d
pystream: Stream and GPU computing in Python
PyStream enhances Python with seamless access to CUDA libraries including the CUDA BLAS and FFT libraries.
http://code.google.com/p/pystream/
/content/cudazone/CUDABrowser/assets/images/applications/29_pystream_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/29_pystream_Large.jpg
Commercial
Tech-X Corporation
http://www.txcorp.com/
2007
12
31
12/31/2007
G80 and up
Tech-X Corporation
Paper
Code
Numerics
Python, language bindings, high performance computing, stream computing, Tech-X Corporation
3af4b289-a5ff-4d4e-bb63-ab493aa101a5
Biomedical Image Analysis
Large scale biomedical image analysis applications on heterogeneous systems with multiple processors and multiple GPUs.
/content/cudazone/CUDABrowser/assets/images/applications/30_Biomedical_Image_Analysis_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/30_Biomedical_Image_Analysis_Large.jpg
Academia
Dept of Biomedical Informatics, Ohio State
http://bmi.osu.edu/
2008
6
7
6/7/2008
G80
13
Hartley, Catalyurek, Ruiz, Igual, Mayo, Ujaldon
Paper
Life Sciences
Imaging
Biomedial Imaging, heterogeneous computing, cluster, high performance computing, Hartley, Catalyurek, Ruiz, Igual, Mayo, Ujaldon
cf8a8ada-2505-4fc3-8584-567addfc8c02
3D Euler Solver
Two- and three-dimensional Euler solvers are ported to the GPU, achieving 16x and 29x speedups respectively.
/content/cudazone/CUDABrowser/assets/images/applications/31_3D_Euler_Solver_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/31_3D_Euler_Solver_Large.jpg
Academia
Whittle Laboratory, University of Cambridge
http://www-g.eng.cam.ac.uk/whittle/
2008
1
1
January 2008
G80 and up
29
Brandvik, Pullan
Paper
Computational Fluid Dynamics
Numerics
Euler Solver, Brandvik, Pullan
f49c3ba6-2bd2-426f-b42e-5b36899ca0d3
Lattice Boltzmann Kernal using CUDA
A 2D-Lattice Boltzmann kernel is accelerated using CUDA, and performance is shown in an example of flow through a generic pourous medium.
/content/cudazone/CUDABrowser/assets/images/applications/32_Lattice_Boltzmann_Kernal_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/32_Lattice_Boltzmann_Kernal_Large.jpg
Academia
TU Braunschweig
http://www.tu-braunschweig.de/
2008
2
1
02/01/2008
G80 and up
10
Tolke
Paper
Computational Fluid Dynamics
CFD, Lattice Boltzmann, Tolke
fb370bb1-f723-4cf8-a3f1-c29eac806c20
AstroGPU 2007 Workshop
Video collection of presentations at the AstroGPU 2007 Workshop on GPUs in Astronomy and Astrophysics.
/content/cudazone/CUDABrowser/assets/images/applications/33_AstroGPU_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/33_AstroGPU_Large.jpg
Academia
AstroGPU 2007
http://www.astrogpu.org/
2007
11
01
11/01/2007
Multimedia
Science
Astrophysics, Photo/Imaging
28b28fb9-ff52-4086-ace1-be86818553b0
GPULib: Library of Mathematical Functions
GPULib allows users to harness the computational power of GPUs from high level languages and environments such as Python, MATLAB, and IDL.
http://www.txcorp.com/technologies/GPULib/download.php
/content/cudazone/CUDABrowser/assets/images/applications/34_Library_of_mathematical_functions_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/34_Library_of_mathematical_functions_Large.jpg
Commercial
Tech-X Corporation
http://www.txcorp.com/technologies/GPULib/
2008
3
6
3/6/2008
G80 and up
40
Commercial
Tech-X Corporation
Code
Numerics
Numerics/Algorithms/Libraries, MATLAB, Python, Programming Languages, Interpreters, Tech-X Corporation
ea24bf13-69ba-432a-9c47-deae561251e5
GPU Acceleration Solutions
Acceleware's products leverage NVIDIA GPUs to provide solutions for processing computationally intensive applications. Acceleware provides Seismic solutions and Imaging solutions.
http://www.acceleware.com
/content/cudazone/CUDABrowser/assets/images/applications/35_GPU_Acceleration_Solution_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/35_GPU_Acceleration_Solution_Large.jpg
Commercial
Acceleware
http://www.acceleware.com/
2007
12
31
12/31/2007
G80 and up
35
Commercial
Acceleware
Application
Imaging
Oil & Gas
Astrophysics, Photo/Imaging, Acceleware
6716afdc-9d38-4f80-a103-b4f53d389a1b
Cmatch: Fast Exact String Matching on the GPU
A string matching kernel with the benefit of having search times proportional to string length rather than body of text searched.
http://www.cbcb.umd.edu/software/cmatch/
/content/cudazone/CUDABrowser/assets/images/applications/36_Cmatch_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/36_Cmatch_Large.jpg
Academia
Center for Bioinformatics & Computational Biology, University of Maryland
http://www.cbcb.umd.edu/
2007
05
01
05/01/2007
G80, T10P
35
Open source
Schatz, Trapnell
Paper
Code
Presentation
Life Sciences
String match,computational biology,suffix tree,data reordering, Schatz, Trapnell
f54917b8-caab-4705-9aef-6ced78aa72d6
General Purpose Molecular Dynamics Simulations
This paper and code show that our GPU implementation provides a performance equivalent to that of fast thirty processor core distributed memory cluster.
http://www.ameslab.gov/hoomd/index.html
/content/cudazone/CUDABrowser/assets/images/applications/37_General_Purpose_Molecular_Dynamics_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/37_General_Purpose_Molecular_Dynamics_Large.jpg
Academia
Ames Laboratory, United States Department of Energy
http://www.external.ameslab.gov/
2008
2
1
02/01/2008
G80, T10P
30
Open source
Code
Life Sciences
Molecular dynamics,HOOMD
c8a25cdb-3816-49bc-a7ac-d22b58473d56
Quantum Mechanical Calculations of Molecular Properties
The modification of a general purpose code for quantum mechanical calculations of molecular properties (Q-Chem) to use a graphical processing unit (GPU) is reported.
http://pubs.acs.org/cgi-bin/abstract.cgi/jpcafh/2008/112/i10/abs/jp0776762.html
/content/cudazone/CUDABrowser/assets/images/applications/38_Quantum_Mechanical_Calculations_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/38_Quantum_Mechanical_Calculations_large.jpg
Academia
2007
11
1
11/01/2007
G80
4
Vogt, et al
Application
Science
Moller-Plesset, Quantum Chemistry, Vogt, et al
a0302629-3147-4aa5-9e29-74e5875036d5
Fast GPU-Based CT Reconstruction
Application of GPU acceleration to the FDK method of image CBCT image reconstruction with on-the-fly reconstruction for presentation immediately after scanning.
/content/cudazone/CUDABrowser/assets/images/applications/39_Fast_GPU_Based_CT_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/39_Fast_GPU_Based_CT_large.jpg
Academia
2007
11
1
11/01/2007
G8x
2
Scherl, Keck, Kowarschik, Hornegger
Paper
Imaging
CT, FDK reconstruction, FDK algorithm, CUDA, Scherl, Keck, Kowarschik, Hornegger
037a9a90-0c50-4769-965e-a6eec6d448b1
MapReduce Framework
MapReduce interface (a software framework implemented by Google to support parallel computations on large datasets) using GPUs.
http://www.cse.ust.hk/gpuqp
/content/cudazone/CUDABrowser/assets/images/applications/40_MapReduce_Framework_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/40_MapReduce_Framework_Large.jpg
Academia
2007
11
25
11/25/2007
10
He, Fang, Govindaraju, Luo, Wang
Paper
Code
Numerics
MapReduce,search, He, Fang, Govindaraju, Luo, Wang
b6ff75f6-1b1c-4faf-bc5a-77fb92193382
Dirac Video Codec
Acceleration of the wavelet-based, Dirac Video Codec (DVC) including overlapped block motion compensation, wavelet transforms, and frame arithmetic. Used for better compression rates and real-time decompression for streaming over low-bandwidth networks.
http://www.cs.rug.nl/~wladimir/sc-cuda/
/content/cudazone/CUDABrowser/assets/images/applications/41_Dirac_Video_Codec_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/41_Dirac_Video_Codec_Large.jpg
Academia
University of Groningen
http://www.rug.nl/corporate/index
2007
12
1
December 2007
Open source
van der Laan, et al
Paper
Code
Video & Audio
Codec,video,streaming, van der Laan, et al
6ad74c31-fdb2-4fd2-a322-f7ee4d8d7ee1
Quantum Chemistry Two-Electron Integral Evolution
Use of GPUs to calculate two-electron repulsion integrals over Gaussian basis functions.
http://pubs.acs.org/cgi-bin/abstract.cgi/jctcce/2008/4/i02/abs/ct700268q.html
/content/cudazone/CUDABrowser/assets/images/applications/42_Quantum_Chemistry_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/42_Quantum_Chemistry_large.jpg
Academia
Bechman Institute, University of Illinois at Urbana-Champaign
http://www.beckman.uiuc.edu/
2008
1
1
01/01/2008
130
Ufimtsev, et al
Application
Science
Computational chemistry, Ufimtsev, et al
e60b1da5-bf35-47ea-8a16-5b471358e63d
Teraflop CFD Computing
Implementation of a Lattice Boltzmann (LB) kernel based on a D3Q13 model.
/content/cudazone/CUDABrowser/assets/images/applications/43_Towards_3D_teraflop_CFD_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/43_Towards_3D_teraflop_CFD_Large.jpg
Academia
TU Braunschweig
http://www.tu-braunschweig.de/
2008
2
1
02/01/2008
100
Tolke, Krafczyk
Paper
Computational Fluid Dynamics
Fluids,Lattice Boltzmann, Tolke, Krafczyk
92e63d97-f7f8-47e6-938e-a23ac28444d8
OmegaSim GX Hardware-Accelerated SPICE Simulator
SPICE simulator for analog and mixed-analog-digital circuits.
http://www.nascentric.com/omegasim_gx.html
/content/cudazone/CUDABrowser/assets/images/applications/45_OmegaSim_GX_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/45_OmegaSim_GX_Large.jpg
Commercial
Nascentric
http://www.nascentric.com/
2008
4
1
04/01/2008
8
Commercial
Nascentric
Application
Electronic Design Automation
EDA,SPICE, Nascentric
00b5aac0-8570-11dd-ad8b-0800200c9a66
A Neural Network on GPU
Implementation of a neural network with CUDA.
/content/cudazone/CUDABrowser/assets/images/applications/46_Neutral_Network_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/46_Neutral_Network_Large.jpg
Academia
University of California Davis
http://www.ucdavis.edu/
2008
3
14
3/14/2008
10
Open Source
Billconan, Kavinguy
Paper
Code
Multimedia
Life Sciences
Neural network, Billconan, Kavinguy
d5f22f44-568c-4be8-9c30-e631e0b37fea
General Relativistic Evolution Code
Implementation of a finite-differencing code for solving Einstein's field equations on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/47_General_Relativistic_Evolution_Code_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/47_General_Relativistic_Evolution_Code_Large.jpg
Academia
Center for Computation and Technology, and Department of Physics and Astronomy, Louisiana State University
http://www.cct.lsu.edu/home
2008
1
1
01/01/2008
26
Zink, Burkhard
Paper
Science
Computational physics, Zink, Burkhard
ed4fdee7-a95b-461b-8dc8-394f4ce1dd8b
Relational Joins on Graphics Processors
Implementation of indexed or non-indexed nested-loop, sort-merge and hash joins using a set of data-parallel primitives such as split and sort.
/content/cudazone/CUDABrowser/assets/images/applications/48_Relational_Joins_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/48_Relational_Joins_Large.jpg
Academia
2008
6
1
06/01/2008
7
He, Yang, Fang, Lu, Govindaraju, Luo, Sander
Paper
Numerics
Algorithm, He, Yang, Fang, Lu, Govindaraju, Luo, Sander
d3acfd2b-6d24-4718-bff6-980d0b5116c7
Level 3 CUBLAS
Evaluation of the performance Level 3 performance operations in CUBLAS.
/content/cudazone/CUDABrowser/assets/images/applications/49_Level_3_CUBLAS_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/49_Level_3_CUBLAS_Large.jpg
Academia
Departamento de Ingenieria y Ciencia de los Computadores, Universitat Jaume I
http://www.uji.es/CA/departaments/icc/
2008
4
1
04/01/2008
Barrachina, Castillo, Igual, Mayo, Quintana-Orti
Paper
Numerics
Linear algebra, Barrachina, Castillo, Igual, Mayo, Quintana-Ortia
229a0477-cf7f-4ee2-bc86-0c91ef1d8332
SnapCT: Tomography Volume Reconstruction
Accelerated tomography volume reconstruction.
http://www.digisens.fr/snapct/
/content/cudazone/CUDABrowser/assets/images/applications/50_SnapCT_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/50_SnapCT_Large.jpg
Commercial
Digisens SA
http://www.digisens.fr/en/
2007
12
31
12/31/2007
50
Commercial
Digisens SA
Application
Imaging
Tomography,medical imaging, Digisens SA
450f1e45-cf76-43ee-8b6f-fa2d8af8202c
Distributed Password Recovery
High-Performance Distributed Password Recovery from the operating system, Microsoft Office products, Adobe PDF files, ZIP and RAR archives, and a variety of other applications.
http://elcomsoft.com/edpr.html/
/content/cudazone/CUDABrowser/assets/images/applications/51_Distributed_Password_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/51_Distributed_Password_Large.jpg
Commercial
Elcomsoft
http://www.elcomsoft.com/
2008
12
31
12/31/2008
50
Elcomsoft
Code
Science
Password recovery,forensic, Elcomsoft
a5b6f331-06dd-452e-8120-54c1ee3b1ba9
Smith-Waterman Sequence Alignment
Exact Smith-Waterman sequence alignment on CUDA.
http://www.biomedcentral.com/1471-2105/9/S2/S10
/content/cudazone/CUDABrowser/assets/images/applications/52_Smith_Waterman_Sequence_Alignment_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/52_Smith_Waterman_Sequence_Alignment_large.jpg
Academia
Universita Degli Studi Di Padova
http://www.cribi.unipd.it/
2008
3
26
3/26/2008
30
Ruiz, Ujaldon, Cooper, Huang
Paper
Life Sciences
Smith-Waterman, Gene, sequencing, DNA, molecular biology, FASTA, BLAST, medical, Ruiz, Ujaldon, Cooper, Huang
44c31276-bccd-4eda-af9f-5a6eb60ad923
Simulation Open Framework Architecture (SOFA)
GPU-based Gauss-Seidel Algorithm for Dense Matrices.
http://www.sofa-framework.org/
/content/cudazone/CUDABrowser/assets/images/applications/53_Simulation_Open_Framework_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/53_Simulation_Open_Framework_Large.jpg
Research
Institut National De Recherche En Informatique Et En Automatique
http://www.inria.fr/
2007
2
1
02/01/2007
55
Open source
Allard, et al
Paper
Code
Numerics
Medical, simulation, dense matrix, Gauss-Seidel, Allard, et al
a7ace7a8-69c7-48c9-9a5f-495d8827e3dc
Non-rigid Registration for Large Set of Microscopic Images
3D reconstruction and visualization of tissue structures from large sets of microscopic images.
/content/cudazone/CUDABrowser/assets/images/applications/54_Non_Ridgid_Registration_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/54_Non_Ridgid_Registration_Large.jpg
Academia
Universidad de Malaga
http://www.ac.uma.es/
2008
4
1
04/01/2008
4
Ruiz, Ujaldon, Cooper, Huang
Paper
Life sciences
Medical,imaging,Microscopic imaging,3D reconstruction, Ruiz, Ujaldon, Cooper, Huang
55062578-eb80-4a33-b1a4-90fb21746ebf
Solving Dense Linear Systems on GPUs
Algorithms to compute the solution of a linear system of equations on a GPU.
/content/cudazone/CUDABrowser/assets/images/applications/55_Solving_Dense_Linear_Systems_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/55_Solving_Dense_Linear_Systems_Large.jpg
Academia
Departamento de Ingenieria y Ciencia de los Computadores, Universitat Jaume I
http://www.uji.es/CA/departaments/icc/
2008
4
1
04/01/2008
3
Barrachina, Castillo, Igual, Mayo, Quintana-Orta
Paper
Numerics
Linear algebra, BLAS, Linear systems, Cholesky factorization, LU factorization, Barrachina, Castillo, Igual, Mayo, Quintana-Orta
63beacf6-b813-4932-aba9-62f9ce118774
Geometric Algorithms with CUDA
Solve basic geometric problems on 3D meshes like the point inclusion test or the self-intersection detection.
/content/cudazone/CUDABrowser/assets/images/applications/56_Geometric_Algorithms_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/56_Geometric_Algorithms_Large.jpg
Academia
Universidad de Jaen
http://www.ujaen.es/
2008
1
29
1/29/2008
100
Rueda, Ortega
Paper
Numerics
3D meshes, inclusion test, self-intersection test, geometric, Rueda, Ortega
b67217ab-626a-4163-9ea7-bb219d660698
Numerical Weather Prediction
CUDA-based speedup for a computationally intensive portion of the Weather Research and Forecast (WRF) model .
/content/cudazone/CUDABrowser/assets/images/applications/57_Numerical_Weather_Prediction_2_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/57_Numerical_Weather_Prediction_2_Large.jpg
Research
The Weather Research & Forecasting Model
http://wrf-model.org/index.php
2007
12
19
12/19/2007
1.3
Michalakes, Vachharajani
Application
Paper
Computational Fluid Dynamics
Weather,computational fluid dynamics,CFD,microphysics,thermodynamics, Michalakes, Vachharajani
0ccfe013-7de1-4269-929f-32012797ce28
MDGPU: Molecular Dynamics Simulation
The MDGPU simulation package presents a framework for Molecular Dynamics simulations, where the most computationally intensive parts are offloaded to the GPU and system dependent tasks are conveniently performed on the CPU.
http://www.amolf.nl/~vanmeel/mdgpu/download.html
/content/cudazone/CUDABrowser/assets/images/applications/58_MDGPU_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/58_MDGPU_large.jpg
Research
Institute for Atomic and Molecular Physics
http://www.amolf.nl/
2007
12
31
12/31/2007
van Meel, et al
Paper
Code
Life Sciences
Molecular Dynamics,Life Sciences, van Meel, et al
e9158b7d-0e71-4550-98ed-386d6acb1142
Visual Molecular Dynamics: VMD
VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems using 3-D graphics and built-in scripting.
http://www.ks.uiuc.edu/Research/vmd/
/content/cudazone/CUDABrowser/assets/images/applications/59_Visual_Molecular_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/59_Visual_Molecular_Large.jpg
Academia
http://www.ks.uiuc.edu/
2007
4
1
04/01/2007
100
Paper
Code
Life Sciences
Molecular Dynamics,Life Sciences
f6a4e7d4-7f25-41f8-aa92-83305b61867d
Map Reduce Framework
The MapReduce interface is a software framework implemented by Google to support parallel computations on the datasets. This paper describes a framework built around the Map Reduce abstraction, which allows application developers to focus on their application, while enabling high performance GPU implementation.
/content/cudazone/CUDABrowser/assets/images/applications/60_Map_Reduce_Framework_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/60_Map_Reduce_Framework_Large.jpg
Academia
University of California Berkeley
http://www.eecs.berkeley.edu/
2008
3
1
03/01/2008
G80
150
Catanzaro, Sundaram, Keutzer
Paper
Numerics
MapReduce,Search,Numerical,Algorithms,Libraries, Catanzaro, Sundaram, Keutzer
7e814a19-5a09-470d-9752-e3ba25d1d51e
Innovative 3D visualization solutions for Oil and Gas
Open Inventor by Mercury is a comprehensive solution for simultaneous computation and visualization of huge 3D seismic data sets, or any highly demanding computing tasks in the interpretation and simulation workflows.
http://3dviz.mc.com/solutions/GPUComputing/default.asp
/content/cudazone/CUDABrowser/assets/images/applications/61_Innovative_3D_visualization_solutions_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/61_Innovative_3D_visualization_solutions_Large.jpg
Commercial
Mercury Computer Systems
http://3dviz.mc.com/solutions/oilandgas/ngog/
2007
11
1
November 2007
G80
10
Mercury Computer Systems
Application
Multimedia
Oil & Gas
Seismic, Graphics, Mercury Computer Systems
25c2eda7-7c30-41c2-9bb3-92134f7f3638
Prestack Seismic Data Interaction
Headwave's Prestack for Interpreters is an integrated Windows/Linux software solution, giving easy access to potentially enormous prestack datasets from single PCs using GPUs.
http://www.headwave.com/article/articleview/25
/content/cudazone/CUDABrowser/assets/images/applications/62_Prestack_Seismic_Data_Interaction_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/62_Prestack_Seismic_Data_Interaction_Large.jpg
Commercial
Headwave
http://www.headwave.com/
2007
6
1
06/01/2007
G80
100
Headwave
Application
Oil & Gas
Seismic,Graphics, Headwave
515ca076-3ed9-4311-a6d0-498d5560e1a5
Swaption Volatility
Short rate models have been dismissed for financial engineering applications in favor of market models as the latter are more flexible and best suited to cluster computing implementations. In this paper, we argue that the paradigm shift toward GPU architectures currently taking place in the high performance computing world can potentially change the situation and tilt the balance back in favor of a new generation of short rate models.
http://www.level3finance.com/index.html
/content/cudazone/CUDABrowser/assets/images/applications/63_Swaption_Violatility_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/63_Swaption_Volatility_Large.jpg
Commercial
Level 3 Finance
http://www.level3finance.com/
2007
9
1
09/01/2007
G80
11
Level 3 Finance
Application
Finance
Finance, Level 3 Finance
22497f94-fe89-4c5a-a6d3-e396e0c60aec
Quantitative Risk Analysis and Algorithmic Trading Systems
The Volera product line delivers high-speed, low-latency options analytics for trading and risk management. Using GPU-based high-performance computing technology, Volera systems offer performance exceeding that of traditional grid computing, and require far less hardware, rack space, electrical power and cooling.
http://www.hanweckassoc.com/products.html
/content/cudazone/CUDABrowser/assets/images/applications/64_Quantitative_Risk_Analysis_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/64_Quantitative_Risk_analysis_Large.jpg
Commercial
Hanweck Associates
http://www.hanweckassoc.com/
2007
9
1
09/01/2007
G80
50
Hanweck Associates
Application
Finance
Finance, Hanweck Associates
f5121b1f-fbcf-466b-9d47-d4b892ca0da2
Geographic Information System (GIS)
Manifold System is a single, integrated product that provides three major classes of GIS functionality in a single package: as a desktop application, as an objects library for programmers and as an Internet Map Server for web applications.
http://www.manifold.net/info/products.shtml
/content/cudazone/CUDABrowser/assets/images/applications/65_Geographic_Information_System_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/65_Geographic_Information_System_Large.jpg
Commercial
Manifold
http://www.manifold.net/index.shtml
2007
8
1
08/01/2007
G80
36
Manifold
Application
Imaging
GIS,Imaging, Manifold
0cdcdb36-45e3-4c53-a001-3cc39aa4460b
Human Body 3D Surface Image Capture and Analysis
A stereo pair of images are captured simultaneously and instantaneously using a pair of synchronised high resolution digital stills cameras.
http://www.di3d.com/
/content/cudazone/CUDABrowser/assets/images/applications/66_Human_Body_3D_Surface_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/66_Human_Body_3D_Surface_Large.jpg
Commercial
Dimensional Imaging
http://www.di3d.com/
2007
12
31
12/31/2007
G80
Dimensional Imaging
Application
Imaging
Imaging, Dimensional Imaging
f31a1ef1-71f6-45ea-bec5-cdb783b71acb
Synthesis of Artificial Neural Circuitry
Simulation of neuronal components closely modeled after neurons in the brain, and synthesize arrays which wire themselves by simulating neural circuit growth in 3D.
http://www.evolvedmachines.com/
/content/cudazone/CUDABrowser/assets/images/applications/67_Synthesis_of_Artificial_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/67_Synthesis_of_Artificial_Large.jpg
Commercial
Evolved Machines
http://www.evolvedmachines.com/
2007
12
31
12/31/2007
G80
Evolved Machines
Application
Science
Computational biology and simulation, Evolved Machines
87617b60-6f68-4a10-a064-49acf3f44a21
Multiple Relatively Robust Representations (MRRR)
This paper presents an implementation of the Algorithm of Multiple Relatively Robust Representations (MRRR) for the symmetric tridiagonal eigenvalue problem on a data-parallel coprocessor using the CUDA programming environment.
http://www.dgp.toronto.edu/people/lessig/mrrr/
/content/cudazone/CUDABrowser/assets/images/applications/68_MRRR_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/68_MRRR_Large.jpg
Academia
University of Toronto
http://www.utoronto.ca/
2008
5
1
05/01/2008
50
Open source
Lessig
Paper
Code
Numerics
Algorithms,numberic,libraries,tridiagonal eigenvalue,multiple relatively robust representations,MRRR, Lessig
e2821f27-0731-465a-a8bf-550230b7c600
Fast MRI Gridding on GPUs via CUDA
This paper explores the challenges and opportunities of exploiting general-purpose GPU processing, we implemented the non-equispaced Fast-Fourier Transform algorithm (commonly known as 'gridding') and reports results.
/content/cudazone/CUDABrowser/assets/images/applications/69_Fast_MRI_Gridding_on_GPUs_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/69_Fast_MRI_Gridding_on_GPUs_large.jpg
Academia
University of Wisconsin - Madison
http://www.wisc.edu/
2008
5
13
5/13/2008
GeForce 8800 (G80)
Open source
Gregerson
Paper
Code
Presentation
Imaging
Imaging,gridding,Fast Fourier,algorithm, Gregerson
5e5775d0-6663-4807-a80a-02c9b89859c9
Fast Support Vector Machine Training and Classification
This paper describes a solver for Support Vector Machine training running on a GPU, using Platt's Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic.
/content/cudazone/CUDABrowser/assets/images/applications/70_Fast Support_Vector_Machine_Smal.jpg
/content/cudazone/CUDABrowser/assets/images/applications/70_Fast_Support_Vector_Machine_Large.jpg
Academia
University of California Berkeley
http://www.eecs.berkeley.edu/
2007
12
31
12/31/2007
138
Catanzaro, Sundaram, Keutzer
Paper
Numerics
Algorithms,numeric,libraries,support vector machine,Platt, Catanzaro, Sundaram, Keutzer
a00952ad-19d7-4ce9-a84a-d35cf6252276
Fluorescent Microscopy
This paper describese how a typical calculation covering 10 seconds of measurement time, which required 8 minutes of computing time on a single core of a Quad CPU, can be done with reduced time by using parallel processing with GPUs. Not only the cheaper option, these GPUs offer computation that can be carried out as fast as on a computer cluster with as many processors offering nearly teraflops of computer power.
http://www.ks.uiuc.edu/Research/microscope/
/content/cudazone/CUDABrowser/assets/images/applications/71_Flurorescent_Microscopy_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/71_Flurorescent_Microscopy_Large.jpg
Academia
Univeristy of Illinois at Urbana-Champaign
http://www.ks.uiuc.edu/
2007
11
19
11/19/2007
GeForce 8800 GTX
Open souce
Arkhipov, et al
Paper
Code
Multimedia
Life Sciences
Life sciences,microscopy, Arkhipov, et al
32888aaf-5ef8-4302-a4c7-843626b05718
Antenna Modeling Design System
This tool addresses the challenges of fast changing wireless appliance design dictated by consumer aesthetic and feature appetite with technology that efficiently imports, meshes and simulates the entire wireless appliance within its surrounding real-word environment.
http://eesof.tm.agilent.com/products/amds_main.html
/content/cudazone/CUDABrowser/assets/images/applications/73_Antenna_Modeling_Design_System_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/73_Antenna_Modeling_Design_System_Large.jpg
Commercial
Agilent EEsof EDA
http://eesof.tm.agilent.com/
2008
2
14
2/14/2008
Agilent AMDS
Application
Imaging
Imaging,simluation,3D,wireless, Agilent AMDS
9791ea82-56fa-461c-bb77-e6d18eb8cb76
Flocking-based Document Clustering on the GPU
In this paper, we have conducted research to exploit the GPU's architecture and apply its strengths to the document flocking problem. Our results highlight the potential bene t the GPU brings to many naturally inspired algorithms.
/content/cudazone/CUDABrowser/assets/images/applications/74_Flocking_based_Document_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/74_Flocking_based_Document_Large.jpg
Research
Applied Software Engineering Group
http://aser.ornl.gov/
2007
10
15
10/15/2007
GeForce 8800
3
Charles, Potok, Patton, Cui
Paper
Numerics
Algorithm,document flocking, Charles, Potok, Patton, Cui
1d4c05f9-4904-457c-acae-50514535274d
Tomographic Reconstruction
How much computing power can you cram into a single desktop PC? In this research on image reconstruction, it often requires large-scale scientific computations that can easily take weeks on a normal PC. To tackle this problem, the team developed a special PC costing less than 4000 euro that is capable of performing computations as fast as a cluster, and can perform three-dimensional reconstructions within a few hours -- over 100 times as fast.
http://fastra.ua.ac.be/en/index.html
/content/cudazone/CUDABrowser/assets/images/applications/75_Tomographic_Reconstruction_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/75_Tomographic_Reconstruction_Large.jpg
Academia
University of Antwerp
http://www.ua.ac.be/main.aspx?c=.ENGLISH
2008
5
28
5/28/2008
GeForce 9800 GX2
40
Sijbers, et al
Paper
Code
Multimedia
Imaging
Medical, imaging, 3D scan, x ray, FASTRA, Vision Lab, Sijbers, et al
fdb3584f-25dd-47d7-a251-56c932a15edf
Solving Dense Linear Systems on Multi-Accelerator Platforms
This paper generalizes the approach for systems with multiple hardware accelerators, and incorporates software implementations of standard cache/memory coherence techniques from computer architecture to improve performance. This experimental evaluation on an NVIDIA Tesla S870 platform delivers a peak performance well over 400 GFLOPS.
/content/cudazone/CUDABrowser/assets/images/applications/76_Solving_Dense_Linear_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/76_Solving_Dense_Linear_Large.jpg
Academia
Universidad Jaume
http://www.icc.uji.es/
2008
5
9
5/9/2008
Quintana-Orta, Igual, van de Geijn
Paper
Numerics
FLAME, linear algebra, numeric, library, multicore, multi-core, BLAS, PLASMA, SMPS, Cell, FPGA, Tesla, Quintana-Orta, Igual, van de Geijn
807de7b2-250d-45f0-aca6-2c9720ad523c
Histogram Computation with CUDA
GPU's higher processing power compared to a standard CPU comes at the cost of reduced data caching and flow control logic as more transistors have to be devoted to data processing. This imposes certain limitations in terms of how an application may access memory and implement flow control. As a result, implementation of certain algorithms (even trivial ones) on the GPU may be difficult or may not be computationally justified.
http://users.rsise.anu.edu.au/~ramtin/cuda.htm
/content/cudazone/CUDABrowser/assets/images/applications/77_Histogram_Computation_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/77_Histogram_Computation_Large.jpg
Academia
The Australian National University
http://www.anu.edu.au/
2008
5
1
05/01/2008
Ramtin Shams
Application
Code
Numerics
Numerics,algorithm,parallel processing,GPU,histogram computation, Ramtin Shams
0766881c-4193-45b7-8aa7-c4113a8d0172
Efficient Histogram Algorithms
This paper presents two efficient histogram algorithms designed for NVIDIA's CUDA compatible GPUs, which can be used for parallel computation of histograms on large data-sets and for thousands of bins. These algorithms do not require the typically costly data transfers by allowing efficient histogram calculation on the GPU.
/content/cudazone/CUDABrowser/assets/images/applications/78_Efficient_Histogram_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/78_Efficient_Histogram_Large.jpg
Academia
The Australian National University
http://www.anu.edu.au/
2007
08
08
08/08/2007
30
Ramtin Shams
Paper
Numerics
Histogram, parallel processing, compute unified device architecture, CUDA, graphics processor unit, GPU, numerics, algorithm, Ramtin Shams
007043c1-e26b-4087-b694-fb00069f40c3
Speeding Up Mutual Information Computation Hardware
This paper presents an efficient method for mutual information computation between images (2D or 3D) for NVIDIA's CUDA compatible devices, overcoming limitations by approximating the pmfs using a down-sampled version of the jointhistogram which avoids memory update problems.
/content/cudazone/CUDABrowser/assets/images/applications/79_Speeding_up_Mutual_Information_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/79_Speeding_up_Mutual_Information_Large.jpg
Academia
The Australian National University
http://www.anu.edu.au/
2007
9
11
9/11/2007
25
Ramtin Shams
Paper
Numerics
Histogram, parallel processing, compute unified device architecture, CUDA, graphics processor unit, GPU, numerics, algorithm, Ramtin Shams
121a8218-926a-4578-be55-6632c782b1ba
LINZIK: The compact optical CAD
A lens ray tracing program for calculating, in particular, astronomical optics. It includes optimizer, which can choose parameters of surfaces to minimize the goal (merit) function and satisfy the specified restrictions.
http://www.linzik.com/download_eng.htm
/content/cudazone/CUDABrowser/assets/images/applications/80_LINZINK_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/80_LINZINK_Large.jpg
LINZIK
http://www.linzik.com/
2008
5
18
5/18/2008
10
Open source
Vodyanik
Application
Code
Vodyanik
f4df4b66-bc01-4b08-afe1-a4868c9ff4d8
Canny Edge Detection
The Canny edge detector is a very popular edge feature detector used as a pre-processing step in many computer vision algorithms. By using the more programmer friendly CUDA framework, we are able to implement the entire Canny algorithm. Details are presented along with a comparison with CPU implementations. We also integrate our detector in to MATLAB, a popular interactive simulation package often used by researchers.
http://www.wam.umd.edu/~yluo1/canny.htm
/content/cudazone/CUDABrowser/assets/images/applications/81_Canny_Edge_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/81_Canny_Edge_Large.jpg
Academia
University of Maryland
http://www.umd.edu/
2008
5
6
5/6/2008
3
Open source
Luo, Duraiswami
Paper
Code
Digital Content Creation
Graphics
Numerics
Imaging
Canny edge detector,Canny algorithm,edge feature detector,graphical application layers,algorithm,simulation,numerics, Luo, Duraiswami
9b9fb409-8a9a-4c8d-92c7-3dbb2b1a4c6c
SciFinance Speeds Financial Results with Parallel Computing
By harnessing the power of NVIDIA CUDA with GPU or multi-CPU workstations, SciFinance parallel codes for Monte Carlo pricing models run blazingly fast. SciFinance CUDA-enabled codes achieve astounding acceleration, while SciFinance OpenMP-compliant codes yield near linear acceleration on multi-CPU workstations.
http://www.scicomp.com/parallel_computing/GPU_OpenMP
/content/cudazone/CUDABrowser/assets/images/applications/82_SciFinance_Small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/82_SciFinance_Large.jpg
Commercial
SciComp Inc.
http://www.scicomp.com/
2008
6
9
6/9/2008
220
SciComp Inc.
Application
Code
Multimedia
Finance
Graphic card, parllel computing, finance, Monte Carlo, SciComp Inc.
d1677622-d8fa-4f22-9532-74d40f7d2d6e
Accelerate Large Graph Algorithms
This paper presents a few fundamental algorithms - including breadth first search, single source shortest path, and all-pairs shortest path - using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the GeForce 8800 GTX GPU costing $600.
/content/cudazone/CUDABrowser/assets/images/applications/83_Accelerating_Large_Graph_Algorithms_small.jpg
/content/cudazone/CUDABrowser/assets/images/applications/83_Accelerating_Large_Graph_Algorithms_large.jpg
Academic
International Institute of Information Technology Hyderabad
http://www.iiit.ac.in/
2008
6
5
6/5/2008
Harish, et al
Paper
Numerics
Numeric, algorithm, breadth first search, single source shortest path, all-pairs shortest path, vertex graph, Harish, et al
eb42f530-fa04-11dd-87af-0800200c9a66
Multibody mechanical simulations on the GPU
Parallel solver for non-linear complementarity problems in mechanical systems with a large number of moving parts and frictional contacts.
/content/cudazone/CUDABrowser/assets/images/applications/202_benchmarkGPU_small.png
/content/cudazone/CUDABrowser/assets/images/applications/202_benchmarkGPU_large.png
Academia
Universita degli Studi di Parma and University of Wisconsin - Madison
2009
02
01
02/01/2009
13
Alessandro Tasora
Application
Multimedia
Numerics
Libraries
Science
Multibody, complementarity, differential variational inequality, Alessandro Tasora