CUDA Spotlight: Cris Cecka
GPU-Accelerated Computational Mathematics
This week's Spotlight is on Dr. Cris Cecka, a research scientist and lecturer in the new Institute for Applied Computational Science (IACS) at Harvard University. He is currently also performing research with the Mathematics Department at the Massachusetts Institute of Technology.
Previously, Cris was a graduate student in the Institute for Computational and Mathematical Engineering (ICME) at Stanford University with Prof. Eric Darve.
Harvard has been a CUDA Center of Excellence since 2009, led by Dr. Hanspeter Pfister, IACS Director.This interview is part of the CUDA Spotlight Series.
Q & A with Cris Cecka
NVIDIA: Cris, what are your primary research interests?
More recently, I have become interested in mathematical and computational abstractions to produce efficient, library-quality scientific software. Specifically, I have focused on generalized n-body problems, including integral equation methods, particle methods, and structured dense matrices.
As part of my work, I've released several software libraries, including FMMTL to aid in the research, development, and use of kernel matrices and CrowdCL to aid in the use of GPU computing within a browser.
NVIDIA: Tell us more about FMMTL. Is it GPU-accelerated?
By avoiding the computation of all of the elements or delaying the computation until the matrix element is requested, the amount of data required to define the matrix is reduced to O(N), which is great in terms of computational intensity! There is very little data to access and lots and lots of computation.
For this reason, these computations are a great fit for GPUs. Notoriously, GPUs are hard to "feed" — they compute much, much faster than they can access data to compute on. Indeed, sparse linear algebra usually gets only 1-5% of the peak performance because there is approximately one operation per piece of data read from memory.
Dense linear algebra does much, much better, but can still be difficult to tune. Implicit dense matrices, where the elements are computed on-demand and each piece of data read from memory can be reused N times or more, are very easy to immediately get 80% of peak performance. For this reason, FMMTL attempts to use GPUs for these direct n-body computations whenever possible.
FMMTL Error Plot
NVIDIA: What approaches have you used to apply the CUDA platform to your work?
With the development of the Thrust library and the recently proposed parallel computing additions to the C++ standards being led by NVIDIA, expressing our algorithms in terms of these optimized primitives has never been more important. Primitives like partition, prefix sum, reduce, and map, when fully abstracted, act as powerful components of efficient cross-platform generic algorithms. This is a more functional-programming-like way of writing code, but it is compact and efficient. I wish more people used the C++ std:: library (or equivalents) and NVIDIA's Thrust.
Using the same kind of advanced C++ techniques, we also create robust and reusable primitives that dispatch based on data structure, available co-processors, available threading, etc. Currently, I'm working with Wesley Chen to wrap up a generalized n-body direct evaluation that is provably optimal in terms of distributed memory communication and takes advantage of available GPUs and threading appropriately. This is written with a general std::-like interface and can be applied far beyond the typically considered n-body problems. Writing code in this way allows us to think about the concepts and algorithms we're trying to express rather than the details of the implementation, while still benefiting from CPU and GPU parallelization.
NVIDIA: Tell us about the course you are teaching.
We wanted to make this course accessible to a broad range of students, and we chose to teach the course in Python. We cover MapReduce using MRJob, distributed computing using MPI4py, and CUDA with PyCUDA.
The choice to run the course in Python made it much more accessible for a broad range of students — we've had students from the social sciences and government who are simply interested in accelerating their data processing Python/Matlab scripts and producing better software.
Clearly, the long exalted notion that the average programmer will need to understand and develop parallel programs is becoming a reality.
NVIDIA: How has learning the CUDA programming model helped your students?
This is an imaging technique of encoding 3D volumes in 2D images that allows the recording frame rate to be increased dramatically. The 3D data must then be reconstructed with back-propagation of the scattered light, which is very computationally demanding. Furthermore, experimentalists must fit parameters of the back-propagation and have an intuition for how the resulting data changes due to the parameters. The initial guesses for the parameters were constructed by hand.
Her team was using Holopy, an open source python package, to recover the 3D information given the data and parameters… slowly. Fitting the parameters to a single frame could take minutes. Using the GPU computing techniques learned in class, Rebecca supplemented Holopy and achieved real-time manipulation of the input parameters via a GUI interface.
The speedup achieved led to new insights into the imagining technique. Failures and discrepancies became much more obvious when the data was manipulatable in real-time, when previously these features were shrouded by computation.
NVIDIA: How did you first hear about CUDA?
I've been impressed with how NVIDIA's programming model has evolved since then. CUDA forces a user to inherently think about their problem in a hierarchical way — breaking it into smaller and smaller problems with higher and higher degrees of parallelism.
This programming model follows the architecture of NVIDIA's GPUs closely. With algorithm-architecture co-design, the same hierarchical strategies appear in CPU caches, distributed memory algorithms, threading strategies, etc. I think the skills acquired in mapping ideas to CUDA and GPU computing are similar to the skills required to write cache-efficient, parallel code in other contexts.
NVIDIA: What excites you the most about your work?
Bio for Cris Cecka
Dr. Cris Cecka received his PhD from Stanford University in Computational and Mathematical Engineering (ICME). He joined the new Institute for Applied Computational Science (IACS) at Harvard University in 2011 as a lecturer and research scientist where he developed core courses on parallel computing and software development. His research is focused on developing and applying integral equation methods and generalized n-body problems using hierarchical methods and high performance computing.
E-Mail: email@example.com (dot) edu
# # #