Patrick Roye | CUDA Spotlight

By Calisa Cole, posted Aug 13, 2013
Read more CUDA in Action Spotlights

Ultra-Low Latency Shape Sensing Using CUDA

This week's Spotlight is on Patrick Roye of Luna, a technology development company based in Virginia.

In the healthcare market, Luna is a pioneer in fiber-optic shape and position sensing. Its technology is being developed to integrate into systems which perform minimally invasive diagnostics, surgery and therapy — pinpointing the position and shape of an instrument inside the body.

Patrick works on accelerating Luna's processing algorithms using GPUs. He and a team of engineers and scientists are developing a prototype system that uses CUDA to calculate the shape of a fiber-optic sensor in real-time.

Luna's shape-sensing systems, which are currently in development, will be used to guide the next generation of medical robotic systems safely through a patient's body.

This interview is part of the CUDA Spotlight Series.

Q & A with Patrick Roye

NVIDIA: Patrick, tell us about your work at Luna.
Patrick: I've been a software engineer at Luna for just over five years. I work in Luna's Lightwave Division, which develops and manufactures products for fiber-optic testing, strain and temperature sensing, and shape sensing. For the last year and a half, I've helped develop high-speed versions of our products that utilize NVIDIA GPUs to accelerate data processing.

NVIDIA: What are some applications of Luna's technology?
Patrick: One of our key target markets is healthcare, including the area of Minimally Invasive Surgery (MIS). Luna's shape-sensing systems, which are currently in development, calculate the shape of fiber-optic sensors in real-time.

NVIDIA: Why did you choose to work with GPUs?
Patrick: The processing for our shape-sensing technology was initially developed on FPGAs, which allowed us to transfer and process data at extremely low latencies, on the order of milliseconds. But when higher levels of accuracy required us to increase the number of points and complexity of our algorithms, the FPGAs we were using were no longer a viable option.

Fortunately, at the same time the door closed on our FPGAs, NVIDIA opened a window with the announcement of RDMA for GPUDirect.

Since we had used CUDA a year earlier to accelerate our strain and temperature sensing calculations, we already had an idea of the advantages of GPU-accelerated processing.

With RDMA for GPUDirect and CUDA-accelerated processing, we determined that we could perform data acquisition and minimal processing on an FPGA, transfer our data directly to the GPU for processing and then transfer the results back to the FPGA fast enough to meet our real-time requirements.

NVIDIA: What approaches did you find useful for developing on the CUDA platform?
Patrick: The algorithm requires over 100 kernels, operating on tens-of-thousands of data points. All kernels must complete before the next data set arrives from the FPGA, so every kernel had to be optimized to run as fast as possible. Here are a few tips I learned from this extreme optimization process:

Get it working first: There's no point in doing something fast if you're doing it wrong.

Take time to generate comprehensive unit tests for each of the kernels: Once you begin optimizing, these unit tests will be invaluable for ensuring your optimizations don't introduce new processing bugs.

Implement each kernel a few different ways: There were a few times where an implementation I was almost certain would be slower turned out to be the fastest one. Additionally, thinking through multiple solutions to one kernel may give you an idea that helps accelerate a different kernel later on.

Luna Shape Sensing Demo

NVIDIA: Tell us about some of the computations performed by the many CUDA kernels you use.
Patrick: Our algorithms employ FFTs, filters, complex integrals, complex derivatives, phase unwrapping, and a host of proprietary algorithms. We use CUFFT for our large FFTs, and everything else is custom. The most difficult algorithm to parallelize was the final shape calculation which integrates coordinate transform matrices to calculate the position and rotation of each point along the sensor.

NVIDIA: What types of parallel algorithms are being implemented?
Patrick: Many of our calculations are vector based, making parallelization easy. But even after parallelizing, many of those operations had to be optimized for efficient global memory access. For more complicated calculations, we typically use reduction or partitioning.

NVIDIA: In your field, what are the biggest challenges going forward?
Patrick: In the future, we’d like to make our shape-sensing systems smaller and lighter. Currently, we require a motherboard, 64-bit CPU, and memory just so that we can set up the GPU and send kernels to it.

That’s a lot of space and energy required for components that basically act as glue between the FPGA and GPU. We’re very excited about NVIDIA’s roadmap toward Parker, which we hope will allow us to shrink our design considerably by combining a next-generation Maxwell GPU with a 64-bit ARM core in a single package.

NVIDIA: How did you first learn about GPU computing?
Patrick: I learned about GPU computing from my coworker, Brian Marshall. He had heard about CUDA from an HPC Wire article about a CUDA programming contest.

NVIDIA: If you had more computing power, what could you and your team do?
Patrick: Go faster! But seriously, we could implement more corrections to increase the accuracy of our current algorithms while maintaining our current time requirements.

NVIDIA: Describe your development infrastructure.
Patrick: Our shape-sensing platform is comprised of a large fiber-optic network, an FPGA for data acquisition, a Quadro K5000 GPU for processing, and a CPU that performs basic setup and sends kernels to the GPU for processing.

NVIDIA: What advice would you offer others who are considering GPU computing?
Patrick: If you enjoy optimizing code and making it run blazing fast, then GPU programming is for you. I highly recommend reading through Mark Harris’s tutorial on “Optimizing Parallel Reduction in CUDA.”

I found that article to be the most helpful in understanding how the CUDA architecture works and how to organize algorithms to get the most performance out of the GPU.

NVIDIA: How did you become interested in the area of sensor technology?
Patrick: My real interest is in developing cool software for a small company that will keep me busy and excited about my job. When I interviewed at Luna, they promised that they would provide exactly that. So my interest in sensors really came about from applying my software expertise at a company that has been on the cutting edge of fiber-optic sensing for years.

Bio for Patrick Roye

Patrick and his wife Joanna live in Blacksburg, Virginia. Patrick enjoys working on challenging software problems. He has developed toolkits for programming collaborative simulations in virtual environments. He also created software that used genetic algorithms to control robotic arms using shape-sensing fiber-optic sensors.

Relevant Links
Luna, Inc.
Optimizing Parallel Reduction in CUDA, by M. Harris
Video Demos of Luna's Shape-Sensing Technology/

Contact Info
Email: proye (at) lunainc [dot] com

CUDA Spotlight: Patrick Roye

Ultra-Low Latency Shape Sensing Using CUDA

Q & A with Patrick Roye

Bio for Patrick Roye