CUDA Spotlight: GPU Computing Momentum at Microway

This week's Spotlight is on Stephen Fried, founder of Microway and veteran technology inventor (with a current focus on clusters and InfiniBand fabrics). Steve is a former space scientist and FAA flight examiner who can be found on weekends in his sailplane soaring over the Green Mountains of Vermont.

Steve Fried is seen in this picture in his Schleicher ASH-26 E sailplane, which has a 60 foot wing span and includes a 55 HP rotary engine in the hull.

We caught up with Steve after learning that BioStack-LS – a CUDA/Tesla-based Microway product – was named "Best of Show" finalist at the Bio-IT World Conference in Boston.

NVIDIA: Steve, tell us about Microway.
Steve: Microway was founded in 1982 and has provided leading edge hardware and software products that speed up floating point calculations since its inception. We develop x86-based Linux clusters whose nodes each employ a pair of NVIDIA Tesla GPUs. For the past four years, we've been providing customers with the best designed and cooled GPU platforms.

NVIDIA: Where are you seeing the most momentum in GPU computing?
Steve: Applications that manipulate matrices or rely on linear algebra techniques are excellent candidates for a system of parallel GPGPUs. Vector based problems normally are speed limited by the ability of the floating point units (FPUs) in CPUs to retrieve data from system memory, not by the ability of the device doing the computation to carry out the floating point operation.

The crucial problem is not memory bandwidth but memory latency. The time that it takes to retrieve a piece of data is determined by the latency of the memory and it is identical for CPUs and GPUs. The trick that GPGPUs employ to get around latency is their ability to queue up a huge number of parallel requests for data. These requests are queued and retrieved by the GPU's memory controller. When the data arrives back at the GPU it is sent to the cores that requested it, which in turn wake up the thread that requested a particular piece of data. Data for that core quickly forms a queue for the parallel threads that have made requests.

With hundreds to thousands of requests now in its queue, each core is able to run at full speed for many thousands of cycles. This makes it possible to achieve FPU efficiencies that approach or exceed 90%, giving the user the ability to take full advantage of the 1 Teraflop performance of Tesla.

We believe that GPUs are ideal for executing the parallel vector applications that dominate much of the bio-informatics world. This week we are at Bio-IT World demonstrating BioStack-LS. BioStack-LS includes seven GPU compute nodes, each with two Tesla C2070s. BioStack-LS represents an innovation for the bio-medical community because it's delivered pre-configured for life sciences software, including AMBER, MATLAB, NAMD and VMD.

NVIDIA: Why are people embracing the CUDA parallel programming model?
Steve: CUDA automatically solves many annoying problems that previously required developers to deal with. The typical programming model for a group of parallel vector cards in the past was complicated by the master node's interaction with the vector processing units (VPUs). One of the first programs used to write these applications employed an "alien file server" on the host node. Its main task was to read and write data to the host's hard disks that was then sent to or received from the VPU.

The user writing code for such an environment had to worry about 1) reading the kernel code from a file and issuing it to the vector card that held one or more VPUs; 2) sending a signal to the VPUs to start execution; 3) sending and receiving data between the host and the VPU; 4) and coordinating tasks carried out by VPUs (using semaphores whose purpose was to guarantee that data would never be read by VPUs sharing data until all VPUs completed a particular portion of an algorithm).
The CUDA programming model hides these bits and pieces of kernel control, data flow and task synchronization from the user. It reduces the user's task to writing a single piece of C code that executes on the host, which automatically loads and calls the kernels that are embedded within the CUDA application in the order that they are to be executed.
Sophisticated users who plan to write applications which employ multiple hosts running in parallel are still left with the problem of dividing up their application and then employing a paradigm like MPI to pass data to and from the GPU enhanced nodes in their cluster, but the problem at this point can be entirely handled using these two higher level programming paradigms, CUDA and MPI.

NVIDIA:How did you become interested in this area?
Steve: Working as a space scientist in the 1960s, I discovered that 30% of the computer time at our laboratory was spent solving linear equations using Gauss Elimination. Twenty years later when I was developing Microway's first i860 card, things had not changed much. DAXPY was the critical routine in the LINPACK benchmark and it was employed at the heart of Gauss Elimination. Today DAXPY has been replaced by matrix multiplies and more efficient methods (LU decomposition) to speed up LINPACK, but solving simultaneous linear equations still appears to be at the heart of many scientific applications.

NVIDIA:What are some of the real-world applications of your products?
Steve: Microway's scientific solutions include WhisperStation-PSC and fully integrated Tesla GPU-based clusters. Our products have been used by scientists to design aircraft and space vehicles; sails and hulls for America's cup racers; and wings for aircraft. They are employed in all areas of research -- from performing nuclear simulations to mapping the ocean floor and brain imaging. They are also employed in CFD applications, including those used by auto and tire companies.

NVIDIA:As computing becomes faster, what will we be able to do in the future?
Steve: With further advances in cooling and improvements in connectivity between GPUs, it should be possible not only to improve GPU cluster efficiency, but also to simultaneously make it possible to employ GPU clusters in office environments where cooling and noise are an issue. Next generation, faster GPUs will also enable porting problems which now require a small cluster to execute in a timely manner on a quiet server that sits desk-side.
Future desktop solutions include simulating car crashes, weapons design simulations, ray trace rendering of architectural drawings, bio- informatics and quantum mechanical applications. For clusters, future GPU developments will enable testing aircraft designs without the use of wind tunnels, more accurate gravitational simulations of astronomical events, and monitoring and searching huge databases for security purposes. The future benefits of enhanced GPU performance are infinite.

Stephen S. Fried's Bio
Stephen S. Fried received a Bachelor's degree in Physics from Brown University in 1964. He designed velocity selectors for molecular beams at the Brown molecular beam group. At Avco, where he was a member of the Atomic and Laser Physics Committees, he and J.R. Airey co-invented the first HF Chemical Laser that ran on a ground state transition.

In 1982, he co-founded Microway with his wife Ann and wrote the first code to employ an Intel math coprocessor in an IBM-PC. During the 1980s he was a frequent contributor to both BYTE and Dr. Dobbs and won several Byte BOM awards. At Microway he designed the first PC accelerator board to allow the use of math coprocessors. This was followed by parallel processing cards that used Inmos Transputers and later, the Intel i860 vector processor.

Fried's recent software products have focused on tools for monitoring, controlling and debugging clusters and InfiniBand fabrics. His most recent peer reviewed publication, "Loop Heat Pipes for Cooling Systems of Servers,"appears in IEEE Transactions on Components and Packaging Technologies. It is co-authored with Professor Yury Maydanik, chairperson of the International Heat Pipe Symposium. He can be reached at steve@microway.com.