Just the Facts
Accelerated computing is revolutionizing high performance computing (HPC). It is now widely accepted that systems with accelerators deliver the highest performance and most energy efficient computing for HPC today. Accelerated systems are the new norm in large scale HPC and leadership supercomputing. It's not a question of "if" but "when and how?"
We'd like to share some facts about accelerated computing by removing the promises and hype. Specifically, notions that Intel's Xeon Phi accelerator can deliver acceptable application performance compared to a GPU by simply recompiling and running code natively on Xeon Phi, or that performance optimization is easier on a Xeon Phi than a GPU, are simply not based on fact.
A GPU is significantly faster than Intel's Xeon Phi on real HPC applications.
Speeding time to result for key science applications by 2x over Xeon Phi.
Although Xeon Phi can be optimized to outperform a CPU, GPU consistently outperforms Xeon Phi on a wide range of supercomputing applications. System and configuration details1 (Data from August 2013)
The Department of Energy uses a collection of 'mini-applications' (Miniapps) to assess the performance of computing architectures on highly representative HPC workloads.
Running the mini-applications shown in the above chart, GPUs deliver a speedup from approximately 2.5x-5x over CPUs. Although Intel's Xeon Phi can be optimized to outperform a CPU, GPU performance remains on average more than 2x faster than Xeon Phi.
Independent results have shown GPU outperforms Xeon Phi by 2x or more. (Data from January 2014)
HPC is all about application performance. Today, more than 200 applications
across a wide range of fields are GPU-accelerated.
"Recompile & Run" on Xeon Phi actually slows down your application.
The notion that developers can simply "recompile and run" applications on Intel's Xeon Phi, without any change to their CPU code, is attractive but misleading. The resulting performance is usually much slower than CPU performance, literally the opposite of acceleration.
While a simple recompile to run natively on Xeon Phi may work on many codes, doing so decelerates the application performance compared to CPU – up to 4x slower on DOE mini-applications as shown above.
"Recompile and run" faces a host of technical challenges as described in the NVIDIA blog post "No Free Lunch for Intel MIC (or GPUs)"
, including Amdahl's Law for serial portions of the code. Because of the poor serial performance of the Xeon Phi cores (based on an old Pentium design) compared to the modern CPU cores, the serial portion of codes run natively on a Xeon Phi can run an order of magnitude slower.
In practice, a developer must work to get the code to recompile on Xeon Phi first, then apply effort to re-factor and optimize the code to increase performance – just to get to performance parity back to CPUs.
At the end of the day, it takes some effort to extract parallelism, whether you want to accelerate with Xeon Phi or GPU. At best, "recompile and run" is a mildly convenient first step for developers; at worst, an attractive claim destined to disappoint.
Developers use libraries, directives, or native programming models to program accelerators and optimize for performance. (Data from August 2013)
GPU and Intel's Xeon Phi may be different in some ways, but they are similar in that both are parallel processors. Developers need to put in similar effort and use similar optimization techniques to expose massive amounts of parallelism, whether on Xeon Phi or GPU.
As shown in the table above, a developer uses the same three methods to accelerate their code – libraries, directives, and native programming models like CUDA C for GPU or vector intrinsics on Xeon Phi.
And the programming efforts for Xeon Phi and GPU are more alike than most people realize.
Below, an N-body kernel code illustrates that comparable optimization techniques and effort are required to optimize for either accelerator. While the code changes are basically the same, performance on GPU significantly outpaces that of Xeon Phi. Download the optimization example
A simple n-body code comparison shows similar optimization techniques must be used, but the GPU is significantly faster. System and configuration details3 (Data from August 2013)
"You can port easily, but the things you do in CUDA to vectorize your code still have to be done for Phi."
Dr. Karl Schultz
Director of Scientific Applications at
Texas Advanced Computing Center (TACC)
, May 17, 2013
"Our GPU codes are quite similar to the Xeon Phi codes, except for replacing SIMD operations with SIMT operations."
“Results gathered on Intel’s Xeon Phi were surprisingly disappointing… It took quite some effort to create solutions with good performance due to vectorization tuning, despite that the Xeon Phi is said to be easily programmable.”
"While getting a program running on Xeon Phi is easy, I found that it is easier with CUDA and NVIDIA GPUs to achieve high sustained performances for Lattice Boltzmann applications."
Dr. Sebastiano Fabio Schifano
Department of Mathematics and Informatics
University of Ferrara
Once you see the facts, a better understanding of accelerated computing emerges. Today, a GPU provides double the performance for essentially the same developer effort. GPUs are the logical choice for accelerating parallel code. In part, this could be why scientific researchers have published with GPUs more than 10:1 over Intel Xeon Phi this year.4 And why NVIDIA GPU is favored more than 20:1 over Xeon Phi in HPC systems today.5