Just the Facts

Accelerated computing is revolutionizing high performance computing (HPC). It is now widely accepted that systems with accelerators deliver the highest performance and most energy efficient computing for HPC today. Accelerated systems are the new norm in large scale HPC and leadership supercomputing. It's not a question of "if" but "when and how?"

We'd like to share some facts about accelerated computing by removing the promises and hype. Specifically, notions that Intel's Xeon Phi accelerator can deliver acceptable application performance compared to a GPU by simply recompiling and running code natively on Xeon Phi, or that performance optimization is easier on a Xeon Phi than a GPU, are simply not based on fact.

FACT: A GPU is significantly faster than Intel's Xeon Phi on real HPC applications.
Speeding time to result for key science applications by 2x over Xeon Phi.
Although Xeon Phi can be optimized to outperform a CPU, GPU consistently outperforms Xeon Phi on a wide range of supercomputing applications. System and configuration details1 (Data from August 2013)
The Department of Energy uses a collection of 'mini-applications' (Miniapps) to assess the performance of computing architectures on highly representative HPC workloads. Running the mini-applications shown in the above chart, GPUs deliver a speedup from approximately 2.5x-5x over CPUs. Although Intel's Xeon Phi can be optimized to outperform a CPU, GPU performance remains on average more than 2x faster than Xeon Phi.
Organization Application GPU Speed-up over Xeon Phi
Tokyo Institute of Technology CFD Diffusion 2.6x
Xcelerit Monte-Carlo LIBOR Swap Pricing 2.2x - 4x
Georgia Tech Synthetic Aperture Radar 2.1x
CGGVeritas Reverse Time Migration 2.0x
Paralution BLAS & SpMV 2.0x
Univ. of Wisconsin-Madison WRF (Weather Forecasting) 1.8x
University Erlangen-Nuremberg Medical Imaging- 3D Image Reconstruction 7x
Independent results have shown GPU outperforms Xeon Phi by 2x or more. (Data from January 2014)
HPC is all about application performance. Today, more than 200 applications across a wide range of fields are GPU-accelerated.
Read Less
Read More
FACT: "Recompile & Run" on Xeon Phi actually slows down your application.
The notion that developers can simply "recompile and run" applications on Intel's Xeon Phi, without any change to their CPU code, is attractive but misleading. The resulting performance is usually much slower than CPU performance, literally the opposite of acceleration.
Simple recompile and run on Xeon Phi can work, but codes run much slower than on the CPU. System and configuration details2 (Data from August 2013)
While a simple recompile to run natively on Xeon Phi may work on many codes, doing so decelerates the application performance compared to CPU – up to 4x slower on DOE mini-applications as shown above.

"Recompile and run" faces a host of technical challenges as described in the NVIDIA blog post "No Free Lunch for Intel MIC (or GPUs)", including Amdahl's Law for serial portions of the code. Because of the poor serial performance of the Xeon Phi cores (based on an old Pentium design) compared to the modern CPU cores, the serial portion of codes run natively on a Xeon Phi can run an order of magnitude slower.

In practice, a developer must work to get the code to recompile on Xeon Phi first, then apply effort to re-factor and optimize the code to increase performance – just to get to performance parity back to CPUs.

At the end of the day, it takes some effort to extract parallelism, whether you want to accelerate with Xeon Phi or GPU. At best, "recompile and run" is a mildly convenient first step for developers; at worst, an attractive claim destined to disappoint.
Read Less
Read More
FACT: Programming for a GPU and Xeon Phi require similar effort — but the
         results are significantly better on a GPU.
Same optimization techniques. Same developer effort. 2x faster acceleration on GPU.
Method GPU Phi
CUDA Libraries + others

Intel MKL + others

OpenMP + Phi Directives
Native Programming Models

Vector Intrinsics
Developers use libraries, directives, or native programming models to program accelerators and optimize for performance. (Data from August 2013)
GPU and Intel's Xeon Phi may be different in some ways, but they are similar in that both are parallel processors. Developers need to put in similar effort and use similar optimization techniques to expose massive amounts of parallelism, whether on Xeon Phi or GPU.

As shown in the table above, a developer uses the same three methods to accelerate their code – libraries, directives, and native programming models like CUDA C for GPU or vector intrinsics on Xeon Phi.

And the programming efforts for Xeon Phi and GPU are more alike than most people realize.

Below, an N-body kernel code illustrates that comparable optimization techniques and effort are required to optimize for either accelerator. While the code changes are basically the same, performance on GPU significantly outpaces that of Xeon Phi. Download the optimization example.
A simple n-body code comparison shows similar optimization techniques must be used, but the GPU is significantly faster. System and configuration details3 (Data from August 2013)
Read more
Read Less
"You can port easily, but the things you do in CUDA to vectorize your code still have to be done for Phi."
Dr. Karl Schultz
Director of Scientific Applications at
Texas Advanced Computing Center (TACC)
Source: HPCWire, May 17, 2013
"Our GPU codes are quite similar to the Xeon Phi codes, except for replacing SIMD operations with SIMT operations."
“Results gathered on Intel’s Xeon Phi were surprisingly disappointing… It took quite some effort to create solutions with good performance due to vectorization tuning, despite that the Xeon Phi is said to be easily programmable.”
"While getting a program running on Xeon Phi is easy, I found that it is easier with CUDA and NVIDIA GPUs to achieve high sustained performances for Lattice Boltzmann applications."
Dr. Sebastiano Fabio Schifano
Department of Mathematics and Informatics
University of Ferrara
Once you see the facts, a better understanding of accelerated computing emerges. Today, a GPU provides double the performance for essentially the same developer effort. GPUs are the logical choice for accelerating parallel code. In part, this could be why scientific researchers have published with GPUs more than 10:1 over Intel Xeon Phi this year.4 And why NVIDIA GPU is favored more than 20:1 over Xeon Phi in HPC systems today.5
1 Dual-socket Intel Xeon Phi E5-2667, 6 cores/socket @ 2.90 GHz with HT off, 64 GB RAM, RHEL 6.2, Tesla K20X, Intel Xeon Phi 5110P. Data is based on 2 CPU versus 2 CPU + GPU. MiniMD and NAMD were run in offload mode on Intel Xeon Phi. CloverLeaf ran in native mode on Intel Xeon Phi. For codes, go to MiniMD Version 1.2RC1, CloverLeaf, and NAMD. (Data from August 2013)
2 MiniFE data is based on Dual-socket Intel Xeon Phi E5-2670, 8 cores/socket @ 2.60 GHz with HT off, 128 GB RAM, RHEL 6.2, Intel Xeon Phi 5110P. All other apps are based on Dual-socket Intel Xeon Phi E5-2667, 6 cores/socket @ 2.90 GHz with HT off, 64 GB RAM, RHEL 6.2, Xeon Phi 5110P. To see actual codes, go to MiniMD Version 1.2RC1, MiniFE, MiniGhost, GTC, and SNAP. (Data from August 2013)
3 Data is based on Dual-socket Intel Xeon Phi E5-2667, 6 cores/socket @ 2.90 GHz, 64 GB RAM, RHEL 6.2, Tesla K20X, Intel Xeon Phi 5110P. (Data from August 2013)
4 Source: Google scholar, all results since 2013. GPU search terms: CUDA GPU. Intel Xeon Phi search terms: "Xeon Phi".
5 Source: Intersect360 Research, HPC User Site Census, 2013.