Fengshun Lu, Kaijun Ren, Junqiang Song, and Jinjun Chen
Recent advances in the graphics processing unit (GPU) technology have brought enormous benefits to the high-performance computing (HPC) community. Many scientific and engineering applications have achieved order-of-magnitude speedups on energy-efficient CPU/GPU clusters, such as TianHe-1A (TH-1A) , Nebulae , Lincoln  and TSUBAME . Besides the heterogeneity in their processing units, these HPC systems also have complex memory hierarchy: distributed memory across different compute nodes, shared memory within each node, and device memory on GPUs. Consequently, scientific researchers and domain scientists are confronted with great challenges to perform efficient and scalable computing on these heterogeneous HPC infrastructures.
Although most issues of scalable computing on traditional CPU-based HPC systems have been extensively addressed in the past decades [5–9], the scalability issue for newly acknowledged heterogeneous CPU/GPU systems has not been well addressed. We herein give an overview of the recent work on the scalability issue of GPU applications. Goddeke et al.  investigated the weak scalability of finite element method (FEM) calculations on a 160-node GPU cluster and observed that the performance of the FEM applications scaled favorably with the number of nodes. Ltaief et al.  addressed ...