Chapter 6. An Efficient CUDA Implementation of the Tree-Based Barnes Hut
Martin Burtscher and Keshav Pingali
This chapter describes the first CUDA implementation of the classical Barnes Hut
-body algorithm that runs entirely on the GPU. Unlike most other CUDA programs, our code builds an
tree-based data structure and performs complex traversals on it. It consists of six GPU kernels. The kernels are optimized to minimize memory accesses and thread divergence and are fully parallelized within and across blocks. Our CUDA code takes 5.2 seconds to simulate one time step with 5,000,000 bodies on a 1.3 GHz Quadro FX 5800 GPU with 240 cores, which is 74 times faster than an optimized serial implementation running on a ...