Computational accelerators, such as manycore NVIDIA GPUs, Intel Xeon Phi and FPGAs, are becoming common in workstations, servers and supercomputers for scientific and engineering...
Yonghong Yan 0001, Pei-Hung Lin, Chunhua Liao, Bro...
Modern graphics processing units (GPUs) include hardwarecontrolled caches to reduce bandwidth requirements and energy consumption. However, current GPU cache hierarchies are ine...
Yingying Tian, Sooraj Puthoor, Joseph L. Greathous...
Read-copy update (RCU) is a shared memory synchronization mechanism with scalable synchronization-free reads that nevertheless execute correctly with concurrent updates. To guaran...
This paper describes Surge, a collection-oriented programming model that enables programmers to compose parallel computations using nested high-level data collections and operator...
Saurav Muralidharan, Michael Garland, Bryan C. Cat...
As modern hardware keeps evolving, an increasingly effective approach to develop energy efficient and high-performance solvers is to design them to work on many small size indepe...
Azzam Haidar, Tingxing Dong, Piotr Luszczek, Stani...
This paper studies the effects of source-code optimizations on the performance, power draw, and energy consumption of a modern compute GPU. We evaluate 128 versions of two n-body ...
In this paper, we present the most extensive comparison of synchronization techniques. We evaluate 5 different synchronization techniques through a series of 31 data structure alg...
Large-scale graph-structured computation usually exhibits iterative and convergence-oriented computing nature, where input data is computed iteratively until a convergence conditi...
Chenning Xie, Rong Chen, Haibing Guan, Binyu Zang,...
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Opt...