The graphics processing unit (GPU) has evolved from a fixedfunction processor with programmable stages to a programmable processor with many fixed-function components that deliver massive parallelism. Consequently, GPUs increasingly take advantage of the programmable processing power for general-purpose, nongraphics tasks, i.e., general-purpose computation on graphics processing units (GPGPU). However, while the GPU can massively accelerate data parallel (or task parallel) applications, the lack of explicit support for inter-block communication on the GPU hampers its broader adoption as a general-purpose computing device. Inter-block communication on the GPU occurs via global memory and then requires a barrier synchronization across the blocks, i.e., inter-block GPU communication via barrier synchronization. Currently, such synchronization is only available via the CPU, which in turn, incurs significant overhead. Thus, we seek to propose more efficient methods for inter-block communic...