A global barrier synchronizes all processors in a parallel system. This paper investigates algorithms that allow disjoint subsets of processors to synchronize independently and in...
Anja Feldmann, Thomas R. Gross, David R. O'Hallaro...
We present a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and...
Originally developed to connect processors and memories in multicomputers, prior research and design of interconnection networks have focused largely on performance. As these netw...
Heterogeneous computing combines general purpose CPUs with accelerators to efficiently execute both sequential control-intensive and data-parallel phases of applications. Existin...
Isaac Gelado, Javier Cabezas, Nacho Navarro, John ...
On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance...