Latency tolerance is essential in achieving high performance on parallel computers for remote function calls and fine-grained remote memory accesses. EM-X supports interprocessor ...
Abstract. We introduce a collection of high performance kernels for basic linear algebra. The kernels encapsulate small xed size computations in order to provide building blocks fo...
OpenMP is emerging as a quasi-standard for shared memory parallel programming on small SMP-systems. To serve as a common programming interface in shared memory parallel programmin...
This paper describes a proposal for a set of Parallel Basic Linear Algebra Subprograms PBLAS. The PBLAS are targeted at distributed vector-vector, matrix-vector and matrixmatrix...
Jaeyoung Choi, Jack Dongarra, Susan Ostrouchov, An...
TACO (Topologies and Collections) is a template library that introduces the flavour of distributed data parallel processing by means of reusable topology classes and C++ s. This p...