We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates t...
Lee W. Howes, Anton Lokhmotov, Alastair F. Donalds...
Abstract: We report on our experiences with the implementation of a parallel algorithm to compute the cycle structure of a permutation given as an oracle. As a sub-problem, the cyc...
Abstract--We present Barra, a simulator of Graphics Processing Units (GPU) tuned for general purpose processing (GPGPU). It is based on the UNISIM framework and it simulates the na...
Sylvain Collange, Marc Daumas, David Defour, David...
Abstract. Designing and tuning parallel applications with MPI, particularly at large scale, requires understanding the performance implications of different choices of algorithms ...
Torsten Hoefler, William Gropp, Rajeev Thakur, Jes...
Clustering and scheduling of tasks for parallel implementation is a well researched problem. Several techniques have been presented in the literature to improve performance and re...