High performance compilers increasingly rely on accurate modeling of the machine resources to efficiently exploit the instruction level parallelism of an application. In this pape...
OpenMP relies heavily on barrier synchronization to coordinate the work of threads that are performing the computations in a parallel region. A good implementation of barriers is ...
Ramachandra C. Nanjegowda, Oscar Hernandez, Barbar...
On machines with high-performance processors, the memory system continues to be a performance bottleneck. Compilers insert prefetch operations and reorder data accesses to improve...
Nathaniel McIntosh, Sandya Mannarswamy, Robert Hun...
CCJ is a communication library that adds MPI-like collective operations to Java. Rather than trying to adhere to the precise MPI syntax, CCJ aims at a clean integration of collect...
Arnold Nelisse, Thilo Kielmann, Henri E. Bal, Jaso...
We systematically evaluate the performance of five implementations of a single, user-level communication interface. Each implementation makes different architectural assumptions ...